General articles on technical subjects.

Do your (X86) CPU and GPU support OpenCL?

Does your computer have OpenCL-capable hardware? Read on and find out if your computer is compatible…

If you want to know what other non-PC hardware (phones, tablets, FPGAs, DSPs, etc) is running OpenCL, see the OpenCL SDK page.

For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for OpenCL 1.1 since drivers 280.13, so if you need OpenCL 1.1, then make sure you have this version or later. If you want to use Intel-processors and you don’t have an AMD GPU installed, you need to download the runtime of Intel OpenCL.

If you want to know if your X86 device is supported, you’ll find answers in this article.

Often it is not clear how OpenCL works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. OpenCL takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this here and here. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with OpenCL. The minimum requirement for OpenCL-on-a-CPU is SSE 4.2, though.

A question I see often is what to do if you have more devices. There is no OpenCL-package for all the available devices, so you then need to install drivers for each device. CPU-drivers are often included in the GPU-drivers.

Read on to find out exactly which processors are supported.

Continue reading “Do your (X86) CPU and GPU support OpenCL?”

Basic concepts: Function Qualifiers

19092053_m
Optimisation of one’s thoughts is a complex problem: a lot of interacting processes can be defined, if you think of it.

In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification))) void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read about them here in the official documentation, with more examples.

Continue reading “Basic concepts: Function Qualifiers”

Black-Scholes mixing on SandyBridge, Radeon and Geforce

Intel, AMD and NVidia have all written implementations of the Black-Scholes algorithm for their devices. Intel has described a kernels in their OpenCL optimisation-document (page 28 and further) with 3 random factors as input: S, K and T, and two configuration-constants R and V. NVidia is easy to compare to Intel’s, while AMD chose to write down the algorithm quite different.
So we have three different but comparable kernels in total. What will happen if we run these, all optimised for specific types of hardware, on the following devices?

  • Intel(R) Core(TM) i7-2600 CPU @3.4GHz, Mem @1333MHz
  • GeForce GTX 560 @810MHz, Mem @1000MHz
  • Radeon HD 6870 @930MHz, Mem @1030MHz

Three different architectures and three different drivers. To complete the comparison I also try to see if there is a difference when using Intel’s and AMD’s driver for CPUs. Continue reading “Black-Scholes mixing on SandyBridge, Radeon and Geforce”

OpenCL potentials: Watermarked media for content-protection

HTML5 has the future, now Flash and Silverlight are abandoning the market to make the way free for HTML5-video. There is one big problem and that is that it is hard to protect the content – before you know the movie is on the free market. DRM is only a temporary solution and many times ends in user-frustration who just want to see the movie wherever they want.

If you look at e-books, you see a much better way to make sure PDFs don’t get all over the web: personalizing. With images and videos this could be done too. The example here at the right has a very obvious, clearly visible watermark (source), but there are many methods which are not easy to see – and thus easier to miss by people who want to have needs to clean the file. It therefore has a clear advantage over DRM, where it is obvious what has to be removed. Watermarks give the buyers freedom of use. The only disadvantage is that personalised video’s ownership cannot be transferred.

Continue reading “OpenCL potentials: Watermarked media for content-protection”

Differences from OpenCL 1.1 to 1.2

This article will be of interest if you don’t want to read the whole new specifications [PDF] for OpenCL 1.2.

As always, feedback will be much appreciated.

After many meetings with the many members of the OpenCL task force, a lot of ideas sprouted. And every 17 or 18 months a new version comes out of OpenCL to give form to all these ideas. You can see totally new ideas coming up and already brought outside in another product by a member. You can also see ideas not appearing at all as other members voted against them. The last category is very interesting and hopefully we’ll see a lot of forum-discussion soon what should be in the next version, as it is missing now.

With the release of 1.2 there was also announced that (at least) two task forces will be erected. One of them will target integration in high-level programming languages, which tells me that phase 1 of creating the standard is complete and we can expect to go for OpenCL 2.0. I will discuss these phases in a follow-up and what you as a user, programmer or customer, can expect… and how you can act on it.

Another big announcement was that Altera is starting to support OpenCL for a FPGA-product. In another article I will let you know everything there is to know. For now, let’s concentrate on the actual differences in this version software-wise, and what you can do with it. I have added links to the 1.1 and 1.2 man-pages, so you can look it up.

Continue reading “Differences from OpenCL 1.1 to 1.2”

Basic Concepts: online kernel compiling

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

  • CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
  • OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling”

Kernels and the GPL. Are we safe and linking?

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?”

Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions

In the series Basic Concepts I try to give an alternative description to what is said everywhere else. This time my eye fell on alternative convenience methods in two cases which were introduced there to be nice to devs with i.e. C/C++ and/or graphics backgrounds. But I see it explained too often from the convenience functions and giving the “preferred” functions as a sort of bonus which works for the cases the old functions don’t get it done. Below is the other way around and I hope it gives better understanding. I assume you have read another definition, so you see it from another view not for the first time.

 

 

Continue reading “Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions”

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL”

OpenCL Potentials: Investment-industry

This is the second in the series “OpenCL potentials“. I chose this industry because it is the finest example where you are always late, even if you were first. So it always must be faster if you want to make the better analyses. Before I started StreamHPC I worked for an investment-company, and one of the things I did was reverse engineering a few megabytes of code with the primary purpose of updating the documentation. I then made a proof-of-concept to show the data-processing could be accelerated with a factor 250-300 using Java-tricks only and no GPGPU. That was the moment I started to understand that real-time data-computation was certainly possible. Also that IO is the next bottle-neck after computional power. Though I am more interested in other types of research, I do have my background and therefore try to give an overview for this sector and why it matters.
Continue reading “OpenCL Potentials: Investment-industry”

AMD OpenCL coding competition

The AMD OpenCL coding competition seems to be Windows 7 64bit only. So if you are on another version of Windows, OSX or (like me) on Linux, you are left behind. Of course StreamHPC supports software that just works anywhere (seriously, how hard is that nowadays?), so here are the instructions how to enter the competition when you work with Eclipse CDT. The reason why it only works with 64-bit Windows I don’t really get (but I understood it was a hint).

I focused on Linux, so it might not work with Windows XP or OSX rightaway. With little hacking, I’m sure you can change the instructions to work with i.e. Xcode or any other IDE which can import C++-projects with makefiles. Let me know if it works for you and what you changed.

Continue reading “AMD OpenCL coding competition”

The current state of WebCL

Years ago Microsoft was in court as it claimed Internet Explorer could not be removed from Windows without breaking the system, while competitors claimed it could. Why was this so important? Because (as it seems) the browser would get more important than the OS and internet as important as electricity in the office and at home. I was therefore very happy to see the introduction of WebGL, the browser-plugin for OpenGL, as this would push web-interfaces as the default for user-interfaces. WebCL is a browser-plugin to run OpenCL-kernels. Meaning that more powerful hardware-devices are available to JavaScript. This post is work-in-progress as I try to find more resources! Seen stuff like this? Let me know.

Continue reading “The current state of WebCL”

Dutch: Gratis kennisochtend over de nieuwe generatie processoren

Ergens opgevangen dat grafische kaarten tegenwoordig ingezet kunnen worden voor zware berekeningen? Tijdens een koffiegesprek gehoord over vector-processors als aanvulling op scalaire processors? Dan wordt het tijd dat u de grote veranderingen op processorgebied op een rijtje krijgt om uw organisatie beter op innovatief gebied te kunnen sturen.

Zie https://streamhpc.com/education/gratis-kennislunch/ voor een uur uitleg op lokatie.

Voor wie is deze kennis-ochtend?

Bedrijven voor wie snelheid belangrijk is en grote hoeveelheden data moeten verwerken. Bijvoorbeeld rekencentra, R&D-afdelingen, financiele instituten, ontwikkelaars van medische software, algoritme-ontwikkelaars en vision-bedrijven. Ook investeerder met hitech-bedrijven in hun portfolio kunnen gratis op de hoogte gebracht worden van de huidige ontwikkelingen.

U heeft geen technische achtergrond nodig, maar u zult zich niet vervelen indien u bits&bytes spreekt. Wij vragen uw achtergrond aan te geven, zodat we de juiste details in het programma kunnen toevoegen.

Wat is het programma?

In het eerste uur hoort u hoe de huidige processor-markt veranderd zijn ten opzichte van enkele jaren geleden – en welke nieuwe software-ontwikkelmethodes zijn geintroduceerd. Daarna krijgt u een overzicht van de nieuwe oplossingen die beschikbaar zijn en hoe dit zich verhouden tot de bestaande. Dit geeft u dan voldoende inzichten om te bepalen of het toepasbaar is binnen uw bedrijf. Het uur wordt afgesloten met wat StreamHPC voor u kan betekenen, maar ook wat u zelf kunt doen.

In het tweede uur bespreken we enkele use-cases en is er tijd voor vragen. De use-cases die worden besproken zijn afhankelijk van de achtergronden van de aanwezigen; denk aan bijvoorbeeld Monte Carlo, physics, enzym-werkingen, matrix-berekeningen en neurale netwerken.

Wanneer?

Indien er minimaal 10 aanmeldingen zijn, wordt er een datum geprikt.

Indien u binnen uw bedrijf direct interesse heeft, is het mogelijk dat StreamHPC bij u langs komt om deze presentatie te geven aangepast aan uw achtergrond. Neem daarvoor contact met ons op.

PDFs of Monday 19 September

Already the fourth PDF-Monday. It takes quite some time, so I might keep it to 10 in the future – but till then enjoy! Not sure which to read? Pick the first one (for the rest there is not order).

Edit: and the last one, follow me on twitter to see the PDFs I’m reading. Reason is that hardly anyone clicked on the links to the PDFs.

I would like if you let others know in the comments which PDF you liked a lot.

Adding Physics to Animated Characters with Oriented Particles (Matthias Müller and Nuttapong Chentanez). Discusses how to accelerate movements of pieces of cloth attached to the bodies. Not time to read? There are nice pictures.

John F. Peddy’s analysis on the GPU market.

Hardware/Software Co-Design. Simple Solution to the Matrix Multiplication Problem using CUDA.

CUDA Based Algorithms for Simulating Cardiac Excitation Waves in a Rabbit Ventricle. Bioinformatics.

Real-time implementation of Bayesian models for multimodal perception using CUDA.

GPU performance prediction using parametrized models (Master-thesis by Andreas Resios)

A Parallel Ray Tracing Architecture Suitable for Application-Specific Hardware and GPGPU Implementations (Alexandre S. Nery, Nadia Nedjah, Felipe M.G. Franca, Lech Jozwiak)

Rapid Geocoding of Satellite SAR Images with Refined RPC Model. An ESA-presentation by Lu Zhang, Timo Balz and Mingsheng Liao.

A Parallel Algorithm for Flight Route Planning with CUDA (Master-thesis by Seçkîn Sanci). About the travelling salesman problem and much more.

Color-based High-Speed Recognition of Prints on Extruded Materials. Product-presentation on how to OCR printed text on cables.

Supplementary File of Sequence Homology Search using Fine-Grained Cycle Sharing of Idle GPUs (Fumihiko Ino, Yuma Munekawa, and Kenichi Hagihara). They sped up the BOINC-system (Folding@Home). Bit vague what they want to tell, but maybe you find it interesting.

Parallel Position Weight Matrices Algorithms (Mathieu Giraud, Jean-Stéphane Varré). Bioinformatics, DNA.

GPU-based High Performance Wave Propagation Simulation of Ischemia in Anatomically Detailed Ventricle (Lei Zhang, Changqing Gai, Kuanquan Wang, Weigang Lu, Wangmeng Zuo). Computation in medicine. Ischemia is a restriction in blood supply, generally due to factors in the blood vessels, with resultant damage or dysfunction of tissue

Per-Face Texture Mapping for Realtime Rendering. A Siggraph2011 presentation by Disney and NVidia.

Introduction to Parallel Computing. The CUDA 101 by Victor Eijkhout of University of Texas.

Optimization on the Power Efficiency of GPU and Multicore Processing Element for SIMD Computing. Presentation on what you find out when putting the volt-meter directly on the GPU.

NUDA: Programming Graphics Processors with Extensible Languages. Presentation on NUDA to write less code for GPGPU.

Qt FRAMEWORK: An introduction to a cross platform application and user interface framework. Presentation on the Qt-platform – which has great #OpenCL-support.

Data Assimilation on future computer architectures. The problems projected for 2020.

Current Status of Standards for Augmented Reality (Christine Perey1, Timo Engelke and Carl Reed). not much to do with OpenCL, but tells an interesting purpose for it.

Parallel Computations of Vortex Core Structures in Superconductors (Master-thesis by Niclas E. Wennerdal).

Program the SAME Here and Over There: Data Parallel Programming Models and Intel Many Integrated Core Architecture. Presentation on how to program the Intel MIC.

Large-Scale Chemical Informatics on GPUs (Imran S. Haque, Vijay S. Pande). Book-chapter on the design and optimization of GPU implementations of two popular chemical similarity techniques: Gaussian shape overlay (GSO) and LINGO.

WebGL, WebCL and Beyond! A presentation by Neil Trevett of NVidia/Khronos.

Biomanycores, open-source parallel code for many-core bioinformatics (Mathieu Giraud, Stéphane Janot, Jean-Frédéric Berthelot, Charles Delte, Laetitia Jourdan , Dominique Lavenier , Hélène Touzet, Jean-Stéphane Varré). A short description on the project http://www.biomanycores.org.

Interest in OpenCL

Since more than a year I have this blog and I want to show the visitors around the world. Why? Then you know where OpenCL is popular and where not. I chose an unknown period, so you cannot really reverse engineer how many visitors I have – but the nice thing is that not much changes between a few days and a month. Unluckily Google Analytics is not really great for maps (Greenland as big as Africa, hard to compare US states to EU countries, cities disappear at world-views, etc), so I needed to do some quick image-editing to make it somewhat clearer.

At the world-view you see that the most interest comes from 3 sub-continents: Europe, North America and South-East Asia. Africa is the real absent continent here, except some Arab countries and South-Africa only some sporadic visits from the other countries. What surprises me is that the Arab countries are among my frequent visitors – this could be a language-issue, but I expected about the same number of visitors as from i.e. China. Latin America has mostly only interest from Brazil.

Continue reading “Interest in OpenCL”

PDFs of Monday 12 September

As it got more popular that I shared my readings, I decided to put them on my site. I focus on everything that uses vector-processing (GPUs, heterogeneous computing, CUDA, OpenCL, GPGPU, etc). Did I miss something or you have a story you want to share? Contact me or comment on this article. If you tell others about these projects you discovered here, I would appreciate you mention my website or my twitter @StreamHPC.

The research-papers have their authors mentions; the other links can be presentations or overviews of (mostly) products. I have read all, except the long PhD-theses (which are on my non-ad-hoc reading-list) – drop me any question you have.

Bullet Physics, Autodesk style. AMD and Autodesk on integrating Bullet Physics engine into Maya.

MERCUDA: Real-time GPU-based marine scene simulation. OpenCL has enabled more realistic sea and sky simulation for this product, see page 7.

J.P.Morgan: Using Graphic Processing Units (GPUs) in Pricing and Risk. Two pages describing OpenCL/CUDA can give 10 to 100 times speedup over conventional methods.

Parallelization of the Generalized Hough Transform on GPU (Juan Gómez-Luna1a, José María González-Linaresb, José Ignacio Benavidesa, Emilio L. Zapatab and Nicolás Guil). Describing two parallel methods for the Fast Generalized Hough Transform (Fast GHT) using GPUs, implemented in CUDA. It studies how load balancing and occupancy impact the performance of an application on a GPU. Interesting article as it shows that you can choose in which limits you bump into.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs (Marwa Elteir, Heshan Lin and Wu-chun Feng). Measurement of the impact of using atomic operations on AMD GPUs. It seems that even mentioning ‘atomic’ puts the kernel in atomic mode and has major influence on the performance. They also come up with a solution: software-based atomic operation. Work in progress.

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing (Mayank Daga, Ashwin M. Aji, and Wu-chun Feng). Another one from Virginia Tech, this time on AMD’s APUs. This article measures its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Very interesting to see in which cases discrete GPUs have a disadvantage even with more muscle power.

A New Approach to rCUDA (José Duato, Antonio J. Peña, Federico Silla1, Juan C. Fernández, Rafael Mayo, and Enrique S. Quintana-Ort). On (remote) execution of CUDA-software within VMs. Interesting if you want powerful machines in your company to delegate heavy work to, or are interested in clouds.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs (Vincent Heuveline, Dimitar Lukarski, Nico Trost and Jan-Philipp Weiss). Different methods around 8 multi-colored Gauß-Seidel type smoothers using OpenMP and GPUs. Also some words on scalability!

Visualization assisted by parallel processing (B. Lange, H. Rey, X. Vasques, W. Puech and N. Rodriguez). How to use GPGPU for visualising big data. An important factor of real-time data-processing is that people get more insight in the matter. As an example they use temperatures in a server-room. As I see more often now, they benchmark CPU, GPU and hybrids.

A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units (Sergio Bernabéa and Antonio Plaza). 30 times speed-up with a new parallel implementation of the k-means unsupervised clustering algorithm in CUDA. Ity is used for classification of satellite images.

TAU performance System. Product-presentation of TAU which does, among other things, parallel profiling and tracing. Support for CUDA and OpenCL. Extensive collection of tools, so worth to spend time on. An paper released in March describes TAU and compares it with two other performance measurement systems: PAPI and VamirTrace.

An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA (Allen D. Malony, Scott Biersdorff, Wyatt Spear and Shangkar Mayanglambam). Using a TAU-based (see above) tool TAUcuda this paper describes where to focus on when optimising heterogeneous systems.

Speeding up the MATLAB complex networks package using graphic processors (Zhang Bai-Da, Wu Jun-Jie, Tang Yu-Hua and Li Xin). Free registration required. Their conclusion: “In a word, the combination of GPU hardware and MATLAB software with Jacket Toolbox enables high-performance solutions in normal server”. Another PDF I found was: Parallel High Performance Computing with emphasis on Jacket based computing.

Profile-driven Parallelisation of Sequential Programs (Georgios Tournavitis). PhD-thesis on a new approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C.

OpenCL, Heterogeneous Computing, and the CPU. Presentation by Tim Mattson of Intel on how to use OpenCL with the vector-extensions of Intel-processors.

MMU Simulation in Hardware Simulator Based-on State Transition Models (Zhang Xiuping, Yang Guowu and Zheng Desheng). It seems a bit off-chart to have a paper on the Memory Management Unit of a ARM, but as the ARM-processor gets more important some insights on its memory-system is important.

Multi-Cluster Performance Impact on the Multiple-Job Co-Allocation Scheduling (Héctor Blanco, Eloi Gabaldón, Fernando Guirado and Josep Lluí Lérida). This research-group has developed a scheduling-technique, and in this paper they discuss in which situations theirs works better than existing techniques.

Convey Computers: Putting Personality Into High Performance Computing. Product-presentation. They combine X86-CPUs with pre-programmed FPGAs to get high though-put. In short: if you make heavy usage of the provided algorithms, then this might be an alternative to GPGPU.

High-Performance and High-Throughput Computing. What it means for you and your research. Presentation by Philip Chan of Monash University. Though the target-group is their own university, it gives nice insights on how it goes around on other universities and research-groups. HPC is getting cheaper and accepted in more and more types of research.

Bull: Porting seismic software to the GPU. Presentation for oil-companies on finding new oil-fields. These seismic calculations are quite computation-intensive and therefore portable HPC is needed. Know StreamHPC is also assisting in porting such code to GPUs.

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems (Shuai Che, Jeremy W. Sheaffer and Kevin Skadron). This piece of software allows CUDA-programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms.

Real-time volumetric shadows for dynamic rendering (MsC-thesis of Alexandru Teodor V.L. Voicu). Self-shadowing using the Opacity Shadow Maps algorithm is not fit for real-time processing. This thesis discusses Bounding Opacity Maps, a novel method to overcome this problem. Including code at the end, which you can download here.

Accelerating Foreign-Key Joins using Asymmetric Memory Channels (Holger Pirk, Stefan Manegold and Martin Kersten). Shows how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU’s VRAM while sequentially streaming the Foreign-Key-Index through the PCI-E Bus. Very interesting on how to make clever usage of I/O-bounds.

Come back next Monday for more interesting research papers and product presentations. If you have questions, don’t hesitate to contact StreamHPC.

PDFs of Monday 5 September

Live from le Centre Pompidou in Paris: Monday PDF-day. I have never been inside the building, but it is a large public library where people are queueing to get in – no end to the knowledge-economy in Paris. A great place to read some interesting articles on the subjects I like.

CUDA-accelerated genetic feedforward-ANN training for data mining (Catalin Patulea, Robert Peace and James Green). Since I have some background on Neural Networks, I really liked this article.

Self-proclaimed State-of-the-art in Heterogeneous Computing (Andre R. Brodtkorb a , Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli). It is from 2010, but just got thrown on the net. I think it is a must-read on Cell, GPU and FPGA architectures, even though (as also remarked by others) Cell is not so state-of-the-art any more.

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems (John E. Stone, David Gohara, and Guochun Shi). A basic and clear introduction to my favourite parallel programming language.

Research proposal: Heterogeneity and Reconfigurability as Key Enablers for Energy Efficient Computing. About increasing energy efficiency with GPUs and FPGAs.

Design and Performance of the OP2 Library for Unstructured Mesh Applications. CoreGRID presentation/workshop on OP2, an open-source parallel library for unstructured grid computations.

Design Exploration of Quadrature Methods in Option Pricing (Anson H. T. Tse, David Thomas, and Wayne Luk). Accelerating specific option pricing with CUDA. Conclusion: FPGA has the least Watt per FLOPS, CUDA is the fastest, and CPU is the big loser in this comparison. Must be mentioned that GPUs are easier to program than FPGAs.

Technologies for the future HPC systems. Presentation on how HPC company Bull sees the (near) future.

Accelerating Protein Sequence Search in a Heterogeneous Computing System (Shucai Xiao, Heshan Lin, and Wu-chun Feng). Accelerating the Basic Local Alignment Search Tool (BLAST) on GPUs.

PTask: Operating System Abstractions To Manage GPUs as Compute Devices (Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel). MS research on how to abstract GPUs as compute devices. Implemented on Windows 7 and Linux, but code is not available.

PhD thesis by Celina Berg: Building a Foundation for the Future of Software Practices within the Multi-Core Domain. It is about a Rupture-model described at Ch.2.2.2 (PDF-page 59). [total 205 pages].

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation (Alin Murarasu, Josef Weidendorfer, and Arndt Bodes). To my opinion a very important subject as this can help automate much-needed “hardware-fitting”.

Fraunhofer: Efficient AMG on Heterogeneous Systems (Jiri Kraus and Malte Förster). AMG stands for Algebraic MultiGrid method. Paper includes OpenCL and CUDA benchmarks for NVidia hardware.

Enabling Traceability in MDE to Improve Performance of GPU Applications (Antonio Wendell de O. Rodrigues, Vincent Aranega, Anne Etien, Frédéric Guyomarc’h, Jean-Luc Dekeyser). Ongoing work on OpenCL code generation from UML (Model Driven Design). [34 pag PDF]

GPU-Accelerated DNA Distance Matrix Computation (Zhi Ying, Xinhua Lin, Simon Chong-Wee See and Minglu Li). DNA sequences distance computation: bit.ly/n8dMis [PDF] #OpenCL #GPGPU #Biology

And while browsing around for PDFs I found the following interesting links:

  • Say bye to Von Neumann. Or how IBM’s Cognitive Computer Works.
  • Workshop on HPC and Free Software. 5-7 October 2011, Ourense, Spain. Info via j.anhel@uvigo.es
  • Basic CUDA course, 10 October, Delft, Netherlands, €200,-.
  • Par4All: automatic parallelizing and optimizing compiler for C and Fortran sequential programs.
  • LAMA: Library for Accelerated Math Applications for C/C++.

PDFs of Monday 29 August

This is the first PDF-Monday. It started as I used Mondays to read up on what happens around OpenCL and I like to share with you. It is a selection of what I find (somewhat) interesting – don’t hesitate to contact me on anything you want to know about accelerated software.

Parallel Programming Models for Real-Time Graphics. A presentation by Aaron Lefohn of Intel. Why a mix of data-, task-, and pipeline-parallel programming works better using hybrid computing (specifically Intel processors with the latest AVX and SSE extensions) than using GPGPU.

The Practical Reality of Heterogeneous Super Computing. A presentation of Rob Farber of NVidia on why discrete GPUs has a great future even if heterogeneous processors hit the market. Nice insights, as you can expect from the author of the latest CUDA-book.

Scalable Simulation of 3D Wave Propagation in Semi-Infinite Domains Using the Finite Difference Method (Thales Luis Rodrigues Sabino, Marcelo Zamith, Diego Brandâo, Anselmo Montenegro, Esteban Clua, Maurício Kischinhevksy, Regina C.P. Leal-Toledo, Otton T. Silveira Filho, André Bulcâo). GPU based cluster environment for the development of scalable solvers for a 3D wave propagation problem with finite difference methods. Focuses on scattering sound-waves for finding oil-fields.

Parallel Programming Concepts – GPU Computing (Frank Feinbube) A nice introduction to CUDA and OpenCL. They missed task-parallel programming on hybrid systems with OpenCL though.

Proposal for High Data Rate Processing and Analysis Initiative (HDRI). Interesting if you want to see a physics project where they did not have decided yet to use GPGPU or a CPU-cluster.

Physis: An Implicitly Parallel Programming Model for Stencil Computations on Large-Scale GPU-Accelerated Supercomputers (Naoya Maruyama, Tatsuo Nomura, Kento Sato and Satoshi Matsuoka). A collection of macros for GPGPU, tested on TSUBAME2.

MPI in terms of OpenCL

OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.

Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.

At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.

Continue reading “MPI in terms of OpenCL”

Is OpenCL coming to Apple iOS?

Answer: No, or not yet. Apple tested Intel and AMD hardware for OSX, and not portable devices. Sorry for the false rumour; I’ll keep you posted.

Update: It seems that OpenCL is on iOS, but only available to system-libraries and not for apps (directly). That explains part of the responsiveness of the system.

At the thirteenth of August 2011 Apple askked the Khronosgroup to test 7 unknown devices if they are conformant with OpenCL 1.1. As Apple uses OpenCL-conformant hardware by AMD, NVidia and Intel in their desktops, the first conclusion is that they have been testing their iOS-devices. A quick look at the list of available iOS devices for iOS 5 capable devices gives the following potential candidates:

  • iPhone 3GS
  • iPhone 4
  • iPhone 5
  • iPad
  • iPad 2
  • iPod Touch 4th generation
  • Apple TV
If OpenCL comes to iOS soon (as it is already tested), iOS 5 would be the moment. iOS 5 processors are all capable of getting speed-up by using OpenCL, so it is no nonsense-feature. This could speed up many features among media-conversion, security-enhancements and data-manipulation of data-streams. Where now the cloud or the desktop has to be used, in the future it can be done on the device.

Continue reading “Is OpenCL coming to Apple iOS?”

Power to the Vector Processor

Reducing energy-consumption is “hot”

After reading this article “Nvidia is losing on the HPC front” by The Inquirer which mixes up the demand for low-power architectures with the other side of the market: the demand for high performance. It made me think that it is not that clear there are two markets using the same technology. Also Nvidia has proven it to be not true, since the super-computer “Nebuale” uses almost half the watts per flop as the #1. How come? I quote The Register from an article of one year old:

>>When you do the math, as far as Linpack is concerned, Jaguar takes just under 4 watts to deliver a megaflops at a cost of $114 per megaflops for the iron, while Nebulae consumes 2 watts per megaflops at a cost of $39 per megaflops for the system. And there is little doubt that the CUDA parallel computing environment is only going to get better over time and hence more of the theoretical performance of the GPU ends up doing real work. (Nvidia is not there yet. There is still too much overhead on the CPUs as they get hammered fielding memory requests for GPUs on some workloads.)<<

Nvidia is (and should) be very proud. But actually I’m already looking forward when hybrids get more common. They will really shake up the HPC-market (as The Register agrees) in lowering latency between GPU and CPU and lowering energy-consumption. But where we can find a bigger market is the mobile market.

Continue reading “Power to the Vector Processor”