Happy New Year!

Reading Time: 1 minute

About a year ago this site was launched and a half year ago StreamHPC as a company was official for the Chamber of Commerce. It has been a year of hard work, but the reason for this all started after seeing the cover of a book about bore-outs. The result is there with a growing number of visitors from all over the world (from 62 countries since 23-Dec-2010) and new twitter-followers every week. Now some mixed news for 2011:

  • We are soon going to release a few plugins for Eclipse, both free and paid, to simplify your development.
  • 2011 will be the year of hybrid processors (Intel SandyBridge and AMD Fusion), which will make OpenCL much more popular.
  • 2011 is also going to be the year of the smart-phone (prognosis: in 2011 more smart-phones will be sold than PCs). So even more OpenCL-potential.
  • At 31-Dec-2010 we migrated the site to a faster server to reduce waiting-time also online.
  • The book will be released in parts, to avoid more delays.
  • There will be around ten (short) articles published in January. Both developers and managers will be served.
  • Our goal is to expand. We have shown you our vision, but we want to show you more.

In a few words: 2011 is going to be exciting! We wish all our readers, business-partners, friends, family and (new) customers a super-accelerated 2011!

StreamHPC – we accelerate your computations

DirectCompute’s unpopularity

Reading Time: 4 minutes

In the world of GPGPU we have currently 4 players: Khronos OpenCL, NVIDIA CUDA, Microsoft DirectCompute and PathScal ENZO. You probably know CUDA and OpenCL already (or start reading more articles from this blog). ENZO is a 64bit-compiler which serves a small niche-market, and DirectCompute is built on top of CUDA/OpenCL or at least uses the same drivers.

Edit 2011-01-03: I was contacted by Pathscale about my conclusions about ENZO. The reason why not much is out there is that they’re still in closed alpha. Expect more to hear from them about ENZO somewhere in the coming 3 months.

A while ago there was an article introducing OpenCL by David Kanter who claimed on page 4 that DirectCompute will win from CUDA. I quote:

Judging by history though, OpenCL and DirectCompute will eventually come to dominate the landscape, just as OpenGL and DirectX became the standards for graphics.

I twittered that I totally disagreed with him and in this article I will explain why I think that.

Continue reading “DirectCompute’s unpopularity”

OpenCL Fireworks

Reading Time: 3 minutes

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

Continue reading “OpenCL Fireworks”


OpenCL Potentials: Medical Imaging

Reading Time: 4 minutes

Photo by Eugene MahWhen you ever saw a CT or MRI scanner, you might have noticed the full-sized computer next to it (especially the older ones). There is quite some processing power needed to keep up with the data-stream coming from the scanner, to process the data to a 3D-image and to visualise the data on a 2D-screen. Luckily we have OpenCL to make it even faster; which doctor doesn’t want real-time high-resolution results and which patient doesn’t want to see the results on Apple iPad or Samsung Galaxy Tab?

Architects, bankers and doctors have one thing in common: they get a better feeling for the current subject if they can play with the data. OpenCL makes it possible to process data much faster and thus let the specialist play with it. The interesting part of IT is that it is in every domain now and therefore a new series: OpenCL-potentials.

Continue reading “OpenCL Potentials: Medical Imaging”

OpenCL on the CPU: AVX and SSE

Reading Time: 5 minutes

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

Continue reading “OpenCL on the CPU: AVX and SSE”

Thalesians talk – OpenCL in financial computations

Reading Time: 5 minutes

End of October I had a talk for the Thalesians, a group that organises different kind of talks for people working or interested in the financial market. If you live in London, I would certainly recommend you visit one of their talks. But from a personal perspective I had a difficult task: how to make a very diverse public happy? The talks I gave in the past were for a more homogeneous and known public, and now I did not know at all what the level of OpenCL-programming was of the attendants. I chose to give an overview and reserve time for questions.

After starting with some honest remarks about my understanding of the British accent and that I will kill my business for being honest with them, I spoke about 5 subjects. Some of them you might have read here, but not all. You can download the sheets [PDF] via this link: Vincent.Hindriksen.20101027-Thalesians. The below text is to make the sheets more clear, but certainly is not the complete talk. So if you have the feeling I skipped a lot of text, your feeling is right.

Continue reading “Thalesians talk – OpenCL in financial computations”

Engineering GPGPU into existing software

Reading Time: 3 minutes

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.


The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software”

New grown-ups on the block

Reading Time: 5 minutes

Members of the band There is one big reason StreamHPC chose for OpenCL and that is (future) hardware-support. I talked about NVIDIA versus AMD a lot, but knowing others would join soon. AMD is correct when they say the future is fusion: hybrid computing with a single chip holding both CPU- and GPU-cores, sharing the same memory and interconnected at high speed. Merging the technologies would also give possible much higher bandwidths to memory for the CPU. Let us see in short which products from experienced companies will appear on the OpenCL-stage.

Continue reading “New grown-ups on the block”

OpenCL – the battle, part III

Reading Time: 5 minutes

The first two parts described hardware-companies and operating systems, programming languages and software-companies, written about half a year ago. Now we focus on what has driven NVIDIA and ATI/AMD for decades: games.

Disclaimer: this is an opinion-piece on the current market. We are strong supporters of OpenCL and all companies which support it too. Since our advise on specific hardware in a consult will be based on specific demands on the customer, we could advise differently than would be expected on the below article.


Computer games are cool; merely because you choose from so many different kinds. While Tetris will live forever, the latest games also have something to add: realistic physics simulation. And that’s what’s done by GPUs now. Nintendo has shown us that gameplay and good interaction are far more important than video-quality. The wow-factor for photo-realistic real-time rendering is not as it was years ago.
You might know the basics for falling objects: F = m*g (Force = Mass times Gravity-acceleration), and action = – reaction. If you drop some boxes, you can predict falling speed, interaction, rotation and possible change of centre of gravity from a still image as a human being. A computer has to do a lot more to detect collision, but the idea is very doable on a fast CPU. A very well-known open source library for these purposes is Bullet Physics. The nice thing comes, when there is more than just a few boxes, but thousands of them. Or when you walk through water or under a waterfall, see fire and smoke, break wood but bend metal, etc. The accelerometer of the iPod was a game-changer too in the demand for more realism in graphics. For an example of a “physics puzzle game” not using GPGPU see World of Goo (with free demo) – for the rest we talk more about high-end games. Of current game-ready systems PCs (Apple, Linux and Windows) have OpenCL support, Sony PlayStation 3 is now somewhat vague and the Xbox 360 has none.

The picture is from Crysis 3, which does not use OpenCL, as we know it.

Continue reading “OpenCL – the battle, part III”

OpenCL in the Clouds

Reading Time: 4 minutes

Buzz-words are cool; they are loosely defined and are actually formed by the many implementation that use the label. Like Web 2.0 which is cool javascript for the one and interaction for the other. Now we have cloud-computing, which is cluster-computing with “something extra”. More than a year ago clouds were in the data-centre, but now we even have “private clouds”. So how to incorporate GPGPU? A cluster with native nodes to run our OpenCL-code with pre-distributed data is pretty hard to maintain, so what are the other solutions?

Distributed computing

Folding@home now has support for OpenCL to add the power of non-NVIDIA GPUs. While in clusters the server commands the clients what they have to do, here the clients ask the server for jobs. Disadvantage is that the clients are written for a specific job and are not really flexible to take different kind of jobs. There are several solutions for this code-distribution-problem, but still the solution is not suitable for smaller problems and small clusters.

Clusters: MPI

The project SHOC (Scalable HeterOgeneous Computing) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. While it is only a benchmark, it can be of great use when designing a cluster. For the rest I only found CUDA MPI-solutions, which are not ported to OpenCL yet.

Also check out Hoopoe, which is a cloud-computing service to run your OpenCL-kernels in their cloud. It seems to be more limited to .NET and have better support for CUDA, but it is a start. In Europe there is a start-up offering a rent-model for OpenCL-computation-time; please contact us if you want to get in contact with them.

Clusters: OpenMP

MOSIX has added a “Many GPU Package” to their cluster management system, so it now allows applications to transparently use cluster-wide OpenCL devices. When “choosing devices” not only the local GPU pops up, but also all GPUs in the cluster.
It works disk-less, in the way no files are copied to the computation-clients and all stays in-memory. Disk-less computations have an advantage when cloud-computer are not fully trusted. Take note that on most cloud-computers the devices need to be virtualised (see next part).

Below is its layered model, VCL being the “Virtual OpenCL Layer”.

They have chosen to base it on OpenMP; while the kernels don’t need to be altered, some OpenMP-code needs to be added. They are very happy to tell it takes much less code to use openMP instead of MPI.

You see a speed-up between 2.19 and 3.29 on 4 nodes is possible. We see comparable cluster-speed-ups in an old cluster-study. The actual speed-up on clusters depends mostly on the amount of data that needs to be transferred.

The project references to a project called remote CUDA, which only works with NVIDIA-GPUs.

Device Virtualisation

Currently there is no good device virtualisation for OpenCL. The gVirtuS-project currently only supports CUDA, but they claim it is easily rewritten to OpenCL. Code needs to be downloaded with a Mercurius-client (comparable to GIT and in repositories of most Linux-distributions):
> hg clone http://osl.uniparthenope.it/hg/projects/gvirtus/gvirtus gvirtus
Or download it here (dated 7-Oct-2010).

Let me know when you ported it to OpenCL! Actually gVirtuS does not do the whole trick since you need to divide the host-devices between the different guest-OSes, but luckily there is an extension which provides sharing of devices, called fission. More about this later.

We can all agree there still needs to be done a lot in this area of virtualised devices to get OpenCL in the cloud. If you can’t wait, you can theoretically use MOSIX locally.


A cloud is the best buzz-word to market a scalable solution to overcome limitations of internet connected personal devices. I personally think the biggest growth will be in personal clouds, so companies will have their own in-house cloud-server (read: clusters); people just want to have a feeling of control, comparable with preference of a daily traffic jam above public transport. But nevertheless shared clouds have potential if it comes to computation-intensive jobs which do not need to be done all year round.

The projects presented here are a start to have OpenCL-power at a larger scale for more demanding cases. Since we can have more power at our fingertips with one desktop-pc stuffed with high-end video-cards than a 4-year-old supercomputer-cluster, there is still time

Please send your comment if I missed a project or method.

Learning both OpenCL and CUDA

Reading Time: 3 minutes

Be sure to read Taking on OpenCL where I’ve put my latest insights – also for CUDA.

The two¹ “camps” OpenCL and CUDA both claim you should first learn their language first, after which the other would be easy to learn. I’m from the OpenCL-camp, so I say you should learn OpenCL first, but with a strong emphasis on hardware-architecture understanding. If I had chosen for CUDA I would have said the opposite, so in other words it does not matter which you do first. But psychology tells us that you probably like the first language more since there is where you discovered the magic; also most people do not like to learn a second language which is much alike and does not add a real difference. Most programmers just want to get the job done and both camps know that. Be aware of that.

NVIDIA is very good in marketing their products, AMD has – to say it modest – a lower budget for GPGPU-marketing. As a programmer you should be aware of this difference.

The possibilities of OpenCL are larger than those of CUDA, because of task-parallel programming and support for far more different architectures. At the other side CUDA is much more user-friendly and has a lot of convenience built-in.

Continue reading “Learning both OpenCL and CUDA”

Waiting for Mobile OpenCL

Reading Time: 2 minutes

People who follow me, know my interest in ARM Cortex CPU & Mali GPU and Imagination Technology’s PowerVR, regarding OpenCL-potential. Here is an overview of what I found out until now and has more open ends than answers. I was very hopeful about access to mobile OpenCL for developers, but now I think it will have to wait until 2011. It seems that the mobile phone makers keep the power to themselves in the form of system-libraries instead of giving direct access. Actually a wise choice, given the fact that a bad kernel could easily crash the phone.

Imagination Technology’s PowerVR SGX product provides the GPU-power for the iPhone 4 and various new Android phones. The company does not provide an OpenCL SDK directly to end-users, but leaves the responsibility to the implementers of their IP. Strangely enough they offer SDKs for OpenGL, and say “no comment” to a pretty direct forum-question. In other words: we don’t know what to expect.

Samsung has the only official implementation for ARM (ARM Cortex-A9 MPCore CPU) for OpenCL. Samsung just released their home-made Linux-based mobile phone OS “Bada”, but there is not a sign of OpenCL in their API. Samsung being the only one who could open an API to developers on their phones officially, does not make me smile yet.

There are many, many voices that Apple iOS 4 has support for OpenCL, but then only in the system-libraries. Given the fact that Apple is a big fan of OpenCL, we can assume this is true. See the web for a lot of articles about this.

ZiiLabs is open about their support for OpenCL in their ZMS-05 processor, and has an early access program. Early access still means waiting.

Qualcomm’s Snapdragon does not mention OpenCL at all, while it powers the more powerful smartphones. It mentions a recent job-post, so you know the drill: wait.

Android has good support for OpenGL ES, but not officially for OpenCL.You might have heard of the particle experiment for Android, which is actually OpenGL-based. It also mentions the Android-version of the bullet engine (broken link) for physics-computations, but also no OpenCL. It doesn’t look like Android “Gingerbread” will have support.

So what did you learn after reading this? That developers still won’t have access to OpenCL-API on a mobile platform, and that you have to wait until next year or enter ZiiLabs’ early access program. More about the good and bad of hiding OpenCL on this blog.

Khronos OpenCL presentation at SIGGRAPH 2010

Reading Time: 4 minutes

Here you find the videos uploaded by Khronos of their presentation about OpenCL. I added the time-line, so you can scroll to the more interesting parts easily. The presentation by Ofer Rosenberg of Intel and Cliff Woolly of NVIDIA were not uploaded (yet). Please note that for non-American people the speech of Affi Munchie is hard to hear; luckily his sheets explain most.

For the first two presentations the sheets can be downloaded from the Khronos-website. The time-line has the sheet-numbers mentioned.

0:00 [sheet 1] Presentation by the president of Khronos and chair of the session: Neill Trevett of NVIDIA.
0:06 [sheet 2] Welcome and a quick overview
1:12 [sheet 3] The prizes for the attendees (not us, online viewers)
1:40 [4] Overview of all members of Khronos. Khronos does not only take care of OpenCL but also the more famous OpenGL and projects like Collada.
2:26 [5] Processor Parallelism. CPUs are getting more parallel and GPUs more programmable. The overlapping area is called Heteregenous Computing and there is where OpenCL pops up.
3:10 [6] OpenCL timeline from version 1.0 to 1.1.
4:44 [7] OpenCL workinggroup with only 30 logos. He mentions missing logos like the one from Apple.
5:18 [8] The Visual Computing Ecosystem, where OpenCL interoperability with other standards are shown. The talk is not complete, so I don;t know if he talks about DirectX.

Continue reading “Khronos OpenCL presentation at SIGGRAPH 2010”

To think about: an OpenCL Wizard

Reading Time: 4 minutes

A part of reaction on my earlier post was “VB Programmers span the whole range from hobbyists to scientists and professions. There is no reason that VB programmers should be left out of the loop in GPU programming. (…) CUDA and OpenCL are fundamentally lower level languages that do not have the same level of user-friendlyness that VB gives, and so they require a bit more care to program. (…)“. By selecting parts, I probably put the emphasis very wrong, but it brought me an idea: “how a Visual Basic Wizzard for OpenCL” would look like.

It is actually very interesting to think how to get GPGPU-power to a programmer who’s used to “and then and then and then”; how do you get parallel processing into the mind of an iterative thinker? Simplification is not easy!

By thinking about an OpenCL-wizard, you start to understand the bottlenecks of, and need for OpenCL initialisation.

Actually it could better be built into an existing framework like OpenMP (or something alike in Python, Java, etc) or the IDE could give hints that the standard-function could be replaced by a GPGPU-version. But we just want a wizard which generates “My First OpenCL Program”, which puts a smile on the face of programmers who use their mouse a lot.

Continue reading “To think about: an OpenCL Wizard”

The rise of the GPGPU-compilers

Reading Time: 3 minutes

Painting "High Rise" made by Huma Mulji
Painting “High Rise” made by Huma Mulji

If you read The Disasters of Visual Designer Tools you’ll find a common thought about easy programming: many programmers don’t learn the hard-to-grasp backgrounds any more, but twiddle around and click a program together. In the early days of BASIC, you could add Assembly-code to speed up calculations; you only needed tot understand registers, cache and other details of the CPU. The people who did that and learnt about hardware, can actually be considered better programmers than the C++ programmer who completely relies on the compiler’s intelligence. So never be happy if the control is taken out of your hands, because it only speeds up the easy parts. An important side-note is that recompiling easy readable code with a future compiler might give faster code than your optimised well-engineered code; it’s a careful trade-off.

Okay, let’s be honest: OpenCL is not easy fun. It is more a kind of readable Assembly than click-and-play programming. But, oh boy, you learn a lot from it! You learn architectures, capabilities of GPUs, special purpose processors and much more. As blogged before, OpenCL probably is going to be the hidden power for non-CPUs wrapped in something like OpenMP.

Continue reading “The rise of the GPGPU-compilers”

OpenCL 1.1 changes compared to 1.0

Reading Time: 4 minutes

This blog-entry is of interest for you, if you don’t want to read the whole new specifications [PDF] for OpenCL 1.1, but just want an overview of the most important changes differences with 1.0.

The news-release sums up the changes for 1.1 like this:

  1. New datatypes including 3-component vectors and additional formats
  2. Handling command from multiple hosts and processing buffers across multiple devices
  3. Operations on regions of a buffer including read, write and copy of 1D, 2D and 3D rectangular regions
  4. Enhanced use of events to drive and control command execution
  5. Additional OpenCL C built-in functions such as integer clamp, shuffle and asynchronous strided copies
  6. Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

Furthermore we can read the update is completely backwards-compatible with version 1.0. The obvious macro‘s CL_VERSION_1_0 and CL_VERSION_1_1 are added to handle the versioning, but what’s more? This blog-post discusses most changes and with some subjective opinions added to it.

Additional Formats

3-component vectors

We only had 2-, 4-, 8 or 16-component vectors, but not 3 which actually was somewhat strange. The functions vload3, vload_half3, vloada_half3 and vstore3, vstore_half3, vstorea_half3 have been added to the family. Watch out, that for the half-functions the offset is calculated somewhat different compared to the even-sized vectors. In version 1.0 you could have chosen for a 4-component vector when using a lot of calculations, or a struct. If you see the new function vec_step below, it seems that it is not more memory-efficient to use this vector instead of a 4-component vector.

RGB with Padding

We have support CL_RGB, CL_RGBa (= RGB with an alpha-channel) and now also RGBx (with padding-channel). The same variants are there for CL_R and CL_RG. Good for graphics-programmers, or for easier reading of 32 bpp BMPs.

Cloud Computing / Multi-user Environments

The support for different hosts gives possibilities for cloud-computing. Side-note: cloud-computing is another word for multi-user environments, with some promotion for big data-centres. All API-functions except clSetKernelArg are thread-safe now; but only when kernels are not shared between hosts; see appendix A.2 for more information. The important part is that you think clearly about how to design your software if you now can assume others can take your resources now too. OpenCL already needed a lot of planning when claiming and releasing resources, so you’re probably already mastering it; now just check more often how much resources are available.

Region-specific Operations

Regions make it possible to split a big buffers to form a queue without having to keeping track of dimensions and offsets during operations, run from host. See clCreateSubBuffer for more information. A real convenience, but watch out when writing to overlapping buffers. The functions clEnqueueCopyBufferRec, clEnqueueReadBufferRect en clEnqueueWriteBufferRect helps synchronising commands to copy, read to or write from a region.

Enhanced Control-room

My favourite description of the host is “the control-room”, since you are not over there on the device but in Houston. The more control and information, the better. The new events are clSetMemObjectDestructorCallback, clCreateUserEvent, clSetUserEventStatus and clSetEventCallback. The first event-listener lets you know when resources a freed, so you can keep track. User-events can be put in the event_wait_list in various functions just like the built-in events; the function will start when all events are CL_COMPLETE. With clSetEventCallback immediate actions-on-events can be programmed; combined with the user-events the programmer got some powerful tools. See the example at clSetUserEventStatus for how to use the user-events.

OpenGL events and Direct3D support

The function clCreateEventFromGLsyncKHR links a CL-event to a GL-event by just giving the name of the OpenGL-event. See gl_sharing for more info.

OpenCL has now support for Direct3D 10, which is great! This might also be a good step to make DirectCompute lighter. See cl_khr_d3d10_sharing for more info. Welcome DirectX-developers! One favour: please be aware that DirectX works on Windows only, not on Apple OSX or iOS, (Embedded) Linux or Symbian. If you use clean calls, it will be more easy to port to other platforms.

Other New Kernel-functions

The following new functions were added to the kernel:

  • get_global_offset: returns the offset of the enqueued kernels.
  • minmag and maxmag: returns the argument with the minimum or maximum distance to zero, falls back to fmin and fmax if distance is equal or an argument is NaN. Example: maxmag(-5, 3) = -5, minmag(-3, 3) = -3.
  • clamp: returns boundary-values if the given number is not between the boundaries.
  • vec_step: returns the number of elements in a scalar or a vector. A scalar returns 1, a vector 2, 4, 8 or 16. If the size is 3, the function returns 4.
  • shuffle and shuffle2: shuffles one or two vectors given another vector with the indices of the new order. Indeed plain old permutations.
  • async_workgroup_strided_copy: buffers between global and local memory on the device. When used correctly, this can overcome some of the hassle when you need to work on global memory objects, but need more speed. Correct usage is described in the reference.

The functions min and max now also work component-wise with a vector as first argument and a scalar as second. Min({2, 4, 6, 8}, 5) will give {2, 4, 5, 5}.


While the many revisions of OpenCL 1.0 were really minor and not a lot attention was paid to them, 1.1 is a big step forward. If you see what has been done to multi-user environments, NVidia and AMD have a lot of work to do with their drivers.

You can read in revision 33 there has been some heated discussion and there was pressure on the decision:

>>Should this extension be KHR or EXT?
PROPOSED: KHR. If this extension is to be approved by Khronos then it should be KHR, otherwise EXT. Not all platforms can support this extension, but that is also true of OpenGL interop.

The part “not all platforms” is very politically written down, since exactly one platform supports this specific extension. I have seen too many of these pressured discussions and I hope Khronos is stronger than i.e. ISO and OpenCL will remain as open as OpenGL.

I’m very happy with the new version, since there is more control with loads of extra events, now multiple hosts are possible, and the forgotten 3-component vector was added. Now let me know in the comments what you think of the new version.

By the way, not all is new. Deprecated are clSetCommandQueueProperty and the __ROUNDING_MODE__ macro.

Let’s enter the Top500 HPC list using GPUs

Reading Time: 2 minutes

The #500 super-computer has only 24 TFlops (2010-06-06): http://www.top500.org/system/9677

update: scroll down to see the best configuration I have found. In other words: a cluster with at least 30 nodes with 4 high-end GPUs each (costing almost €2000,- per node and giving roughly 5 TFlops single precision, 1 TFLOPS double precision) would enter the Top500. 25 nodes to get to a theoretic 25TFlops and 5 extra for overcoming the overhead. So for about €60 000,- of hardware anyone can be on the list (and add at least €13 000 if you want to use Windows instead of Linux for some reason). Ok, you pay most for the services and actual building when buying such a cluster, but you get the idea it does not cost you a few millions any more. I’m curious: who is building these kind of clusters? Could you tell me the specs (theoretical TFlops, LinPack TFlops and watts/TFlop) of your (theoretical) cluster, which costs the customer less then €100 000,- in total? Or do you know companies who can do this? I’ll make a list of companies who will be building the clusters of tomorrow, the “Top €100.000,- HPC cluster list”. You can mail me via vincent [at] this domain, or put your answer in a comment.

Update: the hardware shopping-list

Nobody told in the remarks it is easy to build a faster machine than the one described above. So I’ll do it. We want the most flops per box, so here’s the wishlist:

  • A motherboard with as many slots as possible for PCI-E, CPU-sockets and memory-banks. This because the lag between the nodes is high.
  • A CPU with at least 4 cores.
  • Focus on the bandwidth, else we will not be able to use all power.
  • Focus on price per GFLOPS.

The following is what I found in local computer stores (which for some reason people there love to talk about extreme machines). AMD currently has the graphics cards with the most double precision power, so I chose for their products. I’m looking around for Intel + Nvidia, but currently they are far behind. Is AMD back on stage after being beaten by Intel’s Core-products for so many years?

The GigaByte GA-890FXA-UD7 (€245,-) has 1 AM3-socket, 6(!) PCI-e slots and supports up to 16GB of memory. We want some power, so we use the AMD Phenom II X6 1090T (€289,-), which I chose for the 6 cores and the low price per FLOPS. And to make it a monster, we add 6 times a AMD HD5970 (€599,-) giving 928 x 6 = 3264 DP-GLOPS. If it can handle 16GB DDR3 (€750,-), so we put it in. It needs about 3 Power-supplies of 700 Watt (€100,-). We add 128GB SSD (€350,-) for working data and a big 2 TB HDD (€100,-). Case needs to house the 3 power supplies (€100,-). Cooling is important and I suggest you compete with a wind-tunnel (€500,-). It will cost you €6228,- for 5,6 Double Precision TFLOPS, and 27 TFLOPS single precision. A cluster would be on the HPC500-list for around €38000,- (pure hardware-price, not taking network-devices too much into account, nor the price for man-hours).

Disclaimer: this is the price of a single node, excluding services, maintenance, software-installation, networking, engineering, etc. Please note that the above price is pure for building a single node for yourself, if you have the knowledge to do so.

Nokia Maemo and OpenCL

Reading Time: 3 minutes

Update 21-06-2011: Bumped into a project by Nokia: CLEP, “OpenCL Embedded Profile” for the N900.

Maemo is the Debian based Linux-distribution of Nokia for embedded devices. It is on the gadget N900, so you can be root on your own phone and compile your own kernel. In other words: a great developer’s phone.

Which smartphone to buy when you want to toy around with OpenCL “Embedded Profile”? There is more and more evidence that the next iPhone OS will have support for OpenCL, as should be expected Apple being the trademark-owner of OpenCL. This is good, since the mobile market could make the difference for the technique – competing with CUDA and DirectCompute. “The other ARM Cortex-A8 smartphone”, the Nokia N900 does not support it, while the magic of OpenCL attracts to many developers on the Maemo-forums.

The QT-blog that disclosed coming OpenCL-support for QT, spoke about it too:

>>Right now, QtOpenCL works very well with desktop OpenCL implementations, like that from NVIDIA (we’ve tested it under Linux, Mac, and Windows). Embedded devices are currently another matter – OpenCL implementations are still very basic in that space. The performance improvements on embedded CPU’s are only slightly better than using ARM/NEON instructions for example. And embedded GPU’s are usually hard-wired for GLSL/ES, lacking many of the features that makes OpenCL really sing. But like everything in the embedded space, things are likely to change very quickly. By releasing QtOpenCL, hopefully we can stimulate the embedded vendors to accelerate development by giving them something to test with. Be the first embedded device on the block to get the mandelbrot demo running at 10fps, or 20fps, or 60fps!<<

But checking the whole Nokia QT/Maemo-SDK for something like “opencl.h” or words like “opencl” and “khronos” in .h-files did not return anything interesting. The missing reference in the SDK tells me, we cannot expect any OpenCL-implementation on the N900 soon. So do we have to wait for the Nokia N920, Maemo 6 and QT 4.8? Once I know more, by getting deeper into the SDK, you’re the first to know. But first let me show you the documents which tells us OpenCL is coming to the Maemo-platform.

The Maemo Base Port Document, version 1.1

Exhibit number 1. The introduction tells us that the document describes what hardware-designers should do to get Maemo working on their device:

>>When Maemo is ported to a new chipset and HW environment, the majority of the SW worktakes place in the base layer. However, some adjustments may also be needed in the otherlayers. The porting work as a whole is a combined effort by the chipset vendor and Nokia. Thisdocument describes the deliverables expected from the chipset vendor in such an effort. The requirements in this document are expressed in the form of SW component, interface andfunctional requirements. Note that in many cases more detailed discussions are neededbetween Nokia and the chipset vendor to reach a common understanding about the specificsof the system architecture and the required component versions, functionality and interfaces.<<

So the document describes what the hardware must support, to be able to run Maemo. Let’s then find the magic word “OpenCL”:

>>Graphics Adaptation. The Base Port graphics adaptation interfaces consist of X11, OpenGL ES, and OpenVG interfaces. The OpenCL interface is also included in this group since it typically is used to access the GPU for general-purpose parallel computation.<<

And somewhat below:

>>OpenCL 1. The Base Port should provide an implementation of the OpenCL 1.0 interface for general-purpose parallel programming of heterogeneous systems, especially for the use of GPUs for computation (Khronos group standard).<<

That seems to be pretty clear that Maemo-devices must be able to support OpenCL.


Paper “OpenCL on Embedded devices” by Nokia

Exhibit 2 shows tests of a few simple OpenCL-program on an unnamed device with a TI OMAP 3430 (550 MHz ARM Cortex-A8 CPU & 110 MHz POWERVR SGX530 GPU) – which happens to be in the Motorola Droid, Palm Pre, and Nokia N900. So they managed to create a OpenCL-implementation on ARM. If you’re interested in OpenCL for embedded devices, please do read this presentation:


It is a document from august 2009, which shows they actually were trying POWERVR and OpenCL then. Now with QT and Maemo mentioning it, we can be very sure the N900 or the N920 is eventually going to have OpenCL-support.

Qt Hello World

Reading Time: 2 minutes

The earlier blog-post was about how to use Qt Creator with OpenCL. The examples are all around Images, but nowhere a simple Hello World. So here it is: AMD’s infamous OpenCL Hello World in Qt. Thank’s to linuxjunk for glueing the parts together.

Have fun!

Using Qt Creator for OpenCL

Reading Time: 3 minutes

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>


Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/


This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:


cd util/clinfo/


This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/



The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:


As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

<em>printf("OpenCL Platforms:n"); 
</em><em>QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData()); </em><em> 
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");</em>

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.

Our training concepts for GPGPU

Reading Time: 3 minutes

It’s almost time for more nerdy stuff we have in the pipe-line, but we’ll keep for some superficial blah for a moment. We concentrate on training (and consultancy). There is a lot of discussion here about “how to design training-programs about difficult concepts for technical people”, or better: “how to learn yourself something difficult”. At the end of this blog, we’ll show you a list how to learn OpenCL yourself, but before that we want to share how we look at training you.

Disclaimer: this blog item is positive about our own training-program for obvious reasons. We are aware people don’t want (too much) spam, so we’ll keep this kind of blogs to the minimum. If you want to tell the world that your training-program is better, first mail us for our international partner-program. If you want the training, come back on 14 June or mail us.

OpenCL and CUDA are not the easiest programming languages due to incomparable concepts in software-land (You can claim Java is “slightly” different). Can the usual ways of training give you the insights and facts you need to know?

Current programs

Most training-programs are vendor-supported. People who follow us on Twitter, know we are not the best supporters of vendor locked products. So lets get a list of a typical vendor-supported training-programs, I would like to talk about:

  • They have to be difficult, so the student accomplishes something.
  • The exam are expensive to demotivate trial-on-error-students.
  • You get an official certificate, which guarantees a income-raise.
  • Books and trainings focus on facts you must learn.
  • It’s very clear what you must learn and what you can skip.

So in short, you chose you wanted to know the material and put a lot of effort in it. You get back more than just the knowledge.

Say you get the opposite:

  • They are easy to accomplish.
  • Exams is an assignment which you only need to finish. You can try endless times.
  • You don’t get a certificate, but you might get feedback and homework for self-study.
  • You get a list of facts you must learn; the concepts are explained to support this.
  • You are free to pick which subject you like.

That sucks! You cannot brag about your accomplishments and after the training you still cannot do anything with it; it will probably take years to actually finish it. So actually it’s very clear why the programs are like this, or can we learn from this opposing list? Just like with everything else, you never have to just copy what’s available but pick out the good parts.

Learning GPGPU

If you want to learn GPGPU, you have to learn (in short) shader-concepts, OpenCL, CUDA and GPU-architectures. What would be needed to learn it, according to us?

  • A specified list of subjects you can check when understood.
  • An insight story of the underlying concepts to better understand the way stream-computing works. Concepts are the base of everything, to actually make it sound simple.
  • Very practical know-how. Such as how to integrate stream-computing-code into your current software.
  • A difficult assignment that gets you in touch with everything you learned. The training gave you the instruments you need to accomplish this step.

So there’s no exam and no certificate; these are secondary reasons for finishing the course. The focus should be getting the brain wrapped around the concepts and getting experience. As the disclaimer warned you, our training-program has a high focus on getting you up-and-running in one day. And you do get a certificate after your assignment gets approved, so bragging is easy.

If you want to learn stream-computing and you won’t use our training-program, what then?

  • Read our blog (RSS) and follow us on Twitter.
  • Make yourself a list of subjects you think you have to learn. Thinking before doing helps in getting a focus.
  • Buy a book. There are many.
  • Play around with existing examples. Try to break it. Example: what happens if the kernel uses more and more local/private memory.
  • Update the list of subjects; the more extensive, the better. Prioritise.
  • Find yourself an assignment. For example: try to compress or decompress a large JPG using OpenCL. If you succeed, get yourself a harder assignment. Do you want to be good or the best?

If you know OpenCL, CUDA is easy to learn! We will have some blogs which support your quest on learning OpenCL, so just start to dig in today and see you next time.