OpenCL Fireworks

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

Chain Crackers – serial power

Real serial power has been brought to you when your street is completely covered in red paper and all young children went inside while crying (and daddy saying with a big smile that “the kids are just tired”). We like it dangerous and just 2 seconds after the fuse was lighted the first second delivers 50 bangs, so only 19950 and 6 minutes 39 seconds to go. Often you’ll hear a few crackers explode at almost the same time – let’s compare this to the mechanism for pre-loading to the cache.

Now we cut the string of 20000 crackers in 2 parts and we build a dual-core processor. We make two lanes and make light it up again. The batch now takes 3 minutes 20 seconds. To spice things up, we have made interconnections, put some larger ones in between and scattered some powder here and there. The end-result looks quite complex and that would give a nice image of a modern CPU with intelligent pre-caching, extensions, etc. Now the waiting time dropped to 2 minutes 30 seconds – not bad by just changing the architecture.

OpenCL

We now get the next red roll and cut them in 400 groups of 50 crackers. And the show-stopper: a fuse of 1 meter which takes 16 seconds. The long fuse is the PCIe-bus and the 400 groups the streaming processor of the Graphics Card. Since the GPU is slower we used a slower fuse between the crackers, which only allows 12.5 crackers a second to explode. Job is now done in 20 seconds. As long as the total number of crackers is larger than say 2000, then the extra preparation is worth it.

Now lets build a Hybrid Processor (a combination of CPU-style and GPU-style architecture like Intel’s upcoming SandyBridge and AMD’s Fusion, but also most of ARM-based processors and the Cell in the PS3). We remove the long fuse and divide in groups of 8 (to mimic AVX) and make two rows of 1250 groups. This will take 2 + 25 = 27 seconds. Slower than the CPU-version, but for chains shorter than 12500 crackers this is already the best option. (Exact break-even is at 2+(x/16/50) = 16+(x/400/12.5), but that would spoil the easy reading).

Future firework creations

We humans like to invent and innovate. Every year I’m flabbergasted by the new innovations in both fireworks and technology.

As future Hybrid Processors can have larger group sizes, more cores and better integrated GPUs, the break-even point goes up. This new processors make the term “GPGPU” old-fashioned; of course I will name it “StreamHPC”, but other options like “Heterogeneous Parallel Computing” would also cover it. Now ARM is doing a great job with low-power devices (and also supports OpenCL), NVIDIA probably will have a stronger focus on mobile processors next year, Intel and AMD will fight over the power-market with their hybrid processors, we get a very interesting year in 2011. OpenCL for sure will get on more and more devices. And if StreamHPC does not help your company speeding up software, we will keep you informed via our site.

Merry Christmas and a Happy New Year!

I wish you all less waiting time next year to have longer weekends and longer holidays.

In memoriam: 10 years after The Fireworks disaster in Enschede, the Netherlands (video).