OpenCL Developer support by NVIDIA, AMD and Intel

There was some guy at Microsoft who understood IT very well while being a businessman: “Developers, developers, developers, developers!”. You saw it again in the mobile market and now with OpenCL. Normally I watch his yearly speech to see which product they have brought to their own ecosphere, but the developers-speech is one to watch over and over because he is so right about this! (I don’t recommend the house-remixes, because those stick in your head for weeks.)

Since OpenCL needs to be optimised for each platform, it is important for the companies that developers start developing for their platform first. StreamComputer is developing a few different Eclipse-plugins for OpenCL-development, so we were curious what was already there. Why not share all findings with you? I will keep this article updated – know this article does not cover which features are supported by each SDK.

Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

Applied GPGPU-days Amsterdam 2013

6754632287-2December 2013: Videos are not ready yet, but link will be put here.

Amsterdam, 20 June – Applied GPGPU-days in Amsterdam. Keep your agenda free for this event.

What can you do with GPUs to speed up computations? This year we can see various examples where OpenCL and CUDA have been used. We hope to give you an answer if you can use GPUs for your software, research or algorithm.

After the success of last year (fully booked with 66 attendees), we now have reserved a larger location with place for 100 people. Difference with last year is that we focus more on applications, less on technical aspects.

The program has been made public recently:

Title of talk Company/Institute Presenter
Introduction to GPGPU and GPU-architectures StreamHPC Vincent Hindriksen
Blender Cycles & Tiles: Enhancing user experience AtMind bv Monique Dewanchand & Jeroen Bakker
XeonPhi vs K20: The fight of the titans SURFsara Evghenii Gaburov
A real-time simulation technique for ship-ship and ship-port interaction PMH bv Jo Pinkster
CUDA Accelerated Neural Networks LIACS Ana Balevic
Efficient Reconstruction of Biological Networks via Transitive Reduction on GPUs TU Eindhoven Anton Wijs
Running Petsc on GPUs with an example from fluid dynamics SURFsara Thomas Geenen
Connected Component Labelling, an embarrassingly sequential algorithm Leeuwarden University Jaap van de Loosdrecht
Visualizing sound and vibrations using a GPU and a 1024-channel microphone array TU Eindhoven Wouter Ouwens
Gravitational N-body simulations on 1 to many GPUs Leiden observatory Jeroen Bédorf

A few demos will be shown.

For more information, see the Platform Parallel webpage. Also to find other events by the platform.

Tickets are €75,-. If you are from a Dutch university or research institute affiliated with SURF, your ticket has been fully sponsored by SURFsara.

Associated events in the Netherlands

For the technical aspects (GPU-programming techniques, optimisation, etc) we have a special day: the GPU Dev Day 2013. More information on the Platform Parallel webpage. Date and place will be made public in June.

The first Khronos Meetup Benelux will take place just before the Applied GPGPU day, on 19 June in Amsterdam. More information on the meetup-page.

OpenCL – the battle, part I

Part I: the Hardware-companies and Operating Systems

(Part II will be about programming languages and software-companies, part III about the gaming-industry)

OpenCL is the new, but already de-facto standard of stream-computing; but how it got there so fast is somewhat strange. A few years ago there were many companies and research-groups seeing the power of using the GPU, such as:

And the fight is really not over, since we are talking about a big shift in the super-computing industry. Just think of IBM BlueGene, which will lose lots of market to nVidia and AMD. Or Intel, who hasn’t acquired a GPU-creator as AMD did. Who had expected the market to change this rigorous? If we’re honest, we could have seen it coming (when looking at the turbulence around PhysX and Havok), but “normally” this new techniques would be introduced slowly.

The fight is about market-shares. For operating-systems, the user wants to have their movies encoded in 20 minutes just like their neighbour. For HPC-computing, since clusters can be updated for a far lower price than was possible with the old-fashioned way; here it is mostly between Linux HPC and windows HPC (which still has a very small market-share), but also database-engines which rely on high-performance hardware/software.
The most to gain is in the processor-market. The extremely large consumer-market is declining since 2004, since most users do not need more than a netbook and have bought a separate gaming-computer for the more demanding games. We don’t only see Intel and AMD anymore, but IBM’s powerful Cell- en Power-processors, very power-efficient ARM-processors, etc. Now OpenCL could make it more interesting to buy an average processor and a good graphics-card, Intel (and AMD) have no choice then to take the battle with nVidia.

Background: Why Apple made OpenCL

Short answer: pure frustration. All those different implementations would or get a share or fight for being named the standard; Apple wanted to bet on the right horse and therefore took the lead in creating an open standard. Money would be made by updating software and selling more hardware. For that reason Apple’s close partners Intel and nVidia were easily motivated to help developing the standard. Currently Apple’s only (public) reasons for giving away such an expensive and specialised project is publicity and to be ahead of the competition. Since it will not be a core-business of Apple, it does not need to stay in lead, but which companies do?

Acquisitions, acquisition, acquisitions

No time to lose for the big companies, so they must get the knowledge in-house as soon as possible. Below are some examples.

  • Microsoft: Interactive Supercomputing (22-Sept-2009): made Star-P, software which allowed users to perform scientific, engineering or analytical computation on array or matrix-based data to use parallel architectures such as multi-core workstations, multi-processor systems, distributed memory clusters or utility/cloud-based environments. This is completely in the field of OpenCL, which Microsoft needs to strengthen its products as Apple already did, such as SQL-server and Windows HPC.
  • nVidia: Ageia technologies (22-Febr-2008): made specialized PC-cards and software for calculating complicated physics in games. They made the first commercial product aiming at the masses (gamers). PhysX-code could by integrated in nVidia-drivers to be used with modern nVidia-GPUs.
  • AMD: ATI (24-juli-2006): graphics chip specialist. Although the price was too high, it saved AMD from being bought out by Intel and even stay ahead (if they had kept running).
  • Intel: Havok (17-Sept-2007): builds games-tools, such as a physics-engine. After Ageia was captured, the only good company out there to buy; AMD was too late, which spent all its money on ATI. Wind River (4-June-2009): a company providing embedded systems, development tools for embedded systems, middleware, and other types of software. Also read this interesting article. Cilk (31-July-2009): offers parallel extensions that are tightly tied into a compiler. RapidMind (19-Aug-2009): created a high-level language Sh, which had an OpenCL-backend. Intel has a lead in CPU-compilers, which it wants to broaden to multi-core- and GPU-compilers. Intel discovered it was in the group of “old fashioned compiler-builders” and had lots to learn in a short time.

If you know more acquisitions of interest, please let us know.


Apple, Intel and NVidia are the winners for 2009 and 2010. They have currently the most knowledge in house and have their marketing-machine running. NVidia has the best insight for new markets.

Microsoft and Game-developers are second; they took the first train by joining the OpenCL-consortium and taking it very serious. At the end of 2010 Microsoft will be at Apple’s level of expertise, so we will see then who has the best novelties. The game-developers, of which most already have experience with physics-calculations, all had a second chance when they had misjudged the Physics-engines. More on gaming in part III.

AMD is currently actually a big loser, since it does not seem to take it all seriously enough. But AMD can afford to be late, since OpenCL makes it easy to switch. We hope the best for AMD, since it has the technology of both CPU and GPU, and many years of experience in both fields. More on the competition between marketing-monster nVidia and silent AMD will be discussed in a blog-item, next week.

Another possible loser is Linux, which has lots to lose on HPC-market; OpenBSD-based Apple and Windows HPC can actually win market-share now. Expect most from hardware-manufacturers Intel, AMD and nVidia to give code to the community, but also from universities who do lots of research on the ever-flexible Linux. At the end it all depends on OpenCL-adaptation of (Linux-specific) programming-languages, which will be discussed in part II.

ARM is a member of the OpenCL-group but does not seem to invest in it; they seem to target another growing market: the low-power mobile devices. We will write on OpenCL and the mobile market later and why ARM currently can be relaxed about OpenCL.

We hope you have more insights in this new market; please contact us for more specific information and feel free to give your comments. Please stay tuned for part II and III, which will be released the next few weeks.

First Khronos Chapter meeting in Amsterdam: WebGL/OpenGL


Thursday 13 February 2014 the first Khronos meetup in will take place. We expect a small group, so the location will be cozy and there will be enough time to talk with a beer. First round is on me, admission is free.

Goal is to learn about open media-standards from Khronos and others. So when OpenCV is discussed, we’ll also talk OpenVX. The target group is programmers and Indy developers who are interested in creating multi-OS and multi-device software.


I am very thrilled to tell that Ton Roosendaal of the Blender Foundation will talk about the releationship between his Blender and Khronos OpenGL.

Second Maarten and Jurjen of ThreeDee Media will talk about WebGL, from a technical and a market view. Is WebGL ready for prime-time?

Then you can show your stuff. For that I’ll bring a good laptop with Windows 8.1 and Ubuntu 13.10 64.

Prepare for Meetup today!

See the Meetup-page for more information. See you there!




OpenCL – the battle, part II

Part II: the software-companies

It is very clear what’s at stake for the hardware-companies; we’ve also discussed the operating systems. But what should the software companies do? For companies which make i.e. encoding-software or databases it is very simple: support OpenCL or be years behind (what marketing can’t fix). For most other software there is a dependency on the programming language since OpenCL is a very specialised way of programming which (most times) is too different from in-house knowledge and can therefore be too expensive.

This article is somewhat brief, since most of the material will be discussed further in later-to-be-released articles.

Video-encoding and rendering

Why we had easily 60 frames per second in games but rendering an image of our own house would take minutes? You had the feeling there was a gap between worlds which needed to be closed. OpenGL/DirectX did a lot (also see our next article about OpenGL, OpenCL and DirectX), but was not able to help us in outside games. Apple did a lot to the desktop by integrating hardware-acceleration (later copied by Linux and Windows), but somehow GPU-processed results were not regarded professional and maybe seen more as an intermediate result (to see how it would look like).

Elemental Technologies was first with its H.264/AVC encoder; Nero and nVidia joined forces somewhat later. Both are based on CUDA and not OpenCL. Since rendering is close to what we already expected to come out of a GPU, we think this market is very soon recovering introducing the same product, based on OpenCL.

A few months ago nVidia has released its GPU-based ray-tracing engine, OptiX. On Youtube you can find the demo of VRay‘s accelerated ray-tracing engine.

We expect a lot of news from the graphics-world, since they already know how to program with shaders. A lot of artists will love the free speed-up, but it’s not breaking news this would be possible.

Programming languages

C and C++ are official bindings of OpenCL. And thereby Objective-C (used on i.e. on the iPhone) has native support.

As we described last week, we think that Oracle/Sun is taking OpenCL more serious now, Several wrappers exist for Java, but native support is missing; we would suggest writing the OpenCL-part in C or C++ when using Java, even if this breaks the beauty of the multi-platform-language.

It is very clear Microsoft had a better view with being an early adopter with Visual Studio integration trough profilers (created by AMD and nVidia). You already see higher-level implementations, such as the C#-toolkit OpenTK has included support that goes beyond the default dll-bindings. Also here programming parts in native C would be best.

Python is famous for its endless wrappers around anything, so it was to be expected to find an OpenCL-binding. Python has always been the safe choice for scientific programming, because of its enthusiastic community.

A binding for OpenCL in languages like PHP and Perl is completely absent. Most times this is not a problem, as C-libraries can easily be called.

RapidMind had en product which provided higher-level programming on the GPU, but after its acquisition by Intel, we don’t see the product any more. So we can conclude we just have to wait for native support in other languages than C, C++ and objective-C, to have better support.


We will cover databases later, when projects are more mature. In short, currently is investigated how GPUs can do, what SUN’s UltraSparcs already did. Since the memory-bandwidth is only great when using the onboard-memory, this is not as promising as it looks. Index-searches can be sped up, but these are not the real bottle-neck in database-performance. We think it is very important to invest in OpenCL-research in this competing market.

Operating Systems

Apple has had good GPU-support since OSX and therefore a good understanding of graphic-cards. Apple started the project OSX and already has updated several core libraries with OpenCL.

Microsoft has built DirectCompute in DirectX 11. This is OpenCL-technology put in a MS-jacket, as we’ve seen the company do many times before. Coming up is an article which discusses the differences.

Linux (Desktop and HPC) have not great support for OpenCL, but it works well enough to have most large OpenCL/CUDA-upgraded clusters on its name. Due to the flexibility of the OS and the strong competition between nVidia and AMD, a lot of research is done. Nevertheless there are no core-libraries in Linux which support OpenCL. We expect i.e. visualisation-libraries to support OpenCL this year.

Mathematical software

Matlab, Octave, Mathematica, R and Maple will all have a big advantage by using the GPU. Matlab has the most support by external libraries: CUDA, Jacket, gpuMat, etc. Mathematica will soon release a CUDA-version of Mathematica. R is still in discussion, Octave has a few partial/abandoned implementations of some libraries; since there is a lot of money to make by selling these products we can only expect full open-source implementations. Maple refers to its external call routines, so we still have to wait a while until we can have GPU-support.


This short overview gives an idea of where to expect to find OpenCL-powered solutions. When we find more markets the coming weeks, we’ll update this post.

Clear winners cannot be pinpointed, since the door has just opened. Maybe Nero, since it will now sell more of its encoding-products to owners of nVidia GPUs.

We more than halved the FPGA development time by using OpenCL

A flying FPGA board

Over the past year we developed and fine-tuned a project setup for FPGA development that is much faster than any other method, including other high-level languages for making FPGA-based systems.

How we did it

OpenCL makes it easy to use the CPU and GPU and their tools. Our CPU and GPU developers would design software with FPGAs in mind, after which the FPGA developer took over and finalised the project. As we have expertise in the very different phases of such project, we could be much more effective than when sticking to traditional methods.

The bonus

It also works on CPU and GPU. It has to be said, that the code hasn’t been fully optimised for CPUs and GPUs – this can be done in a separate project. In case a decision has to be made on which hardware to use, our solution has the least risk and the most answers.

Our Unique Selling Points

For the FPGA market our USPs are clear:

  • We outperform traditional FPGA development companies in time-to-market and price.
  • We can discuss problems on hardware level, software level and algorithm level. This contrasts with traditional FPGA houses, where there are less bridges.
  • Our software also works on CPUs and GPUs for no additional charge.
  • The latencies of the resulting project are very comparable.

We’re confident we can make a difference in the FPGA market. If you want more information or want to discuss, feel free to contact us.

Valgrind suppression file for AMD64 on Linux

valgrind_amdValgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

You’ll get problems like below, which clutters the output.

==21436== Conditional jump or move depends on uninitialised value(s)
==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/
==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/
==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/
==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/
==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/
==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/
==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/
==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/
==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/
==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/
==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/
==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/

Molybdenite and graphene to the helping hand?

The rabbit in “The Last Mimzy” was very special. What material was it made of?

You might have read about Molybdenite a few months ago. It is more efficient than Graphene which is in turn more efficient than good old Silicon, most notable energy-wise. Magazine ‘Nature’ had an article on it, which is summarised by Psychorg, so check it out. The claim it is 100 000 times more efficient than Silicon (and more efficient than the already very promising Graphene). This fan-free Silicon-replacer would be a major disaster for the cooling-industry!

But what would change for us? We are now on the edge to move to ARM (started by the smartphone- and tablet-industry), but is al this needed if the energy-costs drop to prices comparable to the costs to keep ice-cream cold on the North-Pole (20 years ago). This technique would give huge potential to Fusion-chips which now have a long way to go, to solve the heat-problem. But since it would take several years (and thus decades in hi-tech years) to get these chips on the market, no assumptions for market-share can be made based on what will happen in a few years.

Low-power ARM and Molybdenite X86

So this is European ARM (and licensees around the world) vs US Intel and AMD. The sarcastic joke among me and a few friends make, is that the fight of the past 20, 30 years between the economic US and EU is actually about who has the money to hire the most Asians, to develop the revolutionising devices. But as long as the US and EU have the feeling we are actually the equation of the competition as we are a massive 12% of the world-population, I won’t be behind the facts too much.

Since batteries don’t evolve as fast as processors, the power-problem needed to get slashed differently. A mayor reason for choosing ARM is that it uses less energy than X86, just like LCD/TFT is replaced by e-ink and organic LEDs and memory is non-volatile in portable devices.

In case we get a big reduction for CPU and memory, then the efficiency of the architecture is less of a problem. So then Intel and AMD can re-enter the market again, but then with much more powerful devices. Until then ARM-licensees like NVIDIA and ImTec have a better market if it comes to near-future devices. As I expected more tablet-manufacturers come up with docking-stations to replace the PC with a tablet. AMD and Intel have to keep surprising (and probably protect their market) the coming years to avoid losing from ARM. In other words: the coming years will be exciting how the consumer-market looks like and which companies deal in it. When thinking about these years, keep in mind what Windows XP has thought us: computers are fast enough for what average Joe wants to do with it. Hey, I use my laptop for OpenCL and the big screen, for the rest I use my mobile phone.

Hybrid chips

While I did not see it as a serious problem last year, the heat-problem for a GPU+CPU on one chip is quite a challenge. Waiting for the Molybdrenite or Graphene chips to mature will be like digging your own grave. Each step forward will result in two new products: one which is more power and/or heat efficient, and one which is more powerful. Since the competition from ARM-companies is heavy, the chances that the focus will be on more powerful Hybrid CPUs is bigger. As I stated above the losses are in the low-power area. Intel and AMD are very aware of this challenge.

Have you checked the differences between DirectX 10 and 11 games? Just check the discussions on the growing side of not needing to support DirectX 11, because 10 is good enough. Also here, the demand is higher to have the same graphics-quality for less money on more portable devices. Hybrid CPUs will eat the GPU-market for sure.

ARM-processors are hybrid processors. That’s all I tell, so you can -in combination with all stated above- formulate your own conclusions. I was very surprised NVIDIA started targeting ARM with their high-end GPUs, but was this a real bad idea?

Device vs Data-centre

Reduction of energy-costs for processors will reduce the head-less servers in the data-centre enormously. Internet costs loads of energy, both the transport and the servers – this will reduce the server-part of energy-consumption-sum with quite some factors. All positive news.

But if it all this becomes true, that chips don’t use much energy anymore and actually mobile internet and other radios take the most, what will happen to the cloud? Will you upload your video to get it processed or put your mobile in the sun to charge it while waiting a shorter period?

Current developments, future needs

We need arithmetic, media-processing and input/output; we all have that. We need long battery-life, a good screen and a fast way to input our data and commands; we get more of that each day. But heat-production is Silicon limits a lot, so we get the perfect electronic device the moment we can replace Silicon. Getting rid of the heat could give us square chips, with challenges like reinventing the socket and multi-multi-layerness.

So the question to you: is in The Last Nimzy sequel (you know, the movie with the molybdenite rabbit) a logo of Intel, AMD, ARM or another company found?


Welcome to the webpage of Stream HPC. We’re a company in Europe that work on solving the most difficult HPC problems with emphasis on scaling to GPUs and clusters. We have built up experience in speeding up software, designing performance oriented architectures, writing maintainable low-level code, selecting the best hardware for the job, and building benchmarks. Above all, we’re a customer oriented company, as we want our clients to feel in control, while we do that heavy lifting.

The company is multi-cultural and designed to be a safe space for everybody of our team – from LBGT+ to Asperger’s, we focus on making our differences our strengths. As you can read in the job self-assessment, we have 4 main strengths:

  • CPU development: algorithms, low-level code, architectures for CPU-based software. This includes clusters.
  • GPU development: algorithms, low-level code, architectures for GPU-based software. This includes graphics programming
  • Problem-solving: get from full understanding to full exploration quickly.
  • Self-managed teams: we don’t hire managers, but provide frameworks.

Our customers are all around the world, but especially North-America, West-Europe and East-Asia. We have built many high performance software that run from edge-computers to super-computers. See “What we do” for examples.

Our offices are in:

  • Amsterdam
  • Budapest
  • Barcelona

If you want to know more, feel free to get in contact.

See this page for Netherlands/Belgium, Hungary or Spain.

What is Khronos as of today?

The Khronos Group is the organization behind APIs like OpenGL, Vulkan and OpenCL. Over one hundred companies are a member and decide together what your next year phone, camera, computer or media device will be capable of.

We’re at the right, near the bottom.

We work most with OpenCL, but you probably noticed we work with OpenGL, Vulkan and SPIR too. Currently they have the following APIs:

  • COLLADA, a file-format intended to facilitate interchange of 3D assets
  • EGL, an interface between Khronos rendering APIs such as OpenGL ES or OpenVG and the underlying native platform window system
  • glTF, a file format specification for 3D scenes and models
  • OpenCL, a cross-platform computation API.
  • OpenGL, a cross-platform computer graphics API
  • OpenGL ES, a derivative of OpenGL for use on mobile and embedded systems, such as cell phones, portable gaming devices, and more
  • OpenGL SC, a safety critical profile of OpenGL ES designed to meet the needs of the safety-critical market
  • OpenKCam, Advanced Camera Control API
  • OpenKODE, an API for providing abstracted, portable access to operating system resources such as file systems, networks and math libraries
  • OpenMAX, a layered set of three programming interfaces of various abstraction levels, providing access to multimedia functionality
  • OpenML, an API for capturing, transporting, processing, displaying, and synchronizing digital media
  • OpenSL ES, an audio API tuned for embedded systems, standardizing access to features such as 3D positional audio and MIDI playback
  • OpenVG, an API for accelerating processing of 2D vector graphics
  • OpenVX, Hardware acceleration API for Computer Vision applications and libraries
  • OpenWF, APIs for 2D graphics composition and display control
  • OpenXR, an open and royalty-free standard for virtual reality and augmented reality applications and devices
  • SPIR, a intermediate compiler target for OpenCL and Vulkan
  • StreamInput, an API for consistently handling input devices
  • Vulkan, a low-overhead computer graphics API
  • WebCL, a JavaScript binding to OpenCL within a browser
  • WebGL, a JavaScript binding to OpenGL ES within a browser on any platform supporting the OpenGL or OpenGL ES graphics standards

Too few people understand that the organization is very unique, as the biggest processor vendors are discussing collaborations and how to move the market, while they’re normally the fiercest competitors. Without Khronos it would have been a totally different world.

Happy New Year!

About a year ago this site was launched and a half year ago StreamHPC as a company was official for the Chamber of Commerce. It has been a year of hard work, but the reason for this all started after seeing the cover of a book about bore-outs. The result is there with a growing number of visitors from all over the world (from 62 countries since 23-Dec-2010) and new twitter-followers every week. Now some mixed news for 2011:

  • We are soon going to release a few plugins for Eclipse, both free and paid, to simplify your development.
  • 2011 will be the year of hybrid processors (Intel SandyBridge and AMD Fusion), which will make OpenCL much more popular.
  • 2011 is also going to be the year of the smart-phone (prognosis: in 2011 more smart-phones will be sold than PCs). So even more OpenCL-potential.
  • At 31-Dec-2010 we migrated the site to a faster server to reduce waiting-time also online.
  • The book will be released in parts, to avoid more delays.
  • There will be around ten (short) articles published in January. Both developers and managers will be served.
  • Our goal is to expand. We have shown you our vision, but we want to show you more.

In a few words: 2011 is going to be exciting! We wish all our readers, business-partners, friends, family and (new) customers a super-accelerated 2011!

StreamHPC – we accelerate your computations



OpenCL is growing fast and various architectures now support compute-acceleration. This means that you have a lot of choice to find the right solution for your algorithm.





Possibly in the (near) future

Currently we are looking into:

  • Game Consoles
    • Nintendo Wii U dev – only vague rumours.
    • Sony Playstation 4 Orbis – strong rumours.
  • Movidius – has internal builds, but will only release on customer’s request.
  • Texas Instruments – support on C66x multicore DSPs (PDF source) and on their ARM-chips.
  • ST-Ericsson
 If you have more information, let us know.



Useful peripherals

When working with various devices, you might find the below tips useful.



When working with those small cute computers, three things come in handy:

  • a HDMI-switch (or monitor with more HDMI-inputs).
  • A small keyboard+mouse which uses Bluetooth or only one USB-port. I use the Logitech-keyboard as shown at the right.
  • A network-switch with enough free ports. Even though most boards have WIFI, good internet proofs itself to be valuable.

nVidia’s CUDA vs OpenCL Marketing

Please read this article about Microsoft and OpenCL, before reading on. The requested benchmarking is done by the people of Unigine have some results on differences between the three big GPGPU-APIs: part I, part II and part III with some typical results.
The following read is not about the technical differences of the 3 APIs, but more about the reason behind why alternate APIs are being kept maintained while OpenCL does the trick. Please comment, if you think OpenCL doesn’t do the trick

As was described in the article, it is important for big companies (such as Microsoft and nVidia) to defend their market-share. This protection is not provided through an open standard like OpenCL. As it went with OpenGL  – which was sort of replaced by DirectX to gain market-share for Windows – now nVidia does the same with CUDA. First we will sum up the many ways nVidia markets Cuda and then we discuss the situation.

The way nVidia wanted to play the game, was soon very clear: it wanted to market CUDA to be seen as the better alternative for OpenCL. And it is doing a very good job.

The best way to get better acceptance is giving away free information, such as articles and courses. See Cuda’s university courses as an example. Also sponsoring helps a lot, so the first main-stream oriented books about GPGPU discussed Cuda in the first place and interesting conferences were always supported by Cuda’s owner. Furthermore loads of money is put into an expensive webpage with very complete information about GPGPU there is to find. nVidia does give a choice, by also having implemented OpenCL in its drivers – it does just not have big pages on how-to-learn-OpenCL.

AMD – having spent their money on buying ATI, could not put this much effort in the “war” and had to go for OpenCL. You have to know, AMD has faster graphics-cards for lower prices than nVidia at the moment; so based on that, they could become the winners on GPGPU (if that was to only thing to do). Intel saw the light too late and is even postponing their High-end GPU, the Larrabee. The only help for them is that Apple demands to have OpenCL on nVidia’s drivers – but for how long? Apple does not want strict dependency on nVidia, since it i.e. also has a mobile market. But what if all Apple-developers create their applications on CUDA?

Most developers – helped by the money&support of nVidia – see that there is just little difference between Cuda and OpenCL and in case of a changing market they could translate their programs from one to the other. For now a demand to have a high-end videocard of nVidia can be rectified, a card which actually many people have or easily could buy within the budget of their current project. The difference between Cuda and OpenCL is comparable with C# and Java – the corporate tactics are also the same. Possibly nVidia will have better driver-support for Cuda than OpenCL and since Cuda does not work on AMD-cards, the conclusion is that Cuda is faster. There can then be a situation that AMD and Intel have to buy Cuda-patents, since OpenCL does not have the support.

We hope that OpenCL will stay the main-stream GPGPU-API, so the battle will be on hardware/drivers and support of higher-level programming-languages. We do really appreciate what nVidia already has done for the GPGPU-industry, but we hope they will solely embrace OpenCL for the sake of the long-term market-development.

What we left out of this discussion is Microsoft’s DirectCompute. It will be used by game-developers in the Windows-platform, which just need physics-calculations. When discussing the game-industry, we will tell more about Microsoft’s DirectX-extension.
Also come back and read our upcoming article about FPGAs + OpenCL, a combination from which we expect a lot.

Market Positioning of Graphics and Compute solutions

positioningWhen compute became possible on GPUs, it was first presented as an extra feature and did not change much to the positioning of the products by AMD/ATI and Nvidia. NVidia started with positioning server-compute (described as “the GPU without a monitor-connector”), where AMD and Intel followed. When the expensive Geforce GTX Titan and Titan Z got introduced it became clear that NVidia still thinks about positioning: Titan is the bridge between Geforce and Tesla, a Tesla with video-out.

Why is positioning important? It is the difference between “I’d like to buy a compute-card for my desktop, so I can develop algorithms that run as well on the compute-server” and “I’d like to buy a graphics card for doing computations and later run that on a passively cooled graphics card”. The second version might get a “you don’t want to do that”, as graphics terminology is used to refer to compute-goals.

Let’s get to the overview.

Desktop User * A-series APU  – Iris / Iris Pro  –
Laptop User * A-series APU  – Iris / Iris Pro  –
Mobile User  – Tegra Iris Mali T720 / T4xx
Desktop Gamer Radeon GeForce  –  –
Laptop Gamer Radeon M GeForce M  –  –
Mobile High-end  – Tegra K (?) Iris Pro Mali T760 / T6xx
Desktop Graphics FirePro W Quadro  –  –
Laptop Graphics FirePro M Quadro M  –  –
Desktop (DP) Compute FirePro W Titan (hdmi) / Tesla (no video-out) XeonPhi  –
Laptop (DP) Compute FirePro M Quadro M XeonPhi  –
Server (DP) Compute FirePro S Tesla XeonPhi (active cooling!)  –
Cloud Sky Grid  –  –

* = For people who say “I think my computer doesn’t have a GPU”.

My thoughts are that Titan are to promote compute at the desktop, while also Tesla is promoted for that. AMD has the FirePro W for that, for both Graphics professionals and Compute professionals, to serve all customers. Intel uses XeonPhi for anything compute and it’s is all actively cooled.

The table has some empty spots: Nvidia doesn’t have IGP, AMD doesn’t have mobile graphics and Intel doesn’t have a clear message at all (J, N, X, P, K mixed for all types of markets). Mobile GPUs from ARM, Imagination, Qualcomm and others have a clear message to differentiate between high-end and low-end mobile GPUs, whereas NVidia and Intel don’t.

Positioning of the Titan Z

Even though I think that Nvidia made a right move with positioning a GPU for the serious Compute Hobbyist, they are very unclear with their proposition. AMD is very clear: “Want professional graphics and compute (and play games after work)? Get FirePro W for workstations”, whereas Nvidia says “Want compute? Get a Titan if you want video-output, or Tesla if you don’t”.

See this Geforce-page, where they position it as a gamers-card that competes with the Google Brain Supercomputer and a MAC Pro. In other places (especially benchmarks) it is stressed that it is not meant for gamers, but for compute enthusiasts (who can afford it). See for example this review on

That said, we wouldn’t recommend this product to gamers anyway: two Nvidia GeForce GTX 780 Ti or AMD Radeon R9 290X cards offer roughly similar performance for only a fraction of the money. Only two Titan-Zs in SLI offer significantly higher performance, but the required investment is incredibly high, to the point where we wouldn’t even consider these cards for our Ultimate PC Advice.

As a result, Nvidia stresses that these cards are primarily intended for GPGPU applications in workstations. However, when looking at these benchmarks, we again fail to see a convincing image that justifies the price of these cards.

So NVIDIA’s naming convention is unclear. If TITAN is for the serious and professional compute developer, why use the brand “Geforce”? A Quadro Titan would have made much more sense. Or even “Tesla Workstation”, so developers could get a guarantee that the code would run on the server too.

Differentiating from low-end compute

Radeon and Geforce GPUs are used for low-cost compute-cluster. Both AMD and NVidia prefer to sell their professional cards for that market and have difficulties to make a clear understanding that game-cards are not designed for compute-only solutions. The one thing they did the past years is to reserve good double precision computations for their professional cards only. An existing difference was the driver quality between Quadro/FirePro (industry quality) and GeForce/Radeon. I think both companies have to rethink the differentiated driver-strategy, as compute has changed the demands in the market.

I expect more differences between the support-software for different types of users. When would I pay for professional cards?

  1. Double Precision GFLOPS
  2. Hardware differences (ECC, NVIDIA GPUDirect or AMD SDI-link/DirectGMA, faster buses, etc)
  3. Faster support
  4. (Free) Developer Tools
  5. System Configuration Software (click-click and compute works)
  6. Ease of porting algorithms to servers/clusters (up-scaling with less bugs)
  7. Ease of porting algorithms to game-cards (simulation-mode for several game-cards)

So the list starts with hardware specific demands, then focuses to developer support. Let me know in the comments, why you would (not) pay for professional cards.

Evolving from gamer-compute to server-compute

GPU-developers are not born, but made (trained or self-educated). Most times they start with OpenCL (or CUDA) on their own PC or laptop.

With Nvidia it would be hobby-compute on Geforce, then serious stuff on Titan, then Tesla or Grid. AMD has a comparable growth-path: hobby-compute on Radeon, then upgrade to FirePro W and then to FirePro S or Sky. Intel it is Iris or XeonPhi directly, as their positioning is not clear at all if it comes to accelerators.


Positioning of the graphics cards and compute cards are finally getting finalised at the high-level, but will certainly change a few more times in the year(s) to come. Think of the growing market for home-video editors in 2015, who will probably need a compute-card for video-compression. Nvidia will come with another solution than AMD or Intel, as it has no desktop-CPU.

Do you think it will be possible to have an AMD APU with NVIDIA accelerator? Do people need to buy a accelerator-box in 2015 that can be attached to their laptop or tablet via network or USB, to do the rendering and other compute-intensive work (a “private compute cloud”)? Or will there always be a market for discrete GPUs? Time will tell.

Thanks for reading. I hope the table makes clear how things are now as of 2014. Suggestions are welcome.

MPI in terms of OpenCL

OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.

Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.

At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.

Scaling mobile GPUs to 1000 GFLOPS

arm_mali_cover_151112297646_640x360On the 20th of April 2013 there was an interesting discussion between Jan Gray and David Kanter. Jan is a specialist in C++ and FPGAs (twitter, homepage). David is a specialist in CPU and GPU architectures (twitterhomepage). Both know their ways well in the field of semiconductors. It is always a joy to follow their short discussions when they happen, but there was something about this one that made me want to share it with special attention.

OpenCL on ARM: Growth-expectation of GFLOPS/Watt of mobile GPUs exceeds Moore’s law. That’s incredible!

Jan Gray: .@OpenCLonARM GFLOPS/W more a factor of almost-over Dennard Scaling. But plenty of waste still to quash.

Jan Gray‏: .@openclonarm Scratch Dennard tweet: reduced capacitance of yet smaller devices shd improve GFLOPS/W even as we approach end of Vdd scaling.

David Kanter: @jangray @OpenCLonARM I think some companies would argue Vdd scaling isn’t dead…

Jan Gray: @TheKanter @openclonarm it’s not dead, but slowing, we’ve gone from 5V to 1V (25x power savings) and have maybe several hundred mVs to go.

David Kanter: @jangray I reckon we have at least 400mV, so ~2X; slower than ideal, but still significant

Jan Gray: @TheKanter We agree, I think.

David Kanter: @jangray I suspect that if GPU scaling > Moore’s Law then they are just spending more area or power; like discrete GPUs in the last decade

David Kanter: @jangray also, most positive comment I’ve heard from industry folks on mobile GPU software and drivers is “catastrophically terrible”

Jan Gray: @TheKanter Many ways to reduce power, soup to nuts. For ex HMC DRAM on interposer for lower energy signaling. I’m sure many tricks to come.

In a nutshell, all the reasons they think mobile GPUs can outpace Moore’s law while staying under a certain power-usage.

It needs some background-info, so let’s start the background of the first tweet, and then explain what has been said. Continue reading “Scaling mobile GPUs to 1000 GFLOPS”

Let’s enter the Top500 HPC list using GPUs

The #500 super-computer has only 24 TFlops (2010-06-06):

update: scroll down to see the best configuration I have found. In other words: a cluster with at least 30 nodes with 4 high-end GPUs each (costing almost €2000,- per node and giving roughly 5 TFlops single precision, 1 TFLOPS double precision) would enter the Top500. 25 nodes to get to a theoretic 25TFlops and 5 extra for overcoming the overhead. So for about €60 000,- of hardware anyone can be on the list (and add at least €13 000 if you want to use Windows instead of Linux for some reason). Ok, you pay most for the services and actual building when buying such a cluster, but you get the idea it does not cost you a few millions any more. I’m curious: who is building these kind of clusters? Could you tell me the specs (theoretical TFlops, LinPack TFlops and watts/TFlop) of your (theoretical) cluster, which costs the customer less then €100 000,- in total? Or do you know companies who can do this? I’ll make a list of companies who will be building the clusters of tomorrow, the “Top €100.000,- HPC cluster list”. You can mail me via vincent [at] this domain, or put your answer in a comment.

Update: the hardware shopping-list

Nobody told in the remarks it is easy to build a faster machine than the one described above. So I’ll do it. We want the most flops per box, so here’s the wishlist:

  • A motherboard with as many slots as possible for PCI-E, CPU-sockets and memory-banks. This because the lag between the nodes is high.
  • A CPU with at least 4 cores.
  • Focus on the bandwidth, else we will not be able to use all power.
  • Focus on price per GFLOPS.

The following is what I found in local computer stores (which for some reason people there love to talk about extreme machines). AMD currently has the graphics cards with the most double precision power, so I chose for their products. I’m looking around for Intel + Nvidia, but currently they are far behind. Is AMD back on stage after being beaten by Intel’s Core-products for so many years?

The GigaByte GA-890FXA-UD7 (€245,-) has 1 AM3-socket, 6(!) PCI-e slots and supports up to 16GB of memory. We want some power, so we use the AMD Phenom II X6 1090T (€289,-), which I chose for the 6 cores and the low price per FLOPS. And to make it a monster, we add 6 times a AMD HD5970 (€599,-) giving 928 x 6 = 3264 DP-GLOPS. If it can handle 16GB DDR3 (€750,-), so we put it in. It needs about 3 Power-supplies of 700 Watt (€100,-). We add 128GB SSD (€350,-) for working data and a big 2 TB HDD (€100,-). Case needs to house the 3 power supplies (€100,-). Cooling is important and I suggest you compete with a wind-tunnel (€500,-). It will cost you €6228,- for 5,6 Double Precision TFLOPS, and 27 TFLOPS single precision. A cluster would be on the HPC500-list for around €38000,- (pure hardware-price, not taking network-devices too much into account, nor the price for man-hours).

Disclaimer: this is the price of a single node, excluding services, maintenance, software-installation, networking, engineering, etc. Please note that the above price is pure for building a single node for yourself, if you have the knowledge to do so.

Targetting various architectures in OpenCL and CUDA

“Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

  • Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
  • User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
  • API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
  • Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
  • OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
  • Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

Win an OpenCL mug!

The first batch is in and you can win one from the second batch!

We’re sending a mug to a random person who subscribes to out newsletter before the end of 17 April 2017 (Central European Time). Yes, that’s a Monday.

Two winners

We’ll pick two winners: one from academia and one from industry. If you select “other” as your background, then share which category you fall in the last field.

Did you already subscribe and also want to win? I am not forgetting you – more details are in a newsletter next quarter.

More winners, by referring to a friend

If you refer a colleague, a friend or even a stranger to subscribe, you can both win a mug. Just be sure he/she remembers to mention you to me when I ask. Before you ask: the maximum referral-length is 5 (so referral of referral of referral of referral, etc) plus the one who started it.

UPDATE: If you win a mug and were not referred by somebody, you can pick a co-winner yourself. Joy should be shared.

You can also use this link

Meet us in April

9017503_mThe coming month we’re travelling every week. This generates are a lot of opportunities where you can meet the StreamHPC team! For appointments, send an email to

  • Meet us at ParallelCon (6 April 2016, Heidelberg, Germany). Besides the crash course (see below), we also have a talk on Vulkan.
  • Crash Course OpenCL @ ParallelCon (8 April 2016, Heidelberg, Germany). This is part of the conference – you can still buy tickets!
  • Meet us in Toronto (11 April 2016, Toronto, Canada). In Toronto for business, with time for appointments.
  • Meet us at IWOCL (19 April 2016, Vienna, Austria). The event-of-the-year for all OpenCL. So ofcourse we’re there.
  • Meet us in Grenoble (25 April 2016, Grenoble, France). For a training we’re there the whole week. On Thursday and Friday there is time for appointments.

We’re happy to talk business and about technology. Also giving presentations at your company is an option.

