Targetting various architectures in OpenCL and CUDA

“Everything that *is* makes up one single world; but not everything is alike in this world” – Plato

The question we aim to answer in this post is: “How to do you make software that performs on several platforms?”.

Note: This article is not fully finished – I’ll add more information during the coming months. It’s busy here!

Even in many Java-code you’ll find hard-coded filename-delimiters in the file-names, which then work on one OS only. Portability is a problem that exists in various aspects of programming. Let’s look at some of the main goals software can have, and which portability-problems they have.

  • Functionality. This is the minimum requirement. Once a function is decided, changing functionality takes a lot of time. Writing code that is very flexible in requirements is hard.
  • User-interface. This is what one sees and which is not too abstract to talk about. For example, porting software to a touch-device requires a lot of rethinking of interaction-principles.
  • API and library usage. To lower development-time, existing and known APIs and libraries are used. This can work out three ways: separation of concerns, less development-time and dependency. The first two being good architectural choices, the latter being a potential hazard. Changing the underlying APIs is not easy.
  • Data-types. Handling video is different from handling video-formats. If the files can be handles in the intermediate form used by the software, then adding new file-types is relatively easy.
  • OS and platform. Besides many visible specifics, an OS is also a collection of APIs. Not only corporate operating systems tend to think of their own platform only, but also competing standards. It compares a lot to what is described under APIs.
  • Hardware-performance. Optimizing software for a specific platform makes it harder to port to other platforms. This will the main point of this article.

OpenCL is known for not being performance-portable, but it is the best we currently have when it comes to writing code with performance as a primary target. The funny thing is that with CUDA 5.0 it has become clearer that NVIDIA has the problem in their GPGPU-language too, whereas it was used before to differentiate CUDA from OpenCL. Also, CUDA 5.0 has many new features only available on the latest Kepler-GPUs.

CUDA and compute levels

CUDA has an advantage: currently all their CUDA-programmable devices are discrete GPU-chips over a PCI-bus. This (current) consistency in architecture gives us the luxury of focusing only on compute levels. Each compute level tells us two things: hardware-specifics and functional capabilities. See the table below, taken from the Kepler architecture document.

So, if you have an optimized kernel that uses dynamic parallelism, then you must check if the device has at least compute level 3.5. With cudaGetDeviceProperties you can check for major (should be at least 3) and minor (should be at least 5, if major is 3). If you want to use 200 registers per thread (kernel), you can use this function to get that capability for the given device instead of depending on compute level alone.

In some cases you might want to have an alternative communication with the CUDA-device, but most times you can just have several kernels in one .cu file for various architectures. When it’s Kepler GK110 or better, then you run “kernel_35”. This comes close to standard behavior of OpenCL. So, if you only develop in CUDA, then read the next sections with CUDA in mind.

OpenCL initialization

The long initialization-steps of OpenCL are feared, but very useful. As we have seen above, to select the optimal CUDA-kernels one needs to check the capabilities of the CUDA-hardware. With OpenCL you do two more steps before selecting on hardware-capabilities with clGetDeviceInfo:

  • Check device vendor (Intel, NVIDIA, AMD, ZiiLabs, Altera, etc) with clGetPlatformInfo.
  • Check architecture-type (CPU, GPU or accelerator), with clGetDeviceInfo.

OpenCL-devices can be on the same chip, behind a PCI-bus or on the host itself – this results in different optimal data-transfer techniques and thus, different kernels (and even several host-functions). Based on all gathered information, the optimal kernel (+ parameters) can be selected.

A main difference between CUDA and OpenCL is that if you care about performance and portability,  you have higher chances of ending up with more kernels. Alternatively, you can put various optimizations, depending on the architecture (such as local caching), in the kernel-code. I would personally recommend having more kernels (or call a list of functions from the kernel), as various optimizations could lower readability enormously.

How to actually optimize for different architectures of different vendors? You can learn best from the SDKs. That’s why I made so much noise around NVIDIA not having OpenCL examples in their SDK anymore. I’d like my students being able to optimize their kernels for the latest architectures.

A trick I like to use is to have a simple, unoptimized kernel, focused on SIMD-principles only: the base kernel. Most times, this kernel works quite well on CPUs (AVX). When optimizing the kernel, I can always compare results. Another advantage is that in case of an unknown architecture, I can always try to feed this one.

Extensions

OpenCL can be extended with vendor-specific functionality. See http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/EXTENSION.html for a list. There is more that must be said, so I’ll extend this paragraph later.

Automatic optimized initialization

Having all those different kernels takes much time, when not practicing. So what if you need to get results soon? After you have understood that CPU-programming ≠ GPU-programming (we give trainings in this concept), you can start using higher level languages. According to this JavaCL-page initialization just needs a few steps:

CLContext context = JavaCL.createBestContext();
CLProgram program = context.createProgram(myKernelSource).build();
CLKernel kernel = program.createKernel(
        "myKernel", 
        new float[] { u, v },
        context.createIntBuffer(Usage.Input, inputBuffer, true),
        context.createFloatBuffer(Usage.Output, resultsBuffer, false)
);

It is not complete, but you see it’s short, understandable and compact. Higher level languages, such as OpenACC, ScalaCL and Aparapi, can take many pains out of low-level programming. They can potentially generate various optimal kernels for various architectures, based on what is in their internal architecture database and the information taken from the driver.

Auto-tuning

This technique is very important. In an upcoming article I will talk more about this in much more detail. Auto-tuning is variating settings like workgroup sizes and data-block sizes, while measuring the time it takes to transport and compute through pre-defined data-sets.

A cleaner future?

In short terms: no. There are many people promising golden mountains, with software much better than OpenCL. But if you just check the list above, then you’ll understand that these are nothing more than false promises.

But luckily, something is getting better: With (hidden) auto-tuning in higher-level languages on top of OpenCL, we can get closer to only writing the base-kernel.