Difference between CUDA and OpenCL 2010

Posted by Vincent Hindriksen on 22 April 2010 with 11 Comments

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA term	OpenCL term
GPU	Device
Multiprocessor	Compute Unit
Scalar core	Processing element
Global memory	Global memory
Shared (per-block) memory	Local memory
Local memory (automatic, or local)	Private memory
kernel	program
block	work-group
thread	work item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

FFT library – so in OpenCL you need to have your own kernels for it.
~~Atomic operations – which make double-write threads easier to implement.~~
Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting.

11 thoughts on “Difference between CUDA and OpenCL 2010”

RCL 22 April 2010

Hey, OpenCL can also use hardware interpolation when sampling from images (basically, textures) – just add CLK_FILTER_LINEAR flag. It’s perfectly supported in NVidia OpenCL implementation. Current ATI SDK suxx and doesn’t support images at all, but that’s just ATI.
RCL 22 April 2010

And major winning point of CUDA is (limited) C++ support in kernels – you can even use templated kernels! It really helps when you are implementing generic algorithms which have to work with various types (like image processing filters which should support uchar4, ushort4, float4 etc).

Another major win for CUDA is ability to easy share structures between host and CPU code – just include the same header in .cu and .cpp files. CUDA runtime gets you started in 5 minutes… In OpenCL, a lot of boiler plate code is needed, and there’s never guarantee that CPU and GPU (device) structures stay in sync.

On the other hand, major win for OpenCL is online compilation. You can create your kernels programmatically.

I wonder why that wasn’t mentioned in the article.
Vincent Hindriksen 22 April 2010

@RCL, when writing the article, I got aware I actually did not know enough about CUDA specific language features to name all the differences. When I was in doubt, I left it out.

Is this filter-mode CLK_FILTER_LINEAR comparable to CUDA’s hardware-interpolation?

Online compilation can be done in CUDA too; it’s referred to as JIT-compilation.

Templating is one I just really forgot! Possibly out of frustration. I’ll add it.
RCL 22 April 2010

Well, I haven’t measured speed of sampling images (read_imageui with float2 coords from sampler with CLK_FILTER_LINEAR) vs fetching in CUDA, but I think they are comparable, as underlying hardware is the same. From functional point of view they are the same – except that support of images is optional in OpenCL. Of 3 OpenCL implementations I know (IBM’s one for Cell, with DEVICE_CPU profile for main PPU and DEVICE_ACCELERATOR for SPUs, ATI’s one and NVidia’s which are both DEVICE_GPU) only one (NVidia’s) supports images at the moment. I expect ATI impl to support images in the future, though.

I overlooked JIT compilation in CUDA – thanks!
RCL 22 April 2010

And by the way, I don’t think CUDA gets ever ported to Cell. Hardware is not really capable of supporting it. Even OpenCL implementation looks really limited compared to other ones available (limited to single thread per SPU for obvious reasons – that is workgroup size is 1, no images, etc). The way SPUs access main memory makes it unlikely to (efficiently) support images ever.
SM 22 April 2010

There is a lot of development on both sides.
For instance please look at CLyther for OpenCL , a utility that could far surpass CUDA’s c++ interface in terms of ease of use and extendability.
Tom Sharpless 30 April 2010

I would like to see both of these initiatives pay more attention to two fundamental needs of image processing and simulation, that are not (yet) met by the typical PC graphics card: 1) big, accurate interpolation filters — 6×6 up to 256×256; 2) seriously fast data I/O — all the way to/from the file system.

I’m a veteran of both the Array Processor and DSP wars, and can vouch that a good programming API always beats mere hardware horsepower in the market. But behind that API we need something better than linear interpolation and write-only frame buffers.
Vincent Hindriksen 5 May 2010

@RCL, did you already order your ATI-hardware to test images? You were right, it wouldn’t take long. The future of CUDA and OpenCL mostly depends on business-decisions, so if NVidia teams up with IBM, then you can expect CUDA to replace OpenCL on Playstations.

@SM, CLyther ( http://clyther.sourceforge.net/ ) is very interesting indeed, but there are many ways to go and the offerings are growing. Did you already take a look at the QT-toolkit ( http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt/ )?

@Tom, isn’t OpenGL capable enough to do interpolations? You can share images between OpenGL and OpenCL, so why duplicate functionality.
OpenCL is not about built-in functionality (take a look at http://dannyruijters.nl/cubicinterpolation/ and have fun building your own algorithm), but OpenCL is a base on which libraries can be built to provide functionality that’s actually usable.
Nadav 7 August 2010

I am not sure why OpenCL is 10% slower. I imagine that with some minor tuning it should perform similarly to cuda. After all, they are basically identical. It would be possible to write a translator from one to the other.
Hbsggman 27 February 2012

Well i’m not expert for hardware Technics But so far i owned both a gtx 480 and now i’m gaming on a 6950 2gb MSI and its blazing Fast i never go under 45 on big map at Battlefield 3 and all other are 60 to 50 fps average.

The physics looks very impressive, ATI are more clear on White i think, But on i don’t know on 3d or Multi Desktop but image to image ATI has improve a lot cause i had 3 Nvidia before one ATI lol

The next card probably be ATI Exept when Nvidia will release maxwell 😀
- Hbsggman 27 February 2012
  
  At Ultra Setting No AA cause i Prefer FPS over Glichy tweak