OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.
The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.
Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux.
Doubling the performance
We therefore created low-level code to implement our own sub-group functions. Below we present benchmark results of 4 different exclusive prefix sum (scan) operations running on AMD Radeon R9 Nano (Fiji), using Catalyst on Linux:
- sub-group scan using local memory (red),
- built-in sub-group function sub_group_scan_exclusive_add (orange),
- our implementation of sub-group scan (green),
- the built-in work-group function work_group_scan_exclusive_add (purple), and
- as a reference a standard buffer copy operation (blue).
While analysing the results it’s important to remember that work-group scan performs prefix sum on all 256 work-items, whereas other scan functions work with wavefronts/sub-groups of 64 work-items.
As you can see our sub-group scan function is 2.3x faster compared to built-in sub-group function. Even the method with local memory was faster than the sub-group scan!
After those tests on Linux, we ran the same benchmarks for AMD Radeon R9 Nano (Fiji) on Windows, using the latest AMD drivers – Crimson ReLive Edition 17.3.3:
Built-in sub-group function has turned out to be around 20% faster compared to what we got on Linux. On the other hand, the performance of the kernel which uses local memory dropped by 25%. Our sub-group scan function remained the best, getting even more GiB/s than on Linux. However, what is very important is that those results proves that our solution is second to none when it comes to performance on both Linux and Windows.
It is important to add that on ROCm platform the performance built-in sub-group scan function and our implementation of sub-group scan were virtually the same. As ROCm doesn’t have OpenGL-interop yet, this was not possible to use.
Also you can have this in your code
Do your OpenCL kernels use work-group or sub-group functions? Would your kernels benefit from using sub-group operations? Is the expected performance lower than expected? If you answer is “yes” for at least one of those questions, then you should consider contacting us.
Not only scan on AMD Catalyst – we now have the framework to improve any specialised function on several platforms.
With the approach we used we have a freedom to implement other, more complex operations as sub-group functions, tailoring them to the requirements of algorithms and to the device, which should result in performance improvement and lower local memory usage.
The technique can be applied to platforms other than AMD and also to functions other than scan. This makes it possible to create missing functions for unsupported hardware features.