Double the performance on AMD Catalyst by tweaking subgroup operations

Reading Time: 3 minutes

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux.

Doubling the performance

We therefore created low-level code to implement our own sub-group functions. Below we present benchmark results of 4 different exclusive prefix sum (scan) operations running on AMD Radeon R9 Nano (Fiji), using Catalyst on Linux:

  • sub-group scan using local memory (red),
  • built-in sub-group function sub_group_scan_exclusive_add (orange),
  • our implementation of sub-group scan (green),
  • the built-in work-group function work_group_scan_exclusive_add (purple), and
  • as a reference a standard buffer copy operation (blue).

While analysing the results it’s important to remember that work-group scan performs prefix sum on all 256 work-items, whereas other scan functions work with wavefronts/sub-groups of 64 work-items.

Linux, Crimson Edition 15.12

As you can see our sub-group scan function is 2.3x faster compared to built-in sub-group function. Even the method with local memory was faster than the sub-group scan!

After those tests on Linux, we ran the same benchmarks for AMD Radeon R9 Nano (Fiji) on Windows, using the latest AMD drivers – Crimson ReLive Edition 17.3.3:

Windows, Crimson ReLive Edition 17.3.3

Built-in sub-group function has turned out to be around 20% faster compared to what we got on Linux. On the other hand, the performance of the kernel which uses local memory dropped by 25%. Our sub-group scan function remained the best, getting even more GiB/s than on Linux. However, what is very important is that those results proves that our solution is second to none when it comes to performance on both Linux and Windows.

It is important to add that on ROCm platform the performance built-in sub-group scan function and our implementation of sub-group scan were virtually the same. As ROCm doesn’t have OpenGL-interop yet, this was not possible to use.

Also you can have this in your code

Do your OpenCL kernels use work-group or sub-group functions? Would your kernels benefit from using sub-group operations? Is the expected performance lower than expected? If you answer is “yes” for at least one of those questions, then you should consider contacting us.

Not only scan on AMD Catalyst – we now have the framework to improve any specialised function on several platforms.

With the approach we used we have a freedom to implement other, more complex operations as sub-group functions, tailoring them to the requirements of algorithms and to the device, which should result in performance improvement and lower local memory usage.

The technique can be applied to platforms other than AMD and also to functions other than scan. This makes it possible to create missing functions for unsupported hardware features.

 

Related Posts

sequence_alignment

We accelerated the OpenCL backend of pyPaSWAS sequence aligner

...  for the query and database. The comparison is between the performance of the original code and our branch. The scoring matrix used ...

screenshot-python-is-scripting

Do you have our GPU DNA?

This is the first question to warm up. Python-programmers are often users of GPU-libraries, not the builders of those libraries.In January 2019 I  ...

ISC-HPC-logo

Stream Team at ISC

...  mostly talking to (new) customers for development of high performance software for the big machines. Also we'll have a list of our ...

default

GPU-related PHD positions at Eindhoven University and Twente University

...  and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, ...