OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.
The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.
Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux. Continue reading “Double the performance on AMD Catalyst by tweaking subgroup operations”