Basic concepts: Function Qualifiers

19092053_m
Optimisation of one’s thoughts is a complex problem: a lot of interacting processes can be defined, if you think of it.

In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification)))  void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read about them here in the official documentation, with more examples.

vec_type_hint

It tells what type of vectors is used in the computation, and its width. If you say for example that float4 will be used, the vectorisator tries to optimize the code for that. Intel is very much a fan of using this hint. Strangely, AMD (with VLIW in its Radeons) does not push this hint too much. Here’s an example with clear naming of a kernel that works on int8:

__kernel __attribute__((vec_type_hint(int8)))  void foo_int8 ( …. ) { …. }

Default assumption is that the computations will be done using int scalars(!). Intel says in its documentation that is uses vectors of fours as default to optimize. While in their SDK 1.1 they suggest hinting for float4 or int4, but in 1.5 they say this hint turns off the auto-vectorizer if the width is not 4 * 4 bytes (float4 or int4). AMD says you should use vectors of 4 wide, but not to use this hint. It just tries to optimize for vector of 4 wide (like Intel). NVidia is also silent on this one. Nevertheless it is always wise to use this hint, as you never know what kind of optimizations are possible with this compiler.

I my experience, this hint does not work always to auto-vectorize scalar-kernels. Packing the bytes together always works better. If you have a good example of a not-too-simple kernel with scalars that vectorized well using this hint, please let me know in the comments.

work_group_size_hint and reqd_work_group_size

The compiler does not see how the kernel will be called, but it could optimize the scheduler if it knew in advance. work_group_size_hint tells the possible dimension of NDRange, reqd_work_group_size is alike but very strict. It is not documented on how the compilers handle these differently or at all. On GPUs this hint gives a good speed-up. Just try with and without to see how it works for your kernel. Check this example of a kernel that needs a workgroup of 64x1x1:

__kernel __attribute__((reqd_work_group_size(64, 1, 1)))  void foo ( …. ) { …. }

Multiple attributes

You can use several attributes when needed. While you might think you can use more attributes separated by a spaces, they are officially  notated as a comma-separated list between the double parentheses. Here’s an example of a kernel that works on float4 and assumes a 1D-workgroup of 32:

__kernel __attribute__((vec_type_hint(float4), work_group_size_hint(32, 1, 1)))  void foo ( …. ) { …. }

There’s no logic in using both work_group_size_hint and reqd_work_group_size, though.

For functions that are not kernels, these hints cannot be used (officially). I could not find what the official documentation says on this, but I assume the compiler uses the same hints for the kernel and all the called functions.

Before we go… Some more food for thought

OpenCL is low-level. To give it space to evolve, less explicit coding is needed. Writing code that can be optimized by the compiler I always say to my trainees that they should keep a copy of the unoptimized code, because every year the compiler gets better. Using vec_type_hint also keeps you focused on packing the data together and think as the compiler: can in every step 4 operations be done at the same time? And if so, can I show the compiler how?

Why work_group_size_hint and reqd_work_group_size are actually needed, is a question I have. I hope this will be resolved in 2.0 that you just compile a kernel given an NDRange and then run it. Hungry for some more explanation? Ask your questions in the comments!

And remember: always keep a copy of your unoptimized scalar kernel. You can never tell what future compilers are capable of.

2 thoughts on “Basic concepts: Function Qualifiers

  1. Claudio André

    Hi, i’m working on some open source OpenCL software (http://openwall.info/wiki/john/GPU). Thanks to take time and write your good blog.

    Well, on my code none of this qualifiers have shown any good results. “Why work_group_size_hint and reqd_work_group_size are actually needed, is a question I have” too.
    And, i agree that there is much to evolve, the compilers and drivers are not good as they have to be. Big kernels are not handled ok (this rants fits better on AMD).

    My feeling is that i’m suffering much more than i have to.

  2. PolarLights One

    Thanks for the article,

    For Claudio, OpenCL is hard to write mainly if you want it stable and fast, but it is a “low/middle level” language that let you control each optimization. Sure, me too I suffer to much 😛 but expect that with time more tools (debuggers for each device/platform etc…) will come and help us !

    For Vincent, I think that, by example work_group_size_hint is still necessary even in OpenCL 2…
    Simply because you can play with the workgroup size at run-time, depending of some settings of your kernel, the number of elements to process etc… but work_group_size_hint will give a simple general approach to optimize !

Comments are closed.