Basic concepts: Function Qualifiers

Reading Time: 3 minutes

Optimisation of one’s thoughts is a complex problem: a lot of interacting processes can be defined, if you think of it.

In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the kernel as this can be confusing. Compile-time of the kernel is at run-time of the software after the compute-devices have been queried. The OpenCL-compiler can make better optimised code when you give as much information as possible. One of the methods is using Function Qualifiers. A function qualifier is notated as a kernel-attribute:

__kernel __attribute__((qualifier(qualification))) void foo ( …. ) { …. }

There are three qualifiers described in OpenCL 1.x. Let’s walk through them one by one. You can also read about them here in the official documentation, with more examples.


It tells what type of vectors is used in the computation, and its width. If you say for example that float4 will be used, the vectorisator tries to optimize the code for that. Intel is very much a fan of using this hint. Strangely, AMD (with VLIW in its Radeons) does not push this hint too much. Here’s an example with clear naming of a kernel that works on int8:

__kernel __attribute__((vec_type_hint(int8))) void foo_int8 ( …. ) { …. }

Default assumption is that the computations will be done using int scalars(!). Intel says in its documentation that is uses vectors of fours as default to optimize. While in their SDK 1.1 they suggest hinting for float4 or int4, but in 1.5 they say this hint turns off the auto-vectorizer if the width is not 4 * 4 bytes (float4 or int4). AMD says you should use vectors of 4 wide, but not to use this hint. It just tries to optimize for vector of 4 wide (like Intel). NVidia is also silent on this one. Nevertheless it is always wise to use this hint, as you never know what kind of optimizations are possible with this compiler.

I my experience, this hint does not work always to auto-vectorize scalar-kernels. Packing the bytes together always works better. If you have a good example of a not-too-simple kernel with scalars that vectorized well using this hint, please let me know in the comments.

work_group_size_hint and reqd_work_group_size

The compiler does not see how the kernel will be called, but it could optimize the scheduler if it knew in advance. work_group_size_hint tells the possible dimension of NDRange, reqd_work_group_size is alike but very strict. It is not documented on how the compilers handle these differently or at all. On GPUs this hint gives a good speed-up. Just try with and without to see how it works for your kernel. Check this example of a kernel that needs a workgroup of 64x1x1:

__kernel __attribute__((reqd_work_group_size(64, 1, 1))) void foo ( …. ) { …. }

Multiple attributes

You can use several attributes when needed. While you might think you can use more attributes separated by a spaces, they are officially notated as a comma-separated list between the double parentheses. Here’s an example of a kernel that works on float4 and assumes a 1D-workgroup of 32:

__kernel __attribute__((vec_type_hint(float4), work_group_size_hint(32, 1, 1))) void foo ( …. ) { …. }

There’s no logic in using both work_group_size_hint and reqd_work_group_size, though.

For functions that are not kernels, these hints cannot be used (officially). I could not find what the official documentation says on this, but I assume the compiler uses the same hints for the kernel and all the called functions.

Before we go… Some more food for thought

OpenCL is low-level. To give it space to evolve, less explicit coding is needed. Writing code that can be optimized by the compiler I always say to my trainees that they should keep a copy of the unoptimized code, because every year the compiler gets better. Using vec_type_hint also keeps you focused on packing the data together and think as the compiler: can in every step 4 operations be done at the same time? And if so, can I show the compiler how?

Why work_group_size_hint and reqd_work_group_size are actually needed, is a question I have. I hope this will be resolved in 2.0 that you just compile a kernel given an NDRange and then run it. Hungry for some more explanation? Ask your questions in the comments!

And remember: always keep a copy of your unoptimized scalar kernel. You can never tell what future compilers are capable of.


Related Posts


Join us at the Dutch eScience Symposium 2019 in Amsterdam

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stre ...


We accelerated the OpenCL backend of pyPaSWAS sequence aligner

Last year we accelerated the OpenCL-code in PaSWAS, which is open source software to do DNA/RNA/protein sequence alignment and trimming. It has users  ...


Do you have our GPU DNA?

This is the first question to warm up. Python-programmers are often users of GPU-libraries, not the builders of those libraries.In January 2019 I  ...


Stream Team at ISC

This year we'll be with 4 people at ISC: Vincent, Adel, Anna and Istvan. You can find us at booth G-812, next to Red Hat.Booth G-812 is manned& ...