Atomic operations for floats in OpenCL – improved

Atomic floats
Atomic floats

Looking around in our code and on our intranet, I found lots of great and unique solutions to get the most performance out. Ofcourse we cannot share all of those, but there are some code-snippets that just have to get out. Below is one of them.

In OpenCL there is only atomic_add or atomic_mul for ints, and not for floats. Unfortunately there are situations there is no other way to implement the algorithm, than with atomics and floats. Already in 2011 Igor Suhorukov shared a solution to get atomic functions for floats, using atomic_cmpxchg(). Here is his example for atomic add:


[raw]
inline void AtomicAdd_g_f(volatile __global float *source, const float operand) {
union {
unsigned int intVal;
float floatVal;
} newVal;
union {
unsigned int intVal;
float floatVal;
} prevVal;
do {
prevVal.floatVal = *source;
newVal.floatVal = prevVal.floatVal + operand;
} while (atomic_cmpxchg((volatile __global unsigned int *)source, prevVal.intVal,
newVal.intVal) != prevVal.intVal);
}
[/raw]

Unfortunately this implementation is not guaranteed to produce the correct results because OpenCL does not enforce global/local memory consistency across all work-items from all work-groups. In other words, a read from the buffer source is not guaranteed to actually perform a read from the specified global buffer; it could, for example, return the value stored in a local cache. For more details check the chapter 3.3 – “Memory Model”, subchapter “Memory Consistency” of the OpenCL specification.
To be sure that the implementation for AtomicAdd_g_f is correct we need to use the value returned by the function atomic_cmpxchg. This guarantees that the actual value stored in global memory is returned.

As it seems our improved version is hidden deep in the code of GROMACS, here’s the code you should use:


[raw]
_INLINE_ void atomicAdd_g_f(volatile __global float *addr, float val)
{
union {
unsigned int u32;
float f32;
} next, expected, current;
current.f32 = *addr;
do {
expected.f32 = current.f32;
next.f32 = expected.f32 + val;
current.u32 = atomic_cmpxchg( (volatile __global unsigned int *)addr,
expected.u32, next.u32);
} while( current.u32 != expected.u32 );
}
[/raw]

As was mentioned in Suhorukov’s blog post, you can change the global to local, and implement the other operations likewise:


[raw]
Atomic_mul_g_f(): next.floatVal = expected.f32 * operand;
Atomic_mad_g_f(source, operand1, operand2): next.f32 = mad(operand1, operand2, expected.f32);
Atomic_div_g_f(): next.f32 = expected.f32 / operand;
[/raw]

Enjoy!