Using async_work_group_copy() on 2D data

async_all_the_things1When copying data from global to local memory, you often see code like below (1D data):

if (get_group_id(0)==0) {
  for (int i=0; i < N; i++) {
      data_local[i] = data_global[offset+i]

This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.

event_t async_work_group_copy ( __local gentype *dst,
const __global gentype *src,
size_t  data_size,
event_t event
event_t async_work_group_copy ( __global gentype *dst,
const __local gentype *src,
size_t data_size,
event_t event

The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.

As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.

The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.


kernel void using_local(const global float* dataIn, local float* dataInLocal) {
    event_t event;
    const int dataInLocalWidth = (offset*2 + get_local_size(0));
    for (int i=0; i < (offset*2 + get_local_size(1)); i++) {
        event = async_work_group_copy(
             &dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0) 
                 + (get_group_id(0)*get_local_size(0)) - offset],
   do_other_stuff(); // code that you can execute for free
   wait_group_events(1, &event); // waits until the copy has finished.


On the host (C++), the most important part:

cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float) 
          * gsize_x * gsize_y);
cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset) 
          * (lsize_y+2*offset));
queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn);
cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error));
cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y), 
          cl::NDRange(lsize_x, lsize_y));
kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);

This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.

This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.

4 thoughts on “Using async_work_group_copy() on 2D data

  1. Aaron Boxer

    Thanks for this short but informative post.

    Two questions:

    1) how does one know how much computation to do while waiting for asynch copy to complete ?

    2) On AMD GCN device, with workgroup <= 64, do all work items need to encounter the asynch call? Because for memory barriers, they can be left out, since 64 is the size of a half wave.


    • StreamHPC

      1) As much as possible, as memory latency is very high. Depending on other ways of memory latency hiding, the effect is noticeable.

      2) All workitems in the workgroup need to encounter the call – no matter what device it is. Not sure if I understand your question, as AMD’s (full) wavefront size is 64 and it is executed in quarter wavefronts.

      • Aaron Boxer

        Thanks. Yes, you’re right of course, full wavefront size is 64 on GCN.
        I’ve found that memory barriers are not needed for work group size <= 64, since only a single CU is operating at any given time. This may change with the new arch, of course.

        For asynch copy, it makes sense that all work items need to be involved.

      • Aaron Boxer

        I did some testing on GCN Cape Verde architecture, and it doesn’t look like the async call makes any difference (actually a bit slower than regular coalesced read from global to local)

        See thread for more details.

Comments are closed.