Using async_work_group_copy() on 2D data

When copying data from global to local memory, you often see code like below (1D data):
[raw]

if (get_group_id(0)==0) {
  for (int i=0; i < N; i++) {
      data_local[i] = data_global[offset+i]
  }
}
mem_fence(CLK_LOCAL_MEM_FENCE);

[/raw]
This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.

event_t async_work_group_copy (	__local gentype `*dst`,
	const __global gentype `*src`,
	size_t *data_`size`*,
	event_t `event`

event_t async_work_group_copy (	__global gentype `*dst`,
	const __local gentype `*src`,
	size_t *data_`size`*,
	event_t `event`

The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.

As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.

The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.

[raw]

kernel void using_local(const global float* dataIn, local float* dataInLocal) {
    event_t event;
    const int dataInLocalWidth = (offset*2 + get_local_size(0));
        
    for (int i=0; i < (offset*2 + get_local_size(1)); i++) {
        event = async_work_group_copy(
             &dataInLocal[i*dataInLocalWidth],
             &dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0) 
                 + (get_group_id(0)*get_local_size(0)) - offset],
             dataInLocalWidth,
             event);
   }
   do_other_stuff(); // code that you can execute for free
   wait_group_events(1, &event); // waits until the copy has finished.
   use_data(dataInLocal);
}

[/raw]

On the host (C++), the most important part:
[raw]

cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float) 
          * gsize_x * gsize_y);
cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset) 
          * (lsize_y+2*offset));
queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn);
cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error));
cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y), 
          cl::NDRange(lsize_x, lsize_y));
kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);

[/raw]
This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.

This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.

4 thoughts on “Using async_work_group_copy() on 2D data”

Aaron Boxer 5 February 2016

Thanks for this short but informative post.

Two questions:

1) how does one know how much computation to do while waiting for asynch copy to complete ?

2) On AMD GCN device, with workgroup <= 64, do all work items need to encounter the asynch call? Because for memory barriers, they can be left out, since 64 is the size of a half wave.

Cheers,
Aaron
- StreamHPC 11 February 2016
  
  1) As much as possible, as memory latency is very high. Depending on other ways of memory latency hiding, the effect is noticeable.
  
  2) All workitems in the workgroup need to encounter the call – no matter what device it is. Not sure if I understand your question, as AMD’s (full) wavefront size is 64 and it is executed in quarter wavefronts.
  - Aaron Boxer 11 February 2016
    
    Thanks. Yes, you’re right of course, full wavefront size is 64 on GCN.
    I’ve found that memory barriers are not needed for work group size <= 64, since only a single CU is operating at any given time. This may change with the new arch, of course.
    
    For asynch copy, it makes sense that all work items need to be involved.
  - Aaron Boxer 30 July 2016
    
    I did some testing on GCN Cape Verde architecture, and it doesn’t look like the async call makes any difference (actually a bit slower than regular coalesced read from global to local)
    
    See https://community.amd.com/thread/203144 thread for more details.

Comments are closed.

StreamHPC communications

4 thoughts on “Using async_work_group_copy() on 2D data”

Discover more from StreamHPC