When copying data from global to local memory, you often see code like below (1D data):
[raw]
if (get_group_id(0)==0) { for (int i=0; i < N; i++) { data_local[i] = data_global[offset+i] } } mem_fence(CLK_LOCAL_MEM_FENCE);[/raw] This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.
event_t async_work_group_copy ( | __local gentype *dst, |
const __global gentype *src, | |
size_t data_size, | |
event_t event |
event_t async_work_group_copy ( | __global gentype *dst, |
const __local gentype *src, | |
size_t data_size, | |
event_t event |
The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.
As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.
The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.
[raw]kernel void using_local(const global float* dataIn, local float* dataInLocal) { event_t event; const int dataInLocalWidth = (offset*2 + get_local_size(0)); for (int i=0; i < (offset*2 + get_local_size(1)); i++) { event = async_work_group_copy( &dataInLocal[i*dataInLocalWidth], &dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0) + (get_group_id(0)*get_local_size(0)) - offset], dataInLocalWidth, event); } do_other_stuff(); // code that you can execute for free wait_group_events(1, &event); // waits until the copy has finished. use_data(dataInLocal); }[/raw]
On the host (C++), the most important part:
[raw]
cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float) * gsize_x * gsize_y); cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset) * (lsize_y+2*offset)); queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn); cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error)); cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y), cl::NDRange(lsize_x, lsize_y)); kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);[/raw] This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.
This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.
Related Posts
When Big Data needs OpenCL
... this data-searching is something that can be sped up using GPUs and parallel computing. But interest for historical data is ...
OpenCL error codes (1.x and 2.x)
... to initialize the contents of the cl_mem object allocated using host-accessible (e.g. PCIe) ...
Double the performance on AMD Catalyst by tweaking subgroup operations
... 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups ...
Selecting Applications Suitable for Porting to the GPU
... in the software can be assessed for GPU-porting fitness using the following list of checks, in order of importance: data ...