When copying data from global to local memory, you often see code like below (1D data):
[raw]
if (get_group_id(0)==0) {
for (int i=0; i < N; i++) {
data_local[i] = data_global[offset+i]
}
}
mem_fence(CLK_LOCAL_MEM_FENCE);
[/raw]
This can be replaced this with an asynchronous copy with the function async_work_group_copy, which results in more manageable and cleaner code. The function behaves like an asynchronous version of memcpy() you know from C++.
| event_t async_work_group_copy ( | __local gentype *dst, |
| const __global gentype *src, | |
| size_t data_size, | |
| event_t event |
| event_t async_work_group_copy ( | __global gentype *dst, |
| const __local gentype *src, | |
| size_t data_size, | |
| event_t event |
The Khronos registry async_work_group_copy provides asynchronous copies between global and local memory and a prefetch from global memory. This way it’s much easier to hide the latency of the data-transfer. In de example below, you effectively get free time to do the do_other_stuff() – this results in faster code.
As I could not find a good code-snippets online, I decided to clean-up and share some of my code. Below is a kernel that uses a patch of size (offset*2+1) and works on 2D data, flattened to a float-array. You can use it for standard convolution-like kernels.
The code is executed on workgroup-level, so there is no need to write code that makes sure it’s only executed by one work-item.
[raw]
kernel void using_local(const global float* dataIn, local float* dataInLocal) {
event_t event;
const int dataInLocalWidth = (offset*2 + get_local_size(0));
for (int i=0; i < (offset*2 + get_local_size(1)); i++) {
event = async_work_group_copy(
&dataInLocal[i*dataInLocalWidth],
&dataIn[(get_group_id(1)*get_local_size(1) - offset + i) * get_global_size(0)
+ (get_group_id(0)*get_local_size(0)) - offset],
dataInLocalWidth,
event);
}
do_other_stuff(); // code that you can execute for free
wait_group_events(1, &event); // waits until the copy has finished.
use_data(dataInLocal);
}
[/raw]
On the host (C++), the most important part:
[raw]
cl::Buffer cl_dataIn(*context, CL_MEM_READ_ONLY|CL_MEM_HOST_WRITE_ONLY, sizeof(float)
* gsize_x * gsize_y);
cl::LocalSpaceArg cl_dataInLocal = cl::Local(sizeof(float) * (lsize_x+2*offset)
* (lsize_y+2*offset));
queue.enqueueWriteBuffer(cl_dataIn, CL_TRUE, 0, sizeof(float) * size_x * size_y, dataIn);
cl::make_kernel kernel_using_local(cl::Kernel(*program,"using_local", &error));
cl::EnqueueArgs eargs(queue,cl::NullRange ,cl::NDRange(gsize_x, gsize_y),
cl::NDRange(lsize_x, lsize_y));
kernel_using_local(eargs, cl_dataIn, cl_dataInLocal);
[/raw]
This should work. Some have the preference to do local initialisation in the kernel, but I prefer not to do this JIT.
This code might not work optimal if you have special tricks for handling the outer border. If you see any improvement, please share via the comments.





So, for this year it will be K40. Here’s an overview:

At Intel they have CPUs (Xeon, Ivy Bridge), GPUs (Isis) and Accelerators (Xeon Phi). OpenCL enables each processor to be used to the fullest and they now promote it as such. Watch the below video and see their view on why OpenCL makes a difference for Intel’s customers.
Khronos just announced three OpenCL based releases:












On 15 – 17 April 2014 a 3-day workshop around HPC is organised. It is free, and focuses on bringing industry and academy together.
















According to the 









DigitalFilmTools

OpenCL SPIR (Standard Portable Intermediate Representation) is an intermediate representation for OpenCL-code, comparable to LLVM IL and HSAIL. It is a search for what would be a good representation, such that parallel software runs well on all kinds of accelerators. LLVM IL is too general, but SPIR is a subset of it. I’ll discuss HSAIL, on where it differs from SPIR – I thought SPIR was a better way to start introducing these. In my next article I’d like to give you an overview of the whole ecosphere around OpenCL (including SPIR and HSAIL), to give you an understanding what it all means and where we’re going to, and why.

