OpenCL 1.1 changes compared to 1.0

This blog-entry is of interest for you, if you don’t want to read the whole new specifications [PDF] for the OpenCL 1.1 changes, but just want an overview of the most important changes differences with 1.0.

The news-release sums up the changes for 1.1 like this:

New datatypes including 3-component vectors and additional formats
Handling command from multiple hosts and processing buffers across multiple devices
Operations on regions of a buffer including read, write and copy of 1D, 2D and 3D rectangular regions
Enhanced use of events to drive and control command execution
Additional OpenCL C built-in functions such as integer clamp, shuffle and asynchronous strided copies
Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events.

Furthermore we can read the update is completely backwards-compatible with version 1.0. The obvious macro‘s CL_VERSION_1_0 and CL_VERSION_1_1 are added to handle the versioning, but what’s more? This blog-post discusses most changes and with some subjective opinions added to it.

Additional Formats

3-component vectors

We only had 2-, 4-, 8 or 16-component vectors, but not 3 which actually was somewhat strange. In OpenCL 1.1, the functions vload3, vload_half3, vloada_half3 and vstore3, vstore_half3, vstorea_half3 have been added to the family. Watch out, that for the half-functions the offset is calculated somewhat different compared to the even-sized vectors. In version 1.0 you could have chosen for a 4-component vector when using a lot of calculations, or a struct. If you see the new function vec_step below, it seems that it is not more memory-efficient to use this vector instead of a 4-component vector.

RGB with Padding

We have support CL_RGB, CL_RGBa (= RGB with an alpha-channel) and now also RGBx (with padding-channel). The same variants are there for CL_R and CL_RG. Good for graphics-programmers, or for easier reading of 32 bpp BMPs.

Cloud Computing / Multi-user Environments

The support for different hosts gives possibilities for cloud-computing. Side-note: cloud-computing is another word for multi-user environments, with some promotion for big data-centres. All API-functions except clSetKernelArg are thread-safe now; but only when kernels are not shared between hosts; see appendix A.2 for more information. The important part is that you think clearly about how to design your software if you now can assume others can take your resources now too. OpenCL already needed a lot of planning when claiming and releasing resources, so you’re probably already mastering it; now just check more often how much resources are available.

Region-specific Operations

Regions make it possible to split a big buffers to form a queue without having to keeping track of dimensions and offsets during operations, run from host. See clCreateSubBuffer for more information. A real convenience, but watch out when writing to overlapping buffers. The functions clEnqueueCopyBufferRec, clEnqueueReadBufferRect en clEnqueueWriteBufferRect helps synchronising commands to copy, read to or write from a region.

Enhanced Control-room

My favourite description of the host is “the control-room”, since you are not over there on the device but in Houston. The more control and information, the better. The new events are clSetMemObjectDestructorCallback, clCreateUserEvent, clSetUserEventStatus and clSetEventCallback. The first event-listener lets you know when resources a freed, so you can keep track. User-events can be put in the event_wait_list in various functions just like the built-in events; the function will start when all events are CL_COMPLETE. With clSetEventCallback immediate actions-on-events can be programmed; combined with the user-events the programmer got some powerful tools. See the example at clSetUserEventStatus for how to use the user-events.

OpenGL events and Direct3D support

The function clCreateEventFromGLsyncKHR links a CL-event to a GL-event by just giving the name of the OpenGL-event. See gl_sharing for more info.

OpenCL has now support for Direct3D 10, which is great! This might also be a good step to make DirectCompute lighter. See cl_khr_d3d10_sharing for more info. Welcome DirectX-developers! One favour: please be aware that DirectX works on Windows only, not on Apple OSX or iOS, (Embedded) Linux or Symbian. If you use clean calls, it will be more easy to port to other platforms.

Other New Kernel-functions

The following new functions were added to the kernel:

get_global_offset: returns the offset of the enqueued kernels.
minmag and maxmag: returns the argument with the minimum or maximum distance to zero, falls back to fmin and fmax if distance is equal or an argument is NaN. Example: maxmag(-5, 3) = -5, minmag(-3, 3) = -3.
clamp: returns boundary-values if the given number is not between the boundaries.
vec_step: returns the number of elements in a scalar or a vector. A scalar returns 1, a vector 2, 4, 8 or 16. If the size is 3, the function returns 4.
shuffle and shuffle2: shuffles one or two vectors given another vector with the indices of the new order. Indeed plain old permutations.
async_workgroup_strided_copy: buffers between global and local memory on the device. When used correctly, this can overcome some of the hassle when you need to work on global memory objects, but need more speed. Correct usage is described in the reference.

The functions min and max now also work component-wise with a vector as first argument and a scalar as second. Min({2, 4, 6, 8}, 5) will give {2, 4, 5, 5}.

Conclusion

While the many revisions of OpenCL 1.0 were really minor and not a lot attention was paid to them, 1.1 is a big step forward. If you see what has been done to multi-user environments, NVidia and AMD have a lot of work to do with their drivers.

You can read in revision 33 there has been some heated discussion and there was pressure on the decision:

>>Should this extension be KHR or EXT?
PROPOSED: KHR. If this extension is to be approved by Khronos then it should be KHR, otherwise EXT. Not all platforms can support this extension, but that is also true of OpenGL interop.
RESOLVED: KHR.<<

The part “not all platforms” is very politically written down, since exactly one platform supports this specific extension. I have seen too many of these pressured discussions and I hope Khronos is stronger than i.e. ISO and OpenCL will remain as open as OpenGL.

I’m very happy with the new version, since there is more control with loads of extra events, now multiple hosts are possible, and the forgotten 3-component vector was added. Now let me know in the comments what you think of the new version.

By the way, not all is new. Deprecated are clSetCommandQueueProperty and the __ROUNDING_MODE__ macro.

StreamHPC communications