NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

OpenCL 2.0 support level on NVIDIA GPUs

The release notes tell what currently works on Windows with 378.66 graphics drivers:

New features in OpenCL 2.0 are available in the driver for evaluation purposes only. The following are the features as well as a description of known issues in the driver:

  • Device side enqueue
    • The current implementation is limited to 64-bit platforms only.
    • OpenCL 2.0 allows kernels to be enqueued with global_work_size larger than the compute capability of the NVIDIA GPU. The current implementation supports only combinations of global_work_size and local_work_size that are within the compute capability of the NVIDIA GPU.
      The maximum supported CUDA grid and block size of NVIDIA GPUs is available at http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#computecapabilities. For a given grid dimension, the global_work_size can be determined by CUDA grid size x CUDA block size.
    • For executing kernels (whether from the host or the device), OpenCL 2.0 supports non-uniform ND-ranges where global_work_size does not need to be divisible by the local_work_size. This capability is not yet supported in the NVIDIA driver, and therefore not supported for device side kernel enqueues.
  • Shared virtual memory
    • The current implementation of shared virtual memory is limited to 64-bit platforms only.

Do know that there is still no mention of OpenCL 2.0 in the driver – it simply reports to be only an OpenCL 1.2 device. However, when you analyse file included in the drivers, you will find out that it contains various OpenCL 2.0 functions like: clCreatePipe, clSVMAlloc and others. And then you can try to run these functions…

So we did tests of our own on a GTX 1080, what was supported and what not. Here’s a detailed overview of what works:

OpenCL 2.0 featureSupportedNotes
SVMYesOnly coarse-grained SVM is supported. Fine-grained SVM (optional feature) is not.
Device side enqueuePartiallySimple OpenCL programs with device side queue work. We encounter compilation
errors in more complex kernels with: enqueue_marker(), barrier(), multiple
event objects. Further investigation is required.
Work-group functionsYes
PipesNoclCreatePipe and other *Pipe functions are defined in OpenCL64.dll,
but using them cause run-time errors.
Generic address spaceYes
Non-uniform work-groupsNoIt is mentioned in release notes that it is not supported.
C11 AtomicsPartiallyUsing atomic_flag_* functions cause an CL_BUILD_ERROR error.
Subgroups extensionNoNot mentioned in the release notes.

Apart from what’s listed in the above table, NVIDIA latest Windows drivers also support the following host-side functions: clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties().

Read more about OpenCL in this blog article. Also very handy: the OpenCL 2.0 reference card from Khronos.

Write OpenCL that works on the three big GPUs

The big news is that we can finally write OpenCL 2.0 code and still support recent GPUs from the big three – NVidia, AMD and Intel. At least to a certain degree, if you specifically need coarse grained SVM, simple device side enqueue, the new work-group functions, Generic address space and C11 Atomics (excluding the atomic flag functions). This means that the biggest additions are there – now hope fine grained SVM and subgroups will follow soon.

It will take some time to test out what is practical and what not. We therefore ask you to help out. The SDKs of Intel and AMD have various 2.0 samples – convert these to CPU using AMD’s or Intel’s 2.0 driver (sanity check) and then try to port it to NVidia – share your results in the comments.

We have good hope that NVidia goes for full OpenCL 2.0 support, so all hardware-functionality that can be programmed via CUDA now also gets accessible in OpenCL.