In the series “Basic Concepts” various basics of GPGPU and OpenCL are discussed. This time we go into a typical one: when an error does not imply the actual problem. It is therefore good to have an overview of all errors with their descriptions.
When you get an out-of-resources error or when you get a crash when using clEnqueReadBuffer, you are sort of left in the dark. What does it mean? And how can you solve it?
Typical: one driver crashes/segfaults and another one gives this error.
Officially the error is defined as:
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
Which means that there can more reasons than the device being out of resources. A better name would have been CL_RESOURCE_ALLOCATION_ERROR. It can be thrown by various functions, but we focus on this one function. It cannot by thrown by clEnqueWriteBuffer, as that depends on the limits of the host.
Finding out the cause
The oldest trick of ‘m all: try to use the CPU and check what the error is then. CPUs are great to detect data-races (correct on CPU, not on GPU) and CPUs are a bit more stable when you have buggy code plus have more RAM. Be sure to install both Intel’s and AMD’s drivers.
Calling clFinish at each line, helps you pinpoint the actual line it happens or to get an error instead of a crash.
Then you have the following options:
- 9 out of 10 times you have a pointer problem at the host or are writing out of bounds. So you try to write to an illegal memory location, or try to cram in an 35×35 float* into 10x10x10 float* space (buffer-overflow). Double check the host memory-sizes, and if the host-pointers are correct.
- You read out of bounds on the device. Double-check the used memory-sizes.
- You might have hit a limit of the driver, such as the 5s timeout if the NVidia card is also being used as a display. Rule out you have used up all memory by using both smaller and larger(!) objects. Also note down memory object sizes over time. Be sure you clean up non-used objects. Fragmentation of device-memory can also be the problem it eventually goes wrong.
The last one I have not encountered myself, but found on the Nvidia forums. I recently had this error (type 1), because I had introduced clear naming in the code I was working on. When I introduced the standard ‘h_‘ and ‘d_‘ prefixes for all variables, I immediately found the cause.
Hope it has helped you understand the resource allocation error. If you found other reasons, please share via the comments and I’ll add it. If you have requests what to discuss in this series, let me know via Twitter or the comments.