
In the series “Basic Concepts” various basics of GPGPU and OpenCL are discussed. This time we go into a typical one: when an error does not imply the actual problem. It is therefore good to have an overview of all errors with their descriptions.
When you get an out-of-resources error or when you get a crash when using clEnqueReadBuffer, you are sort of left in the dark. What does it mean? And how can you solve it?
Typical: one driver crashes/segfaults and another one gives this error.
Officially the error is defined as:
CL_OUT_OF_RESOURCES if there is a failure to allocate resources required by the OpenCL implementation on the device.
Which means that there can more reasons than the device being out of resources. A better name would have been CL_RESOURCE_ALLOCATION_ERROR. It can be thrown by various functions, but we focus on this one function. It cannot by thrown by clEnqueWriteBuffer, as that depends on the limits of the host.
Finding out the cause
The oldest trick of ‘m all: try to use the CPU and check what the error is then. CPUs are great to detect data-races (correct on CPU, not on GPU) and CPUs are a bit more stable when you have buggy code plus have more RAM. Be sure to install both Intel’s and AMD’s drivers.
Calling clFinish at each line, helps you pinpoint the actual line it happens or to get an error instead of a crash.
Then you have the following options:
- 9 out of 10 times you have a pointer problem at the host or are writing out of bounds. So you try to write to an illegal memory location, or try to cram in an 35×35 float* into 10x10x10 float* space (buffer-overflow). Double check the host memory-sizes, and if the host-pointers are correct.
- You read out of bounds on the device. Double-check the used memory-sizes.
- You might have hit a limit of the driver, such as the 5s timeout if the NVidia card is also being used as a display. Rule out you have used up all memory by using both smaller and larger(!) objects. Also note down memory object sizes over time. Be sure you clean up non-used objects. Fragmentation of device-memory can also be the problem it eventually goes wrong.
The last one I have not encountered myself, but found on the Nvidia forums. I recently had this error (type 1), because I had introduced clear naming in the code I was working on. When I introduced the standard ‘h_‘ and ‘d_‘ prefixes for all variables, I immediately found the cause.
Hope it has helped you understand the resource allocation error. If you found other reasons, please share via the comments and I’ll add it. If you have requests what to discuss in this series, let me know via Twitter or the comments.
Related Posts
Basic concepts: malloc in the kernel
... one of those good questions, as it gives another view on a basic concept of OpenCL. Simply put: you cannot allocate (local or ...
Our training concepts for GPGPU
... about "how to design training-programs about difficult concepts for technical people", or better: "how to learn yourself ...
Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions
... the series Basic Concepts I try to give an alternative description to what is said everywhere ...
Basic concepts: Function Qualifiers
In the OpenCL-code, you have run-time and compile-time of the C-code. It is very important to make this clear when you talk about compile-time of the ...