Approximation computing is allowing larger errors when performing calculations. While most programmers might go the other way (lower error rate) by using doubles for instance, this field is interesting for quite some of us. The reason is that you can get more performance, more bandwidth space and lower power usage in return.
In Neural Networks high precision is not required, but also Big Data approximation computing is very useful. Most important is that you actually think of the possibility to trade in precision when designing your OpenCL software. For example, does your window function need to be very precise or can there be rounding errors? Or do you do iterative steps (more precision needed), or calculate relatively from the starting point (less precision needed)? You can even use relatively more expensive algorithms that compensate with a smaller overall error. Here at StreamHPC we think this through as one of the main optimisation techniques.
Let’s look into what is possible in OpenCL and which is the hardware support.
Half Precision FP16
Half precision floats gives you one big advantage: half the bandwidth. This can result in double the throughput in many applications, as memory bandwidth is the most common bottleneck in OpenCL software. For FPGAs this means that less area used, which can be used for more parallelism.
Selection of hardware that supports cl_khr_fp16 extension:
- Qualcomm Adreno
- ARM MALI 7 (not MALI 6)
- NVidia Tegra X1 (not K1)
- NVidia Pascal architecture
- Altera FPGAs
- Xilinx FPGAs
These processors have support in both memory and processor, meaning that “a = b + c” works for halfs natively. Note that Intel CPUs support FP16, but it’s not in the drivers – probably other hardware also has it hidden.
Using clGetDeviceInfo() you can check your hardware forsupports of the extension “CL_DEVICE_HALF_FP_CONFIG”. If there is support, you can use “#pragma OPENCL EXTENSION cl_khr_fp16 : enable” in your kernel to turn the feature on, and you get full access to the scalar type ‘half’, the vector types ‘scalar2’ to ‘scalar16’ and several pragmas and constants.
You can also use halfs without the extension, but then you need to us vload_half() and vstore_half(). Many (if not all) GPUs have memory-support for half, so you can speed up the transfers without the extension.
Relaxed math and native functions
The compiler-option “-cl-fast-relaxed-math” allows optimisations for floating-point arithmetic that may violate the IEEE 754 standard. You can also use the native functions to selectively do this. A good example is trigonometric functions like native_sin(), native_cos() and native_tan() on GPUs – the difference in speed is immense.
The result is that functions take (much) less cycles than the IEEE 754 correct ones. In case compute is the main bottle-neck, this can help out. All hardware has some or several native functions – there is no overview of support per hardware, so you just have to try out.