
Below is a (slightly edited) repost of a blog by David Bucciarelli (homepage, twitter) on the Luxrender forum.
I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.
For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:
When programming for GPUs, simply porting a large CPU programinto an equally large GPU kernel is generally not a good approach.Due to SIMT execution model on GPUs, divergence in control flowcarries substantial performance penalties, as does high register us-age that lessens the latency-hiding capability that is essential for thehigh-latency, high-bandwidth memory system of a GPU. In this pa-per, we implement a path tracer on a GPU using a wavefront formu-lation, avoiding these pitfalls that can be especially prominent whenusing materials that are expensive to evaluate. We compare our per-formance against the traditional megakernel approach, and demon-strate that the wavefront formulation is much better suited for real-world use cases where multiple complex materials are present inthe scene.
OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.
The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.
PATHOCL Micro-kernels edition, the results
The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.
A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.
It solves the problems with AMD compiler
Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:
Try it out yourself
This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.
A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.
To run with micro-kernels, use “path.microkernels.enable=1”.










The research lab 
Altera has just released the free ebook FPGAs for dummies. One part of the book is devoted to OpenCL, so we’ll quote some extracts here from one of the chapters. The rest of the book is worth a read, so if you want to check the rest of the text, just 

When copying data from global to local memory, you often see code like below (1D data):



So, for this year it will be K40. Here’s an overview:

At Intel they have CPUs (Xeon, Ivy Bridge), GPUs (Isis) and Accelerators (Xeon Phi). OpenCL enables each processor to be used to the fullest and they now promote it as such. Watch the below video and see their view on why OpenCL makes a difference for Intel’s customers.
Khronos just announced three OpenCL based releases:












On 15 – 17 April 2014 a 3-day workshop around HPC is organised. It is free, and focuses on bringing industry and academy together.