Mega-kernel versus Micro-kernels in LuxRender (repost)

LuxRenderer demo-rendering
LuxRenderer demo-rendering

Below is a (slightly edited) repost of a blog by

I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.


 

For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:

When programming for GPUs, simply porting a large CPU program
into an equally large GPU kernel is generally not a good approach.
Due to SIMT execution model on GPUs, divergence in control flow
carries substantial performance penalties, as does high register us-
age that lessens the latency-hiding capability that is essential for the
high-latency, high-bandwidth memory system of a GPU. In this pa-
per, we implement a path tracer on a GPU using a wavefront formu-
lation, avoiding these pitfalls that can be especially prominent when
using materials that are expensive to evaluate. We compare our per-
formance against the traditional megakernel approach, and demon-
strate that the wavefront formulation is much better suited for real-
world use cases where multiple complex materials are present in
the scene.

OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.

The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.

PATHOCL Micro-kernels edition, the results

The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.

Occupancy increases from 10% to 30% or more
Occupancy increases from 10% to 30% or more
The performance increase is huge on some platform (Linux + FirePro W8100), 3.6 times:
Speed increases from 0.84M to 3.07M samples/sec
Speed increases from 0.84M to 3.07M samples/sec

A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.

It solves the problems with AMD compiler

Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:

Scene builds correctly on AMD hardware for the first time
Scene builds correctly on AMD hardware for the first time

Try it out yourself

This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.

A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.

To run with micro-kernels, use “path.microkernels.enable=1”.