SSEx, AVX, FMA and other extensions through OpenCL

Felix Fernandez's "More, More, More"This discussion is about a role OpenCL could play in a diversifying processor-market.

Both AMD and Intel have added parallel instruction-sets for their CPUs to accelerate in media-operations. Each time a new instruction-set comes out, code needs to be recompiled to make use of it. But what about support for older processors, without penalties? Intel had some troubles with how to get support for their AVX-instructions, and choose for both their own Array Building Blocks and OpenCL. What I want to discuss here are the possibilities available to make these things easier. Also I want to focus on if a general solution “OpenCL for any future extensions” could hold. I make an assumption that most extensions target mostly parallelisation with media in mind, most notable embedded GPUs on upcoming hybrid processors. I talked about this subject before in “The rise of the GPGPU compiler“.

Virtual machines

Java started in 1996 with the idea that end-point optimisation should be done by compiling intermediate code to the target-platform. The idea still holds and there are many possibilities to optimise intermediate code for SSE4/5, AVX, FMA, XOP, CLMUL and any other extension. Same is of course for dotNET.

Disadvantage is the device-models that are embedded in such compilers, which have not really take specialised instructions into account. So if I have a normal loop, I’m not sure it will work great on processors launched this year. C has pragmas for message-protocols, Java needs extensions. See Neal Gafter’s discussion about concurrent loops from 2006 for a nice discussion.

Smart Compilers

With for instance LLVM and Intel’s fast compilers, a lot can be done to get code optimised for all current processors. A real danger is that too many specialised processors will arrive the coming years; how to get maximum speed at all processors? We already have 32 and 64 bit; 128 bit is really not the only direction there is. Multi-target compilers can be something we should be getting used to, for which no standard is created for yet – only Apple has packed 32 and 64 bits together.

Years ago when CPUs started to have support for the multiply-add operation, a part of the compiled code had to be specially for this type of processor – giving a bigger binary. With any new type of extension, the binary gets bigger. It has to, else the potential of your processor will not be used and sales will drop in favour of cheaper chips. To sell software with support for each new extension, it takes time – in most cases reserved only for major releases.

Because not everybody has Gentoo (A Linux-distribution which compiles each piece of software targeting the user’s computer for maximum optimisation), it takes at least a year to get full use of the processor for most software.


So where does OpenCL fit in this picture? Virtual machines are optimised for threads and platform-targeting compilers are slow in distribution. Since drivers for CPUs are part of the OS-updating system, OpenCL-support in those drivers can get the new extensions utilised soon after market-introduction. The coming year more will be done for automatic optimisation for a broad range of processor-types – more about that later. This focus from the compiler to an OpenCL-library for handling optimal kernel-launching will get an optimum somewhere in between.

The coming time we will see OpenCL is indeed a more stable solution than for instance Intel’s Array Building Blocks, seen from the light of recompiling. If OpenCL can target all kinds of parallel extensions, it will offer the demanded flexibility the market demands in this diversifying processor-market. I used the word ‘demand’, because the consumer (being it an individual or company) who buys a new computer, wants his software to be faster, not potentially faster. What do you think?

Related Posts

    OpenCL on the CPU: AVX and SSE

    ...  a good article, just put it in the the comments. AVX - another quick overview AMD's SSE5 exists out of XOP, CVT16 (together ...