Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

# Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

```int* a = ...;
int* b = ...;
for (int i = 0; i < M; i++)
a[i] = b[i]*2;
}```

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

# Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support

Intel has just released its OpenCL bit CPU-drivers, version 2013 bèta. It has support for OpenCL 1.1 (not 1.2 as for the CPU) on Intel HD Graphics 4000/2500 of the 3rd generation Core processors (Windows only). The release notes mention support for Windows 7 and 8, but the download-site only mentions windows 8. Support under Linux is limited to 64 bits.

The release notes mention:

• General performance improvements for many OpenCL* kernels running on CPU.
• Preview Tool: Kernel Builder (Windows)
• Preview Feature: support of kernel source code hotspots analysis with the Intel VTuneT Amplifier XE 2011 update 3 or higher.
• The GNU Project Debugger (GDB) debugging support on Linux operating systems.
• New OpenCL 1.2 extensions supported by the CPU device:
• `cl_khr_int64_base_atomics and cl_khr_int64_extended_atomics`
• `cl_khr_fp16`
• `cl_khr_gl_sharing`
• `cl_khr_gl_event`
• `cl_khr_d3d10_sharing`
• `cl_khr_dx9_media_sharing`
• `cl_khr_d3d11_sharing`.
• OpenCL 1.1 extensions that were changed in OpenCL 1.2:
• Device Fission supports both OpenCL 1.1 EXT API’s and also OpenCL* 1.2 fission core features
• Media Sharing support intel 1.1 media sharing extension and also the 1.2 KHR media sharing extension
• Printf extension is aligned with OpenCL 1.2 core feature.

Check the release notes for full information.

The drivers can be found on http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk-2013/. Installation is simple. For Windows there is a installer. If you have Linux, make sure you remove any previous version of Intel’s openCL drivers. If you have a Debian-based Linux, use the command ‘alien’ to convert the rpm to deb, and make sure ‘libnuma1‘ is installed. There are requirements for libc 2.11 or 2.12 – more information on that later as Ubuntu 12.04 has libc6 2.15.

# How expensive is an operation on a CPU?

Programmers know the value of everything and the costs of nothing. I saw this quote a while back and loved it immediately. The quote by Alan Perlis is originally about Perl LISP-programmers, but only highly trained HPC-programmers seem to have obtained this basic knowledge well. In an interview with Andrew Richards of Codeplay I heard it from another perspective: software languages were not developed in a time that cache was 100 times faster than memory. He claimed that it should be exposed to the programmer what is expensive and what isn’t. I agreed again and hence this post.

I think it is very clear that programming languages (and/or IDEs) need to be redesigned to overcome the hardware-changes of the past 5 years. I talked about that in the article “Separation of compute, control and transfer” and “Lots of loops“. But it does not seem to be enough.

So what are the costs of each operation (on CPUs)?

# OpenCL Fireworks

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

# OpenCL on the CPU: AVX and SSE

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

# CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

# New grown-ups on the block

There is one big reason StreamHPC chose for OpenCL and that is (future) hardware-support. I talked about NVIDIA versus AMD a lot, but knowing others would join soon. AMD is correct when they say the future is fusion: hybrid computing with a single chip holding both CPU- and GPU-cores, sharing the same memory and interconnected at high speed. Merging the technologies would also give possible much higher bandwidths to memory for the CPU. Let us see in short which products from experienced companies will appear on the OpenCL-stage.