There is a lot going on at the path to GPGPU 2.0 – the libraries on top of OpenCL and/or CUDA. Among many solutions we see for example Microsoft with C++ AMP on top of DirectCompute, NVidia (and more) with OpenACC, and now AccelerEyes (most known for their Matlab-extension Jacket and libJacket) with ArrayFire.
I want you to show how easy programming GPUs can be when using such libraries – know that for using all features such as complex numbers, multi-GPU and linear algebra functions, you need to buy the full version. Prices start at $2500,- for a workstation/server with 2 GPUs.
It comes in two flavours: for OpenCL (C++) and for CUDA (C, C++, Fortran). The code for both is the same, so you can easily switch – though you still see references to cuda.h you can compile most examples from the CUDA-version using the OpenCL-version with little editing. Let’s look a little into what it can do.
Getting started
Note. If you use ArrayFire on Linux-64 with AMD, be sure you have at least AMD APP 2.5. Older drivers lock up your computer due to a bug in the AMD driver.
Be sure you have CUDA 4.0 or 4.1 installed for your NVidia GPU, and OpenCL 1.1 installed for your Intel and AMD devices. Check here to see how to get this going.
You can download ArrayFire from http://www.accelereyes.com/download_arrayfire after registering. You get both 64 and 32 bit libraries in one package and it installed without any problems. You can start right-away and go to bin32 or bin64, but if you want to recompile it then go the examples-directory and run “make clean && make”.
The compile-line gives an idea what it uses:
g++ -I../include -L../lib64 -Wl,-rpath=../lib64 -lafcl -LlibOpenCL.so.1 -LlibclAmdBlas.so.1 -LclAmdFft.Runtime.so.1.4.82 ../examples/helloworld.cpp -o ../bin64/helloworld
You see it bundles the BLAS-library and FFT-library from AMD.
Let’s test if it installed correctly, by running “hello-world”. Results on my current PC:
Arrayfire (OpenCL alpha)Device0: Barts (in use) Device1: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Device2: GeForce GTX 560
The remark “(in use)” means the currently selected device for ArrayFire, not that it is in use by another process. If you don’t get any devices then you have not correctly installed the OpenCL drivers for your hardware.
Example: BlackScholes
BlackScholes is very popular demo-application for OpenCL and related techniques as it is slow as a single-threaded CPU version and very speedy when using OpenCL-capable devices. I wrote on it before in case you want to check how it looks like in OpenCL.
Benchmarks for Input Data Size = 184000 x 1:
- AMD: 1.939260 s
- Intel: 0.471669 s (around 0.1686 s in following runs)
- NVidia – OpenCL: TBD
- NVidia – CUDA: TBD
You can see the full code in the SDK, so I only focus on a few commands. All is in the namespace “af”, but I leave out “af::” before the commands. Under every command I describe what it does.
device(1);
This selects the second device (on this machine the Intel). Nice is that the rest of the initialisation is done when needed, so no real need to be very precise with it. Skipping this line selects device 0.
int N = 6; float C1[] = {5.0f, 6.0f, 7.0f, 8.0f, 1.0f, 10.0f}; //different in code, 1000 times as long array array1 = array(C1, N, 1);
This is all that is needed to produce an array of dimensions 6 x 1. If you would used N/2 and 2 as last two parameters, it would have given an array of 3 x 2 using the same data from C1.
for (int i = 0; i < iter; i++) { black_scholes(Cg, Pg, Sg, Xg, Rg, Vg, Tg); }
This calls the “kernel” several times with arrays prepared as in the previous step. Below is the kernel.
void black_scholes(array& C, array& P, array& S, array& X, array& R, array& V, const array& T) { array d1_ = log(S / X); d1_ = d1_ + (R + (V * V) * 0.5) * T; array d1 = d1_ / (V * sqrt(T)); array d2 = d1 - (V * sqrt(T)); C = S * cnd(d1) - (X * exp((-R) * T) * cnd(d2)); P = X * exp((-R) * T) * cnd(-d2) - (S * cnd(-d1)); }
Alternatively the code can just be called without separate function:
for (int i = 0; i < iter; i++) { array d1_ = log(Sg / Xg); d1_ = d1_ + (Rg + (Vg * Vg) * 0.5) * Tg; array d1 = d1_ / (Vg * sqrt(Tg));array d2 = d1 - (Vg * sqrt(Tg));Cg = Sg * cnd(d1) - (Xg * exp((-Rg) * Tg) * cnd(d2)); Pg = Xg * exp((-Rg) * Tg) * cnd(-d2) - (Sg * cnd(-d1)); }
The nice thing of ArryFire is that arrays are overloaded such that computations are off-loaded to the selected compute-device. So “log(Sg / Xg)” is translated into optimised OpenCL-code and run on GPU or AVX/SSE. In this case each element of array Sg is devided by the element at the same location in array Xg. Log is called on each element of the resulting array.
af::sync();
Sync-function forces the compute-device to work.
You see. Once you get to understand the ArrayFire array class then you have all luggage needed to program an OpenCL-program.
Advantages & Disadvantages
I will get more into libraries built on top of OpenCL, to explain the common (dis)advantages. Any library has a scope for which it works best and products of AccelerEyes focus much on linear algebra and 2D data (arrays), less on 3D and images.
It is hard to say how much the library squeezes out the GPU(s) for you. But sometimes it works almost as good as manually optimised code and in many cases comparable, which is great as you get results much faster and can focus on your algorithm instead of coding. As a bonus each new version the optimisation gets better.
ArrayFire does not give you the generated kernels, so you depend on their license and libraries. Neither does it provide a fall-back to the CPU. This makes it a good solution for research and in-company deployments, but less for product development.
I am a big fan of splitting device-computations from the host-code. ArrayFire gives you the possibility to mix it again giving less optimisable code. But others will disagree with me.
Currently with version 0.3 it is still in alpha, so if you want to use it in production-code then you need to extensively test it. AccelerEyes has built up its good name in many years and is therefor careful before officially putting the library out of alpha/beta.
Learning more
I was happy to discover ArrayFire is so well documented and therefore you can accelerate your algorithm within a day. It depends on your exact situation and goals if ArrayFire is the right choice for you. The best is to give it a try and put your algorithm to the test. If you want to discuss the best option for your software, feel free to contact us.
StreamHPC has selected ArrayFire as one of the libraries on top of OpenCL with potential and therefore will start offering consulting-services for ArrayFire from April 2012.
can you write about Magma http://icl.cs.utk.edu/magma/ a small tutorial perhaps. PLus an article about using multiple GPUs with OPenCL
Currently I have quite some unfinished articles and cannot take requests anymore. I’ll let you know via Twitter when I can take requests again.