OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.
Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.
At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.
- Message – Event: actually these are very comparable, but the ideas behind them are different. Explained below.
- Threads – Kernels: and the kernels work in threads too, so nobody beats you up when using threads for both MPI and OpenCL.
- Program – Host: in OpenCL you have a very clear definition of what is paralleled code and what is serial code. Serial code runs in the main program, called the host. This is comparable to master-slave programs in MPI.
Comparing with an example
Check the example on http://www.lam-mpi.org/tutorials/one-step/ezstart.php which show a master-slave hello-world example with MPI.
These are the steps in the example and how they would have been done in OpenCL:
- Initialisation which detects the hardware. In OpenCL this is several steps, but has the same goal.
- Identifying the current ID. In OpenCL this is quite different, as it is master-slave-only (so no “if ID==0” to pick the master) and the ID can be in 3 dimensions as OpenCL is more data-oriented.
- Finding the number of the processes. OpenCL can also get this information, but this is more important for the data-size.
- Sending data from master to slaves. OpenCL-kernels are initiated with data and send an event back when ready. Also data is put in device-memory and not explicitly to a process: the host throws it at the kernels and only defines what to has to be done – MPI is more explicit in this.
MPI needs both the sender and receiver threads explicitly handle messages, OpenCL doesn’t allow messages between different kernels. So in case you have some data computed and you want to do some other processing, with MPI you message another thread and with OpenCL you need to end the current thread to get back to the host. MPI is more task-parallel oriented than OpenCL, which is more data-parallel from origin. As OpenCL is extending to task-parallel (on CPUs and upcoming GPUs for example), they could look to MPI, but they don’t. That is because kernels are more designed to be micro-tasks, while MPI is more designed to be able to do continuous computations. So when a stream of data comes in, MPI would initiate a number of threads fit for the hardware; each thread would handle a part of the stream. OpenCL would transfer the data to the compute-device (CPU, GPU or a specialised device) and have the kernels compute parts independently from the number of cores available; each time a kernel is finished calculating its part the freed compute-core can be used for another computation in the queue. You see a big difference: with MPI the thread has much power on controlling the data and its processing, while OpenCL-kernels don’t have that power and it is arranged from outside (the device full of slaves and the host containing the master).
In MPI the bees communicate what they want others to do, while in OpenCL the bees had to go home to receive a new order (and/or is replaced by another bee). It depends on many factors which method is faster, but the hierarchy of the MPI-bees is much different than from the OpenCL-bees.
As described above, MPI is more task-parallel. When you check a program to MPI’ify, then you look at different parts which can be run independently. When trying to OpenCL’ify a program for the GPU then you check which data can be processed in parallel; when targeting the CPU you can work exactly the same as when MPI’ifying. Here comes the advantage: you also get the vector-extensions of OpenCL by default; MPI does have Vector-types but doesn’t have this orientation that strong. The coming years MPI will probably get an increased vector-oriented programming-models as more processors will have such extensions. It doesn’t matter that MPI has no support for GPUs, as they will be integrated into CPUs – for now you need OpenCL or such to get that power.
For big data OpenCL has an advantage, since the program doesn’t need to take care of the number of threads – it just asks for thousands of threads and gives the responsibility to the device. Drivers (specially Intel’s) also try to find ways to vectorize computations.
MPI-bees have one tool each, while vectorized OpenCL-bees have several by default.
If you compare it for speed, then OpenCL wins, only because MPI only uses CPUs and OpenCL both GPUs and CPUs. But for scaling MPI is the way, or is it? For that reason researchers try to make all devices in a cluster as if they where on one computer. With this set-up the programmer must be more aware that transport-times to devices take more than just the PCIe-bus, but still that is the same problem MPI also has. There are currently two projects, I am aware of, which use these same principles:
- SnuCL is open source and freely distributable. It seems to make use of OpenMP.
- MOSIX Virtual CL is also opensource and is an extension of the existing cluster-project MOSIX.
Here is where you get the big advantage of MPI, as there are many tools which unroll loops and generate MPI-code. OpenCL is (still) very explicit, so when you are lazy and encounter a loop in your code you just use pragmas and see if it works. With OpenCL you need to unroll the loop yourself. OpenCL could use a lot of the results of research on automatic MPI’ifying.
Let me know what you think MPI is more suitable for and when OpenCL is more suitable (i.e. in a MOSIX-cluster).