MPI in terms of OpenCL

OpenCL is a member of a family of Host-Kernel programming language extensions. Others are CUDA, IMPC and DirectCompute/AMP. It lets itself define by a separate function or set of functions referenced to as kernel, which are prepared and launched by the host to run in parallel. Added to that are deeply integrated language-extensions for vectors, which gives an extra dimension to parallelism.

Except from the vectors, there is much overlap between Host-Kernel-languages and parallel standards like MPI and OpenMP. As MPI and OpenMPI have focused on how to get software parallel for years now, this could give you an image of how OpenCL (and the rest of the family) will evolve. And it answers how its main concept message-passing could be done with OpenCL, and more-over how OpenCL could be integrated into MPI/OpenMP.

At the right you see bees doing different things, which is easy to parallellise with MPI, but currently doesn’t have the focus of OpenCL (when targeting GPUs). But actually it is very easy to do this with OpenCL too, if the hardware supports it such like CPUs.

Terminology

  • Message – Event: actually these are very comparable, but the ideas behind them are different. Explained below.
  • Threads – Kernels: and the kernels work in threads too, so nobody beats you up when using threads for both MPI and OpenCL.
  • Program – Host: in OpenCL you have a very clear definition of what is paralleled code and what is serial code. Serial code runs in the main program, called the host. This is comparable to master-slave programs in MPI.
Sounds comparable, but there are differences.

Comparing with an example

Check the example on http://www.lam-mpi.org/tutorials/one-step/ezstart.php which show a master-slave hello-world example with MPI.

These are the steps in the example and how they would have been done in OpenCL:

  • Initialisation which detects the hardware. In OpenCL this is several steps, but has the same goal.
  • Identifying the current ID. In OpenCL this is quite different, as it is master-slave-only (so no “if ID==0” to pick the master) and the ID can be in 3 dimensions as OpenCL is more data-oriented.
  • Finding the number of the processes. OpenCL can also get this information, but this is more important for the data-size.
  • Sending data from master to slaves. OpenCL-kernels are initiated with data and send an event back when ready. Also data is put in device-memory and not explicitly to a process: the host throws it at the kernels and only defines what to has to be done – MPI is more explicit in this.
A big difference is that the MPI-commands are in-code. OpenCL has a separate file for the slave-code. Personally I find this more tidy, as integration (and getting rid) of new OpenCL-kernels is much more easy.

Messages

MPI needs both the sender and receiver threads explicitly handle messages, OpenCL doesn’t allow messages between different kernels. So in case you have some data computed and you want to do some other processing, with MPI you message another thread and with OpenCL you need to end the current thread to get back to the host. MPI is more task-parallel oriented than OpenCL, which is more data-parallel from origin. As OpenCL is extending to task-parallel (on CPUs and upcoming GPUs for example), they could look to MPI, but they don’t. That is because kernels are more designed to be micro-tasks, while MPI is more designed to be able to do continuous computations. So when a stream of data comes in, MPI would initiate a number of threads fit for the hardware; each thread would handle a part of the stream. OpenCL would transfer the data to the compute-device (CPU, GPU or a specialised device) and have the kernels compute parts independently from the number of cores available; each time a kernel is finished calculating its part the freed compute-core can be used for another computation in the queue. You see a big difference: with MPI the thread has much power on controlling the data and its processing, while OpenCL-kernels don’t have that power and it is arranged from outside (the device full of slaves and the host containing the master).

In MPI the bees communicate what they want others to do, while in OpenCL the bees had to go home to receive a new order (and/or is replaced by another bee). It depends on many factors which method is faster, but the hierarchy of the MPI-bees is much different than from the OpenCL-bees.

Parallisation

As described above, MPI is more task-parallel. When you check a program to MPI’ify, then you look at different parts which can be run independently. When trying to OpenCL’ify a program for the GPU then you check which data can be processed in parallel; when targeting the CPU you can work exactly the same as when MPI’ifying. Here comes the advantage: you also get the vector-extensions of OpenCL by default; MPI does have Vector-types but doesn’t have this orientation that strong. The coming years MPI will probably get an increased vector-oriented programming-models as more processors will have such extensions. It doesn’t matter that MPI has no support for GPUs, as they will be integrated into CPUs – for now you need OpenCL or such to get that power.

For big data OpenCL has an advantage, since the program doesn’t need to take care of the number of threads – it just asks for thousands of threads and gives the responsibility to the device. Drivers (specially Intel’s) also try to find ways to vectorize computations.

MPI-bees have one tool each, while vectorized OpenCL-bees have several by default.

Scaling

If you compare it for speed, then OpenCL wins, only because MPI only uses CPUs and OpenCL both GPUs and CPUs. But for scaling MPI is the way, or is it? For that reason researchers try to make all devices in a cluster as if they where on one computer. With this set-up the programmer must be more aware that transport-times to devices take more than just the PCIe-bus, but still that is the same problem MPI also has. There are currently two projects, I am aware of, which use these same principles:

  • SnuCL is open source and freely distributable. It seems to make use of OpenMP.
  • MOSIX Virtual CL is also opensource and is an extension of the existing cluster-project MOSIX.
I am not aware of benchmarks between these two.

Automatic parallisation

Here is where you get the big advantage of MPI, as there are many tools which unroll loops and generate MPI-code. OpenCL is (still) very explicit, so when you are lazy and encounter a loop in your code you just use pragmas and see if it works. With OpenCL you need to unroll the loop yourself. OpenCL could use a lot of the results of research on automatic MPI’ifying.

Let me know what you think MPI is more suitable for and when OpenCL is more suitable (i.e. in a MOSIX-cluster).

7 thoughts on “MPI in terms of OpenCL

  1. Greg Blair

    The premise of this blog “MPI in terms of OpenCL” indicates some confusion over what OpenCL and MPI are.

    MPI is for running processes in parallel on multiple machines such as cluster networks or render farms. OpenCL is for running parallel threads on the many GPU cores installed on a single machine. Multiple GPU’s can be installed on asingle machine.

    It ias not an either or situation between OpenCL and MPI. Both can be used. If one requires the compute power of a render farm of GPU equipped nodes, one can use both: start up a MPI job in the MPI network and each MPI instance uses OpenCL to utilize the GPU on that node’s cluister machine.

    Think of OpenCL as big truck. It can carry many items. It is only one machine. The more GPU cores available, the bigger the truck. The bigger the truck, the more things it can carry. A bigger truck requires fewer trips to get the job done.

    Think of MPI as fleet of trucks. They can carry many items. It is many machines.

    Think of OpenCL and MPI working tigether as fleet of big trucks.

    Try implementing a simple program using MPI and OpneCL. Let MPI distribute to an MPI node a unit of work. That node calculates the results using OpenCL and communicates the results back to the master MPI node using MPI The differences and advantages of each will become clear.

    • Vincent Hindriksen Post author

      Thank you for your reply.

      There were many questions on the difference between the two, and moreover which is better.

      Maybe better put is if you better virtualise the OpenCL-devices and have them come together via MPI as one big machine, or don’t have the devices visible and leave the actual computation to the nodes. MOSIX and SnuCL are solutions of the first type, while your description is on of the second type.

    • Issam Wakidi

      Thank you Greg for sharing this great use case example of MPI + OpenCL. That augmented this good article. Both this article and your comment made all that much clearer for me.

  2. John Paulus

    Hello,

    I’m an MPI programmer who has recently taken interest in OpenCL. Thank you for writing this article, I found it very helpful and informative. Nice blog overall too.

    -John

    • Vincent Hindriksen Post author

      Thank you for your compliment. I would appreciate if you could share your experience with OpenCL as an MPI programmer in a while here.

  3. Giovanni Rosotti

    Hello,

    I never heard of an automatic parallelizing MPI compiler. As far as I know, automatic parallelization exists only for openMP (and even there performs very badly), which has a FAR easier way of parallelizing. Where have you found an auto-parallel MPI compiler?

Comments are closed.