running If there would be one rule to get the best performance, then it’s avoiding data-transfers. Therefore it’s important to have lots of bandwidth and GFLOPS per processor, and not simply add up those numbers. Everybody who has worked with MPI, knows why: transferring data between processors can totally kill the performance. So the more is packed in one chip, the better the results.

In this short article, I would like to quickly give you an overview of the current state for bandwidth and performance. You would think the current generation accelerators is very close, but actually it is not.

The devices in the below images are AMD FirePro S9150 (16GB), NVidia Tesla K80 (1 GPU of the 2, 12GB), NVidia Tesla K40 (12GB), Intel XeonPhi 7120P (16GB) and Intel Xeon 2699 v3 (18 core CPU). I doubted about selecting a K40 or K80, as I wanted to focus on a single GPU only – so I took both. Dual-GPU cards have an advantage when it comes to power-consumption and physical space – both are not taken into consideration in this blog. Neither efficiency (actual performance compared to theoretical maximum) is included, as this also needs a broad explanation.

Each of these accelerators runs on X86-OpenMP and OpenCL

The numbers

The bandwidth and performance show where things stand: The XeonPhi and FirePro have the most bandwidth, and the FirePro is a staggering 70% to 100% faster than the rest on double precision GFLOPS.

bandwidth-per-chip — Xeon Phi gets to 350 GB/s, followed by the FirePro with 320 GB/s and K40 with 288 GB/s. NVidia’s K80 is only as 240 GB/s, where DDR gets only 50 -60 GB/s.

gflops-per-chip — The FirePro leaves the competition far behind with 2530 GFLOPS (Double Precision). The K40 and K80 get 1430 and 1450, followed by the CPU at 1324 and the Xeon Phi at 1208. Notice these are theoretical maximums and will be lower in real-world applications.

If you have OpenCL or OpenMP code, you can optimise your code for a new device in a short time. Yes, you should have written it in OpenCL or openMP, as now the competition can easily outperform you by selecting a better device.

Costs

Lowest prices in the Netherlands, at the moment of writing:

Intel Xeon 2699 v3: € 6,560.
Intel Xeon Phi 7120P + 16GB DDR4: € 3,350
NVidia Tesla K80: € 5,500 (€ 2,750 per GPU)
NVidia Tesla K40: € 4,070
AMD FirePro S9150: € 3,500

Some prices (like the K40) have one shop with a low price, where others are at least €200 more expensive.

Note: as the Xeon can have 1TB of memory, the “costs per GB/s” is only half the story. Currently the accelerators only have 16GB. Soon a 32GB FirePro will be available in the shops, the S9170, to move up in this space of memory hungry HPC applications.

costs-per-gflops-per-chip — For raw GFLOPS the FirePro is the cheapest, followed by the K80, XeonPhi and then the K40. While the XeonPhi and K40 are twice as expensive as the FirePro, the Xeon is clearly the most expensive as it is 3.5 times as expensive as the FirePro.

If costs are an issue, then it really makes sense to invest some time in making your own Excel sheets for several devices and include costs for power usage.

Which to choose?

Based on the above numbers, the FirePro is the best choice. But your algorithm might simply work better on one of the others – we can help you by optimising your code and performing meaningful benchmarks.

StreamHPC communications

Performance of 5 accelerators in 4 images

The numbers

Costs

Which to choose?