Keep The Hardware Focus

The real Apu

If you buy a car, the first choice is not often the kind of fuel. You first select on the engine-properties, the looks, the interior, the brand and for sure the total cost of ownership. The costs can be a reason to choose for a certain type of fuel though. In the parallel computation world it is different. There the fuel (CUDA or OpenCL) is the first decision and then the hardware is chosen. I think this is wrong and therefore speak a lot about CUDA-vs-OpenCL, while I think NVidia is a good choice for a whole list of algorithms.

If we give advise during a consult, we want to give the best advice. In case of CUDA, that would be based on budget to go for Tesla or the latest GTX; in case of OpenCL we can give much better advice on hardware. But actually starting with the technique is the worst thing you can do: focus on the hardware and then pick the technique that suits best.

IMPORTANT. The following is for understanding some concepts and limits only! It is pure theoretically, so I don’t claim any real-world results. Also what not is taken into account is how well different processors handle control-instructions (for, while, if, case, etc), which has quite some influence on actual performance.

GPGPU

Most GPGPU has 5 stages:

  1. Preparing data.
  2. Load prepared data from CPU-memory to GPU-memory.
  3. Do the calculations (via CUDA or OpenCL kernel).
  4. Load data from GPU-memory to CPU-memory.
  5. Post-process data on CPU.

Let’s focus on streaming media; say we want to mix 8 16-bit stereo audio-streams. This will take a simple MAD-instruction (= 2 FLOP per byte): sum all streams and divide by 8. Total needed bandwidth is (8 in + 1 out) * 16 bit * 2 channels * 44 000 hz = 5.63 Mbit/s = 1.40 MB/s. Needed processing-power for real-time: 2.8 MFLOPS. PCI-e 2.0 x16 can do 7.0 GB/s (8GB/s minus transfer overhead), so no sweat here for the bus. If we want to use the full bandwidth, then we need more than 14 GFLOPS. A Pentium 4 could do that, so let’s put the bar a little higher..

So we also apply a bandpass-filter per audio-stream, which will take 8 pairs of 1D Fourier-transforms. Say we want to filter half the frequencies and make a 1000-point Fourier transform, then FFT could do it in 2*1000*log2(1000) = 19.9 kFLOPS. and back 2*500*log2(500) = 9.0 kFLOPS [explanation]. For 44 kHz we need 1.27 MFLOPS per audio-stream (sampled per Hz) plus the 2.8 MFLOPS we already needed and thus 23.14 MFLOPS in total (this is just an idea how compute-intensive audio-streams can get, not to say that calculating an FFT per Hz is needed for any purpose). Needed bandwidth stays the same and 115 GFLOPS is needed when using all bandwidth. A GPU (say one of 1 TFLOPS) could theoretically do the computation 10 times faster than a high-end CPU (100 GFLOPS).

But what did I not mention? The total time is takes to do these calculations. It takes 1 second to get the data in and out, which makes the GPU about 10% slower than the CPU. It is of uttermost importance that the GPU is kept busy at the highest speed possible if it needs to be applied to to-be-transferred data.

Streaming data

It takes some figuring out, but if the ratio between computation and data is high enough, then you can stream the data and effectively cancel out the transfer-times (except when starting)

For example, we take a 1.6 GB card for that (to get round numbers in the rest of the example) and use the above situation. Of that 0.7 GB is calculated on, 0.1 GB is written to and meanwhile the other 0.8 GB is being loaded or saved (note: in many cases a more optimal form would have the same structure but smaller blocks). PCIe 2.0 X16 effective transfer-rate is around 7.0 GB/s, so we have 6.1 GB/s for step 1 and 0.9 GB/s for step 3. If the GPU uses a minimum of (1000 GFLOPS / 6.1 GB) = 163 Operations per second per byte, we get maximum access to the GPU’s power. This bumps to the limit of video-memory which is somewhere around 110 – 160GB/s, so private (or local) memory needs to be used. See http://www.codeproject.com/KB/showcase/Memory-Spaces.aspx for more info, as I try to keep the focus on the global transfers only.

Hybrid CPU-GPU and SSE

Standard OpenCL on CPU is 2 stages:

  1. Assign some memory as shared by CPU and OpenCL-device.
  2. Do the calculations (OpenCL)

This means that having an OpenCL-kernel jump in when needed can be done – no extra cycles are needed for data-transfer, but the memory-access is slower than on dedicated GPUs. This benchmark shows 16GB/s only on AMD Llano and 24GB/s for recent Intels. [As I am now busy with work, I leave it for now and will add more benchmarks later – it seems that almost 30GB/s is possible].

That means that currently for a lower FLOPS/data-ratio data (as in the examples above) better to be processed on a (hybrid) CPU. Because I also need to include the parallelism-factor, I could not easily create a graph. Clear is that you first need to start with a calculator if you want to know which platform you want to target. As the memory-bandwidth and the GPU-power in hybrid processors go up with each generation, the number of cases a dedicated GPU could be used for is decreasing.

Also for data-control you’d better use the CPU, for massive parallelism a GPU. Nice to read: at HPC Wire the difficulties of memory-management was discussed – and how future (hybrid) processors could/should be.

PCIe 3.0 as the Redemption?

When GPUs get connected via PCIe 3.0 then there will be much more bandwidth (and more compression possibilities). From NVidia I did not hear any rumours of PCIe 3.0 support for their future GPUs; there are rumours AMD Radeon 7000-series will have PCIe 3.0. I think NVidia has more reasons to use a faster bus, as they have to compete with hybrid processors.

The number of needed FLOPS/s/Byte gets halved and the line were dedicated GPUs can compete with hybrid processors shifts. Depending on how fast the hybrid processors evolve this could give time to GPUs, but it seems the days of dedicated GPUs are limited in the consumer-market for X86 – ARM-processors are already hybrid by design. It is now the price that counts, as the added compute-power by a dedicated GPU relatively decreases. But… NVidia has surprised us more often lately.

A Question for You

What kind of algorithms are still better to be done on a GPU? What do you think the turning-point will be for bandwidth and compute-power of hybrid processors? Let me know in the remarks.