10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.
The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently.
NVidia
As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.
2017 will all be about their Pascal-architecture.
Tesla K80 (Kepler)
- The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
- It is the Nvidia GPU with the largest private memory size (used in kernels): 255.
This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.
Pascal P100 (Pascal)
- 20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
- 16 GB HBM2 (720 GB/s).
- NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
- OpenPOWER support coming, to compete with Intel.
Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.
The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.
Titan Black (Kepler) and GTX 980 (Maxwell)
- The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
- The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.
The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.
Tegra X1
- 0.5 TFLOPS SP (GPU), 1 TFLOPS HP
- 10 Watts
While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.
AMD
Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.
For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).
FirePro S9170 (GCN)
- 32GB GDDR5 global memory
- 2.5 TFLOPS DP, 5 TFLOPS SP
The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.
The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.
Radeon Nano and FirePro S9300X2 (Fiji)
- Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
- S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
- Nano 175 Watt, S9300X2 300 Watt
- Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.
The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.
These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.
AMD Carrizo A10 8890k APU (HSA)
- CPU with built-in GPU
- About one TFLOPS
- TDP of 95 Watt
The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.
Intel
After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.
Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.
Xeon E5-2699 v4
- 55MB cache, 22 cores
- AVX 2.0 (256 bit vector operations)
- DDR4 (60 GB/s)
Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.
XeonPhi Knights landing
- Available in socket and PCI version
- 3 TFLOPS DP, 6 TFLOPS SP
- AVX 512 (512 bit vector operations)
- 16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
- Currently (?) not programmable with OpenCL
After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.
Xeon+FPGA
- Task-parallel processor
- Low-latency
The reconfigurable chip that has been promised for over 2 decades.
I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.
Iris GPUs
- CPU with built-in GPU
- 0.7 TFLOPS SP
As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.
Selecting the right hardware
Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamHPC we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.
Many thanks to Carlos Bederián and Nicola Cadenelli for fact-checking this post, when it was in draft.