Building a 150 TFLOPS cluster with Accelerators in 2014

top500You can’t ignore accelerators when designing a new cluster for HPC anymore. Back in 2010 I suggested to use GPUs to enter the Top 500 with a budget of only €38k. It takes ten times more now, as almost everybody started to use accelerators. To get into the November top 500 would roughly take a cluster of 150 TFLOPS.

I’d like to give you a list of what you can expect for 2014, and to help you design your HPC cluster with recent hardware. The focus should be on OpenCL-capable hardware, as open standards can prepare you better for upgrades in the future. So, this is also a guess on what we can see in the November Top 500, based on current information.

There are currently professional solutions from NVIDIA, AMD, Intel and Altera. I’ve searched the web and asked around for what would be the upcoming offers. You will find the results bellow. But information should continue to flow; please add your remarks in the comments, so we get the best information through collaboration.

Comparison: mentioning the Double Precision GFLOPS of the accelerators only. The theoretical GFLOPS can not be reached in real-world benchmarks. Therefore, DGEMM is used as an indication of the maximum realistic GFLOPS. The efficiencies of other benchmarks (like Linpack) are all lower.

NVIDIA Tesla

NVIDIA Tesla is the current market leader with Tesla K20 and K20X. By the end of 2013 they announced K40 (GK110b-architecture), which is 10% to 20% faster than the K20X (see table). This is 10% faster in max GFLOPS, but also 10% due to architecture-improvements. It’s not a huge difference, but the new Maxwell-architecture is more promising. The problem is that high-end Maxwell is not expected for this year. There are several rumours around what’s going on, but the official one is that there are problems with 20nm. I’ve had this confirmed by different sources, but will, of course, keep you up-to-date on Twitter.

I could not find good enough information on The K40x. It has been also very quiet around the current architectures on their yearly GDC conference. My expectations are that they will want to kick in hard with Maxwell in 2015. For 2014 they’ll focus on keeping their current customers happy in a different way. For now, let’s assume the K40X is 10% faster.

K20-K40So, for this year it will be K40. Here’s an overview:

  • Peak 1.43 DP TFLOPS theoretical
  • Peak 1.33 DP TFLOPS DGEMM (93% efficiency)
  • 5.65 GFLOPS/Watt DGEMM
  • Needs 122 GPUs to get 150 TFLOPS DGEMM
  • Lowest streetprice is $4800. $585,600 for 122 GPUs.

AMD FirePro

Just like the Tesla K40 and the Intel Xeon Phi, AMD offers accelerators with a lot of memory. The S10000 and S9000 are their current server-offers, but are still based on their older architectures. Their latest architecture is only available for gamers (i.e. R9 290X) and workstations (i.e. W9100). Now, with the recent announcement of the W9100, we have an indication of what this server-accelerator would cost, and look like. I expect this card to launch soon. I even expected it to be launched before the W9100.

What is interesting about the W9100 is the high memory transfer rate and the large memory. Assuming they need to pack the S9150 in 225 Watt and don’t change the design much to launch soon, they need to under-clock it like 22%. I think they can use 235 Watts (like the K40). Nevertheless, I want to be realistic.

FirePro W9100 FirePro W9000 FirePro S9150
Shader count 2816 2048 2816
Mem size 16 GByte 6 GByte 16 GByte
mem-type GDDR5 GDDR5 GDDR5
Interface 512 Bit 384 Bit 512 Bit
Transferrate 320 GByte/s 264 GByte/s 320 GByte/s
TDP 275 Watt 274 Watt 225 Watt (-22%)
Connectors 6 × MiniDP, 3D-Stereo, Frame-/ Genlock 6 × MiniDP, 3D-Stereo, Frame-/ Genlock ?
Multimonitor yes (6) yes (6) Don’t care
SP/DP (TFlops) 5.24 / 2.62 3.99 / 1.0 4.1 / 2.0 (-22%)
ECC yes yes yes
OpenCL 2.0 yes no yes
Price $3999 USD $2999 USD ?

So, what about the new FirePro S9000 with latest GCN, the S9150? An overview:

  • Peak 2.0 DP TFLOPS theoretical
  • Peak 1.6 DP TFLOPS DGEMM (at 80% efficiency, to be safe)
  • 7.1 GFLOPS/Watt DGEMM
  • Needs 94 GPUs to get 150 TFLOPS DGEMM
  • No prices available yet – AMD mostly prices lower than NVIDIA. $371,907 for 93 GPUs, when priced at $3999.

Update: DGEMM of 90% is reached. Then we get 1.8 DP TFLOPS DGEMM and 8.3 GFLOPS/Watt DGEMM. As a result, you need 84 GPUs only to get to the 150 TFLOPS.

Intel Xeon Phi

Intel currently offers 3110, 5110 and 7110 Xeon Phi’s. In the past months they added the 3120, 5120 and 7120. The 7120 uses 300 Watt, which needs special casing to cool this passively cooled card. I don’t quite understand this. I could compare it better to the W9100 and a heavily overclocked K40, or use lower numbers like I did above with the FirePro. But, as you can see, it doesn’t even compare with 300 Watts.

The OpenCL-drivers have been improved this year, which is more promising news. The guess here is wether they will launch a new 7130, or a 7200 or none at all. All the news and rumours speak of 2015 and 2016, for a more integrated memory and a socket-version(!) of the XeonPhi.

For this year the Xeon Phi 7120 would be their top-offer. It compares well with AMD’s W9100 if it comes to memory: 16GB GDDR5 and 352 GB/s.

  • Peak 1.21 DP TFLOPS theoretical
  • Peak 1.07 DP TFLOPS DGEMM (at 80% efficiency)
  • 3.56 GFLOPS/Watt DGEMM
  • Needs 140 Phi’s to get 150 TFLOPS DGEMM
  • Costs $4129 officially, $578,060 for 140.

Altera FPGAs

With OpenCL it finally got possible to run SIMD-focused software on FPGAs. OpenCL 2.0 also has some improvements for FPGAs, making it interesting for mature software that needs low-latency or less power-usage. In other words: software that has been designed on GPUs and measurements show that lower latency would out-compete others on the market who use GPUs, or that the electricity-bill makes the CFO sad. Understand that FPGAs do compete with the above three, but have their own performance hot spots and therefore it’s hard to compare.

I don’t expect the big entry in this year’s Top 500, but I’m watching FPGA progresses closely. Xilinx is also entering this market, but I don’t get much response (if any) to the emails I send to them. For next year’s article I hope to include FPGAs as a true competitor. If you need low-power or low-latency, then you’d better take your time to research FPGA potential for your business this year.

Conclusion

Open standards

For those who don’t know, I tend to prefer open standards. The main reason is that switching hardware is easier, it gives you space to experiment. AMD, Intel and Altera support OpenCL 1.2 and will start later this year with 2.0, whereas NVIDIA lags over 2 years and only supports OpenCL 1.1. The results are now very visible: due to problems with Maxwell, you’ll need to postpone your plans to 2015 if you code in CUDA. There is one way to pressure them, though: port your code to OpenCL, buy Intel or AMD hardware, and then let NVidia know you want this flexibility.

Green 500

You might have noticed the big differences between the GFLOPS/Watt. Where this is important is in the Green 500, the list of energy efficient supercomputers. The goal of today’s supercomputers is that they are mentioned in the top 10 of both lists. If you build an efficient cluster (say 2 CPUs + 4 GPUs), you can get to 70-80% of max DGEMM performance. Below is a list for 75%:

  • AMD FirePro – 7.10 GFLOPS/Watt DGEMM -> 5.33 GFLOPS/Watt @ 75%
  • NVIDIA Tesla – 5.65 GFLOPS/Watt DGEMM -> 4.24 GFLOPS/Watt @ 75%
  • Intel XeonPhi – 3.56 GFLOPS/Watt DGEMM ->2.67 GFLOPS/Watt @ 75%

Currently this list is lead by a cluster with K20X GPUs, steaming out 4.50 GFLOPS/Watt, which has even 86% of max DGEMM.

In other words: if the FirePro gets out in time, then the green 500 could be full of FirePro GPUs.

Update November 2014: here is the Green top 5.

green5
Green500 with AMD FirePro S9150 at spot #1

The winner

Since there are only three offers, they are all winners. What matters is the order.

  1. AMD FirePro – 16GB with its fast memory, is  the clear winner in DGEMM performance. The negative side: CUDA-software needs to be ported to OpenCL (we can do that for you).
  2. NVIDIA Tesla – Second to everything from FirePro (bandwidth, memory size, GFLOPS, price). The negative side: its OpenCL-support is outdated.
  3. Intel XeonPhi – Same as FirePro when it comes to memory. Nevertheless, it’s 60% slower in DGEMM and 50% less efficient. The negative side: 300 Watt for a server.

I am happy to see AMD as a clear winner after years of NVIDIA leading the pack. As AMD is the most prominent supporter of OpenCL, this could seriously democratise HPC in times to come.

[bordered_box border_color=” background_color=’#C1DAD6′]

Need to port CUDA to extremely fast OpenCL? Hire us!

If you order a cluster from AMD instead of NVIDIA, you effectively get our services for free.

[/bordered_box]