Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.
What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.
Edit: About the multi-GPU configuration
A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.
Performance comparison
At IBC’12 I was at the AMD booth and looked at the impressive performance of OpenCL-enabled software by AMD-partners when using the latest GPUs in the 9000-series. Let’s put into perspective why this is, by comparing the TESLA K10 against the FirePro S9000. As The K10 is a dual-GPU, the table is per GPU.
Functionality | TESLA K10 | FirePro S9000 |
---|---|---|
GPU-Processor count | 2 | 1 |
Architecture | Kepler GK104 | Graphics Core Next |
Memory per GPU-processor | 4 GB GDDR5 ECC | 6GB GDDR5 ECC |
Memory bandwidth per GPU-processor | 160 GB/sec | 264 GB/s |
Performance (single precision, per GPU-proc.) | 2.288 TFLOPS | 3.230 TFLOPS |
Performance (double precision, per GPU-proc.) | 0.095 TFLOPS | 0.806 TFLOPS |
Max power usage per GPU-processor | 150 (?) Watt (225 total) | 225 Watt |
Greenness | 15.25 GFLOPS/Watt | 14.35 GFLOPS/Watt |
Bus Interface | PCIe 3.0 x16 | PCIe 3.0 x16 |
Price (per GPU-processor) | $1638 | $2500 |
Price per GFLOPS (SP) | $0.72 | $0.77 |
Price per GFLOPS (DP) | $17.24 | $3.10 |
Cooling | Passive | Passive |
Sources for this table are below.
Conclusion
Tesla K10 is better in:
- GFLOPS/Watt
- price
But also the lack of a dual-GPU could be seen as an disadvantage, as for instance there can be less GPUs in an U1-server. In U3 this is less of a problem.
For the rest the S9000 is the better choice:
- more memory per GPU,
- more compute power,
- higher memory bandwidth.
- more FLOPS per dollar
FirePro GPUs also provide DirectGMA, which is direct access from other PCIe cards avoiding the CPU, so data from for example FireWire-cards can directly be transported to the FirePro cards via a DMA-channel. TESLA-cards don’t have this. Notice this is different from pinned memory, where both GPUs have support for.
Know that due to architectures-differences, the actual performance of the two professional GPUs may be closer or further away in various cases.
Based on above information we chose S9000s for our next generation Hadoop Servers.
As usual, put in the comments your voice on points I missed or on anything you agree or disagree with. The comments are unmoderated, but be polite.
Sources
NVIDIA TESLA K10:
http://www.nvidia.com/object/tesla-servers.html
http://www.nvidia.com/content/PDF/kepler/Tesla_K10_BD-06280-001_v05.pdf
http://www.amazon.com/s/ref=nb_sb_noss_1?url=search-alias%3Delectronics&field-keywords=nvidia+tesla+k10
AMD FirePro S9000:
http://www.amd.com/us/products/workstation/graphics/firepro-remote-graphics/S9000/Pages/S9000.aspx
http://www.amd.com/us/Documents/FirePro_S9000_Data_Sheet.pdf
http://www.amd.com/us/Documents/SDI-tech-brief.pdf
Vincent:
I guess there is a typo in table with K10 specs: Performance (double precision, per GPU) 0.095 TFLOPS. BTW, double precision performance with Tesla is cheaper than FirePro.
No typo, see http://www.nvidia.com/object/tesla-servers.html – it is really 95 GFLOPS per GPU, or 0.095 TFLOPS.
Oh yes, yes… this is K10, SP only:-) Technically you should compare S9000 vs K20 as a top HPC products.
Will absolutely write about the K20 when the K20 is actually for sale and the *final* specifications are on their homepage (Q4 2012 is promised). The rumoured specifications of the K20 look very good!
the article is misleading with all those PER GPU. All the K10 numbers are actually double of what you mention, because you are taking only half the numbers. the BW is 320GB/sec, the the performance 4.5 Teraflop/sec etc. You are comparing cards, so what matters is the card’s performance.
You cannot allocate 8GB of memory to one processor on the K10 – you need to upload data in to each GPU separately. Therefore it is *very* misleading to describe it as if it was one GPU. The advantage of a dual-GPU card is the reduced power-usage.
But just wait for the K20 – that is one heck of an answer to the S9000.
There is one mistake – TESLA K10 do support GPUdirectRDMA which does a PCIe P2P communication b/w two cards. This has been around from FERMI TESLA C2075
Thanks for the great feedback! GPUdirect RDMA is new in CUDA 5, and I wrote this article when CUDA 4 was still the latest. I’ll soon update the article to have a small comparison between the two techniques – if you have suggestions for articles I should read, let me know.