AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Performance comparison

At IBC’12 I was at the AMD booth and looked at the impressive performance of OpenCL-enabled software by AMD-partners when using the latest GPUs in the 9000-series. Let’s put into perspective why this is, by comparing the TESLA K10 against the FirePro S9000. As The K10 is a dual-GPU, the table is per GPU.

Functionality TESLA K10 FirePro S9000
GPU-Processor count 2 1
Architecture Kepler GK104 Graphics Core Next
Memory per GPU-processor 4 GB GDDR5 ECC 6GB GDDR5 ECC
Memory bandwidth per GPU-processor 160 GB/sec 264 GB/s
Performance (single precision, per GPU-proc.) 2.288 TFLOPS 3.230 TFLOPS
Performance (double precision, per GPU-proc.) 0.095 TFLOPS 0.806 TFLOPS
Max power usage per GPU-processor 150 (?) Watt (225 total) 225 Watt
Greenness 15.25 GFLOPS/Watt 14.35 GFLOPS/Watt
Bus Interface PCIe 3.0 x16 PCIe 3.0 x16
Price (per GPU-processor) $1638 $2500
Price per GFLOPS (SP) $0.72 $0.77
Price per GFLOPS (DP) $17.24 $3.10
Cooling Passive Passive

Sources for this table are below.


Tesla K10 is better in:

  • GFLOPS/Watt
  • price

But also the lack of a dual-GPU could be seen as an disadvantage, as for instance there can be less GPUs in an U1-server. In U3 this is less of a problem.

For the rest the S9000 is the better choice:

  • more memory per GPU,
  • more compute power,
  • higher memory bandwidth.
  • more FLOPS per dollar

FirePro GPUs also provide DirectGMA, which is direct access from other PCIe cards avoiding the CPU, so data from for example FireWire-cards can directly be transported to the FirePro cards via a DMA-channel. TESLA-cards don’t have this. Notice this is different from pinned memory, where both GPUs have support for.

Know that due to architectures-differences, the actual performance of the two professional GPUs may be closer or further away in various cases.

Based on above information we chose S9000s for our next generation Hadoop Servers.

As usual, put in the comments your voice on points I missed or on anything you agree or disagree with. The comments are unmoderated, but be polite.



AMD FirePro S9000:

8 thoughts on “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

  1. gpuscience

    I guess there is a typo in table with K10 specs: Performance (double precision, per GPU) 0.095 TFLOPS. BTW, double precision performance with Tesla is cheaper than FirePro.

      • gpuscience

        Oh yes, yes… this is K10, SP only:-) Technically you should compare S9000 vs K20 as a top HPC products.

      • StreamHPC

        Will absolutely write about the K20 when the K20 is actually for sale and the *final* specifications are on their homepage (Q4 2012 is promised). The rumoured specifications of the K20 look very good!

  2. mirrormirror

    the article is misleading with all those PER GPU. All the K10 numbers are actually double of what you mention, because you are taking only half the numbers. the BW is 320GB/sec, the the performance 4.5 Teraflop/sec etc. You are comparing cards, so what matters is the card’s performance.

    • StreamHPC

      You cannot allocate 8GB of memory to one processor on the K10 – you need to upload data in to each GPU separately. Therefore it is *very* misleading to describe it as if it was one GPU. The advantage of a dual-GPU card is the reduced power-usage.

      But just wait for the K20 – that is one heck of an answer to the S9000.

  3. marks

    There is one mistake – TESLA K10 do support GPUdirectRDMA which does a PCIe P2P communication b/w two cards. This has been around from FERMI TESLA C2075

    • StreamHPC

      Thanks for the great feedback! GPUdirect RDMA is new in CUDA 5, and I wrote this article when CUDA 4 was still the latest. I’ll soon update the article to have a small comparison between the two techniques – if you have suggestions for articles I should read, let me know.

Comments are closed.