Let the competition on large memory GPUs begin!
Some algorithms and continuous batch processes will have the joy of the extra memory. For example when inverting a large matrix or doing huge simulations, you need as much memory as possible. or to avoid memory-bank conflicts by duplicating data-objects (possible only when the data is in memory for a longer time to pay for the time it costs to duplicate the data).
Another reason for larger memories is dual precision computations (this one has a total of 1.48 TFLOPS), which doubles memory-requirements. With Accelerators getting better fit for HPC (true support for IEEE-754 double precision storage format, ECC-memory), memory-size becomes one of limits that needs to be solved.
The other choice is swapping on GPUs or to use multi-core CPUs. Swapping is not an option as it nulls all the speed-up. A server with 4 x 16-core CPUs are as expensive as one Accelerator, but use more energy.
AMD seems to have identified this as an important HPC-market therefore just announced the new S10000 with 12GB of memory. To be mailed at AMD-partners in January, and on the market in April. Is AMD finally taking the professional HPC market serious? They now do have the first 12GB GPU-accelerator built for servers.
Old vs New
Still a few question-marks, unfortunately
|Functionality||FirePro S10000 6GB||FirePro S10000 12GB|
|Architecture||Graphics Core Next||Graphics Core Next|
|Memory per GPU-processor||3 GB GDDR5 ECC||6GB GDDR5 ECC|
|Memory bandwidth per GPU-processor||240 GB/s per GPU||240 GB/s per GPU|
|Performance (single precision, per GPU-proc.)||2.95 TFLOPS per GPU||2.95 TFLOPS per GPU|
|Performance (double precision, per GPU-proc.)||0.74 TFLOPS per GPU||0.74 TFLOPS per GPU|
|Max power usage for whole dual-GPU card||325 Watt||325 Watt (?)|
|Greenness for whole dual-GPU card (SP)||20.35 GFLOPS/Watt||18.15 GFLOPS/Watt|
|Bus Interface||PCIe 3.0 x16||PCIe 3.0 x16|
|Price for whole dual-GPU card||$3500||?|
|Price per GFLOPS (SP)||$0.60||?|
|Price per GFLOPS (DP)||$2.43||?|
The biggest differences are the doubling of memory and the passive cooling.
Biggest competitor is the Quadro K6000, which I haven’t discussed at all. That card throws out 5.2 TFLOPS using one GPU, being able to access all 12GB of memory via a 384-bit bus at 288 GB/s (when all cores are used). It is actively cooled, so it’s not really fit for servers (like the S10000, 6GB version). The S10000 has a higher bandwidth, but cannot access only half the 12GB from one core at full speed. So the K6000 has the advantage here.
Intel is planning to have 12GB and 16GB XeonPhi’s. I’m curious to more benchmarks of the new cards, as the 5110P does not have very good results (benchmark 1, benchmark 2). It compares more to a high-end Xeon CPU than a GPU. I am more enthusiastic about the OpenCL-performance on their CPUs.
What’s next on this path?
A few questions I asked myself and tried to find answers on.
Extendible memory, like we have for CPUs? Probably not, as GDDR5 is not designed to be upgradable.
Unified memory for multi-GPUs? This would solve the disadvantage of multi-die GPU-cards, as 2, 4 or more GPUs could share the same memory. A reason to watch HSA hUMA‘s progress, which now specifies unified memory access between GPU and CPU.
24GB of memory or more? I’ve found below graph to have an idea of the costs of GDDR-memory, so it’s an option. These prices are of course excluding supplementary parts and R&D-costs for getting more memory accessible to the GPU-cores.
At least the question we are going to get answered now: is the market which needs this amount of memory large enough and thus worth serving.
Is there more need for wider memory-bus? Remember that GDDR6 is promised for 2014.
What do you think of a 12GB GPU? Do you think this is the path that distinguishes professional GPUs from desktop-GPUs?