Intel’s answer to AMD and NVIDIA: the XEON Phi 5110P

NOTE: there are many contradicting sources out there, so there are mistakes in this article. Please give me feedback via twitter, mail or comments, so all the info can be completed.

Yes, another post in the answer-to series. At SC12 Intel tries to steal away the show from the Tesla K20 and FirePro S10000.

After two years of waiting Intel finally comes with an accelerator-card: the Xeon Phi. Compare it if NVIDIA would have skipped the GTX 200 series and now has presented the GTX 500 series. Or maybe even the GTX 600 series – we cannot tell yet.

The Phi is not a compute-card as we know it. As you cannot do a 1-to-1 comparison between AMD GCN architecture and NVIDIA Kepler, neither can be easily compared to the Phi. But this article should give an idea on where it is positioned.

The architecture

It contains 60 cores with a vector-width of 512 bits (8 times 64 bits). This means that per clock-tick it can do one computation on a 8-wide vector of double precision floats on each of the 60 cores (SIMD). Compare this to an AMD card, which has several hundreds of cores with support for 4-wide vectors of single precision floats (VLIW). At 1.053 GHz this gives 1.050 * 60 * 8 * 2 = 1011 GFLOPS.

intel-MIC-vector

The above 2 is because it is capable of doing MAD-operations: a multiply + an add. This means that if you have a multiply-operation, you can get an add free – if not, then you get 0.5 TFLOPS only. For more information, check “Fused Multiply-Add” on page two of Differences in floating-point arithmetic between Intel® Xeon® processors and the Intel® Xeon Phi™ coprocessor [PDF].

Most interesting would be to know how good the scheduler is implemented. If there is one (full) scheduler per core, then the Phi will be much easier to program than an accelerator of AMD or NVIDIA. Do note that upcoming architectures of the two GPU-vendors are much more advanced in this criterion.

There is no official information that single precision is double the performance of double precision – clear is that they focus on double precision. It has a strong focus on cache-sizes (± 1.8 MB L1, ± 30 MB L2 cache per core (?)) and a high memory bandwidth (320 GB/s → ±5.33 GB per core) – both will increase programmability of the accelerator. This makes it easier to write code that runs at 70% or better.

cpu-phi-300x157The Phi is special in more ways. When the Phi was still called the Knights Corner, it was mentioned that it is pre-loaded with an embedded Linux. This means it is an computer on its own. You can read more about it here.

Knowing this capability of the Phi, it is strange it is strongly positioned to be used with a strong CPU. Also for future releases Intel focuses its system-architecture on combining an Intel Phi next to an (Intel) CPU (see image blue is CPU, yellow is Phi).

This is a different approach than what is popular with other chip-designers, which try to find ways to put the accelerator on the same die as the CPU. But as the interconnect-war is currently heating up, we cannot draw any conclusion from this. Think of the various ways the 386/486 co-processors could be connected to the motherboard/CPU – also this time nothing has be decided yet.

Programming models

Intel chose the safest way to attract as many developers as possible: support all models. This list could be decreased for the sake of vendor lock-in, but for now we can enjoy it. The below image is taken from this PDF. Of course OpenCL is in it too.

Performance comparison

A comparison to both competitors. There are many, many sources all claiming different things. Will therefore update this tables a lot the coming time.

Tesla K20:


FunctionalityXEON Phi 5110PTESLA K20
GPU-Processor count11
ArchitectureX87 Knights CornerKepler GK110
Memory per GPU-processorup to 8 GB GDDR5. No ECC!6GB GDDR5 (5 GB w/ ECC)
Memory bandwidth per GPU-processor320 GB/s200 GB/s
Performance (single precision, per GPU-proc.)2.022 TFLOPS3.52 TFLOPS
Performance (double precision, per GPU-proc.)1.011 TFLOPS1.17 TFLOPS
Max power usage per GPU-processor225 Watt225 Watt
Greenness (SP)8.99 GFLOPS/Watt ?15.6 GFLOPS/Watt
Bus InterfacePCIe 3.0 x16PCIe 3.0 x16
Price (per GPU-processor)$2649$3199
Price per GFLOPS (SP)$2.62$0.77
Price per GFLOPS (DP)$2.62$2.73
CoolingPassive (P, A = Active)Passive

FirePro S9000 – see next article for the S10000:


FunctionalityXEON Phi 5110PFirePro S9000
GPU-Processor count11
ArchitectureX87 Knights CornerGraphics Core Next
Memory per GPU-processorup to 8 GB GDDR5. No ECC!6GB GDDR5 (5 GB w/ ECC)
Memory bandwidth per GPU-processor320 GB/s264 GB/s
Performance (single precision, per GPU-proc.)2.022 TFLOPS3.230 TFLOPS
Performance (double precision, per GPU-proc.)1.011 TFLOPS0.806 TFLOPS
Max power usage per GPU-processor225 Watt225 Watt
Greenness (SP)8.99 GFLOPS/Watt14.35 GFLOPS/Watt
Bus InterfacePCIe 3.0 x16PCIe 3.0 x16
Price (per GPU-processor)$2649$2500
Price per GFLOPS (SP)$2.62$0.77
Price per GFLOPS (DP)$2.62$3.10
CoolingPassive (P, A = Active)Passive

Sources for the Phi-specifications are below.

As not all information is public, no conclusions can be drawn yet. Follow us on Twitter or LinkedIn to get noticed of any update of this article and other interesting information.

Sources

Blogs of people at Intel who are into Phi and OpenCL

 

Related Posts

Tesla-K20

NVIDIA’s answer to FirePro S9000: the TESLA K20

...  months ago I wrote about the FirePro S9000 - AMD's answer to the K10 - and was already looking forward to this K20. Where in ...