CPU Code modernisation – our hidden expertise

You’ve seen the speedups of 100’s to 1000’s of times. We all know that the lion share of the techniques would also work on modern multi-core CPUs, where GPUs get the last 2x to 8x only. When it’s 8x, the GPU is the obvious choice. When it’s 2x, would the better choice be a bigger CPU or a bigger GPU?

Now AMD has launched their 32 core CPU, the answer to that question changes. Not only because of the 32 cores but also because of the 256bit vector-computations via AVX2. This means that each clock-cycle 32 double4’s can be calculated on. A 16-core AVX1 CPU could work on 16 double2’s, which is only a fourth of that performance.

Intel reacted immediately by hinting they will also launch a 32-core Xeon. Meanwhile IBM works on launching their quad-threaded 24-core Power9 CPU. Cavium is providing 64bit 64-core ARM processors, which also need many threads to keep them busy. Not only core-numbers increase, but the interconnect standards now all push for upgrades while HBM seeks a way outside GPUs.

CPUs have reborn.

We will discuss the advantages of these CPUs in upcoming blog posts.

Algorithm and CPU optimisations

Where performance optimisation (aka “code modernisation”) often is described as “applying tricks”. Having a specialisation of making easily-readable code that performs, we know better. With CPUs taking all that makes GPU fast, we now can also apply GPU-specific optimisations in the CPU-domain.

An example of where we used the CPU in a project was with NooSpheer. The initial goal was to use the GPU, but we found that the CPU was faster for that given algorithm. The 40,000x speedup was on a 8-core CPU with AVX, and the code is expected to run much faster on the CPUs described above.

“Stream is an elite, dependable and unique development outfit. We utilized Stream to achieve a ~40,000x speedup for our quantum simulation software.
If your project demands ultra fast design and robust implementation, work with them.”

Jordan Ash, CEO Noospheer

 

The multi-core CPU is dead.
Long live the multi-multi-core CPU!

In 2012 I wrote on CPUs with embedded GPUs. That was a big change and defined a complete new type of CPU. 20+ cores is not so big, but still it defines a new type of processor. This means a split: 4 or 8 cores for desktops, laptops and mobiles, and 20+ cores for shared servers and HPC.

You might ask, why only now? This is due to better (and more accepted) virtualisation techniques and (thanks to GPUs) more code that’s optimised for handling data-parallel workloads. Another reason is Intel’s monopoly in the server-market, that is now heavily under attack by ARM, IBM and AMD.

These modern, 20+ core CPUs are very welcomed, as they make the CPU-GPU gap smaller.

As you might have guessed, programming a 20+ core CPU is different from programming a 4-core CPU. Luckily we have a long experience in building scalable software, that optimally runs on quad-core CPUs to high-end GPUs. When CPUs are chosen as only target, code can be kept in the same language – and that is much-requested, even if the code has to be mostly rewritten to get to the new performance and quality goals.

For more information on our software development services, call +31 854865760.

 

Related Posts

IMG_20160829_172857_cropped

Get ready for conversions of large-scale CUDA software to AMD hardware

...  that manual porting limits the size of the to-be-ported code-base.Luckily there is a new tool in town. AMD now offers HIP, ...

default

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 2)

...  record the set of random numbers generated in the serial code, write it to a file and use this file as input to a table-based approach ...

taskmanager-multi-CPU

The OpenCL power: offloading to the CPU (AVX+SSE)

...  that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the ...

dobbelstenen

Porting code that uses random numbers

...  FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And ...

  • pointless

    AVX2 is the same width as AVX – it just adds support for vectors of integers. Ryzen did *not* add AVX512, nor did it provide 512-bits worth of AVX/AVX2 throughput.

    http://www.anandtech.com/show/11170/the-amd-zen-and-ryzen-7-review-a-deep-dive-on-1800x-1700x-and-1700/8

    The largest Xeons aren’t 32 cores, but rather 28.

    http://ark.intel.com/products/series/125191/Intel-Xeon-Scalable-Processors

    Maybe the next generation will reach 32 cores, but it would actually be unusual if Intel *didn’t* increase the maximum core count from one generation to the next.

    Of course, you could’ve just mentioned Xeon Phi. Knight’s Landing has up to 72 cores, each equipped with dual-AVX512 pipes. It splits the difference between general-purpose programmability of conventional multi-core server CPUs and GPU-level computational throughput. In terms of raw performance, it’s still off by a factor of 2 from top-end GPUs.