An introduction to Grid-processors: Parallella, Kalray and KnuPath

gridWe have been talking about GPUs, FPGAs and CPUs a lot, but there are more processors that can solve specific problems. This time I’d like you to give a quick introduction to grid-processors.

Grid-processors are different from GPUs. Where a multi-core GPU gets its strength from being able to compute lots of data in parallel (SIMD data-parallellism), a grid-processors is able to have each core do something differently (MIMD, task-based parallelism). You could say that a grid-processor is a multi-core CPU, where the number of cores is at least 16, and the cores are only connected to their neighbours. The difference with full-blown CPUs is that the cores are smaller (like the GPU) and thus use less power. The companies themselves categorise their processors as DSPs or Digital Signal Processors, but most popular DSPs only have 1 to 8 cores.

For the context, there are several types of bus-configurations:

  • single bus: like the PCIe-bus in a PC or the iMX6.
  • ring bus: like the XeonPhi till Knights Corner, and the Cell processor.
  • star bus: a central communication core with the compute-cores around.
  • full mesh bus: each core is connected to each core.
  • grid bus: all cores are connected to their direct neighbours. Messages hop from core to core.

Each of them have their advantages and disadvantages. Grid-processors get great performance (per Watt) with:

  • video encoding
  • signal processing
  • cryptography
  • neural networks

This is because task-parallelism is important for those groups of algorithms. Let’s go through three companies who build them, which are well-known or recently in the news: Parallella, KnuPath and Kalray. If you think I should add another one, do let me know via the contact-page.


Maybe the best-known parallel processor is the 16 core Epiphany III – we have mentioned the processor a few times on our blog and our social media channels. This chip made it very clear that grid-processors are extremely power-efficient with 70 GigaFLOPS per Watt. The company has built a rather large and active community, which makes it the Raspberry Pi of the parallel processors. The board supports OpenCL, or at least 99% of the functions.


The company made a small batch of 64 core processors and now sells the IP to be included in SOCs. No idea if it’s HSA-compliant.

Update Oct-2016: A 1024-core processor has been taped out.


Recently in the news, trying to positioning it as a deep-learning and signal processor. The processor is comparable to Kalray (see below), but without a mature software stack as it is brand new. It has 256 cores and the company has has invested a lot in the interconnects between boards, claiming rack-to-rack latency of 400 ns – that is very comparable to the latency of PCIe.



I wanted to mention this one, because I think we’ll hear more from it later.


I find the MPPA2 the most mature grid-processor around. It has a mature software-stack, 256 cores and beta-support of OpenCL. The chip has been put on various boards, including a networked device for lower latency.

From Kalray I learnt in which fields grid-processors are most efficient:

  • Computer Vision
  • Finance: Monte Carlo Asian Options Pricing benchmark
  • Cryptography
  • Compression: GZIP DEFLATE
  • Machine Learning: Convolutional Neural Networks (CNN)

Why grid-processors are now an option

With OpenCL and other new task/data-parallel programming languages the programmability has been improved for all kinds of processors – including GPUs, FPGAs, DSPs and grid-processors. This means that you can safely choose other processors than a CPU – the only real choice 10 years ago. The reason is that with the frequency-wall of around 3 GHz, not only multi-core processors became a better option than the single-core CPU, but also the processors that used to be exotic. Compiler-advancements with frameworks like LLVM did the rest.

Feel free to contact us for more information.

Related Posts