OpenCL on Altera FPGAs

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

FPGAs

FPGAs, or Field Programmable Gate Arrays, are processors that are programmed on a hardware-level instead of software-level. This means they can be extremely fast in areas like signal analysis, but are less flexible as they cannot just load new software. If you never heard of this technology, watch the below video.

As the video explained, it is easy to do things in parallel. For who has never programmed an FPGA, I suggest you spend a few weekends on it. Just buy a cheap FPGA dev-board and feel the magic of having your own programmed FPGA controlling the doorbell.

Programming FPGAs is a totally different story. It takes far more lines of code compared to C, and it needs to be clock-signal tested (“what is the current state each clock-cycle?” – see also the pipeline shown below). This results in far longer development-times, which can only be paid by the result (extreme speed).

There have been various efforts to make programming FPGAs easier, but C-to-VHDL/Verilog projects just gave the first step and not the automated testing. This was still not a problem, as there was enough demand in the high end market for the two biggest companies Altera and Xilinx. GPGPU was the real problem, as suddenly GPUs were competing with them and could be programmed much easier.

Altera powered by OpenCL

First a compliment to Altera’s Supervising Principal Engineer, Deshanand Singh. He is able to explain complex technology in easy words and I understand why he does many presentations. Below are just a few images from the presentation, but I urge to read at least the September 2012 keynote.

Why are FPGAs cool?

Deshanand starts with putting his view on where FPGAs stand on the scale of both programmability and parallelism. Better would be to put programmability at the x-axis and parallelism at the y-axis.

To substantiate this graph, he shows an example of a FIR Filter, which has large amounts of spatial locality, and its computation can be expressed as a feed forward pipeline. The implementation on an FPGA has an 1000s of times(!) better Performance-to-Power ratio than GPUs and CPUs. Throughout the presentation all three PDFs show other various advantages of FPGAs, such as memory-locality.

The competition: GPUs

The below table is very clear:

Add the quote “FPGAs have been 3 years away from being standard system components for the last 10 years and will be for the next 10 years” to it and it seems a clear win for GPUs. The message is clear: even with all the advantages of FPGAs, GPUs will take market share.

Just like array-processors and AVX/SSE, Altera has the same problem with the term GPGPU – you cannot do GPGPU on a non-GPU.

OpenCL the answer?

To “avoid wasting space”, Altera chose not to replicate hardware for each thread, but to make use of pipeline parallelism. The image below explains what this is.

Of course more performance should be gained by using extra paths. I assume they have not figured it out well enough, or chose not to present this. An extra path can double the performance.

The compiler is quite comparable to other OpenCL-compilers, but generates RTL instead of NVIDIA PTX, AMD IL or Assembly.

The unoptimiser seems to be needed to be able to use the standard CLANG front-end. The optimiser does familiar stuff like loop fusion, auto vectorization, and branch elimination.

Examples show that with OpenCL the performance/Watt is much higher for the FGPA (Altera Stratix-IV 530) than the GPU (Tesla C2075). As the FPGA solution is completely limited by external memory bandwidth, the resulting GFLOPs is lower. When the algorithm is not memory-dependent and the kernel contains complex math, the FPGA outperforms the GPU with a few percent.

Interesting is that a simple graphics filter was up and Running in a day, optimised within a few days. That seems far more faster than programming VHDL or Verilog.

There are still some questions left, but Altera is very open about their progress. The self-proclaimed challenge is that OpenCL’s compute model targets an “abstract machine” that is not an FPGA. This is also a challenge for the OpenCL consortium to make sure more architectures are well-supported. OpenCL has also support for task-parallel computing, which FPGAs could employ.

What about Xilinx?

Xilinx, also a member of Khronos, has been very quiet about any progress on such a project. Just recently a master student Kavya S Shagrithaya did a thesis on OpenCL on Xilinx FPGAs which seems very complete:

In this thesis, a compilation flow to generate customized application-specific hardware descriptions from OpenCL computation kernels is presented. The flow uses Xilinx AutoESL [AutoPilot] tool to obtain the design specification for compute cores. An architecture provided integrates the cores with memory and host interfaces. The host program in the application is compiled and executed to demonstrate a proof-of-concept implementation towards achieving an end-to-end flow that provides abstraction of hardware at the front-end.

From the “Future Work” we can conclude the following:

  • Only a vector addition program was tested.
  • Only a subset of the OpenCL host API was supported on the host.
  • Memory access has not been optimised.
  • The definitions contained Convey specific assembly routines to perform the desired operations
  • The compilation flow does not support all features of the OpenCL language

Even though it is just a start, it is a nice piece of work and done in a short period from actual understanding to finish. And very clever to use AutoESL AutoPilot instead of reinventing the wheel. Hope to see Kavya continue his work as a graduate research assistant – did not find a public repository though.

In another project by Alexey Kravets and Vladimir Platonov in 2011, OpenCL was made working on Xilinx. Except from a reference on Kravets’ LinkedIn-page, no traces of this project can be found.

When making a compiler, the hardest work is the optimisation-part. So even with the work done by the three mentioned people, Xilinx seems to be a year behind Altera.

What’s next?

Unexpected things can happen now. Anyone can build and sell a niche-targeted FPGA, once it can be programmed easily. If this OpenCL on FPGAs actually works, we can see more programmable chips for sure. The coming months more information about Altera’s OpenCL program will be made public, so follow OpenCLonFPGAs on twitter to get the latest information.

Want to play with OpenCL on Altera yourself? At Altera’s OpenCL SDK-page on this site’s knowledge-base the most recent information is kept.