Aparapi: OpenCL in Java

Posted by Vincent Hindriksen on 3 August 2011 with 2 Comments

Edit: Aparapi has been open sourced and many issues have already been fixed and improved.

If you have an AMD GPU/APU, you should try Aparapi. This software lets you write OpenCL-code in Java pretty high-level. The idea is that is sort of that it processes the Java intermediate code to search for loops and then create optimised OpenCL-kernels. Just download Aparapi and try the two examples. As the current version is still in alpha, it is not flawless yet. What I think is important when having worked with Aparapi is that you learn how to keep it simple – like you know that you can gain most speed on straight roads and turns slow down.

The Aparapi-team tries to avoid explicit defining of local memory, but it is still possible by using the @Local annotation. Such decisions show the team wants Aparapi to be high-level. It also integrates well with JavaCL and JOCL, so for the kernels you already have created, you can mix. You can also check out a video introducing Aprapi (it is video 15, if #-linking doesn’t work).

Time to create your own project. As not all errors are documented or are solved in the upcoming version, below you will find a list of common errors and how to easily solve them.

Continue reading “Aparapi: OpenCL in Java” →

Xeon Phi Knights Corner compatible workstation motherboards

Posted by Vincent Hindriksen on 1 August 2015 with 6 Comments

Intel has assumed a lot if it comes to XeonPhi’s. One was that you will use it on dual-Xeon servers or workstations and that you already have a professional supplier of motherboards and other computer-parts. We can only guess why they’re not supporting non-professional enthusiasts who got the cheap XeonPhi.

After browsing half the internet to find an overview of motherboards, I eventually emailed Gigabyte, Asus and ASrock for more information for a desktop-motherboard that supports the blue thing. With the information I got, I could populate the below list. Like usual we share our findings with you.

Quote that applies here: “The main reason business grade computer supplies can be sold at a higher price is that the customers don’t know what they’re buying“. When I heard, I did not know why the customer is not well-informed – now I do. Continue reading “Xeon Phi Knights Corner compatible workstation motherboards” →

The most noticeable processors from NVIDIA, AMD and Intel

Posted by Vincent Hindriksen on 6 May 2016 with 1 Comment

AMD-Intel-NVidia 10 years ago we had CPUs from Intel and AMD and GPUs from ATI and NVidia. There was even another CPU-makers VIA, and GPU-makers S3 and Matrox. Things are different now. Below I want to shortly discuss the most noticeable processors from each of the big three.

The reason for this blog-post is that many processors are relatively unknown, and several problems are therefore solved inefficiently.

NVidia

As NVidia doesn’t have X86, they mostly focuses on GPUs and bet on POWER and ARM for CPU. They already sell their Pascal-architecture in small numbers.

2017 will all be about their Pascal-architecture.

Tesla K80 (Kepler)

The GPU is not simply 2 x K40 (GK110B GPUs), the chip is actually different (GK210)
It is the Nvidia GPU with the largest private memory size (used in kernels): 255.

This is the GPU for lazy programmers and for actually complex code: kernels can use double the registers.

Pascal P100 (Pascal)

20 TFLOPS Half Precision (HP), 10 TFLOPS single precision, 5 TFLOPS double precision
16 GB HBM2 (720 GB/s).
NVlink up to 64 GB/s effectively (20% of the 80 GB/s is protocol-overhead), dual simplex bidirectional (so dedicated wires per direction). Each NVLink offers a bidirectional 16 GB/sec up and 16 GB/sec down. Compared to 12 GB/s PCIe3 x16 (24 GB/s cumulative), this is a good speed-up. The support is only available between Pascal-GPUs, and not between the GPU and CPU yet.
OpenPOWER support coming, to compete with Intel.

Now only available in a $129.000 costing server with 8 of these (making the price of each P100 $15.000). It will probably be widely available somewhere in Q1 2017, when HBM2 production is up-to-speed. It is unknown what the price will be then – that depends on how many companies are willing to pay the high price now.

The GPU is perfect for deep learning, which NVidia is highly focused on. The 5 TFLOPS double precision is also very interesting too. A server with 8 GPUs gives you 80 TFLOPS – double that, if you only need Half Precision.

Titan Black (Kepler) and GTX 980 (Maxwell)

The Titan Black has 1.7 TFLOPS DP, 4.5 TFLOPS SP.
The GTX 980 has 0.14 TFLOPS DP, 4.6 TFLOPS SP.

The two best-sold GPUs from NVidia, which are not server-grade. What interesting to note is that the GTX 980 is not always faster than the Titan Black, even though it’s more recent.

Tegra X1

0.5 TFLOPS SP (GPU), 1 TFLOPS HP
10 Watts

While not well-accepted in the car industry (uses too much power and no OpenCL), they are well-accepted in the car-entertainment industry.

AMD

Known for the strongest OpenCL-developers since 2012. With HSA-capable Fiji-GPUs, they now got to their third GPGPU-architecture after “VLIW” and “GCN” – fully driven by their HSA-initiative.

For 2017 they focus on their main advantages: brute Single Precision performance, HBM (they have early access), their new CPU (Zen) and new GPU (Polaris).

FirePro S9170 (GCN)

32GB GDDR5 global memory
2.5 TFLOPS DP, 5 TFLOPS SP

The GPU’s processor is the same as the FirePro S9150, which has been the unknown best DP-performer of the past years. The GPU got the top 1 spot using air-cooled solutions, only to be surpassed by oil-submersed solutions. The S9170 builds on top of this and adds an extra 16GB of memory.

The S9170 is the GPU with the largest amount of memory, solving problems that use a lot of memory and are bandwidth limited – think calculations on oil&gas and weather, which now don’t fit on GPUs.

Radeon Nano and FirePro S9300X2 (Fiji)

Nano: 0.8 TFLOPS DP, 8 TFLOPS SP, no HP-support at the processor (only for data-transfers)
S9300X2: 1.4 TFLOPS DP, 13.9 TFLOPS SP (lower clocked)
Nano 175 Watt, S9300X2 300 Watt
Nano has 4 GB HBM, with a bandwidth up to 512GB/s, S9300X2 has 2x 4GB HBM.

The Nano is the answer to NVidia’s Titans, and the S9300X2 is its server-class version.

These GPUs brings the best SP-GFLOPS/€ and the best SP-GFLOPS/Watt as of now. The nano focuses on VR desktops, whereas the S9300X2 enables you to put up to 111 TFLOPS in one server.

AMD Carrizo A10 8890k APU (HSA)

CPU with built-in GPU
About one TFLOPS
TDP of 95 Watt

The fastest HSA-capable processor out there. This means that complex software that needs a mix of task-parallel and data-parallel software runs best on such processor. This CPU+GPU has the most TFLOPS available on the market.

Intel

After years of “Peter and the wolf” stories, they seem to finally have gotten the Larrabee they promised years ago. With the acquisition of Altera, new processors are at the horizon.

Their focus is still on customers who focus on test-driven design and want to “make it run quickly, make it perform later”.

Xeon E5-2699 v4

55MB cache, 22 cores
AVX 2.0 (256 bit vector operations)
DDR4 (60 GB/s)

Not well-known, but this CPU is very capable to run complex HPC-code for the price of an high-end GPU. It could reach about 0.64 GFLOPS DP peak, when fully using all cores and AVX 2.0.

XeonPhi Knights landing

Available in socket and PCI version
3 TFLOPS DP, 6 TFLOPS SP
AVX 512 (512 bit vector operations)
16 GB HBM (over 400GB/s), up to 348 GB DDR4 (60 GB/s).
Currently (?) not programmable with OpenCL

After years of okish XeonPhis, it seems Intel now has a processor that competes with AMD and NVidia. Existing code (almost) just works on this processor, and can then be improved step-by-step. The only think not to be liked is the lack of benchmarks – so above numbers are all on paper.

Xeon+FPGA

Task-parallel processor
Low-latency

The reconfigurable chip that has been promised for over 2 decades.

I’m still researching this upcoming processor, as one of the strengths of an FPGA is the low-latency links to DisplayPort and networking, which seem to go via PCI on this processor.

Iris GPUs

CPU with built-in GPU
0.7 TFLOPS SP

As these GPUs are included in almost all CPU that Intel sells, these are the most-sold GPUs.

Selecting the right hardware

Choosing the best hardware has become quite complex, especially when focusing on the TCO (Total Costs of Ownership). At StreamHPC we have experience with many of the devices above, but also various embedded hardware that compete with the above processors on a totally different scale. You need to select the right benchmarks to know what your device of choice is – we can help with that.

Accelerating an Excel Sheet with OpenCL

Posted by Vincent Hindriksen on 19 September 2016

excel-opencl One of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL” →

Starting with Altera OpenCL-on-FPGAs

Posted by Vincent Hindriksen on 3 July 2013 with 2 Comments

Altera has been very busy adding resources and has kicked off the beginning of June with opening up their OpenCL-program for the general public.
Only Stratix V devices are supported, but that could change later.

Below are all pages and PDFs concerning OpenCL I’ve found while searching Altera’s website.

Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms

Altera wanted to know where they could compete with GPUs and CPUs. For a big company their comparisons are quite honest (for instance about their limited access-speed to memory), but they don’t tell everything – like the hours(!) of compilation-time. The idea is that you develop on a GPU and when it’s correct, you port the correctly working software to the FPGA.

If you don’t have any experience working with their FPGAs, best is to ask around.

Medical_OpenCL — Image taken from Altera website.

Continue reading “Starting with Altera OpenCL-on-FPGAs” →

OpenCL 1.2 support on NVIDIA GT 700M series?

Posted by Vincent Hindriksen on 21 June 2013

If the information on the Geforce-website (an official NVIDIA-website) is correct, then the GT 700M series supports OpenCL 1.2.

There were no references of OpenCL 1.2 functions in any driver.

If you have more info, let us know in the comments.

Thanks to Rahul Garg for the tip.

OpenCL at SC14

Posted by Vincent Hindriksen on 12 November 2014 with 5 Comments

During SC14 (SuperComputing Conference 2014), OpenCL is again all over New Orleans. Just like last year, I’ve composed an overview based on info from the Khronos website and the SC2014 website.

Finally I’m attending SC14 myself, and will give two talks for you. Tuesday I’ll be part of a 90 minute session of Khronos, where I’ll talk a bit about GROMACS and selecting the right accelerator for your software. Wednesday I’ll be sharing our experiences from our port of GROMACS to OpenCL. If you meet me, then I can hand you over a leaflet with the decision chart to help select the best device for the job.

Continue reading “OpenCL at SC14” →

OpenCL at SC13

Posted by Vincent Hindriksen on 15 November 2013

Unluckily I am not at SC13, so I’ll enjoy from a distance. Luckily I don’t miss one of the most beautiful 15km runs in the Netherlands. When there is more news, I’ll add to this post – below is mostly taken from the Khronos website and the SC2013 website.

OpenCL Booth

Meet members of the OpenCL workgroup in Booth #4137 to get hot news from the OpenCL experts and an OpenCL reference card. Also learn about next year’s plans for IWOCL (International Workshop on OpenCL). Be sure not to miss the BOF on OpenCL 2.0 (in bold in the schedule).

Schedule

Sunday, 17 November
8:30 – 17:00	Tutorials	Structured Parallel Programming with Patterns	Michael McCool, James Reinders, Arch Robison, Michael Hebenstreit	302
8:30 – 17:00	Tutorials	OpenACC: Productive, Portable Performance on Hybrid Systems Using High-Level Compilers and Tools	Luiz DeRose, Alistair Hart, Heidi Poxon, James Beyer	401
Monday, 18 November
8:30 – 17:00	Tutorials	OpenCL: A Hands-On Introduction	Tim Mattson, Alice Koniges, Simon McIntosh-Smith	403
Tuesday, 19 November
11:00 & 15:00	ACM Student Research Competition Poster Reception	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
17:15 – 19:00	Short presentation	local_malloc: malloc for OpenCL __local memory [Poster]	John Kloosterman	Mile High Pre-Function
Wednesday, 20 November
11:00 & 15:00	Short presentation	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
16:00	Case Study	Accelerating Full Waveform Inversion via OpenCL on AMD GPUs	AcceleWare	AMD booth
17:30 – 19:00	BOF	OpenCL: Version 2.0 and Beyond	Tim Mattson, Ben Bergen, Simon McIntosh-Smith	405/406/407
Thursday, 21 November
11:30 – 12:00	Exhibitor Forum	OpenCL 2.0: Unlocking the Power of Your Heterogeneous Platform	Tim Mattson	501/502
10:30 – 11:00	Papers	General Transformations for GPU Execution of Tree Traversals	Michael Goldfarb, Youngjoon Jo, Milind Kulkarni	205/207
11:00 – 11:30	Papers	A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening	Alberto Magni, Christophe Dubach, Michael F.P. O’Boyle	205/207
11:30 – 12:00	Papers	Semi-Automatic Restructuring of Offloadable Tasks for Many-Core Accelerators	Nishkam Ravi, Yi Yang, Tao Bao, Srimat Chakradhar	205/207
11:30 – 12:00	Short presentation	Introduction to OpenCL on FPGAs	AcceleWare	Altera booth
16:00 – 16:30	Paper	Accelerating Sparse Matrix-Vector Multiplication on GPUs using Bit-Representation-Optimized Schemes	Wai Teng Tang, Wen Jun Tan, Rajarshi Ray, Yi Wen Wong, Weiguang Chen, Shyh-hao Kuo, Rick Siow Mong Goh, Stephen John Turner, Weng-Fai Wong	401/402/403

There are other interesting SC13-events around OpenCL, so be sure to check the schedule carefully.

Khronos Members Exhibiting at SC13

Complete floor plan is available here.

Below links are updated from the company-homepages to special SC13 landing-pages.

Altera Corporation – Booth 4332.
- BittWare will be showcasing their TeraBox High Performance Reconfigurable Computing Platform, which runs OpenCL.
- Acceleware. See schedule above.
AMD – Booth 1113. See this list for a schedule.
ARM – Booth 3141.
- OpenCL on MALI demos at their booth
- AccelerEyes (#310) showcases ArrayFire (with OpenCL-on-ARM backend) running on Mali T604.
- Many more – just look for it.
Khronos OpenCL – Booth 4137.
IBM – Booth 126, 2713. OpenCL on IBM PowerLinux 7R2 and IBM Flex System. Also collaboration with Altera.
Intel – Booth 2501, 2701. OpenCL on their CPUs and GPUs.
NEC Corporation – 3109. Vector supercomputer – Khronos probably hints to OpenCL running on this machine – see for yourself.
NVIDIA – Booth 613. Have fun hearing them say: “Do you use CUDA or are you locked-in to OpenCL?” and variations on this.
Texas Instruments – Booth 3725. OpenCL on DSP demo.
XilinX – To schedule a private appointment (for an OpenCL-demo) visit Xilinx at the Convey booth (#3547) or the Alpha Data booth (#4237).

The floorplan can be downloaded here or here (mirrored on 15-Nov).

At SC13 and saw great OpenCL demos or news?

Share this info and photos in the comments, for others to pick up.

Academic hackatons for Nvidia GPUs

Posted by Vincent Hindriksen on 6 April 2019

Are you working with Nvidia GPUs in your research and wish Nvidia would support you as they used to 5 years ago? This is now done with hackatons, where you get one full week of support, to get your GPU-code improved and your CPU-code ported. Still you have to do it yourself, so it’s not comparable to services we provide.

To start, get your team on a decision to do this. It takes preparation and a clear formulation of what your goals are.

When and where?

It’s already April, so some hackatons have already taken place. For 2019, these are left where you can work on any language, from OpenMP to OpenCL and from OpenACC to CUDA. Python + CUDA-libraries is also no problem, as long as the focus is Nvidia.

Continue reading →

Difference between CUDA and OpenCL 2010

Posted by Vincent Hindriksen on 22 April 2010 with 11 Comments

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA term	OpenCL term
GPU	Device
Multiprocessor	Compute Unit
Scalar core	Processing element
Global memory	Global memory
Shared (per-block) memory	Local memory
Local memory (automatic, or local)	Private memory
kernel	program
block	work-group
thread	work item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

FFT library – so in OpenCL you need to have your own kernels for it.
~~Atomic operations – which make double-write threads easier to implement.~~
Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting.

Khronos Releases OpenCL 2.2 With SPIR-V 1.2

Posted by Vincent Hindriksen on 16 May 2017 with 1 Comment

Today Khronos has released OpenCL 2.2 with SPIR-V 1.2.

The most important changes are:

A static subset of the C++14 standard as a kernel language. The OpenCL C++ kernel language includes classes, templates, lambda expressions, function overloads and many other constructs to increase parallel programming productivity through generic and meta-programming.
Access to the C++ language from OpenCL library functions to provide increased safety and reduced undefined behavior while accessing features such as atomics, iterators, images, samplers, pipes, and device queue built-in types and address spaces.
Pipe storage, which are compile time pipes. It’s a device-side type in OpenCL 2.2 that is useful for FPGA implementations to enable efficient device-scope communication between kernels.
Enhanced optimization of generated SPIR-V code. Applications can provide the value of specialization constants at SPIR-V compilation time, a new query can detect non-trivial constructors and destructors of program scope global objects, and user callbacks can be set at program release time.
KhronosGroup/OpenCL-Headers repository has been flattened. From now on, all version of OpenCL headers will be available not at separate branches, but all in master branch in separate directories named opencl10, opencl11 etc. Old branches are not removed, but they may not be updated in the future.
OpenCL specifications are now open source. OpenCL Working Group decided to publish sources of recent OpenCL specifications on GitHub, including just released OpenCL 2.2 and OpenCL C++ specifications. If you find any mistake, you can create an appropriate merge request fixing that problem.

This is what we said about the release:

“We are very excited and happy to see OpenCL C++ kernel language being a part of the OpenCL standard,” said Vincent Hindriksen, founder and managing director of StreamHPC. “It’s a great achievement, and it shows that OpenCL keeps progressing. After developing conformance tests for OpenCL 2.2 and helping finalizing OpenCL C++ specification, we are looking forward to work on first projects with OpenCL 2.2 and the new kernel language. My team believes that using OpenCL C++ instead of OpenCL C will result in improved software quality, reduced maintenance effort and faster time to market. We expect SPIR-V to heavily impact the compiler ecosystem and bring several new OpenCL kernel languages.”

Continue reading “Khronos Releases OpenCL 2.2 With SPIR-V 1.2” →

OpenCL Basics: Running multiple kernels in OpenCL

Posted by Vincent Hindriksen on 15 October 2018

This series “Basic concepts” is based on GPGPU-questions we get via email more than once, or when the question is not clearly explained in the books. For one it is obvious, for the other just what they’re missing.

They say that learning a new technique is best done by playing around with working code and then try to combine it. The idea is that when you have Stackoverflowed and Githubed code together, you’ve created so many bugs by design that you’ll learn a lot if you make it work. When applying this to OpenCL, you quickly get to a situation that you want to run one.cl file and then another.cl file. Almost all beginner’s material discuss a single OpenCL-file, so how to do this elegantly?

Continue reading “OpenCL Basics: Running multiple kernels in OpenCL” →

Altera published their OpenCL-on-FPGA optimization guide

Posted by Vincent Hindriksen on 11 November 2013

Altera has just released their optimisation guide for OpenCL-on-FPGAs. It does not go into the howto’s of OpenCL, but assumes you have knowledge of the technology. Niether does it provide any information on the basics of Altera’s Stratix V or other FPGA.

It is the first public optimisation document, so it is appreciated to send feedback directly. Not aware what OpenCL can do on an FPGA? Watch the below video.

https://www.youtube.com/watch?v=p25CVFMc-dk

Subjects

The following subjects and optimisation tricks are discussed:

FPGA Overview
Pipelines
Good Design Practices
Avoid Pointer Aliasing
Avoid Expensive Functions
Avoid Work-Item ID-Dependent Backward Branching
Aligned Memory Allocation
Ensure 4-Byte Alignment for All Data Structures
Maintain Similar Structures for Vector Type Elements
Optimization of Data Processing Efficiency
Specify a Maximum Work-Group Size or a Required Work-Group Size
Loop Unrolling
Resource Sharing
Kernel Vectorization
Multiple Compute Units
Combination of Compute Unit Replication and Kernel SIMD Vectorization
Resource-Driven Optimization
Floating-Point Operations
Optimization of Memory Access Efficiency
General Guidelines on Optimizing Memory Accesses
Optimize Global Memory Accesses
Perform Kernel Computations Using Constant, Local or Private Memory
Single Work-Item Execution

Carefully compare these with CPU and GPU optimisation guides to be able to write more generic OpenCL code.

Download

You can download the document here.

If you have any question on OpenCL-on-FPGAs, OpenCL, generic optimisations or Altera FPGAs, feel welcomed to contact us.

Call for papers: SYCL workshop, 13-March-2016, Barcelona, Spain

Posted by Vincent Hindriksen on 22 October 2015 with 2 Comments

A high-level language has been on OpenCL’s roadmap since the years, and would be started once the foundations were ready. Therefore with OpenCL 2.0, SYCL was born.

To keep the pace high, a SYCL workshop is being organised. This week the call-for-papers is opened, which you can read below.

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Barcelona, Spain Sunday, 13th March, 2016

SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction
layer that builds on the underlying concepts, portability and efficiency of
OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL
enables single source development where C++ template functions can contain both
host and device code to construct complex algorithms that use OpenCL
acceleration, and then re-use them throughout their source code on different
types of data. SYCL has also been designed with resilience from the start, by
featuring, for example, a fall-back mechanism to automatically re-enqueue
kernels on different queues in case of a failure.

The SYCL Workshop aims to gather together SYCL’s users, researchers, educators
and implementors to encourage and grow a community of users behind the SYCL
standard, and related work in C++ for heterogeneous architectures. This will be
a half-day workshop. SYCL’16 will be held in Barcelona, 13 March 2016,
co-located with PPoPP 2016, HPCA 2016, CGO 2016 and LLVM 2016.

Travel Awards

Student authors who present papers in this workshop are eligible to apply for
travel awards. Further details will be announced after notification of
acceptance.

Important Dates

Submissions: 23rd November
Notification: 21st December
Final version: 24th January, 2016
Workshop: Sunday, 13th March, 2016

Submission Guidelines

All submissions must be made electronically through the conference submission
site, at https://easychair.org/conferences/?conf=sycl16.
Submissions may be one of the following:

Extended abstract: Two pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Short Paper: Four to six pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Submissions must be in PDF format and printable on US Letter and A4 sized
paper. All submissions will be peer-reviewed by at least two members of the
program committee. We will aim to give longer presentation slots to papers than
to extended abstracts. Conference papers will not be published, but made
available through the website, alongside the slides used for each presentation.
The aim is to enable authors to get feedback and ideas that can later go into
other publications. We will encourage questions and discussions during the
workshop, to create an open environment for the community to engage with.

Topics of interest include, but are not limited to:

Applications implemented using SYCL

C++ Libraries using SYCL

C++ programming models for OpenCL (C++AMP, Boost.Compute, …)

Other C++ applications using OpenCL

New proposals to the SYCL specification

Integration of SYCL with other programming models

Compilation techniques to optimise SYCL kernels

Performance comparisons between SYCL and other programming models

Implementation of SYCL on novel architectures (FPGA, DSP, …)

Using SYCL in fault-tolerant systems

Reports on SYCL implementations

Debuggers, profilers and tools

Organising Committee

Paul Keir, University of the West of Scotland (UK)
Ruyman Reyes, Codeplay Software Ltd, Edinburgh (UK)

Program Committee

Jens Breitbart, TU Munich
Alastair Donaldson, Imperial College London, UK
Christophe Dubach, University of Edinburgh, UK
Joel Falcou, LRI, Université Paris-Sud, France
Benedict Gaster, University of the West of England, UK
Vincent Hindriksen, StreamHPC, Netherlands
Christopher Jefferson, St. Andrews University, UK
Ronan Keryell, Xilinx, Ireland
Zoltán Porkoláb, ELTE, Hungary
Francisco de Sande, Universidad de La Laguna, Spain
Ana Lucia Varbanescu, University of Amsterdam, Netherlands
Josef Weidendorfer, TU Munich

Yes, we’re in the Program Committee as one of the few non-academics. We’re looking forward to read your proposal!

If you have a blog, feel free to copy the above text and repost it.

Software Development

You have developed software that gives the answers you need but takes too long? Or maybe you need to calculate large data-sets on an hourly base, while the batch takes 2 hours?

What do you do when faster hardware starts to get too costly in terms of maintenance costs? You can buy specialized hardware, but that increases costs and dependence on external knowledge. Or, you can choose to just wait for the results to come in, but you can only do this when the computation is not a core process.

What if you could use off-the-shelf hardware to decrease waiting-time? By using OpenCL-devices, which can be high-end graphics cards or other modern processors, software can be sped up by a factor 2 to 20. Why? Because these devices can do much more in parallel and OpenCL makes it possible to make use of that (unused) potential. A few years ago this was not possible to do in the same way it is done now; that’s probably the main reason you haven’t heard of it.

Solutions

All we offer comes into three solutions: find what is available, make a parallel version of the code, and hand-tune the code for maximum performance.

[pricing_tables]
[pricing_table column=”one_third” title=”Specialised Libraries” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

For many “good enough”
Faster code, the easy way.
Gives high performance for generic problems.

[/pricing_table]
[pricing_table column=”one_third” title=”Parallel Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Better caching can give more boost than using faster hardware
Software running in parallel is a first step to GPU-computing
Making the software modular when possible

[/pricing_table]
[pricing_table column=”one_third” title=”High Performance Coding” buttontext=”Request a quote »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

The highest performance is guaranteed
Optimized for the targeted hardware

[/pricing_table]
[/pricing_tables]

Services

There are so many possibilities to speed up code, but one is the best. To help you find the right path, we offer various services.

[pricing_tables]
[pricing_table column=”one_third” title=”Code Review” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/our-services/code-review/” buttoncolor=””]

Code-review of GPU-code (OpenCL, CUDA, Aparapi, and more).
Code-review of CPU-code (Java, C, C++ and more).
Report within 1 week if necessary.

[/pricing_table]
[pricing_table column=”one_third” title=”GPU Assessment” buttontext=”More info »” buttonurl=”https://streamhpc.com/consultancy/rapid-opencl-assessment/” buttoncolor=””]

Find parallellizable computations
Give the fitness to run on GPUs
Report within 2 weeks

[/pricing_table]
[pricing_table column=”one_third” title=”Architecture Assessment” buttontext=”Request more info »” buttonurl=”https://streamhpc.com/consultancy/request-more-information/” buttoncolor=””]

Architecture check-up
Data-transport measurements
Report within 2 weeks

[/pricing_table]
[/pricing_tables]

More information

We can make your compute-intensive algorithms much faster and scalable. How do we do it? We can explain it all to you by phone or in person. Send in the form on this page, and we will contact you.

You can also call now: +31 6454 00 456.

We invite you to download our brochures to get an overview of how we can help you widen the bottlenecks in your software.

Handling OpenCL with CMake 3.1 and higher

Posted by Vincent Hindriksen on 25 September 2015 with 2 Comments

There has been quite some “find OpenCL” code for CMake around. If you haven’t heard of CMake, it’s the most useful cross-platform tool to make cross-platform software.

Put this into CMakeLists.txt, changing the names for the executable.

#Minimal OpenCL CMakeLists.txt by StreamHPC

cmake_minimum_required (VERSION 3.1)

project(GreatProject)

# Handle OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
link_directories(${OpenCL_LIBRARY})

add_executable (main main.cpp)
target_include_directories (main PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries (main ${OpenCL_LIBRARY})

Then do the usual:

make a build-directory
cd build
cmake .. (specifying the right Generator)

Adding your own CMake snippets and you’re one happy dev!

Cmake 3.7

CMake 3.7 makes it even easier! You can do the following:

find_package(OpenCL REQUIRED)
add_executable(test_tgt main.c)
target_link_libraries(test_tgt OpenCL::OpenCL)

This automatically sets up the include paths and target library to link against. No need to use the ${OpenCL_INCLUDE_DIRS} and ${OpenCL_LIBRARIES} any more.

(Thanks Matthäus G. Chajdas for improving this!)

Getting CMake 3.1 or higher

Ubuntu/Debian: Get the PPA.
Other Linux: Get the latest tar.gz and compile.
Windows/OSX: Download the latest exe/dmg from the CMake homepage.

If you have more tips to share, put them in the comments.

AMD is back!

Posted by Vincent Hindriksen on 16 June 2016 with 4 Comments

AMD_Logo-and-wordmark-1024x768 For years we haven been complaining on this blog what AMD was lacking and what needed to be improved. And as you might have concluded from the title of this blogpost, there has been a lot of progress.

AMD is back! It will all come together in the beginning of 2017, but you’ll see a lot of progress already the coming weeks and months.

AMD quietly recognised and solved various totally new problems in HPC, becoming the hidden innovator everybody needed.

This blog is to give an overview of how AMD managed to come back and what it took to get to there. Their market cap supports it, as you can see.

amd-market-cap-history — AMD’s market cap is back at 2012 levels (source)

Continue reading “AMD is back!” →

Nokia Maemo and OpenCL

Posted by Vincent Hindriksen on 1 June 2010 with 2 Comments

Update 21-06-2011: Bumped into a project by Nokia: CLEP, “OpenCL Embedded Profile” for the N900.

Maemo is the Debian based Linux-distribution of Nokia for embedded devices. It is on the gadget N900, so you can be root on your own phone and compile your own kernel. In other words: a great developer’s phone.

Which smartphone to buy when you want to toy around with OpenCL “Embedded Profile”? There is more and more evidence that the next iPhone OS will have support for OpenCL, as should be expected Apple being the trademark-owner of OpenCL. This is good, since the mobile market could make the difference for the technique – competing with CUDA and DirectCompute. “The other ARM Cortex-A8 smartphone”, the Nokia N900 does not support it, while the magic of OpenCL attracts to many developers on the Maemo-forums.

The QT-blog that disclosed coming OpenCL-support for QT, spoke about it too:

>>Right now, QtOpenCL works very well with desktop OpenCL implementations, like that from NVIDIA (we’ve tested it under Linux, Mac, and Windows). Embedded devices are currently another matter – OpenCL implementations are still very basic in that space. The performance improvements on embedded CPU’s are only slightly better than using ARM/NEON instructions for example. And embedded GPU’s are usually hard-wired for GLSL/ES, lacking many of the features that makes OpenCL really sing. But like everything in the embedded space, things are likely to change very quickly. By releasing QtOpenCL, hopefully we can stimulate the embedded vendors to accelerate development by giving them something to test with. Be the first embedded device on the block to get the mandelbrot demo running at 10fps, or 20fps, or 60fps!<<

But checking the whole Nokia QT/Maemo-SDK for something like “opencl.h” or words like “opencl” and “khronos” in .h-files did not return anything interesting. The missing reference in the SDK tells me, we cannot expect any OpenCL-implementation on the N900 soon. So do we have to wait for the Nokia N920, Maemo 6 and QT 4.8? Once I know more, by getting deeper into the SDK, you’re the first to know. But first let me show you the documents which tells us OpenCL is coming to the Maemo-platform.

The Maemo Base Port Document, version 1.1

Exhibit number 1. The introduction tells us that the document describes what hardware-designers should do to get Maemo working on their device:

>>When Maemo is ported to a new chipset and HW environment, the majority of the SW worktakes place in the base layer. However, some adjustments may also be needed in the otherlayers. The porting work as a whole is a combined effort by the chipset vendor and Nokia. Thisdocument describes the deliverables expected from the chipset vendor in such an effort. The requirements in this document are expressed in the form of SW component, interface andfunctional requirements. Note that in many cases more detailed discussions are neededbetween Nokia and the chipset vendor to reach a common understanding about the specificsof the system architecture and the required component versions, functionality and interfaces.<<

So the document describes what the hardware must support, to be able to run Maemo. Let’s then find the magic word “OpenCL”:

>>Graphics Adaptation. The Base Port graphics adaptation interfaces consist of X11, OpenGL ES, and OpenVG interfaces. The OpenCL interface is also included in this group since it typically is used to access the GPU for general-purpose parallel computation.<<

And somewhat below:

>>OpenCL 1. The Base Port should provide an implementation of the OpenCL 1.0 interface for general-purpose parallel programming of heterogeneous systems, especially for the use of GPUs for computation (Khronos group standard).<<

That seems to be pretty clear that Maemo-devices must be able to support OpenCL.

http://www.forum.nokia.com/piazza/wiki/images/7/7d/Maemo_Base_Port_v1.1.pdf

Paper “OpenCL on Embedded devices” by Nokia

Exhibit 2 shows tests of a few simple OpenCL-program on an unnamed device with a TI OMAP 3430 (550 MHz ARM Cortex-A8 CPU & 110 MHz POWERVR SGX530 GPU) – which happens to be in the Motorola Droid, Palm Pre, and Nokia N900. So they managed to create a OpenCL-implementation on ARM. If you’re interested in OpenCL for embedded devices, please do read this presentation:

http://www.khronos.org/developers/library/2009-hotchips/Nokia_OpenCL-in-Handheld-Devices.pdf

It is a document from august 2009, which shows they actually were trying POWERVR and OpenCL then. Now with QT and Maemo mentioning it, we can be very sure the N900 or the N920 is eventually going to have OpenCL-support.

Why this new AMD FirePro Cluster is important for OpenCL

Posted by Vincent Hindriksen on 14 November 2014

Then it hit the doormat:

“AMD is proud to collaborate with ASUS, the Frankfurt Institute for Advanced Studies, (FIAS) and GSI to support such important physics and computer science research,” said David Cummings, senior director and general manager, professional graphics, AMD. “This installation reaffirms AMD’s leading role in HPC with the implementation of the AMD FirePro S9150 server GPUs in this three petaFLOPS supercomputer cluster. AMD and ASUS are enabling OpenCL applications for critical science research usage for this cluster. We’re committed to building our HPC leadership position in the industry as a foremost provider of computing applications, tools and technologies.“

You read more here and the official news here.

Why is this important?

It could be that there is more flops for the same price, as AMD hardware is cheaper? Nice, but secondary.

That it runs OpenCL? We like that, but from a broader perspective this is not the most important.

It is important because it creates more diversity in the world of HPC. Currently there are a few XeonPhi clusters and only one big AMD FirePro S10000 cluster. The rest is NVidia Tesla or CPU only. With more AMD clusters the HPC market is democratised. That means that more software will be written in vendor-neutral software like OpenCL (with high-level software/libraries on top), and prices of HPC accelerators will not be kept high.

How to further democratise the HPC world?

We started with porting Gromacs to OpenCL, and we will continue to port large projects to OpenCL. This software will simply run on XeonPhi, Tesla and FirePro with just little porting time, reducing costs in many ways. We can not do it alone, but together we can. Start by telling us which software needs to be ported from OpenMP to OpenCL or OpenMP 4, or from CUDA to OpenCL. And if you are porting open source software to OpenCL, drop us a line for free advice and help with testing the software.

And the best you can do to break the monopoly of CUDA, is to simply buy AMD or Intel hardware. The price difference is enough to buy lots of extra FLOPS and to pay for a complete porting project to OpenCL of a large application.

OpenCL – the battle, part I

Posted by Vincent Hindriksen on 28 January 2010 with 2 Comments

Part I: the Hardware-companies and Operating Systems

(Part II will be about programming languages and software-companies, part III about the gaming-industry)

OpenCL is the new, but already de-facto standard of stream-computing; but how it got there so fast is somewhat strange. A few years ago there were many companies and research-groups seeing the power of using the GPU, such as:

Ageia technologies’ PhysX (acquired by nVidia)
Havok Physics (acquired by Intel)
Stanford University’s Brook
PeakStream (acquired by Google)
ATI’s Close-To-Metal (abandoned, ATI acquired by AMD)
nVidia’s CUDA
AMD’s Brook-extensions Brook+ (completely replaced by OpenCL)
IBM’s InfoSphere

And the fight is really not over, since we are talking about a big shift in the super-computing industry. Just think of IBM BlueGene, which will lose lots of market to nVidia and AMD. Or Intel, who hasn’t acquired a GPU-creator as AMD did. Who had expected the market to change this rigorous? If we’re honest, we could have seen it coming (when looking at the turbulence around PhysX and Havok), but “normally” this new techniques would be introduced slowly.

The fight is about market-shares. For operating-systems, the user wants to have their movies encoded in 20 minutes just like their neighbour. For HPC-computing, since clusters can be updated for a far lower price than was possible with the old-fashioned way; here it is mostly between Linux HPC and windows HPC (which still has a very small market-share), but also database-engines which rely on high-performance hardware/software.
The most to gain is in the processor-market. The extremely large consumer-market is declining since 2004, since most users do not need more than a netbook and have bought a separate gaming-computer for the more demanding games. We don’t only see Intel and AMD anymore, but IBM’s powerful Cell- en Power-processors, very power-efficient ARM-processors, etc. Now OpenCL could make it more interesting to buy an average processor and a good graphics-card, Intel (and AMD) have no choice then to take the battle with nVidia.

Background: Why Apple made OpenCL

Short answer: pure frustration. All those different implementations would or get a share or fight for being named the standard; Apple wanted to bet on the right horse and therefore took the lead in creating an open standard. Money would be made by updating software and selling more hardware. For that reason Apple’s close partners Intel and nVidia were easily motivated to help developing the standard. Currently Apple’s only (public) reasons for giving away such an expensive and specialised project is publicity and to be ahead of the competition. Since it will not be a core-business of Apple, it does not need to stay in lead, but which companies do?

Acquisitions, acquisition, acquisitions

No time to lose for the big companies, so they must get the knowledge in-house as soon as possible. Below are some examples.

Microsoft: Interactive Supercomputing (22-Sept-2009): made Star-P, software which allowed users to perform scientific, engineering or analytical computation on array or matrix-based data to use parallel architectures such as multi-core workstations, multi-processor systems, distributed memory clusters or utility/cloud-based environments. This is completely in the field of OpenCL, which Microsoft needs to strengthen its products as Apple already did, such as SQL-server and Windows HPC.
nVidia: Ageia technologies (22-Febr-2008): made specialized PC-cards and software for calculating complicated physics in games. They made the first commercial product aiming at the masses (gamers). PhysX-code could by integrated in nVidia-drivers to be used with modern nVidia-GPUs.
AMD: ATI (24-juli-2006): graphics chip specialist. Although the price was too high, it saved AMD from being bought out by Intel and even stay ahead (if they had kept running).
Intel: Havok (17-Sept-2007): builds games-tools, such as a physics-engine. After Ageia was captured, the only good company out there to buy; AMD was too late, which spent all its money on ATI. Wind River (4-June-2009): a company providing embedded systems, development tools for embedded systems, middleware, and other types of software. Also read this interesting article. Cilk (31-July-2009): offers parallel extensions that are tightly tied into a compiler. RapidMind (19-Aug-2009): created a high-level language Sh, which had an OpenCL-backend. Intel has a lead in CPU-compilers, which it wants to broaden to multi-core- and GPU-compilers. Intel discovered it was in the group of “old fashioned compiler-builders” and had lots to learn in a short time.

If you know more acquisitions of interest, please let us know.

Winners

Apple, Intel and NVidia are the winners for 2009 and 2010. They have currently the most knowledge in house and have their marketing-machine running. NVidia has the best insight for new markets.

Microsoft and Game-developers are second; they took the first train by joining the OpenCL-consortium and taking it very serious. At the end of 2010 Microsoft will be at Apple’s level of expertise, so we will see then who has the best novelties. The game-developers, of which most already have experience with physics-calculations, all had a second chance when they had misjudged the Physics-engines. More on gaming in part III.

AMD is currently actually a big loser, since it does not seem to take it all seriously enough. But AMD can afford to be late, since OpenCL makes it easy to switch. We hope the best for AMD, since it has the technology of both CPU and GPU, and many years of experience in both fields. More on the competition between marketing-monster nVidia and silent AMD will be discussed in a blog-item, next week.

Another possible loser is Linux, which has lots to lose on HPC-market; OpenBSD-based Apple and Windows HPC can actually win market-share now. Expect most from hardware-manufacturers Intel, AMD and nVidia to give code to the community, but also from universities who do lots of research on the ever-flexible Linux. At the end it all depends on OpenCL-adaptation of (Linux-specific) programming-languages, which will be discussed in part II.

ARM is a member of the OpenCL-group but does not seem to invest in it; they seem to target another growing market: the low-power mobile devices. We will write on OpenCL and the mobile market later and why ARM currently can be relaxed about OpenCL.

We hope you have more insights in this new market; please contact us for more specific information and feel free to give your comments. Please stay tuned for part II and III, which will be released the next few weeks.

We ported GROMACS from CUDA to OpenCL

Posted by Vincent Hindriksen on 1 November 2014 with 9 Comments

GROMACS does soft matter simulations on molecular scale. Let it fly.

GROMACS is an important molecular simulation kit, which can do all kinds of “soft matter” simulations like nanotubes, polymer chemistry, zeolites, adsorption studies, proteins, etc. It is being used by researches worldwide and is one of the bigger bio-informatics softwares around.

To speed up the computations, GPUs can be used. The big problem is that only NVIDIA GPU could be used, as CUDA was used. To make it possible to use other accelerators, we ported it to OpenCL. It took several months with a small team to get to the alpha-release, and now I’m happy to present it to you.

For who knows us from consultancy (and training) only, might have noticed. This is our first product!

We promised to keep it under the same open source license and that effectively means we are giving it away for free. Below I’ll explain how to obtain the sources and how to build it, but first I’d like to explain why we did it pro bono.

Why we did it

Indeed, we did not get any money (income or funds) for this. There have been several reasons, of which the below four are the most important.

The first reason is that we want to show what we can. Each project was under NDA and we could not demo anything we made for a customer. We chose for a CUDA package to port to OpenCL, as we notice that there is a trend to port CUDA-software to OpenCL (i.e. Adobe software).
The second reason is that bio-informatics is an interesting industry, where we would like to do more work.
Third reason is that we can find new employees. Joining the project is a way to get noticed and could end up in a job-proposal. The GROMACS project is big and needs unique background knowledge, so it can easily overwhelm people. This makes it perfect software to test out who is smart enough to handle such complexity.
Fourth is gaining experience with handling open source projects and distributed teams.

Therefore I think it’s a very good investment, while giving something (back) to the community.

Presentation of lessons learned during SC14

We just jumped in and went for it. We learned a lot, because it did not go as we expected. All this experience, we would like to share on SuperComputing 2014.

During SC14 I will give a presentation on the OpenCL port of GROMACS and the lessons learned. As AMD was quite happy with this port, they provided me a place to talk about the project:

“Porting GROMACS to OpenCL. Lessons learned”
SC14, New Orleans, AMD’s mini-theatre.
19 November, 15:00 (3:00 pm), 25 minutes

The SC14 demo will be available on the AMD booth the whole week, so if you’re curious and want to see it live with explanation.

If you’d like to talk in person, please send an email to make an appointment for SC14.

Getting the sources and build

It still has rough edges, so a better description would be “we are currently porting GROMACS to OpenCL”, but we’re very close.

As it is work in progress, no binaries are available. So besides knowledge of C, C++ and Cmake, you also need to know how to work with GIT. It builds on both Windows and Linux, and NVIDIA and AMD GPUs are the target platforms for the current phase.

The project is waiting for you on https://github.com/StreamHPC/gromacs.

The wiki has lots of information, from how to build, supported devices to the project planning. Please RTFM, before starting! If something is missing on the wiki, please let us know by simply reporting a new issue.

Help us with the GROMACS OpenCL port

We would like to invite you to join, so we can make the port better than the original. There are several reasons to join:

Improve your OpenCL skills. What really applies to the project is this quote:

Tell me and I forget.
Teach me and I remember.
Involve me and I learn.
Make the OpenCL ecosphere better. Every product that has OpenCL support, gives choice to the user what GPU to use (NVIDIA, AMD or Intel)
Make GROMACS better. It is already a large community and OpenCL-knowledge is needed now.
Get hired by StreamHPC. You’ll be working with us directly, so you’ll get to know our team.

What can you do? There is much you can do. Once you managed to build and run it, look at the bug reports. First focus is to get the failing kernels working – this is top priority to finalise phase 1. After that, the real fun begins in phase 2: add features and optimise for speed on specific devices. Since AMD FirePro is much better in double precision than Nvidia Tesla, it would be interesting to add support for double precision. Also certain parts of the code is done on the CPU, which have real potential to be ported to the GPU.

If things are not clear and obstruct you from starting, don’t get stressed and send an email with any question you have. We’re awaiting your merge request or issue report!

Special thanks

This project wasn’t possible without the help of many people. I’d like to thank them now.

The GROMACS team in Sweden, from the KTH Royal Institute of Technology.
- Szilárd Páll. A highly skilled GPU engineer and PhD student, who pro-actively keeps helping us.
- Mark Abraham. The GROMACS development manager, always quickly answering our various questions and helping us where he could.
- Berk Hess. Who helped answering the harder questions and feeding the discussions.
Anca Hamuraru, the team lead. Works at StreamHPC since June, and helped structure the project with much enthusiasm.
Dimitrios Karkoulis. Has been volunteering on the project since the start in his free time. So special thanks to Dimitrios!
Teemu Virolainen. Works at StreamHPC since October and has shown to be an expert on low-level optimisations.
Our contacts at AMD, for helping us tackle several obstacles. Special thanks go to Benjamin Coquelle, who checked out the project to reproduce problems.
Michael Papili, for helping us with designing a demo for SC14.
Octavian Fulger from Romanian gaming-site wasd.ro, for providing us with hardware for evaluation.

Without these people, the OpenCL port would never been here. Thank you.