Starting with GROMACS and OpenCL

Gromacs-OpenCLNow that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git  plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout  via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

  • -DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
  • -DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
  • -DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
  • -DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
  • -DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
  • -DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
  • -DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
  • -DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
  • -DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
  • -DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
  • -DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

 

We ported GROMACS from CUDA to OpenCL

GROMACS does soft matter simulations on molecular scale
GROMACS does soft matter simulations on molecular scale. Let it fly.

GROMACS is an important molecular simulation kit, which can do all kinds of  “soft matter” simulations like nanotubes, polymer chemistry, zeolites, adsorption studies, proteins, etc. It is being used by researches worldwide and is one of the bigger bio-informatics softwares around.

To speed up the computations, GPUs can be used. The big problem is that only NVIDIA GPU could be used, as CUDA was used. To make it possible to use other accelerators, we ported it to OpenCL. It took several months with a small team to get to the alpha-release, and now I’m happy to present it to you.

For who knows us from consultancy (and training) only, might have noticed. This is our first product!

We promised to keep it under the same open source license and that effectively means we are giving it away for free. Below I’ll explain how to obtain the sources and how to build it, but first I’d like to explain why we did it pro bono.

Why we did it

Indeed, we did not get any money (income or funds) for this. There have been several reasons, of which the below four are the most important.

  • The first reason is that we want to show what we can. Each project was under NDA and we could not demo anything we made for a customer.  We chose for a CUDA package to port to OpenCL, as we notice that there is a trend to port CUDA-software to OpenCL (i.e. Adobe software).
  • The second reason is that bio-informatics is an interesting industry, where we would like to do more work.
  • Third reason is that we can find new employees. Joining the project is a way to get noticed and could end up in a job-proposal. The GROMACS project is big and needs unique background knowledge, so it can easily overwhelm people. This makes it perfect software to test out who is smart enough to handle such complexity.
  • Fourth is gaining experience with handling open source projects and distributed teams.

Therefore I think it’s a very good investment, while giving something (back) to the community.

Presentation of lessons learned during SC14

We just jumped in and went for it. We learned a lot, because it did not go as we expected. All this experience, we would like to share on SuperComputing 2014.

During SC14 I will give a presentation on the OpenCL port of GROMACS and the lessons learned. As AMD was quite happy with this port, they provided me a place to talk about the project:

“Porting GROMACS to OpenCL. Lessons learned”
SC14, New Orleans, AMD’s mini-theatre.
19 November, 15:00 (3:00 pm), 25 minutes

The SC14 demo will be available on the AMD booth the whole week, so if you’re curious and want to see it live with explanation.

If you’d like to talk in person, please send an email to make an appointment for SC14.

Getting the sources and build

It still has rough edges, so a better description would be “we are currently porting GROMACS to OpenCL”, but we’re very close.

As it is work in progress, no binaries are available. So besides knowledge of C, C++ and Cmake, you also need to know how to work with GIT. It builds on both Windows and Linux, and  NVIDIA and AMD GPUs are the target platforms for the current phase.

The project is waiting for you on https://github.com/StreamHPC/gromacs.

The wiki has lots of information, from how to build, supported devices to the project planning. Please RTFM, before starting! If something is missing on the wiki, please let us know by simply reporting a new issue.

Help us with the GROMACS OpenCL port

We would like to invite you to join, so we can make the port better than the original. There are several reasons to join:

  1. Improve your OpenCL skills. What really applies to the project is this quote:

    Tell me and I forget.
    Teach me and I remember.
    Involve me and I learn.

  2. Make the OpenCL ecosphere better. Every product that has OpenCL support, gives choice to the user what GPU to use (NVIDIA, AMD or Intel)
  3. Make GROMACS better. It is already a large community and OpenCL-knowledge is needed now.
  4. Get hired by StreamHPC. You’ll be working with us directly, so you’ll get to know our team.

What can you do? There is much you can do. Once you managed to build and run it, look at the bug reports. First focus is to get the failing kernels working – this is top priority to finalise phase 1. After that, the real fun begins in phase 2: add features and optimise for speed on specific devices. Since AMD FirePro is much better in double precision than Nvidia Tesla, it would be interesting to add support for double precision. Also certain parts of the code is done on the CPU, which have real potential to be ported to the GPU.

If things are not clear and obstruct you from starting, don’t get stressed and send an email with any question you have.  We’re awaiting your merge request or issue report!

Special thanks

This project wasn’t possible without the help of many people. I’d like to thank them now.

  • The GROMACS team in Sweden, from the KTH Royal Institute of Technology.
    • Szilárd Páll. A highly skilled GPU engineer and PhD student, who pro-actively keeps helping us.
    • Mark Abraham. The GROMACS development manager, always quickly answering our various questions and helping us where he could.
    • Berk Hess. Who helped answering the harder questions and feeding the discussions.
  • Anca Hamuraru, the team lead. Works at StreamHPC since June, and helped structure the project with much enthusiasm.
  • Dimitrios Karkoulis. Has been volunteering on the project since the start in his free time. So special thanks to Dimitrios!
  • Teemu Virolainen. Works at StreamHPC since October and has shown to be an expert on low-level optimisations.
  • Our contacts at AMD, for helping us tackle several obstacles. Special thanks go to Benjamin Coquelle, who checked out the project to reproduce problems.
  • Michael Papili, for helping us with designing a demo for SC14.
  • Octavian Fulger from Romanian gaming-site wasd.ro, for providing us with hardware for evaluation.

Without these people, the OpenCL port would never been here. Thank you.

How to get full CMake support for AMD HIP SDK on Windows – including patches

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API  was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading “How to get full CMake support for AMD HIP SDK on Windows – including patches”

Install OpenCL on Debian, Ubuntu and Mint orderly

Libraries – can’t have enough

If you read different types of manuals how to compile OpenCL software on Linux, then you can get dizzy of all the LD-parameters. Also when installing the SDKs from AMD, Intel and NVIDIA, you get different locations for libraries, header-files, etc. Now GPGPU is old-fashioned and we go for heterogeneous programming, the chances get higher you will have more SDKs on your machine. Also if you want to keep it the way you have, reading this article gives you insight in what the design is after it all. Note that Intel’s drivers don’t give OpenCL support for their GPUs, but CPUs only.

As my mother said when I was young: “actually cleaning up is very simple”. I’m busy creating a PPA for this, but that will take some more time.

First the idea. For developers OpenCL consists of 5 parts:

  • GPUs-only: drivers with OpenCL-support
  • The OpenCL header-files
  • Vendor specific libraries (needed when using -lOpenCL)
  • libOpenCL.so -> a special driver
  • An installable client driver

Currently GPU-drivers are always OpenCL-capable, so you only need to secure 4 steps. These are discussed below.

Please note that in certain 64-bit distributions there is not lib64, but only ‘lib’ and ‘lib32’. If that is the case for you, you can use the commands that are mentioned with 32-bit.

Continue reading “Install OpenCL on Debian, Ubuntu and Mint orderly”

OpenCL on the CPU: AVX and SSE

When AMD came out with CPU-support I was the last one who was enthusiastic about it, comparing it as feeding chicken-food to oxen. Now CUDA has CPU-support too, so what was I missing?

This article is a quick overview on OpenCL on CPU-extensions, but expect more to come when the Hybrid X86-Processors actually hit the market. Besides ARM also IBM already has them; also more about their POWER-architecture in an upcoming article to give them the attention they deserve.

CPU extensions

SSE/MMX started in the 90’s extending the IBM-compatible X86-instruction, being able to do an add and a multiplication in one clock-tick. I still remember the discussion in my student-flat that the MP3s I could produce in only 4 minutes on my 166MHz PC just had to be of worse quality than the ones which were encoded in 15 minutes. No, the encoder I “found” on the internet made use of SSE-capabilities. Currently we have reached SSE5 (by AMD) and Intel introduced a new extension called AVX. That’s a lot of abbreviations! MMX stands for “MultiMedia Extension”, SSE for “Streaming SIMD Extensions” with SIMD being “Single Instruction Multiple Data” and AVX for “Advanced Vector Extension”. This sounds actually very interesting, since we saw SIMD and Vectors op the GPU too. Let’s go into SSE (1 to 4) and AVX – both fully supported on the new CPUs by AMD and Intel.

Continue reading “OpenCL on the CPU: AVX and SSE”

N-Queens project from over 10 years ago

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago”

NVIDIA ended their support for OpenCL in 2012

If you are looking for the samples in one zip-file, scroll down. The removed OpenCL-PDFs are also available for download.

This sentence “NVIDIA’s Industry-Leading Support For OpenCL” was proudly used on NVIDIA’s OpenCL page last year. It seems that NVIDIA saw a great future for OpenCL on their GPUs. But when CUDA began borrowing the idea of using LLVM for compiling kernels, NVIDIA’s support for OpenCL slowly started to fade instead. Since with LLVM CUDA-kernels can be loaded in OpenCL and vice versa, this could have brought the two techniques more together.

What is the cause for this decreased support for OpenCL? Did they suddenly got aware LLVM would decrease any advantage of CUDA over OpenCL and therefore decreased support for OpenCL? Or did they decide so long ago, as their last OpenCL-conformant product on Windows is from July 2010? We cannot be sure, but we do know NVIDIA does not have an official statement on the matter.

The latest action demonstrating NVIDIA’s reduced support of OpenCL is the absence of the samples in their GPGPU-SDK. NVIDIA removed them without notice or clear statement on their position on OpenCL. Therefore we decided to start a petition to get these OpenCL samples back. The only official statement on the removal of the samples was on LinkedIn:

All of our OpenCL code samples are available at http://developer.nvidia.com/opencl, and the latest versions all work on the new Kepler GPUs.
They are released as a separate download because developers using OpenCL don’t need the rest of the CUDA Toolkit, which is getting to be quite large.
Sorry if this caused any alarm, we’re just trying to make life a little easier for OpenCL developers.

Best regards,

Will.

William Ramey
Sr. Product Manager, GPU Computing
NVIDIA Corporation

Continue reading “NVIDIA ended their support for OpenCL in 2012”

OpenCL Wrappers

Mostly providing simplified kernel-code with more convenient error-checking, but sometimes with quite advanced additions: the wrappers for OpenCL 1.x. As OpenCL is an open standard, these projects are an important part of the evolution of OpenCL.

You won’t find solutions that provide a new programming paradigm or work with pragmas, generating optimised OpenCL. This list is in the making.

C++

Goopax: Goopax is an object-oriented GPGPU programming environment, allowing high performance GPGPU applications to be written directly in C++ in a way that is easy, reliable, and safe.

OCL-Library: Simplified OpenCL in C++. Documentation. By Prof. Tim Warburton and David Medina.

Openclam: possibility to write kernels inside C++ code. Project is not active.

EPGPU: provides expressions in C++. Paper. By Dr. Lawlor.

VexCL: a vector expression template library. Documentation.

Qt-C++. The advantages of Qt in the OpenCL programmer’s hands.

Boost.Compute. An extensive C++ library using Boost. Documentation.

ArrayFire. A wrapper and a library in one.

OpenCLHelper. Easy to run kernels using OpenCL.

SkelCL. A more advanced C++ wrapper, providing various skeleton functions.

HPL, or the Heterogeneous Programming Library, where special Arrays are used to easily communicate between CPU and GPU.

ViennaCL, a wrapper around an OpenCL-library focused on linear algebra.

C, Objective C

C Framework for OpenCL. Rapid development of OpenCL programs in C/C++.

Simple OpenCL. Much simpler host-code.

COPRTHR: STDCL: Simplified programming interface for OpenCL designed to support the most typical use-cases in a style inspired by familiar and traditional UNIX APIs for C programming.

Grand Central Dispatch: integration into Apple’s environment. Documentation [PDF].

GoCL: For combination with Gnome GLib/GObject.

Computing-Language-Utility: C/C++ wrapper by Intel. Documentation included, slides of presentation here.

Delphi/Pascal

Delphi-OpenCL: Delphi/Pascal-bindings for OpenCL. Seems not active.

OpenCLforDelphi: OpenCL 1.2 for Delphi.

Fortran

Howto-article: It describes how to link with c-files. A must-read for Fortran-devs who want to integrate OpenCL-kernels.

FortranCL: OpenCL interface for Fortran 90. Seems to be the only matured wrapper around. FAQ.

Wim’s OpenCL Integration: contains a very simple f95 file ‘oclWrapper.f95’.

Go

GOCL: Go OpenCL bindings

Go-OpenCL: Go OpenCL bindings

Haskell

HopenCL: Haskell-bindings for OpenCL. Paper.

Java

JavaCL: Java bindings for OpenCL.

ClojureCL: OpenCL 2.0 wrapper for Clojure.

ScalaCL: Much more advanced integration as could be done with JavaCL.

JoCL by JogAmp: Java bindings for OpenCL. Good integration with siter-projects JoGL and JoAL.

JoCL.org: Java bindings for OpenCL.

The Lightweight Java Game Library (LWJGL): Support for OpenCL 1.0-1.2 plus extensions and OpenGL interop.

Aparapi, a very high level language for enabling OpenCL in Java.

JavaScript

Standardised WebCL-support is coming via the Khronos WebCL project.

Nokia WebCL. Javascript bindings for OpenCL, which works in Firefox.

Samsung WebCL. Javascript bindings for OpenCL, which works in Safari and Chromium on OSX.

Intel Rivertrail. built on top of WebCL.

Julia

JuliaGPU. OpenCL 1.2 bindings for Julia.

Lisp

cl-opencl-3b: Lisp-bindings for OpenCL. not active.

.NET: C#, F#, Visual Basic

OpenCL.NET: .NET bindings for OpenCL 1.1.

Cloo: .NET bindings for OpenCL 1.1. Used in OpenTK, which has good integration with OpenGL, OpenGL|ES and OpenAL.

ManoCL: Not active project for .NET bindings.

FSCL.Compiler: FSharp OpenCL Compiler

Perl

Perl-OpenCL: Perl bindings for OpenCL.

Python

General article

PyOpenCL: Python bindings for OpenCL with convenience wrappers. Documentation.

Cython: C-extension for Python. More info on the extension.

PyCL: not active.

PythonCL: not active.

Clyther: Python bindings for OpenCL. No code yet, but check out the predecessor.

Ruby

Ruby-OpenCL: Ruby bindings for OpenCL. Not active.

Barracuda: seems to be not active.

Rust

Rust-OpenCL: Rust-bindings for OpenCL. Blog-article by author

Math-software

Mathematica

OpenCLLink. OpenCL-bindings in Mathematica 9.

Matlab

There is native support in Matlab.

OpenCL-toolbox. Alternative bindings. Not active. Works with Octave.

R

R-OpenCL. Interface allowing R to use OpenCL.

 

Suggestions?

If you know not mentioned alike projects, let us know! Even when not active, as that is important information too.

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

OpenCL Videos of AMD’s AFDS 2012

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Continue reading “OpenCL Videos of AMD’s AFDS 2012”

OpenCL at SC14

SC14During SC14 (SuperComputing Conference 2014), OpenCL is again all over New Orleans. Just like last year, I’ve composed an overview based on info from the Khronos website and the SC2014 website.

Finally I’m attending SC14 myself, and will give two talks for you. Tuesday I’ll be part of a 90 minute session of Khronos, where I’ll talk a bit about GROMACS and selecting the right accelerator for your software. Wednesday I’ll be sharing our experiences from our port of GROMACS to OpenCL. If you meet me, then I can hand you over a leaflet with the decision chart to help select the best device for the job.

Continue reading “OpenCL at SC14”

OpenCL Developer support by NVIDIA, AMD and Intel

There was some guy at Microsoft who understood IT very well while being a businessman: “Developers, developers, developers, developers!”. You saw it again in the mobile market and now with OpenCL. Normally I watch his yearly speech to see which product they have brought to their own ecosphere, but the developers-speech is one to watch over and over because he is so right about this! (I don’t recommend the house-remixes, because those stick in your head for weeks.)

Since OpenCL needs to be optimised for each platform, it is important for the companies that developers start developing for their platform first. StreamComputer is developing a few different Eclipse-plugins for OpenCL-development, so we were curious what was already there. Why not share all findings with you? I will keep this article updated – know this article does not cover which features are supported by each SDK.

Continue reading “OpenCL Developer support by NVIDIA, AMD and Intel”

The OpenCL power: offloading to the CPU (AVX+SSE)

Say you have some data that needs to be used as input for a larger kernel, but needs a little preparation to get it aligned in memory (small kernel and random reads). Unluckily the efficiency of such kernel is very low and there is no speed-up or even a slowdown. When programming a GPU it is all about trade-offs, but one trade-off is forgotten a lot (especially by CUDA-programmers) once is decided to use accelerators: just use the CPU. Main problem is not the kernel that has been optimised for the GPU, but all supporting code (like the host-code) needs to be rewritten to be able to use the CPU.

Why use the CPU for vector-computations?

The CPU has support for computing vectors. Each core has a 256 bit wide vector computer. This mean a double4 (a vector of 4 times a 64-bit float) can be computed in one clock-cycle. So a 4-core CPU of 3.5GHz goes from 3.5 billion instructions to 14 billion when using all 4 cores, and to 56 billion instructions when using vectors. When using a float8, it doubles to 112 billion instructions. Using MAD-instructions (Multiply+Add), this can be doubled to even 224 billion instructions.

Say we have this CPU with 4 core and AVX/SSE, and the below code:

int* a = ...;
int* b = ...; 
for (int i = 0; i < M; i++)
   a[i] = b[i]*2;
}

How do you classify the accelerated version of above code? A parallel computation or a vector-computation? Is it is an operation using an M-wide vector or is it using M threads. The answer is both – vector-computations are a subset of parallel computations, so vector-computations can be run in parallel threads too. This is interesting, as this means the code can run on both the AVX as on the various codes.

If you have written the above code, you’d secretly hope the compiler finds out this automatically runs on all hyper-threaded cores and all vector-extensions it has. To have code made use of the separate cores, you have various options like normal threads or OpenMP/MPI. To make use of the vectors (which increases speed dramatically), you need to use vector-enhanced programming languages like OpenCL.

To learn more about the difference between vectors and parallel code, read the series on programming theories, read my first article on OpenCL-CPU, look around at this site (over 100 articles and a growing knowledge-section), ask us a direct question, use the comments, or help make this blog tick: request a full training and/or code-review.

Continue reading “The OpenCL power: offloading to the CPU (AVX+SSE)”

What does it mean to work at Stream HPC?

High-performance computing on many-core environments and low-level optimizations are very important concepts in large scientific projects nowadays. Stream HPC is one of the market’s more prominent companies active in mostly North America and Europe.

As we often get asked how it is to work at the company, we’d like to give you a little peak into our kitchen.

What we find important

We’re a close-knitted group of motivated individuals, who get a kick out of performance optimizations and are experienced in programming GPUs. Every day we have discussions on performance. Finding out why certain hardware behaves in a certain manner when a specific computing load is applied. For instance why certain code is not as fast as theoretically promised, and then finding the bottlenecks by analyzing the device and finding solutions for removing those bottlenecks. As a team we make better code than we could ever do as individuals.

Quality is important for everybody on the team, which is a whole step further than “just getting the job done”. This has a simple reason: we cannot speed up code that is of low quality. This is also why we don’t use many tools that automatically do magic, as these often miss many significant improvements and don’t improve the code quality. We don’t expect AI to dully replace us soon, but once it’s possible we’ll probably be part of that project ourselves.

Computer science in general is evolving at a fast rate and therefore learning, is an important part of the job. Reading papers, finding new articles, discussing future hardware architectures and how they would affect performance, is very important. With every project, we have to gather as much data as possible using scientific publications, interesting blog posts and code repositories in order to be on the bleeding edge of technology for our project. Why use a hammer to speedup code, when you don’t know which hammer to use best?

Our team-culture

Personality of the team

We are all kind, focused on structured problem-solving, communicative about wins and struggles, focus on group-wins above personal gains, and all gamers. To have good discussions and have good disagreements, we seek people who are also open-minded. And we share and appreciate humor! If you want to know more about our culture, click here.

Tailored work environment

As we have all kinds of people in the team, who need different ways of recharging. One needs a walk, while somebody else needs a quiet place. We help each other on more than just work-related obstacles. We think that a broad approach on differences makes us understand how to progress to the next professional level the quickest. This is inclusivity-in-action, we’re proud of. Ow, and we have noise-canceling headphones.

Creating a safe place to speak up is critical for us. This helps us learn new skills and do things we never did before. And this approach helps well with all those who don’t have Asperger or ADHD at all, but need to progress without first fitting a certain norm.

Projects we do

Today we work on plenty of exciting projects and no year has been the same. Below is a page with projects we’re proud of.

https://streamhpc.com/about-us/work-we-do

Style of project handling

We use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”. Most notable are the management and documentation style.

Saying an engineer hates documentation and being managed because he/she is lazy is simply false. It’s because most management and documentation styles are far from optimal.

Pull-style management is where the tasks are written down by the team, based on the proposal. All these tasks are put into the task-list of the project, and then each team member picks the tasks that are a good fit. The last resort for the tasks that stay behind and have a deadline (being pushed) was only needed in a few cases.

All code (MR) is checked by one or two colleague, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

Where do we fit in your career?

Each job should get you forward, when done at the right moment. Question is when Stream HPC is the right choice.

As you might have seen, we don’t require a certain education. This is because a career is a sum, and an academic study can be replaced by various types of experience. The optimum is often both a study and the right type of experience. This means that for us, a senior can be a student and a junior can have been 20 years in the field.

So what is the “right type of experience”? Let’s talk about those who only have job-experience with CPUs. First, being hooked by performance, as primary interest, would be the first reason to get into HPC and GPGPU. Second, being good at C and C++ programming. Third, knowing algorithms and mathematics really well and can quickly apply them. Fourth, being a curious and quick learner, which shows by you having experimented with GPUs. This is also exactly what we test and check during the application procedure.

During your job you’ll learn anything around GPU-programming with a balance between theory and practice. Preparation is key in how we work, and this you will develop in many circumstances.

Those who left Stream HPC have gotten very senior roles, from team lead to CTO. With Stream HPC growing in size, the growth opportunities within the company are also increasing.

Make the decision for a new job

Would you like to work for a rapidly growing company of motivated GPU professionals in Europe? We seek motivated, curious, friendly people. If you liked what you read here, do check our open job positions.

Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL

WebCL_300WebCL is a great technique to have compute-power in the browser. After WebGL which gives high-end graphics in the browser, this is a logical step on the road towards the browser-only operating system (like Chrome OS, but more will follow).

Another way to look at technologies like WebCL, is that it makes it possible to lift the standard base from the OS to the browser. If you remember the trial of Microsoft’s integration of Internet Explorer, the focus was on the OS needing the browser for working well. Now it is the other way around, but it can be any OS. This is because the push doesn’t come from below, but from above.

Last year two guys from Lyon (South-France) got quite some attention, as they wrote a WebCL-plugin. Their names: Adrien Plagnol and Frédéric Langlade-Bellone. Below you’ll find a Q&A with them on WebCL. Enjoy! Continue reading “Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL”

The current state of WebCL

Years ago Microsoft was in court as it claimed Internet Explorer could not be removed from Windows without breaking the system, while competitors claimed it could. Why was this so important? Because (as it seems) the browser would get more important than the OS and internet as important as electricity in the office and at home. I was therefore very happy to see the introduction of WebGL, the browser-plugin for OpenGL, as this would push web-interfaces as the default for user-interfaces. WebCL is a browser-plugin to run OpenCL-kernels. Meaning that more powerful hardware-devices are available to JavaScript. This post is work-in-progress as I try to find more resources! Seen stuff like this? Let me know.

Continue reading “The current state of WebCL”

Privacy Policy

Who we are

We are a group of companies, based in the Netherlands, Hungary and Spain. We help our customers get their code run fast by optimizing the computations and using accelerators. We do this since 2010.

Comments

When visitors leave comments on the site we collect the data shown in the comments form, and also the visitor’s IP address and browser user agent string to help spam detection.

An anonymised string created from your email address (also called a hash) may be provided to the Gravatar service to see if you are using it. The Gravatar service Privacy Policy is available here: https://automattic.com/privacy/. After approval of your comment, your profile picture is visible to the public in the context of your comment.

Forms

Form-data is sent to self-hosted software and is not read by any third-party party.

Tracking

We use anonymized tracking to find out:

  • Which pages are visited how often
  • Which subjects are popular
  • Which pages are clicked through
  • From which countries or states the visitors are

During a visit/session, you get a random ID.

Cookies

If you leave a comment on our site you may opt in to saving your name, email address and website in cookies. These are for your convenience so that you do not have to fill in your details again when you leave another comment. These cookies will last for one year.

Tracking cookies last for 24 hours.

Embedded content from other websites

Articles on this site may include embedded content (e.g. videos, images, articles, etc.). Embedded content from other websites behaves in the exact same way as if the visitor has visited the other website.

These websites may collect data about you, use cookies, embed additional third-party tracking, and monitor your interaction with that embedded content, including tracking your interaction with the embedded content if you have an account and are logged in to that website.

Who we share your data with

None of the data is shared with any third party. Marketing reports don’t contain any personal data.

How long we retain your data

If you leave a comment, the comment and its metadata are retained indefinitely. This is so we can recognize and approve any follow-up comments automatically instead of holding them in a moderation queue.

Anonymous tracking data is not thrown away, to find trends over the years.

What rights you have over your data

If you have left comments, you can request to receive an exported file of the personal data we hold about you, including any data you have provided to us. You can also request that we erase any personal data we hold about you. This does not include any data we are obliged to keep for administrative, legal, or security purposes.

Where your data is sent

Visitor comments and forms are checked through an automated spam detection service, ReCAPTCHA and Akismet.

Reporting problems

We are not in the business of monetizing user data, and believe in finding new customers through content.

As software and plugins change after updates, we are sometimes surprised that more is collected than we configured.

If anything is incorrect or not legal, please email to privacy@streamhpc.com. If you have generic questions, go to the contact page or email to info@streamhpc.com.

Molecular Dynamics

Penicillin-nucleus-3D-balls
Penicillin

StreamHPC has carried out several successful parallel-computing projects in molecular dynamics since 2012. Below are a few examples of our work in bioinformatics, chemistry and meteorology.

GROMACS does soft matter simulations on molecular scale

GROMACS is one of the fastest molecular dynamics software packages on the market. To broaden the user base that can benefit from the processing power of modern GPUs, we ported GROMACS from CUDA to OpenCL and further optimised the code for use with AMD FirePro accelerators. The resulting performance is on a par with that of the original CUDA code but without the restriction of being bound to a specific parallel computing hardware. GROMACS is used world-wide by over 5000 research centers, from simulating molecular docking to examining the hydrogen bonds in a falling water drop. Read more…

stanford_chemistry_logoFor the university of Stanford, we further optimised a part of TeraChem, a general purpose quantum chemistry software designed to run on NVIDIA GPU architectures. Our work resulted in adding an extra 70% performance to the already optimised CUDA code.

UniOfManchesterLogo.svgFor the University of Manchester, we developed a high-performance implementation of the UNIFAC group contribution model for their research on atmospheric aerosol particles. Where an OpenMP implementation of the original single-threaded code got the run time down from 32 to about 10 seconds on a quad-core CPU, we eventually brought it down to 0.062 seconds using OpenCL on a Xeon Phi accelerator – a speedup of 160x over OpenMP. Read more…

OpenCL Books

[infobox type=”information”]opencl-20-small

Want a book written be us?

Format: PDF
Digital: 150-200 pages
Price: TBD
Author: The StreamHPC team
OpenCL-version: 2.0

[/infobox]

Below are the books that are available as downloadable PDF or as a book. Please contact us if a printed book is missing. Note that I tend to be critical to books, and not to be overwhelming positive and talk in superlatives. Most books target C and C++ developers, so be sure you learn the basics of C or C++ first before learning OpenCL. Most important are bit-shifts, pointers and structs, but also thinking more in hardware than in getting-things-done is needed to get the full potential out of OpenCL. While you get that book on C, also get a book on computer-architecture to understand the concepts of bandwidth to its fullest. Then you are ready for one of the pearls below. Happy reading!

The books are ordered descending by published date, except the first.

“The OpenCL specifications” by the Khronos Group

2.0 – current version

Format: PDF
File Size: 2.2MB
Digital: 281 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 13 July 2013 (version 2.0, revision 11)
OpenCL-version: 2.0
Homepage: http://www.khronos.org/registry/cl/http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

As a specifications-document, you cannot expect a nice piece of prose, but most of the knowledge you need to know is in it. There are certainly some gaps (especially in clear explanation), but every version is getting better. When studying other sources, always have this document with you as a reference. I printed it as two pages per side (A4).

Read chapters 1 to 3, and leave the rest as a reference. Other books explain the long, long lists of language specifications in a nicer form. Also the most you’ll better learn by doing.

1.2, 1.1, 1.0 – previous versions

Format: PDF
File Size: 3.3MB
Digital: 380 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 14 November 2012 (version 1.2, revision 19)
OpenCL-version: 1.2
Homepage: http://www.khronos.org/registry/cl/

As many software is still written in a previous version, it is good to know the differences. Differences between 1.1 and 1.2 are written here.


OpenCL Programming By Example

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

2342OT_OpenCL Programming By ExampleFormat: eBook/pBook
Pages: 304
Price: €4,50 (e), €50,- (p+e)
Publisher: PACKT
Authors: Ravishekhar Banger (AMD), Koushik Bhattacharyya (AMD)
Published Date: December 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-programming-by-example/book

“The OpenCL Programming Book” – Fixstars

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: eBook
Pages: 325
Price: USD 19.50
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Aki Asahara, Satoshi Miki, Jeongdo Son and Satoshi Miki. Satoru Tagawa (translator)
Published Date: January 2012
OpenCL-version: 1.2
Homepage: http://www.fixstars.com/en/opencl/book/


Format: eBook
Pages: 246
Price: free
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki
Published Date: 31 March 2010
OpenCL-version: 1.0
Homepage: http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/contents/

1.0-version: It seems to be translated from Japanese to English, but except some small typos and spelling errors the book is very easy to read. The book explains the chapters you could skip in Khronos’ specifications-document, but certainly is not complete since it discusses OpenCL 1.0 and has a focus on the basics. The parts where that build up a program step-by-step is a bit annoying to read, because they repeat the whole program again while only a few lines have changed. The book would be more like 180-200 pages if written more compact.

1.2-version: Thicker, more up to date and a promise there are less translation-errors.

Heterogeneous Computing with OpenCL, second edition

Format: pBook
Pages: 400 (approx.)
Price: USD 69.95
Publisher: Morgan Kaufmann
Authors: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa
Published Date: Sept 2011
OpenCL-version: 1.2
Homepage: http://www.elsevierdirect.com/product.jsp?isbn=9780123877666

This is where we all chose OpenCL for: hybrid processors. And this book dives into that world completely, so we actually learn a lot new stuff about the advantages of having a GPU on your lap.

The new edition upgrades the book to version 1.2, but nothing much new has been added. So if you have the first edition, there is no need to buy the second edition.

OpenCL in Action

Format: eBook + pBook
Pages: 475
Price: USD 47.99 (e). USD 59.99 (p+e)
Publisher: Addison-Wesley Professional
Authors: Matthew Scarpino
Published Date: non-final version updated regularly, target November 2011
OpenCL-version: 1.1
Homepage: http://www.manning.com/scarpino2/

Matthew Scarpino also wrote SWT/JFace In Action and Programming the Cell-processor, has a profession in Linux and has much experience in IT. The book seems to target an audience who want a more practical guide to learn OpenCL. He runs a blog at http://www.openclblog.com/

It is currently my favourite book and this one is a must-have for everybody interested or working with OpenCL.

OpenCL Programming Guide

Format: PDF and/or print
Pages: 648
Price: USD 35.19 (e), USD 43.99 (p), USD 59.39 (p+e)
Publisher: Addison-Wesley Professional
Authors: Aaftab Munshi (Apple, Khronos Group), Benedict Gaster (AMD), Timothy G. Mattson, Dan Ginsburg
Published Date: August 2011
OpenCL-version: 1.1
Homepage: http://my.safaribooksonline.com/9780132488006 and http://www.openclprogrammingguide.com/

Aaftab Munshi is also responsible for the OpenCL-specifications, so he probably knows where he’s talking about.

The 648 pages it is quite bigger than the targeted 480. Currently this is a very good replacement for Fixstars’ book. Disadvantage is that sending the printed book overseas (not USA/Canada) is much too expensive and people from the Eurasian continent, Africa and Latin America should just print it locally – looking into that to find better options.

OpenCL Parallel Programming Development Cookbook

pact

Format: pBook + eBook
Pages: 303 pages
Price: USD 26.39 (e) / 54.99 (p+e)
Publisher: PACT publishing
Author: Raymond Tay
Published Date: August 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-parallel-programming-development-cookbook/book

Introductory book for OpenCL beginners. Examples: histogram, Sobel edge detection, Matrix Multiplication, Sparse Matrix Vector Multiplication, Bitonic sort, Radix sort, n-body.

Programming Massively Parallel Processors

Format: pBook
Pages: 258 pages
Price: USD 46.40
Publisher: Morgan Kaufmann
Authors: David B. Kirk (NVIDIA) and Wen-mei W. Hwu (University of Illinois)
Published Date: 28 January 2010
OpenCL-version: 1.1?
Homepage: http://blogs.nvidia.com/ntersect/2010/01/worlds-first-textbook-on-programming-massively-parallel-processors.html

The book claims to discuss both OpenCL and CUDA, but actually there is just one chapter on OpenCL and the focus is strong towards NVIDIA hardware. It is a nice book for people who need to learn to program CUDA-only software/hardware and don’t want a book that’s too hard to understand. There are assignments at the end of each chapter and important subjects are explained in detail, so you don’t need to have a hard time with those assignments.

It is not good for people interested in OpenCL-compliant architectures from AMD, ARM and IBM besides NVIDIA’s. It is one of the best resources to understand NVIDIA architectures from a view of a GPGPU-programmer. The second edition adds more chapters on for example MPI and OpenACC. It is also less negative about OpenCL than in the first edition.

Big Data

Big_Bang_Data_exhibit_at_CCCB_17Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

  • capture, data-curation & data-management,
  • analysis, search & querying,
  • sharing, storage & transfer,
  • visualization, and
  • information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

The 13 application areas where OpenCL and CUDA can be used

visitekaartje-achter-2013-V
Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading “The 13 application areas where OpenCL and CUDA can be used”