NVIDIA enables OpenCL 2.0 beta-support

Reading Time: 3 minutes

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

The 8 reasons why our customers had their code written or accelerated by us

Reading Time: 4 minutes

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these. Continue reading “The 8 reasons why our customers had their code written or accelerated by us”

Master+PhD students, applications for two PRACE summer activities open now

Reading Time: 2 minutes

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. Continue reading “Master+PhD students, applications for two PRACE summer activities open now”

How many threads can run on a GPU?

Reading Time: 5 minutes

Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

Funded PhD internships at StreamHPC

Reading Time: 2 minutes

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamHPC.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.


The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamHPC”

IWOCL 2017 Toronto call for talks and posters is open

Reading Time: 1 minute

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

We have been awarded the Khronos project to upgrade the OpenCL test suite to 2.2!

Reading Time: 2 minutes

Some weeks ago we started with implementing the Compiler Test Suite for OpenCL 2.2. The biggest improvement of OpenCL 2.2 is C++ kernels, which originally was planned for 2.1. SPIRV 1.1 is another big improvement.

We are very happy to have a part in making OpenCL better! We find OpenCL C++ kernels very important, even if it has its limitations. Thanks to SPIRV 1.1 it gets easier to have more (unofficial) kernel languages next to C and C++, and to get SYCL. Also upgrading from 2.0 to 2.2 is rather easy thanks to the open source libclcxx.

Personally I found this project to also be very important for our internal knowledge building, as almost every function would be touched and discussed.

OpenCL 2.2 CTS RFQ has been awarded to StreamHPC

Khronos issued a Request For Quote (RFQ) back in September 2016 to enhance and expand the existing OpenCL 2.1 conformance tests to create an OpenCL 2.2 test suite to be used to define conformance for OpenCL 2.2 implementations. The contract has been awarded to StreamHPC. StreamHPC is a software consultancy company specialized in performance tuned software development for CPU, GPU and FPGA. A large part of their clients hires them for their OpenCL expertise.

Already improvements have been added, bugs splatted and documentation improved. We hope to continue this the coming months!

We’ll be ready in March. Hopefully the first implementations are ready by then, as there is a test suite ready to iron out any bug discovered. Which three OpenCL drivers do you think will be first to have OpenCL 2.2? Intel, AMD, NVidia, ARM, Imagination, Qualcomm, TI, Intel FPGA (Altera), Xilinx, Portable OpenCL or another?

AMD gets into Machine Intelligence with “MI” range of hardware and software

Reading Time: 3 minutes

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post. Continue reading “AMD gets into Machine Intelligence with “MI” range of hardware and software”

Opinions crossing the table: Khronos for world peace

Reading Time: 3 minutes

Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

Reading Time: 7 minutes

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

Accelerating an Excel Sheet with OpenCL

Reading Time: 2 minutes

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL”

Online Tutorials are here

Reading Time: 1 minute

46188854 - beautiful smiling female student using online education service. young woman looking in laptop display watching training course and listening it with headphones. modern study technology concept
Online training

We’re going online with our presentations and tutorials. This makes it easy to reach more people and make our trainings more flexible.

We’re starting with short introductory trainings, but we have bigger plans. Keep an eye on our events (shared on Twitter, LinkedIn, this blog and the newsletter) to see what the offerings are. And you’re very welcome to join!

On 4 October (new date) there will be an OpenCL 101 of two hours for free. Target timezone is East-America and Europe.

Agenda Online OpenCL 101

  • Introductions (20 minutes)
    • StreamHPC
    • GPUs and paralellism
    • OpenCL
  • By example: Getting started with OpenCL (30 minutes)
  • By example: Porting a simple program to OpenCL (30 minutes)
  • Q&A in parallel (30 minutes). Ask us any question, for instance:
    • General OpenCL.
    • OpenCL on GPUs.
    • OpenCL on FPGAs.
    • What algorithms work well with GPUs, CPUs and FPGAs.
    • StreamHPC services.
  • The next steps (5 minutes).
  • Closing words (5 minutes).

Read more here…

Tutorial server

You can already test if the tutorial server works for you by looking around in our demo room. The tutorial itself will be in another room. Use your own name and password “ap“.

[bigbluebutton token=89b561b86fff]

See you soon!

How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)

Reading Time: 3 minutes

Hampstead flooding

How water moves through an area given a certain pace of instream, can be fully simulated. We got a request to make such simulation faster, as it took already too much time to do moderate simulations. As the customer wanted to be able to have more details, larger areas and more alternative situations computed, the current performance did not suffice.

The code was already ported to MPI to scale to 8 cores. This code was used as a base for creating our optimised GPU-code. Using a single GPU we managed to get an 44 to 58 times speedup over single core CPU, which is 5 to 7 times faster than MPI on 8 to 32 CPU cores.

For larger experiments we could increase the performance advantage over MPI-code from 7 times to a total of 35 times, using multiple GPUs.

We solved both the weak-scaling problem and the mapping on GPUs

If you add the 9x speedup of the initial performance-optimisation, the total is over 2600x. What could be done in a year, now can be done in 3.5 hours. This clearly shows the importance of software performance engineering. Most code already had some optimisations applied (just like here) and 5 to 7 times speedup is quite achievable.

Read below for some more details. Continue reading “How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)”

Get ready for conversions of large-scale CUDA software to AMD hardware

Reading Time: 5 minutes

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

Reading Time: 3 minutes

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

CUDA Compute Capability 6.1 Features in OpenCL 2.0

Reading Time: 2 minutes

On the CUDA page of Wikipedia there is a table with compute capabilities, as shown below. While double checking support for AMD Fijij GPUs (like Radeon Nano and FirePro S9300X2) I got curious how much support is still missing in OpenCL. For the support of Fiji it looks like there is 100% support of all features. For OpenCL 2.0 read on.

CUDA features per Compute Capability on Wikipedia

Continue reading “CUDA Compute Capability 6.1 Features in OpenCL 2.0”

Rant: No surprise there’s a shortage of good GPU-developers

Reading Time: 3 minutes

Another Monday, yet another graphics API

We could read here that software is critical for HPC – a market where accelerators/GPUs are used a lot. So all we need to do is to better support all GPU-developers as a whole, not? Unfortunately something else is happening.

Each big corporation wants to have their own developers, not to be shared with the competition.

Microsoft was quite early in this with Ballmer’s “developers, developers, developers” meme. Tip of the hat to them for acting on the shortage, a shake of the head for how they acted. For .NET is was a success to steal away developers from Java and C/C++, increasing market share of Windows Server, SQL Server and more.

GPU-vendors want that too – growing the cake together they find too slow – best is to start the fight while the cake is tiny. Continue reading “Rant: No surprise there’s a shortage of good GPU-developers”

4-day training on OpenCL-on-FPGAs, 24-28 October, Amsterdam

Reading Time: 1 minute

fast-fpgaFrom 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week ( 2 + 2 days of training, one pause day), including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Porting code that uses random numbers

Reading Time: 2 minutes


When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 2)

Reading Time: 4 minutes

random_300In the first part of our two-part blog series, we have discussed how parallel computing applications can best use pseudo-random number generators (PRNGs) so as to benefit from parallel computing speedups, without negatively impacting the statistical properties of the random numbers generated. We have argued that index-based PRNGs (e.g., from the Random123 library), which do not maintain any state but instead take an index and a key as input and return the random number corresponding to the index in its random output sequence, provide numerous benefits in parallel programming. This week, we discuss both an application of index-based PRNGs and another important aspect of PRNGs in parallel environments: that of reproducing the same output among different parallel implementations or among a parallel and a serial implementation. We focus on the latter and consider reproducibility for the purpose of verification, which – as we have stated last week – is an important matter for our customers.


Reproducibility may be simple if random numbers are only used in the initialization phase of the application. In this case, it may be sensible to record the set of random numbers generated in the serial code, write it to a file and use this file as input to a table-based approach for random-number generation in the parallel code (as discussed in Part 1 of our blog). Indeed, this is a convenient approach for both us and our customers as we do not have to deal with PRNG implementation internals and our customer can independently verify the correctness of the parallel implementation. We have used this for several of our projects.

More complex scenarios such as stochastic simulations and Monte Carlo applications require a continuous, parallel generation of random numbers. In this case, reproducibility at first seems challenging considering the concurrent and thus unpredictable order in which parallel code may invoke PRNGs. Indeed, it may be impossible to achieve when using traditional, seed-based PRNGs. This is because we may need to access entries in the PRNG’s output sequence in an arbitrary order rather than sequentially. Essentially, we need to define a one-to-one mapping between random numbers generated in the serial code and those used in the parallel version and then be able to ask the PRNG for the first entry, the second entry, and so on, in both implementations. The random-access capability is exactly what index-based PRNGs provide. The main challenge lies in the definition of the mapping.

We may consider two options for defining the one-to-one mapping between random numbers generated in a serial code and those created in a parallel implementation. The summary of our findings is this:

  • Serial-to-parallel mapping: This amounts to an emulation of the serial random number generation in the parallel code. It generally requires minimal or no changes in the serial, original code, but may be increasingly difficult or impossible to define as the complexity of the code grows.
  • Parallel-to-serial mapping: This emulates the parallel random number generation in the serial code. It is usually easier to define than the serial-to-parallel mapping but requires deeper changes in the serial code.

We explain below what we mean by that.

Consider a simple scenario where we have some serial code that generates a new set of random numbers in each iteration of a loop using a traditional seed-based PRNG. A parallel implementation may operate by executing each iteration via a single work item. Using traditional PRNGs, we may equip each work item with its own PRNG seed.

  • Using a serial-to-parallel mapping, we may then record the PRNG state at each iteration of the loop in the serial code and initialize the PRNG state in each work item of the parallel code similarly to a table-based approach – this does not require any changes in the serial code (other than temporarily for recording the random numbers). Alternatively, we may replace the PRNG in the serial code with an index-based one and use a single, static counter for the PRNG, which we increment with each invocation of the PRNG – this is a minimal change in the serial code. We reflect the index-based PRNG in the parallel code and initialize each work item’s counter with w·n, where w is the work item index and n denotes the size of the set of random numbers created per loop. In both cases, the serial and the parallel code should produce identical outputs.
  • For a parallel-to-serial mapping, it is easier to assume that the parallel implementation uses index-based PRNGs. Let’s assume that each work item uses indices of the form w·n+j for the j-th item in the set of random numbers generated. Then the serial code is adapted to use the same PRNG and computes indices as WI(i)·n+j, where WI(i) returns the work item index corresponding to the i-th iteration of the loop. Again, the serial and parallel implementations should then produce identical outputs.

A proper mapping becomes increasingly difficult to define when the code complexity grows. However, it is usually easier to map the parallel random number generation to the serial code. Indeed, consider the case where some random numbers are generated in a data-dependent manner, i.e., only if some condition on the input data is fulfilled. Then it is impossible to give a predefined mapping from the serial to the parallel code. We therefore prefer the serial-to-parallel mapping whenever it is easy to define (as we don’t risk or minimize the risk of introducing bugs in the original code by changing the PRNG generation) but resort to the parallel-to-serial mapping for more difficult cases.

Strengthen our team as a remote worker (freelancer)

Reading Time: 2 minutes

code-jobsIn the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.

Do understand that we have extra requirements for freelancers:

  • You have a personality for working independently.
  • You have your own computer with an OpenCL-capable GPU.
  • You have good internet (for doing remote access).

We offer a job in a well-known OpenCL-company with various interesting projects. You can improve your OpenCL skills and work with various hardware (modern GPUs, embedded processors, FPGAs and more).

Our hiring-procedure is as follows:

  • You send a CV and tell us why you are the perfect candidate.
  • After that you are invited for a longer online test. You show your skills on C/C++ and algorithms. You will receive a PDF with useful feedback. (3 hours)
  • We send you a GPU assignment. You need to pick out the right optimisations, code it and explain your decisions in detail. (Hopefully under 30 minutes)
  • If all goes well, you’ll have a videochat on personal and practical matters. You can also ask us anything, to find out if we fit you. (Around 1 hour)
  • If you and the company are a fit, then you’ll go to the technical round. (About 3 hours)
  • Made it to here? Expect a job-offer.

We’re looking forward to your application.

Apply for a job as OpenCL expert (freelancer) now!