General articles on technical subjects.

Should SPIRV be supported in CUDA?

Would you like to run CUDA-kernels on the OpenCL framework? Or Python or Rust? SPIRV is the answer! Where source-to-source translations had several limitations, SPIRV 1.1 even supports higher level languages like C++.

SPIRV is the strength of OpenCL and it will only get bigger.

Currently Intel drivers best supports SPIRV, making it the first target for the new SPIRV-frontends. It is unknown which vendor will be next – probably one which (almost) has OpenCL 2.0 drivers already, such as AMD, ARM, Qualcomm or even NVidia.

How interesting is SPIRV really?

So if SPIRV is really that important a reason to choose for OpenCL, I thought of framing it towards CUDA-devs on twitter in a special way:

Should CUDA 9 support SPIRV?

2 people voted no, 29 people voted yes.

So that’s a big yes for SPIRV-support on CUDA. And why not? Don’t we want to program in our own language and not be forced to use C or C++? SPIRV makes it possible to quickly add support in any language out there without official support of the vendor. Let wrappers handle the differences per vendor and let SPIRV be the shared language for GPU-kernels. Where do we need OpenCL or CUDA for, if the real work is defined by SPIRV?

What do you think? Leave your comment below how you see the future of GPGPU with SPIRV in town. Is 2017 the year of SPIRV?

And if you worked on a SPIRV-frontend, get in touch to continue your project on https://github.com/spirv. Yes, it’s empty right now, but you don’t know what’s hidden.

NVIDIA beta-support for OpenCL 2.0 works on Linux too

In the release notes for 378.66 graphics drivers for Windows (February 2017), NVIDIA officially spoke about supporting OpenCL 2.0 for the first time. Unfortunately, this is partial support only and, as NVIDIA said, these new [OpenCL 2.0] features are available for evaluation purposes only.

We did our own tests on a GTX 1080 on Windows and could confirm that for Windows the green team is halfway there. NVIDIA still has to implement pipes, enable non-uniform work-group sizes (this happens when in ND-range global_work_size is not divisible by the local_work_size), and fix a few bugs in device side enqueue.

Today we decided to test out NVIDIA latest driver (378.13) for 64-bit Linux and check its support for OpenCL 2.0.

NVIDIA, OpenCL 2.0 and Linux

Just like on Windows, our GTX 1080 reports that it is an OpenCL 1.2 devices. It is understandable since support for OpenCL 2.0 is only in beta stage. In the following table you’ll find an overview of the 2.0 functions supported by this Linux driver.

[table id=8 /]

The host-side functions clSetKernelExecInfo(), clCreateSamplerWithProperties() and clCreateCommandQueueWithProperties() are also present and working.

As you can see, the support for OpenCL 2.0 on Linux is almost exactly the same as on Windows. But in contrast with the Windows-drivers, we were able to successfully compile and run several more kernels that use device side queue. It may indicate that this feature is being actively developed and maybe in future drivers it will work much better – for both Linux and Windows.

What you can do to make it better

As NVIDIA only adds new functionality to OpenCL driver when requested, it is very important that they receive these requests. So when you or your employer is a paying customer, do keep requesting the features you need. Know that NVIDIA knows that lacking required functionality will be bad for their sales.

NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

The 8 reasons why our customers had their code written or accelerated by us

Making software better and faster.

In the past six years we have helped out various customers solve their software performance problems. While each project has been very different, there have been 8 reasons to hire us as performance engineers. These can be categorised in three groups:

  • Reduce processing time
    • Meeting timing requirements
    • Increasing user efficiency
    • Increasing the responsiveness
    • Reducing latency
  • Do more in the same time
    • Increasing simulation/data sizes
    • Adding extra functionality
  • Reduce operational costs
    • Reducing the server count
    • Reducing power usage

Let’s go into each of these. Continue reading “The 8 reasons why our customers had their code written or accelerated by us”

Master+PhD students, applications for two PRACE summer activities open now

PRACE is organising two summer activities for Master+PhD students. Both activities are expense-paid programmes and will allow participants to travel and stay at a hosting location and learn about HPC:

  • The 2017 International Summer School on HPC Challenges in Computational Sciences
  • The PRACE Summer of HPC 2017 programme

The main objective of this programme is to enable HiPEAC member companies in Europe to have access to highly skilled and exceptionally motivated research talent. In turn, it offers PhD students from Europe a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems.

Below explains both programmes in detail. Continue reading “Master+PhD students, applications for two PRACE summer activities open now”

How many threads can run on a GPU?

5x5x3x3x3
Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

Funded PhD internships at StreamHPC

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamHPC.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.

 

The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamHPC”

IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

We have been awarded the Khronos project to upgrade the OpenCL test suite to 2.2!

Some weeks ago we started with implementing the Compiler Test Suite for OpenCL 2.2. The biggest improvement of OpenCL 2.2 is C++ kernels, which originally was planned for 2.1. SPIRV 1.1 is another big improvement.

We are very happy to have a part in making OpenCL better! We find OpenCL C++ kernels very important, even if it has its limitations. Thanks to SPIRV 1.1 it gets easier to have more (unofficial) kernel languages next to C and C++, and to get SYCL. Also upgrading from 2.0 to 2.2 is rather easy thanks to the open source libclcxx.

Personally I found this project to also be very important for our internal knowledge building, as almost every function would be touched and discussed.

OpenCL 2.2 CTS RFQ has been awarded to StreamHPC

Khronos issued a Request For Quote (RFQ) back in September 2016 to enhance and expand the existing OpenCL 2.1 conformance tests to create an OpenCL 2.2 test suite to be used to define conformance for OpenCL 2.2 implementations. The contract has been awarded to StreamHPC. StreamHPC is a software consultancy company specialized in performance tuned software development for CPU, GPU and FPGA. A large part of their clients hires them for their OpenCL expertise.

Already improvements have been added, bugs splatted and documentation improved. We hope to continue this the coming months!

We’ll be ready in March. Hopefully the first implementations are ready by then, as there is a test suite ready to iron out any bug discovered. Which three OpenCL drivers do you think will be first to have OpenCL 2.2? Intel, AMD, NVidia, ARM, Imagination, Qualcomm, TI, Intel FPGA (Altera), Xilinx, Portable OpenCL or another?

AMD gets into Machine Intelligence with “MI” range of hardware and software

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post. Continue reading “AMD gets into Machine Intelligence with “MI” range of hardware and software”

Opinions crossing the table: Khronos for world peace

languages
Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

Accelerating an Excel Sheet with OpenCL

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL”

Online Tutorials are here

46188854 - beautiful smiling female student using online education service. young woman looking in laptop display watching training course and listening it with headphones. modern study technology concept
Online training

We’re going online with our presentations and tutorials. This makes it easy to reach more people and make our trainings more flexible.

We’re starting with short introductory trainings, but we have bigger plans. Keep an eye on our events (shared on Twitter, LinkedIn, this blog and the newsletter) to see what the offerings are. And you’re very welcome to join!

On 4 October (new date) there will be an OpenCL 101 of two hours for free. Target timezone is East-America and Europe.

Agenda Online OpenCL 101

  • Introductions (20 minutes)
    • StreamHPC
    • GPUs and paralellism
    • OpenCL
  • By example: Getting started with OpenCL (30 minutes)
  • By example: Porting a simple program to OpenCL (30 minutes)
  • Q&A in parallel (30 minutes). Ask us any question, for instance:
    • General OpenCL.
    • OpenCL on GPUs.
    • OpenCL on FPGAs.
    • What algorithms work well with GPUs, CPUs and FPGAs.
    • StreamHPC services.
  • The next steps (5 minutes).
  • Closing words (5 minutes).

Read more here…

Tutorial server

You can already test if the tutorial server works for you by looking around in our demo room. The tutorial itself will be in another room. Use your own name and password “ap“.

[bigbluebutton token=89b561b86fff]

See you soon!

How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)

LymingtonFlood2002
Hampstead flooding

How water moves through an area given a certain pace of instream, can be fully simulated. We got a request to make such simulation faster, as it took already too much time to do moderate simulations. As the customer wanted to be able to have more details, larger areas and more alternative situations computed, the current performance did not suffice.

The code was already ported to MPI to scale to 8 cores. This code was used as a base for creating our optimised GPU-code. Using a single GPU we managed to get an 44 to 58 times speedup over single core CPU, which is 5 to 7 times faster than MPI on 8 to 32 CPU cores.

For larger experiments we could increase the performance advantage over MPI-code from 7 times to a total of 35 times, using multiple GPUs.

We solved both the weak-scaling problem and the mapping on GPUs

If you add the 9x speedup of the initial performance-optimisation, the total is over 2600x. What could be done in a year, now can be done in 3.5 hours. This clearly shows the importance of software performance engineering. Most code already had some optimisations applied (just like here) and 5 to 7 times speedup is quite achievable.

Read below for some more details. Continue reading “How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)”

Get ready for conversions of large-scale CUDA software to AMD hardware

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

CUDA Compute Capability 6.1 Features in OpenCL 2.0

On the CUDA page of Wikipedia there is a table with compute capabilities, as shown below. While double checking support for AMD Fijij GPUs (like Radeon Nano and FirePro S9300X2) I got curious how much support is still missing in OpenCL. For the support of Fiji it looks like there is 100% support of all features. For OpenCL 2.0 read on.

CUDA-features
CUDA features per Compute Capability on Wikipedia

Continue reading “CUDA Compute Capability 6.1 Features in OpenCL 2.0”

Rant: No surprise there’s a shortage of good GPU-developers

notyetanothergraphicsAPI
Another Monday, yet another graphics API

We could read here that software is critical for HPC – a market where accelerators/GPUs are used a lot. So all we need to do is to better support all GPU-developers as a whole, not? Unfortunately something else is happening.

Each big corporation wants to have their own developers, not to be shared with the competition.

Microsoft was quite early in this with Ballmer’s “developers, developers, developers” meme. Tip of the hat to them for acting on the shortage, a shake of the head for how they acted. For .NET is was a success to steal away developers from Java and C/C++, increasing market share of Windows Server, SQL Server and more.

GPU-vendors want that too – growing the cake together they find too slow – best is to start the fight while the cake is tiny. Continue reading “Rant: No surprise there’s a shortage of good GPU-developers”

4-day training on OpenCL-on-FPGAs, 24-28 October, Amsterdam

fast-fpgaFrom 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week ( 2 + 2 days of training, one pause day), including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Porting code that uses random numbers

dobbelstenen

When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”