General articles on technical subjects.

How many threads can run on a GPU?

5x5x3x3x3
Blocks of Threads

Q: Say a GPU has 1000 cores, how many threads can efficiently run on a GPU?

A: at a minimum around 4 billion can be scheduled, 10’s of thousands can run simultaneously.

If you are used to work with CPUs, you might have expected 1000. Or 2000 with hyper-threading. Handling so many more threads than the number of available cores might sound inefficient. There are a few reasons why a GPU has been designed to handle so many threads. Read further…

NOTE: The below description is a (very) simplified model with the purpose to explain the basics. It is far from complete, as it would take a full book-chapter to explain it all. Continue reading “How many threads can run on a GPU?”

Funded PhD internships at StreamHPC

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamHPC.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.

 

The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamHPC”

IWOCL 2017 Toronto call for talks and posters is open

The fifth International Workshop on OpenCL (IWOCL) will be held on 16-18 May 2017 in Toronto, Canada. The event kicks-off with a full-day Advanced Hands-On OpenCL tutorial which is followed by two-days of conference: keynotes, academic papers, technical presentations, tutorials, poster sessions and table-top demonstrations.

IWOCL 2017 Call for Submission Now Open – Submit your abstract here. Deadline is beginning of February, so better submit the coming month!

Call for IWOCL 2017 Annual Sponsors is also open. For that contact the IWOCL organisation via this webform.

Every year there have been unique conversations having real influence on the OpenCL standard, and we heard real-life development experience during various talks. If you missed the real technical talks at certain other GPU conferences, then IWOCL is where you should go.

We have been awarded the Khronos project to upgrade the OpenCL test suite to 2.2!

Some weeks ago we started with implementing the Compiler Test Suite for OpenCL 2.2. The biggest improvement of OpenCL 2.2 is C++ kernels, which originally was planned for 2.1. SPIRV 1.1 is another big improvement.

We are very happy to have a part in making OpenCL better! We find OpenCL C++ kernels very important, even if it has its limitations. Thanks to SPIRV 1.1 it gets easier to have more (unofficial) kernel languages next to C and C++, and to get SYCL. Also upgrading from 2.0 to 2.2 is rather easy thanks to the open source libclcxx.

Personally I found this project to also be very important for our internal knowledge building, as almost every function would be touched and discussed.

OpenCL 2.2 CTS RFQ has been awarded to StreamHPC

Khronos issued a Request For Quote (RFQ) back in September 2016 to enhance and expand the existing OpenCL 2.1 conformance tests to create an OpenCL 2.2 test suite to be used to define conformance for OpenCL 2.2 implementations. The contract has been awarded to StreamHPC. StreamHPC is a software consultancy company specialized in performance tuned software development for CPU, GPU and FPGA. A large part of their clients hires them for their OpenCL expertise.

Already improvements have been added, bugs splatted and documentation improved. We hope to continue this the coming months!

We’ll be ready in March. Hopefully the first implementations are ready by then, as there is a test suite ready to iron out any bug discovered. Which three OpenCL drivers do you think will be first to have OpenCL 2.2? Intel, AMD, NVidia, ARM, Imagination, Qualcomm, TI, Intel FPGA (Altera), Xilinx, Portable OpenCL or another?

AMD gets into Machine Intelligence with “MI” range of hardware and software

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post. Continue reading “AMD gets into Machine Intelligence with “MI” range of hardware and software”

Opinions crossing the table: Khronos for world peace

languages
Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

quartusTo temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux”

Accelerating an Excel Sheet with OpenCL

excel-openclOne of the world’s most used software is far from performance optimised and there is hardly anything we can do about it. I’m talking about Excel.

There are various engine replacements which promise higher speeds, but those have the disadvantage that they’re still not fast enough with really heavy calculations. Another option is to use much faster LibreOffice, but companies prefer ribbons over new software. The last option is to offer performance-optimised modules for the problematic parts. We created a demo a few years ago and revived it recently. Continue reading “Accelerating an Excel Sheet with OpenCL”

Online Tutorials are here

46188854 - beautiful smiling female student using online education service. young woman looking in laptop display watching training course and listening it with headphones. modern study technology concept
Online training

We’re going online with our presentations and tutorials. This makes it easy to reach more people and make our trainings more flexible.

We’re starting with short introductory trainings, but we have bigger plans. Keep an eye on our events (shared on Twitter, LinkedIn, this blog and the newsletter) to see what the offerings are. And you’re very welcome to join!

On 4 October (new date) there will be an OpenCL 101 of two hours for free. Target timezone is East-America and Europe.

Agenda Online OpenCL 101

  • Introductions (20 minutes)
    • StreamHPC
    • GPUs and paralellism
    • OpenCL
  • By example: Getting started with OpenCL (30 minutes)
  • By example: Porting a simple program to OpenCL (30 minutes)
  • Q&A in parallel (30 minutes). Ask us any question, for instance:
    • General OpenCL.
    • OpenCL on GPUs.
    • OpenCL on FPGAs.
    • What algorithms work well with GPUs, CPUs and FPGAs.
    • StreamHPC services.
  • The next steps (5 minutes).
  • Closing words (5 minutes).

Read more here…

Tutorial server

You can already test if the tutorial server works for you by looking around in our demo room. The tutorial itself will be in another room. Use your own name and password “ap“.

The room linked to this resource is not configured correctly.

See you soon!

How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)

LymingtonFlood2002
Hampstead flooding

How water moves through an area given a certain pace of instream, can be fully simulated. We got a request to make such simulation faster, as it took already too much time to do moderate simulations. As the customer wanted to be able to have more details, larger areas and more alternative situations computed, the current performance did not suffice.

The code was already ported to MPI to scale to 8 cores. This code was used as a base for creating our optimised GPU-code. Using a single GPU we managed to get an 44 to 58 times speedup over single core CPU, which is 5 to 7 times faster than MPI on 8 to 32 CPU cores.

For larger experiments we could increase the performance advantage over MPI-code from 7 times to a total of 35 times, using multiple GPUs.

We solved both the weak-scaling problem and the mapping on GPUs

If you add the 9x speedup of the initial performance-optimisation, the total is over 2600x. What could be done in a year, now can be done in 3.5 hours. This clearly shows the importance of software performance engineering. Most code already had some optimisations applied (just like here) and 5 to 7 times speedup is quite achievable.

Read below for some more details. Continue reading “How we sped up a flooding simulation 35 times (from 32-core CPU to multi-GPU)”

Get ready for conversions of large-scale CUDA software to AMD hardware

IMG_20160829_172857_croppedIn the past years we have been translating several types of software to AMD, targeting OpenCL (and HSA). The main problem was that manual porting limits the size of the to-be-ported code-base.

Luckily there is a new tool in town. AMD now offers HIP, which converts over 95% of CUDA, such that it works on both AMD and NVIDIA hardware. That 5% is solving ambiguity problems that one gets when CUDA is used on non-NVIDIA GPUs. Once the CUDA-code has been translated successfully, software can run on both NVIDIA and AMD hardware without problems.

The target group of HIP are companies with older clusters, who don’t want to pay the premium prices for NVIDIA’s latest offerings. Replacing a single server with 4 Tesla K20 GPUs of 3.5 TFLOPS by 3 dual-GPU FirePro S9300X2 GPUs of 11 TFLOPS will give a huge performance boost for a competitive price.

The costs of making CUDA work on AMD hardware is easily paid for by the price difference, when upgrading a GPU-cluster.

Continue reading “Get ready for conversions of large-scale CUDA software to AMD hardware”

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

GLXgearsThe information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support”

CUDA Compute Capability 6.1 Features in OpenCL 2.0

On the CUDA page of Wikipedia there is a table with compute capabilities, as shown below. While double checking support for AMD Fijij GPUs (like Radeon Nano and FirePro S9300X2) I got curious how much support is still missing in OpenCL. For the support of Fiji it looks like there is 100% support of all features. For OpenCL 2.0 read on.

CUDA-features
CUDA features per Compute Capability on Wikipedia

Continue reading “CUDA Compute Capability 6.1 Features in OpenCL 2.0”

Rant: No surprise there’s a shortage of good GPU-developers

notyetanothergraphicsAPI
Another Monday, yet another graphics API

We could read here that software is critical for HPC – a market where accelerators/GPUs are used a lot. So all we need to do is to better support all GPU-developers as a whole, not? Unfortunately something else is happening.

Each big corporation wants to have their own developers, not to be shared with the competition.

Microsoft was quite early in this with Ballmer’s “developers, developers, developers” meme. Tip of the hat to them for acting on the shortage, a shake of the head for how they acted. For .NET is was a success to steal away developers from Java and C/C++, increasing market share of Windows Server, SQL Server and more.

GPU-vendors want that too – growing the cake together they find too slow – best is to start the fight while the cake is tiny. Continue reading “Rant: No surprise there’s a shortage of good GPU-developers”

4-day training on OpenCL-on-FPGAs, 24-28 October, Amsterdam

fast-fpgaFrom 24 to 28 October we give a 4-day training on OpenCL-on-FPGAs using Altera hardware. The learning goals are correctly writing OpenCL code for FPGAs, learning to work with Quartus and understanding the important optimisation techniques.

The total costs are €2760 excluding VAT for the whole week ( 2 + 2 days of training, one pause day), including a tour in Amsterdam on Wednesday.

See the special event-page for more information.

Porting code that uses random numbers

dobbelstenen

When we port software to the GPU or FPGA, testability is very important. A part of making the code testable, is getting its functionality fully under control. And you guessed already that run-time generated random numbers takes good attention.

In a selection of past projects random numbers were generated on every run. Statistically the simulations were more correct, but it is impossible to make 100% sure the ported code is functionally correct. This is because there are two variations introduced: one due to the numbers being different and one due to differences in code and hardware.

Even if the combined error-variations are within the given limits, the two code-bases can have unnoticed, different functionality. On top of that, it is hard to have further optimisations under control, as that can lower the precision.

When porting, the stochastic correctness of the simulations is less important. Predictable outcomes should be leading during the port.

Below are some tips we gave to these customers, and I hope they’re useful for you. If you have code to be ported, these preparations make the process quicker and more correct.

If you want to know more about the correctness of RNGs themselves, we discussed earlier this year that generating good random numbers on GPUs is not obvious.

Continue reading “Porting code that uses random numbers”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 2)

random_300In the first part of our two-part blog series, we have discussed how parallel computing applications can best use pseudo-random number generators (PRNGs) so as to benefit from parallel computing speedups, without negatively impacting the statistical properties of the random numbers generated. We have argued that index-based PRNGs (e.g., from the Random123 library), which do not maintain any state but instead take an index and a key as input and return the random number corresponding to the index in its random output sequence, provide numerous benefits in parallel programming. This week, we discuss both an application of index-based PRNGs and another important aspect of PRNGs in parallel environments: that of reproducing the same output among different parallel implementations or among a parallel and a serial implementation. We focus on the latter and consider reproducibility for the purpose of verification, which – as we have stated last week – is an important matter for our customers.

Reproducibility

Reproducibility may be simple if random numbers are only used in the initialization phase of the application. In this case, it may be sensible to record the set of random numbers generated in the serial code, write it to a file and use this file as input to a table-based approach for random-number generation in the parallel code (as discussed in Part 1 of our blog). Indeed, this is a convenient approach for both us and our customers as we do not have to deal with PRNG implementation internals and our customer can independently verify the correctness of the parallel implementation. We have used this for several of our projects.

More complex scenarios such as stochastic simulations and Monte Carlo applications require a continuous, parallel generation of random numbers. In this case, reproducibility at first seems challenging considering the concurrent and thus unpredictable order in which parallel code may invoke PRNGs. Indeed, it may be impossible to achieve when using traditional, seed-based PRNGs. This is because we may need to access entries in the PRNG’s output sequence in an arbitrary order rather than sequentially. Essentially, we need to define a one-to-one mapping between random numbers generated in the serial code and those used in the parallel version and then be able to ask the PRNG for the first entry, the second entry, and so on, in both implementations. The random-access capability is exactly what index-based PRNGs provide. The main challenge lies in the definition of the mapping.

We may consider two options for defining the one-to-one mapping between random numbers generated in a serial code and those created in a parallel implementation. The summary of our findings is this:

  • Serial-to-parallel mapping: This amounts to an emulation of the serial random number generation in the parallel code. It generally requires minimal or no changes in the serial, original code, but may be increasingly difficult or impossible to define as the complexity of the code grows.
  • Parallel-to-serial mapping: This emulates the parallel random number generation in the serial code. It is usually easier to define than the serial-to-parallel mapping but requires deeper changes in the serial code.

We explain below what we mean by that.

Consider a simple scenario where we have some serial code that generates a new set of random numbers in each iteration of a loop using a traditional seed-based PRNG. A parallel implementation may operate by executing each iteration via a single work item. Using traditional PRNGs, we may equip each work item with its own PRNG seed.

  • Using a serial-to-parallel mapping, we may then record the PRNG state at each iteration of the loop in the serial code and initialize the PRNG state in each work item of the parallel code similarly to a table-based approach – this does not require any changes in the serial code (other than temporarily for recording the random numbers). Alternatively, we may replace the PRNG in the serial code with an index-based one and use a single, static counter for the PRNG, which we increment with each invocation of the PRNG – this is a minimal change in the serial code. We reflect the index-based PRNG in the parallel code and initialize each work item’s counter with w·n, where w is the work item index and n denotes the size of the set of random numbers created per loop. In both cases, the serial and the parallel code should produce identical outputs.
  • For a parallel-to-serial mapping, it is easier to assume that the parallel implementation uses index-based PRNGs. Let’s assume that each work item uses indices of the form w·n+j for the j-th item in the set of random numbers generated. Then the serial code is adapted to use the same PRNG and computes indices as WI(i)·n+j, where WI(i) returns the work item index corresponding to the i-th iteration of the loop. Again, the serial and parallel implementations should then produce identical outputs.

A proper mapping becomes increasingly difficult to define when the code complexity grows. However, it is usually easier to map the parallel random number generation to the serial code. Indeed, consider the case where some random numbers are generated in a data-dependent manner, i.e., only if some condition on the input data is fulfilled. Then it is impossible to give a predefined mapping from the serial to the parallel code. We therefore prefer the serial-to-parallel mapping whenever it is easy to define (as we don’t risk or minimize the risk of introducing bugs in the original code by changing the PRNG generation) but resort to the parallel-to-serial mapping for more difficult cases.

Strengthen our team as a remote worker (freelancer)

code-jobsIn the past year we’ve been working on more internal projects and therefore we’re seeking strong GPU-coders (good OpenCL experience required) worldwide. This way you can combine staying close to your family and working with advanced technologies. You will be on the newly formed international team.

Do understand that we have extra requirements for freelancers:

  • You have a personality for working independently.
  • You have your own computer with an OpenCL-capable GPU.
  • You have good internet (for doing remote access).

We offer a job in a well-known OpenCL-company with various interesting projects. You can improve your OpenCL skills and work with various hardware (modern GPUs, embedded processors, FPGAs and more).

Our hiring-procedure is as follows:

  • You send a CV and tell us why you are the perfect candidate.
  • After that you are invited for a longer online test. You show your skills on C/C++ and algorithms. You will receive a PDF with useful feedback. (3 hours)
  • We send you a GPU assignment. You need to pick out the right optimisations, code it and explain your decisions in detail. (Hopefully under 30 minutes)
  • If all goes well, you’ll have a videochat on personal and practical matters. You can also ask us anything, to find out if we fit you. (Around 1 hour)
  • If you and the company are a fit, then you’ll go to the technical round. (About 3 hours)
  • Made it to here? Expect a job-offer.

We’re looking forward to your application.

Apply for a job as OpenCL expert (freelancer) now!

Performance can be measured as Throughput, Latency or Processor Utilisation

40225151 - fiber optic cable
Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

  • Transfers. Data-movements through cables, interconnects, etc.
  • Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

  1. Reaction time: hardware/software noticing there is a job
  2. Transport time: it takes time to copy data, especially when we talk GBs
  3. Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

OpenCL.org internship/externship

intern
Our internship is according the description: it’s a rather complex homepage which should look good on your CV (if you manage to build it).

Want to help build an important website? OpenCL.org’s components have been designed and partly built, but still a lot of work needs to be done. We’re seeking an intern (or “extern” when not in Amsterdam) who can help us build the site. This internship is not about GPUs!

To complete the tasks, the following is required:

  • Technical expertise:
    • HTML5, CSS
    • PHP
    • Javascript
    • jQuery
    • Node.js
    • Mediawiki
    • XSLT
  • Can-do mentality
  • Able to plan own work
  • Good communication-skills
  • Available for 3 to 6 months

We don’t expect you know all tools, so we will guide you in learning new tools and techniques. Write us a “email of interest” to info@streamhpc.com, and write what you can and what your objectives for an internship would be.

We’re looking forward to see your letter!

AMD is back!

AMD_Logo-and-wordmark-1024x768For years we haven been complaining on this blog what AMD was lacking and what needed to be improved. And as you might have concluded from the title of this blogpost, there has been a lot of progress.

AMD is back! It will all come together in the beginning of 2017, but you’ll see a lot of progress already the coming weeks and months.

AMD quietly recognised and solved various totally new problems in HPC, becoming the hidden innovator everybody needed.

This blog is to give an overview of how AMD managed to come back and what it took to get to there. Their market cap supports it, as you can see.

amd-market-cap-history
AMD’s market cap is back at 2012 levels (source)

Continue reading “AMD is back!”