Funded PhD internships at StreamHPC

We have several wishes for 2017 and two of them are to make code for the open source community. Luckily HiPEAC is interested in more collaboration between academia and industry and therefore funds PhD internships. There are 81 industrial PhD internships available and two are at StreamHPC.

What is this industrial PhD internship, you may ask? From the HiPEAC homepage:

The HiPEAC Industrial PhD Internship Programme offers PhD students a unique opportunity to experience the industrial research environment and to work on R&D projects solving real problems. To date the internship programme has resulted in several joint paper publications, patent applications and many students have been hired by the companies after completion of their PhDs.

 

The internships cover a 3-month period. Students should indicate when they will be available for an internship during 2016. When you apply for one of the internships, you must update your profile page including a link to your CV (preferably in PDF format).

Every intern receives €55 per day (€5000 for 3 months) + travel expenses (maximum €500). The main goal is to gain experience. Even if you don’t get a job after the internship, you tap into our network.

Continue reading “Funded PhD internships at StreamHPC”

IWOCL 2019

On Monday May 13, 2019 at 09:30 the latest edition of IWOCL starts, not taking into account any pre-events that might be spontaneously organized. This is the biggest OpenCL-focused event that discusses everything that would make any GPGPU-programmer, DSP-programmer and FPGA-programmer enthusiastic.

What’s new since last year, is that it’s actually also more interesting place for CUDA-developers who like to learn and discuss new GPU-programming techniques. This is because Nvidia’s GTC has moved more to AI, where it used to be mostly GPGPU for years.

Since it’s now the last week of the early-bird pricing, it’s a good time to make you think about buying your ticket and book the trip.

Continue reading “IWOCL 2019”

Basic Concepts: online kernel compiling

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

  • CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
  • OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling”

Do you have our GPU DNA?

This is the first question to warm up. Python-programmers are often users of GPU-libraries, not the builders of those libraries.

In January 2019 I gave a talk about culture in the company, which I wanted to share with you. It was intended to trigger discussions on what environment fits somebody, and examples were given on other companies. The nice part was that it became more clear that the culture of a company like CodePlay was very alike, except they are working on different things (compilers). Same for departments of larger companies we work with or know well.

Important: all answered are based on what my colleagues answered. So most of us are cat-people, but I wouldn’t say that defines a GPU-developer. I hope it still gives you an understanding of our perspective on what defines a GPU-dev in just a few minutes, while it also gives you more than enough matter to think about.

Continue reading “Do you have our GPU DNA?”

PDFs of Monday 16 April

By exception, another PDF-Monday.

OpenCL vs. OpenMP: A Programmability Debate. The one moment OpenCL and the other mom ent OpenMP produces faster code. From the conclusion: “OpenMP is more productive, while OpenCL is portable for a larger class of devices. Performance-wise, we have found a large variety of ratios between the two solutions, depending on the application, dataset sizes, compilers, and architectures.”

Improving Performance of OpenCL on CPUs. Focusing on how to optimise OpenCL. From the abstract: “First, we present a static analysis and an accompanying optimization to exclude code regions from control-flow to data-flow conversion, which is the commonly used technique to leverage vector instruction sets. Second, we present a novel technique to implement barrier synchronization.”

Variants of Mersenne Twister Suitable for Graphic Processors. Source-code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MTGP/

Accelerating the FFTD method using SSE and GPUs. “The Finite-Difference Time-Domain (FDTD) method is a computational technique for modelling the behaviour of electromagnetic waves in 3D space”. This is a project-plan, but describes the theories pretty well. Continue reading “PDFs of Monday 16 April”

Tutorials

During our courses/trainings we will teach you the best of what you can find here.

We try to keep the following information as complete as possible, so please contact us if something is missing.

Learning OpenCL

[list1]

[/list1]

OpenCL Optimisation guides

Not available (yet):

  • Imagination PowerVR
  • Qualcomm Adreno
  • Xilinx FPGAs

[infobox type=”information”][widgets_on_pages id=Trainings][/infobox]

University courses

OpenCL-based GPU-programming courses

[list1]

Architectures

[/list1]

Videos

[list1]

[/list1]

Cases/Studies

[list1]

[/list1]

WebCL

WebCL is a new standard-to-be for OpenCL in the browser. Currently there are a few implementations, while Khronos is working on an official standard. WebCL is available on Firefox for Linux32, Windows32 and Windows64 by Nokia. Also available for Safari on OSX by Samsung. A Node.js-implementation is made by Motorola. Examples made for another implementation will probably not work.

Tutorials:

[list1]

[/list1]
Check Khronos’ WebCL page for more resources.

C/C++

Basic knowledge of C is needed to understand how to write kernels. Also many tutorials are in C++.

[list1]

[/list1]

Basic OpenGL

Getting a grasp of OpenGL has advantages. Techniques for faster memory-operations in OpenGL have equivalents in OpenCL, giving reason to read on this subject.

[list1]

[/list1]

AMD gets into Machine Intelligence with “MI” range of hardware and software

Always good to have a share out of that curve.

In June we wrote on “AMD is back!“, where this is one of the blog posts with more details in a specific direction. This post is about AMD specifically targeting machine learning with the MI ( = Machine Intelligence) range of hardware and software.

With all the news around AMD’s new processors Ryzen (CPU) and VEGA (GPU), it became apparent that AMD wants a good share of the Deep Learning market.

And they seem to succeed. Here is the current status.

Hardware: 25 TFLOPS @ 16-bit

Recently released have been the “Radeon Instinct” series, which purely focus on compute. How the new naming of AMD is organised will be discussed in a separate blog post. Continue reading “AMD gets into Machine Intelligence with “MI” range of hardware and software”

WebCL Widget for WordPress

webcl-widget-adminSee the widget at the right showing if your browser+computer supports WebCL?

It is available under the GPL 2.0 license and based on code from WebCL@NokiaResearch (thanks guys for your great Firefox-plugin!)

Download from WordPress.org and unzip in /wp-content/plugins/. Or (better), search for a new plugin: “WebCL”. Feedback can be given in the comments.

I’d like to get your feedback what features you would like to see in the next version.

Continue reading “WebCL Widget for WordPress”

AMD OpenCL Presentation as OpenDocument

You remember AMD’s OpenCL University Kit? It was for universities and completely written in PPTX. (For people who are on university: PPTX is a undocumented document-form which claims to be open and actually works well with an editor/viewer of only one vendor). So I took the freedom to convert all documents to ODF, so anybody can open them.

Download it here: AMD OpenCL University Kit as ODF.

It has 13 chapters, covering all the basics you need to know for further study. Say “thanks AMD” and enjoy!

The 13 application areas where OpenCL and CUDA can be used

visitekaartje-achter-2013-V
Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading “The 13 application areas where OpenCL and CUDA can be used”

Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK

mosaic3dxUpdate: we are very sorry to tell that due to a deadline in a project we were forced to cancel Vincent’s talk.

StreamHPC will be at Mosaic3DX in Cambridge, UK, on 30+31 October. The brand new conference managed to get big names on-board, I’m happy to be amongst. Mosaic3DX describes itself as:

an international event comprising a conference, an exhibition, and opportunities for networking. Our intended audience are users as well as developers of Imaging, Visualisation, and 3D Digital Graphics systems. This includes researchers in Science and Engineering subjects, Digital Artists, as well as Software Developers in different industries.

Continue reading “Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK”

GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA

Wrong!
In contrary to what you see around (on slides like these), AMD and Intel also have support for RDMA.

A while ago I found the slide at the right, claiming that AMD did not have any direct GPU-GPU communication. I found at several sources there was, but it seems not to be a well-known feature. The feature is known as SDI (mostly on network-cards, SSDs and FPGAs), but not much information is found on PCI+SDI. More often RDMA is used: Remote Direct Memory Access (wikipedia).

Questions I try to answer:

  • Which server-grade GPUs support direct GPU-GPU communication when using OpenCL?
  • What are other characteristics interesting for OpenCL-devs besides direct communication GPU-GPU, GPU-FPGA, GPU-NIC?
  • How do you code such fast communication?

Enjoy reading! Continue reading “GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA”

Work we do

We help our customers get faster, more responsive and/or more precise software. But what does that mean? What is it what we do here in Amsterdam?

Our work is under NDA for a large part, so unfortunately we cannot share all details of all the work we’re very proud of.

Below is a selection of blog posts discussing our demos and Github-links showing our work and open-source software.

Projects

Note that various GPU and CPU code does contain low-level code like PTX, AMDIL and Assembly, but is minimal and only used by exception.

Using accelerated code, visually exaggerated
  • Porting GROMACS, OpenMM, AMBER and more (active project) to supercomputers running AMD MI100 GPUs. While we were busy, we optimized some code that also runs faster on Nvidia GPUs, so the comparisons between Nvidia and AMD would be fair. If you run one of these on your local supercomputer – you’re welcome.
  • Building the Khronos OpenCL SDK [OpenCL, C, C++]. To be published.
  • Speeding up pyPasWAS 3-5x [C, Python, OpenCL]. We claimed that we could speed up this open-source software to do DNA/RNA/protein sequence alignment and trimming, and so we did.
  • Building multiple libraries for AMD GPUs. Several foundational libraries on ROCm Github were built by us, and we still maintain.
    • rocRAND [HIP, C++]. The world’s fastest random number generator (or second, depending on Nvidia’s response) is built for AMD GPUs, and it’s open source. With random numbers generated at several hundreds of gigabytes per second, the library makes it possible to speed up existing code numerous times. The code is often faster than Nvidia’s cuRAND and is therefore the preferred library to be used on any high-end GPU.
    • rocThrust – AMD’s optimized version of Thrust [HIP, C++]. Highly optimized for CDNA GPUs. Lots of software for CUDA is Thrust based, and now has no lock-in anymore.
    • hipCUB – AMD’s optimized version of CUB [HIP, C++]. Highly optimized for CDNA GPUs. Now porting CUB-based software to AMD is a lot simpler. Both rocThrust and hipCUB share a library rocPRIM which unites many of the GPU-primitives.
  • Porting Gromacs from CUDA to OpenCL [CUDA, OpenCL, C, C++]. Until we ported the simulation software end of 2014, it has been CUDA-only. Porting took several man-months to manually port all code. You can now download the source, build it and run it on AMD/Intel hardware – see here for more info. All is open source, so you can see our code.
  • Porting Manchester’s UNIFAC to OpenCL@XeonPhi [OpenCL, C++, MPI]. Even though XeonPhi Knights Corner is not a very performant accelerator, we managed to get a 160x speedup from single threaded code. Most of the speedup is due to clever code-optimizations and less due to low-level optimizations.
  • Porting a set of ADSL-algorithms to an embedded special purpose GPU [OpenCL, C, C++]. Allowing central ADSL-routers in large buildings to handle modern ADSL-protocols.
  • Optimizing and extending the main image processing framework of a large photo hosting platform [CUDA, C++].
  • Flooding simulation [OpenCL, C++, MPI]. Software that simulates flooding of land, which we ported to multi-GPU on OpenCL and got a 35x speedup over MPI.

Demos

  • Cartoonizer. The webcam or video stream is “cartoonized” using several image filters on an FPGA using OpenCL.
  • Android video filter demo. Real-time Android-app, where the webcam stream has several real-time OpenGL filters applied to make it look like an old movie. This was a proof-of-concept to show we could apply our knowledge to Android and OpenGL.
  • Speeding up Excel. A heavy financial algorithm is offloaded to a GPU, resulting in a big speedups. Most Excel-sheets are slow because they’re bigger than where Excel was designed for, so unfortunately offloading does often not work when Excel gets really too slow.

Do you need a secret weapon too? We like to work together with you, to build fast software together. Get in touch to discuss your needs and goals.

AMD Hawaii power-management fix on Linux

od6configThe new Hawaii-based GPUs from AMD (Radeon R9 2xx, FirePro W9100 and Firepro S9150) have a lot of improvements, one being a new OverDrive 6 (AMD’s version of NVIDIA GPU Boost). Problem is that it’s not supported yet in the Linux drivers and you will get too low performance – it probably will be solved in the next version. Luckily there is od6config, made by Jeremi M Gosney.

Do the below steps to get the GPU at normal speed.

  1. Download the zip or tar.gz from http://epixoip.github.io/od6config/ and unpack.
  2. Go to the directory where you unpacked the archive.
  3. run:
    make
  4. run:
    sudo make install
  5. check if it’s needed to fix the power management:
    od6config --get clocks,temp,fan
  6. if the values are too low, run:
    od6config --autofix --set power=10
  7. check if it worked:
    od6config --get clocks,temp,fan

Only OverDrive6 devices are set, devices using OverDrive5 will be ignored.

The PowerTune of 10 was what we found convenient for us, but you might find better values for your case. There are several more options, which are on the homepage of 0d6config. You need to run “od6config –autofix –set power=10” on each reboot.

Remember it’s third party software, so no guarantees to you and no “you killed my GPU” to us.

The OpenCL event of the year: IWOCL 2014 – Bristol, UK, 12 & 13 May

iwoclKhronos has supported and organised for the second time the International Workgroup on OpenCL (IWOCL, pronounced as “eye-wok-ul”). Last year the event took place at Georgia Tech, Atlanta, Georgia, in the United States. This year the event will be held in Europe: Bristol University, Bristol, England, UK.

IWOCL 2013 Presentations

Last year there was a varying programme:

  • Porting a Commercial Application to OpenCL: A Case Study
  • Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor
  • Parallelization of the Shortest Path Graph Kernel on the GPU
  • OpenCL-based Approach to Heterogeneous Parallel TSP Optimization
  • clMAGMA: High Performance Dense Linear Algebra with OpenCL
  • Multi-Architecture ISA-Level Simulation of OpenCL
  • Optimizing OpenCL Applications on the Intel Xeon Phi

You can see and download these presentations here. This year the organisation tries to offer a equally exciting programme.

Workshop means it’s an active event

It’s all about sharing, but not just by letting you sit and listen. Below you’ll find some of the options.

Present your work

Did you use OpenCL in your software or research? You are very welcome to present your experience and results. IWOCL is the premier forum for the presentation and discussion of new designs, trends, algorithms, programming models, software, tools and ideas for OpenCL.

Abstract Submission Deadline: Friday 31 January, 2014

It can be in the form of:

  • Research paper
  • Technical presentation
  • Workshops and Tutorial
  • Poster

(StreamHPC’s Vincent Hindriksen is on the Conference Sessions Committee)

Communicate with the workgroup

20-P1020816
Khronos booth at SC13 – some you will see again at IWOCL

The OpenCL workgroup likes to communicate with OpenCL’s users. IWOCL provides a formal channel for community feedback to the Khronos Group’s OpenCL workgroup. This is one of the best moments to be heard, discuss a hack/bug or share a great idea that should be in the next version of OpenCL.

Meet OpenCL developers and enthusiasts

During the breaks, social events and during presentations, you can discuss all your ideas and thoughts on on-topic and off-topic subjects, or you can also join existing talks.

If you are new into compute acceleration, you’ll find many people who are willing to explain what it does and add their personal view.

Test-drive software

We will bring some hardware, on which you can test your kernels. (We’ll put more info about this later!)

Sponsor and present your product

There will be booths available for the sponsors, where you can show your product to the public.

Stay up to date on the event

We will try keep you up-to-date as much as possible, but IWOCL has some channels to keep you informed:

We’ll put on a link when tickets are ready to be sold.

Let others know you plan to be on the event by saying hi in the comments.

Hope to see you there!

Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup

Example of modelled versus measured water activity ('effective' concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC
Example of modelled versus measured water activity (‘effective’ concentration) for highly detailed organic chemical representation based on continental studies using UNIFAC

As we cannot use the performance results for most of our commercial projects because they contain sensitive data, we were happy that Dr. David Topping from the University of Manchester was so kind to allow us to share the data for the UNIFAC project. The goal for this project was simple: port the UNIFAC algorithm to the Intel XeonPhi using OpenCL. We got a total of 485x speedup: 3.0x for going from single-core to multi-core CPU, 53.9x for implementing algorithmic, low-level improvements and a new memory layout design, and 3.0x for using the XeonPhi via OpenCL. To remain fair, we used the 160x speedup from multi-core CPU in the title, not from serial code. Continue reading “Porting Manchester’s UNIFAC to OpenCL@XeonPhi: 160x speedup”

Random Numbers in Parallel Computing: Generation and Reproducibility (Part 1)

random_300Random numbers are important elements in stochastic simulations, but they also show up in machine learning and applications of Monte Carlo methods such as within computational finances, fluid dynamics and molecular dynamics. These are classical fields in high-performance computing, which StreamHPC has experience in.

A common problem when porting traditional software in these fields to parallel computing environments is the generation and reproducibility of random numbers. Questions that arise are:

  • Performance: How can we efficiently obtain random numbers when they are classically generated in a serial fashion?
  • Quality: How can we make sure that random numbers generated in a parallel environment still fulfil statistical randomness requirements?
  • Verification: How can we be sure that the parallel implementation is correct?

We consider verification from the viewpoint of producing identical results among different software implementations. This is often an important matter for our customers, and we have given them guidance on how to address this issue when random numbers are involved.

In this first part of our two-part blog series, we will briefly address some common pitfalls in the generation of random numbers in parallel environments and suggest suitable random-number generation libraries for OpenCL and CUDA. In the second part – on the blog soon – we will discuss solutions for reproducibility in the presence of random numbers.

Generation

Random numbers in computer software are typically obtained via a deterministic pseudo-random number generator (PRNG) algorithm. The output of such an algorithm is not truly random but pseudo-random (i.e., it appears statistically random), though we will simply say “random” for simplicity. We do not consider truly random numbers, which may be derived from physical phenomena such as radioactive decay, because we want the output of a random number generator to be reproducible.

PRNGs traditionally offered to application developers fail within the parallel setting. One reason is that these algorithms usually only support the sequential generation of random numbers based on some initial (seed) value (e.g., consider the standard C rand() function), so work items on a parallel device would need to block for getting exclusive access to the generator, which clearly impacts efficiency.

Some applications may require only a moderate amount of random numbers. In this case, we found it feasible to precompute the required set of random numbers and hold them in global memory. We call this the table-based approach. Other applications in turn may need to efficiently create a huge amount of random numbers. In this case, it may be necessary to equip each work item with its own PRNG seed. One potential problem with this approach is the use of weak PRNGs such as linear congruential generators (LCGs), which remain popular due to their speed and simplicity. In parallel settings, correlations between output sequences are aggravated and the quality of the application output may be severely affected, so LCGs should not be used at all. Another problem is the use of a small seed or a small PRNG’s internal state space. In this case, we may expect that the probability of two work items creating the same random sequence is quite high. Indeed, if we would randomly seed via srand(), the chance is already 50% for two out of approximately 77,000 work items creating entirely the same random number sequence! So we may either need a PRNG with a larger seed space and internal state, or one with a larger state and some mechanism to subslice the PRNG’s output sequence into non-overlapping “substreams”, with one substream per work item. The Mersenne Twister is highly acclaimed but requires a memory state of approximately 2.5 KB per work item in a parallel setting, and substreams are difficult to implement. While good PRNGs with a small internal state and flexible substream support exist (e.g., MRG32k3a), there are also “index-based” PRNGs, which are often more elaborate to compute but do not maintain any state. Such state-less PRNGs take an arbitrary index and a “key” as input and return a random number corresponding to the index in its random output sequence (which depends on the key chosen). Index-based PRNGs are very useful in parallel computing environments, and we will show how we use them for reproducibility in the second part of this blog.

The choice of an appropriate PRNG may not be easy and ultimately depends on the application scenario. Luckily, there is choice! CUDA offers a set of PRNGs via its cuRAND library, and OpenCL applications can benefit from the clRNG library that AMD has released last year. Both cuRAND and clRNG offer a state-based interface with substream support. For index-based algorithms, the Random123 library provides high-quality PRNG implementations for both OpenCL and CUDA.

So far, we have discussed how we can safely generate random numbers in the GPU and FPGA context, but we cannot control the order in which parallel, concurrent work items create random numbers. This makes it difficult to verify the parallel implementation since its output may be different from that of the serial, original code. So the question is, in the presence of random numbers, how can we easily verify that our parallel code implements not only a faithful but a correct port of the serial version? This is addressed in part two – continue reading.

Is the CPU slowly turning into a GPU?

It's all in the plan
It’s all in the plan?

Years ago I was surprised by the fact that CPUs were also programmable with OpenCL – I solely chose that language for the cool of being able to program GPUs. It was weird at start, but cannot think of a world without OpenCL working on a CPU.

But why is it important? Who cares about the 4 cores of a modern CPU? Let me first go into why CPUs have had mostly 2 cores for so long, about 15 years ago. Simply put, it was very hard to program multi-threaded software that made use of all cores. Software like games did, as they needed all the available resources, but even the computations in MS Excel are mostly single-threaded as of now. Multi-threading was maybe used most for having a non-blocking user-interface. Even though OpenMP was standardised 15 years ago, it took many years before the multi-threaded paradigm was used for performance. If you want to read more on this, search the web for “the CPU frequency wall”.

More interesting is what is happening now with CPUs. Both Intel and AMD are releasing CPUs with lost of cores. Intel has recently a 18-core processor (Xeon E5 2699-v3) and AMD was offering 16-core CPUs for a longer time (Opteron 6300 series). Both have SSE and AVX, which means extra parallelism. If you don’t know what this is precisely about, read my 2011-article on how OpenCL uses SSE and AVX on the CPU.

AVX3.2

Intel now steps forward with AVX3.2 on their Skylake CPUs. AVX 3.1 is in XeonPhi “Knight’s Landing” – see this rumoured roadmap

It is 512-bits wide, which means that 8 times as much vector-data can be computed! With 16 cores, this would mean 128 float operations per clock-tick. Like a GPU.

The disadvantage is alike the VLIW we had in the pre-GCN generation of AMD GPUs: one needs to fill the vector-instructions to get the speed-up. Also the relatively slow DDR3 memory is an issue, but lots of progress is being made there with DDR4 and stacked memory.

B6r22cCIQAAEPmP

So is the CPU turning into a GPU?

I’d say yes.

With AVX3.2 the CPU gets all the characteristics of a GPU, except the graphics pipeline. That means that the CPU-part of the CPU-GPU is acting more like a GPU. The funny part is that with the GPU’s scalar-architecture and more complex schedulers, the GPU is slowly turning into a CPU.

In this 2012-article I discussed the marriage between the CPU and GPU. This merger will continue in many ways – a frontier where the HSA-foundation is doing great work now.  So from that perspective, the CPU is transforming into a CPU-GPU; and we’ll keep calling it a CPU.

This all strengthens my believe in the future of OpenCL, as that language is prepared for both task-parallel and data-parallel programs – for both CPUs and GPUs, to say it in current terminology.

Heterogeneous Systems Architecture – memory sharing and task dispatching

HSA-logoWant to get an overview of what Heterogeneous Systems Architecture (HSA) does, or want to know what terminology has changed since version 1.0? Read further.

Back in 2012 the goals for HSA were high. The group tried to design a system where CPU and GPU would work together in an efficient way. In the 2013/2014 time-frame you’ll find lots of articles around the web, including on our blog, describing the capabilities of HSA. Unfortunately with the 1.0 specifications most terminologies have been changed.

In March 2015 the HSA Foundation released the final 1.0 specifications. It does not discuss hUMA (Heterogeneous Uniform Memory Access) nor hQ (Heterogeneous Queuing). These two techniques had undergone so many updates, that new terminologies were used.

In this blog post, we’ll present you an updated description of the two most important problems tackled by HSA: memory sharing and task dispatching.

We’ll be tuning the below description, so feedback is always welcome – focus is on clarity, not on completeness.

What is an HSA System?

Where the original HSA goals focused more on SoCs with CPU and GPU cores, now any compute core can be used. The reason was that modern SoCs are much more complex than just a CPU and GPU – integrated DSPs and video-decoder are found on many processors. HSA thus now (officially) supports truly heterogeneous architectures.hsa_mem_arch_3

The idea is that any heterogeneous processor can be designed by the principles of HSA. This will bring down design costs and enable more exotic configurations from different vendors.

And interesting fact about the HSA-specifications is that it only specifies goals, not how it must be implemented. This makes it possible to implement the specifications in software instead of hardware, making it possible to upgrade older hardware to HSA.

Why is HSA important?

A simple question: “will there be more CPUs with embedded GPU or discrete GPUs?”. A simple answer: “there are already more integrated GPUs than discrete ones”. HSA defines those chips with mixed processors.

CPUs with embedded GPUs used to be not much more than the discrete GPUs with shared memory we know from cheap laptops in the 00’s. When the GPU got integrated, each vendor started to create solutions for inter-processor dispatching (threading extended to heterogeneous computing), course-grained sharing (transferring ownership between processor units) and fine grained sharing (atomics working with all processor units).

The HSA Foundation

Sometimes an industry makes bigger steps by competing and sometimes by collaborating

AMD recognised the need for a standard. As AMD wanted to avoid the problems with introducing 64 bit into X86 and therefore initiated the HSA foundation. The founding members are AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung and Texas Instruments. NVidia and Intel are awkwardly absent.

Memory Sharing

HSA uses a relaxed memory model, which has full memory coherence (data guaranteed to be the same for all processes on all cores) and is pageable (subsets can be reserved by programs).

The below write-up is heavily simplified to give an overview how memory sharing is designed under HSA. If you want to know more, read chapter 5 from the HSA book.

Inter-processor memory-pointer sharing – Unified Addressing

The most important part is the unified memory model (previously referred to as “hUMA”), which makes programming the memory-interactions in a heterogeneous processor with CPU-cores, GPU-cores and DPS-cores comparable to a multi-core CPU.

Like other modern memory models, HSA defines various segments, including global, shared and private. A difference is that flat addressing is used. This means that each address pointer is unique: you don’t have an address 0 for private and an address 0 for global. Flat addressing simplifies optimisation operations for higher level languages. Ofcourse you still need to be aware that each segment size is limited and there will be consequences when defining larger memory chunks than is available in the segment.

When you have created a memory object and want the DSP or GPU continue to work on it, then you can use the same pointers without any translations.

Inter-processor cache coherency

In HSA-systems global memory is coherent without the need for explicit cache maintenance. This means that local caches are synchronised and/or that caches are shared. For more information, read this blog from ARM.

Fine grained memory – Atomic Operations

HSA allows protecting memory segments to be atomicly accessed. This makes it possible to have multiple threads running on different cores of different processor units, all accessing the same memory in a safe manner.

Small and large consequtive memory segments can be reserved for sharing, from very fine to coarse grained. All threads that have access to that segement are notified when atomic operations are done.

Fine Grained Shared Virtual Memory (HSA compatibility for discrete GPUs)

AMD has done some efforts to extend HSA to discrete GPUs. We’ll see the real advantages with dispatching, but it also works to create a cleaner memory management.

The so called “Fine Grained Shared Virtual memory” makes it possible use HSA with discrete GPUs that have HSA-support. Because it’s virtual and data is continuously transferred between GPU and the HSA-processor, the performance is ofcourse lower than when using real shared memory. You can compare it to NVidia’s Unified Virtual Memory, and it also has been planned to be in OpenCL 2.0 for a long time.

Dispatching

HSA defines in detail how a task gets into the queue of a worker thread. Below is an overview of how queues, threads and tasks are defined and are named under HSA.

Queueing

Before HSA 1.0 we only spoke of “Heterogeneous Queue” (hQ). This is now further developed to “User Mode Queues”. A User Mode Queue holds the list of tasks for that specific (group of) processor cores, resides in the shared memory and is allocated at runtime.

Such task is described in a language called “Architected Queueing Language” (AQL), and is called an “AQL package”.

Agents and Kernel Agents

HSA threads run on one or a group of processor cores. These threads are called “Agents” and come in two variations: normal Agents and Kernel Agents. A Kernel Agents is an Agent that has a User Mode Queue and can execute kernels that work on a segment of memory. A normal Agent doesn’t have a queue and can only execute simple tasks.

If a normal agent cannot run kernels, but can run tasks, then what can it actually do? Here are a few examples:

  • Allocate memory, or other tasks only the host can do.
  • Send back (intermediate) data to the host – for example progress indication.

If you compare to OpenCL, an agent is the host (which creates the work) and kernel agents are the kernels (which can issue new threads under OpenCL 2.0).

AQL packages: communicating dispatch tasks

There are different types of the AQL (Architected Queueing Language) packets, of which these are the most important:

  • Agent dispatch packet: contains jobs for normal agents.
  • Kernel dispatch packet: contains jobs for kernel agents.
  • Vendor-specific packet: between processors of the same vendor there can be more freedoms.

In most cases, we’ll be talking about kernel dispatch packages.

The Doorbell signal: low latency dispatching

HSA dispatching is extremely fast and power-efficient due to the implementation of a “doorbell”. The doorbell of an agent is signalled when a new tasks is available, making it possible to take immediate action. A problem in OpenCL is the high dispatch times for GPUs without a doorbell – up to the millisecond range, as we have measured. For HSA-enabled GPUs the response-time before a kernel starts running is in the microseconds range.

Context switching

Threads can move from one core to another core – the task will be removed from the current queue and added to another queue. This can even happen when the thread is in running state.

StreamHPC’s position

The solution simply works and makes faster code – we have done a large project with it last year.

It seems that almost the whole embedded processor industry believes in it. AMD (CPU+GPU), ARM (CPU+GPU), Imagination (GPU), Mediatek, Qualcomm (GPU), Samsung and Texas Instruments (DSP) are founders. Companies like Analog Devices, CEVA, Sony, VIA, S3, Marvell and Cadence have later joined the club. Important Linux clubs like Linaro and Canonical are also seen.

The system-on-a-chip only will get more traction, and we see HSA as an enabler. Languages like OpenCL and OpenMP can be compiled down to HSA, so it just takes switching the compiler. HSA-capable software can be written in a more efficient manner, as now can be assumed that memory can efficiently be shared and dispatching new threads is really fast.

Mega-kernel versus Micro-kernels in LuxRender (repost)

LuxRenderer demo-rendering
LuxRenderer demo-rendering

Below is a (slightly edited) repost of a blog by

I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.


 

For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:

When programming for GPUs, simply porting a large CPU program
into an equally large GPU kernel is generally not a good approach.
Due to SIMT execution model on GPUs, divergence in control flow
carries substantial performance penalties, as does high register us-
age that lessens the latency-hiding capability that is essential for the
high-latency, high-bandwidth memory system of a GPU. In this pa-
per, we implement a path tracer on a GPU using a wavefront formu-
lation, avoiding these pitfalls that can be especially prominent when
using materials that are expensive to evaluate. We compare our per-
formance against the traditional megakernel approach, and demon-
strate that the wavefront formulation is much better suited for real-
world use cases where multiple complex materials are present in
the scene.

OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.

The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.

PATHOCL Micro-kernels edition, the results

The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.

Occupancy increases from 10% to 30% or more
Occupancy increases from 10% to 30% or more

The performance increase is huge on some platform (Linux + FirePro W8100), 3.6 times:

Speed increases from 0.84M to 3.07M samples/sec
Speed increases from 0.84M to 3.07M samples/sec

A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.

It solves the problems with AMD compiler

Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:

Scene builds correctly on AMD hardware for the first time
Scene builds correctly on AMD hardware for the first time

Try it out yourself

This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.

A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.

To run with micro-kernels, use “path.microkernels.enable=1”.

Opinions crossing the table: Khronos for world peace

languages
Pragmas not being mentioned in this old image explaining how languages stack up.

At SC16 there was a discussion between programming language standards for heterogeneous hardware, organised by Khronos. See here for the setup of the session. It was expected to be a heated discussion, but in the end it was a good conversation with lost of learning.

The main message from each language seems to be: “Yes, we’re working on that feature”. This means that a programming language is just like human languages, as new things get named and described world-wide. This also shows the hard work the development of languages bring, as new feature-requests are a constant. Continue reading “Opinions crossing the table: Khronos for world peace”