AMD OpenCL coding competition

Posted by Vincent Hindriksen on 3 October 2011 with 1 Comment

The AMD OpenCL coding competition seems to be Windows 7 64bit only. So if you are on another version of Windows, OSX or (like me) on Linux, you are left behind. Of course StreamHPC supports software that just works anywhere (seriously, how hard is that nowadays?), so here are the instructions how to enter the competition when you work with Eclipse CDT. The reason why it only works with 64-bit Windows I don’t really get (but I understood it was a hint).

I focused on Linux, so it might not work with Windows XP or OSX rightaway. With little hacking, I’m sure you can change the instructions to work with i.e. Xcode or any other IDE which can import C++-projects with makefiles. Let me know if it works for you and what you changed.

Continue reading “AMD OpenCL coding competition” →

AMD OpenCL Presentation as OpenDocument

Posted by Vincent Hindriksen on 23 May 2011 with 1 Comment

You remember AMD’s OpenCL University Kit? It was for universities and completely written in PPTX. (For people who are on university: PPTX is a undocumented document-form which claims to be open and actually works well with an editor/viewer of only one vendor). So I took the freedom to convert all documents to ODF, so anybody can open them.

Download it here: AMD OpenCL University Kit as ODF.

It has 13 chapters, covering all the basics you need to know for further study. Say “thanks AMD” and enjoy!

WebCL Widget for WordPress

Posted by Vincent Hindriksen on 5 June 2013

See the widget at the right showing if your browser+computer supports WebCL?

It is available under the GPL 2.0 license and based on code from WebCL@NokiaResearch (thanks guys for your great Firefox-plugin!)

Download from WordPress.org and unzip in /wp-content/plugins/. Or (better), search for a new plugin: “WebCL”. Feedback can be given in the comments.

I’d like to get your feedback what features you would like to see in the next version.

Continue reading “WebCL Widget for WordPress” →

Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK

Posted by Vincent Hindriksen on 5 September 2013

Update: we are very sorry to tell that due to a deadline in a project we were forced to cancel Vincent’s talk.

StreamHPC will be at Mosaic3DX in Cambridge, UK, on 30+31 October. The brand new conference managed to get big names on-board, I’m happy to be amongst. Mosaic3DX describes itself as:

an international event comprising a conference, an exhibition, and opportunities for networking. Our intended audience are users as well as developers of Imaging, Visualisation, and 3D Digital Graphics systems. This includes researchers in Science and Engineering subjects, Digital Artists, as well as Software Developers in different industries.

Continue reading “Cancelled: StreamHPC at Mosaic3DX in Cambridge, UK” →

Big Data

Big_Bang_Data_exhibit_at_CCCB_17 Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

capture, data-curation & data-management,
analysis, search & querying,
sharing, storage & transfer,
visualization, and
information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

Heterogeneous Systems Architecture – memory sharing and task dispatching

Posted by Vincent Hindriksen on 2 June 2016

Want to get an overview of what Heterogeneous Systems Architecture (HSA) does, or want to know what terminology has changed since version 1.0? Read further.

Back in 2012 the goals for HSA were high. The group tried to design a system where CPU and GPU would work together in an efficient way. In the 2013/2014 time-frame you’ll find lots of articles around the web, including on our blog, describing the capabilities of HSA. Unfortunately with the 1.0 specifications most terminologies have been changed.

In March 2015 the HSA Foundation released the final 1.0 specifications. It does not discuss hUMA (Heterogeneous Uniform Memory Access) nor hQ (Heterogeneous Queuing). These two techniques had undergone so many updates, that new terminologies were used.

In this blog post, we’ll present you an updated description of the two most important problems tackled by HSA: memory sharing and task dispatching.

We’ll be tuning the below description, so feedback is always welcome – focus is on clarity, not on completeness.

What is an HSA System?

Where the original HSA goals focused more on SoCs with CPU and GPU cores, now any compute core can be used. The reason was that modern SoCs are much more complex than just a CPU and GPU – integrated DSPs and video-decoder are found on many processors. HSA thus now (officially) supports truly heterogeneous architectures. hsa_mem_arch_3

The idea is that any heterogeneous processor can be designed by the principles of HSA. This will bring down design costs and enable more exotic configurations from different vendors.

And interesting fact about the HSA-specifications is that it only specifies goals, not how it must be implemented. This makes it possible to implement the specifications in software instead of hardware, making it possible to upgrade older hardware to HSA.

Why is HSA important?

A simple question: “will there be more CPUs with embedded GPU or discrete GPUs?”. A simple answer: “there are already more integrated GPUs than discrete ones”. HSA defines those chips with mixed processors.

CPUs with embedded GPUs used to be not much more than the discrete GPUs with shared memory we know from cheap laptops in the 00’s. When the GPU got integrated, each vendor started to create solutions for inter-processor dispatching (threading extended to heterogeneous computing), course-grained sharing (transferring ownership between processor units) and fine grained sharing (atomics working with all processor units).

The HSA Foundation

Sometimes an industry makes bigger steps by competing and sometimes by collaborating

AMD recognised the need for a standard. As AMD wanted to avoid the problems with introducing 64 bit into X86 and therefore initiated the HSA foundation. The founding members are AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung and Texas Instruments. NVidia and Intel are awkwardly absent.

Memory Sharing

HSA uses a relaxed memory model, which has full memory coherence (data guaranteed to be the same for all processes on all cores) and is pageable (subsets can be reserved by programs).

The below write-up is heavily simplified to give an overview how memory sharing is designed under HSA. If you want to know more, read chapter 5 from the HSA book.

Inter-processor memory-pointer sharing – Unified Addressing

The most important part is the unified memory model (previously referred to as “hUMA”), which makes programming the memory-interactions in a heterogeneous processor with CPU-cores, GPU-cores and DPS-cores comparable to a multi-core CPU.

Like other modern memory models, HSA defines various segments, including global, shared and private. A difference is that flat addressing is used. This means that each address pointer is unique: you don’t have an address 0 for private and an address 0 for global. Flat addressing simplifies optimisation operations for higher level languages. Ofcourse you still need to be aware that each segment size is limited and there will be consequences when defining larger memory chunks than is available in the segment.

When you have created a memory object and want the DSP or GPU continue to work on it, then you can use the same pointers without any translations.

Inter-processor cache coherency

In HSA-systems global memory is coherent without the need for explicit cache maintenance. This means that local caches are synchronised and/or that caches are shared. For more information, read this blog from ARM.

Fine grained memory – Atomic Operations

HSA allows protecting memory segments to be atomicly accessed. This makes it possible to have multiple threads running on different cores of different processor units, all accessing the same memory in a safe manner.

Small and large consequtive memory segments can be reserved for sharing, from very fine to coarse grained. All threads that have access to that segement are notified when atomic operations are done.

Fine Grained Shared Virtual Memory (HSA compatibility for discrete GPUs)

AMD has done some efforts to extend HSA to discrete GPUs. We’ll see the real advantages with dispatching, but it also works to create a cleaner memory management.

The so called “Fine Grained Shared Virtual memory” makes it possible use HSA with discrete GPUs that have HSA-support. Because it’s virtual and data is continuously transferred between GPU and the HSA-processor, the performance is ofcourse lower than when using real shared memory. You can compare it to NVidia’s Unified Virtual Memory, and it also has been planned to be in OpenCL 2.0 for a long time.

Dispatching

HSA defines in detail how a task gets into the queue of a worker thread. Below is an overview of how queues, threads and tasks are defined and are named under HSA.

Queueing

Before HSA 1.0 we only spoke of “Heterogeneous Queue” (hQ). This is now further developed to “User Mode Queues”. A User Mode Queue holds the list of tasks for that specific (group of) processor cores, resides in the shared memory and is allocated at runtime.

Such task is described in a language called “Architected Queueing Language” (AQL), and is called an “AQL package”.

Agents and Kernel Agents

HSA threads run on one or a group of processor cores. These threads are called “Agents” and come in two variations: normal Agents and Kernel Agents. A Kernel Agents is an Agent that has a User Mode Queue and can execute kernels that work on a segment of memory. A normal Agent doesn’t have a queue and can only execute simple tasks.

If a normal agent cannot run kernels, but can run tasks, then what can it actually do? Here are a few examples:

Allocate memory, or other tasks only the host can do.
Send back (intermediate) data to the host – for example progress indication.

If you compare to OpenCL, an agent is the host (which creates the work) and kernel agents are the kernels (which can issue new threads under OpenCL 2.0).

AQL packages: communicating dispatch tasks

There are different types of the AQL (Architected Queueing Language) packets, of which these are the most important:

Agent dispatch packet: contains jobs for normal agents.
Kernel dispatch packet: contains jobs for kernel agents.
Vendor-specific packet: between processors of the same vendor there can be more freedoms.

In most cases, we’ll be talking about kernel dispatch packages.

The Doorbell signal: low latency dispatching

HSA dispatching is extremely fast and power-efficient due to the implementation of a “doorbell”. The doorbell of an agent is signalled when a new tasks is available, making it possible to take immediate action. A problem in OpenCL is the high dispatch times for GPUs without a doorbell – up to the millisecond range, as we have measured. For HSA-enabled GPUs the response-time before a kernel starts running is in the microseconds range.

Context switching

Threads can move from one core to another core – the task will be removed from the current queue and added to another queue. This can even happen when the thread is in running state.

StreamHPC’s position

The solution simply works and makes faster code – we have done a large project with it last year.

It seems that almost the whole embedded processor industry believes in it. AMD (CPU+GPU), ARM (CPU+GPU), Imagination (GPU), Mediatek, Qualcomm (GPU), Samsung and Texas Instruments (DSP) are founders. Companies like Analog Devices, CEVA, Sony, VIA, S3, Marvell and Cadence have later joined the club. Important Linux clubs like Linaro and Canonical are also seen.

The system-on-a-chip only will get more traction, and we see HSA as an enabler. Languages like OpenCL and OpenMP can be compiled down to HSA, so it just takes switching the compiler. HSA-capable software can be written in a more efficient manner, as now can be assumed that memory can efficiently be shared and dispatching new threads is really fast.

Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support

Posted by Vincent Hindriksen on 6 September 2016

GLXgears The information you find everywhere: on Linux the current “radeon” and “fglrx” are being replaced by AMDGPU (graphics) and ROCm (compute) for HSA-enabled GPUs. As the whole AMD Linux driver team is seemingly working on getting the new and open source drivers ready, fglrx is now deprecated and will not get updates (or very late). I therefore can get to the point:

When using fglrx on Linux, don’t upgrade to Linux distributions with a kernel later than 4.2 or Xorg server versions beyond 1.17!

For Ubuntu this means no 14.04.5 or 16.04 or later. When you have 14.04.4, the kernel will not upgrade when you go to 14.04.5. CentOS/RedHat has such old kernels, there currently is no issue. Fedora users simply have a problem, as they already go towards 4.8.

Continue reading “Dear Linux-users, during the transition period for FGLRX to AMDGPU/ROCm there’s no kernel 4.4 or Xorg 1.18 support” →

Company History

There are not many companies like Stream HPC. Most others are or a government-institute for the national supercomputer, 1 or 2 freelancers or… actually not experienced with GPUs. So how did it start? How did we get a large team of HPC- and GPU-experts, working for customers worldwide?

Company History

Sorry,You have not added any story yet

Want to know more? Go to our contacts page and ask any question.

Stream Team at ISC

Posted by Vincent Hindriksen on 12 June 2019

This year we’ll be with 4 people at ISC: Vincent, Adel, Anna and Istvan. You can find us at booth G-812, next to Red Hat.

Booth G-812 is manned&womened by Stream HPC

While we got known in the HPC-world for our expertise on OpenCL, we now have many years of experience in CUDA and OpenMP. To get there, we’ve focused a lot on how to improve code quality of existing software, to reduce bugs and increase speedup-potential. Our main expertise remains full control over algorithms in software – the same data simply processed faster.

Why do we have a booth?

We’ll be mostly talking to (new) customers for development of high performance software for the big machines. Also we’ll have a list of our open job positions with us, and we can do the first introductory interview on the spot.

Our slogan for this year is:

There are a lot of supercomputers. Somebody has to program its software

We’ll be sharing our week on Twitter, so you can also see what we find: posters about HPC-programming on CPU and GPU, booths that have nice demos or interesting talks and ofcourse the surprises.

Let’s meet!

If you don’t have an appointment yet, but would like to chat with us, please contact us or drop by at our booth. As we’re with four people, we have high flexibility.

Porting CUDA to OpenCL

OpenCL-in-CUDA-car — OpenCL speed-meter in 1970 Plymouth Cuda car

Why port your CUDA-acelerated software to OpenCL? Simply, to make your software also run on AMD CPU/APU/GPU, Intel CPU/GPU, Altera FPGA, Xilinx FPGA, Imagination PowerVR, ARM MALI, Qualcomm Snapdragon and upcoming architectures.

And as OpenCL is an open standard, supported by many vendors, it has much more security that it will keep existing in the future than any proprietary language.

If you look at the history of GPU-programming you’ll find many frameworks, such as BrookGPU, Close-to-Metal, Brook+, Rapidmind, CUDA and OpenCL. CUDA was the best choice from 2008 to 2013, as OpenCL had to catch up. Now that OpenCL is gaining serious market traction, the demand for porting legacy CUDA-code to OpenCL rises – as we clearly notice here.

We are very experienced in porting legacy CUDA-code to all flavours of OpenCL (CPU, GPU, FPGA, embedded). Ofcourse porting from OpenCL to CUDA is also possible, as well as updating legacy CUDA-code to the latest standards of CUDA 7.0 and later. We can also add several improvements to the architecture; we have made many customers happy with giving them more structured and documented code, while working on the port. Want to see some work we did? We ported molecular dynamics software Gromacs from CUDA to OpenCL.

[button text=”Request a pilot, code-review or more information” url=”https://streamhpc.com/consultancy/request-more-information/” color=”orange” target=”_blank”]

Heterogeneous Systems Architecture (HSA) – the TL;DR

Posted by Vincent Hindriksen on 5 February 2014

HSASolutionStack — Legacy-apps run on HSA-hardware, but less optimal.

The main problem of discrete GPUs is that memory needs to be transferred from CPU-memory to GPU-memory. Luckily we have SoCs (GPU and CPU in one die), but still you need to do in-memory transfers as the two processors cannot access memory outside their own dedicated memory-regions. This is due the general architecture of computers, which did not take accelerators into account. Von Neumann, thanks!

HSA tries to solve this, by redefining the computer-architecture as we know it. AMD founded the HSA-foundation to share the research with other designers of SoCs, as this big change simply cannot be a one-company effort. Starting with 7 founders, it has now been extended to a long list of members.

Here I try to give an overview of what HSA is, not getting into much detail. It’s a TL;DR.

What is Heterogeneous Systems Architecture (HSA)?

It consists mainly of three parts:

new memory-architecture: hUMA,
new task-queueing: hQ, and
an intermediate language: HSAIL.

hsa-overview — HSA enables tasks being sent to CPU, GPU or DSP without bugging the CPU.

The basic idea is to give GPUs and DSPs about the same rights as a CPU in a computer, to enable true heterogeneous computing.

hUMA (Heterogeneous Uniform Memory Access)

HSA changes the way memory is handled by eliminating a hierarchy in processing-units. In a hUMA architecture, the CPU and the GPU (inside the APU) have full access to the entire system memory. This makes it a shared memory system as we know it from multi-core and multi-CPU systems.

HSA-shared-mem-supersimplified — This is the super-simplified version of hUMA: a shared memory system with CPU, GPU and DSP having equal rights to the shared memory.

hQ (Heterogeneous Queuing)

HSA gives more rights to GPUs and DSPs, leveraging work from the CPU. Compared to the Von Neumann architecture, the CPU is not the Central Processing Unit anymore – each processor can be in control and create tasks for itself and the other processors.

heterogeneous-queing — HSA-processors have control over their own and other application task queues.

HSAIL (HSA Intermediate Language)

HSAIL is a sort of virtual target for HSA-hardware. Hardware-vendors focus on getting HSAIL compiled to their processor instruction sets, and developers of high-level languages target HSAIL in their compilers. This is a proven concept of evolving complex hardware-software projects.

It is pretty close to OpenCL SPIR, which has comparable goals. Don’t see them as competitors, but two projects which both need different freedoms and will work along.

What is in it for OpenCL?

OpenCL 2.0 has support for Shared Virtual Memory, Generic Address Space and Recursive Functions. All supported by HSA-hardware.

OpenCL-code can be compiled to SPIR, which compiles to HSAIL, which compiles to HSA-hardware. When the time comes that HSAIL starts supporting legacy hardware, SPIR can be skipped.

HSA is going to be supported in OpenCL 1.2 via new flags – watch this thread.

Final words

Two companies not there: Intel and Nvidia. Why? Because they want to do it themselves. The good news is that HSA is large enough to define the new architecture, making sure we get a standard. The bad news is that the two outsiders will come up with an exception for whatever reason, which gives a need for exceptions in compilers.

You can read more on the website of the HSA-foundation or ask me in the comments below.

CUDA 6 Unified Memory explained

Posted by Vincent Hindriksen on 14 November 2013 with 11 Comments

AMD, ARM-vendors and Intel have been busy unifying CPU and GPU memories for years. It is not easy to design a model where 2 (or more) processors can access memory without dead-locking each other.

NVIDIA just announced CUDA 6 and to my surprise includes “Unified Memory”. Am missing something completely, or did they just pass their competitors as it implies one memory? The answer is in their definition:

Unified Memory — Simplifies programming by enabling applications to access CPU and GPU memory without the need to manually copy data from one to the other, and makes it easier to add support for GPU acceleration in a wide range of programming languages.

The official definition is:

Unified Memory Access (UMA) is a shared memory architecture used in parallel computers. All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access time to a memory location is independent of which processor makes the request or which memory chip contains the transferred data.

See the difference?

The image at the right explains it differently. A) is how UMA is officially defined, and B is how NVIDIA has redefined it.

So NVIDIA’s Unified Memory solution is engineered by marketeers, not by hardware engineers. On Twitter, I seem not to be the only one who had the need to explain that it is different from the terminology the other hardware-designers have been using.

So if it is not unified memory, what is it?

It is intelligent synchronisation between CPU and GPU-memory. The real question is what the difference is between Unified Virtual Addressing (UVA, introduced in CUDA 4) and this new thing.

UVA defines a single Address Space, where CUDA takes care of the synchronisation when the addresses are physically not on the same memory space. The developer has to give ownership to or the CPU or the GPU, so CUDA knows when to sync memories. It does need CudeDeviceSynchronize() to trigger synchronisation (see image).

From AnandTech, which wrote about Unified (virtual) Memory:

This in turn is intended to make CUDA programming more accessible to wider audiences that may not have been interested in doing their own memory management, or even just freeing up existing CUDA developers from having to do it in the future, speeding up code development.

So its to attract new developers, and then later taking care of them being bad programmers? I cannot agree, even if it makes GPU-programming popular – I don’t bike on highways.

From Phoronix, which discussed the changes of NVIDIA Linux driver 331.17:

The new NVIDIA Unified Kernel Memory module is a new kernel module for a Unified Memory feature to be exposed by an upcoming release of NVIDIA’s CUDA. The new module is nvidia-uvm.ko and will allow for a unified memory space between the GPU and system RAM.

So it is UVM 2.0, but without any API-changes. That’s clear then. It simply matters a lot if it’s true or virtual, and I really don’t understand why NVIDIA chose to obfuscate these matters.

In OpenCL this has to be done explicitly with mapping and unmapping pinned memory, but is very comparable to what UVM does. I do think UVM is a cleaner API.

Let me know what you think. If you have additional information, I’m happy to add this.

Is the CPU slowly turning into a GPU?

Posted by Vincent Hindriksen on 6 February 2015

It's all in the plan — It’s all in the plan?

Years ago I was surprised by the fact that CPUs were also programmable with OpenCL – I solely chose that language for the cool of being able to program GPUs. It was weird at start, but cannot think of a world without OpenCL working on a CPU.

But why is it important? Who cares about the 4 cores of a modern CPU? Let me first go into why CPUs have had mostly 2 cores for so long, about 15 years ago. Simply put, it was very hard to program multi-threaded software that made use of all cores. Software like games did, as they needed all the available resources, but even the computations in MS Excel are mostly single-threaded as of now. Multi-threading was maybe used most for having a non-blocking user-interface. Even though OpenMP was standardised 15 years ago, it took many years before the multi-threaded paradigm was used for performance. If you want to read more on this, search the web for “the CPU frequency wall”.

More interesting is what is happening now with CPUs. Both Intel and AMD are releasing CPUs with lost of cores. Intel has recently a 18-core processor (Xeon E5 2699-v3) and AMD was offering 16-core CPUs for a longer time (Opteron 6300 series). Both have SSE and AVX, which means extra parallelism. If you don’t know what this is precisely about, read my 2011-article on how OpenCL uses SSE and AVX on the CPU.

AVX3.2

Intel now steps forward with AVX3.2 on their Skylake CPUs. AVX 3.1 is in XeonPhi “Knight’s Landing” – see this rumoured roadmap.

It is 512-bits wide, which means that 8 times as much vector-data can be computed! With 16 cores, this would mean 128 float operations per clock-tick. Like a GPU.

The disadvantage is alike the VLIW we had in the pre-GCN generation of AMD GPUs: one needs to fill the vector-instructions to get the speed-up. Also the relatively slow DDR3 memory is an issue, but lots of progress is being made there with DDR4 and stacked memory.

So is the CPU turning into a GPU?

I’d say yes.

With AVX3.2 the CPU gets all the characteristics of a GPU, except the graphics pipeline. That means that the CPU-part of the CPU-GPU is acting more like a GPU. The funny part is that with the GPU’s scalar-architecture and more complex schedulers, the GPU is slowly turning into a CPU.

In this 2012-article I discussed the marriage between the CPU and GPU. This merger will continue in many ways – a frontier where the HSA-foundation is doing great work now. So from that perspective, the CPU is transforming into a CPU-GPU; and we’ll keep calling it a CPU.

This all strengthens my believe in the future of OpenCL, as that language is prepared for both task-parallel and data-parallel programs – for both CPUs and GPUs, to say it in current terminology.

OpenCL in simple words

Posted by Vincent Hindriksen on 21 April 2015 with 6 Comments

Our business is largely around making software faster. For that we use OpenCL, but do you know what this programming language is? Why can’t this speeding-up be done using other languages like Java, C#, C++ or Python?

OpenCL the answer to high-level languages, where we were promised superfast software that was very quick to write. After 20 years this was still a promise, as compilers had to guess too much what was intended. OpenCL gives the programmer more control in the places where more control is needed to get high-performing code and leave less guesses for the compiler.

It’s C with some extra power

It’s like normal C with three extra concepts, all with the aim to make the software run faster.

Explicit Data Transfer

In other introductions to OpenCL the data-transfers are mentioned as one of the last parts, but I find this the most important one. Reason: in most cases this is the main bottleneck in performance-targeted code.

When moving your stuff to another house, you pack all in boxes first before loading the truck. Or would you load each item into the truck one-by-one? Transport-costs would be much higher that way.

While it would be great that the fastest data-transfers should be done automatically, it simply doesn’t work like that. This means that designing the data-transfers is an important task when making fast software. OpenCL lets you do this.

Multiple cores

Most people have heard of “cores”, as made famous by Intel. Each core can do a part of a computation and effectively reduce runtime. OpenCL implements this by isolating the code that runs on each core – what goes in and out the protected code is done explicitly. This way the code is really easy to scale up to thousands of cores.

Would you choose the best-in-class to write the multiplication tables from 1 to 20, or have each student write one of them? Even though the slowest student will limit the rest, the total time is still lower.

Where a normal processor has 1, 2, 4 or 8 cores, a graphics processor has hundreds or even thousands of cores. OpenCL-software works on both.

Vectors

Modern processors can do computations on more than one data-item at the same time. They can be described as sub-cores. This means that each core has parallelism on its own.

When reading, do you read one word at once or character by character? Your brains can parse multiple characters at the same time.

OpenCL has support for “vectors” ( bundles of alike data) to be able to program these sub-cores.

It runs on many types of devices

OpenCL is famous for being the standard programming model for a lot of modern processors. There is no other programming language that can do the same. Support is available on:

CPUs; standard processors by Intel, AMD and ARM
GPUs; graphics cards by Intel, AMD and NVIDIA
FPGAs; processors that are programmed on the hardware-level, by Altera and Xilinx.
DSPs; digital signal processors by TI
Mobile graphics processors by ARM, Imagination, Qualcomm, etc.
See the rest of the list here.

This means that code can be ported to new devices in days or weeks instead of having to rewrite everything from scratch.

How does translating to OpenCL work?

When software needs to be faster, the first step is to find out its bottlenecks – these “hot spots” will be ported to OpenCL, while the rest remains the same. Then comes the hardest part: changing the algorithms such that data-transfers are more efficient and all cores are used. The last step is to look into low-level optimisations like the vectors.

Above is a very simplified representation of OpenCL. Still you’ve seen that the language is very unique and powerful. That will change, as its concepts are slowly getting embedded into existing languages – till then OpenCL is the only standard which fully enables all hardware features.

Apple Metal versus Vulkan + OpenCL 2.2

Posted by Vincent Hindriksen on 8 June 2015 with 21 Comments

Update: C++ has been moved from OpenCL 2.1 to 2.2, being released in 2017. Title and text have been updated to reflect this. A reason why Apple released Metal might be the reason that Khronos was too slow in releasing C++ kernels into OpenCL, given the delays.

Apple Metal in one sentence: one queue for both OpenCL and OpenGL, using C++11. They now brought it to OSX. The detail they don’t tell: that’s exactly what the combination of Vulkan + OpenCL ~~2.1~~ 2.2 does. Instead it is compared with OpenCL 1.x + OpenGL 4.x, which it certainly can compete with, as that combination doesn’t have C++11 kernels nor a single queue.

Apple Metal on OSX – a little too late, bringing nothing new to the stage, compared to SPIR and OpenCL ~~2.1~~ 2.2.

The main reason why they can’t compete with the standards, is that there is an urge to create high-level languages and DSLs on top of lower-level languages. What Apple did, was to create just one and leaving out the rest. This means that languages like SYCL and C++AMP (implemented on top of SPIR-V) can’t simply run on OSX, and thus blocking new innovations. To understand why SPIR-V is so important and Apple should adopt for that road, read this article on SPIR-V.

khronos-SPIR-V-flowchart — Metal could compile to SPIR-V, just like OpenCL-C++ does. The rest of the Metal API is just like Vulkan.

Yet another vendor lock-in?

Now Khronos is switching its two most important APIs to the next level, there is a short-term void. This is clearly the right moment for Apple to take the risk and trying to get developers interested in their new language. If they succeed, then we get the well-known “pffff, I have no time to port it to other platforms” and there is a win for Apple’s platforms (they hope).

Apple has always wanted to have a different way of interacting with OpenCL-kernels using Grand Central Dispatch. Porting OpenCL between Linux and Windows is a breeze, but from and to OSX is not. Discussions over the past years with many people from the industry thought me one thing: Apple is like Google, Microsoft and NVidia – they don’t really want standards, but want 100% dedicated developers for their languages.

Yes, now also Apple is on the list of Me-too™ languages for OpenCL. We at StreamHPC can easily translate your code from and too Metal, but we would like it that you can put your investments in more important matters like improving the algorithms and performance.

Still OpenCL support on OSX?

Yes, but only OpenCL 1.2. A way to work around is to use SPIR-to-Metal translators and a wrapper from Vulkan to Metal – this will not make it very convenient though. The way to go, is that everybody starts asking for OpenCL 2.0 support on OSX forums. Metal is a great API, but that doesn’t change the fact it’s obstructing standardisation of likewise great, open standards. If they provide both Metal and Vulkan+OpenCL ~~2.1~~ 2.2 then I am happy – then the developers have the choice.

Metal debuts in “OSX El Capitan”, which is available per today to developers, and this fall to the general public.

Let us do your peer-review

Posted by Vincent Hindriksen on 26 August 2015

There are many research papers that claim enormous speed-ups using an accelerator. From our experience a large part is because of code-modernisations (parallisation & optimisation), which makes the claim look false. That’s why we offer peer-reviews for half our rate for CUDA and OpenCL software. The final costs depend on the size and complexity of the code.

We will profile your CPU and Accelerator code on our machines and review the code. The results are the effect of the code-modernisations and the effect of using the accelerator (GPU, XeonPhi, FPGA). With this we hope that we stimulate the effect of code-modernization gets more research attention over using “miracle hardware”.

Don’t misunderstand: GPUs can still get an average of 8x speedup (or 700% speed improvement) over optimised code, which is still huge! But it’s simply not the 30-100x speed-up claimed in the slide at the right.

Question: do we work with CUDA?

Posted by Vincent Hindriksen on 17 March 2019

Answer: Yes, actually a lot!

The company was built on OpenCL and we are still work with the language a lot – from embedded GPUs and FPGAs to high-end GPUs. Like OpenCL unjustly isn’t associated with clusters full of professional GPUs, we were not associated with CUDA. I can tell many of our customers have found us to build high performance software in CUDA.

Breaking with the past is not easy due to associations that seem to stick. With the name change from StreamComputing to Stream HPC some years ago, we wanted to enforce that break with being “the OpenCL company”. For some time we were much more pragmatic in solving the problems of our customers, which resulted in making software in MPI and CUDA – sometimes an unexpected direction as the customer initially chose OpenCL.

We also started hiring people who only knew CUDA (but expect them to learn OpenCL), as the right algorithm and the right processor is more important. Internships with CUDA, large CUDA-projects, seeking better relations with Nvidia and such – all have been going on for years. And we like it as much as we like OpenCL – both have unique advantages.

So if you have questions about CUDA, don’t be afraid that you hurt us – we’re happy to help you get fast software.

Double the performance on AMD Catalyst by tweaking subgroup operations

Posted by Jakub Szuppe on 22 March 2017

AMD’s hardware was only used for less than half in case of scan operations in standard OpenCL 2.0.

OpenCL 2.0 added several new built-in functions that operate on a work-group level. These include functions that work within sub-groups (also known as warps or wavefronts). The work-group functions perform basic parallel patterns for whole work-groups or sub-groups.

The most important ones are reduce and scan operations. Those patterns have been used in many OpenCL software and can now be implemented in a more straightforward way. The promise to the developers was that the vendors now can provide better performance using none or very little local memory. However, the promised performance wasn’t there from the beginning.

Recently, at StreamHPC we worked on improving performance of certain OpenCL kernels running specifically on AMD GPUs where we needed OpenGL-interop and thus chose Catalyst-drivers. It turned out that work-group and sub-group functions did not give the expected performance on both Windows and Linux. Continue reading “Double the performance on AMD Catalyst by tweaking subgroup operations” →

When to use Artificial Intelligence and when to use Algorithms?

Posted by Vincent Hindriksen on 27 March 2017

The main strength of Artificial Intelligence is it’s easy to understand by anybody. This results in new applications in all industries at a rapid pace. Are there new possibilities generated or have the possibilities always been possible? The answer is both.

In case of totally unknown input, AI is better capable of adapting. A clearly defined algorithm has much more unpredictable behavior in such cases.

If the input is known well, the tables are turned. AI could give unpredictable results and with that introducing hard-to-solve bugs. Algorithms do exactly what it is designed for, also often at a much higher performance.

Making the wrong choice here results often in a much more expensive solution. Continue reading “When to use Artificial Intelligence and when to use Algorithms?” →

The Art of Benchmarking

Posted by Vincent Hindriksen on 24 March 2020

How fast is your software? The simpler the software setup, the easier to answer this question. The more complex the software, the more the answer will “it depends”. But just peek at F1-racing – the answer will depend on the driver and the track.

This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.

There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

The questions to ask

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.

First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.

We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:

What elements build up the race? From weather conditions to corners, and from other cars on the track to driver-responses
How can each of these elements be quantified?
How can each of these elements be measured for both own cars and other cars?
And as you guessed from the high-level result, the stopwatch: how much speedup is required in total and per round?

Continue reading →

Our offices

We’re expanding to more cities, to be closer to talent and our customers. The idea is to have multiple smaller offices instead of a few big ones. The idea for this was a simple set of questions on how work would be in 2030. The lines between offices would be shifting – not all is to be defined by walls. So smaller offices nearby, with the flexibility to temporarily move to another city, would be much more suited for what is expected in 2030.

Each city has one or two senior developer+manager person, who takes lead when the project-complexity demands it.

In HQ the main structure is provided for onboarding, administration, sales and such. All to make sure the different cities only have a few local things to take care off, so the focus can be on building great software and efficiently handling the projects.

EU – NL – Amsterdam

Koningin Wilhelminaplein 1 – 40601, 1062HG, Amsterdam, Netherlands

Amsterdam is the economic center of the Netherlands, a small country with 17 million inhabitants. It’s the home of HPC-companies like Bright Computing and ClusterVision, and has a large IT workforce that also feed the R&D demand of large international companies. As the number of companies settling here is still growing, Amsterdam is even planning to build a complete new city for 40 to 70 thousand people in the harbour area.

There are different sides of the city. When you think of Amsterdam as a tourist, you might think of the Anne Frank House, Gay Parade, Van Gogh Museum, the Red Light District, the canals, windmills and Tulips. If you would consider living here there are about the 180 different nationalities that live in the city, the 22 international schools and two universities, the vibrant night life and the many-villages-make-the-city atmosphere. Locals of all professions are fluent in English and there is a lively expat community.

You don’t need to live in Amsterdam, as there are several cities and villages nearby with all unique identities. As the Dutch infrastructure is of high standard, Amsterdam is easy to reach via train (and car) from several nearby cities and villages. For instance taking the train from Haarlem to the office takes 9 to 13 minutes, Leiden or Utrecht half an hour. Want to live at the sea? Zandvoort to the office is 25 minutes.

Expats (both single and with family) say they found it easy to build up a social life. For Europeans it’s very easy to move to Amsterdam, as there are no real borders in the EU.

EU – HU – Budapest

Radnóti Miklós u. 2, Budapest, 1137, Hungary

Two cities, Buda and Pest with both their own characteristics form the 1,75 million large capital of Hungary and the ninth-largest city in the EU. The country (est. in 895) has almost 10 million inhabitants.

There is more high-tech industry than you might think. Hungary has one of the highest rates of filed patents, the 6th highest ratio of high-tech and medium high-tech output in the total industrial output, the 12th-highest research Foreign Direct Investment inflow, placed 14th in research talent in business enterprise and has the 17th-best overall innovation efficiency ratio in the world.

If you walk in the city, you’ll find no average Hungarian. There is much creativity hidden and there’s a rich beer-culture. There is this unique quiet vibrant atmosphere that makes you immediately feel at home.

EU – ES – Barcelona

Better weather during winter than in Amsterdam and Budapest and a vibrant tech-city. It hosts the famous Barcelona Supercomputing Center, and is strong tech-hub.

Contenders

We’re researching multiple cities for starting a new office. Due to Covid these researches have been delayed a lot.

EU – NL – Utrecht
EU – NL – Eindhoven
EU – PL – Warsaw
EU – FR – Paris
EU – FR – Grenoble
EU – DE – Heidelberg
UK – Bristol

If you live in one of these cities and are good with GPUs, do get in contact. We start with these people:

An experienced developer who can manage projects
Three to four medior/senior developers
A temporary “location starter”
Optionally a sales-person