AMD’s answer to NVIDIA TESLA K10: the FirePro S9000

Posted by Vincent Hindriksen on 13 September 2012 with 8 Comments

Recently AMD announced their new FirePro GPUs to be used in servers: the S9000 (shown at the right) and the S7000. They use passive cooling, as server-racks are actively cooled already. AMD partners for servers will have products ready Q1 2013 or even before. SuperMicro, Dell and HP will probably be one of the first.

What does this mean? We finally get a very good alternative to TESLA: servers with probably 2 (1U) or 4+ (3U) FirePro GPUs giving 6.46 to up to 12.92 TFLOPS or more theoretical extra performance on top of the available CPU. At StreamHPC we are happy with that, as AMD is a strong OpenCL-supporter and FirePro GPUs give much more performance than TESLAs. It also outperforms the unreleased Intel Xeon Phi in single precision and is close in double precision.

Edit: About the multi-GPU configuration

A multi-GPU card has various advantages as it uses less power and space, but does not compare to a single GPU. As the communication goes via the PCI-bus still, the compute-capabilities between two GPU cards and a multi-GPU card is not that different. Compute-problems are most times memory-bound and that is an important factor that GPUs outperform CPUs, as they have a very high memory bandwidth. Therefore I put a lot of weight on memory and cache available per GPU and core.

Continue reading “AMD’s answer to NVIDIA TESLA K10: the FirePro S9000” →

Khronos Invites Press & Game Developers to Sessions @ GDC San Francisco

Posted by Vincent Hindriksen on 28 February 2014

Khronos just sent out the below message to Press and Game Developers. To my understanding, there are many game devs under the readers of this blog, so I’d like you to share the message with you.

JOIN KHRONOS GROUP AT GDC 2014 SAN FRANCISCO
Press Conference, Technology Sessions and Refreshment OasisWe invite you to attend one or more of the Khronos sessions taking place in the Khronos meeting room just off the Moscone show floor. For detailed information on each session, and to register please visit: https://www.khronos.org/news/events/march-meetup-2014.

PRESS CONFERENCE

WHEN: Wednesday March 19 at 10:00 AM (Reception 9:30 AM)

WHERE: Room 262, West Mezzanine Level, (behind Official Press Room)

GUESTS: Members of the Press and Industry by Invitation*

RSVP: Jon Hirshon, Horizon PR jh@horizonpr.com

Members of the press are invited to attend the Khronos Press Conference, held jointly again this year with consortium PCGA (PC Gaming Alliance). Khronos will issue significant news on OpenGL ES, WebCL, OpenCL, and several more Khronos technologies, and PCGA will issue news about 2013 Gaming Market numbers. Updates will be delivered by Khronos and PCGA Executives, with insights made by David Cole of DFC and Jon Peddie of Jon Peddie Research.

DEVELOPER SESSIONS

WHEN: Wednesday March 19 & Thursday March 20; Session Times Below

WHERE: Room 262 – West Mezzanine Level, Moscone Center

GUESTS: All GDC attendees**

RSVP: https://www.khronos.org/news/events/march-meetup-2014

All GDC attendees** are invited to the Khronos Developer Sessions where experts from the Khronos Working Groups will deliver in-depth updates on the latest developments in graphics and media processing. These sessions are packed with information and provide a great opportunity to:

Hear about the latest updates from the gurus that invented these technologies

See leading-edge demos & applications

Put your questions to members of the Khronos working groups

Meet with other community members

SESSION SCHEDULE

Wednesday March 19

3:00 – 4:00 : OpenCL & SPIR

4:00 – 5:00 : OpenVX, Camera and StreamInput

5:00 – 6:00 : OpenGL ES

6:00 – 7:00 : OpenGL

Thursday March 20

3:00 – 3:50 : WebCL

4:00 – 4:50 : Collada and glTF

5:00 – 7:00 : WebGL

SESSION REGISTRATION
For information and to register, visit:https://www.khronos.org/news/events/march-meetup-2014

REFRESHMENT OASIS

WHEN: Wednesday March 19 & Thursday March 20; from 10 AM to 7 PM

WHERE: Room 270, West Mezzanine Level

GUESTS: All GDC attendees**

RSVP: https://www.khronos.org/news/events/march-meetup-2014

We thought “Refreshment Oasis” sounded like a nice way to say “sit down and have a cup of coffee while we keep working!” Khronos is happy to offer a hospitality suite conveniently located next to our primary meeting room (and the official GDC Press room) to showcase Khronos Member technology demos and offer a place for GDC guests, Khronos Members and Marketing staff to meet. You are welcome to just drop by for a chat, or please email Michelle@GoldStandardGroup.org to arrange a meeting with any Work Group Chairs, Khronos Execs or Marketing Team.

We look forward to seeing you at the show!

*Admittance to the Press Conference is open to all GDC registered Press, and to members of industry on a “Seating Available” basis. Space is limited so reserve your seat today.

** Admittance to the KHRONOS sessions is FREE but: (1) all attendees must have a GDC Exhibitor or Conference Pass to gain entry to the Khronos meeting room area (GDC tickets details http://www.gdconf.com) and (2) all attendees MUST REGISTER for the individual Khronos API sessions. We expect demand to be high and space is limited.

With open standards becoming more important in the very diverse computer-game industry, Khronos is also growing. If you are in this industry and want to know (or influence) the landscape for the coming years, you should attend.

The single-core, multi-core and many-core CPU

Posted by Vincent Hindriksen on 3 August 2017

CPUs are now split up in 3 types, depending on the number of cores: single (1), multi (2-8) and many (10+).

I find it more important now to split up into these three types, as the types of problems to be solved by each is very different. Based on the problem-differences I’m even expecting that the number of cores between multi-core CPUs and many-core CPUs will grow.

Below are the three types of CPUs discussed and a small discussion on many-core processors we see around. Continue reading “The single-core, multi-core and many-core CPU” →

Call for speakers: IEEE eScience Conference in Amsterdam

Posted by Vincent Hindriksen on 23 May 2018

We’re in the program committee of the 14th IEEE eScience Conference in Amsterdam, organized by the Netherlands eScience Center. It will be held from 29 October to 1 November 2018, and the deadlines for sending the abstracts is Monday 18 June.

The conference brings together leading international researchers and research software engineers from all disciplines to present and discuss how digital technology impacts scientific practice. eScience promotes innovation in collaborative, computationally- or data-intensive research across all disciplines, throughout the research lifecycle.

Continue reading “Call for speakers: IEEE eScience Conference in Amsterdam” →

Do your (X86) CPU and GPU support OpenCL?

Posted by Vincent Hindriksen on 29 December 2011 with 26 Comments

Does your computer have OpenCL-capable hardware? Read on and find out if your computer is compatible…

If you want to know what other non-PC hardware (phones, tablets, FPGAs, DSPs, etc) is running OpenCL, see the OpenCL SDK page.

For people who only want to run OpenCL-software and have recent hardware, just read this paragraph. If you have recent drivers for your GPU, you can be sure OpenCL is already supported and you can run OpenCL-capable software. NVidia has support for OpenCL 1.1 since drivers 280.13, so if you need OpenCL 1.1, then make sure you have this version or later. If you want to use Intel-processors and you don’t have an AMD GPU installed, you need to download the runtime of Intel OpenCL.

If you want to know if your X86 device is supported, you’ll find answers in this article.

Often it is not clear how OpenCL works on CPUs. If you have a 8 core processor with double threading, then it mostly is understood that 16 pipelines of instructions are possible. OpenCL takes care of this threading, but also uses parallelism provided by SSE and AVX extension. I talked more about this here and here. Meaning that an 8-core processor with AVX can compute 8 times 32 bytes (8*8 floats or 8*4 doubles) in parallel. You could see it as parallelism of parallelism. SSE is designed with multimedia-operations in mind, but has enough to be used with OpenCL. The minimum requirement for OpenCL-on-a-CPU is SSE 4.2, though.

A question I see often is what to do if you have more devices. There is no OpenCL-package for all the available devices, so you then need to install drivers for each device. CPU-drivers are often included in the GPU-drivers.

Read on to find out exactly which processors are supported.

Continue reading “Do your (X86) CPU and GPU support OpenCL?” →

21-23 August: OpenCL Training London

Posted by Vincent Hindriksen on 13 May 2013

From 21 to 23 August StreamHPC will give a 3-day training in OpenCL. Here you will learn how to develop OpenCL-programs.

A separate ticket for only the first day can be bought, as then will be a crash-course into OpenCL. Module basics.

The second and third day will all about parallel-algorithm design, optimisation and error-handling. Module optimisation with several new subjects added.

The last part of the third day is reserved for special subjects, as requested by the attendees. Continue reading “21-23 August: OpenCL Training London” →

OpenCL mini buying guide for X86

Posted by Vincent Hindriksen on 17 January 2011 with 6 Comments

Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

Continue reading “OpenCL mini buying guide for X86” →

X86-workstation buying guide for OpenCL developers, Q1 2013

Posted by Vincent Hindriksen on 6 January 2013 with 7 Comments

Curved iMac has your back… — Nuno Teixeira designed a large curved monitor in 2008 and assumed it would never be made. For a “few” thousand dollar NEC offers one to you right now. Also Samsung and LG have announced several new curved TVs at CES 2013 (with hdmi-port). We only need a workstation to go with it, where this blog-article might come in handy.

Important: this article was written before Intel “Haswell” and AMD “Richland” architectures came out.

So you want to start developing for OpenCL? When you focus on developing OpenCL for X86, you have these three options: CPUs, GPUs and CPUs with and embedded GPU. This article is for you and represents the current state of hardware – if you want the best hardware for your specific algorithm, the below information is probably not sufficient.

In 2013 we focus on 3 groups: servers/cloud (FirePro, Tesla, XeonPhi), workstations (discussed here), low-power devices (SoCs) and special accelerators (FPGAs and DSPs). This article does not discuss high-end accelerators of a few thousands of Euro, which are laid out in here.

Before reading on, you need to set the goal for your workstation.

If you want to learn the basics of OpenCL-programming, first check if your current machine has OpenCL-support.
If you need more processing power, be sure you select the right hardware for the job. Don’t buy the most expensive hardware (FirePro, Tesla or XeonPhi), but take your time to find out which hardware supports your algorithms best. Feel free to ask us.
If you want to make sure your software works on various types of accelerators, you can choose between:
- swapping PCIe-cards – disadvantage is the drivers-hazzle and time-consumption.
- more accelerators in one machine – disadvantage is that only GPU 1 can do OpenGL/DirectX.
- identical machines with different accelerators – disadvantage is the price.
If you want to focus on multi-GPU development, you need:
- or enough power-supply and the motherboard supports many lanes,
- or buy a videocard with two GPUs.

This article has the goal to help you with buying a good machine for OpenCL-development. Prices are of January 2013. If you think I make the wrong suggestions, please give feedback via the comments.

My contacts at various companies can tell: I want to stay independent no matter what. No deals have been made nor was there any outside influence, except the friendly people of the local computer shops. I was surprised I ended up with suggestion so much AMD hardware, that I felt quite uncomfortable with it – I finally decided to keep to my first conclusions and leave the comments completely open.

Continue reading “X86-workstation buying guide for OpenCL developers, Q1 2013” →

OpenCL Wrappers

Mostly providing simplified kernel-code with more convenient error-checking, but sometimes with quite advanced additions: the wrappers for OpenCL 1.x. As OpenCL is an open standard, these projects are an important part of the evolution of OpenCL.

You won’t find solutions that provide a new programming paradigm or work with pragmas, generating optimised OpenCL. This list is in the making.

C++

Goopax: Goopax is an object-oriented GPGPU programming environment, allowing high performance GPGPU applications to be written directly in C++ in a way that is easy, reliable, and safe.

OCL-Library: Simplified OpenCL in C++. Documentation. By Prof. Tim Warburton and David Medina.

Openclam: possibility to write kernels inside C++ code. Project is not active.

EPGPU: provides expressions in C++. Paper. By Dr. Lawlor.

VexCL: a vector expression template library. Documentation.

Qt-C++. The advantages of Qt in the OpenCL programmer’s hands.

Boost.Compute. An extensive C++ library using Boost. Documentation.

ArrayFire. A wrapper and a library in one.

OpenCLHelper. Easy to run kernels using OpenCL.

SkelCL. A more advanced C++ wrapper, providing various skeleton functions.

HPL, or the Heterogeneous Programming Library, where special Arrays are used to easily communicate between CPU and GPU.

ViennaCL, a wrapper around an OpenCL-library focused on linear algebra.

C, Objective C

C Framework for OpenCL. Rapid development of OpenCL programs in C/C++.

Simple OpenCL. Much simpler host-code.

COPRTHR: STDCL: Simplified programming interface for OpenCL designed to support the most typical use-cases in a style inspired by familiar and traditional UNIX APIs for C programming.

Grand Central Dispatch: integration into Apple’s environment. Documentation [PDF].

GoCL: For combination with Gnome GLib/GObject.

Computing-Language-Utility: C/C++ wrapper by Intel. Documentation included, slides of presentation here.

Delphi/Pascal

Delphi-OpenCL: Delphi/Pascal-bindings for OpenCL. Seems not active.

OpenCLforDelphi: OpenCL 1.2 for Delphi.

Fortran

Howto-article: It describes how to link with c-files. A must-read for Fortran-devs who want to integrate OpenCL-kernels.

FortranCL: OpenCL interface for Fortran 90. Seems to be the only matured wrapper around. FAQ.

Wim’s OpenCL Integration: contains a very simple f95 file ‘oclWrapper.f95’.

Go

GOCL: Go OpenCL bindings

Go-OpenCL: Go OpenCL bindings

Haskell

HopenCL: Haskell-bindings for OpenCL. Paper.

Java

JavaCL: Java bindings for OpenCL.

ClojureCL: OpenCL 2.0 wrapper for Clojure.

ScalaCL: Much more advanced integration as could be done with JavaCL.

JoCL by JogAmp: Java bindings for OpenCL. Good integration with siter-projects JoGL and JoAL.

JoCL.org: Java bindings for OpenCL.

The Lightweight Java Game Library (LWJGL): Support for OpenCL 1.0-1.2 plus extensions and OpenGL interop.

Aparapi, a very high level language for enabling OpenCL in Java.

JavaScript

Standardised WebCL-support is coming via the Khronos WebCL project.

Nokia WebCL. Javascript bindings for OpenCL, which works in Firefox.

Samsung WebCL. Javascript bindings for OpenCL, which works in Safari and Chromium on OSX.

Intel Rivertrail. built on top of WebCL.

Julia

JuliaGPU. OpenCL 1.2 bindings for Julia.

Lisp

cl-opencl-3b: Lisp-bindings for OpenCL. not active.

.NET: C#, F#, Visual Basic

OpenCL.NET: .NET bindings for OpenCL 1.1.

Cloo: .NET bindings for OpenCL 1.1. Used in OpenTK, which has good integration with OpenGL, OpenGL|ES and OpenAL.

ManoCL: Not active project for .NET bindings.

FSCL.Compiler: FSharp OpenCL Compiler

Perl

Perl-OpenCL: Perl bindings for OpenCL.

Python

General article

PyOpenCL: Python bindings for OpenCL with convenience wrappers. Documentation.

Cython: C-extension for Python. More info on the extension.

PyCL: not active.

PythonCL: not active.

Clyther: Python bindings for OpenCL. No code yet, but check out the predecessor.

Ruby

Ruby-OpenCL: Ruby bindings for OpenCL. Not active.

Barracuda: seems to be not active.

Rust

Rust-OpenCL: Rust-bindings for OpenCL. Blog-article by author

Math-software

Mathematica

OpenCLLink. OpenCL-bindings in Mathematica 9.

Matlab

There is native support in Matlab.

OpenCL-toolbox. Alternative bindings. Not active. Works with Octave.

R

R-OpenCL. Interface allowing R to use OpenCL.

Suggestions?

If you know not mentioned alike projects, let us know! Even when not active, as that is important information too.

Problem solving tactic: making black boxes smaller

Posted by Vincent Hindriksen on 6 March 2020

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

https://twitter.com/StreamHPC/status/1235895955569938432

Black boxes will never be transparent

Assumption is the mother of all mistakes
Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading →

Difference between CUDA and OpenCL 2010

Posted by Vincent Hindriksen on 22 April 2010 with 11 Comments

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA term	OpenCL term
GPU	Device
Multiprocessor	Compute Unit
Scalar core	Processing element
Global memory	Global memory
Shared (per-block) memory	Local memory
Local memory (automatic, or local)	Private memory
kernel	program
block	work-group
thread	work item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

FFT library – so in OpenCL you need to have your own kernels for it.
~~Atomic operations – which make double-write threads easier to implement.~~
Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting.

The 13 application areas where OpenCL and CUDA can be used

Posted by Vincent Hindriksen on 3 June 2013 with 16 Comments

visitekaartje-achter-2013-V — Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading →

Selecting Applications Suitable for Porting to the GPU

Posted by Vincent Hindriksen on 14 March 2018

Assessing software is never comparing apples to apples

The goal of this writing is to explain which applications are suitable to be ported to OpenCL and run on GPU (or multiple GPUs). It is done by showing the main differences between GPU and CPU, and by listing features and characteristics of problems and algorithms, which can make use of highly parallel architecture of GPU and simply run faster on graphic cards. Additionally, there is a list of issues that can decrease potential speed-up.

It does not try to be complete, but tries to focus on the most essential parts of assessing if code is a good candidate for porting to the GPU.

GPU vs CPU

The biggest difference between a GPU and a CPU is how they process tasks, due to different purposes. A CPU has a few (usually 4 or 8, but up to 32) ”fat” cores optimized for sequential serial processing like running an operating system, Microsoft Word, a web browser etc, while a GPU has a thousands of ”thin” cores designed to be very efficient when running hundreds of thousands of alike tasks simultaneously.

A CPU is very good at multi-tasking, whereas a GPU is very good at repetitive tasks. GPUs offer much more raw computational power compared to CPUs, but they would completely fail to run an operating system. Compare this to 4 motor cycles (CPU) of 1 truck (GPU) delivering goods – when the goods have to be delivered to customers throughout the city the motor cycles win, when all goods have to be delivered to a few supermarkets the truck wins.

Most problems need both processors to deliver the best value of system performance, price, and power. The GPU does the heavy lifting (truck brings goods to distribution centers) and the CPU does the flexible part of the job (motor cycles distributing doing deliveries).

Assessing software for GPU-porting fitness

Software that does not meet the performance requirement (time taken / time available), is always a potential candidate for being ported to a GPU. Continue reading “Selecting Applications Suitable for Porting to the GPU” →

NVIDIA’s answer to FirePro S9000: the TESLA K20

Posted by Vincent Hindriksen on 12 November 2012 with 5 Comments

Two months ago I wrote about the FirePro S9000 – AMD’s answer to the K10 – and was already looking forward to this K20. Where in the gaming world, it hardly matters what card you buy to get good gaming performance, in the compute-world it does. AMD presented 3.230 TFLOPS with 6GB of memory, and now we are going to see what the K20 can do.

The K20 is very different from its predecessor, the K10. Biggest difference is the difference between the number of double precision capability and this card being more powerful using a single GPU. To keep power-usage low, it is clocked at only 705MHz compared to 1GHz of the K10. I could not find information of the memory-bandwidth.

ECC is turned on by default, hence you see it presented having 5GB. No information yet if this card also has ECC on the local memories/caches.

Continue reading “NVIDIA’s answer to FirePro S9000: the TESLA K20” →

Why did AMD open source ROCm’s OpenCL driver-stack?

Posted by Vincent Hindriksen on 21 May 2017

AMD open sourced the OpenCL driver stack for ROCm in the beginning of May. With this they kept their promise to open source (almost) everything. The hcc compiler was open sourced earlier, just like the kernel-driver and several other parts.

Why this is a big thing?
There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.

It contains:

OpenCL 1.2 compatible language runtime and compiler
OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.

For testing the implementation, see Khronos OpenCL CTS framework or Phoronix openbenchmarking.org.

Why is it open sourced?

There are several reasons. AMD wants to stand out in HPC and therefore listened carefully to their customers, while taking good note on where HPC was going. Where open source used to be something not for businesses, it is now simply required to be commercially successful. Below are the most important answers to this question.

Give deeper understanding of how functions are implemented

It is very useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to use. It does not tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.

Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.

Debug software

When you are working on a large project and have to work with proprietary libraries, this is a typical delay factor. I think every software engineer has this experience that the library does not perform as was documented and work-arounds had to be created. Depending on the project and the library, it could take weeks of delay – only sarcasm can describe these situations, as the legal documents were often a lot better than the software documents. When the library was open source, the debugger could step in and give the “aha” that was needed to progress.

When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and ofcourse your project hits that bug which nobody encountered before. Now all is open source, you can now step into the driver with the debugger. Moreover, the driver can be compiled with a fix instead of work-around.

Get bugs solved quicker

A trace now now include the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This not only improves reproducibility, but reduces the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.

Have “unimportant” specific improvements done

Say your software is important and in the spotlight, like Blender or the LuxMark benchmark, then you can expect your software gets attention in optimisations. For the rest of us, we have to hope our special code-constructions are alike one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team does not have enough time. This is frustrating for both sides.

Now everybody can have their improvements submitted, giving it does not slow down the focus software ofcourse.

Get the feature set extended

Adding SPIR-V is easy now. The SPIRV-frontend needs to be added to ROCm and the right functions need to be added to the OpenCL driver. Unfortunately there is no support for OpenCL 2.x host-code yet – I understood by lack of demand.

For such extensions the AMD team needs to be consulted first, because this has implications on the test-suite.

Get support for complete new things

It takes a single person to make something completely new – this becomes a whole easier now.

More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg. Optimised 128 bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All high performance code is there for you.

Initiate alternative implementations (?)

Not a goal, but forks are coming for sure. For most forks the goals would be like the ones above, to later be merged with the master branch. There are a few forks that go their own direction – for now hard to predict where those will go.

Improve and increase university collaborations

If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.

Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Added bonus for the students is that the GPUs might be allowed to used for games too.

Timour Paltashev (Senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.

Get better support in more Linux distributions

It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised), but the discussion is if this is part of the hardware or software to mark it as “libre”.

There are many obstacles to have ROCm complete stack included as the default, but with the current state it makes much more chance.

Performance

Phoronix has done some benchmarks on ROCm 1.4 OpenCL in January on several systems and now ROCm 1.5 OpenCL on a Radeon RX 470. Though the 1.5 benchmarks were more limited, the important conclusion is that the young compiler is now mostly on par with the closed source OpenCL implementation combined with the AMDGPU-drivers. Only Luxmark AMDGPU was (much) better. Same comparison for the old proprietary fgrlx drivers, which was fully optimised and the first goal to get even with. You’ll see that there will be another big step forward with ROCm 1.6 OpenCL.

Get started

You can find the build instructions here. Let us know in the comments what you’re going to do with it!

12-14 June: OpenCL Training Amsterdam

Posted by Vincent Hindriksen on 25 April 2013

From 12 to 14 June StreamHPC will give a 3-day course in OpenCL (was 3 to 5 June). Here you will learn how to develop OpenCL-programs.

A separate ticket for only the first day can be bought, as then will be a crash-course into OpenCL. Module basics.

The second and third day will all about parallel-algorithm design, optimisation and error-handling. Module optimisation with several new subjects added.

The last part of the third day is reserved for special subjects, as requested by the attendees. Continue reading “12-14 June: OpenCL Training Amsterdam” →

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Posted by Vincent Hindriksen on 23 August 2019

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

Continue reading →

AMD OpenCL Programming Guide August 2013 is out!

Posted by Vincent Hindriksen on 6 August 2013

AMD has just released an update to their AMD programming guide.

~~Download the guide (PDF) August version~~

Download the guide (PDF) November version

Download TOC (PDF)

For more optimisation guides, see the tutorials page of the knowledge base.

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

1.1 Software Overview
1.1.1 Synchronization

1.2 Hardware Overview for Southern Islands Devices

1.3 Hardware Overview for Evergreen and Northern Islands Devices

1.4 The AMD Accelerated Parallel Processing Implementation of OpenCL Continue reading “AMD OpenCL Programming Guide August 2013 is out!” →

Call for Papers, Presentations, Workshops and Posters for IWOCL in Stanford

Posted by Vincent Hindriksen on 10 January 2015

The IWOCL 2015 call for OpenCL Papers is now open and is looking for submissions from industry and academia relating to the use of OpenCL. Submissions may refer to completed projects or those currently in progress and are invited in the form of:

Research Papers
Technical Presentations
Workshops and Tutorials
Posters

Examples of sessions from 2014 can be found here.

Deadlines at a Glance

Call for submissions OPENS:	Wednesday 19th November, 2014
Call for submissions CLOSES:	Saturday 14th February, 2015 (23:59 AOE)
Notifications:	Within 4 weeks of the final closing date

Selection Criteria

The IWOCL Technical Committee will select submissions based on the following criteria;

Concept of the submission and its relevance and timeliness
Technical Depth
Clarity of the submissions; clearly conveying what your presentation
Research findings and results of your work
Your credentials and expertise in the subject matter

Unpublished Technical Papers

We solicit the submission of unpublished technical papers detailing original research related to OpenCL. All topics related to OpenCL are of interest, including OpenCL applications from any domain (e.g., scientific computing, video games, computer graphics, multimedia, information retrieval, optimization, text processing, data mining, finance, signal and image processing and numerical solvers), OpenCL performance analysis and modeling, OpenCL performance and correctness tools and proposed OpenCL extensions. IWOCL will publish formal proceedings of the accepted papers in The ACM International Conference Series. Please Submit an Abstract which should be between 1 and 4 pages long.

Technical Presentations

We solicit the submission of technical presentations detailing the innovative use of OpenCL. All topics related to OpenCL are of interest, including but not limited to applications, software tools, programming methods, extensions, performance analysis and verification. Please Submit an Abstract which should not exceed 4 pages. The accepted presentations will be published in the online workshop proceedings.

Workshops & Tutorials

IWOCL includes a day of tutorials that provide OpenCL users an opportunity to spend more time exploring a specific OpenCL topic. Tutorial submissions should assume working knowledge of OpenCL by the attendees and can for example cover OpenCL itself, any of the related APIs such as SPIR and SYCL, the use of OpenCL libraries or parallel computing techniques in general using OpenCL. Please Submit an Abstract which should not exceed 4 pages. Please include the preferred length of the tutorial or workshop (e.g. 2, 3 or 4 hours).

Posters

To encourage discussion of the latest developments in the OpenCL community, there will be a poster session running in parallel to the main sessions and open during the breaks and lunch sessions. The abstracts of the accepted posters will be published in the form of short communications in the workshop proceedings, provided that at least one of the authors has registered for the workshop. Please Submit an Abstract which should not exceed 2 pages.

Submit your abstract today

Go to Easychair, log in or register, and click on “New Submission”. Deadline is 14 February.

OpenCL at SC14

Posted by Vincent Hindriksen on 12 November 2014 with 5 Comments

During SC14 (SuperComputing Conference 2014), OpenCL is again all over New Orleans. Just like last year, I’ve composed an overview based on info from the Khronos website and the SC2014 website.

Finally I’m attending SC14 myself, and will give two talks for you. Tuesday I’ll be part of a 90 minute session of Khronos, where I’ll talk a bit about GROMACS and selecting the right accelerator for your software. Wednesday I’ll be sharing our experiences from our port of GROMACS to OpenCL. If you meet me, then I can hand you over a leaflet with the decision chart to help select the best device for the job.

Continue reading “OpenCL at SC14” →

Products using OpenCL on ARM MALI are coming

Posted by Vincent Hindriksen on 26 October 2013

The past year you might not have heard much from OpenCL-on-ARM, besides the Arndale developer-board. You have heard just a small portion of what has been going on.

Yesterday the (Linux) OpenCL-drivers for the Chromebook (which contains an ARM MALI T604) the have been released and several companies will launch products using OpenCL.

Below are a few interviews with companies who have built such products. This will give an idea of what is possible on those low-power devices. To first get an idea of what this MALI T604 GPU can do if it comes to OpenCL, here a video from the 2013-edition of the LEAP-conference we co-organised.

http://www.youtube.com/watch?v=UQXfjvcqiQg

Understand that the whole board takes less than ~11.6 Watts – that is including the CPU, GPU, memory , interconnects, networking, SD-card, power-adapter, etc. Only a small portion of that is the GPU. I don’t know the exact specs as this developer-board was not targeted towards energy-optimisation goals. I do know this is less than the 225 Watts of a discrete GPU alone.

Interviews with ARM partners Continue reading “Products using OpenCL on ARM MALI are coming” →

Edit: About the multi-GPU configuration

C++

C, Objective C

Delphi/Pascal

Fortran

Go

Haskell

Java

JavaScript

Julia

Lisp

.NET: C#, F#, Visual Basic

Perl

Python

Ruby

Rust

Math-software

Mathematica

Matlab

R

Suggestions?

Black boxes will never be transparent

Terminology

Speed-comparison

Homogeneous vs Heterogeneous

Missing language-features

Last words

GPU vs CPU

Assessing software for GPU-porting fitness

Why is it open sourced?

Give deeper understanding of how functions are implemented

Debug software

Get bugs solved quicker

Have “unimportant” specific improvements done

Get the feature set extended

Get support for complete new things

Initiate alternative implementations (?)

Improve and increase university collaborations

Get better support in more Linux distributions

Performance

Get started

Table of Contents

Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing

Deadlines at a Glance

Selection Criteria

Unpublished Technical Papers

Technical Presentations

Workshops & Tutorials

Posters

Submit your abstract today

Interviews with ARM partners Continue reading “Products using OpenCL on ARM MALI are coming” →