Speeding up your data-processing

Using unused processing power

The computer as we know it has changed a lot since the past years. For instance, we now can use the graphics card for non-graphic purposes. This has resulted in a computer with a much higher potential. Doubling of processing-speed or more is more rule than exception. Using this unused extra speed gives a huge advantage to software which makes use of it – and that explains the growing popularity.

The acceleration-technique is called OpenCL and not only works on graphics cards of AMD and NVidia, but also on the latest processors of Intel and AMD, and even processors in smartphones and tablets. Special processors such as DSPs and FPGAs will get support too. As it is an open standard the support will only grow.

Offered services

StreamHPC has been active since June 2010 as acceleration-specialist and offers the following services:

[list2]

  • development of extreme fast software,
  • design of (faster) algorithms,
  • accelerating existing software, and
  • provide training in OpenCL and acceleration-tecniques.

[/list2]

Not many companies master this specialisms and StreamHPC enjoys worldwide awareness on top of that. To provide support for large projects, collaborations with other companies have been established.

The preferred way of working is is a low hourly rate and agreed bonuses for speed-ups.

Target markets

The markets we operate in are bio-informatics, financial, audio and video, hightech R&D, energy, mobile apps and other industries who target more performance per Watt or more performance per second.

WBSO

What we offer suits WBSO-projects well (in Netherlands only). This means that a large part of the costs can be subsidised. Together we can promote new technologies in the Netherlands, as is the goals of this subsidy.

Contact

Call Vincent Hindriksen MSc at +31 6 45400456 or mail to vincent@StreamHPC.nl with all your questions, or request a free demo.

Download the brochure for more information.

DOI: Digital attachments for Scientific Papers

Ever saw a claim on a paper you disagreed with or got triggered by, and then wanted to reproduce the experiment? Good luck finding the code and the data used in the experiments.

When we want to redo experiments of papers, it starts with finding the code and data used. A good start is Github or the homepage of the scientist. Also Gitlab. Bitbucket, SourceForge or the personal homepage of one of the researchers could be a place to look. Emailing the authors is often only an option, if the university homepage mentions such option – we’re not surprised to get no reaction at all. If all that doesn’t work, then implementing the pseudo-code and creating own data might be the only option – not if that will support the claims.

So what if scientific papers had an easy way to connect to digital objects like code and data?

Here the DOI comes in.

Continue reading “DOI: Digital attachments for Scientific Papers”

Jobs

We have jobs for people who get bored easily.

DSC_0046_small
We kept space for you

In return, we offer a solution to boredom, since performance engineering is hard.

As the market demand for affordable high performance software grows, StreamHPC continuously looks for great and humble people to join our team(s). With upcoming products, new markets like OpenCL on low-power ARM processors and compute-intensive applications like AI and VR, we expect continuous growth.

We currently have two types of jobs:

  • Software Performance Engineers. The heroes who make software go vroom vroom
  • Growth and Support. The heroes who make the people and the company go vroom vroom

For the second group, you will find the most changes in the job-ads, as we seek people who want to join us for the years to come.

Find our jobs and Apply

To apply, go to our recruitment website.

The hiring process

The procedure for the technical roles is as follows:

  • You send your CV, some public code (preferably C/C++/CUDA/HIP/OpenCL) and your motivational letter/email.
  • We do a quick scan of your CV and letter. (In most cases you’ll get feedback within 2 to 5 days)
  • For those who are left, you do a simple online test. This is to get a grasp of your way of working and thinking, and to prepare you for the longer test. (25 minutes max)
  • You will have a (video) talk with HR (30-45 minutes)
  • After that, you are invited for a longer online test. You show your skills on C/C++ and algorithms. Be warned this includes the ridiculous puzzles, simply because we actually use those ridiculous things (takes 2 – 3 hours)
  • You’ll get a technical interview on C++ and GPGPU (2 hours)
  • We’ll send you a conditional job-offer, assuming that the rest will be ok.
  • We now go into the long interview to be absolutely sure we are a fit, and to introduce each other in more detail (takes 3 hours)
  • We check your references. If these check out, you’re hired.

We try to minimize the time it takes you, while also giving you enough chances to proof yourself. As we are sometimes flooded with applications, we filter out on simple things like “Did not mention CUDA, OpenCL, SYCL, GLSL, HLSL, etc” – be sure you add such experience to your CV or cover letter, also when at university or a personal Github project.

Want to know more about the process, how to prepare, etc? Read “our job-application process“.

For growth&support-roles, it is close to the above but without the coding test.

For management-roles, we do test for technical expertise, as we cannot permit loosening this know-how when project-related.

So where are you waiting for? Apply for a job!

Supporting OpenCL on your own hardware

Say you have a device which is extremely good in numerical trigoniometrics (including integrals, transformations, etc to support mainly Fourier transforms) by using massive parallelism. You also have an optimised library which takes care of the transfer to the device and the handling of trigoniometric math.

Then you find out that the strength of your company is not the device alone, but also the powerful and easy-to-use library. You also find out that companies are willing to pay for the library, if it would work with other devices too. From your own helpdesk you hear that most questions are about extending the library with specialised functions. Giving this information, you define new customer groups for device-only and library-only – so just by adopting a standard you can increase revenue. Read below which steps you have to take to adopt OpenCL.

Continue reading “Supporting OpenCL on your own hardware”

Heterogeneous Systems Architecture (HSA) – the TL;DR

HSASolutionStack
Legacy-apps run on HSA-hardware, but less optimal.

The main problem of discrete GPUs is that memory needs to be transferred from CPU-memory to GPU-memory. Luckily we have SoCs (GPU and CPU in one die), but still you need to do in-memory transfers as the two processors cannot access memory outside their own dedicated memory-regions. This is due the general architecture of computers, which did not take accelerators into account. Von Neumann, thanks!

HSA tries to solve this, by redefining the computer-architecture as we know it. AMD founded the HSA-foundation to share the research with other designers of SoCs, as this big change simply cannot be a one-company effort. Starting with 7 founders, it has now been extended to a long list of members.

Here I try to give an overview of what HSA is, not getting into much detail. It’s a TL;DR.

What is Heterogeneous Systems Architecture (HSA)?

It consists mainly of three parts:

  • new memory-architecture: hUMA,
  • new task-queueing: hQ, and
  • an intermediate language: HSAIL.

hsa-overview
HSA enables tasks being sent to CPU, GPU or DSP without bugging the CPU.

The basic idea is to give GPUs and DSPs about the same rights as a CPU in a computer, to enable true heterogeneous computing.

hUMA (Heterogeneous Uniform Memory Access)

HSA changes the way memory is handled by eliminating a hierarchy in processing-units. In a hUMA architecture, the CPU and the GPU (inside the APU) have full access to the entire system memory. This makes it a shared memory system as we know it from multi-core and multi-CPU systems.

HSA-shared-mem-supersimplified
This is the super-simplified version of hUMA: a shared memory system with CPU, GPU and DSP having equal rights to the shared memory.

hQ (Heterogeneous Queuing)

HSA gives more rights to GPUs and DSPs, leveraging work from the CPU. Compared to the Von Neumann architecture, the CPU is not the Central Processing Unit anymore – each processor can be in control and create tasks for itself and the other processors.

heterogeneous-queing
HSA-processors have control over their own and other application task queues.

HSAIL (HSA Intermediate Language)

HSAIL is a sort of virtual target for HSA-hardware. Hardware-vendors focus on getting HSAIL compiled to their processor instruction sets, and developers of high-level languages target HSAIL in their compilers. This is a proven concept of evolving complex hardware-software projects.

It is pretty close to OpenCL SPIR, which has comparable goals. Don’t see them as competitors, but two projects which both need different freedoms and will work along.

What is in it for OpenCL?

OpenCL 2.0 has support for Shared Virtual Memory, Generic Address Space and Recursive Functions. All supported by HSA-hardware.

OpenCL-code can be compiled to SPIR, which compiles to HSAIL, which compiles to HSA-hardware. When the time comes that HSAIL starts supporting legacy hardware, SPIR can be skipped.

HSA is going to be supported in OpenCL 1.2 via new flags – watch this thread.

Final words

Two companies not there: Intel and Nvidia. Why? Because they want to do it themselves. The good news is that HSA is large enough to define the new architecture, making sure we get a standard. The bad news is that the two outsiders will come up with an exception for whatever reason, which gives a need for exceptions in compilers.

You can read more on the  website of the HSA-foundation or ask me in the comments below.

Let’s enter the Top500 HPC list using GPUs

The #500 super-computer has only 24 TFlops (2010-06-06): http://www.top500.org/system/9677

update: scroll down to see the best configuration I have found. In other words: a cluster with at least 30 nodes with 4 high-end GPUs each (costing almost €2000,- per node and giving roughly 5 TFlops single precision, 1 TFLOPS double precision) would enter the Top500. 25 nodes to get to a theoretic 25TFlops and 5 extra for overcoming the overhead. So for about €60 000,- of hardware anyone can be on the list (and add at least €13 000 if you want to use Windows instead of Linux for some reason). Ok, you pay most for the services and actual building when buying such a cluster, but you get the idea it does not cost you a few millions any more. I’m curious: who is building these kind of clusters? Could you tell me the specs (theoretical TFlops, LinPack TFlops and watts/TFlop) of your (theoretical) cluster, which costs the customer less then €100 000,- in total? Or do you know companies who can do this? I’ll make a list of companies who will be building the clusters of tomorrow, the “Top €100.000,- HPC cluster list”. You can mail me via vincent [at] this domain, or put your answer in a comment.

Update: the hardware shopping-list

Nobody told in the remarks it is easy to build a faster machine than the one described above. So I’ll do it. We want the most flops per box, so here’s the wishlist:

  • A motherboard with as many slots as possible for PCI-E, CPU-sockets and memory-banks. This because the lag between the nodes is high.
  • A CPU with at least 4 cores.
  • Focus on the bandwidth, else we will not be able to use all power.
  • Focus on price per GFLOPS.

The following is what I found in local computer stores (which for some reason people there love to talk about extreme machines). AMD currently has the graphics cards with the most double precision power, so I chose for their products. I’m looking around for Intel + Nvidia, but currently they are far behind. Is AMD back on stage after being beaten by Intel’s Core-products for so many years?

The GigaByte GA-890FXA-UD7 (€245,-) has 1 AM3-socket, 6(!) PCI-e slots and supports up to 16GB of memory. We want some power, so we use the AMD Phenom II X6 1090T (€289,-), which I chose for the 6 cores and the low price per FLOPS. And to make it a monster, we add 6 times a AMD HD5970 (€599,-) giving 928 x 6 = 3264 DP-GLOPS. If it can handle 16GB DDR3 (€750,-), so we put it in. It needs about 3 Power-supplies of 700 Watt (€100,-). We add 128GB SSD (€350,-) for working data and a big 2 TB HDD (€100,-). Case needs to house the 3 power supplies (€100,-). Cooling is important and I suggest you compete with a wind-tunnel (€500,-). It will cost you €6228,- for 5,6 Double Precision TFLOPS, and 27 TFLOPS single precision. A cluster would be on the HPC500-list for around €38000,- (pure hardware-price, not taking network-devices too much into account, nor the price for man-hours).

Disclaimer: this is the price of a single node, excluding services, maintenance, software-installation, networking, engineering, etc. Please note that the above price is pure for building a single node for yourself, if you have the knowledge to do so.

OpenCL in the Clouds

Buzz-words are cool; they are loosely defined and are actually formed by the many implementation that use the label. Like Web 2.0 which is cool javascript for the one and interaction for the other. Now we have cloud-computing, which is cluster-computing with “something extra”. More than a year ago clouds were in the data-centre, but now we even have “private clouds”. So how to incorporate GPGPU? A cluster with native nodes to run our OpenCL-code with pre-distributed data is pretty hard to maintain, so what are the other solutions?

Distributed computing

Folding@home now has support for OpenCL to add the power of non-NVIDIA GPUs. While in clusters the server commands the clients what they have to do, here the clients ask the server for jobs. Disadvantage is that the clients are written for a specific job and are not really flexible to take different kind of jobs. There are several solutions for this code-distribution-problem, but still the solution is not suitable for smaller problems and small clusters.

Clusters: MPI

The project SHOC (Scalable HeterOgeneous Computing) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. While it is only a benchmark, it can be of great use when designing a cluster. For the rest I only  found CUDA MPI-solutions, which are not ported to OpenCL yet.

Also check out Hoopoe, which is a cloud-computing service to run your OpenCL-kernels in their cloud. It seems to be more limited to .NET and have better support for CUDA, but it is a start. In Europe there is a start-up offering a rent-model for OpenCL-computation-time; please contact us if you want to get in contact with them.

Clusters: OpenMP

MOSIX has added a “Many GPU Package” to their cluster management system, so it now allows applications to transparently use cluster-wide OpenCL devices. When “choosing devices” not only the local GPU pops up, but also all GPUs in the cluster.
It works disk-less, in the way no files are copied to the computation-clients and all stays in-memory. Disk-less computations have an advantage when cloud-computer are not fully trusted. Take note that on most cloud-computers the devices need to be virtualised (see next part).

Below is its layered model, VCL being the “Virtual OpenCL Layer”.

They have chosen to base it on OpenMP; while the kernels don’t need to be altered, some OpenMP-code needs to be added. They are very happy to tell it takes much less code to use openMP instead of MPI.

You see a speed-up between 2.19 and 3.29 on 4 nodes is possible. We see comparable cluster-speed-ups in an old cluster-study. The actual speed-up on clusters depends mostly on the amount of data that needs to be transferred.

The project references to a project called remote CUDA, which only works with NVIDIA-GPUs.

Device Virtualisation

Currently there is no good device virtualisation for OpenCL. The gVirtuS-project currently only supports CUDA, but they claim it is easily rewritten to OpenCL. Code needs to be downloaded with a Mercurius-client (comparable to GIT and in repositories of most Linux-distributions):
> hg clone http://osl.uniparthenope.it/hg/projects/gvirtus/gvirtus gvirtus
Or download it here (dated 7-Oct-2010).

Let me know when you ported it to OpenCL! Actually gVirtuS does not do the whole trick since you need to divide the host-devices between the different guest-OSes, but luckily there is an extension which provides sharing of devices, called fission. More about this later.

We can all agree there still needs to be done a lot in this area of virtualised devices to get OpenCL in the cloud. If you can’t wait, you can theoretically use MOSIX locally.

Afterword

A cloud is the best buzz-word to market a scalable solution to overcome limitations of internet connected personal devices. I personally think the biggest growth will be in personal clouds, so companies will have their own in-house cloud-server (read: clusters); people just want to have a feeling of control, comparable with preference of a daily traffic jam above public transport. But nevertheless shared clouds have potential if it comes to computation-intensive jobs which do not need to be done all year round.

The projects presented here are a start to have OpenCL-power at a larger scale for more demanding cases. Since we can have more power at our fingertips with one desktop-pc stuffed with high-end video-cards than a 4-year-old supercomputer-cluster, there is still time

Please send your comment if I missed a project or method.

Installing and using Portable Computing Language (PoCL)

pocl
PoCL, a perfect companion for Portable Apps?

Update August’13: 0.8 has been released

PoCL stands for Portable Portable Computing Language and the goal is to make a full and open source implementation of OpenCL 1.2 for LLVM.

This is about installing and using PoCL on Ubuntu 64. If you want to put some effort to build it on Windows, you will certainly help the project. See also this TODO for version 0.8, if you want to help out (or want to know its current state). Not all functionality is implemented, but the project progresses using test-driven development – using the samples in the SDKs as a base.

Backends

They are eager for collaboration, so new backends can be added. For what I’ve seen this project is one of the best starts for new OpenCL-drivers. First because of the work that already has been done (implement by example), second because it’s an active open source project (continuous post-development), third because of the MIT-license (permits reuse within proprietary software). Here at StreamHPC we keep a close eye on the project.

On a normal desktop it has only one device and that’s the CPU. It has backends for several other types of CPUs (check ./lib/kernel in the source):

  • ARM
  • Cell SPU
  • Powerpc
  • Powerpc64
  • x86_64

Also the TCE libraries can be used as backend. The maturity of each backend differs.

More support is coming. For instance Radeon via the R600 project, but PoCL first needs to support LLVM 3.3 for that.

It is also a good start to build drivers for your own processor (contact us for letting us assist you in building such backend, see Supporting OpenCL on your own hardware for some related info)

Continue reading “Installing and using Portable Computing Language (PoCL)”

OpenCL 2.0 book on Indiegogo

indiegogo-opencl20Edit: the project unfortunately did not get enough funding on Indiegogo

Launching a book takes a lot of effort. By using crowd funding, we hope to get the book be published much earlier and for a lower price.

[button text=”Pre-order via Indiegogo – only in August 2013″ url=”http://igg.me/at/opencl20manual” color=”orange” target=”_blank”]

What you’ll get

You will get the first OpenCL 2.0 book on market. Fully updated with the latest function-references and power-tips. Also usable for OpenCL 1.1/1.2, to help you write backward-compatible software.

Reference pages for quick access of all OpenCL function – available online and offline. This has nothing to do with Khronos reference pages of OpenCL 2.0, as this is a complete rewrite and redesign of the description of each function-definition.

Reference pages of functions

A lot of energy goes into completely revising the original OpenCL reference pages, to create real value for you. This is not just a small upgrade, but an alternative (and more complete) explanation of all the functions. Expect it to contain twice as much information.

Each function will be explained in a clear language with full explanation of background-knowledge and an example. If the function can be used in more contexts, more examples are given.

At one glance you can see what is new per OpenCL version. Also all functions are extensively tagged and grouped, so you can easily find similar functions.

Basic concepts and programming theories

Various new additions to the series of basic concepts and the series on programming theories will only be available in the book, not on the blog. These chapters will help you connect the dots and get a better overview of how OpenCL works.

This content is unique and not found anywhere else. It has its foundation in hundreds of articles and research papers, combined with the years of experience in the field as a developer and a trainer.

Hardware and Optimisation guide

An explanation of all OpenCL optimisation techniques. Including a guide how to use auto-tuning to find the best configurations for each optimisation.

How well does each optimisation work on the various architectures? The results of mini-benchmarks will give you a complete overview what helps and what not.

Tools & software

There are various tools out there – both open source and commercial. These tools make it easier to program more efficiently and faster. The top 10 of best OpenCL tools are described, even software not discussed before online.

For all contributors

Reference pages

You get access to the reference pages while I work on it. When finished, you also get a zip-file with html-files in times you don’t have access to internet. You will get updates for all 2.0 updates. You can give feedback at any time and with this you have influence on the direction the manual is going.

E-book

At all times you get a progress report with a TOC. When finished you’ll get the book sent as PDF. After some time of feedback, you’ll receive a new version. People who bought the print, will receive it with the second version.

All prices are including Dutch VAT.

Ways You Can Help

Have you supported this project? Thank you very much for your support!

Please also tell your friends and colleagues, and on Twitter, Facebook, LinkedIn!

 

 

Problem solving tactic: making black boxes smaller

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

https://twitter.com/StreamHPC/status/1235895955569938432

Black boxes will never be transparent

Assumption is the mother of all mistakes

Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading “Problem solving tactic: making black boxes smaller”

Contact us

Thank you for your interest in our company and services. We will try to answer your question within 24 hours.

There are three ways to get in contact:

    First Name (required)

    Last Name (required)

    Email (required)

    Company (required)

    Phone number

    Your Message

    See ‘about us‘ for the address and other business-specific information.

    Qualcomm Snapdragon 600 & 800 (Adreno 320 & 330)

    snapdragon-800-mdps

    [infobox type=”information”]

    Need a Snapdragon programmer? Hire us!

    [/infobox]

    There are two Adreno GPUs currently available known to have/get OpenCL support: the 320 and 330, respectively in the Snapdragon 600 and Snapdragon 800.

    Qualcomm does not provide a developer’s board, but the Sony Xperia Z is known to have OpenCL. Other phones are expected to have drivers pre-installed too. That is interesting, as new phones with Adreno 330 are shipped soon, such as the LG Optimus G2 LS980, Sony Xperia Z Ultra and a version of the Samsung Galaxy S4.

    Drivers are still in beta and are known to have bugs (as of April 2013). This discussion is the most interesting to follow, if you want keep up to date.

    There are plenty of tools available, such as the Snapdragon SDK for Android and these Tools and Resources for the Adreno GPU. In the latter you’ll find OpenCL samples you can run too (it is a Windows-installer, for some vague reason, so MAC and Linux users need to do some extracting). You can start building the code from this project.

    http://www.youtube.com/watch?feature=player_embedded&v=CaS0kpozyMM

    Boards

    Focus is on the more recent Snapdragon 800.

    Inforce IFC6410 – Snapdragon 600

    IFC6410websiteThe Ifc6410 is a $149 costing single-board computer with Adreno 320 and Qualcomm Snapdragon S4 Pro – APQ8064.

    Datasheet (PDF)

    Order here.

     

    Bsquare Mobile Development Boards for Snapdragon 800

    Processor: Quad-core Krait 400 CPU at up to 2.3GHz per core (Snapdragon 8974) , Adreno 330 GPU, Hexagon QDSP6 V5. A few highlights: wifi n/ac, bluetooth 4, USB 3.0, NFC, 1280x720p screen (tablet: 1920x1080p), 2GB 800MHZ memory, 12MP+2MP camera. It all runs on Android 4.2 (Jelly Bean), so no Linaro-packages. More info the Qualcomm MDB page and on this Qualcomm blog.

    Phone form factor: $799 – Tablet: $1099 – Also check out Bsquare’s information page for these products, but be aware there are some links to the wrong PDFs.

    Warning: you cannot call or use your provider’s internet with these devices! The word ‘phone’ only refers to the form factor.

    DragonBoard Snapdragon APQ8060A for Snapdragon 800

    Some highlight: Snapdragon 8074 quad core processor, 2GB of LPDDR3 RAM, 16GB of eMMC, Wi-Fi, Bluetooth, GPS, HDMI out and qHD LCD with capacitive multi touch, Adreno 330.

    Can be ordered via http://mydragonboard.org/db8074/ for $499,-

    DB8074_annotated_EAP_v1.1

    Sony Xperia Z phones

    OpenCL_SonyThe Xperia Z1 and Xperia Z Ultra have OpenCL support and drivers are ready-loaded. Go here for an introduction of OpenCL on these phones.

    It needs the Android NDK to run the OpenCL programs.

    Sony sees great advantages in using OpenCL on their mobile phones – from the website:

    You can also see that the execution speed is much faster using OpenCL on the GPU when compared to the plain single threaded c-code running on the CPU (tested on Sony Xperia Z1). In addition to the speed benefit, you may also find that you decrease energy consumption by utilizing OpenCL on the GPU compared to using standard programming methods on the CPU.

    Commodity and Open Standards – why OpenCL matters

    V-UVThis article actually discusses the question: is GPGPU a solution for the masses, or is it for niche-products? For the latter open standards matter a lot less, as you will read.

    If you watch the below video on sale&marketing by Victor Antonio, then you get what is so difficult about open standards: It pushes all companies using the standard into a focus on becoming the best. Indeed, survival of the fittest may be the base of (true) capitalism and giving the best products. Problem is that competition on price is not safe for the future of the company.

    http://www.youtube.com/watch?v=SJ5QmW3LfN4

    The key is specialisation, or creating unique value. The below video discusses this. The difference between “a feature” and “unique value” is a discussion on its own, you really should have with your team on your own products. Continue reading “Commodity and Open Standards – why OpenCL matters”

    Sheets GPGPU-day 2012 online

    GPGPU Day Speakers_small.2
    Photos made by Cyrille Favreau

    Better now than never. It has almost been a year, but finally they’re online: the sheets of the GPGPU-day Amsterdam 2012.

    You can find the sheets at http://www.platformparallel.com/nl/gpgpu-day-2012/abstracts/ – don’t hotlink the files, but link to this page. The abstracts should introduce the sheets, but if you need more info just ask them in the comments here.

    PDFs from two unmentioned talks:

    I hope you enjoy the sheets. On 20 June the second edition will take place – see you there!

     

    NVIDIA enables OpenCL 2.0 beta-support

    In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

    You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

    Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

    Update: We tested NVIDIA drivers on Linux too. Read it here.

    Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

    Why did AMD open source ROCm’s OpenCL driver-stack?

    AMD open sourced the OpenCL driver stack for ROCm in the beginning of May. With this they kept their promise to open source (almost) everything. The hcc compiler was open sourced earlier, just like the kernel-driver and several other parts.

    Why this is a big thing?
    There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.

    It contains:

    • OpenCL 1.2 compatible language runtime and compiler
    • OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
    • Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.

    For testing the implementation, see Khronos OpenCL CTS framework or Phoronix openbenchmarking.org.

    Why is it open sourced?

    There are several reasons. AMD wants to stand out in HPC and therefore listened carefully to their customers, while taking good note on where HPC was going. Where open source used to be something not for businesses, it is now simply required to be commercially successful. Below are the most important answers to this question.

    Give deeper understanding of how functions are implemented

    It is very useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to use. It does not tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.

    Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.

    Debug software

    When you are working on a large project and have to work with proprietary libraries, this is a typical delay factor. I think every software engineer has this experience that the library does not perform as was documented and work-arounds had to be created. Depending on the project and the library, it could take weeks of delay – only sarcasm can describe these situations, as the legal documents were often a lot better than the software documents. When the library was open source, the debugger could step in and give the “aha” that was needed to progress.

    When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and ofcourse your project hits that bug which nobody encountered before. Now all is open source, you can now step into the driver with the debugger. Moreover, the driver can be compiled with a fix instead of work-around.

    Get bugs solved quicker

    A trace now now include the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This not only improves reproducibility, but reduces the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.

    Have “unimportant” specific improvements done

    Say your software is important and in the spotlight, like Blender or the LuxMark benchmark, then you can expect your software gets attention in optimisations. For the rest of us, we have to hope our special code-constructions are alike one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team does not have enough time. This is frustrating for both sides.

    Now everybody can have their improvements submitted, giving it does not slow down the focus software ofcourse.

    Get the feature set extended

    Adding SPIR-V is easy now. The SPIRV-frontend needs to be added to ROCm and the right functions need to be added to the OpenCL driver. Unfortunately there is no support for OpenCL 2.x host-code yet – I understood by lack of demand.

    For such extensions the AMD team needs to be consulted first, because this has implications on the test-suite.

    Get support for complete new things

    It takes a single person to make something completely new – this becomes a whole easier now.

    More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg. Optimised 128 bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All high performance code is there for you.

    Initiate alternative implementations (?)

    Not a goal, but forks are coming for sure. For most forks the goals would be like the ones above, to later be merged with the master branch. There are a few forks that go their own direction – for now hard to predict where those will go.

    Improve and increase university collaborations

    If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.

    Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Added bonus for the students is that the GPUs might be allowed to used for games too.

    Timour Paltashev (Senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.

    Get better support in more Linux distributions

    It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised), but the discussion is if this is part of the hardware or software to mark it as “libre”.

    There are many obstacles to have ROCm complete stack included as the default, but with the current state it makes much more chance.

    Performance

    Phoronix has done some benchmarks on ROCm 1.4 OpenCL in January on several systems and now ROCm 1.5 OpenCL on a Radeon RX 470. Though the 1.5 benchmarks were more limited, the important conclusion is that the young compiler is now mostly on par with the closed source OpenCL implementation combined with the AMDGPU-drivers. Only Luxmark AMDGPU was (much) better. Same comparison for the old proprietary fgrlx drivers, which was fully optimised and the first goal to get even with. You’ll see that there will be another big step forward with ROCm 1.6 OpenCL.

    Get started

    You can find the build instructions here. Let us know in the comments what you’re going to do with it!

    Image Processing

    Vd-SharpVd-Blur2Vd-Edge3At StreamHPC, there is broad experience in the parallel, high-performance implementation of image filters. We have significantly improved the performance of various image processing software. For example, we have supported Pixelmator in achieving outstanding processing speeds on large image data, and users frequently praise the software’s speed in comparisons with competing software products.

    StreamHPC is currently hosting an educational initiative that supports interested individuals in their efforts of porting algorithms from the open-source GEGL image processing framework to fast parallel versions based on OpenCL. GEGL is used by the popular image manipulation software Gimp as well as other free software. For more information on this project, look at our website OpenCL.org, which we dedicate to spreading knowledge on OpenCL.

    Verify your OpenCL and CUDA kernels online for race conditions

    gpuverifyGPUVerify is a tool for formal analysis of GPU kernels written in OpenCL and CUDA. The tool can prove that kernels are free from certain types of defect, such as data races and bugs. This is quite useful feedback for any GPU-programmer.

    Below you find a online version of the tool (please don’t break it!). Play around and test your kernels. Be aware the number of groups is the global worksize divided by local worksize.

    For demo-purposes some values have been pre-filled with a simple kernel – press “Check my OpenCL kernel” to find the results. Did you expect this from this kernel? Can you explain the result?

    After the LEAP-conference I’ll extend this article – till then I’m too time-limited. For now I wanted to share the online version with you, especially with the people who will attend the tutorial at LEAP. Be sure to check out the GPUVerify website and paper to learn more about this fantastic tool! Continue reading “Verify your OpenCL and CUDA kernels online for race conditions”

    OpenCL vs CUDA Misconceptions


    Translation available: Russian/Русский. (Let us know if you have translated this article too… And thank you!)


    Last year I explained the main differences between CUDA and OpenCL. Now I want to get some old (and partly) false stories around CUDA-vs-OpenCL out of this world. While it has been claimed too often that one technique is just better, it should be also said that CUDA is better in some aspects, whereas OpenCL is better in others.

    Why did I write this article? I think NVIDIA is visionary in both technology and marketing. But as I’ve written before, the potential market for dedicated graphics cards is shrinking and therefore forecasting the end of CUDA on desktop. Not having this discussion opens the door for closed standards and delaying innovation, which can happen on top of OpenCL. The sooner people & companies start choosing for a standard that gives equal competitive advantages, the more we can expect from the upcoming hardware.

    Let’s stand by what we have learnt at school when gathering information sources, don’t put all your eggs in one basket! Gather as many sources and references as possible. Please also read articles which claim (and underpin!) why CUDA has a more promising future than OpenCL. If you can, post comments with links to articles you think others should read too. We appreciate contributions!

    Also found that Google Insights agrees with what I constructed manually.

    Continue reading “OpenCL vs CUDA Misconceptions”

    GPU-related PHD positions at Eindhoven University and Twente University

    We’re collaborating with a few universities on formal verification of GPU code. The project is called ChEOPS: verified Construction of corrEct and Optimised Parallel Software.

    We’d like to put the following PhD position to your attention:


    Eindhoven University of Technology is seeking two PhD students to work on the ChEOPS project, a collaborative project between the universities of Twente and Eindhoven, funded by the Open Technology Programme of the NWO Applied and Engineering Sciences (TTW) domain.

    In the ChEOPS project, research is conducted to make the development and maintenance of software aimed at graphics processing units (GPUs) more insightful and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, due to their great computational capabilities. However, in practice, one usually needs to have expert knowledge on GPU architectures to optimally gain advantage of those capabilities.

    Continue reading “GPU-related PHD positions at Eindhoven University and Twente University”

    Valgrind suppression file for AMD64 on Linux

    valgrind_amdValgrind is a great tool for finding possible memory leaks in code written in C, C++, Java, Perl, Python, assembly code, Fortran, Ada, etc. I use it to check out if the provided code is ok, before I start porting it to GPU-code. It finds one of those devils in the details. But also for finding my own bugs when writing OpenCL-code, it has given me good feedback. Unfortunately it does not work well with optimised libraries, such as the OpenCL-driver from AMD.

    You’ll get problems like below, which clutters the output.

    ==21436== Conditional jump or move depends on uninitialised value(s)
    ==21436==    at 0x6993DF2: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x6C00F92: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x6BF76E5: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x6C048EA: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x6BED941: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x69550D3: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x69A6AA2: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x69A6AEE: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x69A9D07: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x68C5A53: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x68C8D41: ??? (in /usr/lib/fglrx/libamdocl64.so)
    ==21436==    by 0x68C8FB5: ??? (in /usr/lib/fglrx/libamdocl64.so

    How to fix this cluttering? Continue reading “Valgrind suppression file for AMD64 on Linux”