General articles on technical subjects.

We more than halved the FPGA development time by using OpenCL

Posted by Vincent Hindriksen on 31 October 2015

Over the past year we developed and fine-tuned a project setup for FPGA development that is much faster than any other method, including other high-level languages for making FPGA-based systems.

How we did it

OpenCL makes it easy to use the CPU and GPU and their tools. Our CPU and GPU developers would design software with FPGAs in mind, after which the FPGA developer took over and finalised the project. As we have expertise in the very different phases of such project, we could be much more effective than when sticking to traditional methods.

The bonus

It also works on CPU and GPU. It has to be said, that the code hasn’t been fully optimised for CPUs and GPUs – this can be done in a separate project. In case a decision has to be made on which hardware to use, our solution has the least risk and the most answers.

Our Unique Selling Points

For the FPGA market our USPs are clear:

We outperform traditional FPGA development companies in time-to-market and price.
We can discuss problems on hardware level, software level and algorithm level. This contrasts with traditional FPGA houses, where there are less bridges.
Our software also works on CPUs and GPUs for no additional charge.
The latencies of the resulting project are very comparable.

We’re confident we can make a difference in the FPGA market. If you want more information or want to discuss, feel free to contact us.

OpenCL in the cloud – API beta launching in a month

Posted by Vincent Hindriksen on 31 October 2015

We’re starting the beta phase of our AMD FirePro based OpenCL cloud services in about a month, to test our API. If you need to have your OpenCL based service online and don’t want to pay hundreds to thousands of euros for GPU-hosting, then this is what you need. We have place for a few others.

The instances are chrooted, not virtualised. The API-calls are protected and potentially some extra calls have to be made to fully lock the GPU to your service. The connection is 100MBit duplex.

Payment is per usage, per second, per GPU and per MB of data – we will be fine-tuning the weights together with our first customers. The costs are capped, to make sure our service will remain cheaper than comparable EC2 instances.

Get in contact today, if you are interested.

The magic of clGetKernelWorkGroupInfo

Posted by Vincent Hindriksen on 22 October 2015

It’s not easy to get the available private memory size – actually it’s impossible to get this information directly from the device/drivers, using the OpenCL API. This can only be explained after you dive deep into clGetKernelWorkGroupInfo – the function that tells you how well your kernel fits on the device. It is strange this function is not often discussed.

Memory sizes

CL_KERNEL_LOCAL_MEM_SIZE

Returns the amount of local memory, in bytes, being used by a kernel (per work-group). Use CL_DEVICE_LOCAL_MEM_SIZE to find out the maximum.

CL_KERNEL_PRIVATE_MEM_SIZE

Returns the minimum amount of private memory, in bytes, used by each work-item in the kernel.

Work sizes

CL_KERNEL_GLOBAL_WORK_SIZE

This answers the question “What is the maximum value for global_work_size argument that can be given to clEnqueueNDRangeKernel?”. The result is of type size_t[3].

CL_KERNEL_WORK_GROUP_SIZE

The is the same for local_work_size. The kernel’s resource requirements (register usage etc.) are used, to determine what this work-group size should be.

CL_KERNEL_COMPILE_WORK_GROUP_SIZE

If __attribute__((reqd_work_group_size(X, Y, Z))) is used, then (X, Y, Z) is returned, else (0, 0, 0).

CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

It returns a performance-hint: if the total number of work-items is a multiple of this number, then you’ll get good results. So no more remembering 32 or 64 for specific GPUs, but simply kick in a call to this function.

Combined with clDeviceInfo’s CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS, you can fine-tune your workgroup-size in case you need the group-size to be as large as possible.

Call for papers: SYCL workshop, 13-March-2016, Barcelona, Spain

Posted by Vincent Hindriksen on 22 October 2015 with 2 Comments

A high-level language has been on OpenCL’s roadmap since the years, and would be started once the foundations were ready. Therefore with OpenCL 2.0, SYCL was born.

To keep the pace high, a SYCL workshop is being organised. This week the call-for-papers is opened, which you can read below.

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Barcelona, Spain Sunday, 13th March, 2016

SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction
layer that builds on the underlying concepts, portability and efficiency of
OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL
enables single source development where C++ template functions can contain both
host and device code to construct complex algorithms that use OpenCL
acceleration, and then re-use them throughout their source code on different
types of data. SYCL has also been designed with resilience from the start, by
featuring, for example, a fall-back mechanism to automatically re-enqueue
kernels on different queues in case of a failure.

The SYCL Workshop aims to gather together SYCL’s users, researchers, educators
and implementors to encourage and grow a community of users behind the SYCL
standard, and related work in C++ for heterogeneous architectures. This will be
a half-day workshop. SYCL’16 will be held in Barcelona, 13 March 2016,
co-located with PPoPP 2016, HPCA 2016, CGO 2016 and LLVM 2016.

Travel Awards

Student authors who present papers in this workshop are eligible to apply for
travel awards. Further details will be announced after notification of
acceptance.

Important Dates

Submissions: 23rd November
Notification: 21st December
Final version: 24th January, 2016
Workshop: Sunday, 13th March, 2016

Submission Guidelines

All submissions must be made electronically through the conference submission
site, at https://easychair.org/conferences/?conf=sycl16.
Submissions may be one of the following:

Extended abstract: Two pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Short Paper: Four to six pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Submissions must be in PDF format and printable on US Letter and A4 sized
paper. All submissions will be peer-reviewed by at least two members of the
program committee. We will aim to give longer presentation slots to papers than
to extended abstracts. Conference papers will not be published, but made
available through the website, alongside the slides used for each presentation.
The aim is to enable authors to get feedback and ideas that can later go into
other publications. We will encourage questions and discussions during the
workshop, to create an open environment for the community to engage with.

Topics of interest include, but are not limited to:

Applications implemented using SYCL

C++ Libraries using SYCL

C++ programming models for OpenCL (C++AMP, Boost.Compute, …)

Other C++ applications using OpenCL

New proposals to the SYCL specification

Integration of SYCL with other programming models

Compilation techniques to optimise SYCL kernels

Performance comparisons between SYCL and other programming models

Implementation of SYCL on novel architectures (FPGA, DSP, …)

Using SYCL in fault-tolerant systems

Reports on SYCL implementations

Debuggers, profilers and tools

Organising Committee

Paul Keir, University of the West of Scotland (UK)
Ruyman Reyes, Codeplay Software Ltd, Edinburgh (UK)

Program Committee

Jens Breitbart, TU Munich
Alastair Donaldson, Imperial College London, UK
Christophe Dubach, University of Edinburgh, UK
Joel Falcou, LRI, Université Paris-Sud, France
Benedict Gaster, University of the West of England, UK
Vincent Hindriksen, StreamHPC, Netherlands
Christopher Jefferson, St. Andrews University, UK
Ronan Keryell, Xilinx, Ireland
Zoltán Porkoláb, ELTE, Hungary
Francisco de Sande, Universidad de La Laguna, Spain
Ana Lucia Varbanescu, University of Amsterdam, Netherlands
Josef Weidendorfer, TU Munich

Yes, we’re in the Program Committee as one of the few non-academics. We’re looking forward to read your proposal!

If you have a blog, feel free to copy the above text and repost it.

We’re a member of Khronos now!

Posted by Vincent Hindriksen on 29 September 2015 with 2 Comments

For years we have had a good collaboration with the Khronos group, mainly due our community presence. Now it was time to get into a closer collaboration and become an official Contributor Member (logo not there yet). This effectively means two things:

Instead of complaining on the blog and on twitter, we can now discuss it within the working group. 🙂
If we accidentally find interesting info we now know is under NDA, we won’t share with you anymore. 🙁

Our goal for collaborating with Khronos remains the same: help OpenCL and its community advance. We therefore keep building OpenCL.org, writing articles on OpenCL and organising events in the years to come.

One of our goals of the coming year is to get more vendors on OpenCL 2.0. If you think we should have more goals on our agenda, write them in the comments.

Handling OpenCL with CMake 3.1 and higher

Posted by Vincent Hindriksen on 25 September 2015 with 2 Comments

There has been quite some “find OpenCL” code for CMake around. If you haven’t heard of CMake, it’s the most useful cross-platform tool to make cross-platform software.

Put this into CMakeLists.txt, changing the names for the executable.

#Minimal OpenCL CMakeLists.txt by StreamHPC

cmake_minimum_required (VERSION 3.1)

project(GreatProject)

# Handle OpenCL
find_package(OpenCL REQUIRED)
include_directories(${OpenCL_INCLUDE_DIRS})
link_directories(${OpenCL_LIBRARY})

add_executable (main main.cpp)
target_include_directories (main PUBLIC ${CMAKE_CURRENT_SOURCE_DIR})
target_link_libraries (main ${OpenCL_LIBRARY})

Then do the usual:

make a build-directory
cd build
cmake .. (specifying the right Generator)

Adding your own CMake snippets and you’re one happy dev!

Cmake 3.7

CMake 3.7 makes it even easier! You can do the following:

find_package(OpenCL REQUIRED)
add_executable(test_tgt main.c)
target_link_libraries(test_tgt OpenCL::OpenCL)

This automatically sets up the include paths and target library to link against. No need to use the ${OpenCL_INCLUDE_DIRS} and ${OpenCL_LIBRARIES} any more.

(Thanks Matthäus G. Chajdas for improving this!)

Getting CMake 3.1 or higher

Ubuntu/Debian: Get the PPA.
Other Linux: Get the latest tar.gz and compile.
Windows/OSX: Download the latest exe/dmg from the CMake homepage.

If you have more tips to share, put them in the comments.

How to do Approximation Computing in OpenCL

Posted by Vincent Hindriksen on 18 September 2015 with 1 Comment

Approximation computing is allowing larger errors when performing calculations. While most programmers might go the other way (lower error rate) by using doubles for instance, this field is interesting for quite some of us. The reason is that you can get more performance, more bandwidth space and lower power usage in return.

In Neural Networks high precision is not required, but also Big Data approximation computing is very useful. Most important is that you actually think of the possibility to trade in precision when designing your OpenCL software. For example, does your window function need to be very precise or can there be rounding errors? Or do you do iterative steps (more precision needed), or calculate relatively from the starting point (less precision needed)? You can even use relatively more expensive algorithms that compensate with a smaller overall error. Here at StreamHPC we think this through as one of the main optimisation techniques.

Let’s look into what is possible in OpenCL and which is the hardware support. Continue reading “How to do Approximation Computing in OpenCL” →

Let us do your peer-review

Posted by Vincent Hindriksen on 26 August 2015

There are many research papers that claim enormous speed-ups using an accelerator. From our experience a large part is because of code-modernisations (parallisation & optimisation), which makes the claim look false. That’s why we offer peer-reviews for half our rate for CUDA and OpenCL software. The final costs depend on the size and complexity of the code.

We will profile your CPU and Accelerator code on our machines and review the code. The results are the effect of the code-modernisations and the effect of using the accelerator (GPU, XeonPhi, FPGA). With this we hope that we stimulate the effect of code-modernization gets more research attention over using “miracle hardware”.

Don’t misunderstand: GPUs can still get an average of 8x speedup (or 700% speed improvement) over optimised code, which is still huge! But it’s simply not the 30-100x speed-up claimed in the slide at the right.

OpenCL basics: Multiple OpenCL devices with the ICD.

Posted by Anca Hamuraru on 14 August 2015 with 5 Comments

tesla-xeonphi-firepro — XeonPhi, Tesla, FirePro

Most systems nowadays have more than just one OpenCL device and often from different vendors. How can they all coexist from a programming standpoint? How do they interact?

OpenCL platforms and OpenCL devices

Firstly, please bear with me for a few words about OpenCL devices and OpenCL platforms.

An OpenCL platform usually corresponds to a vendor. This is responsible for providing the OpenCL implementation for its devices. For instance, a machine with an i7-4790 Intel CPU is going to have one OpenCL platform, probably named “Intel OpenCL” and this platform will include two OpenCL devices: one is the Intel CPU itself and the other is the Intel HD Graphics 4600 GPU. This Intel OpenCL platform is providing the OpenCL implementation for the two devices and is responsible for managing them.

Let’s have another example, but this time from outside the Windows ecosystem. A MacBook running OS X and having both the Intel Iris Pro GPU and a dedicated GeForce card will show one single OpenCL platform called “Apple”. The two GPUs and the CPU will appear as devices belonging to this platform. That’s because the “Apple” platform is the one providing the OpenCL implementation for all three devices.

Last but not least, keep in mind that:

An OpenCL platform can have one or several devices.
The same device can have one or several OpenCL implementations from different vendors. In other words, an OpenCL device can belong to more than just one platform.
The OpenCL version of the platform is not necessarily the same with the OpenCL version of the device.

The OpenCL ICD

ICD stands for Installable Client Driver and it refers to a model allowing several OpenCL platforms to coexist. It is actually not a core-functionality, but an extension to OpenCL.

For Windows and Linux the ICD has been available since OpenCL 1.0.
OSX doesn’t have an ICD at all. Apple chose to put all the drivers themselves under one host.
Android did not have the extension under OpenCL 1.1, but people ported its functionality. With OpenCL 2.0 the ICD is also on Android.

How does this model work?

While a machine can have several OpenCL platforms, each with its own driver and OpenCL version, there is always just one ICD Loader. The ICD Loader acts as a supervisor for all installed OpenCL platforms and provides a unique entry point for all OpenCL calls. Based on the platform id, it dispatches the OpenCL host calls to the right driver.

This way you can compile against the ICD (opencl.dll on Windows or libOpenCL.so on Linux), not directly to all the possible drivers. At run-time, an OpenCL application will search for the ICD and load it. The ICD in turn looks in the registry (Windows) or a special directory (Linux) to find the registered OpenCL drivers. Each OpenCL call from your software will be resolved by the ICD, which will further dispatch requests to the selected OpenCL platform.

A few things to keep in mind

The ICD gets installed on your system together with the drivers of the OpenCL devices. Hence, a driver update can also result in an update of the ICD itself. To avoid problems, an OS can decide to handle the OpenCL itself.

Please note that the ICD, the platform and the OpenCL library linked against the application may not necessarily correspond to the same OpenCL version.

I hope this explains how the ICD works. If you have any question or suggestion, just leave a comment. Also check out the Khronos page for the ICD extension. And if you need the sources to build your own ICD (with license that allows you to distribute it with your software), check the OpenCL registry on Khronos.

“Soon we will use only one thousandth of available computer capacity”

Posted by Vincent Hindriksen on 11 August 2015

Professor Henri Bal, who tries to wake up the Netherlands to start going big on parallel programming

At StreamHPC we mostly work for companies in the bigger countries of Europe and North America. We hardly work for companies in the Netherlands. But it seems that after 5 years of sleeping, there is some shaking. Below is a (translated) article with the above quote by Prof. Dr. Ir. Henri Bal, professor at the Computer section at the Vrije University of Amsterdam.

Lack of knowledge of parallel programming will cause a situation where only one thousandth of the capacity of computers will be used. This makes computations unnecessarily slow and inaccurate. That in turn will slow down the development of the Dutch knowledge economy.

Sequential programming, instructing computers to perform calculations in a queue, is now the standard. Computers processors, however, are much more sophisticated and able to perform thousands or even millions of computations simultaneously. But the programming of such many-cores “is still in its infancy, industries that rely heavily on data, can not perform optimally”, claims Ball.

The value of parallel programming, according to Ball, is of enormous importance, for example, meteorology and forensics. “For weather forecasting data from the dense network of computers need to be quickly and accurately processed to have a weather forecast for tomorrow, not after 48 hours,” he says. “In forensics all data should be explored in the first 24 hours after a crime as soon as possible and through pattern recognition all data, for no trace to be lost. The video material of 80,000 security cameras which was manually searched through after the attack on the London Underground in 2005 – with parallel computing methods this can now rapidly be executed by the computer.”

If the Netherlands wants to widen the gap investments are necessary, says Bal. The focus should be on research and teaching. “Investments in research on programming new massively-parallel machines are required to gain knowledge. Thus it must be examined how programs should be written for parallel computing methods and what extent of parallel calculations can be performed automatically. In teaching our future programmers need also to be prepared for the new standards of parallel programming. Only then the Netherlands can make optimal use of the available computer capacity. “

I think my fellow countrymen will be surprised they can find help just around the corner. And if they wait two more years, then 1000x speed-up from sequential programs are indeed becoming possible.

Have you seen similar articles that sequential programming is slowing the knowledge economy?

MediaTek’s partners deliver OpenCL on their phones

Posted by Vincent Hindriksen on 6 August 2015

Several Chinese phones bring OpenCL to millions of users, as MediaTek offers their drivers to all phone vendors who use their (recent) chipsets.

Mediatek said that you just need a phone with one of the below chipsets and you can run your OpenCL-app, as they provide the driver-stack with the hardware to their customers. I’ve added a few phone names, but there is no guarantee OpenCL drivers are actually there. So be on the safe side and don’t buy the cheapest phone, but a more respected China-brand. Contact us if you got a phone with the chipset that doesn’t work – then I’ll contact Mediatek. Share you experience with the chipset in the comments.

In case you want to use the phone for actual use, be sure it supports your 4G frequencies. Also check this Gizchina article on the below chipsets. There are more MediaTek-chipsets that support OpenCL, but not openly – they prefer to focus on their latest 64-bit series.

Important note on conformance: Mediatek is an adopter and does conform for a few processors. Of the ones listed below, only MT6795 is certain to have official support. Continue reading “MediaTek’s partners deliver OpenCL on their phones” →

The knowns and unknowns of the PEZY-SC accelerator at RIKEN

Posted by Vincent Hindriksen on 2 August 2015

The green500 is out and one unknown processor takes the number one position with a huge improvement over last year. It is a new super-computer installed at RIKEN with an incredible 7 GFLOPS/Watt. It is powered by the processor-boards at the right: two Xeons, 4 PEZY-SC 1.4 accelerators and 128GB DRAM, which have a combined performance of about 6.2 TFLOPS. It has been designed for immersive cooling.

The second and third positions are also powered by the PEZY-SC, before we find the winner of last year: the AMD FirePro S9150 and a bit after that the rest (mostly NVidia Tesla). One constant is the CPUs used: Intel XEON is taking most. To my big surprise no ARM64.

From the third to the first PEZY-SC installation there is an improvement of 13%. It seems the first two are the new type, called “bricks”, while the third is the same as last year. Comparing with that super from last year (4.4945 GFLOPS/W) there is an improvement of 42% and 25%. The 13% improvement from the previous version is interesting enough, but the 25% improvement on exactly the same system raised questions. Probably it is due to compiler-optimisations. As the November-version of the Green500 is much more strict, it will be clear if the rules were bent – let’s hope it’s for real!

It supports OpenCL!

When new accelerators support OpenCL, it gets accepted more easily. So it is very interesting the PEZY-SC runs on OpenCL. I asked at ISC and got explained it was a subset of OpenCL, but could not get the finger on which subset, nor could I get access to test it. It does mean that code that would run well on this machine is easy to port. And then I mean the same “easy” Intel uses for explaining the easyness of porting OpenMP software to XeonPhi: PEZI-specific optimisations and writing around the missing functionality would still take effort – the typical stuff we do at StreamHPC.

RIKEN Shoubu

Some information on “Shoubu” (“Iris” in Japanese), the top 1 on the Green 500. According to the Green500 it is 353.8 TFLOPS (based on 50kW, using an actual benchmark). On 25 June RIKEN announced the Shoubu is 2 PFLOPS (theoretical). If the full machine is used for the Green500, then the efficiency was only 18%!

Below are some images of the installation.

Source: http://www.exascaler.co.jp/wp-content/uploads/2015/06/20150625.pdf

An important part is Exascaler’s immersion technology, what I understood is a spin-off of PEZY. I’m very curious what the AMD FirePro S9150 does when it uses immersion-cooling – I think we have to do some frying at the office to find out.

PEZY-SC1.4 and PEZY-SC2

PEZY started with a multi-core processor of 512 cores, the PEZY-1. The PEZY-SC has 1024 cores and has had a few gradual upgrades – currently PEZY-SC 1.4 (“the brick”) is installed.

PEZY-SC Specification:

Logic Cores(PE)	1,024
Core Frequency	733MHz
Peak Performance	Floating Point. Single 3.0TFlops / Double 1.5TFlops
Host Interface	PCI Express GEN3.0 x8Lane x 4Port (x16 bifurcation available) JESD204B Protocol support
DRAM Interface	DDR4, DDR3 combo 64bit x 8Port Max B/W 1533.6GB/s +Ultra WIDE IO SDRAM (2,048bit) x 2Port Max B/W 102.4GB/s
Control CPU	ARM926 dual core
Process Node	28nm
Package	FCBGA 47.5mm x 47.5mm, Ball Pitch 1mm, 2,112pin

Source: http://pezy.co.jp/en/products/pezy-sc.html

Development on PEZY-SC2 is ongoing, which will have a staggering 4096 cores. Ofcourse efficiency has to go up (if the 18% is correct), to make this a good upgrade.

There is no promise on when the PEZY-SC2 will be announced, but it will certainly surprise us again hen it arrives.

Xeon Phi Knights Corner compatible workstation motherboards

Posted by Vincent Hindriksen on 1 August 2015 with 6 Comments

Intel has assumed a lot if it comes to XeonPhi’s. One was that you will use it on dual-Xeon servers or workstations and that you already have a professional supplier of motherboards and other computer-parts. We can only guess why they’re not supporting non-professional enthusiasts who got the cheap XeonPhi.

After browsing half the internet to find an overview of motherboards, I eventually emailed Gigabyte, Asus and ASrock for more information for a desktop-motherboard that supports the blue thing. With the information I got, I could populate the below list. Like usual we share our findings with you.

Quote that applies here: “The main reason business grade computer supplies can be sold at a higher price is that the customers don’t know what they’re buying“. When I heard, I did not know why the customer is not well-informed – now I do. Continue reading “Xeon Phi Knights Corner compatible workstation motherboards” →

Event: Embedded boards comparison

Posted by Vincent Hindriksen on 11 June 2015

A 2012 board — One of the first OpenCL enabled boards from 2012.

Date: 17 September 2015, 17:00
Location: Naritaweg 12B, Amsterdam
Costs: free

Selecting the right hardware for your OpenCL-powerd product is very important. We therefore organise a three hour open house where we you can test, benchmark and discuss many available chipsets that support OpenCL. For each showcased board you can read and hear about the advantages, disadvantages and preferred types of algorithms.

Board with the following chipsets will be showcased:

ARM Mali
Imagination PowerVR
Qualcomm Snapdragon
NVidia Tegra
Freescale i.MX6 / Vivante
Adapteva

Several demo’s and benchmarks are prepared that will continuously run on each board. We will walk around to answer your questions.

During the evening drinks and snacks are available.

Test your own code

There is time to test OpenCL code for free in our labs. Please get in contact, as time that evening is limited.

Registration

Register by filling in the below form. Mention with how many people you will come, if you come by car and if you want to run your own code.

[contact_form]

Nordic/Scandinavian GPGPU-day on 29 September 2015

Posted by Vincent Hindriksen on 11 June 2015

After two events in the Netherlands, the third GPGPU-day will be held in Copenhagen!

Speakers are researchers and companies from the Nordic and Scandinavian countries. If you’re interested in speaking at the conference, get in touch.

Date: 29 September 2015
Location: Copenhagen, DK
Venue: Danish Architecture Centre (route)
Price, early bird: €225
Price, student: €100
Price, normal: €275

The main goal is to connect GPGPU specialist and researchers from the Nordic countries. In the Netherlands, a small country of 17 million, this has helped a lot to interlink 13 research groups. The Scandinavian/Nordic countries have a combined population of 26 million, potentially having around 20 GPU-related research groups. We hope this event will help in starting several collaborations and with that more activity around GPGPU in the north of Europe.

Sponsor packages start at €500 including one ticket.

For an impression, see the below video.

Call for speakers

Are you are doing research using OpenCL or CUDA, and are situated in Denmark, Iceland, Norway, Sweden or Finland, you are invited to speak about your work in front of 60 to 70 people. We expect visitors from Germany and several other European countries too.

Time-slots are 25 minutes, with a maximum of 8 slots available.

Why should you attend?

Meet others in the parallel programming industry and research.
Learn about research that is done with GPUs today.
Find out if you can use GPUs for your own challenges.

See you in Copenhagen!

For more information, see the official page. For tickets, see below.

Apple Metal versus Vulkan + OpenCL 2.2

Posted by Vincent Hindriksen on 8 June 2015 with 21 Comments

Update: C++ has been moved from OpenCL 2.1 to 2.2, being released in 2017. Title and text have been updated to reflect this. A reason why Apple released Metal might be the reason that Khronos was too slow in releasing C++ kernels into OpenCL, given the delays.

Apple Metal in one sentence: one queue for both OpenCL and OpenGL, using C++11. They now brought it to OSX. The detail they don’t tell: that’s exactly what the combination of Vulkan + OpenCL ~~2.1~~ 2.2 does. Instead it is compared with OpenCL 1.x + OpenGL 4.x, which it certainly can compete with, as that combination doesn’t have C++11 kernels nor a single queue.

Apple Metal on OSX – a little too late, bringing nothing new to the stage, compared to SPIR and OpenCL ~~2.1~~ 2.2.

The main reason why they can’t compete with the standards, is that there is an urge to create high-level languages and DSLs on top of lower-level languages. What Apple did, was to create just one and leaving out the rest. This means that languages like SYCL and C++AMP (implemented on top of SPIR-V) can’t simply run on OSX, and thus blocking new innovations. To understand why SPIR-V is so important and Apple should adopt for that road, read this article on SPIR-V.

khronos-SPIR-V-flowchart — Metal could compile to SPIR-V, just like OpenCL-C++ does. The rest of the Metal API is just like Vulkan.

Yet another vendor lock-in?

Now Khronos is switching its two most important APIs to the next level, there is a short-term void. This is clearly the right moment for Apple to take the risk and trying to get developers interested in their new language. If they succeed, then we get the well-known “pffff, I have no time to port it to other platforms” and there is a win for Apple’s platforms (they hope).

Apple has always wanted to have a different way of interacting with OpenCL-kernels using Grand Central Dispatch. Porting OpenCL between Linux and Windows is a breeze, but from and to OSX is not. Discussions over the past years with many people from the industry thought me one thing: Apple is like Google, Microsoft and NVidia – they don’t really want standards, but want 100% dedicated developers for their languages.

Yes, now also Apple is on the list of Me-too™ languages for OpenCL. We at StreamHPC can easily translate your code from and too Metal, but we would like it that you can put your investments in more important matters like improving the algorithms and performance.

Still OpenCL support on OSX?

Yes, but only OpenCL 1.2. A way to work around is to use SPIR-to-Metal translators and a wrapper from Vulkan to Metal – this will not make it very convenient though. The way to go, is that everybody starts asking for OpenCL 2.0 support on OSX forums. Metal is a great API, but that doesn’t change the fact it’s obstructing standardisation of likewise great, open standards. If they provide both Metal and Vulkan+OpenCL ~~2.1~~ 2.2 then I am happy – then the developers have the choice.

Metal debuts in “OSX El Capitan”, which is available per today to developers, and this fall to the general public.

StreamComputing exists 5 years!

Posted by Vincent Hindriksen on 25 May 2015 with 3 Comments

In January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

I would like to thank the following people:

My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
My friends, for understanding I needed to work in the weekends and evenings.
My good friend Laura for supporting me during the hard times of 2011 and 2012.
My girlfriend Elena for always being there for me.
My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
My customers for believing in OpenCL and trusting StreamComputing.

Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!” →

8 reasons why SPIR-V makes a big difference

Posted by Vincent Hindriksen on 21 May 2015 with 3 Comments

From all the news that came out of GDC, I’m most eager to talk about SPIR-V. This intermediate language will make a big difference for the compute-industry. In this article I’d like to explain why. If you need a technical explanation of what SPIR-V is, I suggest you first read gtruc’s article on SPIR-V and then return here to get an overview of the advantages.

Currently there are several shader and c ompute languages, which SPIR-V tries to replace/support. We have GLSL, HLSL for graphics shaders, SPIR (without the V), OpenCL, CUDA and many others for compute shaders.

If you have questions after reading this article, feel free to ask them in a comment or to us directly. Continue reading “8 reasons why SPIR-V makes a big difference” →

Apple’s dragging OpenCL compiler problem

Posted by Vincent Hindriksen on 9 May 2015 with 1 Comment

Remember the times that the OpenCL compilers where not that good as they’re now? Correct source-code being rejected, typos being accepted, long compile times, crashes during compiling and other irritating bugs. These made the work of an OpenCL developer in “the old days” quite tiresome – you needed a lot of persistence and report bugs. Lucky on desktops the drivers have improved a lot.

Apple’s buggy OpenCL compiler

Now to Apple. There have always been complaints about the irritating bugs that were in Apple’s compiler. Recently the Luxrender community started to make more complaints, as the guy responsible for the OSX port decided to quit. This was due to utter frustration: code that worked on every other OS, simply did not work on OSX. Luxrender’s Paolo Ciccone stood up and made this extremely public, by writing an open letter to Apple’s CEO Tim Cook (posted below).

The letter is not specific about the kind of bugs and and therefore asked him via Twitter which were the bugs he was talking about. He explained me that it’s very simple:

@StreamHPC It's not a few speciific bugs. The drivers simply don't work. Whenever we render anything the OCL compiler aborts.

— Paolo Ciccone (@RealityPaolo) May 6, 2015

Here at StreamHPC we could write around those bugs in most cases, but Luxrender has bigger and more complex kernels than we used in our projects – then it’s simply impossible to write around, as the compiler simply crashes. It seems that OSX still has those old compilers, Linux and Windows used to have years ago.

Metal

Metal is the OpenCL-alternative on iOS 8 and up.

If you’re thinking that Metal could be a reason – that language looks very much like OpenCL, as it’s simply OpenCL as Apple would like it to be. Porting between the two languages is therefore quite simple. This also means that with some small fixes a Metel-kernel could be compiled by existing OpenCL-compiler. Ok, there is much more than the compute part, but the message is that more complex Metal wouldn’t be possible using this driver-stack.

If we end up in a situation that Metal comes to OSX and is more stable than OpenCL, only then we can say that Apple tries to block OpenCL in favour of their own APIs.

The letter

I’m really happy that Paolo Ciccone had the guts to publicly complain. This is the letter he wrote:

Dear Mr. Cook.

I’m sorry to bother you but we have tried all other channels and nothing worked.

I’m part of a group of developers of a physically-based renderer called LuxRender. LuxRender has been written to use OpenCL to accelerate its enormous amount of computation necessary to generate photo-realistic scenes. You can see some of the images generated by Lux at http://luxrender.net. Lux is an Open Source program.

Apple has defined OpenCL and we have adopted this API instead of the proprietary CUDA in order to be able to work with all kind of hardware on all major platforms. It made sense for an OSS to use an open standard.

The reason why I’m writing to you is that, after waiting for years, we still have broken GPU drivers on OS X. Scenes that render perfectly well on Windows and even on Linux simply abort on OS X. This is happening with both AMD and nVidia GPUs.

The problem is unsolvable from our side. We need updated, fixed drivers for OS X. The problem is so bad hat our main OS X developer has announced, today, that he is giving up OS X. He simply can’t do his job.

I kindly request that you look into this and give us working AMD and nVidia drivers in an upcoming, possibly soon, update of OS X. We are more than willing to work with your engineers, if you need any kind of specific help in identifying the problem.

Thank you for your attention.

Paolo Ciccone

If you want to help, also post this letter on your blog or in a forum. The more this is shared, the better. Especially Apple’s forum, asking for the official statement.

OpenCL in simple words

Posted by Vincent Hindriksen on 21 April 2015 with 6 Comments

Our business is largely around making software faster. For that we use OpenCL, but do you know what this programming language is? Why can’t this speeding-up be done using other languages like Java, C#, C++ or Python?

OpenCL the answer to high-level languages, where we were promised superfast software that was very quick to write. After 20 years this was still a promise, as compilers had to guess too much what was intended. OpenCL gives the programmer more control in the places where more control is needed to get high-performing code and leave less guesses for the compiler.

It’s C with some extra power

It’s like normal C with three extra concepts, all with the aim to make the software run faster.

Explicit Data Transfer

In other introductions to OpenCL the data-transfers are mentioned as one of the last parts, but I find this the most important one. Reason: in most cases this is the main bottleneck in performance-targeted code.

When moving your stuff to another house, you pack all in boxes first before loading the truck. Or would you load each item into the truck one-by-one? Transport-costs would be much higher that way.

While it would be great that the fastest data-transfers should be done automatically, it simply doesn’t work like that. This means that designing the data-transfers is an important task when making fast software. OpenCL lets you do this.

Multiple cores

Most people have heard of “cores”, as made famous by Intel. Each core can do a part of a computation and effectively reduce runtime. OpenCL implements this by isolating the code that runs on each core – what goes in and out the protected code is done explicitly. This way the code is really easy to scale up to thousands of cores.

Would you choose the best-in-class to write the multiplication tables from 1 to 20, or have each student write one of them? Even though the slowest student will limit the rest, the total time is still lower.

Where a normal processor has 1, 2, 4 or 8 cores, a graphics processor has hundreds or even thousands of cores. OpenCL-software works on both.

Vectors

Modern processors can do computations on more than one data-item at the same time. They can be described as sub-cores. This means that each core has parallelism on its own.

When reading, do you read one word at once or character by character? Your brains can parse multiple characters at the same time.

OpenCL has support for “vectors” ( bundles of alike data) to be able to program these sub-cores.

It runs on many types of devices

OpenCL is famous for being the standard programming model for a lot of modern processors. There is no other programming language that can do the same. Support is available on:

CPUs; standard processors by Intel, AMD and ARM
GPUs; graphics cards by Intel, AMD and NVIDIA
FPGAs; processors that are programmed on the hardware-level, by Altera and Xilinx.
DSPs; digital signal processors by TI
Mobile graphics processors by ARM, Imagination, Qualcomm, etc.
See the rest of the list here.

This means that code can be ported to new devices in days or weeks instead of having to rewrite everything from scratch.

How does translating to OpenCL work?

When software needs to be faster, the first step is to find out its bottlenecks – these “hot spots” will be ported to OpenCL, while the rest remains the same. Then comes the hardest part: changing the algorithms such that data-transfers are more efficient and all cores are used. The last step is to look into low-level optimisations like the vectors.

Above is a very simplified representation of OpenCL. Still you’ve seen that the language is very unique and powerful. That will change, as its concepts are slowly getting embedded into existing languages – till then OpenCL is the only standard which fully enables all hardware features.

GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA

Posted by Vincent Hindriksen on 18 April 2015 with 2 Comments

Wrong! — In contrary to what you see around (on slides like these), AMD and Intel also have support for RDMA.

A while ago I found the slide at the right, claiming that AMD did not have any direct GPU-GPU communication. I found at several sources there was, but it seems not to be a well-known feature. The feature is known as SDI (mostly on network-cards, SSDs and FPGAs), but not much information is found on PCI+SDI. More often RDMA is used: Remote Direct Memory Access (wikipedia).

Questions I try to answer:

Which server-grade GPUs support direct GPU-GPU communication when using OpenCL?
What are other characteristics interesting for OpenCL-devs besides direct communication GPU-GPU, GPU-FPGA, GPU-NIC?
How do you code such fast communication?

Enjoy reading! Continue reading “GPUDirect and DirectGMA – direct GPU-GPU communication via RDMA” →

Category: Technical