Sheets GPGPU-day 2012 online

GPGPU Day Speakers_small.2
Photos made by Cyrille Favreau

Better now than never. It has almost been a year, but finally they’re online: the sheets of the GPGPU-day Amsterdam 2012.

You can find the sheets at http://www.platformparallel.com/nl/gpgpu-day-2012/abstracts/ – don’t hotlink the files, but link to this page. The abstracts should introduce the sheets, but if you need more info just ask them in the comments here.

PDFs from two unmentioned talks:

I hope you enjoy the sheets. On 20 June the second edition will take place – see you there!

 

“That is not what programmers want”

the-miracle-middle-colour2
I think you should be more explicit here in step two” (original print)

This post is part of the series Programming Theories, in which we discuss new and old ways of programming.

When discussing the design of programming languages or the extension of existing ones, the question What concepts can simplify the tasks of the programmer? always triggers lots of interesting debates. After that, when an effective solution is found, inventors are cheered, and a new language is born. Up ’till this point all seems ok, but the problem comes with the intervention of the status quo: C, C++, Java, C#, PHP, Visual Basic. Those languages want the new feature implemented in the way their programmers expect it. But this would be like trying to implement the advantages of a motorcycle into a car without paying attention to the adjustments needed by the design of the car.

I’m in favor of learning concepts instead of doing new things the old way… but only when the latter has proven to be better than the former. The lean acceptance of i.e. functional languages tells a lot about how it goes in reality (with great exceptions like LINQ). That brings a lot of trouble when moving to multi-core. So, how do we get existing languages to change instead of just evolve?

High Level Languages for Multi-Core

Let’s start with a quote from Edsger Dijkstra:

Projects promoting programming in “natural language” are intrinsically doomed to fail.

In other words: a language can be too high level. A programmer needs the language to be able to effectively micro-manage what is being done. We speak of concerns for a reason. Still, the urge to create the highest programming language is strong.

Don’t get me wrong. A high-level language can be very powerful once its concepts define both ways. One way concerns the developer: does the programmer understand the concept and the contract of the command or programming style being offered? The other concerns the machine: can it be effectively programmed to run the command, or could a new machine be made to do just that? This two-side contract is one of the reasons why natural languages are not fit for programming.

And we have also found out that binary programming is not fit for humans.

The cartoon refers to this gap between what programmers want and what computers want.

Continue reading ““That is not what programmers want””

ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15

mont-blancAccording to the latest newsletter of the Mont-Blanc Project, it was explained that the GPU on a Samsung Exynos 5 is much faster and greener than its CPU: 3.5 times faster with half the energy. They built a supercomputer using 810 Exynos SoCs, that can deliver a 26 TFLOPS of peak performance. With the upcoming mobile GPUs becoming exponentially faster, they have all the expertise to build an even faster and greener ARM-supercomputer after this.

The Mont-Blanc compute cards deliver considerably higher performance; at 50% lower energy consumption, compared with previous ARM-based developer platforms.

The Mont-Blanc prototype is based on the Samsung Exynos 5 Dual SoC, which integrates a dual-core ARM Cortex-A15 and an on-chip ARM Mali-T604 GPU, and has been featured and market proven in advanced mobile devices. The dual-core ARM Cortex-A15 delivers twice the performance of the quad-core ARM Cortex-A9, used in the previous generation of ARM-based prototype, whilst consuming 20% less energy for the same workload. Furthermore, the on-chip ARM Mali-T604 GPU provides 3.5 times higher performance than the dual-core Cortex-A15, whilst consuming half the energy for the same workload.

Each Mont-Blanc compute card integrates one Samsung Exynos 5 Dual SoC, 4 GB of DDR3-1600 DRAM, a microSD slot for local storage and a 1 GbE NIC, all in an 85x56mm card (3.3×2.2 inches). A single Mont-Blanc blade integrates fifteen Mont-Blanc compute cards and a 1 GbE crossbar switch, which is connected to the rest of the system via two 10 GbE links. Nine Mont-Blanc blades fit into a standard BullX 9-blade INCA chassis. A complete Mont-Blanc rack hosts up to six such chassis, providing a total of 1620 ARM Cortex-A15 cores and 810 on-chip ARM Mali-T604 GPU accelerators, delivering 26 TFLOPS of peak performance.

“We are only scratching the surface of the Mont-Blanc potential”, says Alex Ramirez, coordinator of the Mont-Blanc project. “There is still room for improvement in our OpenCL algorithms, and for optimizations, such as executing on both the CPU and GPU simultaneously, or overlapping MPI communication with computation.”

Continue reading “ARM Mali-T604 GPU has 3.5x more performance than dual core Cortex-A15”

Texas Instruments DSP

Texas-Instruments-logo-designTI has a fully conformant OpenCL 1.1 implementation.

The below table is taken from http://downloads.ti.com/mctools/esd/docs/opencl/intro.html and shows which DSPs have OpenCL-support.

SoC System Khronos Conformance Installation Instructions
AM572 AM572 EVM OpenCL v1.1 Conformant Processor SDK for AM57x
DRA75x DRA75x EVM OpenCL v1.1 Conformant Processor SDK for DRA7x (Enabling OpenCL on DRA75x)
AM571 AM572 EVM OpenCL v1.1 Conformant Processor SDK for AM57x
66AK2H 66AK2H EVM OpenCL v1.1 Conformant Processor SDK for K2H
66AK2L 66AK2L EVM Not submitted for conformance Processor SDK for K2L
66AK2E 66AK2E EVM Not submitted for conformance Processor SDK for K2E
66AK2G 66AK2G EVM Not submitted for conformance Processor SDK for K2G

Theoretical Performance of the C66x

  • Fixed point 16×16 MACs per cycle: 32
  • Fixed point 32×32 MACs per cycle: 8
  • Floating point single precision MACs per cycle: 8
  • Arithmetic floating point operations per cycle: 16 2-way SIMD on .L and .S units (e.g. 8 SP operations for A and B) and 4 SP multiply on one .M unit (e.g 8 SP operations for A and B)
  • Arithmetic floating point operations per cycle: 164 2-way SIMD on .L and .S units (e.g. 8 SP operations for A and B) and 4 SP multiply on one .M unit (e.g 8 SP operations for A and B)
  • Load/store width 2 x 64-bit 2 x 64-bit Vector size (SIMD capability): 128-bit (4 x 32-bit, 4 x 16-bit, 4x-8bits)

GFLOPs

2 FLOPs – 2-way SIMD on .L1 (A side) such as DADDSP or DSUBSP
2 FLOPs – 2-way SIMD on .L2 (B side) such as DADDSP or DSUBSP
2 FLOPs – 2-way SIMD on .S1 (A side) such as DADDSP or DSUBSP
2 FLOPs – 2-way SIMD on .S2 (B side) such as DADDSP or DSUBSP
4 FLOPs – 4-way SIMD on .M1 (A side) such as QMPYSP (or CMPYSP, maybe not 4-way SIMD)
4 FLOPs – 4-way SIMD on .M2 (B side) such as QMPYSP (or CMPYSP, maybe not 4-way SIMD)
========================
16 FLOPs total per cycle per C66x CorePac (source)

Boards

A good starter board is the BeagleBoard X-15, and has OpenCL drivers. It has 2x C66X DSPs and 2x 1.5-GHz ARM Cortex-A15.

X15_TOP_SIDE

NVIDIA enables OpenCL 2.0 beta-support

In the release notes for NVIDIA 378.66 graphics drivers for Windows NVIDIA mentions support for OpenCL 2.0. This has been the first time in 3 years since OpenCL 2.0 has been launched, that they publicly speak about supporting it. Several 2.0 functions had silently been added to the driver on customer request, but these additions never got any reference in release notes and were therefore officially unofficial.

You should know that only on 3 April 2015 NVIDIA finally started supporting OpenCL 1.2 on their GPUs based on Kepler and newer architectures. OpenCL 2.0 was already there for one and a half years (November 2013), now more than three years ago.

Does it mean that you will be soon able to run OpenCL 2.0 kernels on your newly bought Titan X? Yes and no. Read on to find out about the new advantages and the limitations of the beta-support.

Update: We tested NVIDIA drivers on Linux too. Read it here.

Continue reading “NVIDIA enables OpenCL 2.0 beta-support”

Should SPIRV be supported in CUDA?

Would you like to run CUDA-kernels on the OpenCL framework? Or Python or Rust? SPIRV is the answer! Where source-to-source translations had several limitations, SPIRV 1.1 even supports higher level languages like C++.

SPIRV is the strength of OpenCL and it will only get bigger.

Currently Intel drivers best supports SPIRV, making it the first target for the new SPIRV-frontends. It is unknown which vendor will be next – probably one which (almost) has OpenCL 2.0 drivers already, such as AMD, ARM, Qualcomm or even NVidia.

How interesting is SPIRV really?

So if SPIRV is really that important a reason to choose for OpenCL, I thought of framing it towards CUDA-devs on twitter in a special way:

Should CUDA 9 support SPIRV?

2 people voted no, 29 people voted yes.

So that’s a big yes for SPIRV-support on CUDA. And why not? Don’t we want to program in our own language and not be forced to use C or C++? SPIRV makes it possible to quickly add support in any language out there without official support of the vendor. Let wrappers handle the differences per vendor and let SPIRV be the shared language for GPU-kernels. Where do we need OpenCL or CUDA for, if the real work is defined by SPIRV?

What do you think? Leave your comment below how you see the future of GPGPU with SPIRV in town. Is 2017 the year of SPIRV?

And if you worked on a SPIRV-frontend, get in touch to continue your project on https://github.com/spirv. Yes, it’s empty right now, but you don’t know what’s hidden.

Win an OpenCL mug!

The first batch is in and you can win one from the second batch!

We’re sending a mug to a random person who subscribes to out newsletter before the end of 17 April 2017 (Central European Time). Yes, that’s a Monday.

Two winners

We’ll pick two winners: one from academia and one from industry. If you select “other” as your background, then share which category you fall in the last field.

Did you already subscribe and also want to win? I am not forgetting you – more details are in a newsletter next quarter.

More winners, by referring to a friend

If you refer a colleague, a friend or even a stranger to subscribe, you can both win a mug. Just be sure he/she remembers to mention you to me when I ask. Before you ask: the maximum referral-length is 5 (so referral of referral of referral of referral, etc) plus the one who started it.

UPDATE: If you win a mug and were not referred by somebody, you can pick a co-winner yourself. Joy should be shared.
[mailchimpsf_form]

You can also use this link http://eepurl.com/bgmWaP.

Why did AMD open source ROCm’s OpenCL driver-stack?

AMD open sourced the OpenCL driver stack for ROCm in the beginning of May. With this they kept their promise to open source (almost) everything. The hcc compiler was open sourced earlier, just like the kernel-driver and several other parts.

Why this is a big thing?
There are indeed several open source OpenCL implementations, but with one big difference: they’re secondary to the official compiler/driver. So implementations like PortableCL and Intel Beignet play catch-up. AMD’s open source implementations are primary.

It contains:

  • OpenCL 1.2 compatible language runtime and compiler
  • OpenCL 2.0 compatible kernel language support with OpenCL 1.2 compatible runtime
  • Support for offline compilation right now – in-process/in-memory JIT compilation is to be added.

For testing the implementation, see Khronos OpenCL CTS framework or Phoronix openbenchmarking.org.

Why is it open sourced?

There are several reasons. AMD wants to stand out in HPC and therefore listened carefully to their customers, while taking good note on where HPC was going. Where open source used to be something not for businesses, it is now simply required to be commercially successful. Below are the most important answers to this question.

Give deeper understanding of how functions are implemented

It is very useful to understand how functions are implemented. For instance the difference between sin() and native_sin() can tell you a lot more on what’s best to use. It does not tell how the functions are implemented on the GPU, but does tell which GPU-functions are called.

Learning a new platform has never been so easy. Deep understanding is needed if you want to go beyond “it works”.

Debug software

When you are working on a large project and have to work with proprietary libraries, this is a typical delay factor. I think every software engineer has this experience that the library does not perform as was documented and work-arounds had to be created. Depending on the project and the library, it could take weeks of delay – only sarcasm can describe these situations, as the legal documents were often a lot better than the software documents. When the library was open source, the debugger could step in and give the “aha” that was needed to progress.

When working with drivers it’s about the same. GPU drivers and compilers are extremely complex and ofcourse your project hits that bug which nobody encountered before. Now all is open source, you can now step into the driver with the debugger. Moreover, the driver can be compiled with a fix instead of work-around.

Get bugs solved quicker

A trace now now include the driver-stack and the line-numbers. Even a suggestion for a fix can be given. This not only improves reproducibility, but reduces the time to get the fix for all steps. When a fix is suggested AMD only needs to test for regression to accept it. This makes the work for tools like CLsmith a lot easier.

Have “unimportant” specific improvements done

Say your software is important and in the spotlight, like Blender or the LuxMark benchmark, then you can expect your software gets attention in optimisations. For the rest of us, we have to hope our special code-constructions are alike one that is targeted. This results in many forums-comments and bug-reports being written, for which the compiler team does not have enough time. This is frustrating for both sides.

Now everybody can have their improvements submitted, giving it does not slow down the focus software ofcourse.

Get the feature set extended

Adding SPIR-V is easy now. The SPIRV-frontend needs to be added to ROCm and the right functions need to be added to the OpenCL driver. Unfortunately there is no support for OpenCL 2.x host-code yet – I understood by lack of demand.

For such extensions the AMD team needs to be consulted first, because this has implications on the test-suite.

Get support for complete new things

It takes a single person to make something completely new – this becomes a whole easier now.

More often there is opportunity in what is not there yet, and research needs to be done to break the chicken-egg. Optimised 128 bit computing? Easy complex numbers in OpenCL? Native support for Halide as an alternative to OpenCL? All high performance code is there for you.

Initiate alternative implementations (?)

Not a goal, but forks are coming for sure. For most forks the goals would be like the ones above, to later be merged with the master branch. There are a few forks that go their own direction – for now hard to predict where those will go.

Improve and increase university collaborations

If the software was protected, it was only possible under strict contracts to work on AMD’s compiler infrastructure. In the end it was easier to focus on the open source backends of LLVM than to go through the legal path.

Universities are very important to find unexpected opportunities, integrate the latest research in, bring potential new employees and do research collaborations. Added bonus for the students is that the GPUs might be allowed to used for games too.

Timour Paltashev (Senior manager, Radeon Technology Group, GPU architecture and global academic connections) can be reached via timour dot paltashev at amd dot com for more info.

Get better support in more Linux distributions

It’s easier to include open source drivers in Linux distributions. These OpenCL drivers do need a binary firmware (which were disassembled and seem to do as advertised), but the discussion is if this is part of the hardware or software to mark it as “libre”.

There are many obstacles to have ROCm complete stack included as the default, but with the current state it makes much more chance.

Performance

Phoronix has done some benchmarks on ROCm 1.4 OpenCL in January on several systems and now ROCm 1.5 OpenCL on a Radeon RX 470. Though the 1.5 benchmarks were more limited, the important conclusion is that the young compiler is now mostly on par with the closed source OpenCL implementation combined with the AMDGPU-drivers. Only Luxmark AMDGPU was (much) better. Same comparison for the old proprietary fgrlx drivers, which was fully optimised and the first goal to get even with. You’ll see that there will be another big step forward with ROCm 1.6 OpenCL.

Get started

You can find the build instructions here. Let us know in the comments what you’re going to do with it!

We accelerated the OpenCL backend of pyPaSWAS sequence aligner

Last year we accelerated the OpenCL-code in PaSWAS, which is open source software to do DNA/RNA/protein sequence alignment and trimming. It has users world-wide in universities, research groups and industry.

Below you’ll find the benchmark results of our acceleration work. You can also test out yourself, as the code is public. In the readme-file you can learn more about the idea of the software. Lots of background information is described in these two papers:

We chose PaSWAS because we really like bio-informatics and computational chemistry – the science is interesting, the problems are complex and the potential GPU-speedup is real. Other examples of such software we worked on are GROMACS and TeraChem.

Continue reading “We accelerated the OpenCL backend of pyPaSWAS sequence aligner”

IWOCL 2019

On Monday May 13, 2019 at 09:30 the latest edition of IWOCL starts, not taking into account any pre-events that might be spontaneously organized. This is the biggest OpenCL-focused event that discusses everything that would make any GPGPU-programmer, DSP-programmer and FPGA-programmer enthusiastic.

What’s new since last year, is that it’s actually also more interesting place for CUDA-developers who like to learn and discuss new GPU-programming techniques. This is because Nvidia’s GTC has moved more to AI, where it used to be mostly GPGPU for years.

Since it’s now the last week of the early-bird pricing, it’s a good time to make you think about buying your ticket and book the trip.

Continue reading “IWOCL 2019”