General articles on technical subjects.

Building the HPC ecosphere in Amsterdam

Posted by Vincent Hindriksen on 17 April 2015

Here in Amsterdam a lot is going on around HPC. Including StreamHPC, we have companies like Vancis, Netherlands eScience Centre, and ClusterVision, the research institute for Dutch HPC, Surf SARA, (hosting the Dutch supercomputer) and the very busy Amsterdam IX.

Here in Amsterdam we’re focused on building up more local companies around big compute and big data. I’d like to give two examples. One is Scyfer, an academic startup specialised in deep learning. They’ve developed algorithms to more efficiently train neural networks and help their customers find answers quicker. The second is Euvision Technologies, who developed unique computer vision solutions. Last year it has been sold to Qualcomm, for tens of millions.

We welcome new companies to Amsterdam, to further build up the HPC-ecosphere. If you have a company and are seeking a good location, contact us to talk about HPC in Amsterdam.There are many opportunities to develop in Europe, and we’re open for partnerships in new markets.

If you want to start your own HPC-related startup, Amsterdam thinks of you! There are three steps to do:

Go to the Venture café on 30 April
Apply for the bootcamp
Become our neighbours
Build your own HPC startup

Ping me, if you want advice on which preparations you need to make, before you can make such big decision. I like to have an open discussion, so please use the comment-area below for what you think of HPC in Amsterdam and building companies.

Four conferences that will interest you

Posted by Vincent Hindriksen on 12 April 2015

(if you get to Palo Alto, Manchester, Karlsruhe and Copenhagen)

We’re supporters of open standards and open discussions. When those thow come together, we melt. Therefore I’d like to share four hot conferences with you: IWOCL (Palo Alto, SF, USA), EMiT (Manchester, UK), ParallelCon (Karlsruhe, Germany), GPGPU-day 2015 (Copenhagen, Denmark).

On all these conferences I’ll be there too and are happy to meet you.

This post was shared first via the newsletter. Subscribe here.

Continue reading “Four conferences that will interest you” →

How to install OpenCL on Windows

Posted by Anca Hamuraru on 16 March 2015 with 19 Comments

Getting your Windows machine ready for OpenCL is rather straightforward. In short, you only need the latest drivers for your OpenCL device(s) and you’re ready to go. Of course, you will need to add an OpenCL SDK in case you want to develop OpenCL applications but that’s equally easy.

Before we start, a few notes:

The steps described herein have been tested on Windows 8.1 only, but should also apply for Windows 7 and Windows 8.
We will not discuss how to write an actual OpenCL program or kernel, but focus on how to get everything installed and ready for OpenCL on a Windows machine. This is because writing efficient OpenCL kernels is almost entirely OS independent.

If you want to know more about OpenCL and you are looking for simple examples to get started, check the Tutorials section on this webpage.

Running an OpenCL application

If you only need to run an OpenCL application without getting into development stuff then most probably everything already works.

If OpenCL applications fail to launch, then you need to have a closer look to the drivers and hardware installed on your machine:

Check that you have a device that supports OpenCL. All graphics cards and CPUs from 2011 and later support OpenCL. If your computer is from 2010 or before, check this page. You can also find a list with OpenCL conformant products on Khronos webpage.
Make sure your OpenCL device driver is up to date, especially if you’re not using the latest and greatest hardware. With certain older devices OpenCL support wasn’t initially included in the drivers.

Here is where you can download drivers manually:

Intel has hidden them a bit, but you can find them here with support for OpenCL 2.0.
AMD’s GPU-drivers include the OpenCL-drivers for CPUs, APUs and GPUs, version 2.0.
NVIDIA’s GPU-drivers mention mostly CUDA, but the drivers for OpenCL ~~1.1~~ 1.2 are there too.

In addition, it is always a good idea to check for any other special requirements that the OpenCL application may have. Look for device type and OpenCL version in particular. For example, the application may run only on OpenCL CPUs, or conversely, on OpenCL GPUs. Or it may require a certain OpenCL version that your device does not support.

A great tool that will allow you to retrieve the details for the OpenCL devices in your system is Caps Viewer.

Developing OpenCL applications

Now it’s time to put the pedal to the metal and start developing some proper OpenCL applications.

The basic steps would be the following:

Make sure you have a machine which supports OpenCL, as described above.
Get the OpenCL headers and libraries included in the OpenCL SDK from your favourite vendor.
Start writing OpenCL code. That’s the difficult part.
Tell the compiler where the OpenCL headers are located.
Tell the linker where to find the OpenCL .lib files.
Build the fabulous application.
Run and prepare to be awed in amazement.

Ok, so let’s have a look into each of these.

OpenCL SDKs

For OpenCL headers and libraries the main options you can choose from are:

NVIDIA – CUDA Toolkit. You can grab the OpenCL samples here.
AMD – ~~AMD APP SDK. Also works with Intel’s CPUs.~~
- Headers and OpenCL.lib are here: https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases
- Samples are here: https://github.com/OpenCL/AMD_APP_samples
- Math libraries are here: https://github.com/clMathLibraries
Intel – the previous Intel SDK for OpenCL is now integrated into Intel’s new tools, such as Intel INDE (which has a free starters edition) or Intel Media Server Studio. Grab any of these in order to have everything ready for building OpenCL code.

As long as you pay attention to the OpenCL version and the OpenCL features supported by your device, you can use the OpenCL headers and libraries from any of these three vendors.

OpenCL headers

Let’s assume that we are developing a 64bit C/C++ application using Visual Studio 2013. To begin with, we need to check how many OpenCL platforms are available in the system:

[raw]

#include<stdio.h>
#include<CL/cl.h>

int main(void)
{
    cl_int err;
    cl_uint numPlatforms;

    err = clGetPlatformIDs(0, NULL, &numPlatforms);
    if (CL_SUCCESS == err)
         printf("\nDetected OpenCL platforms: %d", numPlatforms);
    else
         printf("\nError calling clGetPlatformIDs. Error code: %d", err);

    return 0;
}

[/raw]

We need to specify where the OpenCL headers are located by adding the path to the OpenCL “CL” is in the same location as the other CUDA include files, that is, CUDA_INC_PATH. On a x64 Windows 8.1 machine with CUDA 6.5 the environment variable CUDA_INC_PATH is defined as “C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v6.5\include”

If you’re using the AMD SDK, you need to replace “$(CUDA_INC_PATH)” with “$(AMDAPPSDKROOT)/include” or, for Intel SDK, with “$(INTELOCLSDKROOT)/include“.

OpenCL libraries

Similarly, we need to let the linker know about the OpenCL libraries. Firstly, add OpenCL.lib to the list of Additional Dependencies:

Secondly, specify the OpenCL.lib location in Additional Library Directories:

As in the case of the includes, If you’re using the AMD SDK, replace “$(CUDA_LIB_PATH)” with “$(AMDAPPSDKROOT)/lib/x86_64” , or in the case of Intel with “$(INTELOCLSDKROOT)/lib/x64“.

And you’re good to go! The application should now build and run. Now, just how difficult was it? Happy OpenCL-coding on Windows!

If you have any question or suggestion, just leave a comment.

Internships: the self-driving vehicle – updated

Posted by Vincent Hindriksen on 9 March 2015

UPDATE: We now only offer thesis support (“externs”), for students who want to use OpenCL in their research, but don’t have such support at their university. for the rest the below applies.

From July there are several internships available here at StreamHPC, all around self-driving vehicles (or even self-flying drones). This means that with an interest in AI, embedded programming and sensors, you’re all set.

You can work as an intern for a period from 1 to 6 months, and combine it with your thesis. We will assist you with planning, thesis correction and technical support (especially OpenCL). There are also a few other startups in the building, who you’d like to talk with.

Your time will exist of literature studies, programming, testing, OpenCL-optimisations and playing. We’ll work with bikes and toy-cars, so no big cars that are expensive to crash. Study fields are road-location, obstacles, driving-style detection, etc.

If you want to do an internship purely to gain experience, we can offer you a combination of research and working for real customers.

Some targets:

Create a small test-car full with sensors:
- radar for distance
- multi cameras
- laser
- other sensors, like touch
Programming an embedded board with OpenCL-capability.
Programming pointcloud algorithms in OpenCL.
Defining the location on the road, also in OpenCL. (taken)
Detecting pedestrians, signs.
Have fun creating this.

Please contact us and tell your ideas and plan.

Overview of OpenCL 2.0 hardware support, samples, blogs and drivers

Posted by Vincent Hindriksen on 10 February 2015 with 6 Comments

We were too busy lately to tell you about it: OpenCL 2.0 is getting ready for prime time! As it makes use of the more recent hardware features, it’s therefore more powerful than OpenCL 1.x could ever be.

To get you up to speed, see this list of new OpenCL 2.0 features:

Shared Virtual Memory: host and device kernels can directly share complex, pointer-containing data structures such as trees and linked lists, providing significant programming flexibility and eliminating costly data transfers between host and devices.
Dynamic Parallelism: device kernels can enqueue kernels to the same device with no host interaction, enabling flexible work scheduling paradigms and avoiding the need to transfer execution control and data between the device and host, often significantly offloading host processor bottlenecks.
Generic Address Space: functions can be written without specifying a named address space for arguments, especially useful for those arguments that are declared to be a pointer to a type, eliminating the need for multiple functions to be written for each named address space used in an application.
Improved image support: including sRGB images and 3D image writes, the ability for kernels to read from and write to the same image, and the creation of OpenCL images from a mip-mapped or a multi-sampled OpenGL texture for improved OpenGL interop.
C11 Atomics: a subset of C11 atomics and synchronization operations to enable assignments in one work-item to be visible to other work-items in a work-group, across work-groups executing on a device or for sharing data between the OpenCL device and host.
Pipes: memory objects that store data organized as a FIFO and OpenCL 2.0 provides built-in functions for kernels to read from or write to a pipe, providing straightforward programming of pipe data structures that can be highly optimized by OpenCL implementers.
Android Installable Client Driver Extension: Enables OpenCL implementations to be discovered and loaded as a shared object on Android systems.

I could write many articles about the above subjects, but leave that for later. This article won’t get into these technical details, but more into what’s available from the vendors. So let’s see what toys we were given!

A note: don’t start with OpenCL 2.0 directly, if you don’t know the basic concepts of OpenCL. Continue reading “Overview of OpenCL 2.0 hardware support, samples, blogs and drivers” →

Is the CPU slowly turning into a GPU?

Posted by Vincent Hindriksen on 6 February 2015

It's all in the plan — It’s all in the plan?

Years ago I was surprised by the fact that CPUs were also programmable with OpenCL – I solely chose that language for the cool of being able to program GPUs. It was weird at start, but cannot think of a world without OpenCL working on a CPU.

But why is it important? Who cares about the 4 cores of a modern CPU? Let me first go into why CPUs have had mostly 2 cores for so long, about 15 years ago. Simply put, it was very hard to program multi-threaded software that made use of all cores. Software like games did, as they needed all the available resources, but even the computations in MS Excel are mostly single-threaded as of now. Multi-threading was maybe used most for having a non-blocking user-interface. Even though OpenMP was standardised 15 years ago, it took many years before the multi-threaded paradigm was used for performance. If you want to read more on this, search the web for “the CPU frequency wall”.

More interesting is what is happening now with CPUs. Both Intel and AMD are releasing CPUs with lost of cores. Intel has recently a 18-core processor (Xeon E5 2699-v3) and AMD was offering 16-core CPUs for a longer time (Opteron 6300 series). Both have SSE and AVX, which means extra parallelism. If you don’t know what this is precisely about, read my 2011-article on how OpenCL uses SSE and AVX on the CPU.

AVX3.2

Intel now steps forward with AVX3.2 on their Skylake CPUs. AVX 3.1 is in XeonPhi “Knight’s Landing” – see this rumoured roadmap.

It is 512-bits wide, which means that 8 times as much vector-data can be computed! With 16 cores, this would mean 128 float operations per clock-tick. Like a GPU.

The disadvantage is alike the VLIW we had in the pre-GCN generation of AMD GPUs: one needs to fill the vector-instructions to get the speed-up. Also the relatively slow DDR3 memory is an issue, but lots of progress is being made there with DDR4 and stacked memory.

So is the CPU turning into a GPU?

I’d say yes.

With AVX3.2 the CPU gets all the characteristics of a GPU, except the graphics pipeline. That means that the CPU-part of the CPU-GPU is acting more like a GPU. The funny part is that with the GPU’s scalar-architecture and more complex schedulers, the GPU is slowly turning into a CPU.

In this 2012-article I discussed the marriage between the CPU and GPU. This merger will continue in many ways – a frontier where the HSA-foundation is doing great work now. So from that perspective, the CPU is transforming into a CPU-GPU; and we’ll keep calling it a CPU.

This all strengthens my believe in the future of OpenCL, as that language is prepared for both task-parallel and data-parallel programs – for both CPUs and GPUs, to say it in current terminology.

Using OpenCL 1.2 with 1.1 devices

Posted by Vincent Hindriksen on 5 February 2015 with 4 Comments

Recent code uses OpenCL 1.2 more and more, which is a problem when wanting to use OpenCL 1.1 devices, or even 1.0 devices. Most times those computers have OpenCL 1.1 libraries installed. So what to do?

When you want to make code that runs on both on Nvidia, AMD and Intel, then you have two options:

Use the OpenCL 1.2 library.
Go back to OpenCL 1.1 (from 2010)

Below I’ll explain both option. Understand that the difference between 1.1 and 1.2 is not that big, and that this article is mostly here to get you started to prepare for 2.0.

Use the OpenCL 1.2 library

Not many people seem to know that you can just use OpenCL 1.2 libraries with 1.1 devices. So if you have opencl.dll or libOpenCL.so of version 1.2 and not use the new functions. The advantage is that you can use the 1.2 host-functionalities with the 1.2 devices. Else those were not really usable, except the kernel-functions. To get the version 1.2 of opencl.dll or libOpenCL.so, you need to install the drivers of AMD or Intel or just copy it from a friend.

You can go one step further: we have a wrapper that translates most 1.2 functions to 1.1, to port code to 1.1 more easily. This is a preparation for OpenCL 2.0, to keep partly backwards compatible – versioning is getting more important with each new version of OpenCL.

When distributing your code, you need to distribute the library too. Problem is that it’s not really allowed, and should be installed by the user – and that user installs OpenCL 1.1 when they have NVidia hardware. Khronos distributes the code of the library, but it doesn’t compile easily under Windows. Please read the license.txt well.

Go for OpenCL 1.1 completely

The advantage is that you avoid all kinds of problems, by giving up some progress in the standard. As OpenCL 1.0 libraries are not really out there anymore, you probably won’t need to distribute the library yourself. And it’s just what Nvidia wanted: keeping the competition back. Congrats, Nvidia!

A good reason would

Problems arise when using deprecated functions, or have libraries that have calls to deprecated functions. Those can usually be avoided, but not when coding. For instance the cl.hpp for OpenCL 1.1 (found here) gives loads of warnings when you have the 1.2 library. Luckily this can easily be solved by adding some pragmas (works with GCC 4.6 and higher).

#pragma GCC diagnostic push
#pragma GCC diagnostic ignored "-Wdeprecated-declarations"
 --> the contents of cl.hpp
#pragma GCC diagnostic pop

The source of this trick has some more remarks when using GCC 4.5.

A word towards Linux distributions

OSX has OpenCL 1.2 by default for the host. Linux distributions have dependencies on OpenCL 1.1 for Nvidia-drivers, which is not needed – as explained above. They can easily go for a dependency on the OpenCL 1.2 library. The code provided by Khronos could be used.

AMD Hawaii power-management fix on Linux

Posted by Vincent Hindriksen on 4 February 2015 with 1 Comment

The new Hawaii-based GPUs from AMD (Radeon R9 2xx, FirePro W9100 and Firepro S9150) have a lot of improvements, one being a new OverDrive 6 (AMD’s version of NVIDIA GPU Boost). Problem is that it’s not supported yet in the Linux drivers and you will get too low performance – it probably will be solved in the next version. Luckily there is od6config, made by Jeremi M Gosney.

Do the below steps to get the GPU at normal speed.

Download the zip or tar.gz from http://epixoip.github.io/od6config/ and unpack.
Go to the directory where you unpacked the archive.
run:
```
make
```
run:
```
sudo make install
```
check if it’s needed to fix the power management:
```
od6config --get clocks,temp,fan
```
if the values are too low, run:
```
od6config --autofix --set power=10
```
check if it worked:
```
od6config --get clocks,temp,fan
```

Only OverDrive6 devices are set, devices using OverDrive5 will be ignored.

The PowerTune of 10 was what we found convenient for us, but you might find better values for your case. There are several more options, which are on the homepage of 0d6config. You need to run “od6config –autofix –set power=10” on each reboot.

Remember it’s third party software, so no guarantees to you and no “you killed my GPU” to us.

Call for Papers, Presentations, Workshops and Posters for IWOCL in Stanford

Posted by Vincent Hindriksen on 10 January 2015

The IWOCL 2015 call for OpenCL Papers is now open and is looking for submissions from industry and academia relating to the use of OpenCL. Submissions may refer to completed projects or those currently in progress and are invited in the form of:

Research Papers
Technical Presentations
Workshops and Tutorials
Posters

Examples of sessions from 2014 can be found here.

Deadlines at a Glance

Call for submissions OPENS:	Wednesday 19th November, 2014
Call for submissions CLOSES:	Saturday 14th February, 2015 (23:59 AOE)
Notifications:	Within 4 weeks of the final closing date

Selection Criteria

The IWOCL Technical Committee will select submissions based on the following criteria;

Concept of the submission and its relevance and timeliness
Technical Depth
Clarity of the submissions; clearly conveying what your presentation
Research findings and results of your work
Your credentials and expertise in the subject matter

Unpublished Technical Papers

We solicit the submission of unpublished technical papers detailing original research related to OpenCL. All topics related to OpenCL are of interest, including OpenCL applications from any domain (e.g., scientific computing, video games, computer graphics, multimedia, information retrieval, optimization, text processing, data mining, finance, signal and image processing and numerical solvers), OpenCL performance analysis and modeling, OpenCL performance and correctness tools and proposed OpenCL extensions. IWOCL will publish formal proceedings of the accepted papers in The ACM International Conference Series. Please Submit an Abstract which should be between 1 and 4 pages long.

Technical Presentations

We solicit the submission of technical presentations detailing the innovative use of OpenCL. All topics related to OpenCL are of interest, including but not limited to applications, software tools, programming methods, extensions, performance analysis and verification. Please Submit an Abstract which should not exceed 4 pages. The accepted presentations will be published in the online workshop proceedings.

Workshops & Tutorials

IWOCL includes a day of tutorials that provide OpenCL users an opportunity to spend more time exploring a specific OpenCL topic. Tutorial submissions should assume working knowledge of OpenCL by the attendees and can for example cover OpenCL itself, any of the related APIs such as SPIR and SYCL, the use of OpenCL libraries or parallel computing techniques in general using OpenCL. Please Submit an Abstract which should not exceed 4 pages. Please include the preferred length of the tutorial or workshop (e.g. 2, 3 or 4 hours).

Posters

To encourage discussion of the latest developments in the OpenCL community, there will be a poster session running in parallel to the main sessions and open during the breaks and lunch sessions. The abstracts of the accepted posters will be published in the form of short communications in the workshop proceedings, provided that at least one of the authors has registered for the workshop. Please Submit an Abstract which should not exceed 2 pages.

Submit your abstract today

Go to Easychair, log in or register, and click on “New Submission”. Deadline is 14 February.

We sponsor HiPEAC again this year

Posted by Vincent Hindriksen on 17 December 2014

HiPEAC is an academic oriented, 3-day, international conference around HPC, compilers and processors. Last year was in Vienna, this year in Amsterdam – where StreamHPC also is based. That was an extra reason to go for silver sponsorship, besides I find this conference very important.

Compilers have the job to do magic. Last year I had nice feedback on my request to give the developers feedback where in the code the compiler struggles – effectively slapping the guy/gal instead of trying to solve it with more magic. Also learned a lot about compilers in general, listened to GPGPU-talks, discussed about HPC, and most of all: met a lot of very interesting people.

Why you should come too? I give you five reasons:

Learn about compilers and GPU-techniques, in depth.
Have great discussions about the latest and greatest research, before it’s news.
Meet great people who create the compilers you use (or the reverse).
Visit Amsterdam, Netherlands – I can be your guide. Flights are cheap.
Only spend €400 for the full 3 day programme and a unique dinner with 500 people – compare that to SC14 and GTC!

If you are seeking for a job in HPC, compilers and GPGPU, you should really come over. We’re there, but several other sponsors are also looking for new employees too.

See the tracks at HiPEAC, which has a lot more GPU-oriented talks than last year. I selected a few from the list in bold.

Monday

Opening address
William J. Dally, Challenges for Future Computing Systems
Euro-TM: Final Workshop of the Euro-TM COST Action
Session 1. Processor Core Design
CS²: Cryptography and Security in Computing Systems
IMPACT: Polyhedral Compilation Techniques
MCS: Integration of mixed-criticality subsystems on multi-core and manycore processors
EEHCO: Energy Efficiency with Heterogeneous Computing
INA-OCMC: Interconnection Network Architecture: On-Chip, Multi-Chip
WAPCO: Approximate Computing
SoftErr: Mitigation of soft errors: from adding selective redundancy to changing the abstraction stack
Session 2. Data Parallelism, GPUs
James Larus, It’s the End of the World as We Know It (And I Feel Fine)
ENTRE: EXCESS & NANOSTREAMS
SiPhotonics: Exploiting Silicon Photonics for energy-efficient high-performance computing
HetComp: Heterogeneous Computing: Models, Methods, Tools, and Applications
Session 3. Caching
Session 4. I/O, SSDs, Flash Memory
Student poster session / Welcome reception

Tuesday

Don’t forget to meet us at the industrial poster-sessions.

Rudy Lauwereins, New memory technologies and their impact on computer architectures
Thank you HiPEAC
Session 5. Emerging Memory Technologies
EMC²: Mixed Criticality Applications and Implementation Approaches
ADEPT: Energy Efficiency in High-Performance and Embedded Computing
MULTIPROG: Programmability Issues for Heterogeneous Multicores
WRC: Reconfigurable Computing
TISU: Transfer to Industry and Start-ups
HiStencils: High-Performance Stencil Computations
MILS: Architecture and Assurance for Secure Systems
Programmability: Programming Models for Large Scale Heterogeneous Systems
Industrial Poster Session
INNO2015: Innovation actions in Advanced Computing CFP
Session 6. Energy, Power, Performance
DCE: Dynamic Compilation Everywhere
EUROSERVER: Green Computing Node for European Micro-servers
PolyComp: Polyhedral Compilation without Polyhedra
HiPPES4CogApp: High-Performance Predictable Embedded Systems for Cognitive Applications
Industrial Session
Session 7. Memory Optimization
Session 8. Speculation and Transactional Execution
Canal tour / Museum visit / Banquet

Wednesday

Burton J. Smith, Resource Management in PACORA
HiPEAC 2016
Session 9. Resource Management and Interconnects
PARMA-DITAM: Parallel Programming and Run-Time Management Techniques for Many-core Architectures + Design Tools and Architectures for Multi Core Embedded Computing Platforms
ADAPT: Adaptive Self-tuning Computing System
PEGPUM: Power-Efficient GPU and Many-core Computing
HiRES: High-performance and Real-time Embedded Systems
RAPIDO: Rapid Simulation and Performance Evaluation: Methods and Tools
MemTDAC: Memristor Technology, Design, Automation and Computing
DataFlow, Computing in Space: DataFlow SuperComputing
IDEA: Investigating Data Flow modeling for Embedded computing Architectures
TACLe: Timing Analysis on Code-Level
EU Projects Poster Session
Session 10. Compilers
HIP3ES: High Performance Energy Efficient Embedded Systems
HPES: High Performance Embedded Systems
Session 11. Concurrency
Session 12. Methods (Simulation and Modeling)

Hopefully till then!

How to introduce HPC in your enterprise

Posted by Vincent Hindriksen on 10 December 2014

eviljaymz-spare-time — Spare time in IT – © jaymz.eu

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
AMD: For the S9150 get a W9100.
Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

If your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!

AMD now leads the Green500

Posted by Vincent Hindriksen on 25 November 2014

With SC14 behind us, there are a few things I’d like to share with you. I’d like to start with the biggest win for OpenCL: AMD leading in the most power-efficient GPU-cluster.

A few months ago I wrote a theoretical article on how to build the cheapest and greenest supercomputer to enter the Top500 and Green500. There I showed that AMD would theoretically win on both GFLOPS/costs and GFLOPS/Watt. Last week I learned a large cluster is actually being built in Germany, which now leads the Green500 (GFLOPS/Watt). It is powered by Intel Ivy Bridge CPUs, an FDR Infiniband network and accelerated by air-cooled(!) AMD FirePro S9150 GPUs, as can be seen on the Green 500 report of November. The score: 5.27 GFLOPS per Watt, mostly because of AMD’s surprise act: extremely efficient SGEMM and DGEMM.

The first NVIDIA Tesla-based system on the list is at #3 with 4.45 GFLOPS per Watt for a liquid cooled system. If the AMD FirePro S9150 would be oil or water cooled, the system could go to over 6 GFLOPS per Watt. I’m expecting such system on the Green500 of June. The PEZY-SC (#2 on the list) is a very interesting, unexpected newcomer to the field – I’ll share more with you later, as I heard it supports OpenCL.

The price metric

The cluster at GSI Helmholtz Center has around 1.65 double precision PetaFlops (theoretical). Let’s do the same calculation as with the 150 GFLOPS system using the latest prices, only taking the accelerator part.

640 x AMD FirePro S9150.

2.53 GFLOPS * 640 = 1.62 TFLOPS (I rounded down to 2.0 GFLOPS in the other article)
US$ 3300. Total price: $2.112M. Price per TFLOPS: $1.304M
235 Watt * 640 = 150 kWatt (excluding network, CPU, etc)

640 x NVIDIA Tesla K40x

1.42 GFLOPS * 640 = 0.91 TFLOPS
US$ 3160 (got down a lot due to introduction K80!). Total price: $2.022M. Price per TFLOPS: $2.225M
235 Watt * 640 = 150 kWatt

640 x Intel XeonPhi 7120P

1.21 GFLOPS * 640 = 0.65 TFLOPS
US$ 3450. Total price: 2.208$M. Price per TFLOPS: $3.397M
300 Watt * 640 = 192 kWatt

So it’s pretty clear, why GSI chose AMD: $92M or $209M less costs for the same GFLOPS. Also note that more GFLOPS per accelerator is important to lower overhead.

What to expect from June’s Green500

Next year Nvidia probably comes with Maxwell, which probably will do very well in the Green500. Intel has their new XeonPhi, but it’s a very new architecture and no samples have arrived yet – I would be surprised, as they over-promised for too long now. Besides bringing surprises, Intel’s other strengths are its vast collaborations and strong fanbase – the past years I heard the most ridiculous responses on why such underperforming accelerator was chosen instead of FirePro or Tesla, so it’s certainly aiming for a rampage (based on hope). AMD did not enclose any information on a new version of the S9150 (Something like S9200 or S9250).

Then there are the dual GPUs, which have no advantages but lower energy-usage. The K80 just arrived, but the number don’t add up yet – we’ll have to see when the samples arrive. AMD did not say anything about the next version of the S10000, but probably arrives next year – no ETA. Intel did not do dual-chip cards until now. These systems can be built more compact, as 4 GPUs per system is becoming a standard.

Another important change will be the CPUs with embedded CPU being used in the clusters, where now mostly Intel Xeons rule the world. Intel’s Iris Pro line and AMD new Carrizo APU could certainly get more popular, as more complex code can be accelerated very well by such processors. Also 64-bit ARM-processors we’ll see more – hopefully with GPU. This subject I’ll handle in a separate article, as OpenCL could be a big enabler for easy offloading.

Based on the current information I have available, Nvidia aims for Maxwell based Teslas, AMD with S9150 and the dual-GPU variant, Intel with none (aiming for November 2015). It’ll be exciting to see HPC get to 6+ GFLOPS/Watt as a standard – I find that more important than building the biggest cluster.

OpenCL will help select hardware from that year’s winner, not being locked in to that year’s loser. Meanwhile at StreamHPC we will keep building OpenCL-based software, to help our customers pick that winner.

What does Khronos has more to offer than OpenCL and OpenGL?

Posted by Vincent Hindriksen on 24 November 2014

The OpenCL standard is from the not-for-profit industry consortium Khronos Group. But they do a lot more, like the famous standard OpenGL for graphics. Focus of the group has always been on multimedia and getting the fastest results out of the hardware.

Now open source and open standards are getting more important, collabroations like the Khronos Group, get more attention. At StreamHPC we are very happy with this trend, as the business models are more focused on collaborations and getting things done than on making sure the customer cannot ever leave.

Below is an overview of the most important APIs that Khronos has to offer.

OpenCL related

OpenCL: compute
WebCL: web compute
SPIR/SPIR-V: intermedia language for compute-kernels, like those of OpenCL and OpenGL’s GSLS
SYCL: high-level language for OpenCL

OpenGL related

Vulkan: state-less graphics
OpenGL: graphics
OpenGL ES: embedded graphics
WebGL: web graphics
glTF: runtime asset format for WebGL, OpenGL ES, and OpenGL
OpenGL SC: Graphics for Safety Critical operations
EGL: interface between rendering APIs such as OpenGL ES and the underlying native platform window system, such as X.

Streaming input and output

OpenMAX: interface for multimedia codecs, platforms and hardware
StreamInput: interface for sensors
OpenVX: OpenCV-alternative, built for performance.
OpenKCam: interface for cameras and sensors

Others

COLLADA: 3D model file format
OpenSL ES: embedded audio
OpenVG: vector Graphics

One video called “OpenRoad” to show them all:

Want to learn more? Feel free to ask in the comments, or check out https://www.khronos.org/

Starting with GROMACS and OpenCL

Posted by Vincent Hindriksen on 15 November 2014

Now that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

-DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
-DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
-DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
-DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
-DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
-DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
-DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
-DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
-DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
-DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
-DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

Why this new AMD FirePro Cluster is important for OpenCL

Posted by Vincent Hindriksen on 14 November 2014

Then it hit the doormat:

“AMD is proud to collaborate with ASUS, the Frankfurt Institute for Advanced Studies, (FIAS) and GSI to support such important physics and computer science research,” said David Cummings, senior director and general manager, professional graphics, AMD. “This installation reaffirms AMD’s leading role in HPC with the implementation of the AMD FirePro S9150 server GPUs in this three petaFLOPS supercomputer cluster. AMD and ASUS are enabling OpenCL applications for critical science research usage for this cluster. We’re committed to building our HPC leadership position in the industry as a foremost provider of computing applications, tools and technologies.“

You read more here and the official news here.

Why is this important?

It could be that there is more flops for the same price, as AMD hardware is cheaper? Nice, but secondary.

That it runs OpenCL? We like that, but from a broader perspective this is not the most important.

It is important because it creates more diversity in the world of HPC. Currently there are a few XeonPhi clusters and only one big AMD FirePro S10000 cluster. The rest is NVidia Tesla or CPU only. With more AMD clusters the HPC market is democratised. That means that more software will be written in vendor-neutral software like OpenCL (with high-level software/libraries on top), and prices of HPC accelerators will not be kept high.

How to further democratise the HPC world?

We started with porting Gromacs to OpenCL, and we will continue to port large projects to OpenCL. This software will simply run on XeonPhi, Tesla and FirePro with just little porting time, reducing costs in many ways. We can not do it alone, but together we can. Start by telling us which software needs to be ported from OpenMP to OpenCL or OpenMP 4, or from CUDA to OpenCL. And if you are porting open source software to OpenCL, drop us a line for free advice and help with testing the software.

And the best you can do to break the monopoly of CUDA, is to simply buy AMD or Intel hardware. The price difference is enough to buy lots of extra FLOPS and to pay for a complete porting project to OpenCL of a large application.

OpenCL at SC14

Posted by Vincent Hindriksen on 12 November 2014 with 5 Comments

During SC14 (SuperComputing Conference 2014), OpenCL is again all over New Orleans. Just like last year, I’ve composed an overview based on info from the Khronos website and the SC2014 website.

Finally I’m attending SC14 myself, and will give two talks for you. Tuesday I’ll be part of a 90 minute session of Khronos, where I’ll talk a bit about GROMACS and selecting the right accelerator for your software. Wednesday I’ll be sharing our experiences from our port of GROMACS to OpenCL. If you meet me, then I can hand you over a leaflet with the decision chart to help select the best device for the job.

Continue reading “OpenCL at SC14” →

OpenCL integer rounding in C

Posted by Vincent Hindriksen on 7 November 2014 with 4 Comments

Square_rounding — Square pant rounding can simply be implemented with “*return (NAN);*“.

Getting about the same code in C and OpenCL has lots of advantages, when maximum optimisations and vectors are not needed. One thing I bumped into myself was that rounding in C++ is different, and decided to implement the OpenCL-functions for rounding in C.

The OpenCL-page for rounding describes many, many functions with this line:

destType convert_destType<_sat><_roundingMode>(sourceType)

So for each sourceType-destType combination there is a set of functions: 4 rounding modes and an optional saturation. Easy in Ruby to define each of the functions, but takes a lot more time in C.

The 4 rounding modes are:

Modifier	Rounding Mode Description
`_rte`	Round to nearest even
`_rtz`	Round towards zero
`_rtp`	Round toward positive infinity
`_rtn`	Round toward negative infinity

The below pieces of code should also explain what the functions actually do.

Round to nearest even

This means that the numbers get rounded to the closest number. In case of 3.5 and 4.5, they both round to the even number 4. Thanks for Dithermaster, for pointing out my wrong assumption and clarifying how it should work.

inline int convert_int_rte (float number) {
   int sign = (int)((number > 0) - (number < 0));
   int odd = ((int)number % 2); // odd -> 1, even -> 0
   return ((int)(number-sign*(0.5f-odd)));
}

I’m sure there is a more optimal implementation. You can fix that in Github (see below).

Round to zero

This means that positive numbers are rounded up, negative numbers are rounded down. 1.6 becomes 1, -1.6 also becomes 1.

inline int convert_int_rtz (float number) {
   return ((int)(number));
}

Effectively, this just removes everything behind the point.

Round to positive infinity

1.4 becomes 2, -1.6 becomes 1.

inline int convert_int_rtp (float number) {
   return ((int)ceil(number));
}

Round to negative infinity

1.6 becomes 1, -1.4 becomes 2.

inline int convert_int_rtp (float number) {
   return ((int)floor(number));
}

Saturation

Saturation is another word for “avoiding NaN”. It makes sure that numbers are between INT_MAX and INT_MIN, and that NaN returns 0. If not used, the outcome of the function can be anything (-2147483648 in case of convert_int_rtz(NAN) on my computer). Saturation is more expensive, so therefore it’s optional.

inline float saturate_int(float number) {
  if (isnan(number)) return 0.0f; // check if the number was already NaN
  return (number>MAX_INT ? (float)MAX_INT : number

Effectively the other functions become like:

inline int convert_int__sat_rtz (float number) {
   return ((int)(saturate_int(number)));
}

Doubles, longs and getting started.

Yes, you need to make functions for all of these. But you could ofcourse also check out the project on Github (BSD licence, rudimentary first implementation).

You’re free to make a double-version of it.

Mega-kernel versus Micro-kernels in LuxRender (repost)

Posted by Vincent Hindriksen on 4 November 2014

Below is a (slightly edited) repost of a blog by David Bucciarelli (homepage, twitter) on the Luxrender forum.

I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.

For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:

When programming for GPUs, simply porting a large CPU program

into an equally large GPU kernel is generally not a good approach.

Due to SIMT execution model on GPUs, divergence in control flow

carries substantial performance penalties, as does high register us-

age that lessens the latency-hiding capability that is essential for the

high-latency, high-bandwidth memory system of a GPU. In this pa-

per, we implement a path tracer on a GPU using a wavefront formu-

lation, avoiding these pitfalls that can be especially prominent when

using materials that are expensive to evaluate. We compare our per-

formance against the traditional megakernel approach, and demon-

strate that the wavefront formulation is much better suited for real-

world use cases where multiple complex materials are present in

the scene.

OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.

The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.

PATHOCL Micro-kernels edition, the results

The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.

Occupancy increases from 10% to 30% or more

The performance increase is huge on some platform (Linux + FirePro W8100), 3.6 times:

Speed increases from 0.84M to 3.07M samples/sec

A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.

It solves the problems with AMD compiler

Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:

Scene builds correctly on AMD hardware for the first time

Try it out yourself

This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.

A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.

To run with micro-kernels, use “path.microkernels.enable=1”.

We ported GROMACS from CUDA to OpenCL

Posted by Vincent Hindriksen on 1 November 2014 with 9 Comments

GROMACS does soft matter simulations on molecular scale. Let it fly.

GROMACS is an important molecular simulation kit, which can do all kinds of “soft matter” simulations like nanotubes, polymer chemistry, zeolites, adsorption studies, proteins, etc. It is being used by researches worldwide and is one of the bigger bio-informatics softwares around.

To speed up the computations, GPUs can be used. The big problem is that only NVIDIA GPU could be used, as CUDA was used. To make it possible to use other accelerators, we ported it to OpenCL. It took several months with a small team to get to the alpha-release, and now I’m happy to present it to you.

For who knows us from consultancy (and training) only, might have noticed. This is our first product!

We promised to keep it under the same open source license and that effectively means we are giving it away for free. Below I’ll explain how to obtain the sources and how to build it, but first I’d like to explain why we did it pro bono.

Why we did it

Indeed, we did not get any money (income or funds) for this. There have been several reasons, of which the below four are the most important.

The first reason is that we want to show what we can. Each project was under NDA and we could not demo anything we made for a customer. We chose for a CUDA package to port to OpenCL, as we notice that there is a trend to port CUDA-software to OpenCL (i.e. Adobe software).
The second reason is that bio-informatics is an interesting industry, where we would like to do more work.
Third reason is that we can find new employees. Joining the project is a way to get noticed and could end up in a job-proposal. The GROMACS project is big and needs unique background knowledge, so it can easily overwhelm people. This makes it perfect software to test out who is smart enough to handle such complexity.
Fourth is gaining experience with handling open source projects and distributed teams.

Therefore I think it’s a very good investment, while giving something (back) to the community.

Presentation of lessons learned during SC14

We just jumped in and went for it. We learned a lot, because it did not go as we expected. All this experience, we would like to share on SuperComputing 2014.

During SC14 I will give a presentation on the OpenCL port of GROMACS and the lessons learned. As AMD was quite happy with this port, they provided me a place to talk about the project:

“Porting GROMACS to OpenCL. Lessons learned”
SC14, New Orleans, AMD’s mini-theatre.
19 November, 15:00 (3:00 pm), 25 minutes

The SC14 demo will be available on the AMD booth the whole week, so if you’re curious and want to see it live with explanation.

If you’d like to talk in person, please send an email to make an appointment for SC14.

Getting the sources and build

It still has rough edges, so a better description would be “we are currently porting GROMACS to OpenCL”, but we’re very close.

As it is work in progress, no binaries are available. So besides knowledge of C, C++ and Cmake, you also need to know how to work with GIT. It builds on both Windows and Linux, and NVIDIA and AMD GPUs are the target platforms for the current phase.

The project is waiting for you on https://github.com/StreamHPC/gromacs.

The wiki has lots of information, from how to build, supported devices to the project planning. Please RTFM, before starting! If something is missing on the wiki, please let us know by simply reporting a new issue.

Help us with the GROMACS OpenCL port

We would like to invite you to join, so we can make the port better than the original. There are several reasons to join:

Improve your OpenCL skills. What really applies to the project is this quote:

Tell me and I forget.
Teach me and I remember.
Involve me and I learn.
Make the OpenCL ecosphere better. Every product that has OpenCL support, gives choice to the user what GPU to use (NVIDIA, AMD or Intel)
Make GROMACS better. It is already a large community and OpenCL-knowledge is needed now.
Get hired by StreamHPC. You’ll be working with us directly, so you’ll get to know our team.

What can you do? There is much you can do. Once you managed to build and run it, look at the bug reports. First focus is to get the failing kernels working – this is top priority to finalise phase 1. After that, the real fun begins in phase 2: add features and optimise for speed on specific devices. Since AMD FirePro is much better in double precision than Nvidia Tesla, it would be interesting to add support for double precision. Also certain parts of the code is done on the CPU, which have real potential to be ported to the GPU.

If things are not clear and obstruct you from starting, don’t get stressed and send an email with any question you have. We’re awaiting your merge request or issue report!

Special thanks

This project wasn’t possible without the help of many people. I’d like to thank them now.

The GROMACS team in Sweden, from the KTH Royal Institute of Technology.
- Szilárd Páll. A highly skilled GPU engineer and PhD student, who pro-actively keeps helping us.
- Mark Abraham. The GROMACS development manager, always quickly answering our various questions and helping us where he could.
- Berk Hess. Who helped answering the harder questions and feeding the discussions.
Anca Hamuraru, the team lead. Works at StreamHPC since June, and helped structure the project with much enthusiasm.
Dimitrios Karkoulis. Has been volunteering on the project since the start in his free time. So special thanks to Dimitrios!
Teemu Virolainen. Works at StreamHPC since October and has shown to be an expert on low-level optimisations.
Our contacts at AMD, for helping us tackle several obstacles. Special thanks go to Benjamin Coquelle, who checked out the project to reproduce problems.
Michael Papili, for helping us with designing a demo for SC14.
Octavian Fulger from Romanian gaming-site wasd.ro, for providing us with hardware for evaluation.

Without these people, the OpenCL port would never been here. Thank you.

OpenCL tutorial videos from Mac Research

Posted by Vincent Hindriksen on 20 October 2014 with 1 Comment

A while ago macresearch.com stopped from existing, as David Gohara pulled the plug. Luckily the sources of a very nice tutorial were not lost, and David gave us permission to share his material.

Even if you don’t have a MAC, then these almost 5 year old materials are very helpful to understand the basics (and more) of OpenCL.

We also have the sources (chapter 4, chapter 6) and the collection of corresponding PDFs for you. All material is copyright David Gahora. If you like his style, also check out his podcasts.

Introduction to OpenCL

OpenCL fundamentals

Building an OpenCL Project

Memory layout and Access

Questions and Answers

Shared Memory Kernel Optimisation

Did you like it? Do you have improvements on the code? Want us to share more material? Let us know in the comments, or contact us directly.

Want to learn more? Look in our knowledge base, or follow one of our trainings.

We’re looking for an intern to do the cool stuff: benchmarking and Linux wizarding

Posted by Vincent Hindriksen on 28 September 2014

We have some embedded devices here, which badly need attention. Some have gotten some private time on the bench, but we did not share anything on the blog yet with our readers. We simply need some extra hands to do this. Because it’s actually cool to do, but admittedly a bit boring when doing several devices, it was the perfect job for an intern. Besides the benchmarking, we have some other Linux-related projects for you. You’ll get an average payment for an internship in the Netherlands (in Dutch: “stagevergoeding”), lunch, a desk and a bunch of devices (aka toys-for-techies).

Like more companies in the Netherlands, we don’t care about how you where born, but who you are as a person. We expect from you that you…

know everything about Linux administration, from servers to embedded devices.
know how to setup a benchmark.
document all what you do, not only the results.
speak and write Dutch and English.
have great humor! (Even if you’re the only one who laughs at your jokes).
study in the EU, or can arrange the paperwork to get to the EU yourself.
have a place to live/crash in or nearby Amsterdam, or don’t mind the daily travelling. You cannot sleep in the office.

Together with your educational institute we’ll discuss the exact learning goals of the internship, and make a plan for a period of 3 to 6 months.

If you are interested, send a mail to jobs@streamhpc.com. If you know somebody who would be interested, please tell that person that we’re waiting for him/her! Also tips&tricks on finding the right person are very welcome.

Category: Technical