Phoronix OpenCL Benchmark 3.0 beta

Posted by Vincent Hindriksen on 14 January 2011 with 1 Comment

So you want OpenCL-benchmarks? Phoronix is a benchmark for OSX and Linux, created by Michael Larabel, Matthew Tippett (http://en.wikipedia.org/wiki/Phoronix_Test_Suite). On Ubuntu Phoronix version 2.8 is in the Ubuntu “app store” (Synaptic), but 3.0 has those nice OpenCL-tests. The tests are based on David Bucciarelli‘s OpenCL demos. Starting to use Phonornix 3.0 (beta 1) is done in 4 easy steps:

Download the latest beta-version from http://www.phoronix-test-suite.com/?k=downloads
Extract. Can be anywehre. I chose /opt/phoronix-test-suite
Install. Just type ./phoronix-test-suite in a terminal
Use.

WARNING: It is beta-software and the following might not work on your machine! If you have problems with this tutorial and want or found a fix, post a reply.

Continue reading “Phoronix OpenCL Benchmark 3.0 beta” →

OpenCL tutorial videos from Mac Research

Posted by Vincent Hindriksen on 20 October 2014 with 1 Comment

A while ago macresearch.com stopped from existing, as David Gohara pulled the plug. Luckily the sources of a very nice tutorial were not lost, and David gave us permission to share his material.

Even if you don’t have a MAC, then these almost 5 year old materials are very helpful to understand the basics (and more) of OpenCL.

We also have the sources (chapter 4, chapter 6) and the collection of corresponding PDFs for you. All material is copyright David Gahora. If you like his style, also check out his podcasts.

Introduction to OpenCL

http://www.youtube.com/watch?v=oc1-y1V1TPQ

OpenCL fundamentals

http://www.youtube.com/watch?v=FrLqSgYyLQI

Building an OpenCL Project

http://www.youtube.com/watch?v=K7QiD74kMvU

Memory layout and Access

http://www.youtube.com/watch?v=oPE3ypaIEv4

Questions and Answers

http://www.youtube.com/watch?v=9rA6DypMsCU

Shared Memory Kernel Optimisation

http://www.youtube.com/watch?v=oFMPWuMso3Y

Did you like it? Do you have improvements on the code? Want us to share more material? Let us know in the comments, or contact us directly.

Want to learn more? Look in our knowledge base, or follow one of our trainings.

OpenCL Videos of AMD’s AFDS 2012

Posted by Vincent Hindriksen on 12 October 2012 with 4 Comments

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Continue reading “OpenCL Videos of AMD’s AFDS 2012” →

IWOCL 2017 – all the talks

Posted by Vincent Hindriksen on 2 May 2017

An overview of all the tutorials and talks for easy reading.

You can also download the PDF.

Heterogeneous Computing Using Modern C++ with OpenCL Devices – Rod Burns and Ruyman Reyes (Codeplay)

This hands-on session will provide an opportunity to get experience with SYCL using ComputeCpp™ Community Edition, a free to use implementation of the SYCL 1.2 standard. Attendees will be shown how to set up ComputeCpp and use it to write their own SYCL code to run on supported GPUs and CPUs.

SYCL is already able to dispatch to heterogeneous devices and it implements C++17 ParallelSTL, augmenting it with ability to dispatch to GPUs in addition to CPUs. This tutorial will demonstrate how to write parallel SYCL code and how to use the Khronos Group’s experimental Parallel STL implementation. The course outline is as follows

Start with a basic SYCL program that shows how to submit queues in a single task and stream-like object, comparing CPU, SYCL and OpenCL versions
Demonstrate how to access data across host and GPUs using buffers and accessors, the importance of life-time, and basic parallel constructs

Attendees are expected to have programming experience with C++ and a laptop either running Linux or having a VM manager installed such as VirtualBox. The required software will be provided on USB-sticks. This course is suitable for beginners, but is focused on intermediate to advanced parallel programming using C++.

Harnessing the Power of FPGAs with the Intel FPGA SDK for OpenCL- Byron Sinclair, Andrew Ling and Genady Paikin (Intel)

In this tutorial, we will introduce you to the reconfigurable hardware architecture and programming of Field Programmable Gate Arrays (FPGAs).

You will learn why FPGAs have become so popular in recent years, and understand the many advantages of using FPGAs in your HPC application. In particular, we will cover architectural features of FPGAs that make them well suited to many complex operations, including matrix multiplications and convolutions. In addition, we will introduce you to programming FPGAs using the Intel® FPGA SDK for OpenCL, and how specific OpenCL coding techniques can lead to efficient circuits implemented on the FPGA.

Finally, we will go over several case studies where FPGAs have shown very competitive performance when programmed using OpenCL, including convolutional neural nets, FFTs, and astronomy de-dispersion algorithms.

Unlock Intel GPUs for High Performance Compute, Media and Computer Vision Capabilities with Intel OpenCL Extensions – Jeff Mcallister, Biju George, Adam Herr and Ben Ashbaugh (Intel)

The keys to unlock the full performance potential of Intel GPUs for emerging workloads in general compute, media, computer vision, and machine learning are in the rich suite of Intel OpenCL extensions. These give developers direct access to unique Intel hardware capabilities, which until now have been difficult to master.
This tutorial builds step by step with multiple examples, including:

How to write high performance general compute applications based on the core concept of OpenCL subgroups.
How to use additional subgroup operations described in the Intel subgroups and media block read/write extensions.
Then using the framework of subgroups, we explain the device-side motion estimation extension which leverages the unique Intel GPU media sampler to accelerate motion estimation operations from OpenCL kernels.
Finally we explain the Video Enhancement (VEBOX) extension, which is an OpenCL host level API extension to leverage a powerful media fixed function unit to accelerate many frame level video enchancement operations.

Faster, smarter computer vision with AI and OpenCL – Uri Levy and Jeffrey Mcallister (Intel)

Learn how to use Intel machine learning and computer vision tools to get from concept to market faster for machine learning applications based on OpenCL and OpenVX. Build two example scenarios: autonomous driving with FPGA inference and a smart camera app using Intel Graphics inference. This presentation will show how a unified set of tools can reduce the complexity of developing heterogeneous machine learning apps – from training a model with input images, to creating a custom classifier, to building an optimized traditional computer vision pipeline around the classifier to create a full computer vision application

GPGPU Acceleration using OpenCL for a Spotlight SAR Simulator – Eric Balster, Jon Skeans and David Fan (University of Dayton) Marc Hoffman (US Air Force Research Laboratory)

In this paper, OpenCL is used to target a general purpose graphics processing unit (GPGPU) for acceleration of 2 modules used in a synthetic aperture radar (SAR) simulator. Two of the most computationally complex modules, the Generate Return and Back Projection modules, are targeted to an AMD FirePro M5100 GPGPU. The resulting speedup is 2.5X over multi-threaded C++ implementations of those algorithms running on an 8-core Intel I7 2.8GHz processor, 5X over singlethreaded C++ implementations, and 24X over native MATLAB implementations, on average.

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL – Javier Alejandro Varela and Norbert Wehn (University of Kaiserslautern)

In this work, we exploit OpenCL to efficiently map the nested simulation of complex portfolios with multiple algorithms on heterogeneous computing systems. Code portability and customizations allow us to profile the kernels on different accelerating platforms, such as CPU, Intel’s Xeon Phi and GPU. The combination of OpenCL, a new bit-accurate algorithmic optimization and the extension of an existing numerical interpolation scheme allows us to achieve 1000x speedup compared to the state-of-the-art approach. Our system design minimizes costly host-device transfers and global memory, enabling complex portfolios to be easily scaled.

A Performance and Energy Evaluation of OpenCL-accelerated Molecular Docking – Leonardo Solis Vasquez and Andreas Koch (Technische Universität Darmstadt)

This work presents an OpenCL implementation of AutoDock, and a corresponding performance evaluation on two different platforms based on multi-core CPU and GPU accelerators. It shows that OpenCL allows highly efficient docking simulations, achieving speedups of ∼4x and ∼56x over the original serial AutoDock version, as well as energy efficiency gains of ∼2x and ∼6x. respectively. To the best of our knowledge, this work is the first one also considering the energy efficiency of molecular docking programs.

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations – Nuno Fachada and Agostinho Rosa (Instituto Superior Técnico, Portugal)

In this paper we evaluate the feasibility of using CPU-oriented OpenCL for high-performance simulations of agent-based models. We compare a CPU-oriented OpenCL implementation of a reference ABM against a parallel Java version of the same model. We show that there are considerable gains in using CPU-based OpenCL for developing and implementing ABMs, with speedups up to 10x over the parallel Java version on a 10-core hyper-threaded CPU.

Enabling FPGAs as a True Device in the OpenCL Standard – Vincent Mirian and Paul Chow (University Of Toronto)

As FPGA capacities continue to increase, the ability to partition and partially reconfigure the FPGA will become even more desirable. The fundamental issue is how FPGAs are currently viewed as devices in the OpenCL model. In this paper, we propose a small change to the OpenCL definition of a device that unlocks the full potential of FPGAs to the programmer.

Applying Models of Computation to OpenCL Pipes for FPGA Computing – Nachiket Kapre and Hiren Patel (University of Waterloo)

We propose imposing a communication discipline inspired from models of computation (e.g.Ptolemy) such as SDF (synchronous dataflow), bulk synchronous (BSP), or Discrete Event (DE). These models offer a restricted subset of communication patterns that enable implementation tradeoffs and deliver performance and resource guarantees. This is useful for OpenCL developers operating within the constraints of the FPGA device. We hope to facilitate a preliminary analysis and evaluation of supporting these patterns in OpenCL and quantifying associated FPGA implementation costs.

Accelerating Applications at Cloud Scale using FPGAs – Sarah Siripoke, Fernando Martinez Vallina and Spenser Gilliland (Xilinx)

The acceptance and success of cloud computing has given application developers access to computing and new customers at a scale never seen below. The inherent ability of an FPGA to reconfigure and be workload optimized is a great advantage given the fast-moving needs of cloud computing applications. In this talk we will discuss how users can develop, accelerate and deploy accelerated applications in the cloud at scale. You will learn how to get started on a turn-key OpenCL development environment in the cloud using Xilinx FPGAs.

Creating High Performance Applications with Intel’s FPGA OpenCL SDK – Andrew Ling, Utku Aydonat, Davor Capalija, Shane O’Connell and Gordon Chiu (Intel)

After decades of research, High-Level Synthesis has finally caught on as a mainstream design technique for FPGAs. However, achieving performance results that are comparable to designing at a hardware description level still remains a challenge. In this talk, we illustrate how we achieve world class performance results on HPC applications by using OpenCL. Specifically, we show how we achieve 1Tflop of performance on a matrix multiply and over 1.3Tflops on a CNN application, run on Intel’s 20nm Arria 10 FPGA device. Finally, we will describe spatial coding techniques that lead to efficient structures, such as systolic-arrays, to ensure that the FPGA runs efficiently.

Symphony – Task Scheduling and Memory Management in Heterogeneous Computing – Amit Jindal and Wenjia Ruan (Qualcomm Technologies)

Task scheduling and memory management are challenges that make Heterogeneous Computing difficult for the masses. There are several programming models and tools that exist targeting partitioning of workload and accessibility of data between CPU and GPU. We have developed and deployed Symphony SDK – a framework that makes workload partitioning, scheduling and memory management ‘simple’ for developers. In this talk, we will introduce Symphony architecture, elaborate how existing OpenCL kernels can be reused with heterogeneous task synchronization, task scheduling, and memory management capabilities of Symphony. We will also share real-world cases where Symphony has provided 2x-6x performance speed-ups.

CUDA-on-CL: A compiler and runtime for running modern CUDA c++11 applications on OpenCL 1.2 devices – Hugh Perkins (ASAPP)

Cuda-on-cl addresses the problem of creating and maintaining OpenCL forks by leaving the reference implementation entirely in NVIDIA CUDA, and writing both a compiler and a runtime component, so that any CUDA c++11 application can in theory be compiled and run directly on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run Tensorflow and Eigen kernels directly, with no modification to the original CUDA source-code. Performance studies are also undertaken, and show that the cuda-on-cl program runs at about 25% of the original CUDA-compiled version.

OpenCL in Scientific High Performance Computing—The Good, the Bad, and the Ugly – Matthias Noack (Zuse Institute Berlin)

We present experiences with utilising OpenCL alongside C ++ , MPI, and CMake in two real-world scientific codes. Our targets are a Cray XC40 supercomputer with multi- and many-core (Xeon Phi) CPUs, as well as multiple smaller systems with Nvidia and AMD GPUs. We shed light on practical issues arising in such a scenario, like the interaction between OpenCL and MPI, discuss solutions, and point out current limitations of OpenCL in the domain of scientific HPC from an application developer’s and user’s point of view.

Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices – Andrew Richards, Mehdi Goli and Luke Iwanski (Codeplay)

Codeplay has been working with Google to add SYCL back-end support in TensorFlow, one of the most popular machine learning frameworks, enabling developers to use OpenCL devices with their machine learning applications. SYCL provides an abstraction layer that simplifies parallel development, giving developers access to the computing power of OpenCL devices and reducing the amount of code required. Andrew Richards will talk about how machine learning applications can harness the power of OpenCL using open standards and how, by using SYCL, TensorFlow can be extended to include customized operations running on OpenCL devices.

Analyzing and improving performance portability of OpenCL applications via auto-tuning – James Price and Simon McIntosh-Smith (University of Bristol)

In this talk, we present an approach for analyzing performance portability that exploits that black-box nature of automatic performance tuning techniques. We demonstrate this approach across a diverse range of GPU and CPU architectures for two simple OpenCL applications. We then discuss the potential for auto-tuning to aid the generation of performance portable OpenCL kernels by incorporating multi-objective optimization techniques into the tuning process.

Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms – Biju George and Ben Ashbaugh (Intel)

In this presentation we focus on the application of the wavefront pattern to design efficient GPGPU implementations of video encoding algorithms using OpenCL kernels. We present our experiences in implementing and evaluating four solutions of WPP for inter and intra estimation for AVC on GPUs. We explain the reasoning behind each solution and present the results of our analysis.

Challenges and Opportunities in Native GPU Debugging with OpenCL – Uri Levy (Intel)

In this technical session we’ll present the open architectural design of the debugger and how it fits into the OpenCL JIT compilation flow and the underlying compute technology of the HW with focus on Intel processor graphics. We’ll demonstrate a show case on how to natively work with the debugger to solve functional bugs, as-well-as low-level debugging techniques on SIMD thread level which help to solve complex issues such as misaligned or out of range accesses to local\global memory, stack overflows, Illegal instructions, etc. Finally, we’ll cover the challenges in debugging

Modeling Explicit SIMD Programming with Subgroup Functions – Biju George and Ben Ashbaugh (Intel)

In this presentation, based on our experience in developing publicly released vendor extensions based on subgroups, we explain the advantages of the “explicit SIMD” programming paradigm using OpenCL subgroup and how the subgroups framework can be leveraged to: (1) Model features for performance in OpenCL that are commonly available in programming languages or interfaces based on an “explicit SIMD” programming paradigm such as the AVX intrinsics supported in GCC; and to (2) Model features to expose functionality available in GPU accelerator units that are more conveniently and efficiently exposed using a block API.

GPGPU-day materials – teaser

Posted by Vincent Hindriksen on 2 July 2012

Just a quick teaser. More materials (photos, sheets, videos) are coming soon.

Don’t forget to subscribe to the mailing-list of Platform Parallel Netherlands to hear about more events around parallel programming in the Netherlands.

Click on the icon at bottom-right to watch the video full-screen.

If you have made photos during the day, please send them.

Music by Professor Kliq.

Below is the short version with photos only

Our training concepts for GPGPU

Posted by Vincent Hindriksen on 14 May 2010

It’s almost time for more nerdy stuff we have in the pipe-line, but we’ll keep for some superficial blah for a moment. We concentrate on training (and consultancy). There is a lot of discussion here about “how to design training-programs about difficult concepts for technical people”, or better: “how to learn yourself something difficult”. At the end of this blog, we’ll show you a list how to learn OpenCL yourself, but before that we want to share how we look at training you.

Disclaimer: this blog item is positive about our own training-program for obvious reasons. We are aware people don’t want (too much) spam, so we’ll keep this kind of blogs to the minimum. If you want to tell the world that your training-program is better, first mail us for our international partner-program. If you want the training, come back on 14 June or mail us.

OpenCL and CUDA are not the easiest programming languages due to incomparable concepts in software-land (You can claim Java is “slightly” different). Can the usual ways of training give you the insights and facts you need to know?

Current programs

Most training-programs are vendor-supported. People who follow us on Twitter, know we are not the best supporters of vendor locked products. So lets get a list of a typical vendor-supported training-programs, I would like to talk about:

They have to be difficult, so the student accomplishes something.
The exam are expensive to demotivate trial-on-error-students.
You get an official certificate, which guarantees a income-raise.
Books and trainings focus on facts you must learn.
It’s very clear what you must learn and what you can skip.

So in short, you chose you wanted to know the material and put a lot of effort in it. You get back more than just the knowledge.

Say you get the opposite:

They are easy to accomplish.
Exams is an assignment which you only need to finish. You can try endless times.
You don’t get a certificate, but you might get feedback and homework for self-study.
You get a list of facts you must learn; the concepts are explained to support this.
You are free to pick which subject you like.

That sucks! You cannot brag about your accomplishments and after the training you still cannot do anything with it; it will probably take years to actually finish it. So actually it’s very clear why the programs are like this, or can we learn from this opposing list? Just like with everything else, you never have to just copy what’s available but pick out the good parts.

Learning GPGPU

If you want to learn GPGPU, you have to learn (in short) shader-concepts, OpenCL, CUDA and GPU-architectures. What would be needed to learn it, according to us?

A specified list of subjects you can check when understood.
An insight story of the underlying concepts to better understand the way stream-computing works. Concepts are the base of everything, to actually make it sound simple.
Very practical know-how. Such as how to integrate stream-computing-code into your current software.
A difficult assignment that gets you in touch with everything you learned. The training gave you the instruments you need to accomplish this step.

So there’s no exam and no certificate; these are secondary reasons for finishing the course. The focus should be getting the brain wrapped around the concepts and getting experience. As the disclaimer warned you, our training-program has a high focus on getting you up-and-running in one day. And you do get a certificate after your assignment gets approved, so bragging is easy.

If you want to learn stream-computing and you won’t use our training-program, what then?

Read our blog (RSS) and follow us on Twitter.
Make yourself a list of subjects you think you have to learn. Thinking before doing helps in getting a focus.
Buy a book. There are many.
Play around with existing examples. Try to break it. Example: what happens if the kernel uses more and more local/private memory.
Update the list of subjects; the more extensive, the better. Prioritise.
Find yourself an assignment. For example: try to compress or decompress a large JPG using OpenCL. If you succeed, get yourself a harder assignment. Do you want to be good or the best?

If you know OpenCL, CUDA is easy to learn! We will have some blogs which support your quest on learning OpenCL, so just start to dig in today and see you next time.

Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL

Posted by Vincent Hindriksen on 3 April 2013 with 2 Comments

WebCL is a great technique to have compute-power in the browser. After WebGL which gives high-end graphics in the browser, this is a logical step on the road towards the browser-only operating system (like Chrome OS, but more will follow).

Another way to look at technologies like WebCL, is that it makes it possible to lift the standard base from the OS to the browser. If you remember the trial of Microsoft’s integration of Internet Explorer, the focus was on the OS needing the browser for working well. Now it is the other way around, but it can be any OS. This is because the push doesn’t come from below, but from above.

Last year two guys from Lyon (South-France) got quite some attention, as they wrote a WebCL-plugin. Their names: Adrien Plagnol and Frédéric Langlade-Bellone. Below you’ll find a Q&A with them on WebCL. Enjoy! Continue reading “Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL” →

Install OpenCL on Debian, Ubuntu and Mint orderly

Posted by Vincent Hindriksen on 24 June 2011 with 22 Comments

If you read different types of manuals how to compile OpenCL software on Linux, then you can get dizzy of all the LD-parameters. Also when installing the SDKs from AMD, Intel and NVIDIA, you get different locations for libraries, header-files, etc. Now GPGPU is old-fashioned and we go for heterogeneous programming, the chances get higher you will have more SDKs on your machine. Also if you want to keep it the way you have, reading this article gives you insight in what the design is after it all. Note that Intel’s drivers don’t give OpenCL support for their GPUs, but CPUs only.

As my mother said when I was young: “actually cleaning up is very simple”. I’m busy creating a PPA for this, but that will take some more time.

First the idea. For developers OpenCL consists of 5 parts:

GPUs-only: drivers with OpenCL-support
The OpenCL header-files
Vendor specific libraries (needed when using -lOpenCL)
libOpenCL.so -> a special driver
An installable client driver

Currently GPU-drivers are always OpenCL-capable, so you only need to secure 4 steps. These are discussed below.

Please note that in certain 64-bit distributions there is not lib64, but only ‘lib’ and ‘lib32’. If that is the case for you, you can use the commands that are mentioned with 32-bit.

Continue reading “Install OpenCL on Debian, Ubuntu and Mint orderly” →

Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux

Posted by Vincent Hindriksen on 16 October 2016 with 2 Comments

quartus To temporarily increase capacity we put Quartus 16.0.2 on an Ubuntu server, which did not go smooth – but at least smoother than upgrading packages to required versions on RedHat/CentOS. While the download says “Linux” and you’re expecting support for multiple Linux breeds, there is only official support for Redhat 6.5 (and CentOS).

Luckily it was very possible to have a stable installation of Quartus on Ubuntu. As information on this subject was squattered around the net and even incomplete, we decided to share our howto in this blogpost. These tips probably also work for other modern Linux-based operating systems like Fedora, Suse, Arch, etc, as most problems are due to new features and more up-to-date libraries than are provided in RedHat/CentOS.

Note1 : we did not install the FPGA on the Ubuntu-machine and neither fully researched potential problems for doing so – installing the FPGA on an Ubuntu machine is at your own risk. Have your board maker follow this tutorial to test their libraries on Ubuntu.

Note 2: we tested on Ubuntu 14.04. No guarantees if it all works on other version. Let us know in the comments if it works on other versions too. Continue reading “Install (Intel) Altera Quartus 16.0.2 OpenCL on Ubuntu 14.04 Linux” →

Tutorials

During our courses/trainings we will teach you the best of what you can find here.

We try to keep the following information as complete as possible, so please contact us if something is missing.

Learning OpenCL

[list1]

Hands on OpenCL, by Simon McIntosh-Smith and Tom Deakin from the University of Bristol in the UK. It currently is the most up-to-date tutorial on OpenCL, including code for lab-sessions.
Bruno Jurkovski wrote a clear quickstart.
AMD introduction to OpenCL.
MacResearch playlist on Youtube. Code of episode 3 and 6. Zip of PDFs.
CMSoft’s complete OpenCL tutorial.
The Code Project has a series on OpenCL, episodes 1, 2, 3, 4, 5, 6, 7 and 8. By Rob Farber.
Dr.Dobb’s has a series called “CUDA, Supercomputing for the Masses”. It is CUDA-oriented, but you can learn a lot about GPGPU in general and on NVIDIA specific optimisations. Login to their site and then you can access parts 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20 and 21. Registration is free.
AMD’s university program. This is loads of information!
NVIDIA’s OpenCL pages provide all you need to program on NVIDIA.
Enjalot’s adventures in OpenCL giving the basics in OpenCL and pyOpenCL.
StreamHPC’s basic concepts with various tips&tricks on OpenCL.
KISTI Supercomputing Learning Centre has a beginners course for OpenCL. Material including PDFs and code is available on SF.net.
OpenCL cookbook by Dhruba Bandopadhyay.
Anteru’s introduction to OpenCL, part #1, #2 and #3.

[/list1]

OpenCL Optimisation guides

Intel Xeon and XeonPhi
NVidia (CUDA, but same applies to OpenCL)
AMD GPUs and CPUs
ARM MALI T600
Altera FPGAs

Not available (yet):

Imagination PowerVR
Qualcomm Adreno
Xilinx FPGAs

[infobox type=”information”][widgets_on_pages id=Trainings][/infobox]

University courses

OpenCL-based GPU-programming courses

[list1]

Marcus Bannermann university course made for the university of Erlangen, Germany.
Advanced Parallel Programming is a course on parallel programming by professor John Cavazos of University of Delaware.
Programming for Performance is a course on parallel programming by Jonathan Eyolfson of University of Waterloo.
Manchester OpenCL tutorial wiki. Materials from previous courses and more.
University of Innsbruck GPU-programming using OpenCL by Juan J. Durillo PhD.
University of Waterloo Programming for Performance. Lecture notes and assignments.

Architectures

Berkely University Computer Architecture and Engineering

[/list1]

Videos

[list1]

AMD’s OpenCL introduction. Takes about an hour in total, slides are provided.
Harvard Lectures on GPGPU. One hour each.

[/list1]

Cases/Studies

[list1]

AMD optimisation case study: Diagonal Sparse Matrix Vector Multiplication .
AMD optimisation case study: Simple reductions.

[/list1]

WebCL

WebCL is a new standard-to-be for OpenCL in the browser. Currently there are a few implementations, while Khronos is working on an official standard. WebCL is available on Firefox for Linux32, Windows32 and Windows64 by Nokia. Also available for Safari on OSX by Samsung. A Node.js-implementation is made by Motorola. Examples made for another implementation will probably not work.

Tutorials:

[list1]

[/list1]

Check Khronos’ WebCL page for more resources.

C/C++

Basic knowledge of C is needed to understand how to write kernels. Also many tutorials are in C++.

[list1]

A little C primer.
C++ for Java-programmers.
C for Java-programmers.
http://www.stanford.edu/class/cs101/ for if you have never programmed – don’t think about GPU-programming yet.

[/list1]

Basic OpenGL

Getting a grasp of OpenGL has advantages. Techniques for faster memory-operations in OpenGL have equivalents in OpenCL, giving reason to read on this subject.

[list1]

Discussion about OpenGL Shaders.
OpenGL Samples from getting an empty screen till the famous teapot.

[/list1]

What does it mean to work at Stream HPC?

High-performance computing on many-core environments and low-level optimizations are very important concepts in large scientific projects nowadays. Stream HPC is one of the market’s more prominent companies active in mostly North America and Europe.

As we often get asked how it is to work at the company, we’d like to give you a little peak into our kitchen.

What we find important

We’re a close-knitted group of motivated individuals, who get a kick out of performance optimizations and are experienced in programming GPUs. Every day we have discussions on performance. Finding out why certain hardware behaves in a certain manner when a specific computing load is applied. For instance why certain code is not as fast as theoretically promised, and then finding the bottlenecks by analyzing the device and finding solutions for removing those bottlenecks. As a team we make better code than we could ever do as individuals.

Quality is important for everybody on the team, which is a whole step further than “just getting the job done”. This has a simple reason: we cannot speed up code that is of low quality. This is also why we don’t use many tools that automatically do magic, as these often miss many significant improvements and don’t improve the code quality. We don’t expect AI to dully replace us soon, but once it’s possible we’ll probably be part of that project ourselves.

Computer science in general is evolving at a fast rate and therefore learning, is an important part of the job. Reading papers, finding new articles, discussing future hardware architectures and how they would affect performance, is very important. With every project, we have to gather as much data as possible using scientific publications, interesting blog posts and code repositories in order to be on the bleeding edge of technology for our project. Why use a hammer to speedup code, when you don’t know which hammer to use best?

Our team-culture

Personality of the team

We are all kind, focused on structured problem-solving, communicative about wins and struggles, focus on group-wins above personal gains, and all gamers. To have good discussions and have good disagreements, we seek people who are also open-minded. And we share and appreciate humor! If you want to know more about our culture, click here.

Tailored work environment

As we have all kinds of people in the team, who need different ways of recharging. One needs a walk, while somebody else needs a quiet place. We help each other on more than just work-related obstacles. We think that a broad approach on differences makes us understand how to progress to the next professional level the quickest. This is inclusivity-in-action, we’re proud of. Ow, and we have noise-canceling headphones.

Creating a safe place to speak up is critical for us. This helps us learn new skills and do things we never did before. And this approach helps well with all those who don’t have Asperger or ADHD at all, but need to progress without first fitting a certain norm.

Projects we do

Today we work on plenty of exciting projects and no year has been the same. Below is a page with projects we’re proud of.

https://streamhpc.com/about-us/work-we-do

Style of project handling

We use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”. Most notable are the management and documentation style.

Saying an engineer hates documentation and being managed because he/she is lazy is simply false. It’s because most management and documentation styles are far from optimal.

Pull-style management is where the tasks are written down by the team, based on the proposal. All these tasks are put into the task-list of the project, and then each team member picks the tasks that are a good fit. The last resort for the tasks that stay behind and have a deadline (being pushed) was only needed in a few cases.

All code (MR) is checked by one or two colleague, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

Where do we fit in your career?

Each job should get you forward, when done at the right moment. Question is when Stream HPC is the right choice.

As you might have seen, we don’t require a certain education. This is because a career is a sum, and an academic study can be replaced by various types of experience. The optimum is often both a study and the right type of experience. This means that for us, a senior can be a student and a junior can have been 20 years in the field.

So what is the “right type of experience”? Let’s talk about those who only have job-experience with CPUs. First, being hooked by performance, as primary interest, would be the first reason to get into HPC and GPGPU. Second, being good at C and C++ programming. Third, knowing algorithms and mathematics really well and can quickly apply them. Fourth, being a curious and quick learner, which shows by you having experimented with GPUs. This is also exactly what we test and check during the application procedure.

During your job you’ll learn anything around GPU-programming with a balance between theory and practice. Preparation is key in how we work, and this you will develop in many circumstances.

Those who left Stream HPC have gotten very senior roles, from team lead to CTO. With Stream HPC growing in size, the growth opportunities within the company are also increasing.

Make the decision for a new job

Would you like to work for a rapidly growing company of motivated GPU professionals in Europe? We seek motivated, curious, friendly people. If you liked what you read here, do check our open job positions.

Avoiding false dependencies in only two steps

Posted by Vincent Hindriksen on 29 September 2012

Let’s approach the concept of programming through looking at the brain, the code and the computer.

The idea of a program lives in the brain of a programmer. The way to get the program to the computer is using a system process called coding. When the program coded on the computer and the program embedded as an idea in the brain are alike, the programmer is happy. When over time the difference between the brain-version and the computer-version grows, then we go for a maintenance phase (although this is still this mostly from brain to computer).

When the coding-language or important coding-paradigms change, something completely different happens. In such case the program in the brain is updated or altered. Humans are not good at that, or at least not many textbooks discuss how to change from one model to another.

In this article I want to discuss one of these new coding-paradigm: dependencies in parallel software.
Continue reading “Avoiding false dependencies in only two steps” →

How to introduce HPC in your enterprise

Posted by Vincent Hindriksen on 10 December 2014

eviljaymz-spare-time — Spare time in IT – © jaymz.eu

The past ten years we have been happy when we got back home from the office. Our home-computer is simply faster, has more software, more memory and does not take over 10 minutes to boot. Office-computers can be that slow, because 90% of the work is typing documents anyway. Meanwhile the office-servers are mostly used for the intranet and backups only. It’s the way of life and it seems we have to accept it.

But what if you have a daily batch that takes 1 hour to run and 10 people need to wait for the results to continue their tasks? What if you simply need a bigger server to service your colleagues faster? Then Office-HPC can be the answer, the type of High Performance Computing that is affordable and in reach for most companies with more than 50 employees.

Below you’ll find out what you should do, in a nutshell.

Phase 0: Get familiar with parallel and GPU-computing, and convince your boss

This will take one or two weeks only, as it’s more about understanding the basics.

Understand where it’s all about and what’s important. We offer trainings, but you can also look around in the “knowledge base” in the menu above for lots of free advice. It’s very important and should be done before anything else. Even though you end up with CUDA, learn the basics of OpenCL first. Why? Because after CUDA there is only one answer: using Nvidia hardware. Please delay this decision to later, before you end up with the wrong solution.

How to get your boss to invest in all this? I won’t lie about it: it’s a big investment. Luckily the return-on-investment is very good, even when only 10 people are using the software in the company. If the waiting period per person per day is reduced with 20 minutes per day, then it’s easy to see that it pays back quickly: that’s 80 hours per person per year. Based on 10 people that is already €20K per year. StreamHPC has sped up software to take hours less time to process the daily data – therefore many of our clients could earn back the investment within a year, easily.

Phase 1: Know what device you want to use

Quite often I get customers who have bought an expensive Tesla, FirePro or XeonPhi and then ask me to speed up their software. Often I get questions “how do I speed up this algorithm on this device?”, while the question should be like “How do I speed up this algorithm?”. It takes some time to find out what device fits the algorithm best.

There is too much to discuss in this phase, so I keep it to a short Q&A. Please ask us for advice, as this phase is very important! We prefer to help people for free, than to read about failed “HPC in the office” projects (and giving others the idea that the technology is not ready yet).

Q: What programming language do I use?

Let’s start with the short answer. Is everything to be used within your office only, for ever? Use any language you want: CUDA, OpenCL or one of the many others. If you want the software to run on more devices, use OpenCL or OpenGL shaders. For example when developing with several partners, you cannot stick to CUDA and should use OpenCL – else you force others to make certain investments. But if you have some domain specific compute-engine where you will only share the API in the cloud, you can use CUDA without problems.

Part of the long answer is that it is entangled with the algorithm you want to use. Please take good care of this, and make your decision based on good research – not based on what people have told you without discussing your code first.

Q: FPGAs? Why would I use those?

True, they’re more expensive, but they use much less power (20-30 Watt TDP). They’re famous for low-latency computations. If you already have OpenCL-software, it ports quite easily to the FPGA – therefore I like the combination with AMD FirePro (good OpenCL support) and Altera Stratix V.

Xilin recently also started to support OpenCL on their devices. They have the same reason as Altera: to make development time for FPGA code shorter.

Q: Why do CPUs still exist?

Because they perform pretty well on very irregular algorithms. The latest Xeon CPUs with 16 cores outperform GPUs when code-branch prediction is used heavily. And by using OpenCL you can get more performance than when using OpenMP, plus you can port between devices much easier.

Q: I heard I should not use gaming GPUs. Why not?

A: Professional accelerators come with support and tuned libraries, which explains part of the higher price. So even if gaming-GPUs suffice, you need the support before you get to a cluster – the free support is mostly community-based and only gives answers to the problems everybody has. Also libraries are often better tuned for professional cards. See it as this: gaming-GPUs come with free games, professional compute-GPUs come with free support and libraries.

Q: I can’t have passively cooled server-GPUs in my desktop. What now?

Intel: Go for the XeonPhi’s which end with an “A” (= active cooled)
NVIDIA: For the newly announced K80, there will not be an active cooled version – so take the active cooled K40.
AMD: For the S9150 get a W9100.
Altera: Low-power, so you can use the same device. Do ask your supplier specifically if it applies to the FPGA you have in mind.

Phase 2: Have your office computer upgraded

As the goal is to see performance in a cluster, then it’s better to have at least two accelerators in your computer. This is a big investment, but it’s also a good investment. It’s the first step towards getting HPC in your office, and better do it well. Make sure you have at least the memory for your CPU as you have on your accelerator, if you want to use all the GPU’s memory. The S9150 has 16GB of memory, so you need 32GB MB to support two cards.

If you make use of an external software development company, you also need to have a good machine to test out the software and to understand the code that will be rolled out in your company. Control and understanding of the code is very important when working with consultants!

In case you did not get through phase 1 completely, better to test with one Accelerator first. If you don’t need to have something like OpenGL/OpenCL-interaction, make sure you use a third GPU for the video-output, as usage can influence the GPU performance.

Program your software using MPI for connecting the two accelerators and be in full control of what is blocking, to be prepared for the cluster.

Phase 3: Roll software out in a small group

At this phase it’s time to offer the service to a selected group. Say that you have chosen to offer your compute solution via an Excel-plugin, which communicates with the software via an API. Add new users one at a time – make sure (parts of) the results are tested! From here it’s software-development as we know it, and the most unexpected bugs come out of the test-group.

If you get good results, your colleagues will have some accelerators by now too. If you did phases 0 and 1 well, you probably will get good results anyway. The moment you have setup the MPI-environment on multiple desktops, you have just setup your minimal test-street. Very important for later, as many enterprises lack a test-street – then it’s better to have it partially shared with your development-environment. I’m pretty sure I get comments on this, but I would really like to have more companies to do larger scale tests before the production step.

Phase 4: Get a cluster (or cloud service)

If your algorithm is not CPU-bound, then it’s best to have as many GPUs per CPU as possible. Else you need to keep it to one or two. We can give you advice on this in phase 1 already, so you know where to prepare for. Then the most important step comes: calculate how much hardware you need to support the needs of your enterprise. It is possible that you only need one node of 8 GPUs to support even thousands of users.

Say the algorithm is not CPU-bound, then it’s best to put as many GPUs per node. Personally I like ASUS servers most, as they are very open to all accelerators, unlike others who only offer accelerators from “selected partners”. At SC14 they introduced the ESC8000 E3, which holds 8 accelerators via PCIe3 x16 buses. There are more options available, but they only offer systems that don’t mention support for all vendors – my experience is that you get worse support if you do something special.

For Altera-only nodes, you should check for complete different server cases, as cooling requirements are different. For Xeon-only nodes, you can find solutions with 4 CPU-sockets.

If you are allowed to transport company-data outside the local network and can handle the data-transports over the internet, then a cloud-based service might also be a choice. Feel free to ask us what the options are nowadays.

You’re done

If the users are happy, then probably more software needs to be ported to the accelerators now. So good luck and have fun!

OpenCL Books

[infobox type=”information”]

Want a book written be us?

Format: PDF
Digital: 150-200 pages
Price: TBD
Author: The StreamHPC team
OpenCL-version: 2.0

[/infobox]

Below are the books that are available as downloadable PDF or as a book. Please contact us if a printed book is missing. Note that I tend to be critical to books, and not to be overwhelming positive and talk in superlatives. Most books target C and C++ developers, so be sure you learn the basics of C or C++ first before learning OpenCL. Most important are bit-shifts, pointers and structs, but also thinking more in hardware than in getting-things-done is needed to get the full potential out of OpenCL. While you get that book on C, also get a book on computer-architecture to understand the concepts of bandwidth to its fullest. Then you are ready for one of the pearls below. Happy reading!

The books are ordered descending by published date, except the first.

“The OpenCL specifications” by the Khronos Group

2.0 – current version

Format: PDF
File Size: 2.2MB
Digital: 281 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 13 July 2013 (version 2.0, revision 11)
OpenCL-version: 2.0
Homepage: http://www.khronos.org/registry/cl/ – http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

As a specifications-document, you cannot expect a nice piece of prose, but most of the knowledge you need to know is in it. There are certainly some gaps (especially in clear explanation), but every version is getting better. When studying other sources, always have this document with you as a reference. I printed it as two pages per side (A4).

Read chapters 1 to 3, and leave the rest as a reference. Other books explain the long, long lists of language specifications in a nicer form. Also the most you’ll better learn by doing.

1.2, 1.1, 1.0 – previous versions

Format: PDF
File Size: 3.3MB
Digital: 380 pages
Price: Free
Publisher: Khronos Group
Author: Aaftab Munshi (Editor)
Published Date: 14 November 2012 (version 1.2, revision 19)
OpenCL-version: 1.2
Homepage: http://www.khronos.org/registry/cl/ –

As many software is still written in a previous version, it is good to know the differences. Differences between 1.1 and 1.2 are written here.

OpenCL Programming By Example

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: eBook/pBook
Pages: 304
Price: €4,50 (e), €50,- (p+e)
Publisher: PACKT
Authors: Ravishekhar Banger (AMD), Koushik Bhattacharyya (AMD)
Published Date: December 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-programming-by-example/book

“The OpenCL Programming Book” – Fixstars

Two version available, the 1.0 version and the 1.2 version. To start with the 1.2:

Format: eBook
Pages: 325
Price: USD 19.50
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Aki Asahara, Satoshi Miki, Jeongdo Son and Satoshi Miki. Satoru Tagawa (translator)
Published Date: January 2012
OpenCL-version: 1.2
Homepage: http://www.fixstars.com/en/opencl/book/

Format: eBook
Pages: 246
Price: free
Publisher: Fixstars Corporation
Authors: Ryoji Tsuchiyama, Takashi Nakamura, Takuro Iizuka, Akihiro Asahara, Satoshi Miki
Published Date: 31 March 2010
OpenCL-version: 1.0
Homepage: http://www.fixstars.com/en/opencl/book/OpenCLProgrammingBook/contents/

1.0-version: It seems to be translated from Japanese to English, but except some small typos and spelling errors the book is very easy to read. The book explains the chapters you could skip in Khronos’ specifications-document, but certainly is not complete since it discusses OpenCL 1.0 and has a focus on the basics. The parts where that build up a program step-by-step is a bit annoying to read, because they repeat the whole program again while only a few lines have changed. The book would be more like 180-200 pages if written more compact.

1.2-version: Thicker, more up to date and a promise there are less translation-errors.

Heterogeneous Computing with OpenCL, second edition

Format: pBook
Pages: 400 (approx.)
Price: USD 69.95
Publisher: Morgan Kaufmann
Authors: Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa
Published Date: Sept 2011
OpenCL-version: 1.2
Homepage: http://www.elsevierdirect.com/product.jsp?isbn=9780123877666

This is where we all chose OpenCL for: hybrid processors. And this book dives into that world completely, so we actually learn a lot new stuff about the advantages of having a GPU on your lap.

The new edition upgrades the book to version 1.2, but nothing much new has been added. So if you have the first edition, there is no need to buy the second edition.

OpenCL in Action

Format: eBook + pBook
Pages: 475
Price: USD 47.99 (e). USD 59.99 (p+e)
Publisher: Addison-Wesley Professional
Authors: Matthew Scarpino
Published Date: non-final version updated regularly, target November 2011
OpenCL-version: 1.1
Homepage: http://www.manning.com/scarpino2/

Matthew Scarpino also wrote SWT/JFace In Action and Programming the Cell-processor, has a profession in Linux and has much experience in IT. The book seems to target an audience who want a more practical guide to learn OpenCL. He runs a blog at http://www.openclblog.com/

It is currently my favourite book and this one is a must-have for everybody interested or working with OpenCL.

OpenCL Programming Guide

Format: PDF and/or print
Pages: 648
Price: USD 35.19 (e), USD 43.99 (p), USD 59.39 (p+e)
Publisher: Addison-Wesley Professional
Authors: Aaftab Munshi (Apple, Khronos Group), Benedict Gaster (AMD), Timothy G. Mattson, Dan Ginsburg
Published Date: August 2011
OpenCL-version: 1.1
Homepage: http://my.safaribooksonline.com/9780132488006 and http://www.openclprogrammingguide.com/

Aaftab Munshi is also responsible for the OpenCL-specifications, so he probably knows where he’s talking about.

The 648 pages it is quite bigger than the targeted 480. Currently this is a very good replacement for Fixstars’ book. Disadvantage is that sending the printed book overseas (not USA/Canada) is much too expensive and people from the Eurasian continent, Africa and Latin America should just print it locally – looking into that to find better options.

OpenCL Parallel Programming Development Cookbook

Format: pBook + eBook
Pages: 303 pages
Price: USD 26.39 (e) / 54.99 (p+e)
Publisher: PACT publishing
Author: Raymond Tay
Published Date: August 2013
OpenCL-version: 1.2
Homepage: http://www.packtpub.com/opencl-parallel-programming-development-cookbook/book

Introductory book for OpenCL beginners. Examples: histogram, Sobel edge detection, Matrix Multiplication, Sparse Matrix Vector Multiplication, Bitonic sort, Radix sort, n-body.

Programming Massively Parallel Processors

Format: pBook
Pages: 258 pages
Price: USD 46.40
Publisher: Morgan Kaufmann
Authors: David B. Kirk (NVIDIA) and Wen-mei W. Hwu (University of Illinois)
Published Date: 28 January 2010
OpenCL-version: 1.1?
Homepage: http://blogs.nvidia.com/ntersect/2010/01/worlds-first-textbook-on-programming-massively-parallel-processors.html

The book claims to discuss both OpenCL and CUDA, but actually there is just one chapter on OpenCL and the focus is strong towards NVIDIA hardware. It is a nice book for people who need to learn to program CUDA-only software/hardware and don’t want a book that’s too hard to understand. There are assignments at the end of each chapter and important subjects are explained in detail, so you don’t need to have a hard time with those assignments.

It is not good for people interested in OpenCL-compliant architectures from AMD, ARM and IBM besides NVIDIA’s. It is one of the best resources to understand NVIDIA architectures from a view of a GPGPU-programmer. The second edition adds more chapters on for example MPI and OpenACC. It is also less negative about OpenCL than in the first edition.

The 12 latest Twitter Poll Results of 2018

Posted by Vincent Hindriksen on 10 November 2018

Via our Twitter channel we have various polls. Not always have we shared the full background of these polls, so we’ve taken the polls of the past half year and put them here. The first half of the year there were no polls, in case you wanted to know.

As inclusive polls are not focused (and thus difficult to answer), most polls are incomplete by design. Still insights can be given. Or comments given.

Below’s polls have given us insight and we hope they give you insights too how our industry is developing. It’s sorted on date from oldest first.

It was very interesting that the percentage of votes per choice did not change much after 30 votes. Even when it was retweeted by a large account, opinions had the same distribution.

Is HIP (a clone of CUDA) an option?

Continue reading →

OpenCL – the battle, part III

Posted by Vincent Hindriksen on 18 October 2010 with 2 Comments

The first two parts described hardware-companies and operating systems, programming languages and software-companies, written about half a year ago. Now we focus on what has driven NVIDIA and ATI/AMD for decades: games.

Disclaimer: this is an opinion-piece on the current market. We are strong supporters of OpenCL and all companies which support it too. Since our advise on specific hardware in a consult will be based on specific demands on the customer, we could advise differently than would be expected on the below article.

Games

Computer games are cool; merely because you choose from so many different kinds. While Tetris will live forever, the latest games also have something to add: realistic physics simulation. And that’s what’s done by GPUs now. Nintendo has shown us that gameplay and good interaction are far more important than video-quality. The wow-factor for photo-realistic real-time rendering is not as it was years ago.
You might know the basics for falling objects: F = m*g (Force = Mass times Gravity-acceleration), and action = – reaction. If you drop some boxes, you can predict falling speed, interaction, rotation and possible change of centre of gravity from a still image as a human being. A computer has to do a lot more to detect collision, but the idea is very doable on a fast CPU. A very well-known open source library for these purposes is Bullet Physics. The nice thing comes, when there is more than just a few boxes, but thousands of them. Or when you walk through water or under a waterfall, see fire and smoke, break wood but bend metal, etc. The accelerometer of the iPod was a game-changer too in the demand for more realism in graphics. For an example of a “physics puzzle game” not using GPGPU see World of Goo (with free demo) – for the rest we talk more about high-end games. Of current game-ready systems PCs (Apple, Linux and Windows) have OpenCL support, Sony PlayStation 3 is now somewhat vague and the Xbox 360 has none.

The picture is from Crysis 3, which does not use OpenCL, as we know it.

Continue reading “OpenCL – the battle, part III” →

Machine Learning

Machine learning is increasingly employed in computing tasks where it is infeasible to design an explicit algorithm due to the high dimensionality of the input space and the overall complexity of the problem. Algorithms for machine learning build up a model from example inputs and continuously refine this model based on some form of feedback over many training steps. Learning is often either supervised or unsupervised, and in both cases is very time-consuming. Using our expertise in parallel programming, we can speed up your machine learning algorithms to significantly increase learning rates and thus the quality of your algorithms. For example, we could help one of our customers by reducing the training times of its artificial neural network to a tenth of the time, which translated to a better quality of the customer’s analysis software.

We can also consult you in whether your algorithm is suitable for high speedups or whether a different algorithm may better benefit from parallelization. Contact us to find the best solution for you.

Exposing OpenCL on Android: Q&A with Tim Lewis of ZiiLabs

Posted by Vincent Hindriksen on 28 July 2011

ZiiLabs has been offering an early access program for OpenCL SDK since last year. This program was very selective in choosing developers and little news has been put on their webpage. Now they are planning to make their Android NDK a standard component, it’s a good time to ask them some questions. GPGPU-consultant Liad Weinberger of Appilo also added a few questions.

The Q&A has been with Tim Lewis, director Marketing and Partner Relations of ZiiLabs, who has taken the time to give some insights in what we can expect around accelerated computations on Android. ZiiLabs has been better known as 3DLabs and has reinvented itself in 2009 (you can read the full history here). Like other companies in the ARM-industry they mostly design chips and let other parties manufacture devices using their schematics, drivers and software. Now to the questions.

Continue reading “Exposing OpenCL on Android: Q&A with Tim Lewis of ZiiLabs” →

A typical week

Primary and secondary tasks

The main focus is programming and solving problems. But that means that everything that obstructs this focus, needs to be gotten out of the way. This is simpler on paper than in reality and therefore there are multiple “faiths” among company, how to do this.

We start with clearly distincting primary and secondary tasks, where the difference is that there needs to be more time spent on the primary tasks in the long term. The last part of the sentence is very important.

What we do every day and week:

Planning
- Write issues
- Make issue estimations
- Prioritize issues
- Bundle issues in epics
- Pick issues for personal weekly milestones
Problem-solving
Coding and math
Learning
- Reading books
- Reading papers
- Watching videos

Why so much emphasis on planning?

The planning-part takes good time, but refrains us from spending too much time on dead ends. And spending time on dead ends is not a primary task at all. Also planning helps with designing better strategies – there is limited time for solving problems and coding software, so doing a full-scope research is not going to work. As there is no way to efficiently build complex code without any time-estimations on the different approaches, planning-skills provide the necessary foundations for becoming a senior coder.

We start as early as possible to train these skills, so also juniors are asked to do all planning-tasks. Initially this takes a good part of the valuable coding-time but quickly goes down and first advantages are seen.

Style of project handling

Tools

We mostly use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”.

We continuously look into new tools that can help us improve. Also here the main focus is to reduce the time on secondary tasks, so we can spend more time thinking on problem-solving.

Pull-style project management

The tasks are written down by the team, using the project-doc as input. All these tasks are put into the task-list of the project and estimated. Then each team member picks the tasks that are a good fit. There are always tasks that need to be pushed instead of pulled, but luckily that’s a relatively small part of all work.

All code (MR) is checked by one or two colleagues, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

The week

If you’d work here, how would your week look like the first year? Specifically saying the first year, as for more complex projects, different approaches could be chosen.

Monday weekly planning

Together with your team you pick up the issues for the week. The issues should have estimations, or these will be done during that meeting. When your week is filled, you know what to do.

Monday weekly meeting

Every Monday we have a weekly meeting to share with everybody how the other projects are doing.

Mon-Fri: Daily standup

Retrospective of the previous day, and tuning of the day ahead.

Practice:

Tools
C/C++
GPGPU
Scrum

Friday closing

Weekly retrospective, cleaning up, writing notes on issues, etc.

Weekly customer meetings

Here we discuss the progress and anything blocking. The customer shares their progress, and together problems can be solved.

Many projects have a shared (high-level) issue-list, so the progress is continuously synced with the customer and communication is easy.

Learning both OpenCL and CUDA

Posted by Vincent Hindriksen on 23 September 2010 with 2 Comments

Be sure to read Taking on OpenCL where I’ve put my latest insights – also for CUDA.

The two¹ “camps” OpenCL and CUDA both claim you should first learn their language first, after which the other would be easy to learn. I’m from the OpenCL-camp, so I say you should learn OpenCL first, but with a strong emphasis on hardware-architecture understanding. If I had chosen for CUDA I would have said the opposite, so in other words it does not matter which you do first. But psychology tells us that you probably like the first language more since there is where you discovered the magic; also most people do not like to learn a second language which is much alike and does not add a real difference. Most programmers just want to get the job done and both camps know that. Be aware of that.

NVIDIA is very good in marketing their products, AMD has – to say it modest – a lower budget for GPGPU-marketing. As a programmer you should be aware of this difference.

The possibilities of OpenCL are larger than those of CUDA, because of task-parallel programming and support for far more different architectures. At the other side CUDA is much more user-friendly and has a lot of convenience built-in.

Continue reading “Learning both OpenCL and CUDA” →

Mega-kernel versus Micro-kernels in LuxRender (repost)

Posted by Vincent Hindriksen on 4 November 2014

Below is a (slightly edited) repost of a blog by David Bucciarelli (homepage, twitter) on the Luxrender forum.

I find micro-kernels an important subject, since micro-kernels have clear advantages. In OpenCL 2.0 there are more possibilities to create smaller kernels. Also making smaller and more focused functions is considered good software engineering, defined as “Separation of Concerns“.

For a general introduction to the concept of “Mega Vs Micro” kernels, read “Megakernels Considered Harmful: Wavefront Path Tracing on GPUs” by Samuli Laine, Tero Karras, and Timo Aila of NVIDIA. Abstract:

When programming for GPUs, simply porting a large CPU program

into an equally large GPU kernel is generally not a good approach.

Due to SIMT execution model on GPUs, divergence in control flow

carries substantial performance penalties, as does high register us-

age that lessens the latency-hiding capability that is essential for the

high-latency, high-bandwidth memory system of a GPU. In this pa-

per, we implement a path tracer on a GPU using a wavefront formu-

lation, avoiding these pitfalls that can be especially prominent when

using materials that are expensive to evaluate. We compare our per-

formance against the traditional megakernel approach, and demon-

strate that the wavefront formulation is much better suited for real-

world use cases where multiple complex materials are present in

the scene.

OpenCL kernels in “SmallLuxGPU” (raytracer, originally made by David) have followed the micro-kernel approach from the very beginning. However, with the merge with LuxRender and the introduction of LuxRender materials, textures, light sources, etc. one of the kernels sized up to the point of being a “Mega-kernel”.

The major problem with “Mega-kernel”, aside of the inability of AMD OpenCL compiler to compile them, is the huge register usage and the very low GPU utilization. Why this happens, is well explained in the paper.

PATHOCL Micro-kernels edition, the results

The number of kernels increases from 2 to 10, the register usage decrease from 196 (!!!) to 3-84 and the GPU utilization rise from a miserable 10% to a more healthy 30%-100%.

Occupancy increases from 10% to 30% or more

The performance increase is huge on some platform (Linux + FirePro W8100), 3.6 times:

Speed increases from 0.84M to 3.07M samples/sec

A speedup in the 20% to 40% range has been reported on MacOS/Windows + NVIDIA GPUs.

It solves the problems with AMD compiler

Micro-kernels not only improve the performance but also addressees the major issues with AMD OpenCL compiler. For the very first time since the release of first AMD OpenCL SDK beta, I’m not aware of a scene not running on AMD GPUs. This is SATtva’s Mic scene running on GPUs for the first time:

Scene builds correctly on AMD hardware for the first time

Try it out yourself

This feature will be extended to BIASPATHOCL and available in LuxRender v1.5.

A new version of PATHOCL is available in this branch. The sources of micro-kernels are available here.

To run with micro-kernels, use “path.microkernels.enable=1”.