So you want your software to be much faster than the competition?

In 4 days your software team learns all techniques to make extremely fast software.

Your team will learn how to write optimal code for GPUs and make better use of the existing hardware. They will be able to write faster code immediately after the training – doubling the speed is minimal, 100 times is possible. Your customers will notice the difference in speed.

We use advanced, popular techniques like OpenCL and older techniques like cache-flow optimisation. At the end of the training you’ll receive a certificate from StreamHPC.

Want more information? Contact us.

About the training

Location and Time

OpenCL is a rather new subject and hard-coding the location and time has not proved to be successful in the past years for trainers in this subject. Therefore we chose for flexible dates and initially offer the training in large/capital cities and technology centres world-wide.

A final date for a city will be picked once there are 5 to 8 attendees, with a maximum of 12. You can specify your preferences for cities and dates in the form below.

Some discounts are available for developing countries.


Day 1: Introduction

Learn about GPU architectures and AVX/SSE, how to program them and why it is faster.

  • Introduction to parallel programming and GPU-programming
  • An overview of parallel architectures
  • The OpenCL model: host-programming and kernel-programming
  • Comparison with NVIDIA’s CUDA and Intel’s Array Building Blocks.
  • Data-parallel and task-parallel programming
Lab-session will be an image-filter.
Note: since CUDA is very similar to OpenCL, you are free to choose to do the lab-sessions in CUDA.

Day 2: Tools and advanced subjects

Learn about parallel-programming tactics, host-programming (transferring data), IDEs and tools.

  • Static kernel analysis
  • Profiling
  • Debugging
  • Data handling and preparation
  • Theoretical backgrounds for faster code
  • Cache flow optimisation
Lab-session: yesterday’s image-filters using a video-stream from a web-cam or file.

Day 3: Optimisation of memory and group-sizes

Learn the concept of “data-transport is expensive, computations are cheap”.
  • Register usage
  • Data-rearrangement
  • Local and private memory
  • Image/texture memory
  • Bank-conflicts
  • Coalescence
  • Prefetching
Lab-session: various small puzzles, which can be solved using the explained techniques.

Day 4: Optimisation of algorithms

Learn techniques to help the compiler make better and faster code.
  • Precision tinkering
  • Vectorisation
  • Manual loop-unrolling
  • Unbranching
Lab-session: like day 3, but now with compute-oriented problems.


When filling in this form, you declare that you intend to follow the course. Cancellation can be done via e-mail or phone at any time.

StreamHPC will keep you up-to-date for the training at your location(s). When the minimum of 5 attendees has been reached, a final date will be discussed. If you selected more locations, you have the option to wait for a training at another city.

Put any remarks you have in the message. If you have any question, mail to

[si-contact-form form=’7′]

Want to know more? Get in contact!

We are the acknowledged experts in OpenCL, CUDA and performance optimization for CPUs and GPUs. We proudly boast a portfolio of satisfied customers worldwide, and can also help you build high performance software. E-mail us today