Our Job-Application Process – with tips and tricks

Job-applications can be stressful. You need to invest time in a company that you don’t really know and might reject you, or the job turns out not to be as you envisioned it. Also, application-processes are different per company, depending on values and experience – it can be a maze and difficult to understand what is expected.

We’re a relatively small company, but we put more time into our application process than our peers. The feedback we got from past applicants, we used to improve it bit-by-bit. Understand that we want you to successfully walk through the application, and not drown in the process – so if something is not clear on this page, email us via jobs@streamhpc.com. You would be surprised that others ask various questions before starting, and it actually increases the chances of hiring (statistically).

We designed the process such, that the chances you’ll get an offer goes from a few percent in the first step to 90% really quickly. This allows us to spend more time on the people we think make a good chance.

We want you to succeed!
Seriously, the application-process should make sure that people with the right skills get a 100% guarantee to pass.

We wrote this tutorial to help you get through the first rounds successfully. This means that by reading this page, you already increase your chances. As a bonus, various tips&tricks are generally applicable, and you can also use them in other job-offers.

Round 1: CV scanning

How to improve your cv

A CV shows an overview of what you can do with experience as proof. So it does not need to state “he is exceptional in…” but just show what you managed to do. If you did a project with a team, mention your role. If you think an unsuccessful project does not give value, you’re wrong – just clearly state your learning points.

We do a quick scan of your CV and letter, which means that we look for keywords like CUDA, OpenCL, SYCL, GLSL, HLSL, Assembly, etc. Second, we try to assess you on seniority in CPU-programming, GPU-programming and overall project-experience. Example. If you never worked in a team, mostly did C/C++ programming for 15 years and have made your first GPU-software some months ago, we’d assess you as a solo-worker, CPU-senior and GPU-beginner.

Pro-tip: Embrace the idea that companies simply do CV-scanning. But don’t overdo it by summing up every keyword that could ever apply, as you will get questions you cannot answer.

These labels are not good or bad, but just how we think things are. So make sure we can abstract these from your CV and can connect the dots. E.g. if you mention “OpenCL” under skills, but not under any of the experiences, then we might still reject you. It might therefore be good to just mention it under education as “best subjects” instead of skill – just discuss it in your email.

In case you need a sample CV, just use the below one. For the job-descriptions, be specific with what you did. E.g. “Increased x number by y amount”.

CV_template Download

To further support you have experience, recent GPU-code would be very helpful to get through the first filters. Also label it correctly as “university assignment”, “book assignment”, “hobby project”, etc. This helps us to assess your code the right way.

Pro-tip: Clean up your code and add comments. This shows how you would work in a professional environment.

For those who sent GPU-code, we check on coding style, efficiency, applied optimizations, etc. We also check if you used libraries or wrote your own kernels. As the job includes writing those GPU-libraries, we’re not looking for the ones who only use them.

Pro-tip: Split up work on GPU-kernels and usage of libraries. This shows you’re capable of writing GPU-kernels.

Write a motivational email

Last, but not least, always add a motivational letter. Instead of “See my CV attached”, share why you like working with GPUs and HPC. And preferably also what speaks to you about our company. This explains what drives you, and we can quickly find out if we’re a match.

We see templated emails (with sometimes funny mistakes), but it is not really needed to make it that personal. We understand it is time-consuming to do job-applications, so a general text suffices. Think of sentences like:

What you seek/need: “Things I value in a job are: ….. I hope I can find them in this job”
What you value: “I like working with GPUs, since I did …”
What you miss: “I remember a university project ….. I want more of that”

Round 2: short coding test

For those who are left, you do a simple online test in C++ (or C, if you choose that). This test is to get a grasp of your way of working and thinking, and to prepare you for the longer test. We found that puzzle-solvers are good in the work we do. It does help to get some practice in creative coding tasks, if you currently have a boring job.

It takes 10 – 25 minutes, depending on your experience with such tests. We give 30 minutes, which should be enough time for most. If you never did such test, check the tips under round 4A.

Pro-tip: Do the sample test first, if you’re new to such tests. This will allow you to experiment freely.

If you fail the test, you get an email with hints and get the choice to do the long test. This allows those who realized they needed to better prepare, actually do much better in the next round. The hints are written down under round 4A, so you can already prepare.

Reasons people stop here

While we give a choice to continue the process, not everybody takes that opportunity. Here’s some background on their decision.

Finding out it is too difficult. There is now a good understanding and context where the job is actually about. We tried all kinds of ways to message the job is difficult to serve the bored people, but we’ve learned that the texts are understood relatively to somebody’s own experience. We believe in growth (watch “Not there yet” for info on the learning-mindset) and luckily some have people applied again a few years later.
Thinking the job is too difficult, and thus not even doing round 1. We found that many people don’t even apply, because they think they cannot do it, since their friends do rocket-science – that’s a missed chance. Get in contact to discuss your doubts.
Not taking the time-pressure. See what is written under round 4 – we are here to help you pass (if your skills are there)! Get in contact to discuss this, so we can see how we can find solutions.

If you are worried about any of the above, please reach out to us. It is worth doing the short coding test, and we may offer a different alternative to the longer one if you have concerns. We want you to have the best chance possible.

There are more reasons, not mentioned here. We try to get 100% of the people with the right skills through, so feedback is always welcome.

Round 3: video call

First real contact! Here we double-check everything we have assumed, and also answer all your questions. Make sure you have questions prepared. If you don’t have any questions anymore, just take those questions with you and mention that these all have been answered.

Know it’s a fairly relaxed call, where we just want to get to know you. So we don’t want to work through a CV or hear about technical projects at this step. To succeed here, just answer the questions openly and don’t try to give the response you think we want to hear.

Pro-tip: With questions prepared, you signal you’re truly interested in the position.

Here we also discuss your salary expectations. We don’t pay salaries as the financial or energy sector does, and we need to have clarity on this.

Round 4: long coding test

After the call you are invited for a long coding test. We have two variations: the online coding test (4A) and the homework test (4B).

If you’re not that good at coding efficient C++ and doing puzzles, be sure you get more experience in C/C++/GPGPU! It would make sense to wait with your application, or pause your application for a few months, and take it serious. Joining an open source project in C/C++/GPGPU helps a lot with getting through this round, if you seek a method to improve.

Round 4 A: online test

Here you show your skills in C++ and algorithms, not in GPGPU. On average, this takes 2 – 3 hours. There is a warm-up assignment and then 3 bigger assignments. Thus the time per assignment is 45-60 minutes. Understand that we simply test your C++ and puzzle/reading skills, as we need these skills for the projects we run.

If you did the short test really well (80% or higher), chances are increased that you pass this one too. People who did not do well on the first test but studied their mistakes, do also good. In all cases follow the tips under “How to prepare” again, as the pass-rate for seriously prepared people is always higher.

Statistics. Of all the applicants that we send an invite for the test, 67% actually do test and 25% gets score of at least 80. So 37% of the people who start the test, pass with a 80+ score. We of course try to make sure that the numbers get better. Like the 73% who did the test for “nothing” – it actually used to be well over 90%, but we think we can do better. One part is to help applicants better prepare for the test.

How to prepare for the Codility test

The below tips work for any coding-test. If you want a more serious coding-job, you should expect tests and therefore preparations are important. Codility wrote a nice article on how to prepare for the test, including links to sample tests. Start there. Make sure you practice puzzles like you did on university, especially if you are trying to escape a boring job.

The main challenges are working under time-pressure and not being able to get hints, which needs some practice to get used to. Ofcourse we also want to skip these test, and we seriously tried alternatives – we simply found that it is important to know how somebody can solve problems on their own.

General tips during the test:

Read all questions carefully. A typical mistake is not understanding the question – it is therefore much better to spend 10 full minutes on understanding the question (and the provided code) than to start asap.
Plan your time. Estimate how much time each assignment will take you. Focus on the ones you have most confidence in, but use a stop-watch to restrict time to 30-35 minutes. This leaves 10 minutes at the end to double check and test your solution.
Make sure you try each assignment. (100+100+0)/3=66%.
Test. Start by designing a set of tests. With just a few exceptions, all applicants who got high scores, tested their solution in Codility.
Comment your code. Two reasons. One, if your code fails, comments and tests can actually give you the advantage of doubt. Two, explaining the code out loud, supports problem understanding.
Keep your algorithm-books close. We’re not testing your memory, but your skills. It is strictly not allowed to copy code, but it is allowed to double-check an algorithms.

Pro-tip: Those who make time for the test within 2 weeks, have a higher chance to get to the last rounds.

Round 4 B: homework test

If you do this round for the second time or if you are not wiling to be working under time-pressure, we have a homework test. This takes 10 to 20 hours, but can be done in one’s own time. Here you show your skills in C++ algorithms and GPGPU. It shows what somebody’s level really is in the broad sense, which should give you the necessary feedback on progressing on this career-path.

Rounds 5+: The rest

From now on, the chances are 90-95% to get the job! Assuming you did not cheat, but luckily hardly anybody does. The 5-10% is in case of unique situations, we assumed we did not need to test for.

In these rounds we only double-check things, and focus on getting to know you. Notice it takes more than half of the time that needs to be invested. We do:

the technical interview on C, C++ and GPGPU (2 hours)
the long interview (3 hours)
reference-checks

We try to plan all in one week, which makes it intense but fast.

Pro-tip: Read an applicable book before the technical interview. This helps freshen up your theoretical knowledge.

A last remark: the door versus the room

There are two types of people who apply: door-people and room-people. Door-people want to get through the door and then will prove themselves that they are worth it by working really hard. Room-people focus on the room behind that door, and try to find out how compatible we are with each other. Statistically, we almost only hire room-people. This means that if you focus on self-check compatibility for the job, and ask us question about how it is once you are in, chances increase a lot.

If you have questions not discussed here – just email us at jobs@streamhpc.com

We have provided various texts to help you get the information we think is useful:

Job description of GPU-developer (junior, medior, senior) contains info on us as an employer.
Onboarding process describes the first 4 months of the job.
Self-assessment that tells you where you stand and thus what are the technical chances to get the job.

OpenCL Videos of AMD’s AFDS 2012

Posted by Vincent Hindriksen on 12 October 2012 with 4 Comments

AFDS was full of talks on OpenCL. You missed them, just like me? Then you will be happy that they put many videos on Youtube!

Enjoy watching! As all videos are around 40 minutes, it is best to take a full day for watching them all. The first part is on openCL itself, second is on tools, third on OpenCL usages, fourth on other subjects.

Continue reading “OpenCL Videos of AMD’s AFDS 2012” →

Call for Papers, Presentations, Workshops and Posters for IWOCL in Stanford

Posted by Vincent Hindriksen on 10 January 2015

The IWOCL 2015 call for OpenCL Papers is now open and is looking for submissions from industry and academia relating to the use of OpenCL. Submissions may refer to completed projects or those currently in progress and are invited in the form of:

Research Papers
Technical Presentations
Workshops and Tutorials
Posters

Examples of sessions from 2014 can be found here.

Deadlines at a Glance

Call for submissions OPENS:	Wednesday 19th November, 2014
Call for submissions CLOSES:	Saturday 14th February, 2015 (23:59 AOE)
Notifications:	Within 4 weeks of the final closing date

Selection Criteria

The IWOCL Technical Committee will select submissions based on the following criteria;

Concept of the submission and its relevance and timeliness
Technical Depth
Clarity of the submissions; clearly conveying what your presentation
Research findings and results of your work
Your credentials and expertise in the subject matter

Unpublished Technical Papers

We solicit the submission of unpublished technical papers detailing original research related to OpenCL. All topics related to OpenCL are of interest, including OpenCL applications from any domain (e.g., scientific computing, video games, computer graphics, multimedia, information retrieval, optimization, text processing, data mining, finance, signal and image processing and numerical solvers), OpenCL performance analysis and modeling, OpenCL performance and correctness tools and proposed OpenCL extensions. IWOCL will publish formal proceedings of the accepted papers in The ACM International Conference Series. Please Submit an Abstract which should be between 1 and 4 pages long.

Technical Presentations

We solicit the submission of technical presentations detailing the innovative use of OpenCL. All topics related to OpenCL are of interest, including but not limited to applications, software tools, programming methods, extensions, performance analysis and verification. Please Submit an Abstract which should not exceed 4 pages. The accepted presentations will be published in the online workshop proceedings.

Workshops & Tutorials

IWOCL includes a day of tutorials that provide OpenCL users an opportunity to spend more time exploring a specific OpenCL topic. Tutorial submissions should assume working knowledge of OpenCL by the attendees and can for example cover OpenCL itself, any of the related APIs such as SPIR and SYCL, the use of OpenCL libraries or parallel computing techniques in general using OpenCL. Please Submit an Abstract which should not exceed 4 pages. Please include the preferred length of the tutorial or workshop (e.g. 2, 3 or 4 hours).

Posters

To encourage discussion of the latest developments in the OpenCL community, there will be a poster session running in parallel to the main sessions and open during the breaks and lunch sessions. The abstracts of the accepted posters will be published in the form of short communications in the workshop proceedings, provided that at least one of the authors has registered for the workshop. Please Submit an Abstract which should not exceed 2 pages.

Submit your abstract today

Go to Easychair, log in or register, and click on “New Submission”. Deadline is 14 February.

The history of the PC from 2000 – 2012

Posted by Vincent Hindriksen on 6 May 2011 with 1 Comment

After IBM-compatible clones took over from Apple, Atari and ZX Spectrum, we just got used to that a PC is an X86 with MS Windows and Office on it. Around a decade ago Apple fought back with OSX on which Windows 7 (launched in 2009) was the first real answer. Meanwhile Apple switched to Intel, since IBM was not fast enough with the development of the POWER-processor – a huge operation, which seemed a one-time-only step for Apple at the time. SemiAccurate now speaks of Intel being replaced by ARM on Apple’s laptops.

A few weeks ago I asked Computer Science students if they knew ARM. Not even 1% had heard of it, but lots more knew there was a Samsung-chip in their smartphone. So what’s going on without us knowing it?

I’ll try to describe the market for a few key-years and then try to put the big names in it. There is a lot going on between i.e. Nvidia, Samsung, Texas Instruments and Imagination Technologies in the ARM-market, but I’ll leave that out of the story. Also not mentioned are the game-consoles and servers, but they did have big influences on the home-PC market.

In the picture at the right you see an idea of how fast the markets would have grown from a 2006 perspective. (Click on it for the full report). You see that the explosive growth of smartphones was not expected; the other detail is that the cloud also was not foreseen here.

After reading you understand why Nvidia focuses so much on HPC and mobile.

Continue reading “The history of the PC from 2000 – 2012” →

Call for papers: SYCL workshop, 13-March-2016, Barcelona, Spain

Posted by Vincent Hindriksen on 22 October 2015 with 2 Comments

A high-level language has been on OpenCL’s roadmap since the years, and would be started once the foundations were ready. Therefore with OpenCL 2.0, SYCL was born.

To keep the pace high, a SYCL workshop is being organised. This week the call-for-papers is opened, which you can read below.

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Barcelona, Spain Sunday, 13th March, 2016

SYCL (sɪkəl – as in sickle) is a royalty-free, cross-platform C++ abstraction
layer that builds on the underlying concepts, portability and efficiency of
OpenCL, while adding the ease-of-use and flexibility of C++. For example, SYCL
enables single source development where C++ template functions can contain both
host and device code to construct complex algorithms that use OpenCL
acceleration, and then re-use them throughout their source code on different
types of data. SYCL has also been designed with resilience from the start, by
featuring, for example, a fall-back mechanism to automatically re-enqueue
kernels on different queues in case of a failure.

The SYCL Workshop aims to gather together SYCL’s users, researchers, educators
and implementors to encourage and grow a community of users behind the SYCL
standard, and related work in C++ for heterogeneous architectures. This will be
a half-day workshop. SYCL’16 will be held in Barcelona, 13 March 2016,
co-located with PPoPP 2016, HPCA 2016, CGO 2016 and LLVM 2016.

Travel Awards

Student authors who present papers in this workshop are eligible to apply for
travel awards. Further details will be announced after notification of
acceptance.

Important Dates

Submissions: 23rd November
Notification: 21st December
Final version: 24th January, 2016
Workshop: Sunday, 13th March, 2016

Submission Guidelines

All submissions must be made electronically through the conference submission
site, at https://easychair.org/conferences/?conf=sycl16.
Submissions may be one of the following:

Extended abstract: Two pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Short Paper: Four to six pages in standard SIGPLAN two-column conference
format (preprint mode, with page numbers)

Submissions must be in PDF format and printable on US Letter and A4 sized
paper. All submissions will be peer-reviewed by at least two members of the
program committee. We will aim to give longer presentation slots to papers than
to extended abstracts. Conference papers will not be published, but made
available through the website, alongside the slides used for each presentation.
The aim is to enable authors to get feedback and ideas that can later go into
other publications. We will encourage questions and discussions during the
workshop, to create an open environment for the community to engage with.

Topics of interest include, but are not limited to:

Applications implemented using SYCL

C++ Libraries using SYCL

C++ programming models for OpenCL (C++AMP, Boost.Compute, …)

Other C++ applications using OpenCL

New proposals to the SYCL specification

Integration of SYCL with other programming models

Compilation techniques to optimise SYCL kernels

Performance comparisons between SYCL and other programming models

Implementation of SYCL on novel architectures (FPGA, DSP, …)

Using SYCL in fault-tolerant systems

Reports on SYCL implementations

Debuggers, profilers and tools

Organising Committee

Paul Keir, University of the West of Scotland (UK)
Ruyman Reyes, Codeplay Software Ltd, Edinburgh (UK)

Program Committee

Jens Breitbart, TU Munich
Alastair Donaldson, Imperial College London, UK
Christophe Dubach, University of Edinburgh, UK
Joel Falcou, LRI, Université Paris-Sud, France
Benedict Gaster, University of the West of England, UK
Vincent Hindriksen, StreamHPC, Netherlands
Christopher Jefferson, St. Andrews University, UK
Ronan Keryell, Xilinx, Ireland
Zoltán Porkoláb, ELTE, Hungary
Francisco de Sande, Universidad de La Laguna, Spain
Ana Lucia Varbanescu, University of Amsterdam, Netherlands
Josef Weidendorfer, TU Munich

Yes, we’re in the Program Committee as one of the few non-academics. We’re looking forward to read your proposal!

If you have a blog, feel free to copy the above text and repost it.

DOI: Digital attachments for Scientific Papers

Posted by Vincent Hindriksen on 13 March 2018

Ever saw a claim on a paper you disagreed with or got triggered by, and then wanted to reproduce the experiment? Good luck finding the code and the data used in the experiments.

When we want to redo experiments of papers, it starts with finding the code and data used. A good start is Github or the homepage of the scientist. Also Gitlab. Bitbucket, SourceForge or the personal homepage of one of the researchers could be a place to look. Emailing the authors is often only an option, if the university homepage mentions such option – we’re not surprised to get no reaction at all. If all that doesn’t work, then implementing the pseudo-code and creating own data might be the only option – not if that will support the claims.

So what if scientific papers had an easy way to connect to digital objects like code and data?

Here the DOI comes in.

Continue reading “DOI: Digital attachments for Scientific Papers” →

IWOCL 2017 – all the talks

Posted by Vincent Hindriksen on 2 May 2017

An overview of all the tutorials and talks for easy reading.

You can also download the PDF.

Heterogeneous Computing Using Modern C++ with OpenCL Devices – Rod Burns and Ruyman Reyes (Codeplay)

This hands-on session will provide an opportunity to get experience with SYCL using ComputeCpp™ Community Edition, a free to use implementation of the SYCL 1.2 standard. Attendees will be shown how to set up ComputeCpp and use it to write their own SYCL code to run on supported GPUs and CPUs.

SYCL is already able to dispatch to heterogeneous devices and it implements C++17 ParallelSTL, augmenting it with ability to dispatch to GPUs in addition to CPUs. This tutorial will demonstrate how to write parallel SYCL code and how to use the Khronos Group’s experimental Parallel STL implementation. The course outline is as follows

Start with a basic SYCL program that shows how to submit queues in a single task and stream-like object, comparing CPU, SYCL and OpenCL versions
Demonstrate how to access data across host and GPUs using buffers and accessors, the importance of life-time, and basic parallel constructs

Attendees are expected to have programming experience with C++ and a laptop either running Linux or having a VM manager installed such as VirtualBox. The required software will be provided on USB-sticks. This course is suitable for beginners, but is focused on intermediate to advanced parallel programming using C++.

Harnessing the Power of FPGAs with the Intel FPGA SDK for OpenCL- Byron Sinclair, Andrew Ling and Genady Paikin (Intel)

In this tutorial, we will introduce you to the reconfigurable hardware architecture and programming of Field Programmable Gate Arrays (FPGAs).

You will learn why FPGAs have become so popular in recent years, and understand the many advantages of using FPGAs in your HPC application. In particular, we will cover architectural features of FPGAs that make them well suited to many complex operations, including matrix multiplications and convolutions. In addition, we will introduce you to programming FPGAs using the Intel® FPGA SDK for OpenCL, and how specific OpenCL coding techniques can lead to efficient circuits implemented on the FPGA.

Finally, we will go over several case studies where FPGAs have shown very competitive performance when programmed using OpenCL, including convolutional neural nets, FFTs, and astronomy de-dispersion algorithms.

Unlock Intel GPUs for High Performance Compute, Media and Computer Vision Capabilities with Intel OpenCL Extensions – Jeff Mcallister, Biju George, Adam Herr and Ben Ashbaugh (Intel)

The keys to unlock the full performance potential of Intel GPUs for emerging workloads in general compute, media, computer vision, and machine learning are in the rich suite of Intel OpenCL extensions. These give developers direct access to unique Intel hardware capabilities, which until now have been difficult to master.
This tutorial builds step by step with multiple examples, including:

How to write high performance general compute applications based on the core concept of OpenCL subgroups.
How to use additional subgroup operations described in the Intel subgroups and media block read/write extensions.
Then using the framework of subgroups, we explain the device-side motion estimation extension which leverages the unique Intel GPU media sampler to accelerate motion estimation operations from OpenCL kernels.
Finally we explain the Video Enhancement (VEBOX) extension, which is an OpenCL host level API extension to leverage a powerful media fixed function unit to accelerate many frame level video enchancement operations.

Faster, smarter computer vision with AI and OpenCL – Uri Levy and Jeffrey Mcallister (Intel)

Learn how to use Intel machine learning and computer vision tools to get from concept to market faster for machine learning applications based on OpenCL and OpenVX. Build two example scenarios: autonomous driving with FPGA inference and a smart camera app using Intel Graphics inference. This presentation will show how a unified set of tools can reduce the complexity of developing heterogeneous machine learning apps – from training a model with input images, to creating a custom classifier, to building an optimized traditional computer vision pipeline around the classifier to create a full computer vision application

GPGPU Acceleration using OpenCL for a Spotlight SAR Simulator – Eric Balster, Jon Skeans and David Fan (University of Dayton) Marc Hoffman (US Air Force Research Laboratory)

In this paper, OpenCL is used to target a general purpose graphics processing unit (GPGPU) for acceleration of 2 modules used in a synthetic aperture radar (SAR) simulator. Two of the most computationally complex modules, the Generate Return and Back Projection modules, are targeted to an AMD FirePro M5100 GPGPU. The resulting speedup is 2.5X over multi-threaded C++ implementations of those algorithms running on an 8-core Intel I7 2.8GHz processor, 5X over singlethreaded C++ implementations, and 24X over native MATLAB implementations, on average.

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL – Javier Alejandro Varela and Norbert Wehn (University of Kaiserslautern)

In this work, we exploit OpenCL to efficiently map the nested simulation of complex portfolios with multiple algorithms on heterogeneous computing systems. Code portability and customizations allow us to profile the kernels on different accelerating platforms, such as CPU, Intel’s Xeon Phi and GPU. The combination of OpenCL, a new bit-accurate algorithmic optimization and the extension of an existing numerical interpolation scheme allows us to achieve 1000x speedup compared to the state-of-the-art approach. Our system design minimizes costly host-device transfers and global memory, enabling complex portfolios to be easily scaled.

A Performance and Energy Evaluation of OpenCL-accelerated Molecular Docking – Leonardo Solis Vasquez and Andreas Koch (Technische Universität Darmstadt)

This work presents an OpenCL implementation of AutoDock, and a corresponding performance evaluation on two different platforms based on multi-core CPU and GPU accelerators. It shows that OpenCL allows highly efficient docking simulations, achieving speedups of ∼4x and ∼56x over the original serial AutoDock version, as well as energy efficiency gains of ∼2x and ∼6x. respectively. To the best of our knowledge, this work is the first one also considering the energy efficiency of molecular docking programs.

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations – Nuno Fachada and Agostinho Rosa (Instituto Superior Técnico, Portugal)

In this paper we evaluate the feasibility of using CPU-oriented OpenCL for high-performance simulations of agent-based models. We compare a CPU-oriented OpenCL implementation of a reference ABM against a parallel Java version of the same model. We show that there are considerable gains in using CPU-based OpenCL for developing and implementing ABMs, with speedups up to 10x over the parallel Java version on a 10-core hyper-threaded CPU.

Enabling FPGAs as a True Device in the OpenCL Standard – Vincent Mirian and Paul Chow (University Of Toronto)

As FPGA capacities continue to increase, the ability to partition and partially reconfigure the FPGA will become even more desirable. The fundamental issue is how FPGAs are currently viewed as devices in the OpenCL model. In this paper, we propose a small change to the OpenCL definition of a device that unlocks the full potential of FPGAs to the programmer.

Applying Models of Computation to OpenCL Pipes for FPGA Computing – Nachiket Kapre and Hiren Patel (University of Waterloo)

We propose imposing a communication discipline inspired from models of computation (e.g.Ptolemy) such as SDF (synchronous dataflow), bulk synchronous (BSP), or Discrete Event (DE). These models offer a restricted subset of communication patterns that enable implementation tradeoffs and deliver performance and resource guarantees. This is useful for OpenCL developers operating within the constraints of the FPGA device. We hope to facilitate a preliminary analysis and evaluation of supporting these patterns in OpenCL and quantifying associated FPGA implementation costs.

Accelerating Applications at Cloud Scale using FPGAs – Sarah Siripoke, Fernando Martinez Vallina and Spenser Gilliland (Xilinx)

The acceptance and success of cloud computing has given application developers access to computing and new customers at a scale never seen below. The inherent ability of an FPGA to reconfigure and be workload optimized is a great advantage given the fast-moving needs of cloud computing applications. In this talk we will discuss how users can develop, accelerate and deploy accelerated applications in the cloud at scale. You will learn how to get started on a turn-key OpenCL development environment in the cloud using Xilinx FPGAs.

Creating High Performance Applications with Intel’s FPGA OpenCL SDK – Andrew Ling, Utku Aydonat, Davor Capalija, Shane O’Connell and Gordon Chiu (Intel)

After decades of research, High-Level Synthesis has finally caught on as a mainstream design technique for FPGAs. However, achieving performance results that are comparable to designing at a hardware description level still remains a challenge. In this talk, we illustrate how we achieve world class performance results on HPC applications by using OpenCL. Specifically, we show how we achieve 1Tflop of performance on a matrix multiply and over 1.3Tflops on a CNN application, run on Intel’s 20nm Arria 10 FPGA device. Finally, we will describe spatial coding techniques that lead to efficient structures, such as systolic-arrays, to ensure that the FPGA runs efficiently.

Symphony – Task Scheduling and Memory Management in Heterogeneous Computing – Amit Jindal and Wenjia Ruan (Qualcomm Technologies)

Task scheduling and memory management are challenges that make Heterogeneous Computing difficult for the masses. There are several programming models and tools that exist targeting partitioning of workload and accessibility of data between CPU and GPU. We have developed and deployed Symphony SDK – a framework that makes workload partitioning, scheduling and memory management ‘simple’ for developers. In this talk, we will introduce Symphony architecture, elaborate how existing OpenCL kernels can be reused with heterogeneous task synchronization, task scheduling, and memory management capabilities of Symphony. We will also share real-world cases where Symphony has provided 2x-6x performance speed-ups.

CUDA-on-CL: A compiler and runtime for running modern CUDA c++11 applications on OpenCL 1.2 devices – Hugh Perkins (ASAPP)

Cuda-on-cl addresses the problem of creating and maintaining OpenCL forks by leaving the reference implementation entirely in NVIDIA CUDA, and writing both a compiler and a runtime component, so that any CUDA c++11 application can in theory be compiled and run directly on any OpenCL 1.2 device. We use Tensorflow framework as a case-study, and demonstrate the ability to run Tensorflow and Eigen kernels directly, with no modification to the original CUDA source-code. Performance studies are also undertaken, and show that the cuda-on-cl program runs at about 25% of the original CUDA-compiled version.

OpenCL in Scientific High Performance Computing—The Good, the Bad, and the Ugly – Matthias Noack (Zuse Institute Berlin)

We present experiences with utilising OpenCL alongside C ++ , MPI, and CMake in two real-world scientific codes. Our targets are a Cray XC40 supercomputer with multi- and many-core (Xeon Phi) CPUs, as well as multiple smaller systems with Nvidia and AMD GPUs. We shed light on practical issues arising in such a scenario, like the interaction between OpenCL and MPI, discuss solutions, and point out current limitations of OpenCL in the domain of scientific HPC from an application developer’s and user’s point of view.

Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices – Andrew Richards, Mehdi Goli and Luke Iwanski (Codeplay)

Codeplay has been working with Google to add SYCL back-end support in TensorFlow, one of the most popular machine learning frameworks, enabling developers to use OpenCL devices with their machine learning applications. SYCL provides an abstraction layer that simplifies parallel development, giving developers access to the computing power of OpenCL devices and reducing the amount of code required. Andrew Richards will talk about how machine learning applications can harness the power of OpenCL using open standards and how, by using SYCL, TensorFlow can be extended to include customized operations running on OpenCL devices.

Analyzing and improving performance portability of OpenCL applications via auto-tuning – James Price and Simon McIntosh-Smith (University of Bristol)

In this talk, we present an approach for analyzing performance portability that exploits that black-box nature of automatic performance tuning techniques. We demonstrate this approach across a diverse range of GPU and CPU architectures for two simple OpenCL applications. We then discuss the potential for auto-tuning to aid the generation of performance portable OpenCL kernels by incorporating multi-objective optimization techniques into the tuning process.

Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms – Biju George and Ben Ashbaugh (Intel)

In this presentation we focus on the application of the wavefront pattern to design efficient GPGPU implementations of video encoding algorithms using OpenCL kernels. We present our experiences in implementing and evaluating four solutions of WPP for inter and intra estimation for AVC on GPUs. We explain the reasoning behind each solution and present the results of our analysis.

Challenges and Opportunities in Native GPU Debugging with OpenCL – Uri Levy (Intel)

In this technical session we’ll present the open architectural design of the debugger and how it fits into the OpenCL JIT compilation flow and the underlying compute technology of the HW with focus on Intel processor graphics. We’ll demonstrate a show case on how to natively work with the debugger to solve functional bugs, as-well-as low-level debugging techniques on SIMD thread level which help to solve complex issues such as misaligned or out of range accesses to local\global memory, stack overflows, Illegal instructions, etc. Finally, we’ll cover the challenges in debugging

Modeling Explicit SIMD Programming with Subgroup Functions – Biju George and Ben Ashbaugh (Intel)

In this presentation, based on our experience in developing publicly released vendor extensions based on subgroups, we explain the advantages of the “explicit SIMD” programming paradigm using OpenCL subgroup and how the subgroups framework can be leveraged to: (1) Model features for performance in OpenCL that are commonly available in programming languages or interfaces based on an “explicit SIMD” programming paradigm such as the AVX intrinsics supported in GCC; and to (2) Model features to expose functionality available in GPU accelerator units that are more conveniently and efficiently exposed using a block API.

How to get full CMake support for AMD HIP SDK on Windows – including patches

Posted by Máté Ferenc Nagy-Egri on 1 August 2023

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading →

The 12 latest Twitter Poll Results of 2018

Posted by Vincent Hindriksen on 10 November 2018

Via our Twitter channel we have various polls. Not always have we shared the full background of these polls, so we’ve taken the polls of the past half year and put them here. The first half of the year there were no polls, in case you wanted to know.

As inclusive polls are not focused (and thus difficult to answer), most polls are incomplete by design. Still insights can be given. Or comments given.

Below’s polls have given us insight and we hope they give you insights too how our industry is developing. It’s sorted on date from oldest first.

It was very interesting that the percentage of votes per choice did not change much after 30 votes. Even when it was retweeted by a large account, opinions had the same distribution.

Is HIP (a clone of CUDA) an option?

Continue reading →

Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL

Posted by Vincent Hindriksen on 3 April 2013 with 2 Comments

WebCL is a great technique to have compute-power in the browser. After WebGL which gives high-end graphics in the browser, this is a logical step on the road towards the browser-only operating system (like Chrome OS, but more will follow).

Another way to look at technologies like WebCL, is that it makes it possible to lift the standard base from the OS to the browser. If you remember the trial of Microsoft’s integration of Internet Explorer, the focus was on the OS needing the browser for working well. Now it is the other way around, but it can be any OS. This is because the push doesn’t come from below, but from above.

Last year two guys from Lyon (South-France) got quite some attention, as they wrote a WebCL-plugin. Their names: Adrien Plagnol and Frédéric Langlade-Bellone. Below you’ll find a Q&A with them on WebCL. Enjoy! Continue reading “Q&A with Adrien Plagnol and Frédéric Langlade-Bellone on WebCL” →

NVIDIA ended their support for OpenCL in 2012

Posted by Vincent Hindriksen on 10 September 2012 with 19 Comments

If you are looking for the samples in one zip-file, scroll down. The removed OpenCL-PDFs are also available for download.

This sentence “NVIDIA’s Industry-Leading Support For OpenCL” was proudly used on NVIDIA’s OpenCL page last year. It seems that NVIDIA saw a great future for OpenCL on their GPUs. But when CUDA began borrowing the idea of using LLVM for compiling kernels, NVIDIA’s support for OpenCL slowly started to fade instead. Since with LLVM CUDA-kernels can be loaded in OpenCL and vice versa, this could have brought the two techniques more together.

What is the cause for this decreased support for OpenCL? Did they suddenly got aware LLVM would decrease any advantage of CUDA over OpenCL and therefore decreased support for OpenCL? Or did they decide so long ago, as their last OpenCL-conformant product on Windows is from July 2010? We cannot be sure, but we do know NVIDIA does not have an official statement on the matter.

The latest action demonstrating NVIDIA’s reduced support of OpenCL is the absence of the samples in their GPGPU-SDK. NVIDIA removed them without notice or clear statement on their position on OpenCL. Therefore we decided to start a petition to get these OpenCL samples back. The only official statement on the removal of the samples was on LinkedIn:

All of our OpenCL code samples are available at http://developer.nvidia.com/opencl, and the latest versions all work on the new Kepler GPUs.
They are released as a separate download because developers using OpenCL don’t need the rest of the CUDA Toolkit, which is getting to be quite large.
Sorry if this caused any alarm, we’re just trying to make life a little easier for OpenCL developers.

Best regards,

Will.

William Ramey
Sr. Product Manager, GPU Computing
NVIDIA Corporation

Continue reading “NVIDIA ended their support for OpenCL in 2012” →

LEAP-conference call for papers

Posted by Vincent Hindriksen on 5 February 2013

921752_m — Building bridges in a new industry

Embedded processors always have had the focus on low-energy. Now a combination of Moore’s law, the frequency-wall and multi-processor developments have made it possible for these processors to compete in completely new market segments. Most notable due to impressive advancements in graphics IP.
We are now looking at four groups who are interested in learning from each other:

The embedded processor market
The FPGA market
The HPC and server market
The GPGPU market

And answer the question: how can we get more out of low-energy processors by looking at other industries?

The goal of the LEAP conference is to bring these three groups together. Creating the windows to each other and paving roads over the newly constructed bridges. This makes it one of its kind. Half of the conference is focused on quality information sharing and the other half on networking. For more information, check the website of the LEAP-conference. StreamHPC is a co-organiser.

~~Call for papers is now open!~~ Programme is filled!

Continue reading →

GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017

Posted by Vincent Hindriksen on 27 March 2017

What brings 2017 in technology? Gartner gives their vision with the start of each year to give insight in which technologies to invest in. When looking through them, the most important enabling technologies are the GPU and Internet-of-Things (IoT) – see the image below. Whereas the last 4 are IoT based, the first 4 would not have been possible without GPUs.

The middle two are more mature technologies, as they’re based on technology progress of many years – it happens to be that the GPU has played a big role to get here. And ofcourse not only GPUs and IoT are the reason these 10 are on this year’s list.

Continue reading “GPUs and Gartner’s Top 10 Strategic Technology Trends For 2017” →

OpenCL in the Clouds

Posted by Vincent Hindriksen on 7 October 2010 with 4 Comments

Buzz-words are cool; they are loosely defined and are actually formed by the many implementation that use the label. Like Web 2.0 which is cool javascript for the one and interaction for the other. Now we have cloud-computing, which is cluster-computing with “something extra”. More than a year ago clouds were in the data-centre, but now we even have “private clouds”. So how to incorporate GPGPU? A cluster with native nodes to run our OpenCL-code with pre-distributed data is pretty hard to maintain, so what are the other solutions?

Distributed computing

Folding@home now has support for OpenCL to add the power of non-NVIDIA GPUs. While in clusters the server commands the clients what they have to do, here the clients ask the server for jobs. Disadvantage is that the clients are written for a specific job and are not really flexible to take different kind of jobs. There are several solutions for this code-distribution-problem, but still the solution is not suitable for smaller problems and small clusters.

Clusters: MPI

The project SHOC (Scalable HeterOgeneous Computing) is a collection of benchmark programs testing the performance and stability of systems using computing devices with non-traditional architectures for general purpose computing, and the software used to program them. While it is only a benchmark, it can be of great use when designing a cluster. For the rest I only found CUDA MPI-solutions, which are not ported to OpenCL yet.

Also check out Hoopoe, which is a cloud-computing service to run your OpenCL-kernels in their cloud. It seems to be more limited to .NET and have better support for CUDA, but it is a start. In Europe there is a start-up offering a rent-model for OpenCL-computation-time; please contact us if you want to get in contact with them.

Clusters: OpenMP

MOSIX has added a “Many GPU Package” to their cluster management system, so it now allows applications to transparently use cluster-wide OpenCL devices. When “choosing devices” not only the local GPU pops up, but also all GPUs in the cluster.
It works disk-less, in the way no files are copied to the computation-clients and all stays in-memory. Disk-less computations have an advantage when cloud-computer are not fully trusted. Take note that on most cloud-computers the devices need to be virtualised (see next part).

Below is its layered model, VCL being the “Virtual OpenCL Layer”.

They have chosen to base it on OpenMP; while the kernels don’t need to be altered, some OpenMP-code needs to be added. They are very happy to tell it takes much less code to use openMP instead of MPI.

You see a speed-up between 2.19 and 3.29 on 4 nodes is possible. We see comparable cluster-speed-ups in an old cluster-study. The actual speed-up on clusters depends mostly on the amount of data that needs to be transferred.

The project references to a project called remote CUDA, which only works with NVIDIA-GPUs.

Device Virtualisation

Currently there is no good device virtualisation for OpenCL. The gVirtuS-project currently only supports CUDA, but they claim it is easily rewritten to OpenCL. Code needs to be downloaded with a Mercurius-client (comparable to GIT and in repositories of most Linux-distributions):
> hg clone http://osl.uniparthenope.it/hg/projects/gvirtus/gvirtus gvirtus
Or download it here (dated 7-Oct-2010).

Let me know when you ported it to OpenCL! Actually gVirtuS does not do the whole trick since you need to divide the host-devices between the different guest-OSes, but luckily there is an extension which provides sharing of devices, called fission. More about this later.

We can all agree there still needs to be done a lot in this area of virtualised devices to get OpenCL in the cloud. If you can’t wait, you can theoretically use MOSIX locally.

Afterword

A cloud is the best buzz-word to market a scalable solution to overcome limitations of internet connected personal devices. I personally think the biggest growth will be in personal clouds, so companies will have their own in-house cloud-server (read: clusters); people just want to have a feeling of control, comparable with preference of a daily traffic jam above public transport. But nevertheless shared clouds have potential if it comes to computation-intensive jobs which do not need to be done all year round.

The projects presented here are a start to have OpenCL-power at a larger scale for more demanding cases. Since we can have more power at our fingertips with one desktop-pc stuffed with high-end video-cards than a 4-year-old supercomputer-cluster, there is still time

Please send your comment if I missed a project or method.

Let’s enter the Top500 HPC list using GPUs

Posted by Vincent Hindriksen on 6 June 2010 with 1 Comment

The #500 super-computer has only 24 TFlops (2010-06-06): http://www.top500.org/system/9677

update: scroll down to see the best configuration I have found. In other words: a cluster with at least 30 nodes with 4 high-end GPUs each (costing almost €2000,- per node and giving roughly 5 TFlops single precision, 1 TFLOPS double precision) would enter the Top500. 25 nodes to get to a theoretic 25TFlops and 5 extra for overcoming the overhead. So for about €60 000,- of hardware anyone can be on the list (and add at least €13 000 if you want to use Windows instead of Linux for some reason). Ok, you pay most for the services and actual building when buying such a cluster, but you get the idea it does not cost you a few millions any more. I’m curious: who is building these kind of clusters? Could you tell me the specs (theoretical TFlops, LinPack TFlops and watts/TFlop) of your (theoretical) cluster, which costs the customer less then €100 000,- in total? Or do you know companies who can do this? I’ll make a list of companies who will be building the clusters of tomorrow, the “Top €100.000,- HPC cluster list”. You can mail me via vincent [at] this domain, or put your answer in a comment.

Update: the hardware shopping-list

Nobody told in the remarks it is easy to build a faster machine than the one described above. So I’ll do it. We want the most flops per box, so here’s the wishlist:

A motherboard with as many slots as possible for PCI-E, CPU-sockets and memory-banks. This because the lag between the nodes is high.
A CPU with at least 4 cores.
Focus on the bandwidth, else we will not be able to use all power.
Focus on price per GFLOPS.

The following is what I found in local computer stores (which for some reason people there love to talk about extreme machines). AMD currently has the graphics cards with the most double precision power, so I chose for their products. I’m looking around for Intel + Nvidia, but currently they are far behind. Is AMD back on stage after being beaten by Intel’s Core-products for so many years?

The GigaByte GA-890FXA-UD7 (€245,-) has 1 AM3-socket, 6(!) PCI-e slots and supports up to 16GB of memory. We want some power, so we use the AMD Phenom II X6 1090T (€289,-), which I chose for the 6 cores and the low price per FLOPS. And to make it a monster, we add 6 times a AMD HD5970 (€599,-) giving 928 x 6 = 3264 DP-GLOPS. If it can handle 16GB DDR3 (€750,-), so we put it in. It needs about 3 Power-supplies of 700 Watt (€100,-). We add 128GB SSD (€350,-) for working data and a big 2 TB HDD (€100,-). Case needs to house the 3 power supplies (€100,-). Cooling is important and I suggest you compete with a wind-tunnel (€500,-). It will cost you €6228,- for 5,6 Double Precision TFLOPS, and 27 TFLOPS single precision. A cluster would be on the HPC500-list for around €38000,- (pure hardware-price, not taking network-devices too much into account, nor the price for man-hours).

Disclaimer: this is the price of a single node, excluding services, maintenance, software-installation, networking, engineering, etc. Please note that the above price is pure for building a single node for yourself, if you have the knowledge to do so.

N-Queens project from over 10 years ago

Posted by Vincent Hindriksen on 10 August 2023

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading →

When Big Data needs OpenCL

Posted by Vincent Hindriksen on 25 August 2012

Big Data in the previous century was the archive full of ring-binders/folders/ordners, which would grow each year at the same pace. Now the definition is that it should grow each year as much as all years before combined.

A few months ago SunGard named 10 Big Data trends transforming financial services. I have used their list as a base to have my own focus: on increased computation-demands and not specific for this one market. This resulted in 7 general trends where Big Data meets/needs OpenCL.

Since the start of StreamHPC we sought customers who could no compute through their whole data in time. Back then Big Data was still a buzz word catching on, but it best describes this one core businesses.

Continue reading “When Big Data needs OpenCL” →

InsideHPC: SuperComputing. Where to from here?

Posted by Vincent Hindriksen on 29 May 2011

In this video, Moderator Bob Feldman hosts a session entitled: Supercomputing: Where to from Here? Recorded at the National HPCC Conference 2011 in Newport.

Panelists:
Dr. Eng Lim Goh, SGI
Bill Feiereisen, Intel
Shumel Shottan, BlueARC
Steve Lyness, Appro International, Inc.
Marc Hamilton, HP Americas

http://www.youtube.com/watch?v=wI957eRr1kM

Below is a summary of what is told. It is just my notes, so go to the times mentioned to listen to the exact answers. Some details I did not write down, you might think are important, but I did not (or missed as I English is not my mother-tongue).

Continue reading “InsideHPC: SuperComputing. Where to from here?” →

Our training concepts for GPGPU

Posted by Vincent Hindriksen on 14 May 2010

It’s almost time for more nerdy stuff we have in the pipe-line, but we’ll keep for some superficial blah for a moment. We concentrate on training (and consultancy). There is a lot of discussion here about “how to design training-programs about difficult concepts for technical people”, or better: “how to learn yourself something difficult”. At the end of this blog, we’ll show you a list how to learn OpenCL yourself, but before that we want to share how we look at training you.

Disclaimer: this blog item is positive about our own training-program for obvious reasons. We are aware people don’t want (too much) spam, so we’ll keep this kind of blogs to the minimum. If you want to tell the world that your training-program is better, first mail us for our international partner-program. If you want the training, come back on 14 June or mail us.

OpenCL and CUDA are not the easiest programming languages due to incomparable concepts in software-land (You can claim Java is “slightly” different). Can the usual ways of training give you the insights and facts you need to know?

Current programs

Most training-programs are vendor-supported. People who follow us on Twitter, know we are not the best supporters of vendor locked products. So lets get a list of a typical vendor-supported training-programs, I would like to talk about:

They have to be difficult, so the student accomplishes something.
The exam are expensive to demotivate trial-on-error-students.
You get an official certificate, which guarantees a income-raise.
Books and trainings focus on facts you must learn.
It’s very clear what you must learn and what you can skip.

So in short, you chose you wanted to know the material and put a lot of effort in it. You get back more than just the knowledge.

Say you get the opposite:

They are easy to accomplish.
Exams is an assignment which you only need to finish. You can try endless times.
You don’t get a certificate, but you might get feedback and homework for self-study.
You get a list of facts you must learn; the concepts are explained to support this.
You are free to pick which subject you like.

That sucks! You cannot brag about your accomplishments and after the training you still cannot do anything with it; it will probably take years to actually finish it. So actually it’s very clear why the programs are like this, or can we learn from this opposing list? Just like with everything else, you never have to just copy what’s available but pick out the good parts.

Learning GPGPU

If you want to learn GPGPU, you have to learn (in short) shader-concepts, OpenCL, CUDA and GPU-architectures. What would be needed to learn it, according to us?

A specified list of subjects you can check when understood.
An insight story of the underlying concepts to better understand the way stream-computing works. Concepts are the base of everything, to actually make it sound simple.
Very practical know-how. Such as how to integrate stream-computing-code into your current software.
A difficult assignment that gets you in touch with everything you learned. The training gave you the instruments you need to accomplish this step.

So there’s no exam and no certificate; these are secondary reasons for finishing the course. The focus should be getting the brain wrapped around the concepts and getting experience. As the disclaimer warned you, our training-program has a high focus on getting you up-and-running in one day. And you do get a certificate after your assignment gets approved, so bragging is easy.

If you want to learn stream-computing and you won’t use our training-program, what then?

Read our blog (RSS) and follow us on Twitter.
Make yourself a list of subjects you think you have to learn. Thinking before doing helps in getting a focus.
Buy a book. There are many.
Play around with existing examples. Try to break it. Example: what happens if the kernel uses more and more local/private memory.
Update the list of subjects; the more extensive, the better. Prioritise.
Find yourself an assignment. For example: try to compress or decompress a large JPG using OpenCL. If you succeed, get yourself a harder assignment. Do you want to be good or the best?

If you know OpenCL, CUDA is easy to learn! We will have some blogs which support your quest on learning OpenCL, so just start to dig in today and see you next time.

NVIDIA: mobile phones, tablets and HPC (cloud)

Posted by Vincent Hindriksen on 12 May 2012

If you want to see what is coming up in the market of consumer-technology (PC, mobile and tablet), then NVIDIA can tell you the most. The company is very flexible, and shows time after time it really knows in which markets is currently operates and can enter. I sometimes strongly disagree with their marketing, but watch them closely as they are in the most important markets to define the near future in: PCs, Mobile/Tablet and HPC.

You might think I completely miss interconnects (buses between processors, devices and memory) and memory-technologies as clouds have a large need for high-speed data-transport, but the last 20 years have shown that this is a quite stable developing market based on IP-selling to the hardware-vendors. With the acquisition of Cray’s interconnect technology, we have seen this is serious business for Intel, so things might change indeed. For this article I want to focus on NVIDIA’s choices.

Continue reading “NVIDIA: mobile phones, tablets and HPC (cloud)” →

Round 1: CV scanning

How to improve your cv

Share code

Write a motivational email

Round 2: short coding test

Reasons people stop here

Round 3: video call

Round 4: long coding test

Round 4 A: online test

How to prepare for the Codility test

Round 4 B: homework test

Rounds 5+: The rest

A last remark: the door versus the room

Deadlines at a Glance

Selection Criteria

Unpublished Technical Papers

Technical Presentations

Workshops & Tutorials

Posters

Submit your abstract today

1st SYCL workshop (SYCL’16) – co-located with PPoPP’16

Travel Awards

Important Dates

Submission Guidelines

Topics of interest include, but are not limited to:

Organising Committee

Program Committee

Heterogeneous Computing Using Modern C++ with OpenCL Devices – Rod Burns and Ruyman Reyes (Codeplay)

Harnessing the Power of FPGAs with the Intel FPGA SDK for OpenCL- Byron Sinclair, Andrew Ling and Genady Paikin (Intel)

Unlock Intel GPUs for High Performance Compute, Media and Computer Vision Capabilities with Intel OpenCL Extensions – Jeff Mcallister, Biju George, Adam Herr and Ben Ashbaugh (Intel)

Faster, smarter computer vision with AI and OpenCL – Uri Levy and Jeffrey Mcallister (Intel)

GPGPU Acceleration using OpenCL for a Spotlight SAR Simulator – Eric Balster, Jon Skeans and David Fan (University of Dayton) Marc Hoffman (US Air Force Research Laboratory)

Near Real-Time Risk Simulation of Complex Portfolios on Heterogeneous Computing Systems with OpenCL – Javier Alejandro Varela and Norbert Wehn (University of Kaiserslautern)

A Performance and Energy Evaluation of OpenCL-accelerated Molecular Docking – Leonardo Solis Vasquez and Andreas Koch (Technische Universität Darmstadt)

Assessing the feasibility of OpenCL CPU implementations for agent-based simulations – Nuno Fachada and Agostinho Rosa (Instituto Superior Técnico, Portugal)

Enabling FPGAs as a True Device in the OpenCL Standard – Vincent Mirian and Paul Chow (University Of Toronto)

Applying Models of Computation to OpenCL Pipes for FPGA Computing – Nachiket Kapre and Hiren Patel (University of Waterloo)

Accelerating Applications at Cloud Scale using FPGAs – Sarah Siripoke, Fernando Martinez Vallina and Spenser Gilliland (Xilinx)

Creating High Performance Applications with Intel’s FPGA OpenCL SDK – Andrew Ling, Utku Aydonat, Davor Capalija, Shane O’Connell and Gordon Chiu (Intel)

Symphony – Task Scheduling and Memory Management in Heterogeneous Computing – Amit Jindal and Wenjia Ruan (Qualcomm Technologies)

CUDA-on-CL: A compiler and runtime for running modern CUDA c++11 applications on OpenCL 1.2 devices – Hugh Perkins (ASAPP)

OpenCL in Scientific High Performance Computing—The Good, the Bad, and the Ugly – Matthias Noack (Zuse Institute Berlin)

Accelerated Machine Learning Using TensorFlow and SYCL on OpenCL Devices – Andrew Richards, Mehdi Goli and Luke Iwanski (Codeplay)

Analyzing and improving performance portability of OpenCL applications via auto-tuning – James Price and Simon McIntosh-Smith (University of Bristol)

Wavefront Parallel Processing on GPUs with an Application to Video Encoding Algorithms – Biju George and Ben Ashbaugh (Intel)

Challenges and Opportunities in Native GPU Debugging with OpenCL – Uri Levy (Intel)

Modeling Explicit SIMD Programming with Subgroup Functions – Biju George and Ben Ashbaugh (Intel)

Finally, ROCm on Windows

Is HIP (a clone of CUDA) an option?

Distributed computing

Clusters: MPI

Clusters: OpenMP

Device Virtualisation

Afterword

Update: the hardware shopping-list

Introduction in 2023

Introduction in 2016

Current programs

Learning GPGPU