Our Hiring Process

After you apply, we do a quick scan of your resume, which means that we look for keywords like CUDA, OpenCL, SYCL, GLSL, HLSL, Assembly, etc. Then, we try to assess you on seniority in CPU programming, GPU programming and overall project experience. Even if you have experience on personal/hobby projects, make sure you mention them on your resume. If you have the basic keywords we look for, our HR team will reach out to schedule a video call that focuses on cultural fit.

In short, we have 3 interviews and 1 assignment.

Cultural Check

During the video call, we’ll discuss how well you align with our company culture and values. Focusing on learning your motivations and goals, and understanding your current status. If you wanna be successful in this call, be aware of your strengths and soft skills that make you a good fit for the company, not just the technical aspects of the job. Also, make sure you have questions prepared. Know that it’s a fairly relaxed call, where we just want to get to know you in approximately 45 minutes.

Online CodeJudge Test

If you pass the cultural fit check, you’ll be invited to complete an online CodeJudge test to assess your C/C++ skills. The final score is one of many things we are looking at. Therefore, try to show your thinking process and all the problem-solving steps. The duration of the test is 4,5 hours but it is usually completed in 2,5 hours on average.

Tech Talk

Candidates who achieve a high functionality score on the CodeJudge test will be invited to a technical interview with one of our team members. It is an hour-long live coding interview with one of our tech team members. This interview mostly focuses on assessing your GPU programming skills. It is a highly interactive interview in which you will discuss approaches, tools, languages, technologies, and parts of code.

Deep Dive

The final interview is the deep dive where we get to know each other better, and give you a scenario-based problem-solving exercise with our founder so you can demonstrate your problem-solving skills. One of our team members joins to meet you and you will have a chance to ask some questions about the projects and day-to-day operations, and also the other way around.

After we complete the interview process, we ask for your references. While we check the references you provide, we will prepare your offer letter. Once everything is in order, we’ll extend a formal job offer and look forward to welcoming you to our team!

What We Offer

A salary within the market range.
A company laptop and needed equipment for your work.
Flexible work hours for a balanced lifestyle.
A collaborative team culture that values problem-solving, teamwork, and open communication.
Stable, long-term employment with a self-funded company that has over 10 years of experience solving HPC and GPU problems for clients worldwide.
Stream HPC is a flat organization in which you have the freedom and responsibility to do what you are good at and have an impact.
A supportive work environment featuring a 6-month onboarding period, followed by a 1-year contract, and thereafter an indefinite contract.
Specifically for the Amsterdam Office:
- Hybrid working, 2 days in the office.
- NS Business Card to our office located in Lelylaan, Amsterdam
- A minimum of 20 days of vacation.

We have provided various texts to help you get the information we think is useful:

Onboarding process describes the first 6 months of the job.
Self-assessment that tells you where you stand and thus what are the technical chances to get the job.

We’re a small company, but we invest more time in our application process than most. We’ve used feedback from past applicants to improve it step by step. Our goal is to help you through the process, not overwhelm you. If anything on this page isn’t clear, feel free to email us at jobs@streamhpc.com.

We don’t work for the war-industry

Posted by Vincent Hindriksen on 2 November 2018

Last week we emphasized that we don’t work for the war-industry. We did talk to a national army some years ago, but even though the project never started, we would have probably said no. Recently we got a new request, got uncomfortable and did not send a quote for the training.

https://twitter.com/StreamHPC/status/1055121211787763712

This is because we like to think about the next 100 years, and investment in weapons is not something that would solve things for the long term.

To those, who liked the tweet or wanted to, thank you for your support to show us we’re not standing alone here. Continue reading “We don’t work for the war-industry” →

Neil Trevett on OpenCL

Posted by Vincent Hindriksen on 21 April 2012 with 4 Comments

The Khronos Group gave some talks on their technologies in Shanghai China on the 17th of March 2012. Neil Trevett did some interesting remarks on the position of NVidia on OpenCL I would like to share with you. Neil Trevett is both an important member of Khronos and employee of NVidia. To be more precise, he is the Vice President Mobile Content of NVidia and the president of Khronos. I think we can take his comments serious, but we must be very careful as these are mixed with his personal opinions.

Regular readers of the blog have seen I am not enthusiastic at all about NVidia’s marketing, but am a big fan of their hardware. And exactly I am very positive they are bold enough in the industry to position themselves very well with the fast-changing markets of the upcoming years. Having said that, let’s go to the quotes.

All quotes were from this video. Best you can do is to start at 41:50 till 45:35.

http://www.youtube.com/watch?v=_l4QemeMSwQ

At 44:05 he states: “In the mobile I think space CUDA is unlikely to be widely adopted“, and explains: “A party API in the mobile industry doesn’t really meet market needs“. Then continues with his vision on OpenCL: “I think OpenCL in the mobile is going to be fundamental to bring parallel computation to mobile devices” and then “and into the web through WebCL“.

Also interesting at 44:55: “In the end NVidia doesn’t really mind which API is used, CUDA or OpenCL. As long as you are get to use great GPUs“. He ends with a smile, as “great GPUs” refers to NVidia’s of course. 🙂

At 45:10 he puts NVidia’s plans on HPC, before getting back to : “NVidia is going to support both [CUDA and OpenCL] in HPC. In Mobile it’s going to be all OpenCL“.

At 45:23 he repeats his statements: “In the mobile space I expect OpenCL to be the primary tool“.

Continue reading “Neil Trevett on OpenCL” →

Starting with GROMACS and OpenCL

Posted by Vincent Hindriksen on 15 November 2014

Now that GROMACS has been ported to OpenCL, we would like you to help us to make it better. Why? It is very important we get more projects ported to OpenCL, to get more critical mass. If we only used our spare resources, we can port one project per year. So the deal is, that we do the heavy lifting and with your help get all the last issues covered. Understand we did the port using our own resources, as everybody was waiting for others to take a big step forward.

The below steps will take no more than 30 minutes.

Getting the sources

All sources are available on Github (our working branch, bases on GROMACS 5.0). If you want to help, checkout via git (on the command-line, via Visual Studio (included in 2013, 2010 and 2012 via git plugin), Eclipse or your preferred IDE. Else you can simply download the zip-file. Note there is also a wiki, where most of this text came from. Especially check the “known limitations“. To checkout via git, use:

git clone git@github.com:StreamHPC/gromacs.git

Building

You need a fully working building environment (GCC, Visual Studio), and an OpenCL SDK installed. You also need FFTW. Gromacs installer can build it for you, but it is also in Linux repositories, or can be downloaded here for Windows. Below is for Linux, without your own FFTW installed (read on for more options and explanation):

mkdir build
cd build
cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release

There are several other options, to build. You don’t need them, but it gives an idea what is possible:

-DCMAKE_C_COMPILER=xxx equal to the name of the C99 compiler you wish to use (or the environment variable CC)
-DCMAKE_CXX_COMPILER=xxx equal to the name of the C++98 compiler you wish to use (or the environment variable CXX)
-DGMX_MPI=on to build using an MPI wrapper compiler. Needed for multi-GPU.
-DGMX_SIMD=xxx to specify the level of SIMD support of the node on which mdrun will run
-DGMX_BUILD_MDRUN_ONLY=on to build only the mdrun binary, e.g. for compute cluster back-end nodes
-DGMX_DOUBLE=on to run GROMACS in double precision (slower, and not normally useful)
-DCMAKE_PREFIX_PATH=xxx to add a non-standard location for CMake to search for libraries
-DCMAKE_INSTALL_PREFIX=xxx to install GROMACS to a non-standard location (default /usr/local/gromacs)
-DBUILD_SHARED_LIBS=off to turn off the building of shared libraries
-DGMX_FFT_LIBRARY=xxx to select whether to use fftw, mkl or fftpack libraries for FFT support
-DCMAKE_BUILD_TYPE=Debug to build GROMACS in debug mode

It’s very important you use the options GMX_GPU and GMX_USE_OPENCL.

If the OpenCL files cannot be found, you could try to specify them (and let us know, so we can fix this), for example:

cmake .. -DGMX_BUILD_OWN_FFTW=ON -DGMX_GPU=ON -DGMX_USE_OPENCL=ON -DCMAKE_BUILD_TYPE=Release \
  -DOPENCL_INCLUDE_DIR=/usr/include/CL/ -DOPENCL_LIBRARY=/usr/lib/libOpenCL.so

Then make and optionally check the installation (success currently not guaranteed). For make you can use the option “-j X” to launch X threads. Below is with 4 threads (4 core CPU):

make -j 4

If you only want to experiment, and not code, you can install it system-wide:

sudo make install
source /usr/local/gromacs/bin/GMXRC

In case you want to uninstall, that’s easy. Run this from the build-directory:

sudo make uninstall

Building on Windows, special settings and problem solving

See this article on the Gromacs website. In all cases, it is very important you turn on GMX_GPU and GMX_USE_OPENCL. Also the wiki of the Gromacs OpenCL project has lots of extra information. Be sure to check them, if you want to do more than just the below benchmarks.

Run & Benchmark

Let’s torture GPUs! You need to do a few preparations first.

Preparations

Gromacs needs to know where to find the OpenCL kernels, for both Linux and Windows. Under Linux type: export GMX_OCL_FILE_PATH=/path-to-gromacs/src/. For Windows define GMX_OCL_FILE_PATH environment variable and set its value to be /path_to_gromacs/src/

Important: if you plan to make changes to the kernels, you need to disable the caching in order to be sure you will be using the modified kernels: set GMX_OCL_NOGENCACHE and for NVIDIA also CUDA_CACHE_DISABLE:

export GMX_OCL_NOGENCACHE
export CUDA_CACHE_DISABLE

Simple benchmark, CPU-limited (d.poly-ch2)

Then download archive “gmxbench-3.0.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in the build/bin folder. If you have installed it machine wide, you can pick any directory you want. You are now ready to run from /path-to-gromacs/build/bin/ :

cd d.poly-ch2
../gmx grompp
../gmx mdrun

Now you just ran Gromacs and got results like:

Writing final coordinates.

           Core t (s)   Wall t (s)      (%)
 Time:        602.616      326.506    184.6
             (ns/day)   (hour/ns)
Performance:    1.323      18.136

Get impressed by the GPU (adh_cubic_vsites)

This experiment is called “NADP-DEPENDENT ALCOHOL DEHYDROGENASE in water”. Download “ADH_bench_systems.tar.gz” from ftp://ftp.gromacs.org/pub/benchmarks. Unpack it in build/bin.

cd adh_cubic_vsites
../gmx grompp -f pme_verlet_vsites.mdp
../gmx mdrun

If you want to run from the first GPU only, add “-gpu_id 0” as a parameter of mdrun. This is handy if you want to benchmark a specific GPU.

What’s next to do?

If you have your own experiments, ofcourse test them on your AMD devices. Let us know how they perform on “adh_cubic_vsites”! Understand that Gromacs was optimised for NVidia hardware, and we needed to reverse a lot of specific optimisations for good performance on AMD.

We welcome you to solve or report an issue. We are now working on optimisations, which are the most interesting tasks of a porting job. All feedback and help is really appreciated. Do you have any question? Just ask them in the comments below, and we’ll help you on your way.

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

Posted by Vincent Hindriksen on 12 October 2011 with 6 Comments

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL” →

Performance can be measured as Throughput, Latency or Processor Utilisation

Posted by Vincent Hindriksen on 19 July 2016

40225151 - fiber optic cable — Getting data from one point to another can be measured in throughput and latency.

When you ask how fast code is, then we might not be able to answer that question. It depends on the data and the metric.

In this article I’ll give an overview of different ways to describe speed and what metrics are used. I focus on two types of data-utilisations:

Transfers. Data-movements through cables, interconnects, etc.
Processors. Data-processing. with data in and data out.

Both are important to select the right hardware. When we help our customers select the best hardware for their software,an important part of the advice is based on it.

Transfer utilisation: Throughput

How many bytes gets processed per second, minute or hour? Often a metric of GB/s is used, but even MB/day is possible. Alternatively items per second is used, when relative speed is discussed. An alternative word is bandwidth, which described the theoretical maximum instead of the actual bytes being transported.

The typical type of software is a batch-process – think media-processing (audio, video, images), search-jobs and neural networks.

It could be that all answers are computed at the end of the batch-process, or that results are given continuously. The throughput is the same, but the so called latency is very different.

Transfer utilisation: Latency

What is the time between the data-offering and the results? Or what is the reaction time? It is measured in time (often nanoseconds (ns, a billionth of a second), microsecond (μs, a millionth of a second) or milliseconds (ms, a thousandth of a second). When latency gets longer than seconds, its still called latency but more often it’s called “processing time”

This is important in streaming applications – think of applications in broadcasting and networking.

There are three causes for latency:

Reaction time: hardware/software noticing there is a job
Transport time: it takes time to copy data, especially when we talk GBs
Process time: computing the data can

When latency is most important we use FPGAs (see this short presentation on OpenCL-on-FPGAs) or CPUs with embedded GPUs (where the total latency between context-switching from and to the GPU is a lot lower than when discrete GPUs are used).

Processor utilisation: Throughput

Given the current algorithm, how much potential is left on the given hardware?

The algorithm running on the processor possibly is the bottleneck of the system. The metric we use for this balance is “”FLOPS per byte”. This means that the less data is needed per compute operation, the higher the chance that the algorithm is compute-limited. FYI: unless your algorithm is very inefficient, you should be very happy when you’re compute-limited.

resizedimage600300-rooflineai (1)

The below image shows how the above algorithms on the roofline-model. You see that for many processors you need to have at least 4 FLOPS per byte to hit the frequency-wall, else you’ll hit the bandwidth-wall.

roofline

This is why HBM is so important.

Processors utilisation: Latency

How fast can data get in and out of the processor? This sets the minimum latency that can be reached. The metric is the same as for transfers (time), but then on system level.

For FPGAs this latency can be very low (10s of nanoseconds) when data-cables are directly connected to the FPGA-chip. Such FPGAs are on a board with i.e. a network-port and/or a DisplayPort-port.

GPUs depend on how well they’re connected to the CPU. As this is a subject on its own, I’ll discuss in another post.

Determining the theoretical speed of a system

A request “Make this hardware as fast as possible” is a lot easier (and cheaper) to solve than “Make this hardware as fast as possible on hardware X”. This is because there is no one fastest hardware (even though vendors make believe us so), there is only hardware most optimal for a specific algorithm.

When doing code-reviews, we offer free advice on which hardware is best for the target algorithm, for the given budget and required power-envelope. Contact us today to access our knowledge.

Aparapi: OpenCL in Java

Posted by Vincent Hindriksen on 3 August 2011 with 2 Comments

Edit: Aparapi has been open sourced and many issues have already been fixed and improved.

If you have an AMD GPU/APU, you should try Aparapi. This software lets you write OpenCL-code in Java pretty high-level. The idea is that is sort of that it processes the Java intermediate code to search for loops and then create optimised OpenCL-kernels. Just download Aparapi and try the two examples. As the current version is still in alpha, it is not flawless yet. What I think is important when having worked with Aparapi is that you learn how to keep it simple – like you know that you can gain most speed on straight roads and turns slow down.

The Aparapi-team tries to avoid explicit defining of local memory, but it is still possible by using the @Local annotation. Such decisions show the team wants Aparapi to be high-level. It also integrates well with JavaCL and JOCL, so for the kernels you already have created, you can mix. You can also check out a video introducing Aprapi (it is video 15, if #-linking doesn’t work).

Time to create your own project. As not all errors are documented or are solved in the upcoming version, below you will find a list of common errors and how to easily solve them.

Continue reading “Aparapi: OpenCL in Java” →

StreamHPC’s Newsletter

Posted by Vincent Hindriksen on 20 May 2011

Wanting to know what really happens in the world of OpenCL? StreamHPC’s monthly newsletter is the most complete and independent source around the business and techniques around OpenCL. Subscribe, because the written news doesn’t always end up on this blog.

StreamHPC hates spam and will use the subscription-information only for the newsletter.
[mailchimpsf_form]

I hope you enjoy it!

StreamHPC flirts with ARM

Posted by Vincent Hindriksen on 8 March 2012 with 2 Comments

With the launch of twitter-channel @OpenCLonARM we now officially show a strong interest in ARM for compute. And we are not the only ones, as the twitter already has 80 followers (60 in 1.5 day and 12 retweets of the welcome-message).

ARM has made tremendous progress in both technology and market-share. With ARM-64, companies like NVidia (and maybe AMD) in the field, X86 seems to be getting a real competitor. This could happen because since a few years computers are fast enough and are not being replaced by a faster one, but a smaller one (tablet, phone) or extra one. By the rules of the market, current technologies are replaced by the ones that give those other needs. ARM is fast (enough), flexible in design, very cheap, low-power and passively cooled. The biggest obstacle seems to be only getting a standard for a docking-station to connect your mobile, tablet or watch to keyboard, mouse and large screen.

OpenCL is perfect for ARM, as it gives the computation-power to the intensive computations not already covered by hardware-support. In the world of X86 this interests high performance and big data companies, where on ARM this interests also more. Without the need for OpenCL you can already watch HD video, with OpenCL you can encode the video with MP4. This year you will certainly hear more about new possibilities of OpenCL on ARM.

What do you think. Why does Intel not sell IP to ARM-companies as many technologies could be reused? Could Intel be the next ARM as an IP-seller, or will they stay the defender of X86 for many years to come?

streamhpc.com is not affiliated with ARM.

Partner up with StreamHPC for Horizon 2020!

Posted by Vincent Hindriksen on 20 November 2013

For those working in a research-department at a company or university within the EU, Horizon 2020 might be a bit of a familiar sound.

For us this is an important program and a source possibilities for collaboration in the coming years. Our expertise in enabling ultra-fast computations, combined with your expertise can make Europe more competitive. We are interested in applied GPGPU, in the commercialization of tools, in co-developing new software with SMEs and also in universities based on the EU, Switzerland or Israel.

Fields Europe wants to focus on:

Micro- and nano-electronics; photonics
Nanotechnologies
Advanced materials
Biotechnology
Advanced manufacturing and processing

The development of these technologies requires a multi-disciplinary knowledge and a capital-intensive approach.

Each of these industries has opportunities for using GPus and Accelerators.

Events are organised throughout Europe to inform universities and companies about the programme. Are you a Dutch university or company? Check this site of the Dutch government.

Do you want to join StreamHPC?

Posted by Vincent Hindriksen on 4 April 2018

As of this month Stream exists 8 years. 8 full years of helping our customers with fast software.In Chinese numerology 8 is a very lucky number, and we notice that.

Over the years we’ve kept focus on quality and that was a good decision. The only problem is that we don’t have enough time to write on the blog, to organise events or even send the “monthly” newsletter. With over 200 drafts for the blog (subjects that really should be shared), we need extra people to help us out.

Dear developers who are good with C,C++, OpenCL/CUDA and algorithms, please take a look at the following vacancies. I know you are frequenting our blog.

We’re also seeking an all-rounder that supports in daily operations, that includes management, customer contact, team-support, etc.

See below for more details.

We’re looking forward to your application! We accept both remote and Amsterdam-based.

PDFs of Monday 12 September

Posted by Vincent Hindriksen on 12 September 2011

As it got more popular that I shared my readings, I decided to put them on my site. I focus on everything that uses vector-processing (GPUs, heterogeneous computing, CUDA, OpenCL, GPGPU, etc). Did I miss something or you have a story you want to share? Contact me or comment on this article. If you tell others about these projects you discovered here, I would appreciate you mention my website or my twitter @StreamHPC.

The research-papers have their authors mentions; the other links can be presentations or overviews of (mostly) products. I have read all, except the long PhD-theses (which are on my non-ad-hoc reading-list) – drop me any question you have.

Bullet Physics, Autodesk style. AMD and Autodesk on integrating Bullet Physics engine into Maya.

MERCUDA: Real-time GPU-based marine scene simulation. OpenCL has enabled more realistic sea and sky simulation for this product, see page 7.

J.P.Morgan: Using Graphic Processing Units (GPUs) in Pricing and Risk. Two pages describing OpenCL/CUDA can give 10 to 100 times speedup over conventional methods.

Parallelization of the Generalized Hough Transform on GPU (Juan Gómez-Luna1a, José María González-Linaresb, José Ignacio Benavidesa, Emilio L. Zapatab and Nicolás Guil). Describing two parallel methods for the Fast Generalized Hough Transform (Fast GHT) using GPUs, implemented in CUDA. It studies how load balancing and occupancy impact the performance of an application on a GPU. Interesting article as it shows that you can choose in which limits you bump into.

Performance Characterization and Optimization of Atomic Operations on AMD GPUs (Marwa Elteir, Heshan Lin and Wu-chun Feng). Measurement of the impact of using atomic operations on AMD GPUs. It seems that even mentioning ‘atomic’ puts the kernel in atomic mode and has major influence on the performance. They also come up with a solution: software-based atomic operation. Work in progress.

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing (Mayank Daga, Ashwin M. Aji, and Wu-chun Feng). Another one from Virginia Tech, this time on AMD’s APUs. This article measures its performance via a set of micro-benchmarks (e.g., PCIe data transfer), kernel benchmarks (e.g., reduction), and actual applications (e.g., molecular dynamics). Very interesting to see in which cases discrete GPUs have a disadvantage even with more muscle power.

A New Approach to rCUDA (José Duato, Antonio J. Peña, Federico Silla1, Juan C. Fernández, Rafael Mayo, and Enrique S. Quintana-Ort). On (remote) execution of CUDA-software within VMs. Interesting if you want powerful machines in your company to delegate heavy work to, or are interested in clouds.

Parallel Smoothers for Matrix-based Multigrid Methods on Unstructured Meshes Using Multicore CPUs and GPUs (Vincent Heuveline, Dimitar Lukarski, Nico Trost and Jan-Philipp Weiss). Different methods around 8 multi-colored Gauß-Seidel type smoothers using OpenMP and GPUs. Also some words on scalability!

Visualization assisted by parallel processing (B. Lange, H. Rey, X. Vasques, W. Puech and N. Rodriguez). How to use GPGPU for visualising big data. An important factor of real-time data-processing is that people get more insight in the matter. As an example they use temperatures in a server-room. As I see more often now, they benchmark CPU, GPU and hybrids.

A New Tool for Classification of Satellite Images Available from Google Maps: Efficient Implementation in Graphics Processing Units (Sergio Bernabéa and Antonio Plaza). 30 times speed-up with a new parallel implementation of the k-means unsupervised clustering algorithm in CUDA. Ity is used for classification of satellite images.

TAU performance System. Product-presentation of TAU which does, among other things, parallel profiling and tracing. Support for CUDA and OpenCL. Extensive collection of tools, so worth to spend time on. An paper released in March describes TAU and compares it with two other performance measurement systems: PAPI and VamirTrace.

An Experimental Approach to Performance Measurement of Heterogeneous Parallel Applications using CUDA (Allen D. Malony, Scott Biersdorff, Wyatt Spear and Shangkar Mayanglambam). Using a TAU-based (see above) tool TAUcuda this paper describes where to focus on when optimising heterogeneous systems.

Speeding up the MATLAB complex networks package using graphic processors (Zhang Bai-Da, Wu Jun-Jie, Tang Yu-Hua and Li Xin). Free registration required. Their conclusion: “In a word, the combination of GPU hardware and MATLAB software with Jacket Toolbox enables high-performance solutions in normal server”. Another PDF I found was: Parallel High Performance Computing with emphasis on Jacket based computing.

Profile-driven Parallelisation of Sequential Programs (Georgios Tournavitis). PhD-thesis on a new approach for extracting and exploiting multiple forms of coarse-grain parallelism from sequential applications written in C.

OpenCL, Heterogeneous Computing, and the CPU. Presentation by Tim Mattson of Intel on how to use OpenCL with the vector-extensions of Intel-processors.

MMU Simulation in Hardware Simulator Based-on State Transition Models (Zhang Xiuping, Yang Guowu and Zheng Desheng). It seems a bit off-chart to have a paper on the Memory Management Unit of a ARM, but as the ARM-processor gets more important some insights on its memory-system is important.

Multi-Cluster Performance Impact on the Multiple-Job Co-Allocation Scheduling (Héctor Blanco, Eloi Gabaldón, Fernando Guirado and Josep Lluí Lérida). This research-group has developed a scheduling-technique, and in this paper they discuss in which situations theirs works better than existing techniques.

Convey Computers: Putting Personality Into High Performance Computing. Product-presentation. They combine X86-CPUs with pre-programmed FPGAs to get high though-put. In short: if you make heavy usage of the provided algorithms, then this might be an alternative to GPGPU.

High-Performance and High-Throughput Computing. What it means for you and your research. Presentation by Philip Chan of Monash University. Though the target-group is their own university, it gives nice insights on how it goes around on other universities and research-groups. HPC is getting cheaper and accepted in more and more types of research.

Bull: Porting seismic software to the GPU. Presentation for oil-companies on finding new oil-fields. These seismic calculations are quite computation-intensive and therefore portable HPC is needed. Know StreamHPC is also assisting in porting such code to GPUs.

Dymaxion: Optimizing Memory Access Patterns for Heterogeneous Systems (Shuai Che, Jeremy W. Sheaffer and Kevin Skadron). This piece of software allows CUDA-programmers to optimize memory mappings to improve the efficiency of memory accesses on heterogeneous platforms.

Real-time volumetric shadows for dynamic rendering (MsC-thesis of Alexandru Teodor V.L. Voicu). Self-shadowing using the Opacity Shadow Maps algorithm is not fit for real-time processing. This thesis discusses Bounding Opacity Maps, a novel method to overcome this problem. Including code at the end, which you can download here.

Accelerating Foreign-Key Joins using Asymmetric Memory Channels (Holger Pirk, Stefan Manegold and Martin Kersten). Shows how to accelerate Foreign-Key Joins by executing the random table lookups on the GPU’s VRAM while sequentially streaming the Foreign-Key-Index through the PCI-E Bus. Very interesting on how to make clever usage of I/O-bounds.

Come back next Monday for more interesting research papers and product presentations. If you have questions, don’t hesitate to contact StreamHPC.

We’re looking for an intern to do the cool stuff: benchmarking and Linux wizarding

Posted by Vincent Hindriksen on 28 September 2014

We have some embedded devices here, which badly need attention. Some have gotten some private time on the bench, but we did not share anything on the blog yet with our readers. We simply need some extra hands to do this. Because it’s actually cool to do, but admittedly a bit boring when doing several devices, it was the perfect job for an intern. Besides the benchmarking, we have some other Linux-related projects for you. You’ll get an average payment for an internship in the Netherlands (in Dutch: “stagevergoeding”), lunch, a desk and a bunch of devices (aka toys-for-techies).

Like more companies in the Netherlands, we don’t care about how you where born, but who you are as a person. We expect from you that you…

know everything about Linux administration, from servers to embedded devices.
know how to setup a benchmark.
document all what you do, not only the results.
speak and write Dutch and English.
have great humor! (Even if you’re the only one who laughs at your jokes).
study in the EU, or can arrange the paperwork to get to the EU yourself.
have a place to live/crash in or nearby Amsterdam, or don’t mind the daily travelling. You cannot sleep in the office.

Together with your educational institute we’ll discuss the exact learning goals of the internship, and make a plan for a period of 3 to 6 months.

If you are interested, send a mail to jobs@streamhpc.com. If you know somebody who would be interested, please tell that person that we’re waiting for him/her! Also tips&tricks on finding the right person are very welcome.

Updated: OpenCL and CUDA programming training – now online

Posted by Vincent Hindriksen on 27 January 2020

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

Four days of training online
Free code-review after the training, to get feedback on what you created with the new knowledge;
1 month of limited support, so you can avoid StackOverflow;
Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading →

Starting with Altera OpenCL-on-FPGAs

Posted by Vincent Hindriksen on 3 July 2013 with 2 Comments

Altera has been very busy adding resources and has kicked off the beginning of June with opening up their OpenCL-program for the general public.
Only Stratix V devices are supported, but that could change later.

Below are all pages and PDFs concerning OpenCL I’ve found while searching Altera’s website.

Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms

Altera wanted to know where they could compete with GPUs and CPUs. For a big company their comparisons are quite honest (for instance about their limited access-speed to memory), but they don’t tell everything – like the hours(!) of compilation-time. The idea is that you develop on a GPU and when it’s correct, you port the correctly working software to the FPGA.

If you don’t have any experience working with their FPGAs, best is to ask around.

Medical_OpenCL — Image taken from Altera website.

Continue reading “Starting with Altera OpenCL-on-FPGAs” →

28 June: OpenCL course in Utrecht, NL

Posted by Vincent Hindriksen on 26 May 2011

At 28 June 2011 StreamCompting will give a 1-day course on OpenCL in Utrecht. As it is quite new, the priced is reduced. Also if you want to learn CUDA or any other GPGPU-language, this course is also a good option for you. The most important thing about GPGPU are the concepts. In other words the “why” they chose to make GPGPU-languages ike this. In my course you will get it after a one-day training. Most of the day consists of lectures with a short lab-sessions. The training makes use of a unique block-method, so you learn the technique top-down and almost can fill in the spaces yourself. At least 2 years of thorough programming-experience in Java, C++ or Objective C is preferred, because of the level of the subjects. The following is discussed with the big why-question as leading:

[list1]

OpenCL debunked: getting to understand how OpenCL is engineered.
Algoritms: which can be sped-up with GPGPU/OpenCL and which not.
Architectures & Optimalisations: why does one OpenCL-program work better on one architecture and not on another.
Software-engineering: wrapper-languages, code re-use and integration in existing software.
Debugging: not the screenshots, but giving you insight in how the memory-models work.

[/list1]

The lab-sessions are very minimal; you get (fully documented) homework which you can do the subsequent week (with assistance via mail). If you prefer to have extensive lab-sessions, please inform to the possibilities. After the session and the homework you’ll be able to decide on your own what kind of software can be sped up by using OpenCL and which not. Als you will be able to integrate OpenCL into your own software and engineer OpenCL-kernels. Note that the advances you make depend heavily on your seniority in programming. If all attendees are Dutch, it is given in Dutch. Future sessions will be in other cities, so if you prefer to receive training more local or at your company, please ask for the possibilities.

If you want more information, contact us.

Trainings Calendar

Every month StreamHPC offers an OpenCL training in Mathematics or Media-operations. The target is OpenCL 1.2 (AMD, NVidia, Intel) or OpenCL 2.0 on request. All trainings will be given by experienced OpenCL developers/trainers.

If you have any question, just ask. Call +31854865760 or use the contact form.

[columns]
[one_half title=”Trainings overview”]

A training takes 3½ days, from basics on the first morning to special requests on the 4th morning. You’ll have to bring your own laptop, or you login to a compute-server.

The Media-operations module is based on “Heterogeneous Computing with OpenCL, second edition” by Benedict Gaster, Lee Howes, David R. Kaeli, Perhaad Mistry & Dana Schaa.

It covers convolution, video-processing, histogram and mixed particle simulation. Extra subjects are OpenCL-OpenGL interop and code-optimisation.

It is a fit if you work with images, sound and video.

The Mathematics module is based on “OpenCL in Action” by Matthew Scarpino.

It covers reduction, sorting, matrix-operations and signal processing.

If you work on graphs, matrices and data-manipulation, this is for you.

Locations

The default location is Amsterdam. It has a regular and cheap flight-connection to all mayor cities in Europe, good public transport and relatively good hotels and hostels.

Costs

The costs are €1950,- per person (this includes a copy of the book), when at least 5 people have subscribed to the course.

If you want a private, in-house trainings, please contact us for more information. Trainings outside Europe (Russia, Asia, Americas) are possible, but travel-costs will be added when the group is under 6 people.

[/one_half]

[one_half title=”Tranings Agenda”]

[events_list]

[/one_half]
[/columns]

Apply by using the contact form and by filling in the pre-training questionnaire. Alternatively, you can call to +31645400456.

Using Qt Creator for OpenCL

Posted by Vincent Hindriksen on 25 May 2010 with 2 Comments

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: ~~http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt~~ http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>

make

Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/

make

This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:

export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

cd util/clinfo/

./clinfo

This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/

/usr/local/cuda/include/

/opt/qt-opencl/src/opencl/

The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:

/usr/lib/qtcreator:/opt/qt-opencl/lib:/usr/local/cuda/lib:

As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

printf("OpenCL Platforms:n"); 
QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData());  
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.

Apple Metal versus Vulkan + OpenCL 2.2

Posted by Vincent Hindriksen on 8 June 2015 with 21 Comments

Update: C++ has been moved from OpenCL 2.1 to 2.2, being released in 2017. Title and text have been updated to reflect this. A reason why Apple released Metal might be the reason that Khronos was too slow in releasing C++ kernels into OpenCL, given the delays.

Apple Metal in one sentence: one queue for both OpenCL and OpenGL, using C++11. They now brought it to OSX. The detail they don’t tell: that’s exactly what the combination of Vulkan + OpenCL ~~2.1~~ 2.2 does. Instead it is compared with OpenCL 1.x + OpenGL 4.x, which it certainly can compete with, as that combination doesn’t have C++11 kernels nor a single queue.

Apple Metal on OSX – a little too late, bringing nothing new to the stage, compared to SPIR and OpenCL ~~2.1~~ 2.2.

The main reason why they can’t compete with the standards, is that there is an urge to create high-level languages and DSLs on top of lower-level languages. What Apple did, was to create just one and leaving out the rest. This means that languages like SYCL and C++AMP (implemented on top of SPIR-V) can’t simply run on OSX, and thus blocking new innovations. To understand why SPIR-V is so important and Apple should adopt for that road, read this article on SPIR-V.

khronos-SPIR-V-flowchart — Metal could compile to SPIR-V, just like OpenCL-C++ does. The rest of the Metal API is just like Vulkan.

Yet another vendor lock-in?

Now Khronos is switching its two most important APIs to the next level, there is a short-term void. This is clearly the right moment for Apple to take the risk and trying to get developers interested in their new language. If they succeed, then we get the well-known “pffff, I have no time to port it to other platforms” and there is a win for Apple’s platforms (they hope).

Apple has always wanted to have a different way of interacting with OpenCL-kernels using Grand Central Dispatch. Porting OpenCL between Linux and Windows is a breeze, but from and to OSX is not. Discussions over the past years with many people from the industry thought me one thing: Apple is like Google, Microsoft and NVidia – they don’t really want standards, but want 100% dedicated developers for their languages.

Yes, now also Apple is on the list of Me-too™ languages for OpenCL. We at StreamHPC can easily translate your code from and too Metal, but we would like it that you can put your investments in more important matters like improving the algorithms and performance.

Still OpenCL support on OSX?

Yes, but only OpenCL 1.2. A way to work around is to use SPIR-to-Metal translators and a wrapper from Vulkan to Metal – this will not make it very convenient though. The way to go, is that everybody starts asking for OpenCL 2.0 support on OSX forums. Metal is a great API, but that doesn’t change the fact it’s obstructing standardisation of likewise great, open standards. If they provide both Metal and Vulkan+OpenCL ~~2.1~~ 2.2 then I am happy – then the developers have the choice.

Metal debuts in “OSX El Capitan”, which is available per today to developers, and this fall to the general public.

OpenCL on Altera FPGAs

Posted by Vincent Hindriksen on 4 October 2012

On 15 November 2011 Altera announced support for OpenCL. The time between announcements for having/getting OpenCL-support and getting to see actually working SDKs takes always longer than expected, so to get this working on FPGAs I did not expect anything before 2013. Good news: the drivers are actually working (if you can trust the demos at presentations).

There have been three presentations lately:

In this article I share with you what you should not have missed on these sheets, and add some personal notes to it.

Is OpenCL the key that finally makes FPGAs not tomorrow’s but today’s technology?

Continue reading “OpenCL on Altera FPGAs” →

http://www.flickr.com/photos/imabug/2946930401/

OpenCL Potentials: Medical Imaging

Posted by Vincent Hindriksen on 13 December 2010 with 2 Comments

When you ever saw a CT or MRI scanner, you might have noticed the full-sized computer next to it (especially the older ones). There is quite some processing power needed to keep up with the data-stream coming from the scanner, to process the data to a 3D-image and to visualise the data on a 2D-screen. Luckily we have OpenCL to make it even faster; which doctor doesn’t want real-time high-resolution results and which patient doesn’t want to see the results on Apple iPad or Samsung Galaxy Tab?

Architects, bankers and doctors have one thing in common: they get a better feeling for the current subject if they can play with the data. OpenCL makes it possible to process data much faster and thus let the specialist play with it. The interesting part of IT is that it is in every domain now and therefore a new series: OpenCL-potentials.