How to get full CMake support for AMD HIP SDK on Windows – including patches

Posted by Máté Ferenc Nagy-Egri on 1 August 2023

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading “How to get full CMake support for AMD HIP SDK on Windows – including patches” →

N-Queens project from over 10 years ago

Posted by Vincent Hindriksen on 10 August 2023

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago” →

NVIDIA ended their support for OpenCL in 2012

Posted by Vincent Hindriksen on 10 September 2012 with 19 Comments

If you are looking for the samples in one zip-file, scroll down. The removed OpenCL-PDFs are also available for download.

This sentence “NVIDIA’s Industry-Leading Support For OpenCL” was proudly used on NVIDIA’s OpenCL page last year. It seems that NVIDIA saw a great future for OpenCL on their GPUs. But when CUDA began borrowing the idea of using LLVM for compiling kernels, NVIDIA’s support for OpenCL slowly started to fade instead. Since with LLVM CUDA-kernels can be loaded in OpenCL and vice versa, this could have brought the two techniques more together.

What is the cause for this decreased support for OpenCL? Did they suddenly got aware LLVM would decrease any advantage of CUDA over OpenCL and therefore decreased support for OpenCL? Or did they decide so long ago, as their last OpenCL-conformant product on Windows is from July 2010? We cannot be sure, but we do know NVIDIA does not have an official statement on the matter.

The latest action demonstrating NVIDIA’s reduced support of OpenCL is the absence of the samples in their GPGPU-SDK. NVIDIA removed them without notice or clear statement on their position on OpenCL. Therefore we decided to start a petition to get these OpenCL samples back. The only official statement on the removal of the samples was on LinkedIn:

All of our OpenCL code samples are available at http://developer.nvidia.com/opencl, and the latest versions all work on the new Kepler GPUs.
They are released as a separate download because developers using OpenCL don’t need the rest of the CUDA Toolkit, which is getting to be quite large.
Sorry if this caused any alarm, we’re just trying to make life a little easier for OpenCL developers.

Best regards,

Will.

William Ramey
Sr. Product Manager, GPU Computing
NVIDIA Corporation

Continue reading “NVIDIA ended their support for OpenCL in 2012” →

Privacy Policy

Who we are

We are a group of companies, based in the Netherlands, Hungary and Spain. We help our customers get their code run fast by optimizing the computations and using accelerators. We do this since 2010.

Comments

When visitors leave comments on the site we collect the data shown in the comments form, and also the visitor’s IP address and browser user agent string to help spam detection.

An anonymised string created from your email address (also called a hash) may be provided to the Gravatar service to see if you are using it. The Gravatar service Privacy Policy is available here: https://automattic.com/privacy/. After approval of your comment, your profile picture is visible to the public in the context of your comment.

Forms

Form-data is sent to self-hosted software and is not read by any third-party party.

Tracking

We use anonymized tracking to find out:

Which pages are visited how often

Which subjects are popular

Which pages are clicked through

From which countries or states the visitors are

During a visit/session, you get a random ID.

Cookies

If you leave a comment on our site you may opt in to saving your name, email address and website in cookies. These are for your convenience so that you do not have to fill in your details again when you leave another comment. These cookies will last for one year.

Tracking cookies last for 24 hours.

Embedded content from other websites

Articles on this site may include embedded content (e.g. videos, images, articles, etc.). Embedded content from other websites behaves in the exact same way as if the visitor has visited the other website.

These websites may collect data about you, use cookies, embed additional third-party tracking, and monitor your interaction with that embedded content, including tracking your interaction with the embedded content if you have an account and are logged in to that website.

Who we share your data with

None of the data is shared with any third party. Marketing reports don’t contain any personal data.

How long we retain your data

If you leave a comment, the comment and its metadata are retained indefinitely. This is so we can recognize and approve any follow-up comments automatically instead of holding them in a moderation queue.

Anonymous tracking data is not thrown away, to find trends over the years.

What rights you have over your data

If you have left comments, you can request to receive an exported file of the personal data we hold about you, including any data you have provided to us. You can also request that we erase any personal data we hold about you. This does not include any data we are obliged to keep for administrative, legal, or security purposes.

Where your data is sent

Visitor comments and forms are checked through an automated spam detection service, ReCAPTCHA and Akismet.

Reporting problems

We are not in the business of monetizing user data, and believe in finding new customers through content.

As software and plugins change after updates, we are sometimes surprised that more is collected than we configured.

If anything is incorrect or not legal, please email to privacy@streamhpc.com. If you have generic questions, go to the contact page or email to info@streamhpc.com.

Big Data

Big_Bang_Data_exhibit_at_CCCB_17 Big data is a term for data so large or complex that traditional processing applications are inadequate. Challenges include:

capture, data-curation & data-management,
analysis, search & querying,
sharing, storage & transfer,
visualization, and
information privacy.

The term often refers simply to the use of predictive analytics or certain other advanced methods to extract value from data, and seldom to a particular size of data set. Accuracy in big data may lead to more confident decision making, and better decisions can result in greater operational efficiency, cost reduction and reduced risk.

At StreamHPC we’re focused on optimizing (predictive) analytic and data-handling software, as these tend to be slow. We solved Big Data problems at two aspects: real-time pre-processing (filtering, structuring, etc) and analytics (including in-memory search on a GPU).

The 13 application areas where OpenCL and CUDA can be used

Posted by Vincent Hindriksen on 3 June 2013 with 16 Comments

visitekaartje-achter-2013-V — Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading “The 13 application areas where OpenCL and CUDA can be used” →

OpenCL Fireworks

Posted by Vincent Hindriksen on 21 December 2010

I like and appreciate differences in the many cultures on our Earth, but also like to recognise different very old traditions everywhere to feel a sort of ancient bond. As an European citizen I’m quite familiar with the replacement of the weekly flowers with a complete tree, each December – and the burning of al those trees in January. Also celebration of New Year falls on different dates, the Chinese new year being the best known (3 February 2011). We – internet-using humans – all know the power of nicely coloured gunpowder: fireworks!

Let’s try to explain the workings of OpenCL in terms of fireworks. The following data is not realistic, but gives a good idea on how it works.

Continue reading “OpenCL Fireworks” →

Avoiding false dependencies in only two steps

Posted by Vincent Hindriksen on 29 September 2012

Let’s approach the concept of programming through looking at the brain, the code and the computer.

The idea of a program lives in the brain of a programmer. The way to get the program to the computer is using a system process called coding. When the program coded on the computer and the program embedded as an idea in the brain are alike, the programmer is happy. When over time the difference between the brain-version and the computer-version grows, then we go for a maintenance phase (although this is still this mostly from brain to computer).

When the coding-language or important coding-paradigms change, something completely different happens. In such case the program in the brain is updated or altered. Humans are not good at that, or at least not many textbooks discuss how to change from one model to another.

In this article I want to discuss one of these new coding-paradigm: dependencies in parallel software.
Continue reading “Avoiding false dependencies in only two steps” →

5 types of loops you should avoid

Posted by Vincent Hindriksen on 8 April 2012 with 4 Comments

In “Separation of compute, control and transfer” I talked about node-wise programming as a method we should embrace instead of trying to unroll the existing loops. In this article I get into loops and discuss a few types and how they can be run in a parallel form. Dependency is the big variable in each type: the lower the dependency on previous iterations, the better it can be parallelised. Another one is the known iteration-dimensions known before the loop is started.

The more you think about it, the more you find that a loop is not a loop.

Continue reading “5 types of loops you should avoid” →

Waiting for Mobile OpenCL – Q1 2011

Posted by Vincent Hindriksen on 14 February 2011 with 3 Comments

About 5 months ago we started waiting for Mobile OpenCL. Meanwhile we had all the news around ARM on CES in January, and of course all those beta-programs made progress meanwhile. And after a year of having “support“, we actually want to see the words “SDK” and/or “driver“. So who’s leading? Ziilabs, ImTech, Vivante, Qualcomm, FreeScale or newcomer nVIDIA?

Mobile phone manufacturers could have a big problem with the low-level access to the GPU. While most software can be sandboxed in some form, OpenCL can crash the phone. But at the other side, if the program hasn’t taken down the developer’s test-phone, the chances are low it will take any other phone. And also there are more low-level access-points to the phone. So let’s check what has happened until now.

Note: this article will be updated if more news comes from MWC ’11.

OpenCL EP

For mobile devices Khronos has specified a profile, which is optimised for (ARM) phones: OpenCL Embedded Profile. Read on for the main differences (taken from a presentation by Nokia).

Main differences

Adapting code for embedded profile
Added macro __EMBEDDED_PROFILE__
CL_PLATFORM_PROFILE capabilityreturns the string EMBEDDED_PROFILE if only the embedded profile is supported
Online compiler is optional
No 64-bit integers
Reduced requirements for constant buffers, object allocation, constant argument count and local memory
Image & floating point support matches OpenGL ES 2.0 texturing
The extensions of full profile can be applied to embedded profile

Continue reading “Waiting for Mobile OpenCL – Q1 2011” →

Difference between CUDA and OpenCL 2010

Posted by Vincent Hindriksen on 22 April 2010 with 11 Comments

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

Most GPGPU-enthusiasts have heard of both OpenCL and CUDA. While there are more solutions, these have the most potential. Both techniques are very comparable like a BMW and a Mercedes, but there are some differences. Since the technologies will evolve, we’ll take a look at the differences again next year. We’ve discussed this difference in a with a focus on marketing earlier this year.

Disclaimer: we have a strong focus on OpenCL (but actually for reasons explained in this article).

Terminology

If you have seen kernels of OpenCL and CUDA, you see the biggest difference might be the prefix “cl_” or the prefix “cu_”, but there is also a difference in terminology.

Matt Harvey (developer of Cuda2OpenCL-translator Swan) has summed up the differences in a presentation “Experiences porting from CUDA to OpenCL” (PDF):

CUDA term	OpenCL term
GPU	Device
Multiprocessor	Compute Unit
Scalar core	Processing element
Global memory	Global memory
Shared (per-block) memory	Local memory
Local memory (automatic, or local)	Private memory
kernel	program
block	work-group
thread	work item

As far as I know, the kernel-program is also called a kernel in OpenCL. Personally I like Cuda’s terms “thread” and “per-block memory” more. It is very clear CUDA targets the GPU only, while in OpenCL it an be any device.

Edit 2011-01-15: In a talk by Sami Rosendahl the differences are also discussed.

Speed-comparison

We would like to present you a benchmark between OpenCL and CUDA with full comparison, but we don’t have enough hardware in-house to do a full benchmark. Below information is what we’ve found on the net and a little bit based on our own experience.

On NVidia hardware, OpenCL is up to 10% slower (see Matt Harvey’s presentation); this is mainly because OpenCL is implemented on top of CUDA-architecture (this shouldn’t be a reason, but to say NVidia has put more energy in CUDA is just a wild guess also). On ATI 4000-series OpenCL is just slow, but gives very comparable to NVidia if compared to the 5000-series. The specialised streaming processors NVidia’s Tesla and AMD’s FireStream really bite each other, while the Playstation 3 unbelievably still wins on some tasks.

The architecture AMD/ATI-hardware is very different from NVidia’s and that’s why a kernel written with a specific brand or GPU in mind just performs better than a version which is not optimised. So if you do a benchmark, it really depends on which kernels you use for it. To be more precise: any benchmark can be written in favour of a specific architecture. Fine-tuning the software to work a maximum speed in current and future(!) hardware for different kinds of datasets is (still) a specialised task for that reason. This is also one of the current problems of GPGPU, but kernel-optimisers will get better.

If you like pictures, Hugh Merz comes to the rescue, who compared CUDA-FFT against FFTW (“the fastest FFT in the West”). The page is offline now, but you it was clear that the data-transfer from and to the GPU is a huge bottleneck and Hugh Merz was rather sceptical about GPU-computing in 2007. He extended his benchmark with the PS3 and a Tesla-s1070 and now you see bigger differences. Since CPUs go multi-multi-core, you cannot tell how big this gap will be in the future; but you can tell the gap will be bigger and CPUs will more and more be programmed like GPUs (massively parallel).

What we learn from this is 1) that different devices will improve if the demands are more clear, and 2) that it will be all about specialisation, since different manufacturers will hear different demands. The latest GPUs from AMD works much better with OpenCL, the next might beat all others in a many or only specific areas in 2011 – who knows? IBM’s Cell-processor is expected to enter the ring outside the home-brew PS3 render-farms, but with what specialised product? NVidia wants to enter high in the HPC-world, and they might even win it. ARM is developing multiple-core CPUs, but will it support OpenCL for a better FLOP/Watt than competitors?

It’s all about the choices manufacturers make, which way CUDA en OpenCL will develop.

Homogeneous vs Heterogeneous

For us the most important reason to have chosen for OpenCL, even if CUDA is more mature. While CUDA only targets NVidia’s GPUs (homogeneous), OpenCL can target any digital device that has an input and an output (very heterogeneous). AMD/ATI and Intel are both on the path of making architectures that are heterogeneous; just like Systems-on-a-Chip (SoCs) based on an ARM-architecture. Watch for our upcoming article about ARM & SoCs.

While I was searching for more information about this difference, I came across a blog-item by RogueWave, which claims something different. I think they switched Intel’s architectures with NVidia’s or he knew things were going to change. In the near future could bring us an x86-chip from NVidia. This will change a lot in the field, so more about this later. They already have an ARM-chip in their Tegra mobile processor, so NVidia/CUDA still has some big bullets.

Missing language-features

Like Java and .NET are very comparable, developers from both side know very well that their favourite feature is missing at the other camp. Most time such a feature is an external library, just built in. Or is it taste? Or even a stack of soapboxes?

OpenCL has:

Task-parallel execution mode (to be used on CPUs) – not needed on NVidia’s GPUs.

CUDA has unique features too:

FFT library – so in OpenCL you need to have your own kernels for it.
~~Atomic operations – which make double-write threads easier to implement.~~
Hardware texture interpolation – OpenCL has to fall back to a larger kernel or OpenGL.
Templating – in openCL you have to create new kernels for every data-type.

In short CUDA certainly has made a lot of things just easier for the developer, but OpenCL has its potential in support for more than just GPUs. All differences are based on this difference in focus-area.

I’m pretty sure this list is not complete at all, and only explains the type of differences. So please come to the LinkedIn GPGPU Users Group to discuss this.

Last words

THIS ARTICLE IS VERY OUTDATED AND NOW SIMPLY UNTRUE FOR CERTAIN PARTS! NEW ARTICLE COMING UP.

As it is done with more shared standards, there is no win and no gain to promote it. If you promote it, a lot of companies thank you, but the Rreturn-on-Investments is lower than when you have your own standard. So OpenCL is just used-as-it-is-available, while CUDA is highly promoted; for that reason more people invest in partnerships with NVidia to use CUDA instead of non-profit organisation Khronos. And eventually CUDA-drivers can be ported to IBM’s Cell-processors or to ARM, since it is very comparable to OpenCL. It really depends on the profit NVidia will make with such deals, so who can tell what will happen.

We still think OpenCL will win eventually on consumer-markets (desktop and mobile) because of support for more devices, but CUDA will stay a big player in professional and scientific markets because of the legacy software they are currently building up and the more friendly development-support. We hope they will both exist and help each other push forward, just like OpenGL vs DirectX, nVidia vs ATI, Europe vs the USA vs Asia, etc. Time will tell what features will eventually end up in each technology.

Update August 2012: due to higher demand StreamHPC is explicitly offering CUDA to OpenCL porting.

StreamComputing exists 5 years!

Posted by Vincent Hindriksen on 25 May 2015 with 3 Comments

In January 2010 I created the first steps of StreamComputing (redacted: rebranded to StreamHPC in 2017), by registering the website and writing a hello-world article. About 4 months of preparations and paperwork later the freelance-company was registered. Then 5 years later it got turned into a small company with still the strong focus on OpenCL, but with more employees and more customers.

I would like to thank the following people:

My parents and grand-mother for (financially) supporting me, even though they did not always understand why I was taking all those risks.
My friends, for understanding I needed to work in the weekends and evenings.
My good friend Laura for supporting me during the hard times of 2011 and 2012.
My girlfriend Elena for always being there for me.
My colleagues and OpenCL-experts Anca, Teemu and Oscar, who have done the real work the past year.
My customers for believing in OpenCL and trusting StreamComputing.

Without them, the company would never even existed. Thank you! Continue reading “StreamComputing exists 5 years!” →

Meteorology & Climatology

Whether short-term weather forecast or long-term climate change, predictions and modelling in meteorology and climatology generally happens on super-computers. StreamHPC focuses on preparing algorithms for execution on GPUs and other accelerator hardware (e.g., Xeon Phi), which make up the smallest elements in the distributed compute architecture of supercomputers. Contact us to discuss what we can do for you.

OpenCL mini buying guide for X86

Posted by Vincent Hindriksen on 17 January 2011 with 6 Comments

Developing with OpenCL is fun, if you like debugging. Having software with support for OpenCL is even more fun, because no debugging is needed. But what would be a good machine? Below is an overview of what kind of hardware you have to think about; it is not in-depth, but gives you enough information to make a decision in your local or online computer store.

Companies who want to build a cluster, contact us for information. Professional clusters need different hardware than described here.

Continue reading “OpenCL mini buying guide for X86” →

PDFs of Monday 5 September

Posted by Vincent Hindriksen on 5 September 2011

Live from le Centre Pompidou in Paris: Monday PDF-day. I have never been inside the building, but it is a large public library where people are queueing to get in – no end to the knowledge-economy in Paris. A great place to read some interesting articles on the subjects I like.

CUDA-accelerated genetic feedforward-ANN training for data mining (Catalin Patulea, Robert Peace and James Green). Since I have some background on Neural Networks, I really liked this article.

Self-proclaimed State-of-the-art in Heterogeneous Computing (Andre R. Brodtkorb a , Christopher Dyken, Trond R. Hagen, Jon M. Hjelmervik, and Olaf O. Storaasli). It is from 2010, but just got thrown on the net. I think it is a must-read on Cell, GPU and FPGA architectures, even though (as also remarked by others) Cell is not so state-of-the-art any more.

OpenCL: A Parallel Programming Standard for Heterogeneous Computing Systems (John E. Stone, David Gohara, and Guochun Shi). A basic and clear introduction to my favourite parallel programming language.

Research proposal: Heterogeneity and Reconfigurability as Key Enablers for Energy Efficient Computing. About increasing energy efficiency with GPUs and FPGAs.

Design and Performance of the OP2 Library for Unstructured Mesh Applications. CoreGRID presentation/workshop on OP2, an open-source parallel library for unstructured grid computations.

Design Exploration of Quadrature Methods in Option Pricing (Anson H. T. Tse, David Thomas, and Wayne Luk). Accelerating specific option pricing with CUDA. Conclusion: FPGA has the least Watt per FLOPS, CUDA is the fastest, and CPU is the big loser in this comparison. Must be mentioned that GPUs are easier to program than FPGAs.

Technologies for the future HPC systems. Presentation on how HPC company Bull sees the (near) future.

Accelerating Protein Sequence Search in a Heterogeneous Computing System (Shucai Xiao, Heshan Lin, and Wu-chun Feng). Accelerating the Basic Local Alignment Search Tool (BLAST) on GPUs.

PTask: Operating System Abstractions To Manage GPUs as Compute Devices (Christopher J. Rossbach, Jon Currey, Mark Silberstein, Baishakhi Ray, and Emmett Witchel). MS research on how to abstract GPUs as compute devices. Implemented on Windows 7 and Linux, but code is not available.

PhD thesis by Celina Berg: Building a Foundation for the Future of Software Practices within the Multi-Core Domain. It is about a Rupture-model described at Ch.2.2.2 (PDF-page 59). [total 205 pages].

Workload Balancing on Heterogeneous Systems: A Case Study of Sparse Grid Interpolation (Alin Murarasu, Josef Weidendorfer, and Arndt Bodes). To my opinion a very important subject as this can help automate much-needed “hardware-fitting”.

Fraunhofer: Efficient AMG on Heterogeneous Systems (Jiri Kraus and Malte Förster). AMG stands for Algebraic MultiGrid method. Paper includes OpenCL and CUDA benchmarks for NVidia hardware.

Enabling Traceability in MDE to Improve Performance of GPU Applications (Antonio Wendell de O. Rodrigues, Vincent Aranega, Anne Etien, Frédéric Guyomarc’h, Jean-Luc Dekeyser). Ongoing work on OpenCL code generation from UML (Model Driven Design). [34 pag PDF]

GPU-Accelerated DNA Distance Matrix Computation (Zhi Ying, Xinhua Lin, Simon Chong-Wee See and Minglu Li). DNA sequences distance computation: bit.ly/n8dMis [PDF] #OpenCL #GPGPU #Biology

And while browsing around for PDFs I found the following interesting links:

Say bye to Von Neumann. Or how IBM’s Cognitive Computer Works.
Workshop on HPC and Free Software. 5-7 October 2011, Ourense, Spain. Info via j.anhel@uvigo.es
Basic CUDA course, 10 October, Delft, Netherlands, €200,-.
Par4All: automatic parallelizing and optimizing compiler for C and Fortran sequential programs.
LAMA: Library for Accelerated Math Applications for C/C++.

A typical week

Primary and secondary tasks

The main focus is programming and solving problems. But that means that everything that obstructs this focus, needs to be gotten out of the way. This is simpler on paper than in reality and therefore there are multiple “faiths” among company, how to do this.

We start with clearly distincting primary and secondary tasks, where the difference is that there needs to be more time spent on the primary tasks in the long term. The last part of the sentence is very important.

What we do every day and week:

Planning
- Write issues
- Make issue estimations
- Prioritize issues
- Bundle issues in epics
- Pick issues for personal weekly milestones
Problem-solving
Coding and math
Learning
- Reading books
- Reading papers
- Watching videos

Why so much emphasis on planning?

The planning-part takes good time, but refrains us from spending too much time on dead ends. And spending time on dead ends is not a primary task at all. Also planning helps with designing better strategies – there is limited time for solving problems and coding software, so doing a full-scope research is not going to work. As there is no way to efficiently build complex code without any time-estimations on the different approaches, planning-skills provide the necessary foundations for becoming a senior coder.

We start as early as possible to train these skills, so also juniors are asked to do all planning-tasks. Initially this takes a good part of the valuable coding-time but quickly goes down and first advantages are seen.

Style of project handling

Tools

We mostly use Gitlab and Mattermost to share code and have discussions. This makes it possible to keep good track of each project – searching for what somebody said or coded two years ago is quite easy. Using modern tools has changed the way we work a lot, thus we have questioned and optimized everything that was presented as “good practice”.

We continuously look into new tools that can help us improve. Also here the main focus is to reduce the time on secondary tasks, so we can spend more time thinking on problem-solving.

Pull-style project management

The tasks are written down by the team, using the project-doc as input. All these tasks are put into the task-list of the project and estimated. Then each team member picks the tasks that are a good fit. There are always tasks that need to be pushed instead of pulled, but luckily that’s a relatively small part of all work.

All code (MR) is checked by one or two colleagues, chosen by the one who wrote the code. More important are the discussions in advance, as the group can give more insight than any individual and one can get into the task well-prepared. The goal is not to get the job finished, but not having written the code where a future bug has been found.

All types of code can contain comments and Doxygen can create documentation automatically, so there is no need to copy functions into a Word-document. Log-style documentation was introduced, as git history and Doxygen don’t answer why a certain decision has been made. By writing down a logbook, a new member of the team can just read these remarks and fully understand why the architecture is how it is and what the limits are. We’ll discuss this in more detail later.

These type of solutions describe how we work and differ from a corporate environment: no-nonsense and effective.

The week

If you’d work here, how would your week look like the first year? Specifically saying the first year, as for more complex projects, different approaches could be chosen.

Monday weekly planning

Together with your team you pick up the issues for the week. The issues should have estimations, or these will be done during that meeting. When your week is filled, you know what to do.

Monday weekly meeting

Every Monday we have a weekly meeting to share with everybody how the other projects are doing.

Mon-Fri: Daily standup

Retrospective of the previous day, and tuning of the day ahead.

Practice:

Tools
C/C++
GPGPU
Scrum

Friday closing

Weekly retrospective, cleaning up, writing notes on issues, etc.

Weekly customer meetings

Here we discuss the progress and anything blocking. The customer shares their progress, and together problems can be solved.

Many projects have a shared (high-level) issue-list, so the progress is continuously synced with the customer and communication is easy.

Our Hiring Process

After you apply, we do a quick scan of your resume, which means that we look for keywords like CUDA, OpenCL, SYCL, GLSL, HLSL, Assembly, etc. Then, we try to assess you on seniority in CPU programming, GPU programming and overall project experience. Even if you have experience on personal/hobby projects, make sure you mention them on your resume. If you have the basic keywords we look for, our HR team will reach out to schedule a video call that focuses on cultural fit.

In short, we have 3 interviews and 1 assignment.

Cultural Check

During the video call, we’ll discuss how well you align with our company culture and values. Focusing on learning your motivations and goals, and understanding your current status. If you wanna be successful in this call, be aware of your strengths and soft skills that make you a good fit for the company, not just the technical aspects of the job. Also, make sure you have questions prepared. Know that it’s a fairly relaxed call, where we just want to get to know you in approximately 45 minutes.

Online CodeJudge Test

If you pass the cultural fit check, you’ll be invited to complete an online CodeJudge test to assess your C/C++ skills. The final score is one of many things we are looking at. Therefore, try to show your thinking process and all the problem-solving steps. The duration of the test is 4,5 hours but it is usually completed in 2,5 hours on average.

Tech Talk

Candidates who achieve a high functionality score on the CodeJudge test will be invited to a technical interview with one of our team members. It is an hour-long live coding interview with one of our tech team members. This interview mostly focuses on assessing your GPU programming skills. It is a highly interactive interview in which you will discuss approaches, tools, languages, technologies, and parts of code.

Deep Dive

The final interview is the deep dive where we get to know each other better, and give you a scenario-based problem-solving exercise with our founder so you can demonstrate your problem-solving skills. One of our team members joins to meet you and you will have a chance to ask some questions about the projects and day-to-day operations, and also the other way around.

After we complete the interview process, we ask for your references. While we check the references you provide, we will prepare your offer letter. Once everything is in order, we’ll extend a formal job offer and look forward to welcoming you to our team!

What We Offer

A salary within the market range.
A company laptop and needed equipment for your work.
Flexible work hours for a balanced lifestyle.
A collaborative team culture that values problem-solving, teamwork, and open communication.
Stable, long-term employment with a self-funded company that has over 10 years of experience solving HPC and GPU problems for clients worldwide.
Stream HPC is a flat organization in which you have the freedom and responsibility to do what you are good at and have an impact.
A supportive work environment featuring a 6-month onboarding period, followed by a 1-year contract, and thereafter an indefinite contract.
Specifically for the Amsterdam Office:
- Hybrid working, 2 days in the office.
- NS Business Card to our office located in Lelylaan, Amsterdam
- A minimum of 20 days of vacation.

We have provided various texts to help you get the information we think is useful:

Onboarding process describes the first 6 months of the job.
Self-assessment that tells you where you stand and thus what are the technical chances to get the job.

We’re a small company, but we invest more time in our application process than most. We’ve used feedback from past applicants to improve it step by step. Our goal is to help you through the process, not overwhelm you. If anything on this page isn’t clear, feel free to email us at jobs@streamhpc.com.

We don’t work for the war-industry

Posted by Vincent Hindriksen on 2 November 2018

Last week we emphasized that we don’t work for the war-industry. We did talk to a national army some years ago, but even though the project never started, we would have probably said no. Recently we got a new request, got uncomfortable and did not send a quote for the training.

https://twitter.com/StreamHPC/status/1055121211787763712

This is because we like to think about the next 100 years, and investment in weapons is not something that would solve things for the long term.

To those, who liked the tweet or wanted to, thank you for your support to show us we’re not standing alone here. Continue reading “We don’t work for the war-industry” →

The 12 latest Twitter Poll Results of 2018

Posted by Vincent Hindriksen on 10 November 2018

Via our Twitter channel we have various polls. Not always have we shared the full background of these polls, so we’ve taken the polls of the past half year and put them here. The first half of the year there were no polls, in case you wanted to know.

As inclusive polls are not focused (and thus difficult to answer), most polls are incomplete by design. Still insights can be given. Or comments given.

Below’s polls have given us insight and we hope they give you insights too how our industry is developing. It’s sorted on date from oldest first.

It was very interesting that the percentage of votes per choice did not change much after 30 votes. Even when it was retweeted by a large account, opinions had the same distribution.

Is HIP (a clone of CUDA) an option?

Continue reading “The 12 latest Twitter Poll Results of 2018” →

Company History

There are not many companies like Stream HPC. Most others are or a government-institute for the national supercomputer, 1 or 2 freelancers or… actually not experienced with GPUs. So how did it start? How did we get a large team of HPC- and GPU-experts, working for customers worldwide?

Company History

2025

June 26

15 Years Celebration

We celebrated 15 years for Stream HPC with 25 people.

2020

November 1

Second office

The first choice was actually in Belgium, because it was closer to Amsterdam. Unfortunately that project did not succeed. By coincidence, we got into Budapest, and grew out of the office space the first year.

August 1

First growth phase

Growth is hard, really hard. And we learned that, well, the hard way. Several decisions would now be made differently, but we adopted and continued. Some examples: (1) Investing in FPGAs too early. OpenCL-on-FPGAs was the next big thing, so based on what we got promised by vendors, we made the same promises to our customers. Many promises did not turn into reality. (2) Hiring the wrong people. Or: hiring people for whom we are the wrong company, as it goes both ways. We now define our culture, because we want people who fit our culture. (3-20) All the other things that are in the books under “early stage growth”.

2014

November 1

The first employee

There was still not a stable income. Sales&marketing also took a lot of time, hurting the time that could be spent on actual work. But slowly we got more traction – more people started to believe in the company’s vision. But by the end of the year the first employee was hired, Anca. As the choice was to build a services company first instead of a products company, banks and investors were not even interested in providing financial support. We can now say that for the long term this was the best – we can now fully control our own strategies and invest in our own product development.

2013

November 1

Grandmother becomes investor

A gift of €4000 by Vincent’s grandmother, a landlord who was relaxed with late payments, the trust in the technology by early customers, and late payments to our creditors got us through. Never doing this again!

2011

April 1

What’s a GPU?

We now have a clear idea on what GPUs can do, but in 2010-2014 GPUs were still for graphics only in the mind of people. Selling was difficult – even had a rejection where was stated that a GPU cannot be used for Compute, as it’s a “GGGGGGraphics Processing Unit”.

2010

April 1

A fresh start

From that bore-out the company was born the next year. There were two options: GPGPU (mostly OpenCL, a hobby) or build smart products for public transport. Two domains were bought, and the choice was made during the year. For the public transport a proof-of-concept was made, but the choice fell for the really difficult work. Not much money was earned that year, and life was tough. The moment the first project was finished, the little government-support had to be paid back as the invoice was sent 2 weeks too early according to the conditions.

2009

October 13

The bore-out

For some reason it took well over 6 months to port it to .NET. As there was nothing to do in this job of full-time doing nothing, going to “work” became unbearable. As he did the reverse engineering and thus was the only one who understood the code, there was no option to leave the job. This ended in a bore-out: a depression comparable to a burn-out, caused by a lack of work.

April 14

The discovery of the thrill

Stream’s founder Vincent Hindriksen had to maintain a piece of software that was often failing to process the daily reports. After documenting the internals and algorithms of the code by interviewing the key people and some reverse engineering, it was a lot easier to create effective solutions for the bugs within the software. After fixing a handful of bugs, there was simply a lot less to do except reading books and playing online games. So why not rewrite the software in full? Three weeks it did not take 2.5 hours anymore to process the data, but 19 seconds. The kick for performance optimization was ignited.

Want to know more? Go to our contacts page and ask any question.

Help us find our future COO

Posted by Vincent Hindriksen on 20 July 2018

*Is this a motto that goes with your personality? Then we want to talk with you.*

About 7 years ago we were still dealing with the usual peaks and lows of consultancy.

I’d like to get your help to find our future COO to help streamline this growth.

You might have seen that there are hardly any new blog posts – now you know why. By helping us find that special person, there can be put more time of writing new blog posts again.

If you know the perfect person for this job in Amsterdam, please let them know there is this unique company looking for her or him. Sharing this blog-post would help a lot.

You can find more information in this job-post:

We all know that quality comes with attention to detail, but also that with growth the details are the first to be postponed. We seek help in handling daily operations during our growth. The most important tasks are:

Customer contact. You make sure the communication is regular and smooth with all our customers, making them more engaged and happy with us.

Sales follow up. You take over to discuss the needs of potential customers pre-sales has had contact with.

Team support. You help the development-teams to get even better by helping them to solve their daily and long-term problems.

The job is very broad, but is all around a listening ear and getting things done.

You have studied business administration or alike, and have a can-do attitude. You know how to work with technical people and are a real team-player. You understand how to develop and engage group dynamics.

Do you think this is a job written for you, then we would like to hear more from you! Send an email to jobs@streamhpc.com with a motivational letter and listing relevant experience.

Thanks for helping out!

If you got sent here, we hope to hear from you!