N-Queens project from over 10 years ago

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago”

The Fastest Payroll System Of The World

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Continue reading “The Fastest Payroll System Of The World”

How to get full CMake support for AMD HIP SDK on Windows – including patches

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API  was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading “How to get full CMake support for AMD HIP SDK on Windows – including patches”

Improving FinanceBench for GPUs Part II – low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

Continue reading “Improving FinanceBench for GPUs Part II – low hanging fruit”

The Art of Benchmarking

How fast is your software? The simpler the software setup, the easier to answer this question. The more complex the software, the more the answer will “it depends”. But just peek at F1-racing – the answer will depend on the driver and the track.

This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.

There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

The questions to ask

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.

First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.

We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:

  • What elements build up the race? From weather conditions to corners, and from other cars on the track to driver-responses
  • How can each of these elements be quantified?
  • How can each of these elements be measured for both own cars and other cars?
  • And as you guessed from the high-level result, the stopwatch: how much speedup is required in total and per round?
Continue reading “The Art of Benchmarking”

Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL

Stream HPC is 10 years old on 1 April 2020. Therefore we offer our one day GPGPU crash course for free that whole month.

Now Corona (and fear for it) spreads, we had to rethink how to celebrate 10 years. So while there were different plans, we simply had to adapt to the market and world dynamics.

5 years ago…
Continue reading “Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL”

Problem solving tactic: making black boxes smaller

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

Black boxes will never be transparent

Assumption is the mother of all mistakes

Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading “Problem solving tactic: making black boxes smaller”

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

Updated: OpenCL and CUDA programming training – now online

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

  • Four days of training online
  • Free code-review after the training, to get feedback on what you created with the new knowledge;
  • 1 month of limited support, so you can avoid StackOverflow;
  • Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.


Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading “Updated: OpenCL and CUDA programming training – now online”

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

Continue reading “Join us at the Dutch eScience Symposium 2019 in Amsterdam”

We accelerated the OpenCL backend of pyPaSWAS sequence aligner

Last year we accelerated the OpenCL-code in PaSWAS, which is open source software to do DNA/RNA/protein sequence alignment and trimming. It has users world-wide in universities, research groups and industry.

Below you’ll find the benchmark results of our acceleration work. You can also test out yourself, as the code is public. In the readme-file you can learn more about the idea of the software. Lots of background information is described in these two papers:

We chose PaSWAS because we really like bio-informatics and computational chemistry – the science is interesting, the problems are complex and the potential GPU-speedup is real. Other examples of such software we worked on are GROMACS and TeraChem.

Continue reading “We accelerated the OpenCL backend of pyPaSWAS sequence aligner”

Do you have our GPU DNA?

This is the first question to warm up. Python-programmers are often users of GPU-libraries, not the builders of those libraries.

In January 2019 I gave a talk about culture in the company, which I wanted to share with you. It was intended to trigger discussions on what environment fits somebody, and examples were given on other companies. The nice part was that it became more clear that the culture of a company like CodePlay was very alike, except they are working on different things (compilers). Same for departments of larger companies we work with or know well.

Important: all answered are based on what my colleagues answered. So most of us are cat-people, but I wouldn’t say that defines a GPU-developer. I hope it still gives you an understanding of our perspective on what defines a GPU-dev in just a few minutes, while it also gives you more than enough matter to think about.

Continue reading “Do you have our GPU DNA?”

Stream Team at ISC

This year we’ll be with 4 people at ISC: Vincent, Adel, Anna and Istvan. You can find us at booth G-812, next to Red Hat.

Booth G-812 is manned&womened by Stream HPC

While we got known in the HPC-world for our expertise on OpenCL, we now have many years of experience in CUDA and OpenMP. To get there, we’ve focused a lot on how to improve code quality of existing software, to reduce bugs and increase speedup-potential. Our main expertise remains full control over algorithms in software – the same data simply processed faster.

Why do we have a booth?

We’ll be mostly talking to (new) customers for development of high performance software for the big machines. Also we’ll have a list of our open job positions with us, and we can do the first introductory interview on the spot.

Our slogan for this year is:

There are a lot of supercomputers. Somebody has to program its software

We’ll be sharing our week on Twitter, so you can also see what we find: posters about HPC-programming on CPU and GPU, booths that have nice demos or interesting talks and ofcourse the surprises.

Let’s meet!

If you don’t have an appointment yet, but would like to chat with us, please contact us or drop by at our booth. As we’re with four people, we have high flexibility.

GPU-related PHD positions at Eindhoven University and Twente University

We’re collaborating with a few universities on formal verification of GPU code. The project is called ChEOPS: verified Construction of corrEct and Optimised Parallel Software.

We’d like to put the following PhD position to your attention:

Eindhoven University of Technology is seeking two PhD students to work on the ChEOPS project, a collaborative project between the universities of Twente and Eindhoven, funded by the Open Technology Programme of the NWO Applied and Engineering Sciences (TTW) domain.

In the ChEOPS project, research is conducted to make the development and maintenance of software aimed at graphics processing units (GPUs) more insightful and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, due to their great computational capabilities. However, in practice, one usually needs to have expert knowledge on GPU architectures to optimally gain advantage of those capabilities.

Continue reading “GPU-related PHD positions at Eindhoven University and Twente University”

Academic hackatons for Nvidia GPUs

Are you working with Nvidia GPUs in your research and wish Nvidia would support you as they used to 5 years ago? This is now done with hackatons, where you get one full week of support, to get your GPU-code improved and your CPU-code ported. Still you have to do it yourself, so it’s not comparable to services we provide.

To start, get your team on a decision to do this. It takes preparation and a clear formulation of what your goals are.

When and where?

It’s already April, so some hackatons have already taken place. For 2019, these are left where you can work on any language, from OpenMP to OpenCL and from OpenACC to CUDA. Python + CUDA-libraries is also no problem, as long as the focus is Nvidia.

Continue reading “Academic hackatons for Nvidia GPUs”

IWOCL 2019

On Monday May 13, 2019 at 09:30 the latest edition of IWOCL starts, not taking into account any pre-events that might be spontaneously organized. This is the biggest OpenCL-focused event that discusses everything that would make any GPGPU-programmer, DSP-programmer and FPGA-programmer enthusiastic.

What’s new since last year, is that it’s actually also more interesting place for CUDA-developers who like to learn and discuss new GPU-programming techniques. This is because Nvidia’s GTC has moved more to AI, where it used to be mostly GPGPU for years.

Since it’s now the last week of the early-bird pricing, it’s a good time to make you think about buying your ticket and book the trip.

Continue reading “IWOCL 2019”

Question: do we work with CUDA?

Answer: Yes, actually a lot!

The company was built on OpenCL and we are still work with the language a lot – from embedded GPUs and FPGAs to high-end GPUs. Like OpenCL unjustly isn’t associated with clusters full of professional GPUs, we were not associated with CUDA. I can tell many of our customers have found us to build high performance software in CUDA.

Breaking with the past is not easy due to associations that seem to stick. With the name change from StreamComputing to Stream HPC some years ago, we wanted to enforce that break with being “the OpenCL company”. For some time we were much more pragmatic in solving the problems of our customers, which resulted in making software in MPI and CUDA – sometimes an unexpected direction as the customer initially chose OpenCL.

We also started hiring people who only knew CUDA (but expect them to learn OpenCL), as the right algorithm and the right processor is more important. Internships with CUDA, large CUDA-projects, seeking better relations with Nvidia and such – all have been going on for years. And we like it as much as we like OpenCL – both have unique advantages.

So if you have questions about CUDA, don’t be afraid that you hurt us – we’re happy to help you get fast software.

The 12 latest Twitter Poll Results of 2018

Via our Twitter channel we have various polls. Not always have we shared the full background of these polls, so we’ve taken the polls of the past half year and put them here. The first half of the year there were no polls, in case you wanted to know.

As inclusive polls are not focused (and thus difficult to answer), most polls are incomplete by design. Still insights can be given. Or comments given.

Below’s polls have given us insight and we hope they give you insights too how our industry is developing. It’s sorted on date from oldest first.

It was very interesting that the percentage of votes per choice did not change much after 30 votes. Even when it was retweeted by a large account, opinions had the same distribution.

Is HIP (a clone of CUDA) an option?

Continue reading “The 12 latest Twitter Poll Results of 2018”

We don’t work for the war-industry

Last week we emphasized that we don’t work for the war-industry. We did talk to a national army some years ago, but even though the project never started, we would have probably said no. Recently we got a new request, got uncomfortable and did not send a quote for the training.

This is because we like to think about the next 100 years, and investment in weapons is not something that would solve things for the long term.

To those, who liked the tweet or wanted to, thank you for your support to show us we’re not standing alone here. Continue reading “We don’t work for the war-industry”

OpenCL Basics: Running multiple kernels in OpenCL

This series “Basic concepts” is based on GPGPU-questions we get via email more than once, or when the question is not clearly explained in the books. For one it is obvious, for the other just what they’re missing.

They say that learning a new technique is best done by playing around with working code and then try to combine it. The idea is that when you have Stackoverflowed and Githubed code together, you’ve created so many bugs by design that you’ll learn a lot if you make it work. When applying this to OpenCL, you quickly get to a situation that you want to run one.cl file and then another.cl file. Almost all beginner’s material discuss a single OpenCL-file, so how to do this elegantly?

Continue reading “OpenCL Basics: Running multiple kernels in OpenCL”

Start your GPU-career here

GPUs have been our mysterious friends and known enemies for years, as they let us run code in expected and unexpected ways. GPUs have solved problems for many of our customers. GPUs have such a high rate of evolvement, that they’ll remain important for the years to come.

Problem is that programming GPUs is not an easy task. Where do you learn to program GPUs? We found these to be the main groups:

  • Universities
  • Research centers
  • GPU vendors (AMD, Nvidia, Intel, Qualcomm, ARM)
  • Self-study

This is far from enough. Add to that, that only a very select group learns the craft at a company. We’d like to change that, and we think now is the time for us to be able to deliver on this.

In January we’ll our internal training program will start with 4 to 8 developers. Focus in on fully understanding recent GPU-architectures, CUDA and OpenCL. It will consist of lectures, workshops, discussions, paper reading and ofcourse coding for one month. The months after that will have guidance, paper presentations, code reviews and time for self-study. The exact form will differ per person.

The hard side

The current measurable requirements are:

  • EU citizen or already having a working permit
  • Great at C/C++
  • High interest in algorithmic optimisations
  • Any performance improvement focus (i.e. Assembly, clean code) is a plus
  • Any GPU experience (i.e. OpenGL, DirectX, self-study) is a plus
  • High interest in performance
  • Willing to move to Amsterdam
  • Willing to work for Stream HPC for at least 2 years

The soft side

We’re looking for people that fit our culture and we think we can train. This means that the selection is based for a large part on “the spark”. Therefore the application starts with a speed date, and we’re sorry for not finding a better wording for this. This is a 20 minute discussion about what we like and what we don’t. This can be done via phone, Skype or in person, during the evening, in the weekends or during your lunch break.

How to apply

Read about our company culture. Look at the jobs we have open. These describe the requirements after the training. Then write us a motivational letter: explain us why this is exactly what you want, why you’re capable and why you’re a cultural fit. If you find it hard to write such letter, then just start with answering the list of requirements. It’s a big bonus to share code (Github, Gitlab, zip-file). Send your email to jobs@streamhpc.com

Other jobs

Feeling more senior? We have other jobs: