Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.
We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.
Introduction in 2023
Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.
I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.
Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.
Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.
Introduction in 2016
What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.
An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.
This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.
Continue reading “N-Queens project from over 10 years ago”









This series “

Ever saw a claim on a paper you disagreed with or got triggered by, and then wanted to reproduce the experiment? Good luck finding the code and the data used in the experiments.
When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.


It takes quite some effort to program FPGAs using VHDL or Verilog. Since several years Intel/Altera has OpenCL-drivers, with the goal to reduce this effort. OpenCL-on-FPGAs reduced the required effort to a quarter of the time, while also making it easier to alter the specifications during the project. Exactly the latter was very beneficiary when creating the demo, as the to-be-solved problem was vaguely defined. The goal was to make a video look like a cartoon using image filters. We soon found out that “cartoonized” is a vague description, and it took several iterations to get the right balance between blur, color-reduction and edge-detection. 

A month ago IWOCL (OpenCL workshop) and DHPCC++ (C++ for GPUs) took place. Meanwhile many slides and posters have been
Most of our projects are around performance optimisation, but we’re cleaning up bugs too. This is because you can only speed up software when certain types of bugs are cleared out. A few months ago, we got a different type of request. If we could solve bugs in MESA 3D that appear in games.