* Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL.* And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

## Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

## Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

The first approach tries to create a parallel version of an existing one, the second focuses on methods better used by GPUs. The kernels are not fully optimized, so I hope you can come with a faster version – I give some hints where further optimization could help.

# The N-queens problem

The problem is described on Wikipedia as: placing N chess queens on an N×N chessboard so that no two queens attack each other. It is an extensive search-problem, by walking a tree and test if it holds as a solution. The “nice” thing is that the number of solutions increases irregularly with increasing N (see “Counting solutions” at the Wikipedia-page). There are many optimizations possible, which are all very bright, but harder to port to the CPU.

## Implementations

The forums spoke about multiple fast implementations, of which I researched three.

The first one was software by Jeff Somers, who described his software as: “*My program is a heavily-optimized C program, yet it still takes over a week for an 800 MHz PC to calculate the number of solutions for a 21 x 21 board*“. In other words: perfect for a 1050MHz GPU to do it in less time.

I found other solutions by Kenjiro Taura, but his webpage is offline now. He is very focused on exploring different parallel/distributed methods, and it could make use of all the cores of the 8-core i7. It was somewhat tricky to get it going, but – in case you find his code – the following worked:

gcc nq.c -lpthread OMP_NUM_THREADS=8 ./a.out p 17 2

Results got in after 28.17 seconds, so only a small speed-up compared to Somers’ algorithm on just one single core. Another one that was advertised as fast, was made by Mr. Takaken, which I could not compile it with GCC at all – so I left that one.

But if you look around, there are many implementations to learn from:

- http://146.187.134.7/class/cscd435/Parallel_435/index.html
- http://drdobbs.com/architecture-and-design/184406068?pgno=5
- http://www.liacs.nl/~kosters/nqueens/
- http://c2.com/cgi/wiki?EightQueensInManyProgrammingLanguages

*Note that the above list is from 2016! Also, Dr. Bobbs does not even let you go to different pages – you need to edit the url to get to another page.*

I based my code on the one from Somers, as it fully focused on the algorithm, and that is always the first problem to solve. As a bonus, I could focus on my learning goal: porting a difficult puzzle to the GPU. There were more reasons, but only if you try yourself, you’ll find these.

## Searching & Checking

What Mr. Somers discovered already: the more algorithmic steps can be skipped, the less work needs to be done, and the faster we can make it.

This page describes is short some differences and shows where a lot of speed-up can be found. Also checking can be optimized. When going deeper into the search-tree, the previous checks don’t need be done again (step-wise forward-check). The first row has N possibilities: a queen on a position. For example N=4 gives 1000, 0100, 0010 and 0001. If it were a tower-problem, and we would not remember the previous row then for row 2, there are N-1 possibilities, etc, so for the whole board there are N*(N-1)*(N-1)*… = N*(N-1)^(N-1) possibilities (click here to see logarithmic representation to see how fast it grows). Of course, we can easily remember the previous rows to find the N! solutions, but we cannot easily do it vertically as queens need. We come back to the tower-search later when discussing the second approach, as it can be very usable. For queens, the first row has N possibilities to put a queen on, the second row N-2 (when first at the border) and N-3 (otherwise). The third row has between N-6 (only when N>8) and N-4 possibilities; There are N-3 possibilities when the first two queens shadow positions – having covered a position more than once. Still it’s not easy to find reusable search-paths. Finding the shadowed positions is as hard as checking validity, as far as I have found. This shadowing makes it a harder problem and solutions can be used for other search-problems too. As the results can be mirrored, we only need to check half and just double the results – exception for the middle row of odd sized boards. You see that Somers coded two runs: one for one side of the board and one for the middle row (when the board-size is odd) – in my port I combined it. When doing the middle row, he starts at the second row, so he can double the results anyway and speeds up the search.

If you read the various acceleration techniques by AMD and NVidia, you notice the focus for GPU-optimization is on memory-transfers – and this is the tricky part of N-Queens. Making a parallel version of the code was not the real problem, but working around the limitations of GPUs, really was. The first approach focuses on a parallel version of Somers’ code, the second one takes the GPU-limitations much more into account.

# The first approach

## Finding parallelizeable code

Because of dependencies on the previous run it actually is impossible to completely unloop, so we generate the possible starting-points for the first x rows and then start from there. We need a series, so we can call a kernel by id. The problem it is a depth-first algorithm, and we need a breadth-first – at least for the first rows. I therefore made a variation on the code, that combines the odd and even part This is one that fails immediately, but optimizing this is for later – I hope by pumping the threads the scheduler of the GPU helps me out. I made a different version that runs the separate threads (in serial) by only giving a row where to start. Unlooping and clustering would increase the speed, but we first try without.

## The results

The least I want is that it is faster than 2.89 seconds. The OpenCL versions gave results under 9 seconds on an 8-core Intel i7, so less than 4 times slower than a GPU using the other code base – that sounds hopeful. The threaded CPU-version (Java) could do around 30 seconds, so we do have some performance-issues here. But… running OpenCL on GPU also gave 30 seconds, which is not good at all!

I could tune the software by eliminating the first breadth-first search by better initiating or by looking into using vectors to use the AVX. But since that would not improve the GPU-times much, I’ll leave these optimisations from this approach. I had learned that Somers worked a lot on memory-efficiency for a single threaded solution and have found its limitations for my cause. By going wide before deep, we effectively created a parallel version which runs very nice on CPUs (which can handle complete asynchronous workers), but GPUs need more order and structure – time for approach 2.

# The second approach

Irregular code doesn’t work really well with GPU-style devices, as you can read in this extensive report about H264. There are more approaches, but all have in common that they are not usable when doing with less parallel processors. We now start with three focus-points:

- No data-steered control. This makes it hard to keep the kernels synchronous.
- Checking the data in parallel and not step-by-step
- Better to focus on calculating than caching

We can use permutations of half correct solutions. We can make sure that horizontal and vertical there is only one queen per line, so we only need to check diagonal lines for illegal sharing. It would be faster to permute the diagonals somehow, but I did not see a possibility. Checking the diagonals should be done in parallel too. We can use shift and sum, which we need to do twice. To get the idea of this shift visualized, put 4 queens in the middle 4×4 of a chess-board, leave the first, shift the second 1 to the left, the third 2 positions, the fourth 3.

The number of permutations is N!, and looks a bit like bubble-sort visually. A starting-position in one diagonal fails immediately, so later we are going to improve this algorithm later to skip the obvious ones. Same for unnecessary checks of the diagonals.

## Permute / switch rows

The simple method this needs a double loop: i (0< =i0 then we of course need to keep track of some history. Before going on with how to divide the problem. AMD prefers to have workgroups-sizes which are a multiple of 64 (per dimension) – called a wavefront. NVIDIA works well with multiples of 32 (a warp), so has an advantage when the workgroups are smaller. A lot of time this explains the difference between NVIDIA and AMD when the algorithm was tested first on NVIDIA. NVIDIA has an advantage, as we cannot easily fill a 64 wavefront/warp. How do we make sure the whole wavefront/warp is filled? It is clear we need to have loops, as we cannot launch N! kernels as it is not representable in an uint64. You may try. 🙂

We have the same problem as the previous approach, that we can only launch from a certain row. Two rows have too little starting-positions, so we start with three or four as we did with approach 1. At http://wordaligned.org/articles/next-permutation a lot is explained about permutations. We have one problem: since there are many permutations invalid for

## Hamiltonian path

Writing down the rows it can be next to it comes very clear that there are no solutions for N = 2 or 3. 4 gives two solutions: 2,4,1,3 and 3,1,4,2. For 5 and 6 we need to check the diagonals for rows further apart, but it is clear you need to find whole, unique paths to find possible solutions where all nodes are visited only once. If we put it in a truth-table, you get a wide diagonal of 2 or 3 with false and the rest true (symmetrical). This problem is best known as a part of the travelling salesman problems: the Hamiltonian path. We don’t care about distances (set to 1) or directions, but do very much care about getting all the separate solutions as we need to test them for correctness. Creating such graph is easy, so let us focus on walking it. If all these paths can be found in less operations than using permutations AND it can be done in parallel, we have taken a big step forward. Somers’ solution walks paths very orderly and checking the diagonals efficiently, and the permutations only needed to update+check the diagonals of the two switched rows, so we need to see if check-wile-walking-tree could also be possible. If the board-size is bigger than 6, you see an increase in the number of solutions. The tree (figure) shows that when starting in the edge, you see only one escape. The rest has only one or two solutions. This tells something about the influence inside and outside 6×6; for us larger board-sizes are important and the complexity of such boards makes it hard to find theoretical shortcuts.

## Checks

We keep track of two arrays – for both diagonals – of length 2*N if odd and 2*N-1 if even. They hold the id of the diagonal the queen is on. The two switched rows are shifted +n and -n, after which they are compared to the other rows. Starting from a position that is valid, only the switched rows need to be checked.

# Back in 2023: gimme an implementation

So, that was a lot of theory! Where’s the action? Where’s the code? Unfortunately I have no code that has all the optimizations implemented or is in shareable condition. I do hope that the above gives you some insights in the puzzle, to speed up existing implementations.

Also unfortunately, still as of today you can find “new” implementations that are slower than my initial version, even if that code runs on modern hardware. But I assume you want to see how all these optimizations would result to.

Luckily I found an implementation on Github by Ole and Tim Pöschl, who had the same starting points as I had and looked into (several of) the same approaches. Do check out their code – “FAF” stands for “fast and fun” of course.