General articles on technical subjects.

Thinking with iterators in CUDA and HIP

Parallel primitives are the ubiquitous building blocks of GPU programming with CUDA and HIP, to make your life as a programmer easier. Primitives like scans, reductions, and sorts operate in parallel over large data inputs. The basic use case has input and output residing in device memory as an array of values. However, the libraries provided by NVIDIA and AMD allow the use of iterators, which abstract the concept of input and output. An iterator is a type that behaves as a pointer, but overrides part of the dereferencing and arithmetic logic. With creative use of iterators, using the parallel primitives can become simpler and more performant. Assuming basic familiarity with the primitives, in this article we will show two examples of how iterators can be used to create better GPU programs: finding the arguments of the maximum of a function using a zip iterator in HIP and fast reduction with equally-sized segments in CUDA.

Leveraging the zip iterator to find the maximum argument

A reduction with a binary maximum operator finds the maximum element in the input. But what if we are also interested in where this element is located in the input? For the first example, let us find the argument of the maximum of a function with HIP. To make it a bit more interesting, suppose we want to find an index of an integer value that has most “set” bits.

A naive approach might be to first perform a reduction over the entire input, and then find the index that produced that input. However, by making clever use of iterators we can use just a single reduction. This is where the zip iterator comes in, which “zips” two iterators together: the dereferenced type is a tuple of both iterator’s value types. By zipping a counting iterator and an input array, we can enumerate the input elements.

thrust::device_vector<unsigned int> d_in(num_elements);
auto iter = rocprim::make_zip_iterator(
    rocprim::make_tuple(rocprim::make_counting_iterator(0), d_in.data()));

Any function that operates on this iterator is now provided with the index of each value. Below is an implementation of an binary operator that finds a tuple with the greatest number of set value bits.

using tuple_t = rocprim::tuple<int, unsigned int>;

struct arg_max_popc {
    __device__ tuple_t operator()(const tuple_t &lhs, const tuple_t &rhs) {
        // __popc counts the number of set bits
        return __popc(rocprim::get<1>(lhs)) < __popc(rocprim::get<1>(rhs)) ? rhs : lhs;
    }
};

To finish up, we perform the reduction over the zipped iterators using the custom operator, which finds an index and value that has the most set bits.

thrust::device_vector<tuple_t> d_out(1);

rocprim::reduce(
  temporary_storage,
  storage_size,
  iter,           // input (our custom iterator)
  d_out.data(),   // output
  tuple_t(-1, 0), // initial_value
  num_elements,
  arg_max_popc{}  // reduce_op (our custom operator)
);

thrust::host_vector<tuple_t> h_out = d_out;
printf("(index=%d, value=0x%08x)\n", rocprim::get<0>(h_out[0]), rocprim::get<1>(h_out[0]));

Without the use of iterators, implementing such an operation would require an inconvenient approach or a custom kernel. By thinking with iterators, the implementation becomes simple and elegant. In fact, the implementations of (hip)CUB’s ArgMin and ArgMax are based on the same idea.

Speeding up reduction for equally-sized segments

Reductions come in many flavors, but the common ones are the input-wide reduction, the by-key reduction, and the segmented reduction. Briefly stated, input-wide computes the reduction of the whole input, by-key computes a reduction for every run of identical keys associated with the input, and segmented computes a run for a set of provided segments.

For the second example, we will consider a scenario where a reduction needs to be made for equally-sized segments in CUDA. The simplest way to implement this would be to choose the segmented reduction, which looks something like the following:

thrust::device_vector<float> d_in(num_segments * num_elements_per_segment);
thrust::device_vector<float> d_out(num_segments);

// Prepare the segment offsets.
thrust::host_vector<int> h_offsets(num_segments + 1);
for (int i = 0; i < num_segments + 1; ++i) {
  h_offsets[i] = num_elements_per_segment * i;
}
thrust::device_vector<int> d_offsets = h_offsets;

cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, 
  temp_storage_bytes,
  d_in.data(),
  d_out.data(),
  num_segments,
  thrust::raw_pointer_cast(d_offsets.data()),     // d_begin_offsets
  thrust::raw_pointer_cast(d_offsets.data()) + 1, // d_end_offsets
  ::cuda::std::plus{}, // reduction_op
  0.f                  // initial_value
);

While the above implementation works fine, every segment offset introduces additional loads. Especially for small segments, this can add a lot of overhead. With the use of iterators, we can produce the offsets on the fly, without having to store them in memory. To achieve this, we combine two iterators: the counting iterator and the transform iterator.

A counting iterator simply represents a range of sequentially increasing values. If we create a counting iterator starting at zero, the iterator dereferences to zero, and incrementing by five dereferences to five. Since there is no backing storage, no memory loads are needed.

The transform iterator applies a unary operator to another iterator. Suppose the operator is to perform the square, then applying the transform iterator to a counting iterator starting from zero returns the values 0, 1, 4, and 9.

By combining these two basic building blocks we can formulate an iterator that returns the offsets as needed for the segmented reduction for equally-sized segments:

struct multiplier {
    (int segment_num_elements) : segment_num_elements(segment_num_elements) {}

    __host__ __device__ int operator()(int i) const {
        return i * segment_num_elements;
    }

    int segment_num_elements;
};

auto offsets = ::cuda::make_transform_iterator(
  ::cuda::make_counting_iterator(0), (segment_num_elements));

cub::DeviceSegmentedReduce::Reduce(
  d_temp_storage, 
  temp_storage_bytes,
  d_in.data(),
  d_out.data(),
  num_segments,
  offsets,             // d_begin_offsets (our custom iterator)
  offsets + 1,         // d_end_offsets
  ::cuda::std::plus{}, // reduction_op
  0.f                  // initial_value
);

This should increase the performance, but there is an alternative we could consider instead. Since our segments are of equal size, each element can independently infer which segment it belongs to. For any given element index, its segment index is simply given by the element index divided by the number of elements in the segment. Using this information, we can use a by-key reduction instead. To leverage reduce by key, an array of keys must be constructed that maps every index to a segment index. Below is an example with a segment size of three elements.

keys   0 0 0 1 1 1 2 2 2 3 3 3 
values 0 1 2 3 4 5 6 7 8 9 0 1
result     3    12    21    10

A naive method would be to write out the keys in device memory, but similar to before we can produce an iterator that produces the keys on the fly.

struct indexer {
    indexer(int segment_num_elements) : segment_num_elements(segment_num_elements) {}

    __host__ __device__ int operator()(int i) const {
        return i / segment_num_elements;
    }

    int segment_num_elements;
};

auto keys_in = ::cuda::make_transform_iterator(
  ::cuda::make_counting_iterator(0), indexer(segment_num_elements));

The full reformulated reduction is shown below. The snippet also makes use of an iterator that has not yet been mentioned, the discard iterator. The discard iterator is an output iterator that simply does not perform any operation when written to. As we are not interested in the unique output keys nor the number of runs, these results can be discarded.

cub::DeviceReduce::ReduceByKey(
  d_temp_storage, 
  temp_storage_bytes,
  keys_in,                       // d_keys_in (our custom iterator)
  cuda::make_discard_iterator(), // d_unique_out
  d_in.data(),                   // d_values_in
  d_out.data(),                  // d_aggregates_out
  cuda::make_discard_iterator(), // d_num_runs_out
  cuda::std::plus{},             // reduction_op 
  num_elements);

The segmented reduction and by-key reduction have different constraints they need to operate under, and it’s not immediately clear which version will perform the best. In such situations, the best thing to do is just to measure. Below are the results of performing the three methods for reduction with different segment sizes on CUDA 13.0 on an RTX 5080, including a non-segmented reduction as a baseline.

As we assumed, the segmented reduction that uses no iterator performs the worst, and using an iterator makes it strictly better. For small to medium segment sizes, the by-key reduction outperforms the segmented reduce significantly. For very large segment sizes, the segmented reduce performs on par with a non-segmented reduce. From the graph, it seems that the by-key reduction has a constant overhead with respect to a non-segmented reduce, except for tiny segments. The segmented reduction has an overhead that scales more heavily with the segment size, but for huge segments the overhead is fully gone.

Blog Post Author: Nol Moonen, Stream HPC

Published in May 2026

IWOCL 2026

At Stream HPC, we love open standards. They allow developers to write portable applications, encourage industry collaboration, and enable common tooling. Each year Khronos organizes the annual IWOCL conference on open standard compute languages, and this year we are delighted to have our engineers delivering talks on SYCL and OpenCL.

IWOCL 2026 is the 14th iteration of the conference in Heilbronn, Germany. Not far along the river Neckar from Heidelberg where the conference was in 2025. The conference is the ideal venue for both users and implementers to exchange their expertise to drive the community forward together.

We always enjoy attending to share our experiences as GPGPU performance experts. Informing the community of practices we found worked well, or advising on features and tooling that we’d like to see in the future of the ecosystem. Attracting over one hundred attendees from across the world, IWOCL is also an excellent opportunity to meet our industry colleagues face-to-face, who we collaborate with throughout the year as Khronos members and open source contributors.

Bálint Soproni and Ewan Crawford from Stream HPC will be in Heilbronn this year to present their work. The full details of which can be found in the IWOCL 2026 conference program.

Evaluating the AdaptiveCpp Single-Pass (SSCP) SYCL compiler for GROMACS on Modern AMD Accelerators

Authors: Bálint Soproni, Aksel Alpay and Vincent Heuveline, Heidelberg University

Summary: Bálint is presenting the work from Heidelberg University on the performance benefits of using the AdaptiveCpp JIT compiler for large SYCL applications. Traditionally SYCL is compiled using a SMCP (single-source, multiple compiler passes) compiler, but the AdaptiveCpp SSCP (single-source, single compiler pass) JIT compiler offers an alternative approach to SYCL compilation. This short paper shows how the GROMACS molecular dynamics application, which has a mature SYCL backend, benefits from SSCP JIT compilation on modern AMD GPUs.

CLVizulayer: A Tool for Visualising the Directed Acyclic Graph of OpenCL Device Submissions

Authors: Ewan Crawford, Stream HPC

Summary: In this technical talk Ewan will debut the CLVizulayer tool for printing the graph of asynchronous OpenCL device tasks. Designed as an OpenCL ICD Loader layer to enable users to easily collect a Graphviz DOT file that can be graphically rendered to show tasks as nodes and dependencies as edges. This OpenCL vendor agnostic graph can then be used to optimize and debug the application as the presentation will illustrate through a case study.

IWOCL may only be a few weeks away but tickets are still available. We would be delighted to meet you there, please say hello if you’ll be joining us in Germany.

Blog Post Author: Ewan Crawford, Stream HPC

Published in April 2026

N-Queens project from over 10 years ago

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago”

How to get full CMake support for AMD HIP SDK on Windows – including patches

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API  was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading “How to get full CMake support for AMD HIP SDK on Windows – including patches”

Improving FinanceBench for GPUs Part II – low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

Continue reading “Improving FinanceBench for GPUs Part II – low hanging fruit”

The Art of Benchmarking

How fast is your software? The simpler the software setup, the easier to answer this question. The more complex the software, the more the answer will “it depends”. But just peek at F1-racing – the answer will depend on the driver and the track.

This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.

There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

The questions to ask

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.

First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.

We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:

  • What elements build up the race? From weather conditions to corners, and from other cars on the track to driver-responses
  • How can each of these elements be quantified?
  • How can each of these elements be measured for both own cars and other cars?
  • And as you guessed from the high-level result, the stopwatch: how much speedup is required in total and per round?
Continue reading “The Art of Benchmarking”

Problem solving tactic: making black boxes smaller

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

Black boxes will never be transparent

Assumption is the mother of all mistakes

Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading “Problem solving tactic: making black boxes smaller”

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

Updated: OpenCL and CUDA programming training – now online

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

  • Four days of training online
  • Free code-review after the training, to get feedback on what you created with the new knowledge;
  • 1 month of limited support, so you can avoid StackOverflow;
  • Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading “Updated: OpenCL and CUDA programming training – now online”

The 12 latest Twitter Poll Results of 2018

Via our Twitter channel we have various polls. Not always have we shared the full background of these polls, so we’ve taken the polls of the past half year and put them here. The first half of the year there were no polls, in case you wanted to know.

As inclusive polls are not focused (and thus difficult to answer), most polls are incomplete by design. Still insights can be given. Or comments given.

Below’s polls have given us insight and we hope they give you insights too how our industry is developing. It’s sorted on date from oldest first.

It was very interesting that the percentage of votes per choice did not change much after 30 votes. Even when it was retweeted by a large account, opinions had the same distribution.

Is HIP (a clone of CUDA) an option?

Continue reading “The 12 latest Twitter Poll Results of 2018”

OpenCL Basics: Running multiple kernels in OpenCL

This series “Basic concepts” is based on GPGPU-questions we get via email more than once, or when the question is not clearly explained in the books. For one it is obvious, for the other just what they’re missing.

They say that learning a new technique is best done by playing around with working code and then try to combine it. The idea is that when you have Stackoverflowed and Githubed code together, you’ve created so many bugs by design that you’ll learn a lot if you make it work. When applying this to OpenCL, you quickly get to a situation that you want to run one.cl file and then another.cl file. Almost all beginner’s material discuss a single OpenCL-file, so how to do this elegantly?

Continue reading “OpenCL Basics: Running multiple kernels in OpenCL”

How to speed up Excel in 6 steps

After the last post on Excel (“Accelerating an Excel Sheet with OpenCL“), there have been various request and discussions how we do “the miracle”. Short story: we only apply proper engineering tactics. Below I’ll explain how you can also speed up Excel and when you actually have to call us (last step).

A computer can handle 10s of gigabytes per second. Now look how big your Excel-sheet is and how much time it takes. Now you understand that the problem is probably not your computer.

Excel is a special piece of software from a developer’s perspective. An important rule of software engineering is to keep functionality (code) and data separate. Excel mixes these two as no other, which actually goes pretty well in many cases unless the data gets too big or the computations too heavy. In that case you’ve reached Excel’s limits and need to properly solve it.

An Excel-file often does things one-by-one, with a new command in every cell. This prevents any kind of automatic optimizations – besides that, Excel-sheets are very prone to errors.

Below are the steps to go through, of which most you can do yourself!

Continue reading “How to speed up Excel in 6 steps”

Selecting Applications Suitable for Porting to the GPU

Assessing software is never comparing apples to apples

The goal of this writing is to explain which applications are suitable to be ported to OpenCL and run on GPU (or multiple GPUs). It is done by showing the main differences between GPU and CPU, and by listing features and characteristics of problems and algorithms, which can make use of highly parallel architecture of GPU and simply run faster on graphic cards. Additionally, there is a list of issues that can decrease potential speed-up.

It does not try to be complete, but tries to focus on the most essential parts of assessing if code is a good candidate for porting to the GPU.

GPU vs CPU

The biggest difference between a GPU and a CPU is how they process tasks, due to different purposes. A CPU has a few (usually 4 or 8, but up to 32) ”fat” cores optimized for sequential serial processing like running an operating system, Microsoft Word, a web browser etc, while a GPU has a thousands of ”thin” cores designed to be very efficient when running hundreds of thousands of alike tasks simultaneously.

A CPU is very good at multi-tasking, whereas a GPU is very good at repetitive tasks. GPUs offer much more raw computational power compared to CPUs, but they would completely fail to run an operating system. Compare this to 4 motor cycles (CPU) of 1 truck (GPU) delivering goods – when the goods have to be delivered to customers throughout the city the motor cycles win, when all goods have to be delivered to a few supermarkets the truck wins.

Most problems need both processors to deliver the best value of system performance, price, and power. The GPU does the heavy lifting (truck brings goods to distribution centers) and the CPU does the flexible part of the job (motor cycles distributing doing deliveries).

Assessing software for GPU-porting fitness

Software that does not meet the performance requirement (time taken / time available), is always a potential candidate for being ported to a GPU. Continue reading “Selecting Applications Suitable for Porting to the GPU”

DOI: Digital attachments for Scientific Papers

Ever saw a claim on a paper you disagreed with or got triggered by, and then wanted to reproduce the experiment? Good luck finding the code and the data used in the experiments.

When we want to redo experiments of papers, it starts with finding the code and data used. A good start is Github or the homepage of the scientist. Also Gitlab. Bitbucket, SourceForge or the personal homepage of one of the researchers could be a place to look. Emailing the authors is often only an option, if the university homepage mentions such option – we’re not surprised to get no reaction at all. If all that doesn’t work, then implementing the pseudo-code and creating own data might be the only option – not if that will support the claims.

So what if scientific papers had an easy way to connect to digital objects like code and data?

Here the DOI comes in.

Continue reading “DOI: Digital attachments for Scientific Papers”

Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks

When CUDA kept having a dominance over OpenCL, AMD introduced HIP – a programming language that closely resembles CUDA. Now it doesn’t take months to port code to AMD hardware, but more and more CUDA-software converts to HIP without problems. The real large and complex code-bases only take a few weeks max, where we found that solved problems also made the CUDA-code run faster.

The only problem is that CUDA-libraries need to have their HIP-equivalent to be able to port all CUDA-software.

Here is where we come in. We helped AMD make a high-performance Pseudo Random Generator (PRNG) Library, called rocRAND. Random number generation is important in many fields, from finance (Monte Carlo simulations) to Cryptographics, and from procedural generation in games to providing white noise. For some applications it’s enough to have some data, but for large simulations the PRNG is the limiting factor. Continue reading “Learn about AMD’s PRNG library we developed: rocRAND – includes benchmarks”

GPU and FPGA challenge for MSc and PhD students

While going through my email, I found out about the third “HiPEAC Student Heterogeneous Programming Challenge”. Unfortunately the deadline was last week, but just got an email: if you register by this weekend (17 September), you can still join.

EDIT: if you joined, be sure to comment in early November how it was. This would hopefully motivate others to join in next year. Continue reading “GPU and FPGA challenge for MSc and PhD students”

The single-core, multi-core and many-core CPU

Multi-core CPU from 2011

CPUs are now split up in 3 types, depending on the number of cores: single (1), multi (2-8) and many (10+).

I find it more important now to split up into these three types, as the types of problems to be solved by each is very different. Based on the problem-differences I’m even expecting that the number of cores between multi-core CPUs and many-core CPUs will grow.

Below are the three types of CPUs discussed and a small discussion on many-core processors we see around. Continue reading “The single-core, multi-core and many-core CPU”

HPC centre EPCC says: “Better software, better science”

The University of Edinburgh houses the HPC centre EPCC. Neelofer Banglawala wrote about a programme which funds the development and improvement of scientific software, and also discussed about the results.

Many of the 10 most used application codes on ARCHER have been the focus of an eCSE project. Software with more modest user bases have improved user uptake and widened their impact through eCSE-funded work. Furthermore, performance improvements can lead to tens of thousands of pounds of savings in compute time.

Saving tens of thousands of pounds is certainly worth the investment. This also means more users can work on the same supercomputer, thus reducing waiting times. Continue reading “HPC centre EPCC says: “Better software, better science””

Demo: cartoonizer on an Altera Arria 10 FPGA

It takes quite some effort to program FPGAs using VHDL or Verilog. Since several years Intel/Altera has OpenCL-drivers, with the goal to reduce this effort. OpenCL-on-FPGAs reduced the required effort to a quarter of the time, while also making it easier to alter the specifications during the project. Exactly the latter was very beneficiary when creating the demo, as the to-be-solved problem was vaguely defined. The goal was to make a video look like a cartoon using image filters. We soon found out that “cartoonized” is a vague description, and it took several iterations to get the right balance between blur, color-reduction and edge-detection. Continue reading “Demo: cartoonizer on an Altera Arria 10 FPGA”

CPU Code modernisation – our hidden expertise

You’ve seen the speedups possible on GPUs. We secretly know that many of these techniques would also work on modern multi-core CPUs. If after the first optimisations the GPU still gets an 8x speedup, the GPU is the obvious choice. When it’s 2x, would the better choice be a bigger CPU or a bigger GPU? Currently the GPU is chosen more often.

Now AMD, Intel and AMD have 28+ core CPUs, the answer to that question might now lean towards the CPU. With a CPU that has 32 cores and 256bit vector-computations via AVX2, each clock-cycle 32 double4 can be computed. A 16-core AVX1 CPU could work on 16 double2’s, which is only a fourth of that performance. Actual performance compared to peak-performance is comparable to GPUs here. Continue reading “CPU Code modernisation – our hidden expertise”

New training dates for OpenCL on CPUs and GPUs!

OpenCL remains to be a popular programming language for accelerators, from embedded to HPC. Good examples are consumer software and embedded devices. With Vulkan potentially getting OpenCL-support in the future, the supported devices will only increase.

For multicore-CPUs and GPUs we now have monthly training dates for the rest of the year:

Minimum number of participants is two. By request the location and date can be changed.

The first day of the training is the OpenCL Foundations training, which can be booked separately.

For more information call us at +31854865760.