Asynchronous and parallel programming in C++26

For the past decade, every C++ programmer who wanted to do real concurrent work has had the same conversation with themselves. std::async is a toy. std::thread is too low-level. std::future doesn’t compose. So you reach for TBB, or another third-party library, or a thread pool you wrote in 2017, or std::coroutine that you have to wire up to an executor you also wrote yourself.

C++26 finally puts an answer in the standard library: std::execution, which is C++’s execution control library, also known as senders/receivers, also known as P2300. It is the largest single addition to the language since modules, and it is going to take the community a few years to digest. This post is a tour. We’ll start with the parallel algorithm policies you may already know, build up to senders/receivers, schedulers, cancellation, and end with a worked example that ties everything together.

What is std::execution

std::execution defines a model for describing asynchronous work and provides a small set of generic algorithms for composing those descriptions into larger ones. The actual execution is delegated to a scheduler, which is a handle to whatever resource happens to run the work: a thread pool, a GPU queue, an I/O reactor, or the calling thread itself.

The three core abstractions are:

  • A scheduler is a lightweight handle that says where work runs.
  • A sender is a lazy description of what work runs. It produces one of three completion signals: a value, an error, or “stopped” (cancelled).
  • A receiver is the callback object that consumes those signals. You almost never write one by hand; the algorithms wire them up for you.

The crucial word is lazy. A sender does nothing until you connect it to a receiver and start it. This is why the model composes: you can build up a description of an entire pipeline at compile time, hand it to the runtime once, and let the runtime decide how to schedule the pieces.

What is added in C++26

The paper that became std::execution landed at the St. Louis meeting in June 2024. The standard ships with:

  • The sender/receiver concepts and their machinery.
  • Sender factories: just, just_error, just_stopped, schedule, read_env.
  • Sender adaptors: then, upon_error, upon_stopped, let_value, let_error, let_stopped, starts_on, continues_on, on, when_all, split, bulk, into_variant, stopped_as_optional, stopped_as_error.
  • Sender consumers: sync_wait, sync_wait_with_variant.
  • A system scheduler (P2079) so you don’t have to bring your own thread pool just to write Hello, world.
  • An async scope (P3149) for spawning dynamic, but still joined, work.
  • A coroutine task<T> type that interoperates with senders.

What is not in C++26: networking, file I/O, time-based schedulers, and the asynchronous parallel algorithms (P3300). Those are queued for C++29. The reference implementation, NVIDIA’s stdexec, is what you should use today if you want to experiment.

Enabling parallel algorithms with std::execution policies

Before senders, the simplest way to ask the standard library to use multiple cores was to pass an execution policy to a parallel algorithm. These were added in C++17 and they are the entry point most people will reach for first:

#include <algorithm>
#include <execution>
#include <vector>

std::vector<float> data = load();
std::sort(std::execution::par_unseq, data.begin(), data.end());

The four standard policies are seq, par, par_unseq, and unseq. par says the implementation is allowed to use multiple threads. unseq says it is allowed to vectorise. par_unseq says both. seq says neither, it behaves like the unpoliced overload.

These policies are permissions, not promises. The implementation may decide that your range is too small to be worth distributing, and run it on the calling thread. If you want guarantees, you have to measure.

The relationship to std::execution is mostly nominal. The policies live in the same std::execution namespace, and the upcoming asynchronous parallel algorithms (P3300) will accept the same policy types when they ship. But the parallel algorithms today don’t return senders, don’t compose, and don’t let you choose a scheduler. They are a useful, but limited, self-contained corner of the library.

Everything below this point is the new world.

Asynchronous tasks with senders/receivers

The canonical first example. We build a description, then we run it:

namespace ex = std::execution;

ex::sender auto work = ex::just("Hello, world!")
                     | ex::then([](std::string_view s) { std::print("{}\n", s); });

ex::sync_wait(std::move(work));

just(x) is a sender that completes immediately with the value x. then(f) is a sender adaptor: it takes the value coming out of the upstream sender and passes it to f. The pipe operator is just function composition. just(a) | then(f) is the same as then(a, f), which is equivalent to f(a). In other words: wait for value a to be delivered from the sender, and when it is delivered, execute the function f(a). Or in the above example: expect “Hello, world!” to be delivered, and when it is, print it.

Nothing has been executed yet. The variable work holds a typed object describing the computation. sync_wait is what actually starts it and blocks until it finishes. This separation between describing and running is the whole game. You can build pipelines, pass them around, store them, and decide later where and when to run them.

A sender can complete in three ways:

  • set_value(args...): Success, with zero or more values.
  • set_error(err): Failure, with any error type (not just exception_ptr).
  • set_stopped(): Cancellation.

then only fires on the value channel. To handle the others, use upon_error and upon_stopped, which are the same shape but for the other two channels. To chain another sender off the value (the monadic bind), use let_value:

ex::sender auto pipeline =
    ex::just(filename)
  | ex::then(open_file)
  | ex::let_value([](File f) { return read_async_sender(f); })
  | ex::then(parse);

Use then when your continuation is synchronous and returns a value. Use let_value when your continuation is itself asynchronous and returns a sender. Get this distinction wrong and you’ll find yourself with a Sender<Sender<T>> that compiles but does the wrong thing.

When you need to fan out and join, use when_all and split:

auto common = ex::schedule(sched) | ex::then(parse_input) | ex::split();

auto branch_a = common | ex::then(compute_a);
auto branch_b = common | ex::then(compute_b);

auto combined = ex::when_all(branch_a, branch_b)
              | ex::then([](auto a, auto b) { return merge(a, b); });

split is necessary because senders are single-shot by default. Without it, attaching common to two downstream chains would mean running parse_input twice.

Image of the compute graph created by the code sample above.
Figure 1: The compute graph created by the code sample above.

Each channel will carry one of the value(s), error or cancel signals to the following receivers!

Mapping tasks to different schedulers

The point of having a scheduler abstraction is that where work runs is a property of the pipeline, not of the work itself. You write the work once and bind it to a scheduler at composition time.

Note: Compile time is when the compiler type-checks and instantiates the sender chain. Composition time is when your code builds the sender description by chaining adaptors (just(x) | then(f) | continues_on(sched)). Runtime / execution time is when sync_wait (or start) actually kicks off the operation state and work flows through the chain: threads run, GPU kernels launch, I/O happens.
The distinction that matters is between composition time and execution time. Both happen “at runtime” in the C++ sense, but the term (“composition time”) is used to name the phase where you describe work without doing it. That’s the core property of laziness in senders: the pipeline is a value you can inspect, store, or pass around before anyone calls start.

There are three primitives for placement:

  • schedule(sched): Produce a sender that completes on sched. This is how you start work on a particular execution context.
  • starts_on(sched, sndr): Run sndr starting on sched, regardless of where the surrounding chain came from.
  • continues_on(sched): At this point in the chain, hop onto sched and continue there.

A typical pattern: read from disk on an I/O scheduler, do CPU work on a worker pool, then write back from the I/O scheduler:

auto pipeline =
    ex::starts_on(io_sched, ex::just(path) | ex::then(read_file))
  | ex::continues_on(cpu_sched)
  | ex::then(process)
  | ex::continues_on(io_sched)
  | ex::then(write_file);

The system scheduler (std::execution::get_system_scheduler()) is the lazy-default thread pool that the standard provides so you don’t have to write one. It’s a sensible default for CPU-bound work, but for I/O or for work that needs strict ordering you’ll usually want to construct your own.

A practical warning, courtesy of Mathieu Ropert’s investigation of stdexec: if you build a pipeline through let_value and the inner sender doesn’t carry a scheduler, bulk(par_unseq, ...) may silently run serially on the calling thread. The fix is to add an explicit continues_on(sched) before the bulk:

ex::let_value([&] {
    return ex::just()
         | ex::continues_on(sched)              // re-anchor the scheduler
         | ex::bulk(ex::par_unseq, count, work);
})

This is the kind of thing you only catch with a profiler. Make a habit of being explicit about scheduler placement at every fan-out point.

How to pause/cancel sender chains

Cancellation in the senders/receivers world is cooperative and structured. There is no kill_thread. Instead, the framework propagates a stop request, and the senders that care about it observe a std::stop_token, finish what they’re doing, and complete with set_stopped instead of set_value.

The mechanism (inherited from std::jthread) is std::stop_source and std::stop_token. When you connect a sender to a receiver, every sender in the chain can query the receiver’s environment via get_env(receiver), and that environment is what answers get_stop_token(env). The token is a property of whoever is consuming the chain.

C++26 gives you exactly the tool for this: std::execution::write_env [exec.write.env]. It’s a sender adaptor that wraps a sender so that, when connected to a receiver, the inner receiver sees an environment with whatever extra queries you’ve spliced in. Combined with std::execution::prop [exec.prop], which is a tiny class template that pairs a query tag with a value, it gives you a one-liner for token injection:

namespace ex = std::execution;

std::stop_source src;
auto pipeline = ex::just(42)
              | ex::then([](int x) { return x * 2; });

auto result = ex::sync_wait(
                ex::write_env(
                  std::move(pipeline),
                  ex::prop(ex::get_stop_token, src.get_token())
                )
              );

if (result) std::println("got {}", std::get<0>(*result));
else        std::println("cancelled");

write_env joins the new env onto the outer one rather than replacing it, so sync_wait‘s own queries (allocator, scheduler, etc.) still flow through. sync_wait already returns std::optional<std::tuple<Vs...>> which is engaged (has_value() == true) on value, disengaged (has_value() == false) on stopped and throws on error, so there’s no extra boilerplate needed.

Now let’s actually test the cancelling operation. I have initially implemented everything from scratch, including an interruptible_sleep_sender that showcases the cancellation feature (which can be found in this godbolt link). What instead makes things much simpler is the experimental::execution::timed_thread_context and functions like experimental::execution::schedule_after, which schedule senders to run after a specified number of milliseconds. So, without the need to define our own structs we can do this:

// godbolt link: https://godbolt.org/z/nxG8M6oY9
{
    namespace expex = experimental::execution;
    ex::inplace_stop_source src;

    expex::timed_thread_context timer_ctx;
    auto timer = timer_ctx.get_scheduler();

    auto pipeline = ex::just(42)
                  | ex::let_value([timer](int x)
                    {
                        return expex::schedule_after(timer, 5s)
                             | ex::then([x] { return x * 2; });
                    });

    // cancel from another thread:
    std::jthread canceller([&] {
        std::this_thread::sleep_for(200ms);
        std::println("cancelling from thread {}", std::this_thread::get_id());
        src.request_stop();
    });

    auto result = ex::sync_wait(
                    ex::write_env(
                        std::move(pipeline),
                        ex::prop(ex::get_stop_token, src.get_token())
                    )
                );

    if (result) std::println("got {}", std::get<0>(*result));
    else        std::println("cancelled");
}

For longer-running or dynamically spawned work, prefer spawning through an async_scope (P3149, see the advanced example below). The scope’s spawn/spawn_future already builds a receiver whose environment carries a stop token tied to the scope’s lifetime, so you don’t need write_env at all and calling request_stop() on the scope cancels every task running inside it.

Once a stop token is in the receiver’s environment, regardless of whether it is put there with write_env or from an async_scope, then request_stop() on the corresponding source will be observed all the way through the chain.

For a sender to honor the request, it has to query the token. The standard algorithms like then, bulk, when_all do this for you. when_all is particularly important: if any of its child senders fails or is stopped, it requests stop on its remaining children before completing.

You can pause a chain by writing a sender whose start registers a callback and returns. The chain is “paused” because nothing is running; it resumes when the callback fires and calls set_value on its receiver. This is how I/O senders, timer senders, and task<T> coroutines work. There is no thread sitting blocked. That’s the whole reason structured async exists.

How to create pause/cancel callback wrappers

Most of the time you’ll consume the cancellation machinery rather than implement it. But if you’re integrating an existing API like a callback-based HTTP client, or a hardware event, you’ll write a small custom sender.

Here’s a sketch of a sender that wraps a one-shot callback API and respects stop requests:

namespace ex = std::execution;

template <typename Receiver>
struct cancellable_op {
    using operation_state_concept = ex::operation_state_t;

    Receiver receiver;
    SomeAPIHandle handle;

    using StopToken = ex::stop_token_of_t<ex::env_of_t<Receiver>>;

    struct on_stop {
        cancellable_op* self;
        void operator()() const noexcept { self->handle.cancel(); }
    };

    // Use the token's OWN callback_type - that way this works whether
    // the env carries a real stop token (std::stop_token,
    // stdexec::inplace_stop_token) or never_stop_token (no-op).
    using stop_cb_t = typename StopToken::template callback_type<on_stop>;
    std::optional<stop_cb_t> stop_cb;

    // C++26 member-function customization (no more tag_invoke).
    void start() & noexcept {
        auto token = ex::get_stop_token(ex::get_env(receiver));
        stop_cb.emplace(std::move(token), on_stop{this});

        handle.submit([this](Result r) {
            stop_cb.reset();
            if (r.cancelled()) {
                ex::set_stopped(std::move(receiver));
            } else if (r.hasError()) {
                ex::set_error(std::move(receiver), r.outcome.error());
            } else {
                ex::set_value(std::move(receiver), *r.outcome);
            }
        });
    }
};

struct cancellable_sender {
    using sender_concept = ex::sender_t;
    using completion_signatures = ex::completion_signatures<
        ex::set_value_t(std::optional<int>),
        ex::set_error_t(Error),
        ex::set_stopped_t()>;

    SomeAPIHandle handle;

    // Member-function connect.
    template <typename Receiver>
    auto connect(Receiver r) && {
        return cancellable_op<Receiver>{std::move(r), std::move(handle)};
    }
};

Three things to notice. First, the stop callback is an RAII object, meaning that its constructor registers, its destructor unregisters. You must reset it before completing the receiver, or you risk the callback firing on a torn-down object. Second, start is noexcept. Senders may be started in contexts where exceptions can’t be propagated; throw and you terminate. Third, the receiver is moved into the operation state at connect time and lives there until completion. There is no allocation per chain element; the whole graph is one nested aggregate, sized at compile time.

This is more than you’ll usually want to write. In practice, prefer the upcoming std::execution::task<T> coroutine, where co_await does all of the above for you.

As always, a godbolt example invoking the (1) value, (2) error and (3) cancel paths can be found here.

Advanced example

Let’s pull it all together. Suppose we have an image processing pipeline that:

  1. Reads a JPEG from disk on an I/O scheduler.
  2. Decodes it on the I/O scheduler (decoding is mostly a stream of read calls plus some CPU; let’s keep it on I/O for simplicity).
  3. Hands the pixel buffer to a GPU worker pipeline, which runs a parallel filter GPU kernel over the rows.
  4. Re-encodes the result to JPEG on the I/O scheduler.
  5. Uploads the encoded bytes to a remote server on a network scheduler.
  6. Is fully cancellable, and the upload will be aborted if a stop is requested mid-flight.

Assume the obvious helper senders exist. We’re not interested in how the JPEG decoder works.

// Godbolt link for this example: https://godbolt.org/z/b5PnnzMqn
#include <exec/static_thread_pool.hpp>
#include <stdexec/execution.hpp>
#include <filesystem>
#include <print>
#include <thread>
#include <vector>

// Hypothetical GPU scheduler header - each vendor (ROCm, CUDA, ...)
// would ship their own that models ex::scheduler, or we create our own
// wrappers for it.
#include <gpu/scheduler.hpp>

namespace ex = stdexec;
using namespace std::literals;

// ── Domain types ────────────────────────────────────────────────────
struct PixelBuffer {
    std::vector<std::uint8_t> pixels;
    int width{}, height{}, channels{3};
    int row_count() const noexcept { return height; }
};
struct JpegBlob     { std::vector<std::uint8_t> bytes; };
struct UploadResult { int status_code{}; std::string etag; };

// ── Synchronous helpers (the pipeline turns these into senders
//    via just/then) ──────────────────────────────────────────────────
JpegBlob    read_file(std::filesystem::path p);              // may throw
PixelBuffer decode_jpeg(JpegBlob&& blob);                    // may throw
JpegBlob    encode_jpeg(PixelBuffer&& buf);                  // may throw

// ── GPU kernel wrapper ──────────────────────────────────────────────
// launch_filter_kernel synchronously enqueues a GPU kernel operating
// on row `row` of `buf`.  Internally it wraps the hipLaunchKernel /
// cuLaunchKernel call.  The bulk scheduler ensures all enqueued
// kernels complete before the bulk sender signals set_value.
// A GPU-aware scheduler maps bulk indices to hardware threads /
// wavefronts, so each call just enqueues — it does not wait.
void launch_filter_kernel(PixelBuffer& buf, int row) noexcept;

// ── upload_to: returns a cancellable sender that uploads `blob` to
//    `url` and completes with UploadResult.  Same stop-callback
//    pattern as the callback-wrapper section above.
ex::sender auto upload_to(std::string url, JpegBlob& blob);

// ── The pipeline ────────────────────────────────────────────────────
// A function template parameterised on schedulers.  The same pipeline
// can be tested with inline_scheduler, profiled with an instrumented
// one, or run against real thread pools.
template <ex::scheduler IoSched,
          ex::scheduler GpuSched,
          ex::scheduler NetSched>
ex::sender auto image_pipeline(
    IoSched  io,
    GpuSched gpu,
    NetSched net,
    std::filesystem::path path,
    std::string           upload_url)
{
    return
        // 1-2. Read and decode on the I/O context.
        ex::starts_on(io,
            ex::just(std::move(path))
          | ex::then(read_file)
          | ex::then(decode_jpeg))

        // 3. GPU filter: hop onto the GPU scheduler, then enqueue
        //    one kernel per row via bulk.  The GPU scheduler maps
        //    bulk indices to hardware threads / wavefronts.
      | ex::continues_on(gpu)
      | ex::let_value([gpu](auto&& buf) {
            const int rows = buf.row_count();
            return ex::just(std::forward<decltype(buf)>(buf))
                 | ex::continues_on(gpu)
                 | ex::bulk(ex::par, rows, // Note the std::execution policy from introduction ;)
                       [](int row, PixelBuffer& b) noexcept {
                           // Each invocation enqueues a GPU kernel.
                           // A GPU-aware scheduler maps bulk indices
                           // to hardware threads / wavefronts.
                           launch_filter_kernel(b, row);
                       });
        })

        // 4. Re-encode on the I/O context.
      | ex::continues_on(io)
      | ex::then(encode_jpeg)

        // 5-6. Upload on the network context.  upload_to returns a
        //      cancellable sender, so a stop request mid-upload will
        //      abort the HTTP call and complete with set_stopped.
      | ex::continues_on(net)
      | ex::let_value([url = std::move(upload_url)](auto&& blob) {
            return upload_to(url, std::forward<decltype(blob)>(blob));
        });
}

int main()
{
    // Three execution contexts, each representing a different resource.
    exec::static_thread_pool io_ctx{2};
    gpu::context             gpu_ctx;    // wraps a GPU device
    exec::static_thread_pool net_ctx{2};

    ex::inplace_stop_source stop;

    // Simulate the user pressing Ctrl-C after 100 ms.
    std::jthread watchdog([&] {
        std::this_thread::sleep_for(100ms);
        stop.request_stop();
    });

    auto result = ex::sync_wait(
        ex::write_env(
            image_pipeline(
                io_ctx.get_scheduler(),
                gpu_ctx.get_scheduler(),
                net_ctx.get_scheduler(),
                "/photos/cat.jpg",
                "https://cdn.example.com/images/cat.jpg"),
            ex::prop(ex::get_stop_token, stop.get_token())));

    if (result) {
        auto& [upload] = *result;
        std::println("done: HTTP {} etag={}", upload.status_code, upload.etag);
    } else {
        std::println("pipeline cancelled");
    }
}
// Godbolt link for this example : https://godbolt.org/z/b5PnnzMqn

A toy example of the above example that synthetically changes the UPLOAD_DURATION so that the pipeline can succeed or be canceled can be found here!

Analysis

Read the pipeline definition from top to bottom and notice what isn’t there. There are no thread handles, no futures, no mutexes, no condition variables. There is no question about who owns what data and the value channel is responsible for moving the data through it. The schedulers are passed in, so the same pipeline could be tested by passing in an inline scheduler, or profiled by passing in an instrumented one. The whole graph is a value, built lazily, started once, joinable, cancellable, and statically typed end-to-end.

A few things worth calling out:

  • let_value for dynamic bulk shapes. The row count comes from the decoded PixelBuffer, which is only known at runtime. bulk(rows, f) needs rows at the point where the adaptor is created, so we use let_value to unwrap the buffer, read its dimensions, and re-emit it through just + bulk. This “unwrap, inspect, rewrap” is the standard idiom whenever a downstream adaptor’s parameters depend on the upstream value.
  • GPU via continues_on(gpu) + bulk. A GPU scheduler models the same scheduler concept as a thread pool. continues_on(gpu) hops execution onto the device context, and the subsequent bulk(rows, f) maps each index to a hardware thread or wavefront. The callable passed to bulk (launch_filter_kernel) enqueues the actual kernel. A vendor-specific GPU scheduler (ROCm’s HIP, CUDA, oneAPI) that schedules work by enqueueing onto the device’s command queue and completing when the queue signals. From the pipeline’s perspective it’s just another scheduler with no special API surface.

Note: bulk(n, f) is defined to forward its input values downstream after all n invocations of f complete, so the PixelBuffer& is passed to step 4.

  • continues_on at every boundary. Without explicit placement, bulk may silently run serially on whatever thread the previous step completed on (as we noted in the scheduler section). The continues_on(gpu) before bulk ensures the filter launches on the GPU, and the continues_on(io) after it hops back to the CPU for the encode.
  • Cancellation is structural. The single write_env + prop(get_stop_token, ...) at the outermost layer is enough. Every sender in the chain (then, bulk, when_all, and our custom upload_to) observes the same token through the receiver’s environment. When the watchdog calls request_stop(), the pipeline completes with set_stopped and sync_wait returns std::nullopt. Even the GPU kernel can be aborted mid-flight if the device supports async cancellation (on AMD hardware this maps to hipStreamDestroy / signal-based abort).

Conclusion

We’ve abstracted concurrency and asynchronous programming without ever caring about threads, mutexes, or condition variables. We’ve created portable (and cancellable) pipelines that define work in terms of what runs where and minimized the risk of data races. That is what std::execution is for.

The API surface is large and the error messages are, today, terrifying. The reference implementation has rough edges that will catch you out. Having said that, the development is active and the issue list relatively short. But the underlying model is the right one. After a decade of executor proposals, C++ finally has a single vocabulary for “do this work somewhere, then this work somewhere else, and let me cancel the whole thing.” Whether or not C++26 is the version where you adopt it, it is the version where you should start learning it.

On a closing note, everyone is talking about C++26 reflection [P2996] features. This unlocks a whole new realm of the language that will take us years to master and understand (maybe even more than 10 years), but this standard brings a palette of other new features (safety, erroneous behavior, pre-post conditions and, of course, std::execution) that users will be eager to adopt and use. And to quote Herb Sutter’s post Living in the future: Using C++26 at work closing statement: fun times for C++!

Lazy Ranges in C++23 with std::generator

Disclaimer: LLMs were used for proof-reading and grammar check.

C++20 gave us coroutines. The machinery needed was included: co_yield, co_return, co_await but the standard library had no concrete coroutine types. You had to write your own promise type, your own iterator, your own bookkeeping. The complexity of this boilerplate was intimidating for most people.

C++23 fixes the most obvious gap: std::generator<T>. It’s a synchronous, pull-based, lazy sequence. You write a function that co_yields values, and the caller iterates over them with a range-for. No allocator gymnastics, no hand-rolled promise types. See appendix for an example of creating your own generator.

Let’s see this feature in detail, with some examples and illustrate some caveats.

What std::generator actually is

An std::generator<T> is a view. It satisfies std::ranges::input_range. That means it plugs into the entire ranges pipeline. You can use it with useful functions like views::take, views::filter, views::transform and without ever materializing the full sequence in memory.

The function suspends at each co_yield and resumes when the caller asks for the next element. Values are produced one at a time, on demand. If the caller stops iterating, the remaining values are never computed.

This makes it trivial to express infinite sequences, stateful streams, or anything where computing the entire result set upfront is wasteful or impossible.

Example 1: Fibonacci, forever

Take one of the most common arithmetic sequences: the Fibonacci. An infinite sequence that would blow up a std::vector but works fine as a generator because nothing is stored.

#include <cstdint>
#include <generator>
#include <iostream>
#include <ranges>

std::generator<std::uint64_t> fibonacci() {
    std::uint64_t a = 0, b = 1;
    while (true) {
        co_yield a;
        auto next = a + b;
        a = b;
        b = next;
    }
}

int main() {
    for (auto n : fibonacci() | std::views::take(20)) {
        std::cout << n << '\n';
    }
}

There’s no sentinel, no size. The generator runs until you stop pulling from it. Compose it with take, filter, or drop whatever you need. The generator doesn’t know or care.

A few things worth noticing:

  • The while (true) loop never terminates on its own. That’s fine. When the caller destructs the generator (by leaving the range-for), the coroutine frame is destroyed and the loop just stops.
  • You get the same performance characteristics as writing a manual iterator class, without actually writing one.
  • The return type is std::generator<std::uint64_t>. That’s it. No template metaprogramming, no CRTP base class.

Note: Compare how the generated assembly differs between this example and a hand-rolled Fibonacci range iterator coroutine in the appendix!

Example 2: Camera driver class

Fibonacci is clean but not realistic. Here’s something closer to production: a camera device that yields frames lazily, where each yield can succeed or fail. We pair std::generator with std::expected to get a typed error channel without exceptions.

In your camera interface, you could define something like:

using FrameResult = std::expected</*struct*/ Frame, /*enum class*/ FrameError>;

And to facilitate a polling mechanism from the device something like:

virtual std::generator<FrameResult> frames() = 0;

Now creating non-determnistic test cases that yield FrameErrors randomly can be a piece of cake! A mock class representing the camera device can be as simple as:

class CameraMock : public CameraI {
public:
    explicit CameraMock(std::uint32_t w, std::uint32_t h)
        : width(w), height(h) {}

    std::generator<FrameResult> frames() override {
        std::random_device rd;
        std::mt19937 rng{rd()};
        std::uniform_int_distribution<int> fault(0, 19);

        while (true) {
            int roll = fault(rng);
            if (roll == 0) {
                co_yield std::unexpected(FrameError::Timeout);
            }
            else if (roll == 1) {
                co_yield std::unexpected(FrameError::CorruptedData);
            }
            else if (roll == 2) {
                co_yield std::unexpected(FrameError::DeviceLost);
            }
            else {
                co_yield getMockFrame(width, height, seq++);
            }
        }
    }

private:
    std::uint32_t width, height;
    std::uint64_t seq{};
};

The generator lives inside a CameraMock class. Each iteration either produces a frame or reports an error and it is up to the caller to decide how to handle it. The consumer doesn’t know anything about the camera’s internal state machine. It just pulls results from a range.

This pattern of using generator of expected turns out to be genuinely useful. The producer can signal errors inline without throwing and the consumer handles them in the same loop that processes success values. No separate error callback, no out-of-band signalling. The control flow reads top to bottom.

You could also compose this with ranges. Want to skip errors and only process valid frames?

auto good_frames = camera.frames()
                 | std::views::filter(&CameraI::FrameResult::has_value);

Note: this range is not evaluated yet. I like to think of the variable good_frames like an un-invoked lambda: evaluation will be done while iterating it. Of course, lambdas capture variables at the time of definition, not invocation. This comparison isn’t an 1-to-1.

Whether you should do that depends on whether you need to log the errors. But the option is there, for free, because std::generator is a range.

Or use the monadic interface on std::expected directly. Each element in the generator is already an expected, so you can chain and_then and or_else per-frame without unwrapping anything yourself:

for (auto&& result : camera.frames() | std::views::take(100)) {
    result
        .and_then([&](const Frame& frame) -> CameraI::FrameResult {
            // do something
            ++processed;
            return frame;
        })
        .or_else([&](FrameError e) -> CameraI::FrameResult {
            std::cerr << std::format("[cam] error: {}\n", to_string(e));
            ++errors;
            return std::unexpected(e);
        });
}

and_then runs only when the expected holds a value. or_else runs only on the error path. No if/else, no operator*. The types route execution(!). You can also propagate or convert errors mid-chain if a processing step can also fail. Can an interface get cleaner than that?

The sharp edges

A few things to know before you consider this for production:

Single-pass. std::generator is an input range, not a forward range. You can iterate it exactly once. If you need to make two passes, collect the values into a container first.

Move semantics matter. When you co_yield a value, the generator stores a pointer to the yielded object. For large types like our Frame, use co_yield with a moved variable or a temporary and consume the value before the next iteration advances the coroutine. Holding a reference across iterations is undefined behaviour.

No co_await. std::generator is synchronous. If you need to co_await an async operation inside the coroutine body, you need a different type. Generators yield values; they don’t wait on futures.

HALO: when the heap allocation disappears

Every coroutine needs a frame to store its local variables, parameters, and suspension-point bookkeeping. By default that frame lives on the heap: a new on creation, a delete on destruction. For a generator that yields millions of values in a tight loop, that allocation is noise. But it’s still there, and in latency-sensitive or real-time code it can matter.

HALO stands for Heap Allocation eLision Optimization. If the compiler can prove that:

  • the coroutine’s lifetime is bounded by the caller, and
  • the frame size is known at compile time

then it can allocate the coroutine frame on the caller’s stack (or in registers) instead of the heap. The new/delete pair vanishes entirely. This isn’t optional optimisation in the “maybe the compiler will do it” sense. It’s a well-defined elision that compilers actively implement.

The key factor is visibility. The compiler needs to see both the coroutine body and the call site in the same translation unit, and it needs to prove that the coroutine object doesn’t escape.

// HALO-friendly: coroutine defined in same TU, lifetime bounded by caller
generator<int> range(int from, int to) {
    for (int i = from; i < to; ++i)
        co_yield i;
}

int main() {
    auto s = range(1, 10);
    return std::accumulate(s.begin(), s.end(), 0);
    // s destroyed here — compiler sees the full lifetime
}
// HALO-unfriendly: coroutine defined in another TU
generator<int> range(int from, int to); // declaration only

int main() {
    auto s = range(1, 10); // compiler can't see the ramp function
    return std::accumulate(s.begin(), s.end(), 0);
}

The set of functions the compiler must inline is small and bounded: the coroutine ramp function, get_return_object(), begin(), the constructor/move-constructor/destructor of the generator type, and coroutine_handle<>::destroy. Critically, neither the coroutine body itself nor the algorithm consuming it (accumulate in this case) needs to be inlined. The optimizer only needs to see enough to prove that every path from creation to scope exit calls destroy on the coroutine handle.

In practice this means: keep your generators close to their consumers. Define them in headers or the same file, consume them in tight scopes, and don’t stash them in containers or pass them across API boundaries if you care about the allocation. When HALO kicks in, an std::generator loop compiles down to the same code as a hand-written state machine with no use of heap and without any indirection.

Wrapping up

std::generator isn’t a revolution. It’s the missing piece that makes C++ coroutines usable for the most common case: producing a sequence of values lazily. The machinery was already there; now there’s a standard type that does what everyone was avoiding anyway.

Pair it with std::expected and you get a clean, composable pattern for fallible streams. Pair it with ranges and you get lazy pipelines with zero allocations beyond the coroutine frame. That’s a good trade for a one-line return type change (exaggerating).

Compiler explorer

You can find the above example compiled here:

Appendix

In Computer Science, It’s all about tradeoffs. Of course there are some everywhere! Notice the code below:

If we enable the comment on line 38, then we can make the de-reference cheaper and the difference to the produced assembly code is that std::generator stores the yielded value via a pointer in the promise and operator* and must dereference it (mov rax, QWORD PTR [rax]), adding a dependent load.

Also begin() is cheaper: the FibonacciRange returns a small iterator (just a coroutine handle) in two registers (rax+rdx). std::generator::_Iterator is larger or non-trivially constructible, forcing the caller to pass a destination pointer in rdi, which means an extra memory round-trip.

Finally FibonacciRange has a smaller stack footprint: it uses less stack space (variables around [rbp-96]) vs the generator block ([rbp-176]), reflecting a smaller iterator/view type.

This gap can get even larger with -O3, but I’ll leave that to the reader…

GPU Day 2026

At Stream HPC, we enjoy opportunities to connect with the HPC and accelerator community, exchange ideas, and learn from engineers and researchers working across the GPU ecosystem. Later this month, several members of our team will be attending GPU Day 2026 in Budapest, Hungary.

Now in its 16th edition, GPU Day has become an established annual conference focused on massively parallel computing in science and industrial applications. Organized by the Wigner Scientific Computation Laboratory, the event brings together researchers, developers, students, and industry experts to discuss technologies spanning GPUs, compilers, machine learning, visualization, and emerging accelerator platforms.

For Stream HPC, events like GPU Day are a natural fit. We work with clients across a wide range of hardware platforms and software ecosystems, helping optimize and accelerate applications where performance matters. Conferences like this provide an opportunity to share practical experiences from real-world projects and contribute to the broader HPC community.

This year two Stream HPC engineers will present their work at the conference:

1 – Manual AMDGCN Assembly Analysis & Optimization

Presenter: Nara Prasetya

Performance optimization on GPUs often starts with profiling and identifying memory bottlenecks. But sometimes performance limitations originate elsewhere, like in compiler decisions and generated machine code itself.

In this presentation, Nara explores optimization techniques that go beyond conventional profiling workflows. By analyzing AMDGCN assembly directly, reducing register pressure, and investigating the impact of compiler changes, the talk demonstrates how manual low-level analysis can recover performance that would otherwise remain hidden behind generated code.

2- Evaluating the AdaptiveCpp Single-Pass (SSCP) SYCL compiler for GROMACS on Modern AMD Accelerators

Presenter: Bálint Soproni

SYCL supports multiple implementation strategies, including both Single-Source Multiple Compiler Passes (SMCP) and Single-Source Single Compiler Pass (SSCP) approaches. AdaptiveCpp’s SSCP JIT compiler has previously shown promising performance gains, but its impact on large production applications has remained relatively unexplored.

Bálint presents work evaluating AdaptiveCpp’s SSCP compiler using GROMACS, a widely-used molecular dynamics package with a mature SYCL backend targeting AMD GPUs. Their results show performance improvements of up to 10–25% for certain workload configurations and increased peak throughput across modern AMD accelerators, demonstrating that SSCP advantages can extend beyond smaller benchmark applications into real production workloads.

GPU Day is only a few weeks away and we are looking forward to meeting fellow developers, researchers, and industry colleagues in Budapest. If you’ll be attending, come say hello and we’d be happy to talk GPUs, performance optimization, SYCL, compilers, and HPC.

Thinking with iterators in CUDA and HIP

Parallel primitives are the ubiquitous building blocks of GPU programming with CUDA and HIP, to make your life as a programmer easier. Primitives like scans, reductions, and sorts operate in parallel over large data inputs. The basic use case has input and output residing in device memory as an array of values. However, the libraries provided by NVIDIA and AMD allow the use of iterators, which abstract the concept of input and output. An iterator is a type that behaves as a pointer, but overrides part of the dereferencing and arithmetic logic. With creative use of iterators, using the parallel primitives can become simpler and more performant. Assuming basic familiarity with the primitives, in this article we will show two examples of how iterators can be used to create better GPU programs: finding the arguments of the maximum of a function using a zip iterator in HIP and fast reduction with equally-sized segments in CUDA.

Leveraging the zip iterator to find the maximum argument

A reduction with a binary maximum operator finds the maximum element in the input. But what if we are also interested in where this element is located in the input? For the first example, let us find the argument of the maximum of a function with HIP. To make it a bit more interesting, suppose we want to find an index of an integer value that has most “set” bits.

A naive approach might be to first perform a reduction over the entire input, and then find the index that produced that input. However, by making clever use of iterators we can use just a single reduction. This is where the zip iterator comes in, which “zips” two iterators together: the dereferenced type is a tuple of both iterator’s value types. By zipping a counting iterator and an input array, we can enumerate the input elements.

thrust::device_vector<unsigned int> d_in(num_elements);
auto iter = rocprim::make_zip_iterator(
    rocprim::make_tuple(rocprim::make_counting_iterator(0), d_in.data()));
Continue reading “Thinking with iterators in CUDA and HIP”

IWOCL 2026

At Stream HPC, we love open standards. They allow developers to write portable applications, encourage industry collaboration, and enable common tooling. Each year Khronos organizes the annual IWOCL conference on open standard compute languages, and this year we are delighted to have our engineers delivering talks on SYCL and OpenCL.

IWOCL 2026 is the 14th iteration of the conference in Heilbronn, Germany. Not far along the river Neckar from Heidelberg where the conference was in 2025. The conference is the ideal venue for both users and implementers to exchange their expertise to drive the community forward together.

We always enjoy attending to share our experiences as GPGPU performance experts. Informing the community of practices we found worked well, or advising on features and tooling that we’d like to see in the future of the ecosystem. Attracting over one hundred attendees from across the world, IWOCL is also an excellent opportunity to meet our industry colleagues face-to-face, who we collaborate with throughout the year as Khronos members and open source contributors.

Bálint Soproni and Ewan Crawford from Stream HPC will be in Heilbronn this year to present their work. The full details of which can be found in the IWOCL 2026 conference program.

Evaluating the AdaptiveCpp Single-Pass (SSCP) SYCL compiler for GROMACS on Modern AMD Accelerators

Authors: Bálint Soproni, Aksel Alpay and Vincent Heuveline, Heidelberg University

Summary: Bálint is presenting the work from Heidelberg University on the performance benefits of using the AdaptiveCpp JIT compiler for large SYCL applications. Traditionally SYCL is compiled using a SMCP (single-source, multiple compiler passes) compiler, but the AdaptiveCpp SSCP (single-source, single compiler pass) JIT compiler offers an alternative approach to SYCL compilation. This short paper shows how the GROMACS molecular dynamics application, which has a mature SYCL backend, benefits from SSCP JIT compilation on modern AMD GPUs.

CLVizulayer: A Tool for Visualising the Directed Acyclic Graph of OpenCL Device Submissions

Authors: Ewan Crawford, Stream HPC

Summary: In this technical talk Ewan will debut the CLVizulayer tool for printing the graph of asynchronous OpenCL device tasks. Designed as an OpenCL ICD Loader layer to enable users to easily collect a Graphviz DOT file that can be graphically rendered to show tasks as nodes and dependencies as edges. This OpenCL vendor agnostic graph can then be used to optimize and debug the application as the presentation will illustrate through a case study.

IWOCL may only be a few weeks away but tickets are still available. We would be delighted to meet you there, please say hello if you’ll be joining us in Germany.

Blog Post Author: Ewan Crawford, Stream HPC

Published in April 2026

N-Queens project from over 10 years ago

Why you should just delve into porting difficult puzzles using the GPU, to learn GPGPU-languages like CUDA, HIP, SYCL, Metal or OpenCL. And if you did not pick one, why not N-Queens? N-Queens is a truly fun puzzle to work on, and I am looking forward to learning about better approaches via the comments.

We love it when junior applicants have a personal project to show, even if it’s unfinished. As it can be scary to share such unfinished project, I’ll go first.

Introduction in 2023

Everybody who starts in GPGPU, has this moment that they feel great about the progress and speedup, but then suddenly get totally overwhelmed by the endless paths to more optimizations. And ofcourse 90% of the potential optimizations don’t work well – it takes many years of experience (and mentors in the team) to excel at it. This was also a main reason why I like GPGPU so much: it remains difficult for a long time, and it never bores. My personal project where I had this overwhelmed+underwhelmed feeling, was with N-Queens – till then I could solve the problems in front of me.

I worked on this backtracking problem as a personal fun-project in the early days of the company (2011?), and decided to blog about it in 2016. But before publishing I thought the story was not ready to be shared, as I changed the way I coded, learned so many more optimization techniques, and (like many programmers) thought the code needed a full rewrite. Meanwhile I had to focus much more on building the company, and also my colleagues got better at GPGPU-coding than me – this didn’t change in the years after, and I’m the dumbest coder in the room now.

Today I decided to just share what I wrote down in 2011 and 2016, and for now focus on fixing the text and links. As the code was written in Aparapi and not pure OpenCL, it would take some good effort to make it available – I decided not to do that, to prevent postponing it even further. Luckily somebody on this same planet had about the same approaches as I had (plus more), and actually finished the implementation – scroll down to the end, if you don’t care about approaches and just want the code.

Note that when I worked on the problem, I used an AMD Radeon GPU and OpenCL. Tools from AMD were hardly there, so you might find a remark that did not age well.

Introduction in 2016

What do 1, 0, 0, 2, 10, 4, 40, 92, 352, 724, 2680, 14200, 73712, 365596, 2279184, 14772512, 95815104, 666090624, 4968057848, 39029188884, 314666222712, 2691008701644, 24233937684440, 227514171973736, 2207893435808352 and 22317699616364044 have to do with each other? They are the first 26 solutions of the N-Queens problem. Even if you are not interested in GPGPU (OpenCL, CUDA), this article should give you some background of this interesting puzzle.

An existing N-Queen implementation in OpenCL took N=17 took 2.89 seconds on my GPU, while Nvidia-hardware took half. I knew it did not use the full potential of the used GPU, because bitcoin-mining dropped to 55% and not to 0%. 🙂 I only had to find those optimizations be redoing the port from another angle.

This article was written while I programmed (as a journal), so you see which questions I asked myself to get to the solution. I hope this also gives some insight on how I work and the hard part of the job is that most of the energy goes into resultless preparations.

Continue reading “N-Queens project from over 10 years ago”

The Fastest Payroll System Of The World

At StreamHPC we do several very different types of projects, but this project has been very, very different. In the first place, it was nowhere close to scientific simulation or media processing. Our client, Intersoft solutions, asked us to speed up thousands of payroll calculations on a GPU.

They wanted to solve a simple problem, avoiding slow conversations with HR of large companies:

Yes, I can answer your questions.

For that I need to do a test-run.

Please come back tomorrow.

The calculation of 1600 payslips took one hour. This means 10,000 employees would take over 6 hours. Potential customers appreciated the clear advantages of Intersoft’s solution, but told that they were searching for a faster solution in the first place.

Using our accelerated compute engine, a run with 3300 employees (anonymised, real data) now only takes 20 seconds, including loading and writing all data to the database – a speedup of about 250 times. Calculations with 100k employees can get all calculations done under 2 minutes – the above HR department would have liked that.

Continue reading “The Fastest Payroll System Of The World”

How to get full CMake support for AMD HIP SDK on Windows – including patches

Written by Máté Ferenc Nagy-Egri and Gergely Mészáros

Disclaimer: if you’ve stumbled across this page in search of fixing up the ROCm SDK’s CMake HIP language support on Windows and care only about the fix, please skip to the end of this post to download the patches. If you wish to learn some things about ROCm and CMake, join us for a ride.

Finally, ROCm on Windows

The recent release of the AMD’s ROCm SDK on Windows brings a long awaited rejuvenation of developer tooling for offload APIs. Undoubtedly it’s most anticipated feature is a HIP-capable compiler. The runtime component amdhip64.dll has been shipping with AMD Software: Adrenalin Edition for multiple years now, and with some trickery one could consume the HIP host-side API by taking the API headers from GitHub (or a Linux ROCm install) and creating an export lib from the driver DLL. Feeding device code compiled offline and given to HIP’s Module API  was attainable, yet cumbersome. Anticipation is driven by the single-source compilation model of HIP borrowed from CUDA. That is finally available* now!

[*]: That is, if you are using Visual Studio and MSBuild, or legacy HIP compilation atop CMake CXX language support.

Continue reading “How to get full CMake support for AMD HIP SDK on Windows – including patches”

Improving FinanceBench for GPUs Part II – low hanging fruit

We found a finance benchmark for GPUs and wanted to show we could speed its algorithms up. Like a lot!

Following the initial work done in porting the CUDA code to HIP (follow article link here), significant progress was made in tackling the low hanging fruits in the kernels and tackling any potential structural problems outside of the kernel.

Additionally, since the last article, we’ve been in touch with the authors of the original repository. They’ve even invited us to update their repository too. For now it will be on our repository only. We also learnt that the group’s lead, professor John Cavazos, passed away 2 years ago. We hope he would have liked that his work has been revived.

Link to the paper is here: https://dl.acm.org/doi/10.1145/2458523.2458536

Scott Grauer-Gray, William Killian, Robert Searles, and John Cavazos. 2013. Accelerating financial applications on the GPU. In Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units (GPGPU-6). Association for Computing Machinery, New York, NY, USA, 127–136. DOI:https://doi.org/10.1145/2458523.2458536

Improving the basics

We could have chosen to rewrite the algorithms from scratch, but first we need to understand the algorithms better. Also, with the existing GPU-code we can quickly assess what are the problems of the algorithm, and see if we can get to high performance without too much effort. In this blog we show these steps.

Continue reading “Improving FinanceBench for GPUs Part II – low hanging fruit”

The Art of Benchmarking

How fast is your software? The simpler the software setup, the easier to answer this question. The more complex the software, the more the answer will “it depends”. But just peek at F1-racing – the answer will depend on the driver and the track.

This article focuses on the foundations of solid benchmarking, so it helps you to decide which discussions to have with your team. It is not the full book.

There will be multiple blog posts coming in this series, which will be linked at the end of the post when published.

The questions to ask

Even when it depends on various variables, answers do can be given. These answers are best be described as ‘insights’ and this blog is about that.

First the commercial message, so we can focus on the main subject. As benchmark-design is not always obvious, we help customers to set up a system that plugs into a continuous integration system and gives continuous insights. More about that in an upcoming blog.

We see benchmarking as providing insights in contrast with the stopwatch-number. Going back to F1 – being second in the race, means the team wants to probably know these:

  • What elements build up the race? From weather conditions to corners, and from other cars on the track to driver-responses
  • How can each of these elements be quantified?
  • How can each of these elements be measured for both own cars and other cars?
  • And as you guessed from the high-level result, the stopwatch: how much speedup is required in total and per round?
Continue reading “The Art of Benchmarking”

Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL

Stream HPC is 10 years old on 1 April 2020. Therefore we offer our one day GPGPU crash course for free that whole month.

Now Corona (and fear for it) spreads, we had to rethink how to celebrate 10 years. So while there were different plans, we simply had to adapt to the market and world dynamics.

5 years ago…
Continue reading “Birthday present! Free 1-day Online GPGPU crash course: CUDA / HIP / OpenCL”

Problem solving tactic: making black boxes smaller

We are a problem solving company first, specialised in HPC – building software close to the processor. The more projects we finish, the more it’s clear that without our problem solving skills, we could not tackle the complexity of a GPU and CPU-clusters. While I normally shield off how we do and how we continuously improve ourselves, it would be good to share a bit more so both new customers and new recruits know what to expect form the team.

Black boxes will never be transparent

Assumption is the mother of all mistakes

Eugene Lewis Fordsworthe

A colleague put “Assumptions is the mother of all fuckups” on the wall, because we should be assuming we assume. Problem is that we want to have full control and make faster decisions, and then assuming fits in all these scary unknowns.

Continue reading “Problem solving tactic: making black boxes smaller”

Improving FinanceBench

If you’re into computational finance, you might have heard of FinanceBench.

It’s a benchmark developed at the University of Deleware and is aimed at those who work with financial code to see how certain code paths can be targeted for accelerators. It utilizes the original QuantLib software framework and samples to port four existing applications for quantitative finance. It contains codes for Black-Scholes, Monte-Carlo, Bonds, and Repo financial applications which can be run on the CPU and GPU.

The problem is that it has not been maintained for 5 years and there were good improvement opportunities. Even though the paper was already written, we think it can still be of good use within computational finance. As we were seeking a way to make a demo for the financial industry that is not behind an NDA, this looked like the perfect starting point for that. We have emailed all the authors of the library, but unfortunately did not get any reply. As the code is provided under an permissive license, we could luckily go forward.

The first version of the code will be released on Github early next month. Below we discuss some design choices and preliminary results.

Continue reading “Improving FinanceBench”

Updated: OpenCL and CUDA programming training – now online

Update: due to Corona, the Amsterdam training has been cancelled. We’ll offer the training online on dates that better suit the participants.

As it has been very busy here, we have not done public trainings for a long time. This year we’re going to train future GPU-developers again – online. For now it’s one date, but we’ll add more dates in this blog-post later on.

If you need to learn solid GPU programming, this is the training you should attend. The concepts can be applied to other GPU-languages too, which makes it a good investment for any probable future where GPUs exist.

This is a public training, which means there are attendees from various companies. If you prefer not to be in a public class, get in contact to learn more about our in-company trainings.

It includes:

  • Four days of training online
  • Free code-review after the training, to get feedback on what you created with the new knowledge;
  • 1 month of limited support, so you can avoid StackOverflow;
  • Certificate.

Trainings will be done by employees of Stream HPC, who all have a lot of experience with applying the techniques you are going to learn.

Schedule

Most trainings have around 40% lectures, 50% lab-sessions and 10% discussions.

Continue reading “Updated: OpenCL and CUDA programming training – now online”

Join us at the Dutch eScience Symposium 2019 in Amsterdam

Soon there will be another Dutch eScience Symposium 2019 in Amsterdam. We thought it might be a good place to meet and listen to e-science talks. Stream HPC in the end is just making scientific software, so we’re here at the right place. The eScience Center is a government institute that aims to advance eScience in the Netherlands.

Interested? Read on!

Continue reading “Join us at the Dutch eScience Symposium 2019 in Amsterdam”

We accelerated the OpenCL backend of pyPaSWAS sequence aligner

Last year we accelerated the OpenCL-code in PaSWAS, which is open source software to do DNA/RNA/protein sequence alignment and trimming. It has users world-wide in universities, research groups and industry.

Below you’ll find the benchmark results of our acceleration work. You can also test out yourself, as the code is public. In the readme-file you can learn more about the idea of the software. Lots of background information is described in these two papers:

We chose PaSWAS because we really like bio-informatics and computational chemistry – the science is interesting, the problems are complex and the potential GPU-speedup is real. Other examples of such software we worked on are GROMACS and TeraChem.

Continue reading “We accelerated the OpenCL backend of pyPaSWAS sequence aligner”

Do you have our GPU DNA?

This is the first question to warm up. Python-programmers are often users of GPU-libraries, not the builders of those libraries.

In January 2019 I gave a talk about culture in the company, which I wanted to share with you. It was intended to trigger discussions on what environment fits somebody, and examples were given on other companies. The nice part was that it became more clear that the culture of a company like CodePlay was very alike, except they are working on different things (compilers). Same for departments of larger companies we work with or know well.

Important: all answered are based on what my colleagues answered. So most of us are cat-people, but I wouldn’t say that defines a GPU-developer. I hope it still gives you an understanding of our perspective on what defines a GPU-dev in just a few minutes, while it also gives you more than enough matter to think about.

Continue reading “Do you have our GPU DNA?”

Stream Team at ISC

This year we’ll be with 4 people at ISC: Vincent, Adel, Anna and Istvan. You can find us at booth G-812, next to Red Hat.

Booth G-812 is manned&womened by Stream HPC

While we got known in the HPC-world for our expertise on OpenCL, we now have many years of experience in CUDA and OpenMP. To get there, we’ve focused a lot on how to improve code quality of existing software, to reduce bugs and increase speedup-potential. Our main expertise remains full control over algorithms in software – the same data simply processed faster.

Why do we have a booth?

We’ll be mostly talking to (new) customers for development of high performance software for the big machines. Also we’ll have a list of our open job positions with us, and we can do the first introductory interview on the spot.

Our slogan for this year is:

There are a lot of supercomputers. Somebody has to program its software

We’ll be sharing our week on Twitter, so you can also see what we find: posters about HPC-programming on CPU and GPU, booths that have nice demos or interesting talks and ofcourse the surprises.

Let’s meet!

If you don’t have an appointment yet, but would like to chat with us, please contact us or drop by at our booth. As we’re with four people, we have high flexibility.

GPU-related PHD positions at Eindhoven University and Twente University

We’re collaborating with a few universities on formal verification of GPU code. The project is called ChEOPS: verified Construction of corrEct and Optimised Parallel Software.

We’d like to put the following PhD position to your attention:


Eindhoven University of Technology is seeking two PhD students to work on the ChEOPS project, a collaborative project between the universities of Twente and Eindhoven, funded by the Open Technology Programme of the NWO Applied and Engineering Sciences (TTW) domain.

In the ChEOPS project, research is conducted to make the development and maintenance of software aimed at graphics processing units (GPUs) more insightful and effective in terms of functional correctness and performance. GPUs have an increasingly big impact on industry and academia, due to their great computational capabilities. However, in practice, one usually needs to have expert knowledge on GPU architectures to optimally gain advantage of those capabilities.

Continue reading “GPU-related PHD positions at Eindhoven University and Twente University”

Academic hackatons for Nvidia GPUs

Are you working with Nvidia GPUs in your research and wish Nvidia would support you as they used to 5 years ago? This is now done with hackatons, where you get one full week of support, to get your GPU-code improved and your CPU-code ported. Still you have to do it yourself, so it’s not comparable to services we provide.

To start, get your team on a decision to do this. It takes preparation and a clear formulation of what your goals are.

When and where?

It’s already April, so some hackatons have already taken place. For 2019, these are left where you can work on any language, from OpenMP to OpenCL and from OpenACC to CUDA. Python + CUDA-libraries is also no problem, as long as the focus is Nvidia.

Continue reading “Academic hackatons for Nvidia GPUs”

IWOCL 2019

On Monday May 13, 2019 at 09:30 the latest edition of IWOCL starts, not taking into account any pre-events that might be spontaneously organized. This is the biggest OpenCL-focused event that discusses everything that would make any GPGPU-programmer, DSP-programmer and FPGA-programmer enthusiastic.

What’s new since last year, is that it’s actually also more interesting place for CUDA-developers who like to learn and discuss new GPU-programming techniques. This is because Nvidia’s GTC has moved more to AI, where it used to be mostly GPGPU for years.

Since it’s now the last week of the early-bird pricing, it’s a good time to make you think about buying your ticket and book the trip.

Continue reading “IWOCL 2019”