Asynchronous and parallel programming in C++26

For the past decade, every C++ programmer who wanted to do real concurrent work has had the same conversation with themselves. std::async is a toy. std::thread is too low-level. std::future doesn’t compose. So you reach for TBB, or another third-party library, or a thread pool you wrote in 2017, or std::coroutine that you have to wire up to an executor you also wrote yourself.

C++26 finally puts an answer in the standard library: std::execution, which is C++’s execution control library, also known as senders/receivers, also known as P2300. It is the largest single addition to the language since modules, and it is going to take the community a few years to digest. This post is a tour. We’ll start with the parallel algorithm policies you may already know, build up to senders/receivers, schedulers, cancellation, and end with a worked example that ties everything together.

What is std::execution

std::execution defines a model for describing asynchronous work and provides a small set of generic algorithms for composing those descriptions into larger ones. The actual execution is delegated to a scheduler, which is a handle to whatever resource happens to run the work: a thread pool, a GPU queue, an I/O reactor, or the calling thread itself.

The three core abstractions are:

  • A scheduler is a lightweight handle that says where work runs.
  • A sender is a lazy description of what work runs. It produces one of three completion signals: a value, an error, or “stopped” (cancelled).
  • A receiver is the callback object that consumes those signals. You almost never write one by hand; the algorithms wire them up for you.

The crucial word is lazy. A sender does nothing until you connect it to a receiver and start it. This is why the model composes: you can build up a description of an entire pipeline at compile time, hand it to the runtime once, and let the runtime decide how to schedule the pieces.

What is added in C++26

The paper that became std::execution landed at the St. Louis meeting in June 2024. The standard ships with:

  • The sender/receiver concepts and their machinery.
  • Sender factories: just, just_error, just_stopped, schedule, read_env.
  • Sender adaptors: then, upon_error, upon_stopped, let_value, let_error, let_stopped, starts_on, continues_on, on, when_all, split, bulk, into_variant, stopped_as_optional, stopped_as_error.
  • Sender consumers: sync_wait, sync_wait_with_variant.
  • A system scheduler (P2079) so you don’t have to bring your own thread pool just to write Hello, world.
  • An async scope (P3149) for spawning dynamic, but still joined, work.
  • A coroutine task<T> type that interoperates with senders.

What is not in C++26: networking, file I/O, time-based schedulers, and the asynchronous parallel algorithms (P3300). Those are queued for C++29. The reference implementation, NVIDIA’s stdexec, is what you should use today if you want to experiment.

Enabling parallel algorithms with std::execution policies

Before senders, the simplest way to ask the standard library to use multiple cores was to pass an execution policy to a parallel algorithm. These were added in C++17 and they are the entry point most people will reach for first:

#include <algorithm>
#include <execution>
#include <vector>

std::vector<float> data = load();
std::sort(std::execution::par_unseq, data.begin(), data.end());

The four standard policies are seq, par, par_unseq, and unseq. par says the implementation is allowed to use multiple threads. unseq says it is allowed to vectorise. par_unseq says both. seq says neither, it behaves like the unpoliced overload.

These policies are permissions, not promises. The implementation may decide that your range is too small to be worth distributing, and run it on the calling thread. If you want guarantees, you have to measure.

The relationship to std::execution is mostly nominal. The policies live in the same std::execution namespace, and the upcoming asynchronous parallel algorithms (P3300) will accept the same policy types when they ship. But the parallel algorithms today don’t return senders, don’t compose, and don’t let you choose a scheduler. They are a useful, but limited, self-contained corner of the library.

Everything below this point is the new world.

Asynchronous tasks with senders/receivers

The canonical first example. We build a description, then we run it:

namespace ex = std::execution;

ex::sender auto work = ex::just("Hello, world!")
                     | ex::then([](std::string_view s) { std::print("{}\n", s); });

ex::sync_wait(std::move(work));

just(x) is a sender that completes immediately with the value x. then(f) is a sender adaptor: it takes the value coming out of the upstream sender and passes it to f. The pipe operator is just function composition. just(a) | then(f) is the same as then(a, f), which is equivalent to f(a). In other words: wait for value a to be delivered from the sender, and when it is delivered, execute the function f(a). Or in the above example: expect “Hello, world!” to be delivered, and when it is, print it.

Nothing has been executed yet. The variable work holds a typed object describing the computation. sync_wait is what actually starts it and blocks until it finishes. This separation between describing and running is the whole game. You can build pipelines, pass them around, store them, and decide later where and when to run them.

A sender can complete in three ways:

  • set_value(args...): Success, with zero or more values.
  • set_error(err): Failure, with any error type (not just exception_ptr).
  • set_stopped(): Cancellation.

then only fires on the value channel. To handle the others, use upon_error and upon_stopped, which are the same shape but for the other two channels. To chain another sender off the value (the monadic bind), use let_value:

ex::sender auto pipeline =
    ex::just(filename)
  | ex::then(open_file)
  | ex::let_value([](File f) { return read_async_sender(f); })
  | ex::then(parse);

Use then when your continuation is synchronous and returns a value. Use let_value when your continuation is itself asynchronous and returns a sender. Get this distinction wrong and you’ll find yourself with a Sender<Sender<T>> that compiles but does the wrong thing.

When you need to fan out and join, use when_all and split:

auto common = ex::schedule(sched) | ex::then(parse_input) | ex::split();

auto branch_a = common | ex::then(compute_a);
auto branch_b = common | ex::then(compute_b);

auto combined = ex::when_all(branch_a, branch_b)
              | ex::then([](auto a, auto b) { return merge(a, b); });

split is necessary because senders are single-shot by default. Without it, attaching common to two downstream chains would mean running parse_input twice.

Image of the compute graph created by the code sample above.
Figure 1: The compute graph created by the code sample above.

Each channel will carry one of the value(s), error or cancel signals to the following receivers!

Mapping tasks to different schedulers

The point of having a scheduler abstraction is that where work runs is a property of the pipeline, not of the work itself. You write the work once and bind it to a scheduler at composition time.

Note: Compile time is when the compiler type-checks and instantiates the sender chain. Composition time is when your code builds the sender description by chaining adaptors (just(x) | then(f) | continues_on(sched)). Runtime / execution time is when sync_wait (or start) actually kicks off the operation state and work flows through the chain: threads run, GPU kernels launch, I/O happens.
The distinction that matters is between composition time and execution time. Both happen “at runtime” in the C++ sense, but the term (“composition time”) is used to name the phase where you describe work without doing it. That’s the core property of laziness in senders: the pipeline is a value you can inspect, store, or pass around before anyone calls start.

There are three primitives for placement:

  • schedule(sched): Produce a sender that completes on sched. This is how you start work on a particular execution context.
  • starts_on(sched, sndr): Run sndr starting on sched, regardless of where the surrounding chain came from.
  • continues_on(sched): At this point in the chain, hop onto sched and continue there.

A typical pattern: read from disk on an I/O scheduler, do CPU work on a worker pool, then write back from the I/O scheduler:

auto pipeline =
    ex::starts_on(io_sched, ex::just(path) | ex::then(read_file))
  | ex::continues_on(cpu_sched)
  | ex::then(process)
  | ex::continues_on(io_sched)
  | ex::then(write_file);

The system scheduler (std::execution::get_system_scheduler()) is the lazy-default thread pool that the standard provides so you don’t have to write one. It’s a sensible default for CPU-bound work, but for I/O or for work that needs strict ordering you’ll usually want to construct your own.

A practical warning, courtesy of Mathieu Ropert’s investigation of stdexec: if you build a pipeline through let_value and the inner sender doesn’t carry a scheduler, bulk(par_unseq, ...) may silently run serially on the calling thread. The fix is to add an explicit continues_on(sched) before the bulk:

ex::let_value([&] {
    return ex::just()
         | ex::continues_on(sched)              // re-anchor the scheduler
         | ex::bulk(ex::par_unseq, count, work);
})

This is the kind of thing you only catch with a profiler. Make a habit of being explicit about scheduler placement at every fan-out point.

How to pause/cancel sender chains

Cancellation in the senders/receivers world is cooperative and structured. There is no kill_thread. Instead, the framework propagates a stop request, and the senders that care about it observe a std::stop_token, finish what they’re doing, and complete with set_stopped instead of set_value.

The mechanism (inherited from std::jthread) is std::stop_source and std::stop_token. When you connect a sender to a receiver, every sender in the chain can query the receiver’s environment via get_env(receiver), and that environment is what answers get_stop_token(env). The token is a property of whoever is consuming the chain.

C++26 gives you exactly the tool for this: std::execution::write_env [exec.write.env]. It’s a sender adaptor that wraps a sender so that, when connected to a receiver, the inner receiver sees an environment with whatever extra queries you’ve spliced in. Combined with std::execution::prop [exec.prop], which is a tiny class template that pairs a query tag with a value, it gives you a one-liner for token injection:

namespace ex = std::execution;

std::stop_source src;
auto pipeline = ex::just(42)
              | ex::then([](int x) { return x * 2; });

auto result = ex::sync_wait(
                ex::write_env(
                  std::move(pipeline),
                  ex::prop(ex::get_stop_token, src.get_token())
                )
              );

if (result) std::println("got {}", std::get<0>(*result));
else        std::println("cancelled");

write_env joins the new env onto the outer one rather than replacing it, so sync_wait‘s own queries (allocator, scheduler, etc.) still flow through. sync_wait already returns std::optional<std::tuple<Vs...>> which is engaged (has_value() == true) on value, disengaged (has_value() == false) on stopped and throws on error, so there’s no extra boilerplate needed.

Now let’s actually test the cancelling operation. I have initially implemented everything from scratch, including an interruptible_sleep_sender that showcases the cancellation feature (which can be found in this godbolt link). What instead makes things much simpler is the experimental::execution::timed_thread_context and functions like experimental::execution::schedule_after, which schedule senders to run after a specified number of milliseconds. So, without the need to define our own structs we can do this:

// godbolt link: https://godbolt.org/z/nxG8M6oY9
{
    namespace expex = experimental::execution;
    ex::inplace_stop_source src;

    expex::timed_thread_context timer_ctx;
    auto timer = timer_ctx.get_scheduler();

    auto pipeline = ex::just(42)
                  | ex::let_value([timer](int x)
                    {
                        return expex::schedule_after(timer, 5s)
                             | ex::then([x] { return x * 2; });
                    });

    // cancel from another thread:
    std::jthread canceller([&] {
        std::this_thread::sleep_for(200ms);
        std::println("cancelling from thread {}", std::this_thread::get_id());
        src.request_stop();
    });

    auto result = ex::sync_wait(
                    ex::write_env(
                        std::move(pipeline),
                        ex::prop(ex::get_stop_token, src.get_token())
                    )
                );

    if (result) std::println("got {}", std::get<0>(*result));
    else        std::println("cancelled");
}

For longer-running or dynamically spawned work, prefer spawning through an async_scope (P3149, see the advanced example below). The scope’s spawn/spawn_future already builds a receiver whose environment carries a stop token tied to the scope’s lifetime, so you don’t need write_env at all and calling request_stop() on the scope cancels every task running inside it.

Once a stop token is in the receiver’s environment, regardless of whether it is put there with write_env or from an async_scope, then request_stop() on the corresponding source will be observed all the way through the chain.

For a sender to honor the request, it has to query the token. The standard algorithms like then, bulk, when_all do this for you. when_all is particularly important: if any of its child senders fails or is stopped, it requests stop on its remaining children before completing.

You can pause a chain by writing a sender whose start registers a callback and returns. The chain is “paused” because nothing is running; it resumes when the callback fires and calls set_value on its receiver. This is how I/O senders, timer senders, and task<T> coroutines work. There is no thread sitting blocked. That’s the whole reason structured async exists.

How to create pause/cancel callback wrappers

Most of the time you’ll consume the cancellation machinery rather than implement it. But if you’re integrating an existing API like a callback-based HTTP client, or a hardware event, you’ll write a small custom sender.

Here’s a sketch of a sender that wraps a one-shot callback API and respects stop requests:

namespace ex = std::execution;

template <typename Receiver>
struct cancellable_op {
    using operation_state_concept = ex::operation_state_t;

    Receiver receiver;
    SomeAPIHandle handle;

    using StopToken = ex::stop_token_of_t<ex::env_of_t<Receiver>>;

    struct on_stop {
        cancellable_op* self;
        void operator()() const noexcept { self->handle.cancel(); }
    };

    // Use the token's OWN callback_type - that way this works whether
    // the env carries a real stop token (std::stop_token,
    // stdexec::inplace_stop_token) or never_stop_token (no-op).
    using stop_cb_t = typename StopToken::template callback_type<on_stop>;
    std::optional<stop_cb_t> stop_cb;

    // C++26 member-function customization (no more tag_invoke).
    void start() & noexcept {
        auto token = ex::get_stop_token(ex::get_env(receiver));
        stop_cb.emplace(std::move(token), on_stop{this});

        handle.submit([this](Result r) {
            stop_cb.reset();
            if (r.cancelled()) {
                ex::set_stopped(std::move(receiver));
            } else if (r.hasError()) {
                ex::set_error(std::move(receiver), r.outcome.error());
            } else {
                ex::set_value(std::move(receiver), *r.outcome);
            }
        });
    }
};

struct cancellable_sender {
    using sender_concept = ex::sender_t;
    using completion_signatures = ex::completion_signatures<
        ex::set_value_t(std::optional<int>),
        ex::set_error_t(Error),
        ex::set_stopped_t()>;

    SomeAPIHandle handle;

    // Member-function connect.
    template <typename Receiver>
    auto connect(Receiver r) && {
        return cancellable_op<Receiver>{std::move(r), std::move(handle)};
    }
};

Three things to notice. First, the stop callback is an RAII object, meaning that its constructor registers, its destructor unregisters. You must reset it before completing the receiver, or you risk the callback firing on a torn-down object. Second, start is noexcept. Senders may be started in contexts where exceptions can’t be propagated; throw and you terminate. Third, the receiver is moved into the operation state at connect time and lives there until completion. There is no allocation per chain element; the whole graph is one nested aggregate, sized at compile time.

This is more than you’ll usually want to write. In practice, prefer the upcoming std::execution::task<T> coroutine, where co_await does all of the above for you.

As always, a godbolt example invoking the (1) value, (2) error and (3) cancel paths can be found here.

Advanced example

Let’s pull it all together. Suppose we have an image processing pipeline that:

  1. Reads a JPEG from disk on an I/O scheduler.
  2. Decodes it on the I/O scheduler (decoding is mostly a stream of read calls plus some CPU; let’s keep it on I/O for simplicity).
  3. Hands the pixel buffer to a GPU worker pipeline, which runs a parallel filter GPU kernel over the rows.
  4. Re-encodes the result to JPEG on the I/O scheduler.
  5. Uploads the encoded bytes to a remote server on a network scheduler.
  6. Is fully cancellable, and the upload will be aborted if a stop is requested mid-flight.

Assume the obvious helper senders exist. We’re not interested in how the JPEG decoder works.

// Godbolt link for this example: https://godbolt.org/z/b5PnnzMqn
#include <exec/static_thread_pool.hpp>
#include <stdexec/execution.hpp>
#include <filesystem>
#include <print>
#include <thread>
#include <vector>

// Hypothetical GPU scheduler header - each vendor (ROCm, CUDA, ...)
// would ship their own that models ex::scheduler, or we create our own
// wrappers for it.
#include <gpu/scheduler.hpp>

namespace ex = stdexec;
using namespace std::literals;

// ── Domain types ────────────────────────────────────────────────────
struct PixelBuffer {
    std::vector<std::uint8_t> pixels;
    int width{}, height{}, channels{3};
    int row_count() const noexcept { return height; }
};
struct JpegBlob     { std::vector<std::uint8_t> bytes; };
struct UploadResult { int status_code{}; std::string etag; };

// ── Synchronous helpers (the pipeline turns these into senders
//    via just/then) ──────────────────────────────────────────────────
JpegBlob    read_file(std::filesystem::path p);              // may throw
PixelBuffer decode_jpeg(JpegBlob&& blob);                    // may throw
JpegBlob    encode_jpeg(PixelBuffer&& buf);                  // may throw

// ── GPU kernel wrapper ──────────────────────────────────────────────
// launch_filter_kernel synchronously enqueues a GPU kernel operating
// on row `row` of `buf`.  Internally it wraps the hipLaunchKernel /
// cuLaunchKernel call.  The bulk scheduler ensures all enqueued
// kernels complete before the bulk sender signals set_value.
// A GPU-aware scheduler maps bulk indices to hardware threads /
// wavefronts, so each call just enqueues — it does not wait.
void launch_filter_kernel(PixelBuffer& buf, int row) noexcept;

// ── upload_to: returns a cancellable sender that uploads `blob` to
//    `url` and completes with UploadResult.  Same stop-callback
//    pattern as the callback-wrapper section above.
ex::sender auto upload_to(std::string url, JpegBlob& blob);

// ── The pipeline ────────────────────────────────────────────────────
// A function template parameterised on schedulers.  The same pipeline
// can be tested with inline_scheduler, profiled with an instrumented
// one, or run against real thread pools.
template <ex::scheduler IoSched,
          ex::scheduler GpuSched,
          ex::scheduler NetSched>
ex::sender auto image_pipeline(
    IoSched  io,
    GpuSched gpu,
    NetSched net,
    std::filesystem::path path,
    std::string           upload_url)
{
    return
        // 1-2. Read and decode on the I/O context.
        ex::starts_on(io,
            ex::just(std::move(path))
          | ex::then(read_file)
          | ex::then(decode_jpeg))

        // 3. GPU filter: hop onto the GPU scheduler, then enqueue
        //    one kernel per row via bulk.  The GPU scheduler maps
        //    bulk indices to hardware threads / wavefronts.
      | ex::continues_on(gpu)
      | ex::let_value([gpu](auto&& buf) {
            const int rows = buf.row_count();
            return ex::just(std::forward<decltype(buf)>(buf))
                 | ex::continues_on(gpu)
                 | ex::bulk(ex::par, rows, // Note the std::execution policy from introduction ;)
                       [](int row, PixelBuffer& b) noexcept {
                           // Each invocation enqueues a GPU kernel.
                           // A GPU-aware scheduler maps bulk indices
                           // to hardware threads / wavefronts.
                           launch_filter_kernel(b, row);
                       });
        })

        // 4. Re-encode on the I/O context.
      | ex::continues_on(io)
      | ex::then(encode_jpeg)

        // 5-6. Upload on the network context.  upload_to returns a
        //      cancellable sender, so a stop request mid-upload will
        //      abort the HTTP call and complete with set_stopped.
      | ex::continues_on(net)
      | ex::let_value([url = std::move(upload_url)](auto&& blob) {
            return upload_to(url, std::forward<decltype(blob)>(blob));
        });
}

int main()
{
    // Three execution contexts, each representing a different resource.
    exec::static_thread_pool io_ctx{2};
    gpu::context             gpu_ctx;    // wraps a GPU device
    exec::static_thread_pool net_ctx{2};

    ex::inplace_stop_source stop;

    // Simulate the user pressing Ctrl-C after 100 ms.
    std::jthread watchdog([&] {
        std::this_thread::sleep_for(100ms);
        stop.request_stop();
    });

    auto result = ex::sync_wait(
        ex::write_env(
            image_pipeline(
                io_ctx.get_scheduler(),
                gpu_ctx.get_scheduler(),
                net_ctx.get_scheduler(),
                "/photos/cat.jpg",
                "https://cdn.example.com/images/cat.jpg"),
            ex::prop(ex::get_stop_token, stop.get_token())));

    if (result) {
        auto& [upload] = *result;
        std::println("done: HTTP {} etag={}", upload.status_code, upload.etag);
    } else {
        std::println("pipeline cancelled");
    }
}
// Godbolt link for this example : https://godbolt.org/z/b5PnnzMqn

A toy example of the above example that synthetically changes the UPLOAD_DURATION so that the pipeline can succeed or be canceled can be found here!

Analysis

Read the pipeline definition from top to bottom and notice what isn’t there. There are no thread handles, no futures, no mutexes, no condition variables. There is no question about who owns what data and the value channel is responsible for moving the data through it. The schedulers are passed in, so the same pipeline could be tested by passing in an inline scheduler, or profiled by passing in an instrumented one. The whole graph is a value, built lazily, started once, joinable, cancellable, and statically typed end-to-end.

A few things worth calling out:

  • let_value for dynamic bulk shapes. The row count comes from the decoded PixelBuffer, which is only known at runtime. bulk(rows, f) needs rows at the point where the adaptor is created, so we use let_value to unwrap the buffer, read its dimensions, and re-emit it through just + bulk. This “unwrap, inspect, rewrap” is the standard idiom whenever a downstream adaptor’s parameters depend on the upstream value.
  • GPU via continues_on(gpu) + bulk. A GPU scheduler models the same scheduler concept as a thread pool. continues_on(gpu) hops execution onto the device context, and the subsequent bulk(rows, f) maps each index to a hardware thread or wavefront. The callable passed to bulk (launch_filter_kernel) enqueues the actual kernel. A vendor-specific GPU scheduler (ROCm’s HIP, CUDA, oneAPI) that schedules work by enqueueing onto the device’s command queue and completing when the queue signals. From the pipeline’s perspective it’s just another scheduler with no special API surface.

Note: bulk(n, f) is defined to forward its input values downstream after all n invocations of f complete, so the PixelBuffer& is passed to step 4.

  • continues_on at every boundary. Without explicit placement, bulk may silently run serially on whatever thread the previous step completed on (as we noted in the scheduler section). The continues_on(gpu) before bulk ensures the filter launches on the GPU, and the continues_on(io) after it hops back to the CPU for the encode.
  • Cancellation is structural. The single write_env + prop(get_stop_token, ...) at the outermost layer is enough. Every sender in the chain (then, bulk, when_all, and our custom upload_to) observes the same token through the receiver’s environment. When the watchdog calls request_stop(), the pipeline completes with set_stopped and sync_wait returns std::nullopt. Even the GPU kernel can be aborted mid-flight if the device supports async cancellation (on AMD hardware this maps to hipStreamDestroy / signal-based abort).

Conclusion

We’ve abstracted concurrency and asynchronous programming without ever caring about threads, mutexes, or condition variables. We’ve created portable (and cancellable) pipelines that define work in terms of what runs where and minimized the risk of data races. That is what std::execution is for.

The API surface is large and the error messages are, today, terrifying. The reference implementation has rough edges that will catch you out. Having said that, the development is active and the issue list relatively short. But the underlying model is the right one. After a decade of executor proposals, C++ finally has a single vocabulary for “do this work somewhere, then this work somewhere else, and let me cancel the whole thing.” Whether or not C++26 is the version where you adopt it, it is the version where you should start learning it.

On a closing note, everyone is talking about C++26 reflection [P2996] features. This unlocks a whole new realm of the language that will take us years to master and understand (maybe even more than 10 years), but this standard brings a palette of other new features (safety, erroneous behavior, pre-post conditions and, of course, std::execution) that users will be eager to adopt and use. And to quote Herb Sutter’s post Living in the future: Using C++26 at work closing statement: fun times for C++!

NVIDIA: mobile phones, tablets and HPC (cloud)

If you want to see what is coming up in the market of consumer-technology (PC, mobile and tablet), then NVIDIA can tell you the most. The company is very flexible, and shows time after time it really knows in which markets is currently operates and can enter. I sometimes strongly disagree with their marketing, but watch them closely as they are in the most important markets to define the near future in: PCs, Mobile/Tablet and HPC.
You might think I completely miss interconnects (buses between processors, devices and memory) and memory-technologies as clouds have a large need for high-speed data-transport, but the last 20 years have shown that this is a quite stable developing market based on IP-selling to the hardware-vendors. With the acquisition of Cray’s interconnect technology, we have seen this is serious business for Intel, so things might change indeed. For this article I want to focus on NVIDIA’s choices.

Let’s enter the Top500 HPC list using GPUs

The #500 super-computer has only 24 TFlops (2010-06-06): http://www.top500.org/system/9677

update: scroll down to see the best configuration I have found. In other words: a cluster with at least 30 nodes with 4 high-end GPUs each (costing almost €2000,- per node and giving roughly 5 TFlops single precision, 1 TFLOPS double precision) would enter the Top500. 25 nodes to get to a theoretic 25TFlops and 5 extra for overcoming the overhead. So for about €60 000,- of hardware anyone can be on the list (and add at least €13 000 if you want to use Windows instead of Linux for some reason). Ok, you pay most for the services and actual building when buying such a cluster, but you get the idea it does not cost you a few millions any more. I’m curious: who is building these kind of clusters? Could you tell me the specs (theoretical TFlops, LinPack TFlops and watts/TFlop) of your (theoretical) cluster, which costs the customer less then €100 000,- in total? Or do you know companies who can do this? I’ll make a list of companies who will be building the clusters of tomorrow, the “Top €100.000,- HPC cluster list”. You can mail me via vincent [at] this domain, or put your answer in a comment.

Update: the hardware shopping-list

Nobody told in the remarks it is easy to build a faster machine than the one described above. So I’ll do it. We want the most flops per box, so here’s the wishlist:

  • A motherboard with as many slots as possible for PCI-E, CPU-sockets and memory-banks. This because the lag between the nodes is high.
  • A CPU with at least 4 cores.
  • Focus on the bandwidth, else we will not be able to use all power.
  • Focus on price per GFLOPS.

The following is what I found in local computer stores (which for some reason people there love to talk about extreme machines). AMD currently has the graphics cards with the most double precision power, so I chose for their products. I’m looking around for Intel + Nvidia, but currently they are far behind. Is AMD back on stage after being beaten by Intel’s Core-products for so many years?

The GigaByte GA-890FXA-UD7 (€245,-) has 1 AM3-socket, 6(!) PCI-e slots and supports up to 16GB of memory. We want some power, so we use the AMD Phenom II X6 1090T (€289,-), which I chose for the 6 cores and the low price per FLOPS. And to make it a monster, we add 6 times a AMD HD5970 (€599,-) giving 928 x 6 = 3264 DP-GLOPS. If it can handle 16GB DDR3 (€750,-), so we put it in. It needs about 3 Power-supplies of 700 Watt (€100,-). We add 128GB SSD (€350,-) for working data and a big 2 TB HDD (€100,-). Case needs to house the 3 power supplies (€100,-). Cooling is important and I suggest you compete with a wind-tunnel (€500,-). It will cost you €6228,- for 5,6 Double Precision TFLOPS, and 27 TFLOPS single precision. A cluster would be on the HPC500-list for around €38000,- (pure hardware-price, not taking network-devices too much into account, nor the price for man-hours).

Disclaimer: this is the price of a single node, excluding services, maintenance, software-installation, networking, engineering, etc. Please note that the above price is pure for building a single node for yourself, if you have the knowledge to do so.