Asynchronous and parallel programming in C++26

For the past decade, every C++ programmer who wanted to do real concurrent work has had the same conversation with themselves. std::async is a toy. std::thread is too low-level. std::future doesn’t compose. So you reach for TBB, or another third-party library, or a thread pool you wrote in 2017, or std::coroutine that you have to wire up to an executor you also wrote yourself.

C++26 finally puts an answer in the standard library: std::execution, which is C++’s execution control library, also known as senders/receivers, also known as P2300. It is the largest single addition to the language since modules, and it is going to take the community a few years to digest. This post is a tour. We’ll start with the parallel algorithm policies you may already know, build up to senders/receivers, schedulers, cancellation, and end with a worked example that ties everything together.

What is std::execution

std::execution defines a model for describing asynchronous work and provides a small set of generic algorithms for composing those descriptions into larger ones. The actual execution is delegated to a scheduler, which is a handle to whatever resource happens to run the work: a thread pool, a GPU queue, an I/O reactor, or the calling thread itself.

The three core abstractions are:

  • A scheduler is a lightweight handle that says where work runs.
  • A sender is a lazy description of what work runs. It produces one of three completion signals: a value, an error, or “stopped” (cancelled).
  • A receiver is the callback object that consumes those signals. You almost never write one by hand; the algorithms wire them up for you.

The crucial word is lazy. A sender does nothing until you connect it to a receiver and start it. This is why the model composes: you can build up a description of an entire pipeline at compile time, hand it to the runtime once, and let the runtime decide how to schedule the pieces.

What is added in C++26

The paper that became std::execution landed at the St. Louis meeting in June 2024. The standard ships with:

  • The sender/receiver concepts and their machinery.
  • Sender factories: just, just_error, just_stopped, schedule, read_env.
  • Sender adaptors: then, upon_error, upon_stopped, let_value, let_error, let_stopped, starts_on, continues_on, on, when_all, split, bulk, into_variant, stopped_as_optional, stopped_as_error.
  • Sender consumers: sync_wait, sync_wait_with_variant.
  • A system scheduler (P2079) so you don’t have to bring your own thread pool just to write Hello, world.
  • An async scope (P3149) for spawning dynamic, but still joined, work.
  • A coroutine task<T> type that interoperates with senders.

What is not in C++26: networking, file I/O, time-based schedulers, and the asynchronous parallel algorithms (P3300). Those are queued for C++29. The reference implementation, NVIDIA’s stdexec, is what you should use today if you want to experiment.

Enabling parallel algorithms with std::execution policies

Before senders, the simplest way to ask the standard library to use multiple cores was to pass an execution policy to a parallel algorithm. These were added in C++17 and they are the entry point most people will reach for first:

#include <algorithm>
#include <execution>
#include <vector>

std::vector<float> data = load();
std::sort(std::execution::par_unseq, data.begin(), data.end());

The four standard policies are seq, par, par_unseq, and unseq. par says the implementation is allowed to use multiple threads. unseq says it is allowed to vectorise. par_unseq says both. seq says neither, it behaves like the unpoliced overload.

These policies are permissions, not promises. The implementation may decide that your range is too small to be worth distributing, and run it on the calling thread. If you want guarantees, you have to measure.

The relationship to std::execution is mostly nominal. The policies live in the same std::execution namespace, and the upcoming asynchronous parallel algorithms (P3300) will accept the same policy types when they ship. But the parallel algorithms today don’t return senders, don’t compose, and don’t let you choose a scheduler. They are a useful, but limited, self-contained corner of the library.

Everything below this point is the new world.

Asynchronous tasks with senders/receivers

The canonical first example. We build a description, then we run it:

namespace ex = std::execution;

ex::sender auto work = ex::just("Hello, world!")
                     | ex::then([](std::string_view s) { std::print("{}\n", s); });

ex::sync_wait(std::move(work));

just(x) is a sender that completes immediately with the value x. then(f) is a sender adaptor: it takes the value coming out of the upstream sender and passes it to f. The pipe operator is just function composition. just(a) | then(f) is the same as then(a, f), which is equivalent to f(a). In other words: wait for value a to be delivered from the sender, and when it is delivered, execute the function f(a). Or in the above example: expect “Hello, world!” to be delivered, and when it is, print it.

Nothing has been executed yet. The variable work holds a typed object describing the computation. sync_wait is what actually starts it and blocks until it finishes. This separation between describing and running is the whole game. You can build pipelines, pass them around, store them, and decide later where and when to run them.

A sender can complete in three ways:

  • set_value(args...): Success, with zero or more values.
  • set_error(err): Failure, with any error type (not just exception_ptr).
  • set_stopped(): Cancellation.

then only fires on the value channel. To handle the others, use upon_error and upon_stopped, which are the same shape but for the other two channels. To chain another sender off the value (the monadic bind), use let_value:

ex::sender auto pipeline =
    ex::just(filename)
  | ex::then(open_file)
  | ex::let_value([](File f) { return read_async_sender(f); })
  | ex::then(parse);

Use then when your continuation is synchronous and returns a value. Use let_value when your continuation is itself asynchronous and returns a sender. Get this distinction wrong and you’ll find yourself with a Sender<Sender<T>> that compiles but does the wrong thing.

When you need to fan out and join, use when_all and split:

auto common = ex::schedule(sched) | ex::then(parse_input) | ex::split();

auto branch_a = common | ex::then(compute_a);
auto branch_b = common | ex::then(compute_b);

auto combined = ex::when_all(branch_a, branch_b)
              | ex::then([](auto a, auto b) { return merge(a, b); });

split is necessary because senders are single-shot by default. Without it, attaching common to two downstream chains would mean running parse_input twice.

Image of the compute graph created by the code sample above.
Figure 1: The compute graph created by the code sample above.

Each channel will carry one of the value(s), error or cancel signals to the following receivers!

Mapping tasks to different schedulers

The point of having a scheduler abstraction is that where work runs is a property of the pipeline, not of the work itself. You write the work once and bind it to a scheduler at composition time.

Note: Compile time is when the compiler type-checks and instantiates the sender chain. Composition time is when your code builds the sender description by chaining adaptors (just(x) | then(f) | continues_on(sched)). Runtime / execution time is when sync_wait (or start) actually kicks off the operation state and work flows through the chain: threads run, GPU kernels launch, I/O happens.
The distinction that matters is between composition time and execution time. Both happen “at runtime” in the C++ sense, but the term (“composition time”) is used to name the phase where you describe work without doing it. That’s the core property of laziness in senders: the pipeline is a value you can inspect, store, or pass around before anyone calls start.

There are three primitives for placement:

  • schedule(sched): Produce a sender that completes on sched. This is how you start work on a particular execution context.
  • starts_on(sched, sndr): Run sndr starting on sched, regardless of where the surrounding chain came from.
  • continues_on(sched): At this point in the chain, hop onto sched and continue there.

A typical pattern: read from disk on an I/O scheduler, do CPU work on a worker pool, then write back from the I/O scheduler:

auto pipeline =
    ex::starts_on(io_sched, ex::just(path) | ex::then(read_file))
  | ex::continues_on(cpu_sched)
  | ex::then(process)
  | ex::continues_on(io_sched)
  | ex::then(write_file);

The system scheduler (std::execution::get_system_scheduler()) is the lazy-default thread pool that the standard provides so you don’t have to write one. It’s a sensible default for CPU-bound work, but for I/O or for work that needs strict ordering you’ll usually want to construct your own.

A practical warning, courtesy of Mathieu Ropert’s investigation of stdexec: if you build a pipeline through let_value and the inner sender doesn’t carry a scheduler, bulk(par_unseq, ...) may silently run serially on the calling thread. The fix is to add an explicit continues_on(sched) before the bulk:

ex::let_value([&] {
    return ex::just()
         | ex::continues_on(sched)              // re-anchor the scheduler
         | ex::bulk(ex::par_unseq, count, work);
})

This is the kind of thing you only catch with a profiler. Make a habit of being explicit about scheduler placement at every fan-out point.

How to pause/cancel sender chains

Cancellation in the senders/receivers world is cooperative and structured. There is no kill_thread. Instead, the framework propagates a stop request, and the senders that care about it observe a std::stop_token, finish what they’re doing, and complete with set_stopped instead of set_value.

The mechanism (inherited from std::jthread) is std::stop_source and std::stop_token. When you connect a sender to a receiver, every sender in the chain can query the receiver’s environment via get_env(receiver), and that environment is what answers get_stop_token(env). The token is a property of whoever is consuming the chain.

C++26 gives you exactly the tool for this: std::execution::write_env [exec.write.env]. It’s a sender adaptor that wraps a sender so that, when connected to a receiver, the inner receiver sees an environment with whatever extra queries you’ve spliced in. Combined with std::execution::prop [exec.prop], which is a tiny class template that pairs a query tag with a value, it gives you a one-liner for token injection:

namespace ex = std::execution;

std::stop_source src;
auto pipeline = ex::just(42)
              | ex::then([](int x) { return x * 2; });

auto result = ex::sync_wait(
                ex::write_env(
                  std::move(pipeline),
                  ex::prop(ex::get_stop_token, src.get_token())
                )
              );

if (result) std::println("got {}", std::get<0>(*result));
else        std::println("cancelled");

write_env joins the new env onto the outer one rather than replacing it, so sync_wait‘s own queries (allocator, scheduler, etc.) still flow through. sync_wait already returns std::optional<std::tuple<Vs...>> which is engaged (has_value() == true) on value, disengaged (has_value() == false) on stopped and throws on error, so there’s no extra boilerplate needed.

Now let’s actually test the cancelling operation. I have initially implemented everything from scratch, including an interruptible_sleep_sender that showcases the cancellation feature (which can be found in this godbolt link). What instead makes things much simpler is the experimental::execution::timed_thread_context and functions like experimental::execution::schedule_after, which schedule senders to run after a specified number of milliseconds. So, without the need to define our own structs we can do this:

// godbolt link: https://godbolt.org/z/nxG8M6oY9
{
    namespace expex = experimental::execution;
    ex::inplace_stop_source src;

    expex::timed_thread_context timer_ctx;
    auto timer = timer_ctx.get_scheduler();

    auto pipeline = ex::just(42)
                  | ex::let_value([timer](int x)
                    {
                        return expex::schedule_after(timer, 5s)
                             | ex::then([x] { return x * 2; });
                    });

    // cancel from another thread:
    std::jthread canceller([&] {
        std::this_thread::sleep_for(200ms);
        std::println("cancelling from thread {}", std::this_thread::get_id());
        src.request_stop();
    });

    auto result = ex::sync_wait(
                    ex::write_env(
                        std::move(pipeline),
                        ex::prop(ex::get_stop_token, src.get_token())
                    )
                );

    if (result) std::println("got {}", std::get<0>(*result));
    else        std::println("cancelled");
}

For longer-running or dynamically spawned work, prefer spawning through an async_scope (P3149, see the advanced example below). The scope’s spawn/spawn_future already builds a receiver whose environment carries a stop token tied to the scope’s lifetime, so you don’t need write_env at all and calling request_stop() on the scope cancels every task running inside it.

Once a stop token is in the receiver’s environment, regardless of whether it is put there with write_env or from an async_scope, then request_stop() on the corresponding source will be observed all the way through the chain.

For a sender to honor the request, it has to query the token. The standard algorithms like then, bulk, when_all do this for you. when_all is particularly important: if any of its child senders fails or is stopped, it requests stop on its remaining children before completing.

You can pause a chain by writing a sender whose start registers a callback and returns. The chain is “paused” because nothing is running; it resumes when the callback fires and calls set_value on its receiver. This is how I/O senders, timer senders, and task<T> coroutines work. There is no thread sitting blocked. That’s the whole reason structured async exists.

How to create pause/cancel callback wrappers

Most of the time you’ll consume the cancellation machinery rather than implement it. But if you’re integrating an existing API like a callback-based HTTP client, or a hardware event, you’ll write a small custom sender.

Here’s a sketch of a sender that wraps a one-shot callback API and respects stop requests:

namespace ex = std::execution;

template <typename Receiver>
struct cancellable_op {
    using operation_state_concept = ex::operation_state_t;

    Receiver receiver;
    SomeAPIHandle handle;

    using StopToken = ex::stop_token_of_t<ex::env_of_t<Receiver>>;

    struct on_stop {
        cancellable_op* self;
        void operator()() const noexcept { self->handle.cancel(); }
    };

    // Use the token's OWN callback_type - that way this works whether
    // the env carries a real stop token (std::stop_token,
    // stdexec::inplace_stop_token) or never_stop_token (no-op).
    using stop_cb_t = typename StopToken::template callback_type<on_stop>;
    std::optional<stop_cb_t> stop_cb;

    // C++26 member-function customization (no more tag_invoke).
    void start() & noexcept {
        auto token = ex::get_stop_token(ex::get_env(receiver));
        stop_cb.emplace(std::move(token), on_stop{this});

        handle.submit([this](Result r) {
            stop_cb.reset();
            if (r.cancelled()) {
                ex::set_stopped(std::move(receiver));
            } else if (r.hasError()) {
                ex::set_error(std::move(receiver), r.outcome.error());
            } else {
                ex::set_value(std::move(receiver), *r.outcome);
            }
        });
    }
};

struct cancellable_sender {
    using sender_concept = ex::sender_t;
    using completion_signatures = ex::completion_signatures<
        ex::set_value_t(std::optional<int>),
        ex::set_error_t(Error),
        ex::set_stopped_t()>;

    SomeAPIHandle handle;

    // Member-function connect.
    template <typename Receiver>
    auto connect(Receiver r) && {
        return cancellable_op<Receiver>{std::move(r), std::move(handle)};
    }
};

Three things to notice. First, the stop callback is an RAII object, meaning that its constructor registers, its destructor unregisters. You must reset it before completing the receiver, or you risk the callback firing on a torn-down object. Second, start is noexcept. Senders may be started in contexts where exceptions can’t be propagated; throw and you terminate. Third, the receiver is moved into the operation state at connect time and lives there until completion. There is no allocation per chain element; the whole graph is one nested aggregate, sized at compile time.

This is more than you’ll usually want to write. In practice, prefer the upcoming std::execution::task<T> coroutine, where co_await does all of the above for you.

As always, a godbolt example invoking the (1) value, (2) error and (3) cancel paths can be found here.

Advanced example

Let’s pull it all together. Suppose we have an image processing pipeline that:

  1. Reads a JPEG from disk on an I/O scheduler.
  2. Decodes it on the I/O scheduler (decoding is mostly a stream of read calls plus some CPU; let’s keep it on I/O for simplicity).
  3. Hands the pixel buffer to a GPU worker pipeline, which runs a parallel filter GPU kernel over the rows.
  4. Re-encodes the result to JPEG on the I/O scheduler.
  5. Uploads the encoded bytes to a remote server on a network scheduler.
  6. Is fully cancellable, and the upload will be aborted if a stop is requested mid-flight.

Assume the obvious helper senders exist. We’re not interested in how the JPEG decoder works.

// Godbolt link for this example: https://godbolt.org/z/b5PnnzMqn
#include <exec/static_thread_pool.hpp>
#include <stdexec/execution.hpp>
#include <filesystem>
#include <print>
#include <thread>
#include <vector>

// Hypothetical GPU scheduler header - each vendor (ROCm, CUDA, ...)
// would ship their own that models ex::scheduler, or we create our own
// wrappers for it.
#include <gpu/scheduler.hpp>

namespace ex = stdexec;
using namespace std::literals;

// ── Domain types ────────────────────────────────────────────────────
struct PixelBuffer {
    std::vector<std::uint8_t> pixels;
    int width{}, height{}, channels{3};
    int row_count() const noexcept { return height; }
};
struct JpegBlob     { std::vector<std::uint8_t> bytes; };
struct UploadResult { int status_code{}; std::string etag; };

// ── Synchronous helpers (the pipeline turns these into senders
//    via just/then) ──────────────────────────────────────────────────
JpegBlob    read_file(std::filesystem::path p);              // may throw
PixelBuffer decode_jpeg(JpegBlob&& blob);                    // may throw
JpegBlob    encode_jpeg(PixelBuffer&& buf);                  // may throw

// ── GPU kernel wrapper ──────────────────────────────────────────────
// launch_filter_kernel synchronously enqueues a GPU kernel operating
// on row `row` of `buf`.  Internally it wraps the hipLaunchKernel /
// cuLaunchKernel call.  The bulk scheduler ensures all enqueued
// kernels complete before the bulk sender signals set_value.
// A GPU-aware scheduler maps bulk indices to hardware threads /
// wavefronts, so each call just enqueues — it does not wait.
void launch_filter_kernel(PixelBuffer& buf, int row) noexcept;

// ── upload_to: returns a cancellable sender that uploads `blob` to
//    `url` and completes with UploadResult.  Same stop-callback
//    pattern as the callback-wrapper section above.
ex::sender auto upload_to(std::string url, JpegBlob& blob);

// ── The pipeline ────────────────────────────────────────────────────
// A function template parameterised on schedulers.  The same pipeline
// can be tested with inline_scheduler, profiled with an instrumented
// one, or run against real thread pools.
template <ex::scheduler IoSched,
          ex::scheduler GpuSched,
          ex::scheduler NetSched>
ex::sender auto image_pipeline(
    IoSched  io,
    GpuSched gpu,
    NetSched net,
    std::filesystem::path path,
    std::string           upload_url)
{
    return
        // 1-2. Read and decode on the I/O context.
        ex::starts_on(io,
            ex::just(std::move(path))
          | ex::then(read_file)
          | ex::then(decode_jpeg))

        // 3. GPU filter: hop onto the GPU scheduler, then enqueue
        //    one kernel per row via bulk.  The GPU scheduler maps
        //    bulk indices to hardware threads / wavefronts.
      | ex::continues_on(gpu)
      | ex::let_value([gpu](auto&& buf) {
            const int rows = buf.row_count();
            return ex::just(std::forward<decltype(buf)>(buf))
                 | ex::continues_on(gpu)
                 | ex::bulk(ex::par, rows, // Note the std::execution policy from introduction ;)
                       [](int row, PixelBuffer& b) noexcept {
                           // Each invocation enqueues a GPU kernel.
                           // A GPU-aware scheduler maps bulk indices
                           // to hardware threads / wavefronts.
                           launch_filter_kernel(b, row);
                       });
        })

        // 4. Re-encode on the I/O context.
      | ex::continues_on(io)
      | ex::then(encode_jpeg)

        // 5-6. Upload on the network context.  upload_to returns a
        //      cancellable sender, so a stop request mid-upload will
        //      abort the HTTP call and complete with set_stopped.
      | ex::continues_on(net)
      | ex::let_value([url = std::move(upload_url)](auto&& blob) {
            return upload_to(url, std::forward<decltype(blob)>(blob));
        });
}

int main()
{
    // Three execution contexts, each representing a different resource.
    exec::static_thread_pool io_ctx{2};
    gpu::context             gpu_ctx;    // wraps a GPU device
    exec::static_thread_pool net_ctx{2};

    ex::inplace_stop_source stop;

    // Simulate the user pressing Ctrl-C after 100 ms.
    std::jthread watchdog([&] {
        std::this_thread::sleep_for(100ms);
        stop.request_stop();
    });

    auto result = ex::sync_wait(
        ex::write_env(
            image_pipeline(
                io_ctx.get_scheduler(),
                gpu_ctx.get_scheduler(),
                net_ctx.get_scheduler(),
                "/photos/cat.jpg",
                "https://cdn.example.com/images/cat.jpg"),
            ex::prop(ex::get_stop_token, stop.get_token())));

    if (result) {
        auto& [upload] = *result;
        std::println("done: HTTP {} etag={}", upload.status_code, upload.etag);
    } else {
        std::println("pipeline cancelled");
    }
}
// Godbolt link for this example : https://godbolt.org/z/b5PnnzMqn

A toy example of the above example that synthetically changes the UPLOAD_DURATION so that the pipeline can succeed or be canceled can be found here!

Analysis

Read the pipeline definition from top to bottom and notice what isn’t there. There are no thread handles, no futures, no mutexes, no condition variables. There is no question about who owns what data and the value channel is responsible for moving the data through it. The schedulers are passed in, so the same pipeline could be tested by passing in an inline scheduler, or profiled by passing in an instrumented one. The whole graph is a value, built lazily, started once, joinable, cancellable, and statically typed end-to-end.

A few things worth calling out:

  • let_value for dynamic bulk shapes. The row count comes from the decoded PixelBuffer, which is only known at runtime. bulk(rows, f) needs rows at the point where the adaptor is created, so we use let_value to unwrap the buffer, read its dimensions, and re-emit it through just + bulk. This “unwrap, inspect, rewrap” is the standard idiom whenever a downstream adaptor’s parameters depend on the upstream value.
  • GPU via continues_on(gpu) + bulk. A GPU scheduler models the same scheduler concept as a thread pool. continues_on(gpu) hops execution onto the device context, and the subsequent bulk(rows, f) maps each index to a hardware thread or wavefront. The callable passed to bulk (launch_filter_kernel) enqueues the actual kernel. A vendor-specific GPU scheduler (ROCm’s HIP, CUDA, oneAPI) that schedules work by enqueueing onto the device’s command queue and completing when the queue signals. From the pipeline’s perspective it’s just another scheduler with no special API surface.

Note: bulk(n, f) is defined to forward its input values downstream after all n invocations of f complete, so the PixelBuffer& is passed to step 4.

  • continues_on at every boundary. Without explicit placement, bulk may silently run serially on whatever thread the previous step completed on (as we noted in the scheduler section). The continues_on(gpu) before bulk ensures the filter launches on the GPU, and the continues_on(io) after it hops back to the CPU for the encode.
  • Cancellation is structural. The single write_env + prop(get_stop_token, ...) at the outermost layer is enough. Every sender in the chain (then, bulk, when_all, and our custom upload_to) observes the same token through the receiver’s environment. When the watchdog calls request_stop(), the pipeline completes with set_stopped and sync_wait returns std::nullopt. Even the GPU kernel can be aborted mid-flight if the device supports async cancellation (on AMD hardware this maps to hipStreamDestroy / signal-based abort).

Conclusion

We’ve abstracted concurrency and asynchronous programming without ever caring about threads, mutexes, or condition variables. We’ve created portable (and cancellable) pipelines that define work in terms of what runs where and minimized the risk of data races. That is what std::execution is for.

The API surface is large and the error messages are, today, terrifying. The reference implementation has rough edges that will catch you out. Having said that, the development is active and the issue list relatively short. But the underlying model is the right one. After a decade of executor proposals, C++ finally has a single vocabulary for “do this work somewhere, then this work somewhere else, and let me cancel the whole thing.” Whether or not C++26 is the version where you adopt it, it is the version where you should start learning it.

On a closing note, everyone is talking about C++26 reflection [P2996] features. This unlocks a whole new realm of the language that will take us years to master and understand (maybe even more than 10 years), but this standard brings a palette of other new features (safety, erroneous behavior, pre-post conditions and, of course, std::execution) that users will be eager to adopt and use. And to quote Herb Sutter’s post Living in the future: Using C++26 at work closing statement: fun times for C++!

The 13 application areas where OpenCL and CUDA can be used

visitekaartje-achter-2013-V
Did you find your specialism in the list? The formula is the easiest introduction to GPGPU I could think of, including the need of auto-tuning.

Which algorithms map is best to which accelerator? In other words: What kind of algorithms are faster when using accelerators and OpenCL/CUDA?

Professor Wu Feng and his group from VirginiaTech took a close look at which types of algorithms were a good fit for vector-processors. This resulted in a document: “The 13 (computational) dwarves of OpenCL” (2011). It became an important document here in StreamHPC, as it gave a good starting point for investigating new problem spaces.

The document is inspired by Phil Colella, who identified seven numerical methods that are important for science and engineering. He named “dwarves” these algorithmic methods. With 6 more application areas in which GPUs and other vector-accelerated processors did well, the list was completed.

As a funny side-note, in Brothers Grimm’s “Snow White” there were 7 dwarves and in Tolkien’s “The Hobbit” there were 13.

Continue reading “The 13 application areas where OpenCL and CUDA can be used”

Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support

Screenshot from Intel’s “God Rays” demo

This article is still work-in-progress

Intel has just released its OpenCL bit CPU-drivers, version 2013 bèta. It has support for OpenCL 1.1 (not 1.2 as for the CPU) on Intel HD Graphics 4000/2500 of the 3rd generation Core processors (Windows only). The release notes mention support for Windows 7 and 8, but the download-site only mentions windows 8. Support under Linux is limited to 64 bits.

The release notes mention:

  • General performance improvements for many OpenCL* kernels running on CPU.
  • Preview Tool: Kernel Builder (Windows)
  • Preview Feature: support of  kernel source code hotspots analysis with the Intel VTuneT Amplifier XE 2011 update 3 or higher.
  • The GNU Project Debugger (GDB) debugging support on Linux operating systems.
  • New OpenCL 1.2 extensions supported by the CPU device:
    • cl_khr_int64_base_atomics and cl_khr_int64_extended_atomics
    • cl_khr_fp16
    • cl_khr_gl_sharing
    • cl_khr_gl_event
    • cl_khr_d3d10_sharing
    • cl_khr_dx9_media_sharing
    • cl_khr_d3d11_sharing.
  • OpenCL 1.1 extensions that were changed in OpenCL 1.2:
    • Device Fission supports both OpenCL 1.1 EXT API’s and also OpenCL* 1.2 fission core features
    • Media Sharing support intel 1.1 media sharing extension and also the 1.2 KHR media sharing extension
    • Printf extension is aligned with OpenCL 1.2 core feature.

Check the release notes for full information.

The drivers can be found on http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk-2013/. Installation is simple. For Windows there is a installer. If you have Linux, make sure you remove any previous version of Intel’s openCL drivers. If you have a Debian-based Linux, use the command ‘alien’ to convert the rpm to deb, and make sure ‘libnuma1‘ is installed. There are requirements for libc 2.11 or 2.12 – more information on that later as Ubuntu 12.04 has libc6 2.15.

Continue reading “Intel OpenCL CPU-drivers 2013 beta with OpenCL 1.2 support”

How expensive is an operation on a CPU?

Programmers know the value of everything and the costs of nothing. I saw this quote a while back and loved it immediately. The quote by Alan Perlis is originally about Perl LISP-programmers, but only highly trained HPC-programmers seem to have obtained this basic knowledge well. In an interview with Andrew Richards of Codeplay I heard it from another perspective: software languages were not developed in a time that cache was 100 times faster than memory. He claimed that it should be exposed to the programmer what is expensive and what isn’t. I agreed again and hence this post.

I think it is very clear that programming languages (and/or IDEs) need to be redesigned to overcome the hardware-changes of the past 5 years. I talked about that in the article “Separation of compute, control and transfer” and “Lots of loops“. But it does not seem to be enough.

So what are the costs of each operation (on CPUs)?

This article is just to help you on your way, and most of all: to make you aware. Note it is incomplete and probably not valid for all kinds of CPUs.

Continue reading “How expensive is an operation on a CPU?”

Basic Concepts: online kernel compiling

Typos are a programmers worst nightmare, as they are bad for concentration. The code in your head is not the same as the code on the screen and therefore doesn’t have much to do with the actual problem solving. Code highlighting in the IDE helps, but better is to use the actual OpenCL compiler without running your whole software: an Online OpenCL Compiler. In short is just an OpenCL-program with a variable kernel as input, and thus uses the compilers of Intel, AMD, NVidia or whatever you have installed to try to compile the source. I have found two solutions, which both have to be built from source – so a C-compiler is needed.

  • CLCC. It needs the boost-libraries, cmake and make to build. Works on Windows, OSX and Linux (needs possibly some fixes, see below).
  • OnlineCLC. Needs waf to build. Seems to be Linux-only.

Continue reading “Basic Concepts: online kernel compiling”

Kernels and the GPL. Are we safe and linking?

Disclaimer: I am not a lawyer and below is my humble opinion only. The post is for insights only, not for legal matters.

GPL was always a protection that somebody or some company does not run away with your code and makes the money with it. Or at least force that improvements get back into the community. For unprepared companies this was quite some stress when they were forced to give their software away. Now we have host-kernels-languages such as OpenCL, CUDA, DirectCompute, RenderScript don’t really link a kernel, but load it and launch it. As GPL is quite complicated if it comes to mixing with commercial code, I try to give a warning that GPL might not be prepared for this.

If your software is dual-licensed, you cannot assume the GPL is not chosen when eventually used in commercial software. Read below why not.

I hope we can have a discussion here, so we get to the bottom of this.

Continue reading “Kernels and the GPL. Are we safe and linking?”

Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions

In the series Basic Concepts I try to give an alternative description to what is said everywhere else. This time my eye fell on alternative convenience methods in two cases which were introduced there to be nice to devs with i.e. C/C++ and/or graphics backgrounds. But I see it explained too often from the convenience functions and giving the “preferred” functions as a sort of bonus which works for the cases the old functions don’t get it done. Below is the other way around and I hope it gives better understanding. I assume you have read another definition, so you see it from another view not for the first time.

 

 

Continue reading “Basic Concepts: OpenCL Convenience Methods for Vector Elements and Type Conversions”

Installing both NVidia GTX and AMD Radeon on Linux for OpenCL

August 2012: article has been completely rewritten and updated. For driver-specific issues, please refer to this article.

Want to have both your GTX and Radeon working as OpenCL-devices under Linux? The bad news is that attempts to get Radeon as a compute device and the GTX as primary all failed. The good news is that the other way around works pretty easy (with some luck). You need to install both drivers and watch out that libglx.so isn’t overwritten by NVidia’s driver as we won’t use that GPU for graphics – this is also the reason why it is impossible to use the second GPU for OpenGL.

Continue reading “Installing both NVidia GTX and AMD Radeon on Linux for OpenCL”

AMD OpenCL coding competition

The AMD OpenCL coding competition seems to be Windows 7 64bit only. So if you are on another version of Windows, OSX or (like me) on Linux, you are left behind. Of course StreamHPC supports software that just works anywhere (seriously, how hard is that nowadays?), so here are the instructions how to enter the competition when you work with Eclipse CDT. The reason why it only works with 64-bit Windows I don’t really get (but I understood it was a hint).

I focused on Linux, so it might not work with Windows XP or OSX rightaway. With little hacking, I’m sure you can change the instructions to work with i.e. Xcode or any other IDE which can import C++-projects with makefiles. Let me know if it works for you and what you changed.

Continue reading “AMD OpenCL coding competition”

Qt Creator OpenCL Syntax Highlighting

With highlighting for Gedit, I was happy to give you the convenience of a nice editor to work on OpenCL-files. But it seems that one of the most popular IDEs for C++-programming is Qt Creator. So you receive another free syntax highlighter. You need at least Qt Creator 2.1.0.

The people of Qt have written everything you need to know about their Syntax highlighting, which was enough help to create this file. You see that they use the system of Kate, so logically this file works with this editor too.

In this article there is all you need to know to use Qt Creator with OpenCL.

Installing

First download the file to your computer.

Under Windows and OSX you need to copy this file to the directory shareqtcreatorgeneric-highlighter in the Qt installation dir (i.e. c:Qtqtcreator-2.2.1shareqtcreatorgeneric-highlighter). Under Linux copy this file to ~/.kde/share/apps/katepart/syntax or to /usr/share/kde4/apps/katepart/syntax (all users). That’s all, have fun!

Install OpenCL on Debian, Ubuntu and Mint orderly

Libraries – can’t have enough

If you read different types of manuals how to compile OpenCL software on Linux, then you can get dizzy of all the LD-parameters. Also when installing the SDKs from AMD, Intel and NVIDIA, you get different locations for libraries, header-files, etc. Now GPGPU is old-fashioned and we go for heterogeneous programming, the chances get higher you will have more SDKs on your machine. Also if you want to keep it the way you have, reading this article gives you insight in what the design is after it all. Note that Intel’s drivers don’t give OpenCL support for their GPUs, but CPUs only.

As my mother said when I was young: “actually cleaning up is very simple”. I’m busy creating a PPA for this, but that will take some more time.

First the idea. For developers OpenCL consists of 5 parts:

  • GPUs-only: drivers with OpenCL-support
  • The OpenCL header-files
  • Vendor specific libraries (needed when using -lOpenCL)
  • libOpenCL.so -> a special driver
  • An installable client driver

Currently GPU-drivers are always OpenCL-capable, so you only need to secure 4 steps. These are discussed below.

Please note that in certain 64-bit distributions there is not lib64, but only ‘lib’ and ‘lib32’. If that is the case for you, you can use the commands that are mentioned with 32-bit.

Continue reading “Install OpenCL on Debian, Ubuntu and Mint orderly”

Intel’s OpenCL SDK examples for GCC

Update august 2012: There is a new post for the latest Linux examples.

Note: these patches won’t work anymore! You can learn from the patches how to fix the latest SDK-code for GCC and Linux/OSX.

Code-examples are not bundled with the Linux OpenCL SDK 1.1 beta. Their focus is primarily Windows, so VisualStudio seems to be a logical target. I just prefer GCC/LLVM which you can get to work with all OSes. After some time trying to find the alternatives for MS-specific calls, I think I managed. Since ShallowWater uses DirectX and is quite extensive, I did not create a patch for that one – sorry for that.

I had a lot of troubles getting the BMP-export to work, because serialisation of the struct added an extra short. Feedback (such as a correct BMP-export of a file) is very welcome, since I the colours are correct. For the rest: most warnings are removed and it just works – tested with g++ (Ubuntu/Linaro 4.5.2-8ubuntu4) 4.5.2 on 64 bit (llvm-g++-4.2 seems to work too, but not fully tested).

THE PATCHES ARE PROVIDED AS IS – NO WARRANTIES!

Continue reading “Intel’s OpenCL SDK examples for GCC”

ImageJ and OpenCL

For a customer I’m writing a plugin for ImageJ, a toolkit for image-processing and analysis in Java. Rick Lentz has written an OpenCL-plugin using JOCL. In the tutorial step 1 is installing the great OS Ubuntu, but that would not be the fastest way to get it going, and since JOCL is multi-platform this step should be skippable. Furthermore I rewrote most of the code, so it is a little more convenient to use.

In this blog-post I’ll explain how to get it up and running within 10 minutes with the provided information.

Continue reading “ImageJ and OpenCL”

Engineering GPGPU into existing software

At the Thalesian talk about OpenCL I gave in London it was quite hard to find a way to talk about OpenCL for a very diverse public (without falling back to listing code-samples for 50 minutes); some knew just everything about HPC and other only heard of CUDA and/or OpenCL. One of the subjects I chose to talk about was how to integrate OpenCL (or GPGPU in common) into existing software. The reason is that we all have built nice, cute little programs which were super-fast, but it’s another story when it must be integrated in some enterprise-level software.

Readiness

The most important step is making your software ready. Software engineering can be very hectic; managing this in a nice matter (i.e. PRINCE2) just doesn’t fit in a deadline-mined schedule. We all know it costs less time and money when looking at the total picture, but time is just against.

Let’s exaggerate. New ideas, new updates of algorithms, new tactics and methods arrive on the wrong moment, Murphy-wise. It has to be done yesterday, so testing is only allowed when the code will be in the production-code too. Programmers just have to understand the cost of delay, but luckily is coming to the rescue and says: “It is my responsibility”. And after a year of stress your software is the best in the company and gets labelled as “platform”; meaning that your software is chosen to include all small ideas and scripts your colleagues have come up “which are almost the same as your software does, only little different”. This will turn the platform into something unmanageable. That is a different kind of software-acceptance!

Continue reading “Engineering GPGPU into existing software”

Qt Hello World

The earlier blog-post was about how to use Qt Creator with OpenCL. The examples are all around Images, but nowhere a simple Hello World. So here it is: AMD’s infamous OpenCL Hello World in Qt. Thank’s to linuxjunk for glueing the parts together.



int main(int argc, char *argv[]) {

    // Define the kernel. Take a good look what it does.
    QByteArray prog(
    "#pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enablen" 
    "__constant char hw[] = "Hello World from Qt!"; n" 
    "__kernel void hello(__global char * out) {n" 
    "  size_t tid = get_global_id(0); n" 
    "  out[tid] = hw[tid]; n" 
    "}n"
    );

     // Get a context on the first available GPU.
     QCLContext context;
     if (!context.create(QCLDevice::GPU))
         qFatal("Could not create OpenCL context");

     // Allocate 100 bytes of memory on the Host.
     size_t mem_size = 100;
     char* outH = new char[mem_size];
     // Allocate buffer on the Device.
     QCLBuffer outCL = context.createBufferHost(outH, sizeof(char) * mem_size,
                                                QCLMemoryObject::WriteOnly);

     // Compile program against device
      QCLProgram program = context.buildProgramFromSourceCode(prog);

     // Create a kernel object, tell it we are using the kernel called "hello".
     QCLKernel kernel = program.createKernel("hello");

     // Set the necessary global memory. In this case it is the buffer-size.
     kernel.setGlobalWorkSize(outCL.size(), 1);

     // Turn on profiling.
     QCLCommandQueue queue = context.commandQueue();
     queue.setProfilingEnabled(true);

     // Queue the kernel up to run.
     // Give it an argument which is the memory we allocated above.
     QCLEvent event = kernel(outCL);

     // Use the event object above to block until processing has completed.
     event.waitForFinished();

     // Timing only works with profiling on. runTime is unsigned.
     printf(" time '%u'n", event.runTime());

     // Read the results out of the shared memory area.
     outCL.read(outH, mem_size);

     // Write to screen.
     printf(" result = '%s'", outH);
}

Have fun!

Using Qt Creator for OpenCL

More and more ways are getting available to bring easy OpenCL to you. Most of the convenience libraries are wrappers for other languages, so it seems that C and C++ programmers have the hardest time. Since a while my favourite way to go is Qt: it is multi-platform, has a good IDE, is very extensive, has good multi-core and OpenGL-support and… has an extension for OpenCL: http://labs.trolltech.com/blogs/2010/04/07/using-opencl-with-qt http://blog.qt.digia.com/blog/2010/04/07/using-opencl-with-qt/

Other multi-platform choices are Anjuta, CodeLite, Netbeans and Eclipse. I will discuss them later, but wanted to give Qt an advantage because it also simplifies your OpenCL-development. While it is great for learning OpenCL-concepts, please know that the the commercial version of Qt Creator costs at least €2995,- a year. I must also warn the plugin is still in beta.

streamhpc.com is not affiliated with Qt.

Getting it all

Qt Creator is available in most Linux-repositories: install packages ‘qtcreator’ and ‘qt4-qmake’. For Windows, MAC and the other Linux-distributions there are installers available: http://qt.nokia.com/downloads. People who are not familiar with Qt, really should take a look around on http://qt.nokia.com/.

You can get the source for the plugin QtOpenCL, by using GIT:

git clone http://git.gitorious.org/qt-labs/opencl.git QtOpenCL

See http://qt.gitorious.org/qt-labs/opencl for more information about the status of the project.

You can download it here: https://dl.dropbox.com/u/1118267/QtOpenCL_20110117.zip (version 17 January 2011)

Building the plugin

For Linux and MAC you need to have the ‘build-essentials’. For Windows it might be a lot harder, since you need make, gcc and a lot of other build-tools which are not easily packaged for the Windows-OS. If you’ve made a win32-binary and/or a Windows-specific how-to, let me know.

You might have seen that people have problems building the plugin. The trick is to use the options -qmake and -I (capital i) with the configure-script:

./configure -qmake <location of qmake 4.6 or higher> -I<location of directory CL with OpenCL-headers>

make

Notice the spaces. The program qmake is provided by Qt (package ‘qt4-qmake’), the OpenCL-headers by the SDK of ATI or NVidia (you’ll need the SDK anyway), or by Khronos. By example, on my laptop (NVIDIA, Ubuntu 32bit, with Qt 4.7):

./configure -qmake /usr/bin/qmake-qt4 -I/opt/NVIDIA_GPU_Computing_SDK_3.2/OpenCL/common/inc/

make

This should work. On MAC the directory is not CL, but OpenCL – I haven’t tested it if Qt took that into account.

After building , test it by setting a environment-setting “LD_LIBRARY_PATH” to the lib-directory in the plugin, and run the provided example-app ‘clinfo’. By example, on Linux:

export LD_LIBRARY_PATH=`pwd`/lib:$LD_LIBRARY_PATH

cd util/clinfo/

./clinfo

This should give you information about your OpenCL-setup. If you need further help, please go to the Qt forums.

Configuring Qt Creator

Now it’s time to make a new project with support for OpenCL. This has to be done in two steps.

First make a project and edit the .pro-file by adding the following:

LIBS     += -L<location of opencl-plugin>/lib -L<location of OpenCL-SDK libraries> -lOpenCL -lQtOpenCL

INCLUDEPATH += <location of opencl-plugin>/lib/

<location of OpenCL-SDK include-files>

<location of opencl-plugin>/src/opencl/

By example:

LIBS     += -L/opt/qt-opencl/lib -L/usr/local/cuda/lib -lOpenCL -lQtOpenCL

INCLUDEPATH += /opt/qt-opencl/lib/

/usr/local/cuda/include/

/opt/qt-opencl/src/opencl/

The following screenshot shows how it could look like:

Second we edit (or add) the LD_LIBRARY_PATH in the project-settings (click on ‘Projects’ as seen in screenshot):

/usr/lib/qtcreator:location of opencl-plugin>:<location of OpenCL-SDK libraries>:

By example:

/usr/lib/qtcreator:/opt/qt-opencl/lib:/usr/local/cuda/lib:

As you see, we now also need to have the Qt-creator-libraries and SDK-libraries included.

The following screenshot shows the edit-field for the project-environment:

Testing your setup

Just add something from the clinfo-source to your project:

printf("OpenCL Platforms:n"); 
QList platforms = QCLPlatform::platforms();
foreach (QCLPlatform platform, platforms) { 
   printf("    Platform ID       : %ldn", long(platform.platformId())); 
   printf("    Profile           : %sn", platform.profile().toLatin1().constData()); 
   printf("    Version           : %sn", platform.version().toLatin1().constData()); 
   printf("    Name              : %sn", platform.name().toLatin1().constData()); 
   printf("    Vendor            : %sn", platform.vendor().toLatin1().constData()); 
   printf("    Extension Suffix  : %sn", platform.extensionSuffix().toLatin1().constData());  
   printf("    Extensions        :n");
} QStringList extns = platform.extensions(); 
foreach (QString ext, extns) printf("        %sn", ext.toLatin1().constData()); printf("n");

If it gives errors during programming (underlined includes, etc), focus on INCLUDEPATH in the project-file. If it complaints when building the application, focus on LIBS. If it complaints when running the successfully built application, focus on LD_LIBRARY_PATH.

Ok, it is maybe not that easy to get it running, but I promise it gets easier after this. Check out our Hello World, the provided examples and http://doc.qt.nokia.com/opencl-snapshot/ to start building.