Kernel development + C++26 reflection = 🧡

Posted by Chris Tsiaousis on 8 July 2026 with 0 Comment

Anyone who has written a non-trivial GPU kernel has hit the same wall. You define a struct in C++ and pass it to a kernel. Six months later, somebody adds a member, forgets to update the matching layout on the device side, and a 200-line shader reads garbage. The compiler doesn’t help. The runtime doesn’t help. You find out from a customer.

The traditional answer is more macros. BOOST_PP_SEQ_FOR_EACH, X-macros, FOREACH_FIELD. They work, but they make the source unreadable and they cost nothing to misuse. The other answer is a code generator, which means a build-time dependency, a generated header nobody reads, and a debugging experience that involves three files at once.

C++26 ships P2996, static reflection, which was voted into the working draft at St. Louis in June 2024 and finalized at the Hagenberg meeting in 2025. For most of the C++ world this is an enabler for serialization, ORMs, and CLI parsers. For people writing GPU code, it is also one of the most practical things to happen to the toolchain in years. The reason is simple. GPU programming is about layouts, bindings, kernel variants, and metadata. All four of those are exactly what reflection is good at.

This post is a tour through six concrete things you can do once your compiler speaks reflection:

validating that host and device struct layouts agree at compile time,
generating SoA types from AoS definitions,
packing push constants without hand-computed offsets,
building kernel-variant dispatch tables from enum definitions,
inspecting memory layout and detecting padding, and
attaching binding/memory-space metadata directly to declarations via annotations.

The examples use HIP syntax but the patterns work for CUDA, SYCL, Vulkan, or anything else where you push a C++ struct across an API boundary into device-side code.

A two-minute reflection primer

If you’ve already read a reflection tutorial, skip this. If not, the entire API fits on a postcard.

You reflect a type, value, or template with the reflection operator ^^, which produces a std::meta::info. You can splice an info back into ordinary C++ with the splicer [: r :]. Also, you iterate over reflections at compile time with template for. Everything in std::meta is consteval.

struct Particle { float x, y, z; float mass; };

template <typename T>
consteval auto field_names()
{
    constexpr auto ctx = std::meta::access_context::current();
    static constexpr auto members = std::define_static_array(
                                        std::meta::nonstatic_data_members_of(^^T, ctx)
                                    );
    std::array<std::string_view, members.size()> out{};
    std::size_t i = 0;
    template for (constexpr auto m : members)
    {
        out[i++] = std::meta::identifier_of(m);
    }
    return out;
}

static_assert(field_names<Particle>()[3] == "mass");

nonstatic_data_members_of(^^T) returns a vector<info> of the data members in declaration order. template for is the new compile-time loop that instantiates its body once per element, so m is a constexpr reflection inside each iteration. identifier_of(m) gives you the spelling as a string_view. The whole thing is evaluated in a consteval context, so the result is baked into
the binary.

That’s enough vocabulary for everything that follows. The metafunctions worth remembering are members_of, nonstatic_data_members_of, enumerators_of, parameters_of, bases_of, identifier_of, type_of, offset_of, size_of, alignment_of, and annotations_of. The synthesis side has define_aggregate for building new types and data_member_spec for describing the members you want to inject.

The crucial property is that splicing turns a reflection back into a real language entity. obj.[:m:] accesses a member. typename [: t :] names a type. That round-trip between “the meta universe” and “the regular universe”, to borrow Lemire’s phrasing in his equality-check post, is the whole engine.

Automatic struct-to-device layout validation

The classic GPU bug. You have a host struct, you memcpy it into a constant buffer, and your kernel reads it back through a struct with the same name but slightly different layout. Maybe device code aligned a double3 differently. Maybe somebody removed a float pad that was load-bearing. The kernel reads the wrong offsets and produces plausible-looking garbage.

The pattern is straightforward: write a consteval function that compares two types field by field and fails the build if they don’t agree.

template <typename Host, typename Device>
consteval bool layouts_match()
{
    constexpr auto ctx = std::meta::access_context::current();
    static constexpr auto h = std::define_static_array(
                                  std::meta::nonstatic_data_members_of(^^Host, ctx)
                              );
    static constexpr auto d = std::define_static_array(
                                  std::meta::nonstatic_data_members_of(^^Device, ctx)
                              );

    if (h.size() != d.size()) return false;
    if (sizeof(Host) != sizeof(Device)) return false;
    if (alignof(Host) != alignof(Device)) return false;

    for (std::size_t i = 0; i < h.size(); ++i)
    {
        if (std::meta::identifier_of(h[i]) !=
            std::meta::identifier_of(d[i])) return false;
        if (std::meta::offset_of(h[i]) !=
            std::meta::offset_of(d[i])) return false;
        if (std::meta::size_of(h[i]) !=
            std::meta::size_of(d[i])) return false;
    }
    return true;
}

// In the header that defines both:
struct LaunchParams        { int n; float dt; float3 origin; uint32_t flags; };
struct LaunchParams_device { int n; float dt; float3 origin; uint32_t flags; };

static_assert(layouts_match<LaunchParams, LaunchParams_device>(),
              "Host and device layouts diverged. Check field order, "
              "padding, and alignment.");

A few things to notice. The check is consteval, so it runs at compile time with zero runtime cost. The static_assert triggers in the translation unit that includes both definitions, which means the build breaks at the point of mismatch, not at the kernel launch site three hundred files away. The error message is yours, written in English, and it points at the kind of problem rather than at a template instantiation stack.

In practice the device struct is usually the same type as the host struct, in which case this check is redundant. The interesting case is when the kernel is written in a vendor IR (SPIR-V, AMDGCN assembly) or in a separate compilation unit with different target-specific types. There the two structs really are different, and the static_assert is the cheapest insurance you’ll ever buy.

For the more useful case where the host and device struct are nominally the same, the check is still valuable as a padding guard. The same function works across compiler versions and target flags:

template <typename T>
consteval bool is_packed()
{
    constexpr auto ctx = std::meta::access_context::current();
    static constexpr auto m = std::define_static_array(
                                  std::meta::nonstatic_data_members_of(^^T, ctx)
                              );
    std::size_t expected = 0;
    template for (constexpr auto f : m)
    {
        if (std::meta::offset_of(f).bytes != expected) return false;
        expected += std::meta::size_of(f);
    }
    return expected == sizeof(T);
}

struct PushConstants { uint32_t a; uint64_t b; uint32_t c; };
static_assert(is_packed<PushConstants>(),
              "PushConstants has internal padding. Reorder members or "
              "add explicit pad fields.");

This catches the case where somebody adds a uint64_t after a uint32_t and the compiler silently inserts four bytes. On a discrete GPU those four bytes are a real DMA transfer.

Generating SoA structs from AoS definitions

Almost every GPU kernel that takes a list of “things” wants the data in structure of arrays form rather than array of structures. You declared std::vector<Particle> because that’s what made sense for the rest of your code, but the kernel wants four pointers: float* x, float* y, float* z,
float* mass. The traditional workaround is a hand-written shadow struct that gets out of sync the moment somebody adds a member to Particle.

Reflection lets you generate the SoA type from the AoS type. The standard synthesis function is define_aggregate, which takes a forward declaration of a class and a list of data_member_spec descriptions and finishes the definition in place.

template <typename Aos>
struct soa_of_helper {
    struct type;
    consteval {
        constexpr auto ctx = std::meta::access_context::current();
        std::vector<std::meta::info> specs;
        template for (constexpr auto m : std::define_static_array(
                                             std::meta::nonstatic_data_members_of(^^Aos, ctx)
                                         ))
        {
            // Build a pointer-to-T member with the same name as the AoS field.
            auto member_type = std::meta::add_pointer(std::meta::type_of(m));
            specs.push_back(std::meta::data_member_spec(member_type, {
                .name = std::meta::identifier_of(m)
            }));
        }
        std::meta::define_aggregate(^^type, specs);
    }
};

template <typename Aos>
using soa_of = typename soa_of_helper<Aos>::type;

struct Particle { float x, y, z; float mass; };

// Generated: struct { float* x; float* y; float* z; float* mass; };
using ParticleSoA = soa_of<Particle>;

__global__ void update(ParticleSoA p, float dt, int n)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n) return;
    p.x[i] += p.vx_or_whatever_you_added_to_Particle[i] * dt;
    // ... etc.
}

The consteval { ... } block is a consteval block from the same proposal family. It’s a piece of compile-time imperative code that runs at the point where it appears in the program, which is what lets you call define_aggregate to retroactively finish the forward-declared type. The body builds a vector of data_member_spec, each one consisting of “a pointer to the original field type, named the same as the original field.” Once the consteval block returns, type is a complete aggregate and can be used like any other struct.

Add a member to Particle, recompile, and ParticleSoA grows a matching pointer. The kernel that takes ParticleSoA by value sees the new field immediately. There is no generator script, no shadow header, no #define FOREACH_PARTICLE_FIELD(X) list to keep in sync.

The same machinery handles the inverse direction. Given an SoA struct of pointers, you can synthesize an indexing helper that takes an int i and returns a populated AoS struct, with each member loaded by index. The point is not the specific gadget. The point is that the relationship “field X of the AoS type corresponds to a pointer to X in the SoA type” stops being a comment in a README and becomes a property the compiler enforces.

Compile-time kernel argument binding and push-constant packing

A HIP or CUDA kernel launch is just a function call with extra ceremony. The arguments are copied into a small parameter buffer that the driver hands to the GPU. For Vulkan compute or D3D12 the analogous thing is a push constant: a small chunk of memory the application packs at the host and the shader unpacks on the device. In both cases the contract is “the bytes in this buffer, at these offsets, with these types.”

Hand-written push-constant packing looks like this, and it is exactly as awful as it sounds:

std::byte buf[128];
std::memcpy(buf + 0, &transform, sizeof(transform));
std::memcpy(buf + 64, &lighting, sizeof(lighting));
std::memcpy(buf + 96, &debug_flag, sizeof(debug_flag));
// hope you got the offsets right

With reflection you describe the layout with a regular struct and generate the packer from it:

template <typename Layout>
void pack_push_constants(std::byte* dst, const Layout& src)
{
    constexpr auto ctx = std::meta::access_context::current();
    template for (constexpr auto m : std::define_static_array(
                                         std::meta::nonstatic_data_members_of(^^Layout, ctx)
                                     ))
    {
        constexpr std::size_t off = std::meta::offset_of(m).bytes;
        constexpr std::size_t sz  = std::meta::size_of(m);
        std::memcpy(dst + off, &src.[:m:], sz);
    }
}

struct DrawPC {
    glm::mat4 transform;
    glm::vec4 tint;
    uint32_t  material_id;
};

DrawPC pc{...};
pack_push_constants(scratch, pc);
vkCmdPushConstants(cmd, layout, stage, 0, sizeof(DrawPC), scratch);

Three things to notice. First, the offsets and sizes are constants. They are known at compile time because they come from std::meta::offset_of and std::meta::size_of, both of which return consteval values. A good optimizer turns the whole loop into a single memcpy of sizeof(Layout) bytes, which is exactly what you’d write by hand. Second, the access src.[:m:] is the splicer doing real work: it expands to src.transform, src.tint, src.material_id across the three iterations of the template for. Third, the function is generic over Layout. One function template covers every push constant struct in your codebase.

You can take this further. If your backend cares about ordering within the buffer (some APIs do), you can sort the members at compile time before emitting, and assert that the user-facing struct matches the chosen order:

template <typename Layout>
consteval auto layout_descriptor()
{
    constexpr auto ctx = std::meta::access_context::current();
    static constexpr auto m = std::define_static_array(
                                  std::meta::nonstatic_data_members_of(^^Layout, ctx)
                              );
    std::array<std::pair<std::string_view, std::size_t>, m.size()> out{};
    for (std::size_t i = 0; i < m.size(); ++i)
    {
        out[i] = { std::meta::identifier_of(m[i]),
                   std::meta::offset_of(m[i]).bytes };
    }

    // Sort by offset ascending so the descriptor reflects buffer order,
    // regardless of declaration order in the struct.
    std::ranges::sort(out,
              [](const auto& a, const auto& b) { return a.second < b.second; });
    return out;
}

The returned array is a constexpr value. You can use it to drive runtime validation (compare against the layout the driver reports), or you can dump it into a static_assert that diffs your struct against a reference layout the shader author committed.

Automatic enum-to-dispatch-table for kernel variants

GPU kernels live and die by specialization. A reduction kernel for float is different from a reduction kernel for __bf16. A GEMM with M=128, N=128, K=32 is a different binary from M=64, N=64, K=128. You write the kernel as a template and instantiate it thousands of times. Then you need a dispatch table that maps “the runtime tile size the user asked for” to “the specific instantiation.”

The boilerplate version is a switch:

switch (cfg.tile) {
    case TileSize::T64:  return launch<TileSize::T64>(args);
    case TileSize::T128: return launch<TileSize::T128>(args);
    case TileSize::T256: return launch<TileSize::T256>(args);
    default: std::unreachable();
}

This works. It also breaks every time you add a new tile size and forget to add the case. With reflection, the switch writes itself from the enum definition:

enum class TileSize { T64 = 64, T128 = 128, T256 = 256, T512 = 512 };

template <typename Enum, typename Args>
void dispatch(Enum value, const Args& args)
{
    template for (constexpr auto e : std::define_static_array(
                                         std::meta::enumerators_of(^^Enum)
                                     ))
    {
        if (value == [:e:])
        {
            launch<([:e:])>(args);
            return;
        }
    }
    std::unreachable();
}

// Call site:
dispatch(cfg.tile, args);

enumerators_of(^^TileSize) returns reflections for the four enumerators. [:e:] splices each one as a constexpr value, which is exactly what launch<...> needs as a non-type template argument. Each iteration of template for produces a separate if-branch with its own template instantiation, so the resulting code is identical to the hand-written switch.

Add T1024 to the enum, recompile, and the dispatch picks it up automatically. You’ll get a hard error in the same translation unit if launch<T1024> fails to instantiate, which is the right time to learn about it.

For two-dimensional dispatch (tile size and data type, say) you nest the template for and the combinatorial explosion is now the compiler’s problem instead of yours:

template <typename Tile, typename Dtype, typename Args>
void dispatch2d(Tile t, Dtype d, const Args& args)
{
    template for (constexpr auto te : std::define_static_array(
                                          std::meta::enumerators_of(^^Tile)
                                      ))
    {
        if (t != [:te:]) continue;
        template for (constexpr auto de : std::define_static_array(
                                              std::meta::enumerators_of(^^Dtype)
                                          ))
        {
            if (d == [:de:])
            {
                launch<([:te:]), ([:de:])>(args);
                return;
            }
        }
    }
    std::unreachable();
}

GPU memory layout inspector and padding detector

The host and device disagree on alignment more often than they agree on it. A bool is one byte on the host and four bytes on most shading languages. A double is sometimes 8-byte aligned, sometimes 16. The std140 and std430 layouts have their own opinions. Even within HIP, __align__(16) decorations on device structs can move offsets without changing the source.

A reflection-driven layout report is a small, useful tool. Stick it in a unit test and dump the layout of every kernel-facing struct. Compare two builds of the same struct, or compare against a checked-in golden file.

struct field_layout {
    std::string_view name;
    std::string_view type;
    std::size_t offset;
    std::size_t size;
    std::size_t align;
};

template <typename T>
consteval auto inspect()
{
    constexpr auto ctx = std::meta::access_context::current();
    static constexpr auto m = std::define_static_array(
                                  std::meta::nonstatic_data_members_of(^^T, ctx)
                              );
    std::array<field_layout, m.size()> out{};
    for (std::size_t i = 0; i < m.size(); ++i)
    {
        out[i] = {
            .name   = std::meta::identifier_of(m[i]),
            .type   = std::meta::display_string_of(std::meta::type_of(m[i])),
            .offset = std::meta::offset_of(m[i]).bytes,
            .size   = std::meta::size_of(m[i]),
            .align  = std::meta::alignment_of(std::meta::type_of(m[i]))
        };
    }
    return out;
}

template <typename T>
void print_layout(const char* label)
{
    constexpr auto layout = inspect<T>();
    std::println("=== {} (size={}, align={}) ===",
                 label, sizeof(T), alignof(T));
    std::size_t expected = 0;
    for (auto f : layout)
    {
        if (f.offset != expected)
        {
            std::println("  -- {} bytes of padding --", f.offset - expected);
        }
        std::println("  +{:>4}  {:<20} {} (size={}, align={})",
                     f.offset, f.name, f.type, f.size, f.align);
        expected = f.offset + f.size;
    }
    if (expected != sizeof(T))
    {
        std::println("  -- {} bytes of trailing padding --",
                     sizeof(T) - expected);
    }
}

=== MaterialParams (size=96, align=16) ===
  +   0  base_color           float [4] (size=16, align=4)
  +  16  emissive             float [3] (size=12, align=4)
  +  28  emissive_strength    float (size=4, align=4)
  +  32  roughness            float (size=4, align=4)
  +  36  metallic             float (size=4, align=4)
  +  40  ao                   float (size=4, align=4)
  +  44  normal_scale         float (size=4, align=4)
  +  48  albedo_tex_idx       unsigned int (size=4, align=4)
  +  52  normal_tex_idx       unsigned int (size=4, align=4)
  +  56  roughness_tex_idx    unsigned int (size=4, align=4)
  +  60  ao_tex_idx           unsigned int (size=4, align=4)
  +  64  emissive_tex_idx     unsigned int (size=4, align=4)
  +  68  flags                unsigned int (size=4, align=4)
  +  72  alpha_cutoff         float (size=4, align=4)
  +  76  clearcoat            float (size=4, align=4)
  +  80  dummy_test_pad       char [9] (size=9, align=1)
  -- 7 bytes of trailing padding --

Running print_layout<MaterialParams>("MaterialParams") produces a compact report you can paste into a PR description, diff against last week’s build, or hand to a shader author who is convinced the bug is on the C++ side. The padding detection makes the cost of a poorly-ordered struct visible. Visible is the first step to fixable.

Shader and kernel metadata via annotations

The most underrated feature in C++26 is P3394, annotations for reflection. The syntax is [[=value]] on a declaration, and the value is any structural type. Reflection queries can read the attached annotations back. This is the missing piece that turns reflection from a “compute things about types” tool into a “carry intent from the source into the compile-time program” tool.

The GPU use case is screaming for it. A kernel argument struct has all sorts of metadata that today lives in comments or in a parallel YAML file: which register set, which binding slot, which memory space, whether a buffer is read-only, whether a sampler is anisotropic. Annotations let you put it on the declaration and pull it back out at codegen time.

// Annotation vocabulary
namespace gpu::annotations {
    struct binding   { uint32_t set, slot; };
    struct readonly  {};
    struct storage   {};
    struct uniform   {};
}

// Domain structs
struct Particle { float x, y, z; float mass; };
struct DebugCounter { uint32_t draws, culled, overdraw_samples; uint32_t _pad; };

struct DrawResources {
    [[=gpu::annotations::binding(0, 0), =gpu::annotations::uniform{}]]
    glm::mat4*    camera;

    [[=gpu::annotations::binding(0, 1), =gpu::annotations::storage{}, =gpu::annotations::readonly{}]]
    Particle*     particles;

    [[=gpu::annotations::binding(0, 2), =gpu::annotations::storage{}]]
    DebugCounter* counters;
};

The reflection side reads them back. std::meta::annotations_of(m) returns the list of annotations on a member, and value_of (or extract<T>) recovers the typed value:

// Annotation helper
template <typename Annotation, std::meta::info m>
consteval auto find_annotation() -> std::optional<Annotation>
{
    static constexpr auto annotations = std::define_static_array(
                                            std::meta::annotations_of(m)
                                        );
    template for (constexpr auto a : annotations)
    {
        if (std::meta::remove_const(std::meta::type_of(a)) == ^^Annotation)
        {
            return std::meta::extract<Annotation>(a);
        }
    }
    return std::nullopt;
}

// Descriptor builder
template <typename Resources>
consteval auto build_descriptor_set()
{
    constexpr auto ctx = std::meta::access_context::current();
    static constexpr auto m = std::define_static_array(
                                  std::meta::nonstatic_data_members_of(^^Resources, ctx)
                              );
    std::array<VkDescriptorSetLayoutBinding, m.size()> out{};
    std::size_t i = 0;
    template for (constexpr auto member : m)
    {
        constexpr auto bind = find_annotation<gpu::annotations::binding, member>();
        static_assert(bind.has_value(),
                      "Every resource must declare a gpu::annotations::binding.");
        out[i].binding = bind->slot;
        out[i].descriptorType =
            find_annotation<gpu::annotations::storage, member>().has_value()
                ? VK_DESCRIPTOR_TYPE_STORAGE_BUFFER
                : VK_DESCRIPTOR_TYPE_UNIFORM_BUFFER;
        ++i;
    }
    return out;
}

constexpr auto draw_bindings = build_descriptor_set<DrawResources>();

The whole descriptor-set layout is computed at compile time, from declarations that read like documentation. A new resource means adding a member with annotations. The descriptor set updates itself. The static_assert ensures that nobody adds a binding without saying where it goes.

The same trick works for any “out of band” piece of information you’d otherwise encode in naming conventions or stringly-typed registration calls:

[[=workgroup_size{256, 1, 1}]] on a kernel function template, read by the launch wrapper to derive grid dimensions.
[[=numa{0}]] or [[=on_device{1}]] on a buffer member, read by the allocator to pick the right pool.
[[=docstring{"Energy of the system in joules."}]] on a field, read by the layout-report generator above so the dumped output explains itself.

The Qt people have been particularly vocal about this. Their QRangeModel post shows the same pattern in a totally different domain: meta-properties for UI binding, declared on the data type, harvested by the framework at compile time. The GPU case is structurally identical.

Conclusion

Six examples, one shape. You take metadata that used to live in a code generator, a macro list, a comment, or a wiki page, and you put it on the declaration where it belongs. The compiler reads it back, generates the code that used to be hand-written boilerplate, and breaks the build when the
declaration and the kernel disagree.

Current support

The tooling story has moved fast. GCC 16 is the first mainline compiler to ship P2996, and as of writing it covers the core language pieces (reflection operator, splicers, template for, annotations from P3394, expansion statements from P1306) plus the <meta> library, all marked partial because a handful of corners are still being filled in. Bloomberg’s clang fork is still useful as a cross-check and was the implementation everything was prototyped against. Mainline Clang and MSVC have nothing yet, which means cross-compiler portability is, for the moment, theoretical. Error messages when a consteval block fails halfway through still test your patience. Linker diagnostics for define_aggregate are, in 2026, an acquired taste.

The bigger picture

Having said that, the underlying model is correct and the GPU use case is one of the most compelling targets for it. Layouts, bindings, kernel variants, and metadata are exactly the things reflection was designed to manipulate, and they happen to be exactly the things GPU programmers spend half their time getting wrong by hand. Whether you adopt reflection on the day your toolchain ships it or wait until C++29 fills in the gaps (code injection from P2237, richer parameter reflection from P3096), it is worth learning now.

The shift this represents is bigger than any one feature. GPU library authors have spent the last decade papering over the lack of language-level metadata with code generators, embedded DSLs, and Python build steps. Reflection does not retire any of those tools overnight, but it shrinks the surface area they need to cover. The parts of those systems that really do amount to “describe this struct to the runtime” can now be plain C++, sitting next to the kernel that uses them. That is a much smaller, much friendlier build, and it is the first time in a long while that the host side of a GPU codebase gets to feel like a modern language instead of a preprocessor exhibit.

RDNA and CDNA: similarities and differences

Posted by Simeon Atanasov on 24 June 2026 with 0 Comment

In 2019 AMD announced the Radeon™ RX 5700 XT, a GPU that sported its brand-new at the time architecture named RDNA. It aimed to provide upgrades compared to the older GCN-based cards. Then one year later, AMD announced another GPU architecture — CDNA, with the release of the Radeon™ Instinct MI100. And in the current day, AMD still maintains two separate product stacks, with RDNA 4 at the forefront of their consumer GPU releases, and the MI355X as CDNA 4’s flagship chip. This then raises an interesting question — what are the differences between RDNA and CDNA-based cards. This article aims to present not only the architectural differences, but also to give practical examples of different behavior on the two platforms.

A shared DNA

Before we look at where the two product stacks differ, let us first check the similarities between the two. It is no coincidence that both architectures have similar names, as they do share a common ancestor. Both cards’ instruction set architectures (ISA) are based on the previous Graphics Core Next (or GCN for short) — the driving force behind several years of AMD graphics accelerators. GCN cards gained a reputation for having great compute performance compared to some of NVIDIA’s offerings at the time. It is no coincidence that, for instance, the RX 480 was very popular with cryptocurrency miners, as it offered a high amount of VRAM and great compute performance for the price it was offered for.

StreamHPC communications

A two-minute reflection primer

Automatic struct-to-device layout validation

Generating SoA structs from AoS definitions

Compile-time kernel argument binding and push-constant packing

Automatic enum-to-dispatch-table for kernel variants

GPU memory layout inspector and padding detector

Shader and kernel metadata via annotations

Conclusion

Current support

The bigger picture

A shared DNA

What is std::execution

What std::generator actually is

Leveraging the zip iterator to find the maximum argument

Introduction in 2023

Finally, ROCm on Windows

Improving the basics

The questions to ask

Black boxes will never be transparent

Schedule

Why do we have a booth?

Let’s meet!

What is `std::execution`

What `std::generator` actually is