RDNA and CDNA: similarities and differences

In 2019 AMD announced the Radeon™ RX 5700 XT, a GPU that sported its brand-new at the time architecture named RDNA. It aimed to provide upgrades compared to the older GCN-based cards. Then one year later, AMD announced another GPU architecture — CDNA, with the release of the Radeon™ Instinct MI100. And in the current day, AMD still maintains two separate product stacks, with RDNA 4 at the forefront of their consumer GPU releases, and the MI355X as CDNA 4’s flagship chip. This then raises an interesting question — what are the differences between RDNA and CDNA-based cards. This article aims to present not only the architectural differences, but also to give practical examples of different behavior on the two platforms.

A shared DNA

Before we look at where the two product stacks differ, let us first check the similarities between the two. It is no coincidence that both architectures have similar names, as they do share a common ancestor. Both cards’ instruction set architectures (ISA) are based on the previous Graphics Core Next (or GCN for short) — the driving force behind several years of AMD graphics accelerators. GCN cards gained a reputation for having great compute performance compared to some of NVIDIA’s offerings at the time. It is no coincidence that, for instance, the RX 480 was very popular with cryptocurrency miners, as it offered a high amount of VRAM and great compute performance for the price it was offered for.

And the GCN gaming cards performed well compared to the competition, with good examples like the HD 7970, the R9 290 and the budget king at the time — RX 480. Yet as time went by, problems were outlined — latency and utilization. Due to the inner workings of the GCN architecture, it takes more cycles to execute an instruction, and it requires more threads to saturate the ALU usage to 100% — something that can be unfortunately tricky to do. Yet for compute, the cards continued having the reputation as good compute accelerators.

Different goals

When looking at the technical differences between the two architecture series, the most important factor to consider is the use case for the cards sporting them. RDNA is used on consumer cards and it puts an emphasis on gaming. Such an architecture should reduce latency, in order to let games run with a stable 1% low framerate, which translates to a smoother experience. On the other hand, CDNA cards are designed purely for computational workloads — physics simulations, scientific workloads, machine learning, AI, etc. In those use cases, AMD puts emphasis on increasing the number of floating point operations per second, or FLOPS. In such an architecture, the focus is put on keeping the silicon fed with data.

Knowing that, we can actually start the comparison. The primary differences are how a wavefront is executed, and how large it is. Apart from that, there are also differences in the presence or absence of hardware features, which is also reflected in the instruction set. The table below summarizes these differences, and we will look at several of them in subsections afterwards.

	RDNA	CDNA
Wavefront implementation	Based on SIMD32	Based on SIMD16
Wavefront size	Wave32, Wave64	Wave64
Cycles per instruction	1	4
Matrix multiply add implementation	WMMA instructions (since RDNA 3)	MFMA instructions (since CDNA 1)
Other instruction set differences	more limited Data-Parallel Primitives (DPP)	DPP and SDWA instruction modifiers
Other features	Video encoding and decoding, hardware for graphics workloads	Video decoding

Wavefront size and execution

The most fundamental difference between RDNA and CDNA hardware is the wavefront — the logical unit in which the execution of code is split. Since their advent, CDNA cards do 64 threads per wavefront. In contrast, RDNA changes that to 32, with the option to run in 64-lane wavefront mode (referred to as either Wave32 or Wave64). Let us take a look at what this looks like in more detail.

CDNA essentially builds on top of GCN’s compute units by expanding them with Matrix Core Engines. Other than that, it is similar to the AMD GPUs of the past — 4 SIMD16 units, where an instruction is executed over an entire wavefront in 4 cycles. The benefit of this strategy is that it allows for a high degree of interleaving calculations. As long as the GPU is kept well fed with data and branching is kept to a minimum, computational performance is very respectable.

In comparison, RDNA uses at its foundation SIMD32 units. Additionally, instructions are issued once per cycle (compared to once every 4 cycles on CDNA/GCN). The meaning of this is a faster completion time of executions!

Let us look at an example:

1  v_add_f32 v4 v0 v1
2  v_fmac_f32 v2 v0 v1
3  v_mul_f32 v2 v4 2.0
4  v_sub_f32 v5 v4 v1

These instructions work on both CDNA-based cards and RDNA-based ones — that’s the common DNA between them, as we said earlier. Importantly, we have one dependency — between instructions 1 and 3 (register v4) and also between 1 and 4 (again, register v4). Let us now see how the execution would look like:

A diagram comparing the execution per cycle on CDNA and RDNA Wave32 — *Figure 1: Comparison of CDNA’s Wave64 and RDNA’s Wave32*

On CDNA, one wavefront of size 64 is assigned to one SIMD16. And each instruction is executed in 4 cycles, for the 4 chunks of 16 threads in the wavefront, marked with shades of orange per each cycle of the instructions.

Compare that to RDNA’s Wave32 execution. There, instructions are issued once per cycle (except for transcendentals like exp, rcp, and trigonometric calculations, all of which are issued once every 4 cycles). Additionally, instructions have a latency — a certain number of cycles the GPU needs to run for, in order to make a result available in the output destination. In the example, we see the RDNA-based GPU needs 5 cycles per instruction to make a result available in the output register. For the purposes of consistency, on CDNA (and GCN) it’s 1 cycle (but that’s at quarter the rate due to the way of executing a wavefront on these cards).

Now we can take a look at the dependency in the example above. You can see that the third instruction cannot start until cycle 5, when v_add_f32 is complete, due to the use of v4. So the SIMD will unfortunately have to wait 3 cycles, before it can continue execution. Yet, the execution is done faster than in CDNA!

But in the table you saw that RDNA does both Wave32 and Wave64. This means that it can simulate a 64-size wavefront by executing an instruction first on the lower half of lanes, and then on the upper half. If we take the same code example above in Wave64, then we get an execution that looks like this:

Comparison of 4 assembly instructions executed in RDNA's Wave32 and Wave64 modes — *Figure 2: Comparison of RDNA’s Wave32 and RDNA’s Wave64*

Importantly, compare the mul instruction that has a dependency in v4 with add — there see something interesting. The delay that in Wave32 was 3 cycles, now got reduced to 1 cycle! And if we try other examples, you will see that overall the time the SIMD waits for a dependency gets significantly reduced. Yet, the total execution time was prolonged by 1 cycle, which is still faster than CDNA! Thus, Wave64 is good when a program’s goal is to increase the utilization.

This then raises an important question — what is the benefit of CDNA’s way of executing wavefronts? To answer this, we need to look at how SIMDs are bundled together on the two architectures.

In CDNA, four SIMD16s are bundled into a compute unit (CU for short). One CU contains the 4 SIMD units, scalar ALU, cache and the local data share (LDS) — chunk of memory that kernels within a block of threads can have shared access to. RDNA, on the other hand, contains workgroup processors (or WGP for short). It essentially bundles two CUs under the same block of LDS and cache. Consequently, both a CU and a WGP both contain 4 SIMDs, but a CU has 4 SIMD16s, and a WGP has 4 SIMD32s. Refer to the diagram from AMD’s presentation on RDNA:

A comparison between CDNA and RDNA illustrating how CDNA needs many more threads to occupy a compute unit compared to RDNA — *Figure 3: Comparison of 2 compute units (GCN, functionally equivalent to CDNA) and a single workgroup processor (RDNA). Source:* https://gpuopen.com/download/RDNA_Architecture_public.pdf

To make the proper comparison between the two architectures, we need to see how many threads are needed to occupy either a CDNA compute unit or an RDNA workgroup processor. For CDNA, taking 4 wavefronts — one per each SIMD, each wave having 64 threads, yields 256 threads needed to fill the CU’s occupancy to 100%. Compare that with RDNA, where you have 4 SIMDs each taking a wavefront of size 32, which yields 128 threads.

This means that you need a smaller number of threads to utilize the hardware efficiently. Yet, CDNA’s architecture is good for computational workloads, namely when you can distribute your computational task such that you can utilize the CU to its maximum. This is also related to how GCN-era cards had a reputation for being really good at computational workloads, a key example being a card like the RX 480 which was highly sought after by cryptocurrency miners.

We must make a certain distinction: what we explained here relates more closely to RDNA 1 and 2 versus CDNA 1-4 as opposed to RDNA 3/4 versus CDNA 1-4. RDNA 3 made an important change by introducing dual issue. What dual issue means is that there is a set of instructions which can be bundled together in a single instruction word, as long as they are independent of each other (if you have heard of VLIW from back in the day, congratulations — you recognize this as VLIW2 reinvented). Doing dual issue strives to increase instruction throughput, though it complicates compiler design. Such compilers have to find and instructions independent of each other and group them — a notoriously difficult task.

Giving the examples in this section with dual issue in mind would have complicated the diagrams, thus looking at dual issue is better done separately. Yet, the scheduling principles that were observed here, mostly apply for RDNA 3 and 4 as well.

Accelerating matrix multiply add operations

Both architectures sport hardware acceleration for matrix multiply add (MMA) operations, though their implementation differs.

Starting with RDNA 3, Radeon™ GPUs have included hardware for matrix multiply add (MMA) operations, namely the WMMA (wave matrix multiply accumulate) instructions. In contrast, already since its first iteration, CDNA has hardware for this purpose in the form of MFMA (matrix fused multiply add) instructions.

The common factors between them are the computation task they achieve and the way they are called from the kernel. In both cases, they accelerate the calculation of:

$D = AB + C$

with A, B, C and D all being matrices. A has a size of MxK, B — KxN. C and D thus have a size of M x N, with K being the common dimension between A and B.

The other similarity is how the instructions are invoked. In both cases, the data from the input matrices gets arranged in some layout, an intrinsic is called, after which the data is populated in the output matrix with the respective layout. Where the two implementations differ is in the sizes of M, N, K, and how the data is arranged in registers.

Let us look at the following example that takes two A and B matrices, each being 16×16 FP16 elements, accumulating and outputting in C being a 16×16 FP32 matrix. The example is composed from blogs on MFMA and WMMA published by AMD[1] [2], combined with host-side code needed to prepare the data, and compare the output using a CPU reference implementation. It defines two kernels, each named matmul, but one meant to be run on CDNA hardware, and the other — RDNA 3 hardware in Wave32 mode. You can compile it with hipcc demo.cpp -o matmul or amdclang++ -x hip demo.cpp --offload-arch=gfx??? -o matmul_demo (architectures here could be for instance gfx90a for CDNA 2 or gfx1100 for RDNA 3). If you want to see the assembly code, add the --save-temps flag and inspect the .s files.

 // Wave Matrix Multiply Accumulate (WMMA) and Matrix Fused Multiply Add (MFMA),
 // both cases using HIP compiler intrinsics
 // The examples do a matrix multiplication of two 16x16, fp16 matrices,
 // and stores them into a 16x16 fp32 result matrix
 
 #include <iostream>
 #include <hip/hip_runtime.h>
 #include <hip/hip_fp16.h>
 
 using namespace std;
 
 // Use half16 as an alias of the internal clang vector type of 16 fp16 values
 typedef _Float16 half16  __attribute__((ext_vector_type(16)));
 typedef float    float8  __attribute__((ext_vector_type(8)));
 typedef _Float16 half4   __attribute__((ext_vector_type(4)));
 typedef float    _float4 __attribute__((ext_vector_type(4)));
 // Note that there is a float4 defined in hip_runtime.h,
 // but it is not based on the ext_vector_type of clang, which does not help our use case
 
 
 #ifdef __GFX9__
 __global__ void matmul(__half* a, __half* b, float* c)
 {
     const int gIdx = blockIdx.x * blockDim.x + threadIdx.x;
     const int lIdx = threadIdx.x;
 
     // a_frag will store 4 consecutive elements of a row
     // b_frag will store 4 consecutive elements of a column
 
     half4 a_frag;
     half4 b_frag;
     // initialize c fragment to 0
     _float4 c_frag = {};
 
     a_frag = *reinterpret_cast<const half4*>(a + 4 * (threadIdx.x / 16) + 16 * (threadIdx.x % 16));
 
     for (int i = 0; i < 4; i++)
     {
         b_frag[i] = b[i * 16 + lIdx % 16 + (lIdx / 16) * 64];
     }
 
     // Calling the MFMA intrinsic
     c_frag = __builtin_amdgcn_mfma_f32_16x16x16f16(a_frag, b_frag, c_frag, 0, 0, 0);
 
     // Write output data from c_frag to c
     for (int i = 0; i < 4; i++)
     {
         c[i * 16 + lIdx % 16 + (lIdx / 16) * 64] = c_frag[i];
     }
 
 }
 
 #elif defined(__GFX11__)
 
 __global__ void matmul(__half* a, __half* b, float* c)
 {
     const int gIdx = blockIdx.x * blockDim.x + threadIdx.x;
     const int lIdx = threadIdx.x;
 
     // a and b fragments are stored in 8 VGPRs each, in packed format, so 16 elements each for a and b
     // a_frag will store one column of the 16x16 matrix A tile
     // b_frag will store one row of the 16x16 matrix B tile
     half16 a_frag;
     half16 b_frag;
     // initialize c fragment to 0
     float8 c_frag = {};
     float8 c_frag_hi = {};
 
 
     // lane is (0-31) mod 16 instead of 0-31 due to replication needed in RDNA 3's WMMA implementation
     const int lane = lIdx % 16; // then lanes 16-31 will
 
     for (int i = 0; i < 16; i++)
     {
         b_frag[i] = b[16 * i + lane];
     }
 
     for (int i = 0; i < 16; i++)
     {
         a_frag[i] = a[16 * lane + i];
     }
 
     // call the WMMA intrinsic with OPSEL set to "false"
     c_frag = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_frag, b_frag, c_frag);
 
     // Write output data from c_frag to c
     for (int i = 0; i < 8; i++)
     {
         const int r = i * 2 + (lIdx / 16);
         c[16 * r + lane] = c_frag[i];
     }
 }
 #endif
 
 int main(int argc, char* argv[])
 
 {
     __half a[16 * 16] = {};
     __half b[16 * 16] = {};
     float c[16 * 16] = {};
     __half *a_gpu, *b_gpu;
     float *c_gpu;
     hipMalloc(&a_gpu, 16*16 * sizeof(__half));
     hipMalloc(&b_gpu, 16*16 * sizeof(__half));
     hipMalloc(&c_gpu, 16*16 * sizeof(float));
 
     // fill in some data into matrices A and B
     for (int i = 0; i < 16; ++i)
     {
         for (int j = 0; j < 16; ++j)
         {
             a[i * 16 + j] = (__half)(i * 16 + j); // so this is row-major
             b[i * 16 + j] = (__half) (i * 16 + j); // for both A and B
         }
     }
 
     // Print out A
     printf("A:\n");
     for (int i = 0; i < 16; ++i)
     {
         for (int j = 0; j < 16; ++j)
         {
             printf("%f ", (float)a[i * 16 + j]);
         }
         printf("\n");
     }
 
 
     // Print out B
    
     printf("B:\n");
     for (int i = 0; i < 16; ++i)
     {
         for (int j = 0; j < 16; ++j)
         {
             printf("%f ", (float)b[i * 16 + j]);
         }
         printf("\n");
     }
 
     hipMemcpy(a_gpu, a, (16*16) * sizeof(__half), hipMemcpyHostToDevice);
     hipMemcpy(b_gpu, b, (16*16) * sizeof(__half), hipMemcpyHostToDevice);
     hipMemcpy(c_gpu, c, (16*16) * sizeof(float), hipMemcpyHostToDevice);
 
     int deviceId;
     auto err = hipGetDevice(&deviceId);
     int wavefrontSize;
     err = hipDeviceGetAttribute(&wavefrontSize, hipDeviceAttributeWarpSize, deviceId);
 
     matmul<<<dim3(1), dim3(wavefrontSize, 1, 1), 0, 0>>>(a_gpu, b_gpu, c_gpu);
     // grid has size 1x1x1, the block has wavefrontSizex1x1 threads
     // for both MFMA and WMMA
 
     hipMemcpy(c, c_gpu, (16 * 16) * sizeof(float), hipMemcpyDeviceToHost);
 
     hipFree(a_gpu);
     hipFree(b_gpu);
     hipFree(c_gpu);
 
     printf("C:\n");
     for (int i = 0; i < 16; ++i)
     {
         for (int j = 0; j < 16; ++j)
         {
             printf("%f ", c[i * 16 + j]);
         }
         printf("\n");
     }
 
     // Compare with a computation on the CPU
 
     float C_cpu[16 * 16] = {};
     float err_sum = 0;
 
     printf("The same calculation on a CPU:\n");
     for (int i = 0; i < 16; ++i)
     {
         for (int j = 0; j < 16; ++j)
         {
             float accum = 0;
             for (int k = 0; k < 16; ++k)
             {
                 accum += (float)a[i * 16 + k] * (float)b[k * 16 + j];
             }
             C_cpu[i * 16 + j] = accum;
             printf("%f ", C_cpu[i * 16 + j]);
             err_sum += c[i * 16 + j] - C_cpu[i * 16 + j];
         }
         printf("\n");
     }
 
     printf("\nCumulative diff between the CPU and GPU calculation: \e[91m%.2f\n\e[0m", err_sum);
 
     return 0;
 }

To go in depth, let us compare more closely the two implementations of the matmul kernel, and let us first look at the WMMA implementation:

__global__ void matmul(__half* a, __half* b, float* c)
 {
     const int gIdx = blockIdx.x * blockDim.x + threadIdx.x;
     const int lIdx = threadIdx.x;
 
     // a and b fragments are stored in 8 VGPRs each, in packed format, so 16 elements each for a and b
     // a_frag will store one column of the 16x16 matrix A tile
     // b_frag will store one row of the 16x16 matrix B tile
     half16 a_frag;
     half16 b_frag;
     // initialize c fragment to 0
     float8 c_frag = {};
     float8 c_frag_hi = {};
 
 
     // lane is (0-31) mod 16 instead of 0-31 due to replication needed in RDNA 3's WMMA implementation
     const int lane = lIdx % 16; // then lanes 16-31 will
 
     for (int i = 0; i < 16; i++)
     {
         b_frag[i] = b[16 * i + lane];
     }
 
     for (int i = 0; i < 16; i++)
     {
         a_frag[i] = a[16 * lane + i];
     }
 
     // call the WMMA intrinsic with OPSEL set to "false"
     c_frag = __builtin_amdgcn_wmma_f32_16x16x16_f16_w32(a_frag, b_frag, c_frag);
 
     // Write output data from c_frag to c
     for (int i = 0; i < 8; i++)
     {
         const int r = i * 2 + (lIdx / 16);
         c[16 * r + lane] = c_frag[i];
     }
 }

What we see is multiple registers assigned (in the vectors a_frag and b_frag), each taking a row/column of A and B respectively. Afterwards, the intrinsic is being called, which returns a vector of 8 values (in c_frag), which are then populated in the c matrix. This happens for each thread in a wavefront (seen in the kernel invocation in the full source code). The benefit of WMMA is the consistent form they take — usually A and B are 16×16 operands as input, with only the K dimension varying among the available set of intrinsics. Though, there are important distinctions to be made.

Firstly, the RDNA 3 and RDNA 4 have different builtin functions to call from the kernels. This is due to RDNA 3’s requirement for duplicating data in Wave32 mode when using WMMA. Moreover, if you opt to run the intrinsic in Wave64 mode, you need to, first, call __builtin_amdgcn_wmma_f32_16x16x16_f16_w64, and second, quadruple the data. RDNA 4 does not suffer from the data duplication, but your kernel would have to look differently, especially when loading data into a_frag and b_frag.

Let us now contrast this with MFMA in CDNA:

__global__ void matmul(__half* a, __half* b, float* c)
 {
     const int gIdx = blockIdx.x * blockDim.x + threadIdx.x;
     const int lIdx = threadIdx.x;
 
     // a_frag will store 4 consecutive elements of a row
     // b_frag will store 4 consecutive elements of a column
 
     half4 a_frag;
     half4 b_frag;
     // initialize c fragment to 0
     _float4 c_frag = {};
 
     a_frag = *reinterpret_cast<const half4*>(a + 4 * (threadIdx.x / 16) + 16 * (threadIdx.x % 16));
 
     for (int i = 0; i < 4; i++)
     {
         b_frag[i] = b[i * 16 + lIdx % 16 + (lIdx / 16) * 64];
     }
 
     // Calling the MFMA intrinsic
     c_frag = __builtin_amdgcn_mfma_f32_16x16x16f16(a_frag, b_frag, c_frag, 0, 0, 0);
 
     // Write output data from c_frag to c
     for (int i = 0; i < 4; i++)
     {
         c[i * 16 + lIdx % 16 + (lIdx / 16) * 64] = c_frag[i];
     }
 
 }

Looking closely shows that the data layout is completely different compared to the WMMA intrinsic! Here, a_frag and b_frag are loaded with 4 consecutive values of the A and B matrix respectively, compared to the row/column arrangement found in the WMMA example. The most likely explanation is due to the hardware differences — the very different ways of executing wavefronts on CDNA versus RDNA would be exemplified exactly in operations that iterate over an entire wavefront. An additional side effect is the available layouts for MFMA intrinsics, which are quite varied compared to WMMA. Yet, both forms of matrix multiplication acceleration would yield similar results.

Other differences in the instruction set

There are still other points where RDNA and CDNA differ. That includes hardware that is present or absent, and further differences in instructions.

RDNA is tailored for graphics workloads, and though nowadays we have GPUs that can be programmed in many different ways with incredible speed, there are some graphics operations that are still faster if done in silicon. They relate to rasterizing, texturing, ray tracing, among others. This means that RDNA will house that hardware within each WGP. Moreover, RDNA cards have a separate area on the chip for video encoding and decoding.

In contrast, CDNA cards only include the circuitry for decoding video, primarily due to machine learning workloads on video, and omit all geometry-oriented logic. Instead, they replace the graphics and encoding hardware with more cache.

Finally, there are some cases where the ISA differs between the two architecture in parts that are outside the missing hardware on either side. There are instructions with different names in the two architectures (compare for instance RDNA 4’s GLOBAL_LOAD_B128 with GLOBAL_LOAD_DWORDX4 in CDNA 4’s, both of which load 128 bits, respectively 4 double words of 32 bits, into vector registers). And last but not least, instruction modifiers like the Data-Parallel Primitives (DPP) differ in capabilities between RDNA and CDNA hardware, and the Sub-Dword Addressing (SDWA) modifier is not to be found on RDNA. A difference that can be further looked into in the future.

Conclusion

In this article we looked at an overview of how the graphics-focused RDNA microarchitecture differs from the compute-focused CDNA architecture. We looked at their common roots represented by GCN, and considered the use cases that influenced each architecture’s design.

We saw how CDNA carries the torch of GCN’s legacy by pruning the graphics hardware and focusing on its strength — big computational workloads. Of course, as long as the compute units are well-fed with wavefronts. In contrast, we observed how RDNA improves performance by focusing on smaller latency.

Following the fundamental differences of compute units and workgroup processors, we looked at how the primitives for matrix fused multiply add operations differ, with the help of a code sample. Afterwards, we briefly discussed other points where the instruction sets of RDNA and CDNA diverge.

Finally, we hope to have shown that the two product stacks of GPUs have differences far bigger than what a single letter change in the name would suggest. Thus, properly investigating the behaviours of both families of architectures, considerations for improving performance, how each evolved with time, and hardware capabilities that were gained or lost, would require further exploration.

StreamHPC communications