2024-05-07 15:52:14 -04:00
|
|
|
|
.. meta::
|
2025-12-11 10:52:34 +01:00
|
|
|
|
:description: This chapter describes the hardware implementation of AMD GPUs supported by HIP.
|
|
|
|
|
|
:keywords: AMD, ROCm, HIP, hardware, GPU, architecture, compute unit, VALU, SALU, cache, memory hierarchy, CDNA, RDNA, GCN
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
|
|
|
|
|
.. _hardware_implementation:
|
|
|
|
|
|
|
|
|
|
|
|
*******************************************************************************
|
2024-10-21 16:50:09 +02:00
|
|
|
|
Hardware implementation
|
2024-05-07 15:52:14 -04:00
|
|
|
|
*******************************************************************************
|
|
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
This chapter describes the hardware architecture of AMD GPUs supported by HIP,
|
|
|
|
|
|
focusing on the internal organization and operation of GPU hardware components.
|
|
|
|
|
|
Understanding these hardware details helps you optimize GPU applications and
|
|
|
|
|
|
achieve maximum performance.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Overall GPU architecture
|
|
|
|
|
|
========================
|
|
|
|
|
|
|
|
|
|
|
|
AMD GPUs consist of interconnected blocks of digital circuits that work together
|
|
|
|
|
|
to execute complex parallel computing tasks. The architecture is organized
|
|
|
|
|
|
hierarchically to enable massive parallelism while managing resources efficiently.
|
|
|
|
|
|
|
|
|
|
|
|
Command processor and control
|
|
|
|
|
|
-----------------------------
|
|
|
|
|
|
|
|
|
|
|
|
The command processor (CP) serves as the primary interface between the CPU and
|
|
|
|
|
|
GPU, receiving and distributing commands for execution. The CP consists of two
|
|
|
|
|
|
main components:
|
|
|
|
|
|
|
|
|
|
|
|
* **Command processor fetcher (CPF)**: Fetches commands from memory and passes
|
|
|
|
|
|
them to the CPC for processing.
|
|
|
|
|
|
* **Command processor packet processor (CPC)**: A microcontroller that decodes the
|
|
|
|
|
|
fetched commands and dispatches kernels to the workgroup processors for
|
|
|
|
|
|
scheduling.
|
|
|
|
|
|
|
|
|
|
|
|
The command processor handles several types of operations:
|
|
|
|
|
|
|
|
|
|
|
|
* Kernel launches, which are forwarded to asynchronous compute engines (ACEs)
|
|
|
|
|
|
* Memory transfers, which are delegated to direct memory access (DMA) engines
|
|
|
|
|
|
* Synchronization operations and memory fences
|
|
|
|
|
|
|
|
|
|
|
|
**DMA engines** handle memory transfers between CPU and GPU memory without CPU
|
|
|
|
|
|
involvement after initialization. Most GPUs contain two DMA engines, enabling
|
|
|
|
|
|
concurrent bidirectional transfers to better utilize PCIe bandwidth. The DMA
|
|
|
|
|
|
engines fetch data in small chunks and can process transfers in parallel but
|
|
|
|
|
|
cannot handle multiple copy commands on the same engine simultaneously.
|
|
|
|
|
|
|
|
|
|
|
|
**Asynchronous compute engines (ACEs)** break down kernels into workgroups for
|
|
|
|
|
|
distribution to shader processor input (SPI) blocks. Multiple ACEs enable
|
|
|
|
|
|
concurrent kernel execution, with each ACE capable of dispatching one kernel
|
|
|
|
|
|
at a time. ACEs process commands from different queues asynchronously, enabling
|
|
|
|
|
|
overlap between different kernel executions and memory operations.
|
|
|
|
|
|
|
|
|
|
|
|
Hierarchical organization
|
|
|
|
|
|
-------------------------
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
The GPU organizes compute resources in a three-level hierarchy that enables
|
|
|
|
|
|
modular design and resource sharing:
|
|
|
|
|
|
|
|
|
|
|
|
1. **Shader engines (SE)**: Top-level organizational units containing multiple
|
|
|
|
|
|
shader arrays and shared resources
|
|
|
|
|
|
2. **Shader arrays**: Groups of compute units sharing instruction and scalar
|
|
|
|
|
|
caches
|
|
|
|
|
|
3. **Compute units (CU)**: Basic execution units containing the ALUs and
|
|
|
|
|
|
registers for thread execution
|
|
|
|
|
|
|
|
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/selayout.png
|
|
|
|
|
|
:align: center
|
|
|
|
|
|
:alt: Diagram showing the hierarchical organization of compute units grouped
|
|
|
|
|
|
into shader engines on AMD GPUs
|
|
|
|
|
|
:width: 800
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Hierarchical organization of compute units into shader engines
|
|
|
|
|
|
|
|
|
|
|
|
This hierarchical design allows different GPU configurations using the same
|
|
|
|
|
|
underlying architecture. For example, the R9 Fury X contains 16 shader arrays
|
|
|
|
|
|
with four CUs each, while the RX 480 contains 12 shader arrays with three CUs
|
|
|
|
|
|
each, but both use the same gfx803 chip design.
|
|
|
|
|
|
|
|
|
|
|
|
Shader engine components
|
|
|
|
|
|
========================
|
|
|
|
|
|
|
|
|
|
|
|
Shader engines group multiple compute units together with shared resources that
|
|
|
|
|
|
improve efficiency and reduce redundancy. Each shader engine contains several
|
|
|
|
|
|
key components shared across its compute units.
|
|
|
|
|
|
|
|
|
|
|
|
Workgroup manager (SPI)
|
|
|
|
|
|
-----------------------
|
|
|
|
|
|
|
|
|
|
|
|
The workgroup manager, also called the shader processor input (SPI), bridges
|
|
|
|
|
|
the command processor and compute units. After the CP processes a kernel
|
|
|
|
|
|
dispatch, the SPI:
|
|
|
|
|
|
|
|
|
|
|
|
* Receives workgroups from the ACEs
|
|
|
|
|
|
* Schedules workgroups onto available compute units
|
|
|
|
|
|
* Initializes registers with kernel parameters
|
|
|
|
|
|
* Ensures all wavefronts of a workgroup execute on the same CU for
|
|
|
|
|
|
synchronization
|
|
|
|
|
|
* Monitors resource availability and queues workgroups when resources are
|
|
|
|
|
|
exhausted
|
|
|
|
|
|
|
|
|
|
|
|
The SPI tracks four critical resources that limit concurrent execution:
|
|
|
|
|
|
|
|
|
|
|
|
* Wavefront slots (execution contexts)
|
|
|
|
|
|
* Vector general-purpose registers (VGPRs)
|
|
|
|
|
|
* Scalar general-purpose registers (SGPRs)
|
|
|
|
|
|
* Local data share (LDS) memory
|
|
|
|
|
|
|
|
|
|
|
|
Workgroup-to-CU mapping is non-deterministic and based on available resources.
|
|
|
|
|
|
You should not assume any specific mapping pattern, as the same kernel launched
|
|
|
|
|
|
multiple times can have different workgroup distributions.
|
|
|
|
|
|
|
|
|
|
|
|
Scalar L1 data cache (sL1D)
|
|
|
|
|
|
---------------------------
|
|
|
|
|
|
|
|
|
|
|
|
The scalar L1 data cache serves scalar memory operations from multiple CUs
|
|
|
|
|
|
within a shader array. The sL1D is shared between CUs (3 CUs in the Graphics
|
|
|
|
|
|
Core Next (GCN) and MI100, 2 CUs in the MI200 series) and caches data that is
|
|
|
|
|
|
uniform across a wavefront, including:
|
|
|
|
|
|
|
|
|
|
|
|
* Kernel arguments and pointers
|
|
|
|
|
|
* Grid and block dimensions
|
|
|
|
|
|
* Constants accessed uniformly across threads
|
|
|
|
|
|
* Data from ``__constant__`` memory when accessed uniformly
|
|
|
|
|
|
|
|
|
|
|
|
Unlike the vector L1 cache, the sL1D doesn't use a "hit-on-miss" approach,
|
|
|
|
|
|
meaning subsequent requests to the same pending cache line count as duplicated
|
|
|
|
|
|
misses rather than hits.
|
|
|
|
|
|
|
|
|
|
|
|
L1 instruction cache (L1I)
|
|
|
|
|
|
--------------------------
|
|
|
|
|
|
|
|
|
|
|
|
The L1 instruction cache is a read-only cache shared between multiple CUs in a
|
|
|
|
|
|
shader array. Like the sL1D, it is backed by the L2 cache and doesn't use the
|
|
|
|
|
|
"hit-on-miss" approach. The L1I stores kernel instructions fetched by the
|
|
|
|
|
|
compute units, reducing instruction fetch latency and L2 cache pressure.
|
|
|
|
|
|
|
|
|
|
|
|
Compute unit architecture
|
|
|
|
|
|
=========================
|
|
|
|
|
|
|
|
|
|
|
|
The compute unit (CU) is the fundamental execution block of AMD GPUs.
|
|
|
|
|
|
It's responsible for executing kernels through its various specialized components
|
|
|
|
|
|
and pipelines.
|
|
|
|
|
|
|
|
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/gcn_compute_unit.png
|
|
|
|
|
|
:align: center
|
|
|
|
|
|
:alt: Detailed diagram of an AMD CDNA compute unit showing internal
|
|
|
|
|
|
components and data flow
|
|
|
|
|
|
:width: 800
|
|
|
|
|
|
|
|
|
|
|
|
Internal architecture of an AMD CDNA compute unit
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Sequencer and scheduling
|
|
|
|
|
|
------------------------
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
The instruction sequencer (SQ) serves as the control center of each compute
|
|
|
|
|
|
unit, managing instruction flow through the execution pipelines. The sequencer
|
|
|
|
|
|
maintains wavefront state and coordinates instruction execution across different
|
|
|
|
|
|
functional units.
|
|
|
|
|
|
|
|
|
|
|
|
**Wavefront organization**: The sequencer organizes active wavefronts into four
|
|
|
|
|
|
pools, each containing slots for up to ten wavefronts (eight on the CDNA2 MI200
|
|
|
|
|
|
series). Each slot includes:
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
* Wavefront-level registers (program counter, execution mask, and others)
|
|
|
|
|
|
* Instruction buffer for prefetched instructions
|
|
|
|
|
|
* State information for scheduling decisions
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
This organization theoretically allows up to 40 concurrent wavefronts per CU,
|
|
|
|
|
|
though actual occupancy is typically limited by register and LDS usage.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Instruction fetching**: The fetch arbiter selects one wavefront per cycle to
|
|
|
|
|
|
fetch instructions from memory, prioritizing the oldest wavefronts. Each CU can
|
|
|
|
|
|
fetch up to 32 bytes (4-8 instructions) per cycle.
|
|
|
|
|
|
|
|
|
|
|
|
**Instruction issuing**: The issue arbiter determines which instructions execute
|
|
|
|
|
|
each cycle, selecting wavefronts from one pool per cycle in round-robin fashion.
|
|
|
|
|
|
The arbiter can issue multiple instructions per cycle to different execution
|
|
|
|
|
|
units, with a theoretical maximum of five instructions per cycle:
|
|
|
|
|
|
|
|
|
|
|
|
* One VALU instruction
|
|
|
|
|
|
* One vector memory operation
|
|
|
|
|
|
* One SALU/scalar memory operation
|
|
|
|
|
|
* One LDS operation
|
|
|
|
|
|
* One branch operation
|
2025-06-02 17:10:41 +02:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Instructions always issue at wavefront granularity, with all threads in the
|
|
|
|
|
|
wavefront executing the same instruction in lockstep. Context switching between
|
|
|
|
|
|
wavefronts occurs every cycle with zero overhead, as all wavefront contexts
|
|
|
|
|
|
remain resident on the CU.
|
|
|
|
|
|
|
|
|
|
|
|
Execution pipelines
|
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
|
|
Each compute unit contains multiple specialized execution pipelines that process
|
|
|
|
|
|
different types of instructions in parallel, enabling efficient utilization of
|
|
|
|
|
|
the hardware resources.
|
|
|
|
|
|
|
|
|
|
|
|
Vector arithmetic logic unit (VALU)
|
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
The VALU executes vector instructions across entire wavefronts, with each thread
|
|
|
|
|
|
potentially operating on different data. The VALU consists of:
|
|
|
|
|
|
|
|
|
|
|
|
* **Four SIMD processors**: Each containing 16 single-precision ALUs (or
|
|
|
|
|
|
equivalent), for 64 total ALUs per CU
|
|
|
|
|
|
* **Vector register files**: 256-512 KiB of VGPR storage split across the four
|
|
|
|
|
|
SIMDs
|
|
|
|
|
|
* **Instruction buffers**: Storage for up to 8-10 wavefronts per SIMD
|
|
|
|
|
|
|
|
|
|
|
|
On architectures with 64-thread wavefronts and 16-instruction wide SIMD units,
|
|
|
|
|
|
executing one instruction takes four cycles (one cycle per 16 threads). The four
|
|
|
|
|
|
SIMD design ensures full utilization when sufficient wavefronts are available, as
|
|
|
|
|
|
a new instruction can issue to each SIMD every cycle.
|
|
|
|
|
|
|
|
|
|
|
|
The VALU serves as the primary arithmetic engine, executing the majority of
|
|
|
|
|
|
computation in GPU kernels. Data flows into these pipelines, undergoes arithmetic
|
|
|
|
|
|
transformation, and exits as results — with the goal of maximizing the number of
|
|
|
|
|
|
such transformations per clock cycle.
|
|
|
|
|
|
|
|
|
|
|
|
For CDNA architectures with matrix operations, the VALU also dispatches
|
|
|
|
|
|
matrix fused multiply-add (MFMA) instructions to specialized matrix units.
|
|
|
|
|
|
|
|
|
|
|
|
Scalar arithmetic logic unit (SALU)
|
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
|
|
The SALU executes instructions uniformly across all threads in a wavefront,
|
|
|
|
|
|
handling operations like:
|
|
|
|
|
|
|
|
|
|
|
|
* Control flow (branches, loops)
|
|
|
|
|
|
* Address calculations
|
|
|
|
|
|
* Loading kernel arguments and constants
|
|
|
|
|
|
* Managing wavefront-uniform values
|
|
|
|
|
|
|
|
|
|
|
|
The SALU includes:
|
|
|
|
|
|
|
|
|
|
|
|
* A scalar processor for arithmetic and logic operations
|
|
|
|
|
|
* 12.5 KiB of SGPR storage per CU
|
|
|
|
|
|
* A scalar memory (SMEM) unit for memory operations
|
|
|
|
|
|
|
|
|
|
|
|
Scalar operations reduce pressure on vector units and registers by handling
|
|
|
|
|
|
uniform computations efficiently.
|
|
|
|
|
|
|
|
|
|
|
|
Vector memory unit (VMEM)
|
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
|
|
The VMEM unit handles all vector memory operations, including loads, stores,
|
|
|
|
|
|
and atomic operations. Each thread supplies its own address and data, though
|
|
|
|
|
|
the hardware optimizes access through memory coalescing when threads access
|
|
|
|
|
|
nearby addresses. The VMEM unit connects to the vector L1 cache and implements
|
|
|
|
|
|
both address generation and coalescing logic.
|
|
|
|
|
|
|
|
|
|
|
|
Branch unit
|
|
|
|
|
|
^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
|
|
The branch unit executes jumps and branches for control flow changes affecting
|
|
|
|
|
|
entire wavefronts. Note that the branch unit handles wavefront-level control
|
|
|
|
|
|
flow, not execution mask updates for thread divergence, which are handled
|
|
|
|
|
|
through predication.
|
|
|
|
|
|
|
|
|
|
|
|
Special function unit (SFU)
|
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
|
|
The special function units accelerate certain arithmetic operations that are too
|
|
|
|
|
|
complex or costly to implement purely within the standard vector ALUs.
|
|
|
|
|
|
|
|
|
|
|
|
SFUs are responsible for executing transcendental and reciprocal mathematical
|
|
|
|
|
|
functions — operations such as ``exp``, ``log``, ``sin``, ``cos``, ``rcp``
|
|
|
|
|
|
(reciprocal), and ``rsqrt`` (reciprocal square root). These are heavily used in
|
|
|
|
|
|
scientific, physics, and machine learning workloads, particularly in activation
|
|
|
|
|
|
functions like GELU, sigmoid, or softmax.
|
|
|
|
|
|
|
|
|
|
|
|
Each compute unit includes a set of specialized pipelines or transcendental
|
|
|
|
|
|
function units (TFUs) that handle these operations with dedicated hardware.
|
|
|
|
|
|
While their throughput is lower than that of the primary SIMD pipelines, they
|
|
|
|
|
|
enable these functions to execute efficiently without consuming general ALU
|
|
|
|
|
|
bandwidth.
|
|
|
|
|
|
|
|
|
|
|
|
From the compiler's perspective, these operations map to specific AMDGPU ISA
|
|
|
|
|
|
instructions, such as:
|
|
|
|
|
|
|
|
|
|
|
|
* ``v_exp_f32`` — compute exponential base e
|
|
|
|
|
|
* ``v_log_f32`` — compute natural logarithm
|
|
|
|
|
|
* ``v_sin_f32``, ``v_cos_f32`` — compute sine or cosine
|
|
|
|
|
|
* ``v_rsq_f32``, ``v_rcp_f32`` — compute reciprocal or reciprocal square root
|
|
|
|
|
|
|
|
|
|
|
|
In CDNA3-based GPUs (like MI300), SFU throughput and latency have been tuned for
|
|
|
|
|
|
deep learning primitives. For instance, exponentiation (``exp``) and logarithm
|
|
|
|
|
|
(``log``) functions are now pipelined to complete in a few cycles per lane,
|
|
|
|
|
|
allowing vectorized activation functions in large-scale matrix workloads to
|
|
|
|
|
|
execute without significant stalls.
|
|
|
|
|
|
|
|
|
|
|
|
For programmers targeting ROCm or HIP, these SFU-accelerated operations are
|
|
|
|
|
|
typically accessed through math intrinsics such as ``__expf``, ``__logf``, or
|
|
|
|
|
|
``__sinf``, which the compiler lowers to the corresponding AMDGPU ISA instructions
|
|
|
|
|
|
at compile time.
|
|
|
|
|
|
|
|
|
|
|
|
Load/store unit (LSU)
|
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
|
|
The load/store units handle the transfer of data between the compute units and
|
|
|
|
|
|
the GPU's memory subsystems. They are responsible for issuing, tracking, and
|
|
|
|
|
|
retiring memory operations — including loads from and stores to global memory,
|
|
|
|
|
|
local shared memory, and caches — for thousands of concurrent threads.
|
|
|
|
|
|
|
|
|
|
|
|
Each compute unit includes a set of LSUs tightly integrated with its vector and
|
|
|
|
|
|
scalar pipelines. These units handle memory instructions generated by active
|
|
|
|
|
|
wavefronts — such as ``buffer_load``, ``buffer_store``, and ``flat_load_dword``
|
|
|
|
|
|
— and route them through the GPU's hierarchical memory system.
|
|
|
|
|
|
|
|
|
|
|
|
The LSU's responsibilities include:
|
|
|
|
|
|
|
|
|
|
|
|
* Managing vector memory accesses for SIMD instructions
|
|
|
|
|
|
* Coordinating local data share (LDS) reads and writes
|
|
|
|
|
|
* Accessing the L0/L1 caches and forwarding requests to the L2 cache and HBM
|
|
|
|
|
|
* Handling synchronization and atomic operations between threads and workgroups
|
|
|
|
|
|
|
|
|
|
|
|
LSUs manage thousands of outstanding memory requests per GPU, dynamically
|
|
|
|
|
|
scheduling them to hide memory latency. While arithmetic pipelines continue
|
|
|
|
|
|
executing other wavefronts, the LSUs maintain queues of pending transactions
|
|
|
|
|
|
and reorder responses as data returns from memory.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
On modern accelerators like MI300X, these LSUs achieve terabytes-per-second of
|
|
|
|
|
|
aggregate memory bandwidth, coordinating thousands of active threads performing
|
|
|
|
|
|
memory-intensive operations such as tensor loading, matrix tiling, and gradient
|
|
|
|
|
|
updates.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Matrix fused multiply-add (MFMA)
|
|
|
|
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
|
|
|
|
|
|
|
|
CDNA architectures (MI100 and newer) include specialized matrix acceleration
|
|
|
|
|
|
units for high-throughput matrix operations. These units execute independently
|
|
|
|
|
|
from other VALU operations, allowing overlap between matrix and vector
|
|
|
|
|
|
computations. MFMA units support various data types including ``INT8``, ``FP16``,
|
|
|
|
|
|
``BF16``, and ``FP32``, with different throughput characteristics for each.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Matrix cores are GPU execution units that perform large-scale matrix operations
|
|
|
|
|
|
in a single instruction. In AMD architectures, these units are formally known as
|
|
|
|
|
|
MFMA (matrix fused multiply-add) units — the core hardware blocks responsible for
|
|
|
|
|
|
accelerating deep learning, HPC, and dense linear-algebra workloads on modern
|
|
|
|
|
|
Instinct GPUs.
|
|
|
|
|
|
|
|
|
|
|
|
Operating on entire tiles of matrices per instruction allows MFMA units to deliver
|
|
|
|
|
|
far greater arithmetic throughput and energy efficiency than scalar or vector ALUs.
|
|
|
|
|
|
Rather than fetching and decoding thousands of per-element multiply-add instructions,
|
|
|
|
|
|
each MFMA instruction processes an entire matrix fragment — drastically reducing
|
|
|
|
|
|
power per operation and increasing overall throughput.
|
|
|
|
|
|
|
|
|
|
|
|
An example MFMA instruction from the AMDGPU ISA is:
|
|
|
|
|
|
|
|
|
|
|
|
.. code-block:: none
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
v_mfma_f32_16x16x4f16 v[0:15], v[16:31], v[32:47], v[0:15]
|
|
|
|
|
|
|
|
|
|
|
|
This instruction performs a matrix multiplication and accumulation D=A×B+C,
|
|
|
|
|
|
where the fragments A, B, and C are stored in VGPRs. The suffix ``16x16x4f16``
|
|
|
|
|
|
indicates a tile size of 16×16, with an inner dimension of 4, operating on
|
|
|
|
|
|
half-precision (FP16) inputs and accumulating into 32-bit floating-point outputs.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Since their introduction in CDNA1, and further expanded in CDNA2 and CDNA3, AMD's
|
|
|
|
|
|
matrix cores have become the primary source of peak floating-point performance in
|
|
|
|
|
|
datacenter GPUs. For example, an MI300X accelerator achieves its multi-petaFLOP
|
|
|
|
|
|
throughput primarily through MFMA units.
|
|
|
|
|
|
|
|
|
|
|
|
The MFMA units use both standard VGPRs and additional accumulation VGPRs
|
|
|
|
|
|
(AGPRs) on supported architectures, providing up to 512 KiB of combined
|
|
|
|
|
|
register storage per CU.
|
|
|
|
|
|
|
|
|
|
|
|
Local data share (LDS)
|
|
|
|
|
|
----------------------
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
The local data share provides fast on-CU scratchpad memory for communication
|
|
|
|
|
|
between threads in a workgroup.
|
|
|
|
|
|
|
|
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/lds.svg
|
|
|
|
|
|
:align: center
|
|
|
|
|
|
:alt: Diagram showing the organization of local data share with banks and
|
|
|
|
|
|
connections to SIMD units
|
|
|
|
|
|
:width: 800
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
Local data share organization and SIMD connections
|
|
|
|
|
|
|
|
|
|
|
|
**Organization**: The LDS contains 32 banks, each 4-bytes wide, providing
|
|
|
|
|
|
128 bytes/cycle total bandwidth. Banks can be accessed independently each cycle
|
|
|
|
|
|
for reads, writes, or atomic operations. The SIMDs connect to the LDS in pairs,
|
|
|
|
|
|
with each pair sharing a 64-byte bidirectional port.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Access patterns**: A single wavefront can achieve up to 64 bytes/cycle
|
|
|
|
|
|
throughput (16 lanes per cycle). The actual bandwidth depends on data size and
|
|
|
|
|
|
access patterns:
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
* 4-byte values: 8 cycles for 64 threads (50% peak bandwidth)
|
|
|
|
|
|
* 16-byte values: 20 cycles for 64 threads (80% peak bandwidth)
|
|
|
|
|
|
|
|
|
|
|
|
**Conflict resolution**: The LDS includes hardware to detect and resolve bank
|
|
|
|
|
|
conflicts when multiple threads access different addresses in the same bank.
|
|
|
|
|
|
Conflicts are resolved by serializing accesses across multiple cycles. Address
|
|
|
|
|
|
conflicts (multiple threads atomically updating the same address) are similarly
|
|
|
|
|
|
serialized. Broadcasting from the same address to multiple threads is handled
|
|
|
|
|
|
efficiently without conflicts.
|
|
|
|
|
|
|
|
|
|
|
|
Vector L1 cache
|
|
|
|
|
|
---------------
|
|
|
|
|
|
|
|
|
|
|
|
Each CU contains a dedicated vector L1 data cache (vL1D) serving vector memory
|
|
|
|
|
|
operations. Key characteristics include:
|
|
|
|
|
|
|
|
|
|
|
|
* Write-through design (writes go directly to L2)
|
|
|
|
|
|
* Optimization for high-bandwidth streaming access patterns
|
|
|
|
|
|
* Coherent with other CUs through software management
|
|
|
|
|
|
* Typical size of 16 KB per CU
|
|
|
|
|
|
|
|
|
|
|
|
The vector cache tags are checked for all vector memory operations, with misses
|
|
|
|
|
|
forwarded to the L2 cache. The write-through design simplifies coherence at the
|
|
|
|
|
|
cost of write bandwidth.
|
|
|
|
|
|
|
|
|
|
|
|
Memory hierarchy and system
|
|
|
|
|
|
===========================
|
|
|
|
|
|
|
|
|
|
|
|
The GPU memory system provides the bandwidth and capacity needed for massive
|
|
|
|
|
|
parallel computation while managing data coherence and access efficiency.
|
|
|
|
|
|
|
|
|
|
|
|
Memory organization
|
|
|
|
|
|
-------------------
|
|
|
|
|
|
|
|
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/cdna2_gcd.png
|
|
|
|
|
|
:alt: Block diagram showing four compute engines with L2 cache, memory
|
|
|
|
|
|
controllers, and Infinity Fabric interconnect on CDNA2
|
|
|
|
|
|
:width: 800
|
|
|
|
|
|
|
|
|
|
|
|
CDNA2 Graphics Compute Die organization showing memory subsystem
|
|
|
|
|
|
|
|
|
|
|
|
AMD GPUs typically use high-bandwidth memory (HBM) for data-intensive workloads,
|
|
|
|
|
|
providing significantly higher bandwidth than traditional GDDR memory at the
|
|
|
|
|
|
cost of slightly higher latency. The memory system includes:
|
|
|
|
|
|
|
|
|
|
|
|
* **Memory channels**: Multiple independent memory controllers (typically 8-16)
|
|
|
|
|
|
* **L2 cache banks**: Distributed cache banks serving as the coherence point
|
|
|
|
|
|
* **Infinity Fabric**: High-speed interconnect for data routing
|
|
|
|
|
|
|
|
|
|
|
|
L2 cache architecture
|
|
|
|
|
|
---------------------
|
|
|
|
|
|
|
|
|
|
|
|
The L2 cache serves as the coherence point for all GPU memory accesses and is
|
|
|
|
|
|
shared by all compute units. The L2 consists of multiple independent channels
|
|
|
|
|
|
(32 on CDNA GPUs at 256-byte interleaving) that operate in parallel.
|
|
|
|
|
|
|
|
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/l2perf_model.png
|
|
|
|
|
|
:align: center
|
|
|
|
|
|
:alt: Diagram showing L2 cache to Infinity Fabric transaction flow with
|
|
|
|
|
|
request categorization and routing
|
|
|
|
|
|
:width: 800
|
|
|
|
|
|
|
|
|
|
|
|
L2 cache to Infinity Fabric transaction flow
|
|
|
|
|
|
|
|
|
|
|
|
**Key characteristics**:
|
|
|
|
|
|
|
|
|
|
|
|
* **Channel organization**: Each channel handles a portion of the address space,
|
|
|
|
|
|
with addresses interleaved across channels for load balancing.
|
|
|
|
|
|
* **Hit-on-miss behavior**: If a request arrives for a pending cache line fill,
|
|
|
|
|
|
it counts as a hit, improving the effective hit rate.
|
|
|
|
|
|
* **Write coalescing**: Multiple writes to the same cache line are combined.
|
|
|
|
|
|
* **Atomic operation support**: Atomics execute directly in the L2 cache for
|
|
|
|
|
|
coherence.
|
|
|
|
|
|
|
|
|
|
|
|
**L2-Fabric interface**: Requests missing in L2 are routed through Infinity
|
|
|
|
|
|
Fabric to the appropriate memory location, which could be:
|
|
|
|
|
|
|
|
|
|
|
|
* Local HBM on the same GPU
|
|
|
|
|
|
* Remote GPU memory (in multi-GPU systems)
|
|
|
|
|
|
* System memory (CPU DRAM)
|
|
|
|
|
|
|
|
|
|
|
|
The interface categorizes requests by type (read/write), size (32B/64B), and
|
|
|
|
|
|
destination for optimal routing.
|
|
|
|
|
|
|
|
|
|
|
|
Memory coherence
|
2024-05-07 15:52:14 -04:00
|
|
|
|
----------------
|
|
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
GPU memory coherence differs significantly from CPU designs to optimize for
|
|
|
|
|
|
throughput over latency:
|
|
|
|
|
|
|
|
|
|
|
|
**Write-through L1 caches**: All writes update both L1 and L2, ensuring L2
|
|
|
|
|
|
always has the latest data. This eliminates the need for complex coherence
|
|
|
|
|
|
protocols between L1 caches but requires higher write bandwidth.
|
|
|
|
|
|
|
|
|
|
|
|
**Software-managed coherence**: Coherence between CUs requires explicit
|
|
|
|
|
|
synchronization through:
|
|
|
|
|
|
|
|
|
|
|
|
* Memory fences for ordering
|
|
|
|
|
|
* Cache invalidation instructions
|
|
|
|
|
|
* Atomic operations (executed at L2 level)
|
|
|
|
|
|
* Kernel boundaries (implicit synchronization)
|
|
|
|
|
|
|
|
|
|
|
|
**Write combining**: To handle partial cache line updates from different CUs,
|
|
|
|
|
|
the GPU uses write masks indicating which bytes to update. This prevents false
|
|
|
|
|
|
sharing issues while maintaining correctness.
|
|
|
|
|
|
|
|
|
|
|
|
Memory coalescing
|
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
|
|
Memory coalescing combines memory accesses from multiple threads into fewer
|
|
|
|
|
|
transactions, significantly improving bandwidth utilization. The coalescing
|
|
|
|
|
|
hardware in the VMEM unit analyzes addresses from all threads in a wavefront
|
|
|
|
|
|
and groups them into the minimum number of cache line requests.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Coalesced access pattern**: When consecutive threads access consecutive memory
|
|
|
|
|
|
addresses, the hardware can combine all 64 thread requests into as few as 4-8
|
|
|
|
|
|
cache line requests (depending on data size and alignment).
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Non-coalesced access pattern**: When threads access widely separated addresses,
|
|
|
|
|
|
each thread can generate a separate memory transaction, reducing effective
|
|
|
|
|
|
bandwidth by up to 16x or more.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
To achieve optimal memory performance:
|
|
|
|
|
|
|
|
|
|
|
|
* Ensure consecutive threads access consecutive memory addresses
|
|
|
|
|
|
* Align data structures to cache line boundaries (64B or 128B)
|
|
|
|
|
|
* Use structure-of-arrays rather than array-of-structures layouts
|
|
|
|
|
|
* Consider padding to avoid bank conflicts
|
|
|
|
|
|
|
|
|
|
|
|
Architecture variants
|
|
|
|
|
|
=====================
|
|
|
|
|
|
|
|
|
|
|
|
AMD supports multiple GPU architecture families optimized for different use
|
|
|
|
|
|
cases while maintaining HIP compatibility.
|
|
|
|
|
|
|
|
|
|
|
|
Graphics Core Next (GCN)
|
|
|
|
|
|
------------------------
|
|
|
|
|
|
|
|
|
|
|
|
GCN represents the foundational architecture for modern AMD GPUs, establishing
|
|
|
|
|
|
key design principles still used today:
|
|
|
|
|
|
|
|
|
|
|
|
* 64-thread wavefronts
|
|
|
|
|
|
* Four SIMD units per CU with 16 lanes each
|
|
|
|
|
|
* Scalar unit for wavefront-uniform operations
|
|
|
|
|
|
* LDS with 32 banks
|
|
|
|
|
|
|
|
|
|
|
|
Multiple GCN generations (GCN1-5) introduced incremental improvements in
|
|
|
|
|
|
process technology, clock speeds, and instruction set features while maintaining
|
|
|
|
|
|
the core architectural philosophy.
|
|
|
|
|
|
|
|
|
|
|
|
.. _cdna_architecture:
|
2025-06-02 17:10:41 +02:00
|
|
|
|
|
2024-10-21 16:50:09 +02:00
|
|
|
|
CDNA architecture
|
2025-12-11 10:52:34 +01:00
|
|
|
|
-----------------
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
CDNA (Compute DNA) specializes in high-performance computing and machine
|
|
|
|
|
|
learning workloads. Building on GCN principles, CDNA adds significant compute
|
|
|
|
|
|
enhancements:
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2024-05-10 15:58:16 -04:00
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/cdna3_cu.png
|
2025-12-11 10:52:34 +01:00
|
|
|
|
:alt: Block diagram showing CDNA3 compute unit with matrix core unit, shader
|
|
|
|
|
|
cores, L1 cache, and local data share
|
|
|
|
|
|
:width: 800
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
CDNA3 compute unit with matrix acceleration
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Matrix Core Unit**: Specialized hardware for matrix multiply-accumulate
|
|
|
|
|
|
operations, providing up to 16 times more throughput than vector units for supported
|
|
|
|
|
|
operations. Matrix cores support multiple precisions (``INT8``, ``FP16``, ``BF16``, ``FP32``)
|
|
|
|
|
|
with varying performance characteristics.
|
2025-06-02 17:10:41 +02:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Accumulation VGPRs (AGPRs)**: Additional register file space (up to 256 KB)
|
|
|
|
|
|
dedicated to matrix accumulation, doubling the available register storage for
|
|
|
|
|
|
matrix operations. Data movement between VGPRs and AGPRs uses specialized
|
|
|
|
|
|
instructions (``v_accvgpr_*``).
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Enhanced memory bandwidth**: CDNA GPUs typically use HBM2/HBM2e/HBM3 memory,
|
|
|
|
|
|
providing up to 3.2 TB/s bandwidth on high-end models.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Multi-die designs**: CDNA2 (MI250) and CDNA3 (MI300) use chiplet
|
|
|
|
|
|
architectures with multiple dies connected through high-speed links, scaling
|
|
|
|
|
|
to higher compute and memory capacities.
|
|
|
|
|
|
|
|
|
|
|
|
.. _rdna_architecture:
|
|
|
|
|
|
|
|
|
|
|
|
RDNA architecture
|
|
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
|
|
|
|
RDNA optimizes for graphics and lower-latency compute workloads through
|
|
|
|
|
|
fundamental architectural changes:
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2024-05-10 15:58:16 -04:00
|
|
|
|
.. figure:: ../data/understand/hardware_implementation/rdna3_cu.png
|
2025-12-11 10:52:34 +01:00
|
|
|
|
:alt: Block diagram showing RDNA3 work group processor with dual compute
|
|
|
|
|
|
units, shared caches, and 32-wide SIMD units
|
|
|
|
|
|
:width: 800
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
RDNA3 work group processor architecture
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Wave32 execution**: Primary execution mode uses 32-thread wavefronts,
|
|
|
|
|
|
reducing divergence penalties and register pressure. Wave64 mode is available
|
|
|
|
|
|
for backward compatibility.
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
**Dual compute units**: The work group processor (WGP) replaces standalone CUs,
|
|
|
|
|
|
containing two closely coupled compute units sharing resources:
|
2024-05-07 15:52:14 -04:00
|
|
|
|
|
2025-12-11 10:52:34 +01:00
|
|
|
|
* Each CU has two 32-wide SIMD units (vs. four 16-wide in GCN)
|
|
|
|
|
|
* Wavefronts execute in a single cycle on 32-wide SIMDs
|
|
|
|
|
|
* Reduced instruction latency improves responsiveness
|
|
|
|
|
|
|
|
|
|
|
|
**Three-level cache hierarchy**:
|
|
|
|
|
|
|
|
|
|
|
|
* **L0 cache**: Per-CU cache (equivalent to GCN's L1)
|
|
|
|
|
|
* **L1 cache**: Shared between CUs in a WGP (new intermediate level)
|
|
|
|
|
|
* **L2 cache**: Global cache shared across all WGPs
|
|
|
|
|
|
|
|
|
|
|
|
**128-byte cache lines**: Doubled from 64 bytes in GCN, aligning with Wave32
|
|
|
|
|
|
access patterns (32 threads × 4 bytes = 128 bytes).
|
|
|
|
|
|
|
|
|
|
|
|
These RDNA optimizations target gaming workloads where latency matters more
|
|
|
|
|
|
than pure throughput, though the architecture remains capable for general
|
|
|
|
|
|
compute tasks.
|
|
|
|
|
|
|
|
|
|
|
|
Performance considerations
|
|
|
|
|
|
==========================
|
|
|
|
|
|
|
|
|
|
|
|
Understanding hardware characteristics helps you optimize GPU applications for
|
|
|
|
|
|
maximum performance.
|
|
|
|
|
|
|
|
|
|
|
|
Occupancy and resource limits
|
|
|
|
|
|
-----------------------------
|
|
|
|
|
|
|
|
|
|
|
|
Occupancy measures the ratio of active wavefronts to maximum possible
|
|
|
|
|
|
wavefronts on a CU. Higher occupancy generally improves latency hiding but
|
|
|
|
|
|
is limited by:
|
|
|
|
|
|
|
|
|
|
|
|
* **Register usage**: Each wavefront requires VGPRs and SGPRs from finite pools
|
|
|
|
|
|
* **LDS allocation**: Shared memory used per workgroup
|
|
|
|
|
|
* **Wavefront slots**: Fixed number of execution contexts per CU
|
|
|
|
|
|
* **Workgroup size**: Smaller workgroups can waste resources
|
|
|
|
|
|
|
|
|
|
|
|
Balancing these resources is critical for achieving optimal occupancy. Tools
|
|
|
|
|
|
like ``rocprof`` can help analyze occupancy and identify limiting factors.
|
|
|
|
|
|
|
|
|
|
|
|
Latency hiding through multithreading
|
|
|
|
|
|
-------------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
GPUs hide memory and instruction latency through massive hardware
|
|
|
|
|
|
multithreading rather than complex CPU techniques like out-of-order execution
|
|
|
|
|
|
or speculation. With sufficient wavefronts:
|
|
|
|
|
|
|
|
|
|
|
|
* Memory latency is hidden by executing other wavefronts during waits
|
|
|
|
|
|
* Pipeline latencies are covered by round-robin wavefront scheduling
|
|
|
|
|
|
* No context switch overhead as all contexts remain resident
|
|
|
|
|
|
|
|
|
|
|
|
The hardware can switch between wavefronts every cycle, maintaining high ALU
|
|
|
|
|
|
utilization even with long-latency operations in flight.
|
|
|
|
|
|
|
|
|
|
|
|
Memory bandwidth utilization
|
|
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
|
|
|
|
Effective memory bandwidth depends on access patterns:
|
|
|
|
|
|
|
|
|
|
|
|
* **Coalesced access**: Can achieve 70-90% of peak bandwidth
|
|
|
|
|
|
* **Random access**: Might achieve only 5-15% of peak bandwidth
|
|
|
|
|
|
* **Bank conflicts**: Can serialize LDS access, reducing throughput
|
|
|
|
|
|
|
|
|
|
|
|
Memory-bound kernels should focus on:
|
|
|
|
|
|
|
|
|
|
|
|
* Maximizing coalescing through proper data layout
|
|
|
|
|
|
* Prefetching and data reuse in LDS
|
|
|
|
|
|
* Balancing computation with memory access
|
|
|
|
|
|
* Using appropriate cache policies
|
|
|
|
|
|
|
|
|
|
|
|
Hardware-specific optimizations
|
|
|
|
|
|
-------------------------------
|
|
|
|
|
|
|
|
|
|
|
|
Different AMD GPU architectures benefit from tailored optimizations:
|
|
|
|
|
|
|
|
|
|
|
|
**For GCN/CDNA**:
|
|
|
|
|
|
|
|
|
|
|
|
* Optimize for 64-thread wavefront granularity
|
|
|
|
|
|
* Leverage matrix cores for applicable algorithms
|
|
|
|
|
|
* Consider AGPR usage for register spilling
|
|
|
|
|
|
|
|
|
|
|
|
**For RDNA**:
|
|
|
|
|
|
|
|
|
|
|
|
* Design for 32-thread wavefront execution
|
|
|
|
|
|
* Utilize improved divergence handling
|
|
|
|
|
|
* Take advantage of additional cache level
|
|
|
|
|
|
|
|
|
|
|
|
**Architecture-agnostic**:
|
|
|
|
|
|
|
|
|
|
|
|
* Minimize divergent control flow
|
|
|
|
|
|
* Ensure memory access coalescing
|
|
|
|
|
|
* Balance resource usage for occupancy
|
|
|
|
|
|
* Overlap computation with memory access
|
|
|
|
|
|
|
|
|
|
|
|
Summary
|
|
|
|
|
|
=======
|
|
|
|
|
|
|
|
|
|
|
|
AMD GPU hardware architecture provides massive parallelism through hierarchical
|
|
|
|
|
|
organization of compute resources, specialized execution units, and a
|
|
|
|
|
|
sophisticated memory system. Understanding these hardware details—from the
|
|
|
|
|
|
command processor through shader engines to individual compute units and the
|
|
|
|
|
|
memory hierarchy—enables you to write more efficient GPU applications.
|
|
|
|
|
|
|
|
|
|
|
|
Key hardware concepts for optimization include:
|
|
|
|
|
|
|
|
|
|
|
|
* Workgroup scheduling and resource management by the SPI
|
|
|
|
|
|
* Instruction scheduling and wavefront execution in compute units
|
|
|
|
|
|
* Memory coalescing and cache behavior
|
|
|
|
|
|
* Architecture-specific features (matrix cores, Wave32/64 modes)
|
|
|
|
|
|
* Resource limits affecting occupancy
|
|
|
|
|
|
|
|
|
|
|
|
For details on mapping parallel algorithms to this hardware, see the
|
|
|
|
|
|
:ref:`programming_model` chapter. For specific optimization techniques, consult
|
|
|
|
|
|
the performance optimization guides in the ROCm documentation.
|