256dd1963a
* [hip] Docs: Overhaul HW implementation page * Update hardware implementation and glossary * Update programming model * Add performance optimization * Split into how-to and understanding --------- Signed-off-by: Jan Stephan <jan.stephan@amd.com> Co-authored-by: Jan Stephan <jan.stephan@amd.com> Co-authored-by: Julia Jiang <julia.jiang@amd.com>
709 líneas
29 KiB
ReStructuredText
709 líneas
29 KiB
ReStructuredText
.. meta::
|
||
:description: This chapter describes the hardware implementation of AMD GPUs supported by HIP.
|
||
:keywords: AMD, ROCm, HIP, hardware, GPU, architecture, compute unit, VALU, SALU, cache, memory hierarchy, CDNA, RDNA, GCN
|
||
|
||
.. _hardware_implementation:
|
||
|
||
*******************************************************************************
|
||
Hardware implementation
|
||
*******************************************************************************
|
||
|
||
This chapter describes the hardware architecture of AMD GPUs supported by HIP,
|
||
focusing on the internal organization and operation of GPU hardware components.
|
||
Understanding these hardware details helps you optimize GPU applications and
|
||
achieve maximum performance.
|
||
|
||
Overall GPU architecture
|
||
========================
|
||
|
||
AMD GPUs consist of interconnected blocks of digital circuits that work together
|
||
to execute complex parallel computing tasks. The architecture is organized
|
||
hierarchically to enable massive parallelism while managing resources efficiently.
|
||
|
||
Command processor and control
|
||
-----------------------------
|
||
|
||
The command processor (CP) serves as the primary interface between the CPU and
|
||
GPU, receiving and distributing commands for execution. The CP consists of two
|
||
main components:
|
||
|
||
* **Command processor fetcher (CPF)**: Fetches commands from memory and passes
|
||
them to the CPC for processing.
|
||
* **Command processor packet processor (CPC)**: A microcontroller that decodes the
|
||
fetched commands and dispatches kernels to the workgroup processors for
|
||
scheduling.
|
||
|
||
The command processor handles several types of operations:
|
||
|
||
* Kernel launches, which are forwarded to asynchronous compute engines (ACEs)
|
||
* Memory transfers, which are delegated to direct memory access (DMA) engines
|
||
* Synchronization operations and memory fences
|
||
|
||
**DMA engines** handle memory transfers between CPU and GPU memory without CPU
|
||
involvement after initialization. Most GPUs contain two DMA engines, enabling
|
||
concurrent bidirectional transfers to better utilize PCIe bandwidth. The DMA
|
||
engines fetch data in small chunks and can process transfers in parallel but
|
||
cannot handle multiple copy commands on the same engine simultaneously.
|
||
|
||
**Asynchronous compute engines (ACEs)** break down kernels into workgroups for
|
||
distribution to shader processor input (SPI) blocks. Multiple ACEs enable
|
||
concurrent kernel execution, with each ACE capable of dispatching one kernel
|
||
at a time. ACEs process commands from different queues asynchronously, enabling
|
||
overlap between different kernel executions and memory operations.
|
||
|
||
Hierarchical organization
|
||
-------------------------
|
||
|
||
The GPU organizes compute resources in a three-level hierarchy that enables
|
||
modular design and resource sharing:
|
||
|
||
1. **Shader engines (SE)**: Top-level organizational units containing multiple
|
||
shader arrays and shared resources
|
||
2. **Shader arrays**: Groups of compute units sharing instruction and scalar
|
||
caches
|
||
3. **Compute units (CU)**: Basic execution units containing the ALUs and
|
||
registers for thread execution
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/selayout.png
|
||
:align: center
|
||
:alt: Diagram showing the hierarchical organization of compute units grouped
|
||
into shader engines on AMD GPUs
|
||
:width: 800
|
||
|
||
Hierarchical organization of compute units into shader engines
|
||
|
||
This hierarchical design allows different GPU configurations using the same
|
||
underlying architecture. For example, the R9 Fury X contains 16 shader arrays
|
||
with four CUs each, while the RX 480 contains 12 shader arrays with three CUs
|
||
each, but both use the same gfx803 chip design.
|
||
|
||
Shader engine components
|
||
========================
|
||
|
||
Shader engines group multiple compute units together with shared resources that
|
||
improve efficiency and reduce redundancy. Each shader engine contains several
|
||
key components shared across its compute units.
|
||
|
||
Workgroup manager (SPI)
|
||
-----------------------
|
||
|
||
The workgroup manager, also called the shader processor input (SPI), bridges
|
||
the command processor and compute units. After the CP processes a kernel
|
||
dispatch, the SPI:
|
||
|
||
* Receives workgroups from the ACEs
|
||
* Schedules workgroups onto available compute units
|
||
* Initializes registers with kernel parameters
|
||
* Ensures all wavefronts of a workgroup execute on the same CU for
|
||
synchronization
|
||
* Monitors resource availability and queues workgroups when resources are
|
||
exhausted
|
||
|
||
The SPI tracks four critical resources that limit concurrent execution:
|
||
|
||
* Wavefront slots (execution contexts)
|
||
* Vector general-purpose registers (VGPRs)
|
||
* Scalar general-purpose registers (SGPRs)
|
||
* Local data share (LDS) memory
|
||
|
||
Workgroup-to-CU mapping is non-deterministic and based on available resources.
|
||
You should not assume any specific mapping pattern, as the same kernel launched
|
||
multiple times can have different workgroup distributions.
|
||
|
||
Scalar L1 data cache (sL1D)
|
||
---------------------------
|
||
|
||
The scalar L1 data cache serves scalar memory operations from multiple CUs
|
||
within a shader array. The sL1D is shared between CUs (3 CUs in the Graphics
|
||
Core Next (GCN) and MI100, 2 CUs in the MI200 series) and caches data that is
|
||
uniform across a wavefront, including:
|
||
|
||
* Kernel arguments and pointers
|
||
* Grid and block dimensions
|
||
* Constants accessed uniformly across threads
|
||
* Data from ``__constant__`` memory when accessed uniformly
|
||
|
||
Unlike the vector L1 cache, the sL1D doesn't use a "hit-on-miss" approach,
|
||
meaning subsequent requests to the same pending cache line count as duplicated
|
||
misses rather than hits.
|
||
|
||
L1 instruction cache (L1I)
|
||
--------------------------
|
||
|
||
The L1 instruction cache is a read-only cache shared between multiple CUs in a
|
||
shader array. Like the sL1D, it is backed by the L2 cache and doesn't use the
|
||
"hit-on-miss" approach. The L1I stores kernel instructions fetched by the
|
||
compute units, reducing instruction fetch latency and L2 cache pressure.
|
||
|
||
Compute unit architecture
|
||
=========================
|
||
|
||
The compute unit (CU) is the fundamental execution block of AMD GPUs.
|
||
It's responsible for executing kernels through its various specialized components
|
||
and pipelines.
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/gcn_compute_unit.png
|
||
:align: center
|
||
:alt: Detailed diagram of an AMD CDNA compute unit showing internal
|
||
components and data flow
|
||
:width: 800
|
||
|
||
Internal architecture of an AMD CDNA compute unit
|
||
|
||
Sequencer and scheduling
|
||
------------------------
|
||
|
||
The instruction sequencer (SQ) serves as the control center of each compute
|
||
unit, managing instruction flow through the execution pipelines. The sequencer
|
||
maintains wavefront state and coordinates instruction execution across different
|
||
functional units.
|
||
|
||
**Wavefront organization**: The sequencer organizes active wavefronts into four
|
||
pools, each containing slots for up to ten wavefronts (eight on the CDNA2 MI200
|
||
series). Each slot includes:
|
||
|
||
* Wavefront-level registers (program counter, execution mask, and others)
|
||
* Instruction buffer for prefetched instructions
|
||
* State information for scheduling decisions
|
||
|
||
This organization theoretically allows up to 40 concurrent wavefronts per CU,
|
||
though actual occupancy is typically limited by register and LDS usage.
|
||
|
||
**Instruction fetching**: The fetch arbiter selects one wavefront per cycle to
|
||
fetch instructions from memory, prioritizing the oldest wavefronts. Each CU can
|
||
fetch up to 32 bytes (4-8 instructions) per cycle.
|
||
|
||
**Instruction issuing**: The issue arbiter determines which instructions execute
|
||
each cycle, selecting wavefronts from one pool per cycle in round-robin fashion.
|
||
The arbiter can issue multiple instructions per cycle to different execution
|
||
units, with a theoretical maximum of five instructions per cycle:
|
||
|
||
* One VALU instruction
|
||
* One vector memory operation
|
||
* One SALU/scalar memory operation
|
||
* One LDS operation
|
||
* One branch operation
|
||
|
||
Instructions always issue at wavefront granularity, with all threads in the
|
||
wavefront executing the same instruction in lockstep. Context switching between
|
||
wavefronts occurs every cycle with zero overhead, as all wavefront contexts
|
||
remain resident on the CU.
|
||
|
||
Execution pipelines
|
||
-------------------
|
||
|
||
Each compute unit contains multiple specialized execution pipelines that process
|
||
different types of instructions in parallel, enabling efficient utilization of
|
||
the hardware resources.
|
||
|
||
Vector arithmetic logic unit (VALU)
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The VALU executes vector instructions across entire wavefronts, with each thread
|
||
potentially operating on different data. The VALU consists of:
|
||
|
||
* **Four SIMD processors**: Each containing 16 single-precision ALUs (or
|
||
equivalent), for 64 total ALUs per CU
|
||
* **Vector register files**: 256-512 KiB of VGPR storage split across the four
|
||
SIMDs
|
||
* **Instruction buffers**: Storage for up to 8-10 wavefronts per SIMD
|
||
|
||
On architectures with 64-thread wavefronts and 16-instruction wide SIMD units,
|
||
executing one instruction takes four cycles (one cycle per 16 threads). The four
|
||
SIMD design ensures full utilization when sufficient wavefronts are available, as
|
||
a new instruction can issue to each SIMD every cycle.
|
||
|
||
The VALU serves as the primary arithmetic engine, executing the majority of
|
||
computation in GPU kernels. Data flows into these pipelines, undergoes arithmetic
|
||
transformation, and exits as results — with the goal of maximizing the number of
|
||
such transformations per clock cycle.
|
||
|
||
For CDNA architectures with matrix operations, the VALU also dispatches
|
||
matrix fused multiply-add (MFMA) instructions to specialized matrix units.
|
||
|
||
Scalar arithmetic logic unit (SALU)
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The SALU executes instructions uniformly across all threads in a wavefront,
|
||
handling operations like:
|
||
|
||
* Control flow (branches, loops)
|
||
* Address calculations
|
||
* Loading kernel arguments and constants
|
||
* Managing wavefront-uniform values
|
||
|
||
The SALU includes:
|
||
|
||
* A scalar processor for arithmetic and logic operations
|
||
* 12.5 KiB of SGPR storage per CU
|
||
* A scalar memory (SMEM) unit for memory operations
|
||
|
||
Scalar operations reduce pressure on vector units and registers by handling
|
||
uniform computations efficiently.
|
||
|
||
Vector memory unit (VMEM)
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The VMEM unit handles all vector memory operations, including loads, stores,
|
||
and atomic operations. Each thread supplies its own address and data, though
|
||
the hardware optimizes access through memory coalescing when threads access
|
||
nearby addresses. The VMEM unit connects to the vector L1 cache and implements
|
||
both address generation and coalescing logic.
|
||
|
||
Branch unit
|
||
^^^^^^^^^^^
|
||
|
||
The branch unit executes jumps and branches for control flow changes affecting
|
||
entire wavefronts. Note that the branch unit handles wavefront-level control
|
||
flow, not execution mask updates for thread divergence, which are handled
|
||
through predication.
|
||
|
||
Special function unit (SFU)
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The special function units accelerate certain arithmetic operations that are too
|
||
complex or costly to implement purely within the standard vector ALUs.
|
||
|
||
SFUs are responsible for executing transcendental and reciprocal mathematical
|
||
functions — operations such as ``exp``, ``log``, ``sin``, ``cos``, ``rcp``
|
||
(reciprocal), and ``rsqrt`` (reciprocal square root). These are heavily used in
|
||
scientific, physics, and machine learning workloads, particularly in activation
|
||
functions like GELU, sigmoid, or softmax.
|
||
|
||
Each compute unit includes a set of specialized pipelines or transcendental
|
||
function units (TFUs) that handle these operations with dedicated hardware.
|
||
While their throughput is lower than that of the primary SIMD pipelines, they
|
||
enable these functions to execute efficiently without consuming general ALU
|
||
bandwidth.
|
||
|
||
From the compiler's perspective, these operations map to specific AMDGPU ISA
|
||
instructions, such as:
|
||
|
||
* ``v_exp_f32`` — compute exponential base e
|
||
* ``v_log_f32`` — compute natural logarithm
|
||
* ``v_sin_f32``, ``v_cos_f32`` — compute sine or cosine
|
||
* ``v_rsq_f32``, ``v_rcp_f32`` — compute reciprocal or reciprocal square root
|
||
|
||
In CDNA3-based GPUs (like MI300), SFU throughput and latency have been tuned for
|
||
deep learning primitives. For instance, exponentiation (``exp``) and logarithm
|
||
(``log``) functions are now pipelined to complete in a few cycles per lane,
|
||
allowing vectorized activation functions in large-scale matrix workloads to
|
||
execute without significant stalls.
|
||
|
||
For programmers targeting ROCm or HIP, these SFU-accelerated operations are
|
||
typically accessed through math intrinsics such as ``__expf``, ``__logf``, or
|
||
``__sinf``, which the compiler lowers to the corresponding AMDGPU ISA instructions
|
||
at compile time.
|
||
|
||
Load/store unit (LSU)
|
||
^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
The load/store units handle the transfer of data between the compute units and
|
||
the GPU's memory subsystems. They are responsible for issuing, tracking, and
|
||
retiring memory operations — including loads from and stores to global memory,
|
||
local shared memory, and caches — for thousands of concurrent threads.
|
||
|
||
Each compute unit includes a set of LSUs tightly integrated with its vector and
|
||
scalar pipelines. These units handle memory instructions generated by active
|
||
wavefronts — such as ``buffer_load``, ``buffer_store``, and ``flat_load_dword``
|
||
— and route them through the GPU's hierarchical memory system.
|
||
|
||
The LSU's responsibilities include:
|
||
|
||
* Managing vector memory accesses for SIMD instructions
|
||
* Coordinating local data share (LDS) reads and writes
|
||
* Accessing the L0/L1 caches and forwarding requests to the L2 cache and HBM
|
||
* Handling synchronization and atomic operations between threads and workgroups
|
||
|
||
LSUs manage thousands of outstanding memory requests per GPU, dynamically
|
||
scheduling them to hide memory latency. While arithmetic pipelines continue
|
||
executing other wavefronts, the LSUs maintain queues of pending transactions
|
||
and reorder responses as data returns from memory.
|
||
|
||
On modern accelerators like MI300X, these LSUs achieve terabytes-per-second of
|
||
aggregate memory bandwidth, coordinating thousands of active threads performing
|
||
memory-intensive operations such as tensor loading, matrix tiling, and gradient
|
||
updates.
|
||
|
||
Matrix fused multiply-add (MFMA)
|
||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||
|
||
CDNA architectures (MI100 and newer) include specialized matrix acceleration
|
||
units for high-throughput matrix operations. These units execute independently
|
||
from other VALU operations, allowing overlap between matrix and vector
|
||
computations. MFMA units support various data types including ``INT8``, ``FP16``,
|
||
``BF16``, and ``FP32``, with different throughput characteristics for each.
|
||
|
||
Matrix cores are GPU execution units that perform large-scale matrix operations
|
||
in a single instruction. In AMD architectures, these units are formally known as
|
||
MFMA (matrix fused multiply-add) units — the core hardware blocks responsible for
|
||
accelerating deep learning, HPC, and dense linear-algebra workloads on modern
|
||
Instinct GPUs.
|
||
|
||
Operating on entire tiles of matrices per instruction allows MFMA units to deliver
|
||
far greater arithmetic throughput and energy efficiency than scalar or vector ALUs.
|
||
Rather than fetching and decoding thousands of per-element multiply-add instructions,
|
||
each MFMA instruction processes an entire matrix fragment — drastically reducing
|
||
power per operation and increasing overall throughput.
|
||
|
||
An example MFMA instruction from the AMDGPU ISA is:
|
||
|
||
.. code-block:: none
|
||
|
||
v_mfma_f32_16x16x4f16 v[0:15], v[16:31], v[32:47], v[0:15]
|
||
|
||
This instruction performs a matrix multiplication and accumulation D=A×B+C,
|
||
where the fragments A, B, and C are stored in VGPRs. The suffix ``16x16x4f16``
|
||
indicates a tile size of 16×16, with an inner dimension of 4, operating on
|
||
half-precision (FP16) inputs and accumulating into 32-bit floating-point outputs.
|
||
|
||
Since their introduction in CDNA1, and further expanded in CDNA2 and CDNA3, AMD's
|
||
matrix cores have become the primary source of peak floating-point performance in
|
||
datacenter GPUs. For example, an MI300X accelerator achieves its multi-petaFLOP
|
||
throughput primarily through MFMA units.
|
||
|
||
The MFMA units use both standard VGPRs and additional accumulation VGPRs
|
||
(AGPRs) on supported architectures, providing up to 512 KiB of combined
|
||
register storage per CU.
|
||
|
||
Local data share (LDS)
|
||
----------------------
|
||
|
||
The local data share provides fast on-CU scratchpad memory for communication
|
||
between threads in a workgroup.
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/lds.svg
|
||
:align: center
|
||
:alt: Diagram showing the organization of local data share with banks and
|
||
connections to SIMD units
|
||
:width: 800
|
||
|
||
Local data share organization and SIMD connections
|
||
|
||
**Organization**: The LDS contains 32 banks, each 4-bytes wide, providing
|
||
128 bytes/cycle total bandwidth. Banks can be accessed independently each cycle
|
||
for reads, writes, or atomic operations. The SIMDs connect to the LDS in pairs,
|
||
with each pair sharing a 64-byte bidirectional port.
|
||
|
||
**Access patterns**: A single wavefront can achieve up to 64 bytes/cycle
|
||
throughput (16 lanes per cycle). The actual bandwidth depends on data size and
|
||
access patterns:
|
||
|
||
* 4-byte values: 8 cycles for 64 threads (50% peak bandwidth)
|
||
* 16-byte values: 20 cycles for 64 threads (80% peak bandwidth)
|
||
|
||
**Conflict resolution**: The LDS includes hardware to detect and resolve bank
|
||
conflicts when multiple threads access different addresses in the same bank.
|
||
Conflicts are resolved by serializing accesses across multiple cycles. Address
|
||
conflicts (multiple threads atomically updating the same address) are similarly
|
||
serialized. Broadcasting from the same address to multiple threads is handled
|
||
efficiently without conflicts.
|
||
|
||
Vector L1 cache
|
||
---------------
|
||
|
||
Each CU contains a dedicated vector L1 data cache (vL1D) serving vector memory
|
||
operations. Key characteristics include:
|
||
|
||
* Write-through design (writes go directly to L2)
|
||
* Optimization for high-bandwidth streaming access patterns
|
||
* Coherent with other CUs through software management
|
||
* Typical size of 16 KB per CU
|
||
|
||
The vector cache tags are checked for all vector memory operations, with misses
|
||
forwarded to the L2 cache. The write-through design simplifies coherence at the
|
||
cost of write bandwidth.
|
||
|
||
Memory hierarchy and system
|
||
===========================
|
||
|
||
The GPU memory system provides the bandwidth and capacity needed for massive
|
||
parallel computation while managing data coherence and access efficiency.
|
||
|
||
Memory organization
|
||
-------------------
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/cdna2_gcd.png
|
||
:alt: Block diagram showing four compute engines with L2 cache, memory
|
||
controllers, and Infinity Fabric interconnect on CDNA2
|
||
:width: 800
|
||
|
||
CDNA2 Graphics Compute Die organization showing memory subsystem
|
||
|
||
AMD GPUs typically use high-bandwidth memory (HBM) for data-intensive workloads,
|
||
providing significantly higher bandwidth than traditional GDDR memory at the
|
||
cost of slightly higher latency. The memory system includes:
|
||
|
||
* **Memory channels**: Multiple independent memory controllers (typically 8-16)
|
||
* **L2 cache banks**: Distributed cache banks serving as the coherence point
|
||
* **Infinity Fabric**: High-speed interconnect for data routing
|
||
|
||
L2 cache architecture
|
||
---------------------
|
||
|
||
The L2 cache serves as the coherence point for all GPU memory accesses and is
|
||
shared by all compute units. The L2 consists of multiple independent channels
|
||
(32 on CDNA GPUs at 256-byte interleaving) that operate in parallel.
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/l2perf_model.png
|
||
:align: center
|
||
:alt: Diagram showing L2 cache to Infinity Fabric transaction flow with
|
||
request categorization and routing
|
||
:width: 800
|
||
|
||
L2 cache to Infinity Fabric transaction flow
|
||
|
||
**Key characteristics**:
|
||
|
||
* **Channel organization**: Each channel handles a portion of the address space,
|
||
with addresses interleaved across channels for load balancing.
|
||
* **Hit-on-miss behavior**: If a request arrives for a pending cache line fill,
|
||
it counts as a hit, improving the effective hit rate.
|
||
* **Write coalescing**: Multiple writes to the same cache line are combined.
|
||
* **Atomic operation support**: Atomics execute directly in the L2 cache for
|
||
coherence.
|
||
|
||
**L2-Fabric interface**: Requests missing in L2 are routed through Infinity
|
||
Fabric to the appropriate memory location, which could be:
|
||
|
||
* Local HBM on the same GPU
|
||
* Remote GPU memory (in multi-GPU systems)
|
||
* System memory (CPU DRAM)
|
||
|
||
The interface categorizes requests by type (read/write), size (32B/64B), and
|
||
destination for optimal routing.
|
||
|
||
Memory coherence
|
||
----------------
|
||
|
||
GPU memory coherence differs significantly from CPU designs to optimize for
|
||
throughput over latency:
|
||
|
||
**Write-through L1 caches**: All writes update both L1 and L2, ensuring L2
|
||
always has the latest data. This eliminates the need for complex coherence
|
||
protocols between L1 caches but requires higher write bandwidth.
|
||
|
||
**Software-managed coherence**: Coherence between CUs requires explicit
|
||
synchronization through:
|
||
|
||
* Memory fences for ordering
|
||
* Cache invalidation instructions
|
||
* Atomic operations (executed at L2 level)
|
||
* Kernel boundaries (implicit synchronization)
|
||
|
||
**Write combining**: To handle partial cache line updates from different CUs,
|
||
the GPU uses write masks indicating which bytes to update. This prevents false
|
||
sharing issues while maintaining correctness.
|
||
|
||
Memory coalescing
|
||
-----------------
|
||
|
||
Memory coalescing combines memory accesses from multiple threads into fewer
|
||
transactions, significantly improving bandwidth utilization. The coalescing
|
||
hardware in the VMEM unit analyzes addresses from all threads in a wavefront
|
||
and groups them into the minimum number of cache line requests.
|
||
|
||
**Coalesced access pattern**: When consecutive threads access consecutive memory
|
||
addresses, the hardware can combine all 64 thread requests into as few as 4-8
|
||
cache line requests (depending on data size and alignment).
|
||
|
||
**Non-coalesced access pattern**: When threads access widely separated addresses,
|
||
each thread can generate a separate memory transaction, reducing effective
|
||
bandwidth by up to 16x or more.
|
||
|
||
To achieve optimal memory performance:
|
||
|
||
* Ensure consecutive threads access consecutive memory addresses
|
||
* Align data structures to cache line boundaries (64B or 128B)
|
||
* Use structure-of-arrays rather than array-of-structures layouts
|
||
* Consider padding to avoid bank conflicts
|
||
|
||
Architecture variants
|
||
=====================
|
||
|
||
AMD supports multiple GPU architecture families optimized for different use
|
||
cases while maintaining HIP compatibility.
|
||
|
||
Graphics Core Next (GCN)
|
||
------------------------
|
||
|
||
GCN represents the foundational architecture for modern AMD GPUs, establishing
|
||
key design principles still used today:
|
||
|
||
* 64-thread wavefronts
|
||
* Four SIMD units per CU with 16 lanes each
|
||
* Scalar unit for wavefront-uniform operations
|
||
* LDS with 32 banks
|
||
|
||
Multiple GCN generations (GCN1-5) introduced incremental improvements in
|
||
process technology, clock speeds, and instruction set features while maintaining
|
||
the core architectural philosophy.
|
||
|
||
.. _cdna_architecture:
|
||
|
||
CDNA architecture
|
||
-----------------
|
||
|
||
CDNA (Compute DNA) specializes in high-performance computing and machine
|
||
learning workloads. Building on GCN principles, CDNA adds significant compute
|
||
enhancements:
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/cdna3_cu.png
|
||
:alt: Block diagram showing CDNA3 compute unit with matrix core unit, shader
|
||
cores, L1 cache, and local data share
|
||
:width: 800
|
||
|
||
CDNA3 compute unit with matrix acceleration
|
||
|
||
**Matrix Core Unit**: Specialized hardware for matrix multiply-accumulate
|
||
operations, providing up to 16 times more throughput than vector units for supported
|
||
operations. Matrix cores support multiple precisions (``INT8``, ``FP16``, ``BF16``, ``FP32``)
|
||
with varying performance characteristics.
|
||
|
||
**Accumulation VGPRs (AGPRs)**: Additional register file space (up to 256 KB)
|
||
dedicated to matrix accumulation, doubling the available register storage for
|
||
matrix operations. Data movement between VGPRs and AGPRs uses specialized
|
||
instructions (``v_accvgpr_*``).
|
||
|
||
**Enhanced memory bandwidth**: CDNA GPUs typically use HBM2/HBM2e/HBM3 memory,
|
||
providing up to 3.2 TB/s bandwidth on high-end models.
|
||
|
||
**Multi-die designs**: CDNA2 (MI250) and CDNA3 (MI300) use chiplet
|
||
architectures with multiple dies connected through high-speed links, scaling
|
||
to higher compute and memory capacities.
|
||
|
||
.. _rdna_architecture:
|
||
|
||
RDNA architecture
|
||
-----------------
|
||
|
||
RDNA optimizes for graphics and lower-latency compute workloads through
|
||
fundamental architectural changes:
|
||
|
||
.. figure:: ../data/understand/hardware_implementation/rdna3_cu.png
|
||
:alt: Block diagram showing RDNA3 work group processor with dual compute
|
||
units, shared caches, and 32-wide SIMD units
|
||
:width: 800
|
||
|
||
RDNA3 work group processor architecture
|
||
|
||
**Wave32 execution**: Primary execution mode uses 32-thread wavefronts,
|
||
reducing divergence penalties and register pressure. Wave64 mode is available
|
||
for backward compatibility.
|
||
|
||
**Dual compute units**: The work group processor (WGP) replaces standalone CUs,
|
||
containing two closely coupled compute units sharing resources:
|
||
|
||
* Each CU has two 32-wide SIMD units (vs. four 16-wide in GCN)
|
||
* Wavefronts execute in a single cycle on 32-wide SIMDs
|
||
* Reduced instruction latency improves responsiveness
|
||
|
||
**Three-level cache hierarchy**:
|
||
|
||
* **L0 cache**: Per-CU cache (equivalent to GCN's L1)
|
||
* **L1 cache**: Shared between CUs in a WGP (new intermediate level)
|
||
* **L2 cache**: Global cache shared across all WGPs
|
||
|
||
**128-byte cache lines**: Doubled from 64 bytes in GCN, aligning with Wave32
|
||
access patterns (32 threads × 4 bytes = 128 bytes).
|
||
|
||
These RDNA optimizations target gaming workloads where latency matters more
|
||
than pure throughput, though the architecture remains capable for general
|
||
compute tasks.
|
||
|
||
Performance considerations
|
||
==========================
|
||
|
||
Understanding hardware characteristics helps you optimize GPU applications for
|
||
maximum performance.
|
||
|
||
Occupancy and resource limits
|
||
-----------------------------
|
||
|
||
Occupancy measures the ratio of active wavefronts to maximum possible
|
||
wavefronts on a CU. Higher occupancy generally improves latency hiding but
|
||
is limited by:
|
||
|
||
* **Register usage**: Each wavefront requires VGPRs and SGPRs from finite pools
|
||
* **LDS allocation**: Shared memory used per workgroup
|
||
* **Wavefront slots**: Fixed number of execution contexts per CU
|
||
* **Workgroup size**: Smaller workgroups can waste resources
|
||
|
||
Balancing these resources is critical for achieving optimal occupancy. Tools
|
||
like ``rocprof`` can help analyze occupancy and identify limiting factors.
|
||
|
||
Latency hiding through multithreading
|
||
-------------------------------------
|
||
|
||
GPUs hide memory and instruction latency through massive hardware
|
||
multithreading rather than complex CPU techniques like out-of-order execution
|
||
or speculation. With sufficient wavefronts:
|
||
|
||
* Memory latency is hidden by executing other wavefronts during waits
|
||
* Pipeline latencies are covered by round-robin wavefront scheduling
|
||
* No context switch overhead as all contexts remain resident
|
||
|
||
The hardware can switch between wavefronts every cycle, maintaining high ALU
|
||
utilization even with long-latency operations in flight.
|
||
|
||
Memory bandwidth utilization
|
||
----------------------------
|
||
|
||
Effective memory bandwidth depends on access patterns:
|
||
|
||
* **Coalesced access**: Can achieve 70-90% of peak bandwidth
|
||
* **Random access**: Might achieve only 5-15% of peak bandwidth
|
||
* **Bank conflicts**: Can serialize LDS access, reducing throughput
|
||
|
||
Memory-bound kernels should focus on:
|
||
|
||
* Maximizing coalescing through proper data layout
|
||
* Prefetching and data reuse in LDS
|
||
* Balancing computation with memory access
|
||
* Using appropriate cache policies
|
||
|
||
Hardware-specific optimizations
|
||
-------------------------------
|
||
|
||
Different AMD GPU architectures benefit from tailored optimizations:
|
||
|
||
**For GCN/CDNA**:
|
||
|
||
* Optimize for 64-thread wavefront granularity
|
||
* Leverage matrix cores for applicable algorithms
|
||
* Consider AGPR usage for register spilling
|
||
|
||
**For RDNA**:
|
||
|
||
* Design for 32-thread wavefront execution
|
||
* Utilize improved divergence handling
|
||
* Take advantage of additional cache level
|
||
|
||
**Architecture-agnostic**:
|
||
|
||
* Minimize divergent control flow
|
||
* Ensure memory access coalescing
|
||
* Balance resource usage for occupancy
|
||
* Overlap computation with memory access
|
||
|
||
Summary
|
||
=======
|
||
|
||
AMD GPU hardware architecture provides massive parallelism through hierarchical
|
||
organization of compute resources, specialized execution units, and a
|
||
sophisticated memory system. Understanding these hardware details—from the
|
||
command processor through shader engines to individual compute units and the
|
||
memory hierarchy—enables you to write more efficient GPU applications.
|
||
|
||
Key hardware concepts for optimization include:
|
||
|
||
* Workgroup scheduling and resource management by the SPI
|
||
* Instruction scheduling and wavefront execution in compute units
|
||
* Memory coalescing and cache behavior
|
||
* Architecture-specific features (matrix cores, Wave32/64 modes)
|
||
* Resource limits affecting occupancy
|
||
|
||
For details on mapping parallel algorithms to this hardware, see the
|
||
:ref:`programming_model` chapter. For specific optimization techniques, consult
|
||
the performance optimization guides in the ROCm documentation.
|