ファイル
rocm-systems/docs/understand/programming_model.rst
T
Julia Jiang fee13d6a3f SWDEV-451170 - Merge PR#3432 to amd-staging
Change-Id: I210bced4e626fc2ac464b80172132481b882cf63
2024-03-19 12:00:21 -04:00

159 行
7.6 KiB
ReStructuredText

.. meta::
:description: This chapter explains the HIP programming model, the contract
between the programmer and the compiler/runtime executing the
code, how it maps to the hardware.
:keywords: AMD, ROCm, HIP, CUDA, API design
*******************************************************************************
Programming Model
*******************************************************************************
The HIP programming model makes it as easy as reasonably possible to map
data-parallel C/C++ algorithms and map them to massively parallel, wide SIMD
architectures, such as GPUs. As a consequence, one needs a basic understanding
of the underlying device architecture to make efficient use of HIP and GPGPU
(General Purpose Graphics Processing Unit) programming in general.
RDNA & CDNA Architecture Summary
===============================================================================
Most GPU architectures, much like RDNA and CDNA have a hierarchical structure.
The inner-most piece is some Single Instruction Multiple Data (SIMD) enabled
vector Arithmetic Logical Unit (ALU). Most recent GPUs beside the vector ALU
also house some matrix ALU for accelerating algorithms of well defined shapes.
Some number of vector and matrix ALUs comprise a larger block, often referred
to as a Compute Unit (OpenCL, AMD block diagrams) but is referred to as Multi
Processor in HIP terms.
.. _rdna3_cu:
.. figure:: ../data/understand/programming_model/rdna3_cu.png
:alt: Block diagram showing components mostly duplicated on the upper and
lower halves of the image with some spanning over both parts (hinting
at them being shared). Both the top and the bottom have two sets of
identical hardware blocks. a "Scheduler" having "Vector GPR"s with an
associated Vector ALU with the following noted capabilities:
Float/INT/Matrix SIMD32, Float/Matrix SIMD32, Transcendental SIMD8, AI
MATRIX Accelerator, DPFP (1). Both top and bottom have a single "Ray
Accelerator", "Texture Filters" and "LD/ST/Tex Addr" and L0 blocks.
Shared among top and bottom are "Scalar Cache", "Shader Instruction
Cache" and Shared Memory.
Block Diagram of an RDNA3 Compute Unit.
.. _cdna3_cu:
.. figure:: ../data/understand/programming_model/cdna3_cu.png
:alt: Block diagram showing components: "Scheduler", "Local Data Share",
"Matrix Core Unit", "Shader Core" and "L1 Cache" along with some
unnamed blocks.
Block Diagram of an CDNA3 Compute Unit.
For hardware implementation's sake, some number of Multi Processors are grouped
together into a Shader Engine or Compute Engine, typically sharing some fixed
function units or memory subsystem resources.
.. _cdna2_gcd:
.. figure:: ../data/understand/programming_model/cdna2_gcd.png
:alt: Block diagram showing four "Compute Engine"s each with 28 "CU"s
(Compute Unit) inside. These four Compute Engines share one block of
"L2 Cache and Controllers". Around them are four "Memory Controller"s,
each having a "Memory Phy" block next to them. To the top and bottom of
all these are eight blocks of "Infinity Fabric Link" with one of the
eight reading "Infinity Fabric or PCIe". Two sole "VCN" blocks sit in
top corners. At the very bottom spans a colored seaction reading
"Infinity Fabric" along with its logo.
Block Diagram of a CDNA2 Graphics Compute Die.
Single Instruction Multiple Threads
===============================================================================
The SIMT programming model behind the HIP device-side execution is a
middle-ground between SMT (Simultaneous Multi-Threading) programming known from
multi-core CPUs, and SIMD (Single Instruction, Multiple Data) programming
mostly known from exploiting relevant instruction sets on CPUs (eg.
SSE/AVX/Neon).
A HIP device compiler maps our SIMT code written in HIP C++ to an inherently
SIMD architecture (like GPUs) not by exploiting data parallelism within a
single instance of a kernel and spreading identical instructions over the SIMD
engines at hand, but by scalarizing the entire kernel and issuing the scalar
instructions of multiple kernel instances to each of the SIMD engine lanes.
Consider the following kernel
.. code:: cu
__global__ void k(float4* a, const float4* b)
{
int tid = threadIdx.x;
int bid = blockIdx.x;
int dim = blockDim.x;
a[tid] += (tid + bid - dim) * b[tid];
}
The incoming four-vector of floating-point values ``a`` is multiplied by a
scalar and then multiplied element-wise by another four-vector. On modern
SIMD-capable architectures the four-vector ops are expected to compile to a
single SIMD instruction. GPU execution of this kernel however will typically
look the following:
.. _simt:
.. figure:: ../data/understand/programming_model/simt.svg
:alt: Two large arrows pointing downward with blocks inside and ellipses
between the arrows. Inside the arrows the same series of blocks with
the following texts inside from top to bottom: "ADD", "DIV", "FMA",
"FMA", "FMA" and "FMA".
Instruction flow of the sample SIMT program.
In HIP, lanes of a SIMD architecture are fed by mapping threads of a SIMT
execution, one thread down each lane of a SIMD engine. Execution parallelism
isn't exploited from the width of the built-in vector types, but via the thread
id constants ``threadIdx.x``, ``blockIdx.x``, etc. For more details, refer to
:ref:`inherent_thread_model`.
Heterogenous Programming
===============================================================================
The HIP programming model assumes two execution contexts. One is referred to as
*host* while compute kernels execute on a *device*. These contexts have
different capabilities, therefor slightly different rules apply. The *host*
execution is defined by the C++ abstract machine, while *device* execution
follows the HIP model, primarily defined by SIMT. These execution contexts in
code are signified by the ``__host__`` and ``__device__`` decorators. There are
a few key differences between the two:
* The C++ abstract machine assumes a unified memory address space, meaning that
one can always access any given address in memory (assuming the absence of
data races). HIP however introduces several memory namespaces, an address
from one means nothing in another. Moreover not all address spaces are
accessible from all contexts.
If one were to look at {ref}`cdna2_gcd` and inside the {ref}`cdna3_cu`,
every Compute Unit has an instance of storage backing the namespace
``__shared__``. Even if the host were to have access to these regions of
memory, the performance benefits of the segmented memory subsystem are
supported by the inability of meaningful asynchronous accesss from the host.
* Not all C++ language features map cleanly to typical device architectures,
some are very expensive (meaning: slow) to implement on GPU devices, therefor
they are forbidden in device contexts to avoid users tapping into features
unexpectedly decimating their program's performance. Offload devices targeted
by HIP aren't general purpose devices, at least not in the sense a CPU is.
HIP focuses on data parallel computations and as such caters to throughput
optimized architectures, such as GPUs or accelerators derived from GPU
architectures.
* Asynchrony is at the forefront of the HIP API. Computations launched by HIP
execute asynchronously on the device and it is the user's responsibility to
synchronize their data dispatch/fetch with computations on the device. HIP
does perform implicit synchronization on occasions, but unlike some APIs
(OpenCL, SYCL) by and large places the onus of synchronization on the user.