SWDEV-490062 - Update documentation

Change-Id: Ib5297fdda2e05795b3b20436cc1de962e310b08b


[ROCm/hip commit: 3d60bd3a64]
Cette révision appartient à :
Istvan Kiss
2024-10-21 16:50:09 +02:00
révisé par Istvan Kiss
Parent 862e713a5c
révision 122cecc59d
108 fichiers modifiés avec 14036 ajouts et 2987 suppressions
+213 -199
Voir le fichier
@@ -1,125 +1,127 @@
.. meta::
:description: This chapter describes a set of best practices designed to help developers optimize the performance of HIP-capable GPU architectures.
:description: This chapter describes a set of best practices designed to help
developers optimize the performance of HIP-capable GPU architectures.
:keywords: AMD, ROCm, HIP, CUDA, performance, guidelines
*******************************************************************************
Performance Guidelines
Performance guidelines
*******************************************************************************
The AMD HIP Performance Guidelines are a set of best practices designed to help
developers optimize the performance of AMD GPUs. They cover established
parallelization and optimization techniques, coding metaphors, and idioms that
can greatly simplify programming for HIP-capable GPU architectures.
The AMD HIP performance guidelines are a set of best practices designed to help
you optimize the application performance on AMDGPUs. The guidelines discuss
established parallelization and optimization techniques to improve the
application performance on HIP-capable GPU architectures.
By following four main cornerstones, we can exploit the performance
optimization potential of HIP.
Here are the four main cornerstones to help you exploit HIP's performance
optimization potential:
- parallel execution
- memory usage optimization
- optimization for maximum throughput
- minimizing memory thrashing
- Parallel execution
- Memory bandwidth usage optimization
- Maximum throughput optimization
- Memory thrashing minimization
In the following chapters, we will show you their benefits and how to use them
effectively.
This document discusses the usage and benefits of these cornerstones in detail.
.. _parallel execution:
Parallel execution
==================
================================================================================
For optimal use, the application should reveal and efficiently imply as much
parallelism as possible to keep all system components active.
For optimal use and to keep all system components busy, the application must
reveal and efficiently provide as much parallelism as possible. The parallelism
can be performed at the application level, device level, and multiprocessor
level.
Application level
-----------------
--------------------------------------------------------------------------------
The application should optimize parallel execution across the host and devices
using asynchronous calls and streams. Workloads should be assigned based on
efficiency: serial to the host, parallel to the devices.
To enable parallel execution of the application across the host and devices, use
asynchronous calls and streams. Assign workloads based on efficiency: serial to
the host or parallel to the devices.
For parallel workloads, when threads need to synchronize to share data, if they
belong to the same block, they should use ``__syncthreads()`` (see:
:ref:`synchronization functions`) within the same kernel invocation. If they
belong to different blocks, they must use global memory with two separate
kernel invocations. The latter should be minimized as it adds overhead.
For parallel workloads, when threads belonging to the same block need to
synchronize to share data, use :cpp:func:`__syncthreads()` (see:
:ref:`synchronization functions`) within the same kernel invocation. For threads
belonging to different blocks, use global memory with two separate
kernel invocations. It is recommended to avoid the latter approach as it adds
overhead.
Device level
------------
--------------------------------------------------------------------------------
Device-level optimization primarily involves maximizing parallel execution
across the multiprocessors of the device. This can be achieved by executing
multiple kernels concurrently on a device. The management of these kernels is
facilitated by streams, which allow for the overlapping of computation and data
transfers, enhancing performance. The aim is to keep all multiprocessors busy
by executing enough kernels concurrently. However, launching too many kernels
can lead to resource contention, so a balance must be found for optimal
performance. This approach helps in achieving maximum utilization of the
resources of the device.
Device level optimization primarily involves maximizing parallel execution
across the multiprocessors on the device. You can achieve device level
optimization by executing multiple kernels concurrently on a device. To enhance
performance, the management of these kernels is facilitated by streams, which
allows overlapping of computation and data transfers. This approach aims at
keeping all multiprocessors busy by executing enough kernels concurrently.
However, launching too many kernels can lead to resource contention, hence a
balance must be found for optimal performance. The device level optimization
helps in achieving maximum utilization of the device resources.
Multiprocessor level
--------------------
--------------------------------------------------------------------------------
Multiprocessor-level optimization involves maximizing parallel execution within
each multiprocessor on a device. Each multiprocessor can execute a number of
threads concurrently, and the total number of threads that can run in parallel
is determined by the number of concurrent threads each multiprocessor can
handle.
Multiprocessor level optimization involves maximizing parallel execution within
each multiprocessor on a device. The key to multiprocessor level optimization
is to efficiently utilize the various functional units within a multiprocessor.
For example, ensuring a sufficient number of resident warps, so that every clock
cycle has an instruction from a warp is ready for execution. This instruction
could either be another independent instruction of the same warp, which exploits
:ref:`instruction level optimization <instruction optimization>`, or more
commonly an instruction of another warp, which exploits thread-level parallelism.
The key to multiprocessor-level optimization is to efficiently utilize the
various functional units within a multiprocessor. This can be achieved by
ensuring a sufficient number of resident warps, as at every instruction issue
time, a warp scheduler selects an instruction that is ready to execute. This
instruction can be another independent instruction of the same warp, exploiting
:ref:`instruction optimization`, or more commonly an instruction of another warp,
exploiting thread-level parallelism.
In comparison, device-level optimization focuses on the device as a whole,
aiming to keep all multiprocessors busy by executing enough kernels
concurrently. Both levels of optimization are crucial for achieving maximum
performance. They work together to ensure efficient utilization of the
resources of the GPU, from the individual multiprocessors to the device as a
whole.
On the other hand, device level optimization focuses on the device as a whole,
aiming at keeping all multiprocessors busy by executing enough kernels
concurrently. Both multiprocessor and device levels of optimization are crucial
for achieving maximum performance. They work together to ensure efficient
utilization of the GPU resources, ranging from individual multiprocessors to the
device as a whole.
.. _memory optimization:
Memory optimization
===================
Memory throughput optimization
================================================================================
The first step in maximizing memory throughput is to minimize low-bandwidth
data transfers. This involves reducing data transfers between the host and the
device, as these have lower bandwidth than transfers between global memory and
the device.
data transfers between the host and the device.
Additionally, data transfers between global memory and the device should be
minimized by maximizing the use of on-chip memory: shared memory and caches.
Shared memory acts as a user-managed cache, where the application explicitly
allocates and accesses it. A common programming pattern is to stage data from
device memory into shared memory. This involves each thread of a block loading
data from device memory to shared memory, synchronizing with all other threads
of the block, processing the data in shared memory, synchronizing again if
necessary, and writing the results back to device global memory.
Additionally, maximize the use of on-chip memory, that is, shared memory and
caches, and minimize transfers with global memory. Shared memory acts as a
user-managed cache explicitly allocated and accessed by the application. A
common programming pattern is to stage data from device memory into shared
memory. The staging of data from the device to shared memory involves the
following steps:
1. Each thread of a block loading data from device memory to shared memory.
2. Synchronizing with all other threads of the block.
3. Processing the data stored in shared memory.
4. Synchronizing again if necessary.
5. Writing the results back to the device global memory.
For some applications, a traditional hardware-managed cache is more appropriate
to exploit data locality. On devices of certain compute capabilities, the same
on-chip memory is used for both L1 and shared memory, and the amount dedicated
to each is configurable for each kernel call.
for exploiting data locality.
Finally, the throughput of memory accesses by a kernel can vary significantly
depending on the access pattern for each type of memory. Therefore, the next
step in maximizing memory throughput is to organize memory accesses as
optimally as possible. This is especially important for global memory accesses,
as global memory bandwidth is low compared to available on-chip bandwidths and
arithmetic instruction throughput. Thus, non-optimal global memory accesses
generally have a high impact on performance.
In conclusion, the throughput of memory accesses by a kernel can vary
significantly depending on the access pattern. Therefore, the next step in
maximizing memory throughput is to organize memory accesses as optimally as
possible. This is especially important for global memory accesses, as global
memory bandwidth is low compared to available on-chip bandwidths and arithmetic
instruction throughput. Thus, non-optimal global memory accesses generally have
a high impact on performance.
The memory throughput optimization techniques are further discussed in detail in
the following sections.
Data Transfer
-------------
.. _data transfer:
Applications should aim to minimize data transfers between the host and the
device. This can be achieved by moving more computations from the host to the
device, even if it means running kernels that do not fully utilize the
parallelism for device. Intermediate data structures can be created, used,
and discarded in device memory without being mapped or copied to host memory.
Data transfer
--------------------------------------------------------------------------------
To minimize data transfers between the host and the device, applications should
move more computations from the host to the device, even at the cost of running
kernels that don't fully utilize parallelism for the device. Intermediate data
structures should be created, used, and discarded in device memory without being
mapped or copied to host memory.
Batching small transfers into a single large transfer can improve performance
due to the overhead associated with each transfer. On systems with a front-side
@@ -129,173 +131,185 @@ When using mapped page-locked memory, there is no need to allocate device
memory or explicitly copy data between device and host memory. Data transfers
occur implicitly each time the kernel accesses the mapped memory. For optimal
performance, these memory accesses should be coalesced, similar to global
memory accesses.
memory accesses. The process where threads in a warp access sequential memory
locations is known as coalesced memory access, which can enhance memory data
transfer efficiency.
On integrated systems where device and host memory are physically the same,
any copy operation between host and device memory is unnecessary, and mapped
page-locked memory should be used instead. Applications can check if a device
is integrated by querying the integrated device property.
On integrated systems where device and host memory are physically the same, no
copy operation between host and device memory is required and hence mapped
page-locked memory should be used instead. To check if the device is integrated,
applications can query the integrated device property.
.. _device memory access:
Device Memory Access
--------------------
Device memory access
---------------------
Memory access instructions may be repeated due to the spread of memory
Memory access instructions might be repeated due to the spread of memory
addresses across warp threads. The impact on throughput varies with memory type
and is generally reduced when addresses are more scattered, especially in
global memory.
Device memory is accessed via 32-, 64-, or 128-byte transactions that must be
naturally aligned. Maximizing memory throughput involves coalescing memory
accesses of threads within a warp into minimal transactions, following optimal
access patterns, using properly sized and aligned data types, and padding data
when necessary.
naturally aligned.
Maximizing memory throughput involves:
Global memory instructions support reading or writing data of specific sizes
(1, 2, 4, 8, or 16 bytes) that are naturally aligned. If the size and alignment
requirements are not met, it leads to multiple instructions, reducing
performance. Therefore, using data types that meet these requirements, ensuring
alignment for structures, and maintaining alignment for all values or arrays is
crucial for correct results and optimal performance.
- Coalescing memory accesses of threads within a warp into minimal transactions.
- Following optimal access patterns.
- Using properly sized and aligned data types.
- Padding data when necessary.
Global memory instructions support reading or writing data of specific sizes (1,
2, 4, 8, or 16 bytes) that are naturally aligned. Not meeting the size and
alignment requirements leads to multiple instructions, which reduces
performance. Therefore, for correct results and optimal performance:
- Use data types that meet these requirements
- Ensure alignment for structures
- Maintain alignment for all values or arrays.
Threads often access 2D arrays at an address calculated as
``BaseAddress + xIndex + width * yIndex``. For efficient memory access, the
array and thread block widths should be multiples of the warp size. If the
array width is not a multiple of the warp size, it is usually more efficient to
allocate it with a width rounded up to the nearest multiple and pad the rows
accordingly.
allocate the array with a width rounded up to the nearest multiple and pad the
rows accordingly.
Local memory is used for certain automatic variables, such as arrays with
non-constant indices, large structures or arrays, and any variable when the
non-constant indices, large structures of arrays, and any variable where the
kernel uses more registers than available. Local memory resides in device
memory, leading to high latency and low bandwidth similar to global memory
accesses. However, it is organized for consecutive 32-bit words to be accessed
by consecutive thread IDs, allowing full coalescing when all threads in a warp
access the same relative address.
memory, which leads to high latency and low bandwidth, similar to global memory
accesses. However, the local memory is organized for consecutive 32-bit words to
be accessed by consecutive thread IDs, which allows full coalescing when all
threads in a warp access the same relative address.
Shared memory, located on-chip, provides higher bandwidth and lower latency
than local or global memory. It is divided into banks that can be
simultaneously accessed, boosting bandwidth. However, bank conflicts, where two
addresses fall in the same bank, lead to serialized access and decreased
throughput. Therefore, understanding how memory addresses map to banks and
scheduling requests to minimize conflicts is crucial for optimal performance.
Shared memory is located on-chip and provides higher bandwidth and lower latency
than local or global memory. It is divided into banks that can be simultaneously
accessed, which boosts bandwidth. However, bank conflicts, where two addresses
fall in the same bank, lead to serialized access and decreased throughput.
Therefore, understanding how memory addresses map to banks and scheduling
requests to minimize conflicts is crucial for optimal performance.
Constant memory is in device memory and cached in the constant cache. Requests
are split based on different memory addresses, affecting throughput, and are
serviced at the throughput of the constant cache for cache hits, or the
throughput of the device memory otherwise.
Constant memory is in the device memory and cached in the constant cache.
Requests are split based on different memory addresses and are serviced based
either on the throughput of the constant cache for cache hits or on the
throughput of the device memory otherwise. This splitting of requests affects
throughput.
Texture and surface memory are stored in device memory and cached in texture
cache. This setup optimizes 2D spatial locality, leading to better performance
for threads reading close 2D addresses. Reading device memory through texture
or surface fetching can be advantageous, offering higher bandwidth for local
texture fetches or surface reads, offloading addressing calculations,
allowing data broadcasting, and optional conversion of 8-bit and 16-bit integer
input data to 32-bit floating-point values on-the-fly.
Texture and surface memory are stored in the device memory and cached in the
texture cache. This setup optimizes 2D spatial locality, which leads to better
performance for threads reading close 2D addresses.
Reading device memory through texture or surface fetching provides the following
advantages:
- Higher bandwidth for local texture fetches or surface reads.
- Offloading addressing calculation.
- Data broadcasting.
- Optional conversion of 8-bit and 16-bit integer input data to 32-bit
floating-point values on the fly.
.. _instruction optimization:
Optimization for maximum instruction throughput
===============================================
================================================================================
To maximize instruction throughput:
- minimize low throughput arithmetic instructions
- minimize divergent warps inflicted by control flow instructions
- minimize the number of instruction as possible
- maximize instruction parallelism
- Minimize low throughput arithmetic instructions.
- Minimize divergent warps inflicted by flow control instructions.
- Maximize instruction parallelism.
These techniques are discussed in detail in the following sections.
Arithmetic instructions
-----------------------
--------------------------------------------------------------------------------
The type and complexity of arithmetic operations can significantly impact the
performance of your application. We are highlighting some hints how to maximize
it.
Using efficient operations: Some arithmetic operations are more costly than
others. For example, multiplication is typically faster than division, and
integer operations are usually faster than floating-point operations,
especially with double-precision.
Use efficient operations: Some arithmetic operations are costlier than others.
For example, multiplication is typically faster than division, and integer
operations are usually faster than floating-point operations, especially with
double precision.
Minimizing low-throughput instructions: This might involve trading precision
for speed when it does not affect the final result. For instance, consider
using single-precision arithmetic instead of double-precision.
Minimize low-throughput instructions: This might involve trading precision for
speed when it does not affect the final result. For instance, consider using
single-precision arithmetic instead of double-precision.
Leverage intrinsic functions: Intrinsic functions are pre-defined functions
Leverage intrinsic functions: Intrinsic functions are predefined functions
available in HIP that can often be executed faster than equivalent arithmetic
operations (subject to some input or accuracy restrictions). They can help
optimize performance by replacing more complex arithmetic operations.
Avoiding divergent warps: Divergent warps occur when threads within the same
warp follow different execution paths. This can happen due to conditional
statements that lead to different arithmetic operations being performed by
different threads. Divergent warps can significantly reduce instruction
throughput, so try to structure your code to minimize divergence.
Optimize memory access: The memory access efficiency can impact the speed of
arithmetic operations. See: :ref:`device memory access`.
Optimizing memory access: The efficiency of memory access can impact the speed
of arithmetic operations. Coalesced memory access, where threads in a warp
access consecutive memory locations, can improve memory throughput and thus
the speed of arithmetic operations.
Maximizing instruction parallelism: Some GPU architectures could issue parallel
independent instructions simultaneously, for example integer and floating
point, or two operations with independent inputs and outputs. Mostly this is a
work for compiler, but expressing parallelism in the code explicitly can
improve instructions throughput.
.. _control flow instructions:
Control flow instructions
-------------------------
--------------------------------------------------------------------------------
Flow control instructions (``if``, ``else``, ``for``, ``do``, ``while``,
Control flow instructions (``if``, ``else``, ``for``, ``do``, ``while``,
``break``, ``continue``, ``switch``) can impact instruction throughput by
causing threads within a warp to diverge and follow different execution paths.
To optimize performance, control conditions should be written to minimize
divergent warps. For example, when the control condition depends on
(``threadIdx`` / ``warpSize``), no warp diverges. The compiler may optimize
loops or short if or switch blocks using branch predication, preventing warp
divergence. With branch predication, instructions associated with a false
predicate are scheduled but not executed, avoiding unnecessary operations.
To optimize performance, write control conditions to minimize divergent warps.
For example, when the control condition depends on ``threadIdx`` or ``warpSize``,
warp doesn't diverge. The compiler might optimize loops, short ifs, or switch
blocks using branch predication, which prevents warp divergence. With branch
predication, instructions associated with a false predicate are scheduled but
not executed, which avoids unnecessary operations.
Avoiding divergent warps
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Warps diverge when threads within the same warp follow different execution paths.
This is caused by conditional statements that lead to different arithmetic
operations being performed by different threads. Divergent warps can
significantly reduce instruction throughput, so it is advisable to structure
your code to minimize divergence.
Synchronization
---------------
--------------------------------------------------------------------------------
Synchronization ensures that all threads within a block have completed their
Synchronization ensures that all threads within a block complete their
computations and memory accesses before moving forward, which is critical when
threads are dependent on the results of other threads. However,
synchronization can also lead to performance overhead, as it requires threads
to wait, potentially leading to idle GPU resources.
threads depend on other thread results. However, synchronization can also cause
performance overhead, as it needs the threads to wait, which might lead to idle
GPU resources.
``__syncthreads()`` is used to synchronize all threads in a block, ensuring
that all threads have reached the same point in the code and that shared memory
is visible to all threads after the point of synchronization.
To synchronize all threads in a block, use :cpp:func:`__syncthreads()`.
:cpp:func:`__syncthreads()` ensures that, all threads reach the same point in
the code and can access shared memory after reaching that point.
An alternative way to synchronize is using streams. Different streams can
execute commands out of order with respect to one another or concurrently. This
allows for more fine-grained control over the execution order of commands,
which can be beneficial in certain scenarios.
An alternative way to synchronize is to use streams. Different streams can
execute commands either without following a specific order or concurrently. This
is why streams allow more fine-grained control over the execution order of
commands, which can be beneficial in certain scenarios.
Minimizing memory thrashing
===========================
================================================================================
Applications frequently allocating and freeing memory may experience slower
allocation calls over time. This is expected as memory is released back to the
operating system. To optimize performance in such scenarios, consider some
recommendations:
Applications frequently allocating and freeing memory might experience slower
allocation calls over time as memory is released back to the operating system.
To optimize performance in such scenarios, follow these guidelines:
- avoid allocating all available memory with ``hipMalloc`` / ``hipHostMalloc``,
as this immediately reserves memory and can block other applications from
using it. This could strain the operating system schedulers or even prevent
other applications from running on the same GPU.
- aim to allocate memory in suitably sized blocks early in the lifecycle of the
application and deallocate only when the application no longer needs it.
Minimize the number of ``hipMalloc`` and ``hipFree`` calls in your
application, particularly in areas critical to performance.
- if an application is unable to allocate sufficient device memory, consider
resorting to other memory types such as ``hipHostMalloc`` or
``hipMallocManaged``. While these may not offer the same performance, they
can allow the application to continue running.
- For supported platforms, ``hipMallocManaged`` allows for oversubscription.
With the right memory advise policies, it can maintain most, if not all, of
the performance of ``hipMalloc``. ``hipMallocManaged`` does not require an
allocation to be resident until it is needed or prefetched, easing the load
on the operating system schedulers and facilitating multi-tenant scenarios.
- Avoid allocating all available memory with :cpp:func:`hipMalloc` or
:cpp:func:`hipHostMalloc`, as this immediately reserves memory and might
prevent other applications from using it. This behavior could strain the
operating system schedulers or prevent other applications from running on the
same GPU.
- Try to allocate memory in suitably sized blocks early in the application's
lifecycle and deallocate only when the application no longer needs it.
Minimize the number of :cpp:func:`hipMalloc` and :cpp:func:`hipFree` calls in
your application, particularly in performance-critical areas.
- Consider resorting to other memory types such as :cpp:func:`hipHostMalloc` or
:cpp:func:`hipMallocManaged`, if an application can't allocate sufficient
device memory. While the other memory types might not offer similar
performance, they allow the application to continue running.
- For supported platforms, use :cpp:func:`hipMallocManaged`, as it allows
oversubscription. With the right policies, :cpp:func:`hipMallocManaged` can
maintain most, if not all, :cpp:func:`hipMalloc` performance.
:cpp:func:`hipMallocManaged` doesn't require an allocation to be resident
until it is needed or prefetched, which eases the load on the operating
system's schedulers and facilitates multitenant scenarios.