ca0f3a6b5a
Signed-off-by: Jan Stephan <jan.stephan@amd.com>
98 líneas
4.3 KiB
ReStructuredText
98 líneas
4.3 KiB
ReStructuredText
.. meta::
|
|
:description: This chapter describes how to use multiple devices on one host.
|
|
:keywords: ROCm, HIP, multi-device, multiple, GPUs, devices
|
|
|
|
.. _multi-device:
|
|
|
|
*******************************************************************************
|
|
Multi-device management
|
|
*******************************************************************************
|
|
|
|
Device enumeration
|
|
===============================================================================
|
|
|
|
Device enumeration involves identifying all the available GPUs connected to the
|
|
host system. A single host machine can have multiple GPUs, each with its own
|
|
unique identifier. By listing these devices, you can decide which GPU to use
|
|
for computation. The host queries the system to count and list all connected
|
|
GPUs that support the chosen ``HIP_PLATFORM``, ensuring that the application
|
|
can leverage the full computational power available. Typically, applications
|
|
list devices and their properties for deployment planning, and also make
|
|
dynamic selections during runtime to ensure optimal performance.
|
|
|
|
If the application does not define a specific GPU, device 0 is selected.
|
|
|
|
.. literalinclude:: ../../tools/example_codes/device_enumeration.cpp
|
|
:start-after: // [sphinx-start]
|
|
:end-before: // [sphinx-end]
|
|
:language: cpp
|
|
|
|
.. _multi_device_selection:
|
|
|
|
Device selection
|
|
===============================================================================
|
|
|
|
Once you have enumerated the available GPUs, the next step is to select a
|
|
specific device for computation. This involves setting the active GPU that will
|
|
execute subsequent operations. This step is crucial in multi-GPU systems where
|
|
different GPUs might have different capabilities or workloads. By selecting the
|
|
appropriate device, you ensure that the computational tasks are directed to the
|
|
correct GPU, optimizing performance and resource utilization.
|
|
|
|
.. literalinclude:: ../../tools/example_codes/device_selection.hip
|
|
:start-after: // [sphinx-start]
|
|
:end-before: // [sphinx-end]
|
|
:language: cpp
|
|
|
|
Stream and event behavior
|
|
===============================================================================
|
|
|
|
In a multi-device system, streams and events are essential for efficient
|
|
parallel computation and synchronization. Streams enable asynchronous task
|
|
execution, allowing multiple devices to process data concurrently without
|
|
blocking one another. Events provide a mechanism for synchronizing operations
|
|
across streams and devices, ensuring that tasks on one device are completed
|
|
before dependent tasks on another device begin. This coordination prevents race
|
|
conditions and optimizes data flow in multi-GPU systems. Together, streams and
|
|
events maximize performance by enabling parallel execution, load balancing, and
|
|
effective resource utilization across heterogeneous hardware.
|
|
|
|
.. literalinclude:: ../../tools/example_codes/multi_device_synchronization.hip
|
|
:start-after: // [sphinx-start]
|
|
:end-before: // [sphinx-end]
|
|
:language: cpp
|
|
|
|
Peer-to-peer memory access
|
|
===============================================================================
|
|
|
|
In multi-GPU systems, peer-to-peer memory access enables one GPU to directly
|
|
read or write to the memory of another GPU. This capability reduces data
|
|
transfer times by allowing GPUs to communicate directly without involving the
|
|
host. Enabling peer-to-peer access can significantly improve the performance of
|
|
applications that require frequent data exchange between GPUs, as it eliminates
|
|
the need to transfer data through the host memory.
|
|
|
|
By adding peer-to-peer access to the example referenced in
|
|
:ref:`multi_device_selection`, data can be efficiently copied between devices.
|
|
If peer-to-peer access is not activated, the call to :cpp:func:`hipMemcpy`
|
|
still works but internally uses a staging buffer in host memory, which incurs a
|
|
performance penalty.
|
|
|
|
.. tab-set::
|
|
|
|
.. tab-item:: with peer-to-peer
|
|
|
|
.. literalinclude:: ../../tools/example_codes/p2p_memory_access.hip
|
|
:start-after: // [sphinx-start]
|
|
:end-before: // [sphinx-end]
|
|
:emphasize-lines: 45-51, 65-69
|
|
:language: cpp
|
|
|
|
.. tab-item:: without peer-to-peer
|
|
|
|
.. literalinclude:: ../../tools/example_codes/p2p_memory_access_host_staging.hip
|
|
:start-after: // [sphinx-start]
|
|
:end-before: // [sphinx-end]
|
|
:emphasize-lines: 57-59
|
|
:language: cpp
|