Add documentation for NPS4 and CPX partition modes (#1555)
[ROCm/rccl commit: 28ab8603d2]
Cette révision appartient à :
@@ -30,6 +30,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https:
|
||||
environment variable `RCCL_DISABLE_RAIL_TREES=1`.
|
||||
* Additional debug information about how the trees are built can be logged to the GRAPH logging subsys by setting
|
||||
`RCCL_OUTPUT_TREES=1`.
|
||||
* Added documentation about the NPS4 and CPX partition modes performance benefits on the MI300X.
|
||||
|
||||
## RCCL 2.21.5 for ROCm 6.3.1
|
||||
|
||||
|
||||
Fichier binaire non affiché.
|
Après Largeur: | Hauteur: | Taille: 114 KiB |
Fichier binaire non affiché.
|
Après Largeur: | Hauteur: | Taille: 107 KiB |
@@ -82,15 +82,24 @@ set the HSA environment variable as follows:
|
||||
This feature requires GPUs that support peer-to-peer access along with
|
||||
proper large BAR addressing support.
|
||||
|
||||
Improving performance on the MI300X accelerator when using fewer than 8 GPUs
|
||||
============================================================================
|
||||
Improving performance on the MI300X
|
||||
===================================
|
||||
|
||||
On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links
|
||||
in a fully-connected topology. For collective operations, this can achieve good performance when
|
||||
all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction
|
||||
of the potential bandwidth on the system.
|
||||
However, if your workload warrants using fewer than 8 MI300X accelerators on a system,
|
||||
you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
|
||||
This section outlines ways to improve RCCL performance on MI300X systems,
|
||||
including guidelines for systems with fewer than eight GPUs and the most efficient
|
||||
GPU partition modes.
|
||||
|
||||
Configuration with fewer than eight GPUs
|
||||
----------------------------------------
|
||||
|
||||
On a system with eight MI300X accelerators, each pair of accelerators is
|
||||
connected with dedicated Infinity Fabric™ links in a fully connected topology.
|
||||
For collective operations, this can achieve good performance when all eight
|
||||
accelerators (and all Infinity Fabric links) are used. When fewer than eight
|
||||
GPUs are used, however, this can only achieve a fraction of the potential
|
||||
bandwidth on the system. However, if your workload warrants using fewer than
|
||||
eight MI300X accelerators on a system, you can set the run-time variable
|
||||
``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
@@ -98,6 +107,150 @@ you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number
|
||||
|
||||
Increasing the number of channels can benefit performance, but it also increases
|
||||
GPU utilization for collective operations.
|
||||
Additionally, RCCL pre-defines a higher number of channels when only 2 or
|
||||
4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators
|
||||
and 24 channels for four MI300X accelerators.
|
||||
Additionally, RCCL pre-defines a higher number of channels when only two or four
|
||||
accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32
|
||||
channels with two MI300X accelerators and 24 channels for four MI300X
|
||||
accelerators.
|
||||
|
||||
.. _nps4_cpx_mi300_rccl:
|
||||
|
||||
NPS4 and CPX partition modes
|
||||
----------------------------
|
||||
|
||||
The term compute partitioning modes, or Modular Chiplet Platform (MCP), refers to the
|
||||
logical partitioning of XCDs into devices in the ROCm stack. The names are
|
||||
derived from the number of logical partitions that are created out of the eight
|
||||
XCDs. In the default mode, SPX (Single Partition X-celerator), all eight XCDs are
|
||||
viewed as a single logical compute element, meaning that the :doc:`amd-smi <amdsmi:index>`
|
||||
utility will show a single MI300X device. In CPX (Core Partitioned X-celerator)
|
||||
mode, each XCD appears as a separate logical GPU, for example, as eight separate
|
||||
GPUs in :doc:`amd-smi <amdsmi:index>` per MI300X. CPX mode can be viewed as
|
||||
having explicit scheduling privileges for each individual compute element (XCD).
|
||||
|
||||
While compute partitioning modes change the space on which you can assign work
|
||||
to compute units, the memory partitioning modes (known as Non-Uniform Memory
|
||||
Access (NUMA) Per Socket (NPS)) change the number of NUMA domains that a device
|
||||
exposes. In other words, it changes the number of HBM stacks which are
|
||||
accessible to a compute unit, and therefore the size of its memory space. However,
|
||||
for the MI300X, the number of memory partitions must be less than or equal to
|
||||
the number of compute partitions. NPS4 (viewing pairs of HBM stacks as a
|
||||
disparate element), for example, is only enabled when in CPX mode (viewing each
|
||||
XCD as a disparate element).
|
||||
|
||||
- Compute partition modes
|
||||
|
||||
- In SPX mode, workgroups launched to the device are distributed
|
||||
round-robin to the XCDs in the device, meaning that the programmer cannot
|
||||
have explicit control over which XCD a workgroup is assigned to.
|
||||
|
||||
- In CPX mode, workgroups are launched to a single XCD, meaning the
|
||||
programmer has explicit control over work placement onto the XCDs.
|
||||
|
||||
- Memory partition modes
|
||||
|
||||
- In NPS1 mode (compatible with CPX and SPX), the entire memory is accessible
|
||||
to all XCDs.
|
||||
|
||||
- In NPS4 mode (compatible with CPX), each memory quadrant of the memory is
|
||||
directly visible to the logical devices in its quadrant. An XCD can still
|
||||
access all portions of memory through multi-GPU programming techniques.
|
||||
|
||||
The MI300 CPX mode can be accessed using the following :doc:`amdsmi:index`
|
||||
commands.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
amd-smi set --gpu all --compute-partition CPX
|
||||
amd-smi set --gpu all --memory-partition NPS4
|
||||
|
||||
RCCL performance with CPX and NPS4
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
To run RCCL allreduce on 64 GPUs with CPX+NPS4 mode on the MI300X, use this
|
||||
example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mpirun -np 64 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
|
||||
|
||||
To run RCCL allreduce on 8 GPUs in the same OAM with CPX+NPS4 mode on the
|
||||
MI300X, use this example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
|
||||
mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
|
||||
|
||||
RCCL delivers improved allreduce performance in CPX mode for TP=8 (8 GPUs in
|
||||
the same OAM) on the MI300X.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export HIP_FORCE_DEV_KERNARG=1
|
||||
export RCCL_MSCCLPP_THRESHOLD=1073741824
|
||||
|
||||
export MSCCLPP_READ_ALLRED=1
|
||||
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
|
||||
mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 32 -e 1G -f 2 -g 1 -G 2 -w 20 -n 50
|
||||
|
||||
Here are the benchmark results for in-place (where the output buffer is used as
|
||||
the input buffer) and out-of-place allreduce bus bandwidth.
|
||||
|
||||
.. figure:: ../data/how-to/rccl-usage-tips/in-place_allreduce.png
|
||||
:alt: In-place allreduce benchmark results
|
||||
:align: center
|
||||
|
||||
.. figure:: ../data/how-to/rccl-usage-tips/out-of-place_allreduce.png
|
||||
:alt: Out-of-place allreduce benchmark results
|
||||
:align: center
|
||||
|
||||
A significant performance improvement is achievable with optimized CPX mode,
|
||||
which peaks at ~340 GB/s with a single OAM. The difference in bus bandwidth
|
||||
between the unoptimized and optimized modes increases as the buffer size grows.
|
||||
|
||||
Using RCCL and CPX in PyTorch
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
The PyTorch all_reduce benchmark is used to reproduce the performance reported
|
||||
by RCCL-Tests with the RCCL and CPX optimizations.
|
||||
|
||||
.. note::
|
||||
|
||||
To use RCCL with CPX mode in PyTorch, check the RCCL version used by PyTorch.
|
||||
|
||||
For a virtualenv with a .whl-based PyTorch setup (such as nightly/rocm6.2),
|
||||
this would be in
|
||||
``<path-to-your-venv>/lib/<python-version>/site-packages/torch/lib/librccl.so``
|
||||
This is the version of RCCL that is packaged as part of ROCm version 6.2.
|
||||
|
||||
RCCL for CPX mode was enabled in ROCm 6.3.0. To use the CPX features, replace
|
||||
the existing ``librccl.so`` with one from ROCm 6.3.0 or newer or from a local
|
||||
build of the RCCL develop branch.
|
||||
|
||||
To test the effects of RCCL on PyTorch, the `stas00 all reduce benchmark <https://github.com/stas00/ml-engineering/blob/master/network/benchmarks/all_reduce_bench.py>`_
|
||||
was used. The following command is used to run a single OAM allreduce
|
||||
benchmark:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py
|
||||
|
||||
For better performance, the ``HIP_FORCE_DEV_KERNARG``, ``RCCL_MSCCLPP_THRESHOLD``,
|
||||
and ``TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`` environment variables are
|
||||
set during the benchmark in the following manner:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
export TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1
|
||||
export HIP_FORCE_DEV_KERNARG=1
|
||||
export RCCL_MSCCLPP_THRESHOLD=$((2*1024*1024*1024))
|
||||
export MSCCLPP_READ_ALLRED=1
|
||||
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
||||
python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py
|
||||
|
||||
The default allreduce PyTorch benchmark peak bus bandwidth performance is
|
||||
~170 GB/s on a single OAM with ROCm 6.2.4, while the optimized run for CPX on a
|
||||
single OAM peaks at ~315 GB/s.
|
||||
Référencer dans un nouveau ticket
Bloquer un utilisateur