Add documentation for NPS4 and CPX partition modes (#1555)

[ROCm/rccl commit: 28ab8603d2]
这个提交包含在:
Istvan Kiss
2025-03-31 17:25:25 +02:00
提交者 GitHub
父节点 1a2eca1756
当前提交 858fa4e65d
修改 4 个文件,包含 165 行新增11 行删除
+1
查看文件
@@ -30,6 +30,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https:
environment variable `RCCL_DISABLE_RAIL_TREES=1`.
* Additional debug information about how the trees are built can be logged to the GRAPH logging subsys by setting
`RCCL_OUTPUT_TREES=1`.
* Added documentation about the NPS4 and CPX partition modes performance benefits on the MI300X.
## RCCL 2.21.5 for ROCm 6.3.1
二进制文件未显示。

之后

宽度:  |  高度:  |  大小: 114 KiB

二进制文件未显示。

之后

宽度:  |  高度:  |  大小: 107 KiB

+164 -11
查看文件
@@ -82,15 +82,24 @@ set the HSA environment variable as follows:
This feature requires GPUs that support peer-to-peer access along with
proper large BAR addressing support.
Improving performance on the MI300X accelerator when using fewer than 8 GPUs
============================================================================
Improving performance on the MI300X
===================================
On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links
in a fully-connected topology. For collective operations, this can achieve good performance when
all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction
of the potential bandwidth on the system.
However, if your workload warrants using fewer than 8 MI300X accelerators on a system,
you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
This section outlines ways to improve RCCL performance on MI300X systems,
including guidelines for systems with fewer than eight GPUs and the most efficient
GPU partition modes.
Configuration with fewer than eight GPUs
----------------------------------------
On a system with eight MI300X accelerators, each pair of accelerators is
connected with dedicated Infinity Fabric™ links in a fully connected topology.
For collective operations, this can achieve good performance when all eight
accelerators (and all Infinity Fabric links) are used. When fewer than eight
GPUs are used, however, this can only achieve a fraction of the potential
bandwidth on the system. However, if your workload warrants using fewer than
eight MI300X accelerators on a system, you can set the run-time variable
``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
.. code-block:: shell
@@ -98,6 +107,150 @@ you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number
Increasing the number of channels can benefit performance, but it also increases
GPU utilization for collective operations.
Additionally, RCCL pre-defines a higher number of channels when only 2 or
4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators
and 24 channels for four MI300X accelerators.
Additionally, RCCL pre-defines a higher number of channels when only two or four
accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32
channels with two MI300X accelerators and 24 channels for four MI300X
accelerators.
.. _nps4_cpx_mi300_rccl:
NPS4 and CPX partition modes
----------------------------
The term compute partitioning modes, or Modular Chiplet Platform (MCP), refers to the
logical partitioning of XCDs into devices in the ROCm stack. The names are
derived from the number of logical partitions that are created out of the eight
XCDs. In the default mode, SPX (Single Partition X-celerator), all eight XCDs are
viewed as a single logical compute element, meaning that the :doc:`amd-smi <amdsmi:index>`
utility will show a single MI300X device. In CPX (Core Partitioned X-celerator)
mode, each XCD appears as a separate logical GPU, for example, as eight separate
GPUs in :doc:`amd-smi <amdsmi:index>` per MI300X. CPX mode can be viewed as
having explicit scheduling privileges for each individual compute element (XCD).
While compute partitioning modes change the space on which you can assign work
to compute units, the memory partitioning modes (known as Non-Uniform Memory
Access (NUMA) Per Socket (NPS)) change the number of NUMA domains that a device
exposes. In other words, it changes the number of HBM stacks which are
accessible to a compute unit, and therefore the size of its memory space. However,
for the MI300X, the number of memory partitions must be less than or equal to
the number of compute partitions. NPS4 (viewing pairs of HBM stacks as a
disparate element), for example, is only enabled when in CPX mode (viewing each
XCD as a disparate element).
- Compute partition modes
- In SPX mode, workgroups launched to the device are distributed
round-robin to the XCDs in the device, meaning that the programmer cannot
have explicit control over which XCD a workgroup is assigned to.
- In CPX mode, workgroups are launched to a single XCD, meaning the
programmer has explicit control over work placement onto the XCDs.
- Memory partition modes
- In NPS1 mode (compatible with CPX and SPX), the entire memory is accessible
to all XCDs.
- In NPS4 mode (compatible with CPX), each memory quadrant of the memory is
directly visible to the logical devices in its quadrant. An XCD can still
access all portions of memory through multi-GPU programming techniques.
The MI300 CPX mode can be accessed using the following :doc:`amdsmi:index`
commands.
.. code-block:: shell
amd-smi set --gpu all --compute-partition CPX
amd-smi set --gpu all --memory-partition NPS4
RCCL performance with CPX and NPS4
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To run RCCL allreduce on 64 GPUs with CPX+NPS4 mode on the MI300X, use this
example:
.. code-block:: shell
mpirun -np 64 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
To run RCCL allreduce on 8 GPUs in the same OAM with CPX+NPS4 mode on the
MI300X, use this example:
.. code-block:: shell
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
RCCL delivers improved allreduce performance in CPX mode for TP=8 (8 GPUs in
the same OAM) on the MI300X.
.. code-block:: shell
export HIP_FORCE_DEV_KERNARG=1
export RCCL_MSCCLPP_THRESHOLD=1073741824
export MSCCLPP_READ_ALLRED=1
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 32 -e 1G -f 2 -g 1 -G 2 -w 20 -n 50
Here are the benchmark results for in-place (where the output buffer is used as
the input buffer) and out-of-place allreduce bus bandwidth.
.. figure:: ../data/how-to/rccl-usage-tips/in-place_allreduce.png
:alt: In-place allreduce benchmark results
:align: center
.. figure:: ../data/how-to/rccl-usage-tips/out-of-place_allreduce.png
:alt: Out-of-place allreduce benchmark results
:align: center
A significant performance improvement is achievable with optimized CPX mode,
which peaks at ~340 GB/s with a single OAM. The difference in bus bandwidth
between the unoptimized and optimized modes increases as the buffer size grows.
Using RCCL and CPX in PyTorch
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The PyTorch all_reduce benchmark is used to reproduce the performance reported
by RCCL-Tests with the RCCL and CPX optimizations.
.. note::
To use RCCL with CPX mode in PyTorch, check the RCCL version used by PyTorch.
For a virtualenv with a .whl-based PyTorch setup (such as nightly/rocm6.2),
this would be in
``<path-to-your-venv>/lib/<python-version>/site-packages/torch/lib/librccl.so``
This is the version of RCCL that is packaged as part of ROCm version 6.2.
RCCL for CPX mode was enabled in ROCm 6.3.0. To use the CPX features, replace
the existing ``librccl.so`` with one from ROCm 6.3.0 or newer or from a local
build of the RCCL develop branch.
To test the effects of RCCL on PyTorch, the `stas00 all reduce benchmark <https://github.com/stas00/ml-engineering/blob/master/network/benchmarks/all_reduce_bench.py>`_
was used. The following command is used to run a single OAM allreduce
benchmark:
.. code-block:: shell
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py
For better performance, the ``HIP_FORCE_DEV_KERNARG``, ``RCCL_MSCCLPP_THRESHOLD``,
and ``TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`` environment variables are
set during the benchmark in the following manner:
.. code-block:: shell
export TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1
export HIP_FORCE_DEV_KERNARG=1
export RCCL_MSCCLPP_THRESHOLD=$((2*1024*1024*1024))
export MSCCLPP_READ_ALLRED=1
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py
The default allreduce PyTorch benchmark peak bus bandwidth performance is
~170 GB/s on a single OAM with ROCm 6.2.4, while the optimized run for CPX on a
single OAM peaks at ~315 GB/s.