Add documentation for NPS4 and CPX partition modes (#1555)

[ROCm/rccl commit: 28ab8603d2]
2025-03-31 17:25:25 +02:00
@@ -30,6 +30,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https:
  environment variable `RCCL_DISABLE_RAIL_TREES=1`.
 * Additional debug information about how the trees are built can be logged to the GRAPH logging subsys by setting
  `RCCL_OUTPUT_TREES=1`.
+* Added documentation about the NPS4 and CPX partition modes performance benefits on the MI300X.

 ## RCCL 2.21.5 for ROCm 6.3.1

@@ -82,15 +82,24 @@ set the HSA environment variable as follows:
 This feature requires GPUs that support peer-to-peer access along with
 proper large BAR addressing support.

-Improving performance on the MI300X accelerator when using fewer than 8 GPUs
-============================================================================
+Improving performance on the MI300X 
+===================================

-On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links
-in a fully-connected topology. For collective operations, this can achieve good performance when
-all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction
-of the potential bandwidth on the system.
-However, if your workload warrants using fewer than 8 MI300X accelerators on a system,
-you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
+This section outlines ways to improve RCCL performance on MI300X systems,
+including guidelines for systems with fewer than eight GPUs and the most efficient
+GPU partition modes.
+
+Configuration with fewer than eight GPUs
+----------------------------------------
+
+On a system with eight MI300X accelerators, each pair of accelerators is
+connected with dedicated Infinity Fabric™ links in a fully connected topology.
+For collective operations, this can achieve good performance when all eight
+accelerators (and all Infinity Fabric links) are used. When fewer than eight
+GPUs are used, however, this can only achieve a fraction of the potential
+bandwidth on the system. However, if your workload warrants using fewer than
+eight MI300X accelerators on a system, you can set the run-time variable
+``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:

 .. code-block:: shell

@@ -98,6 +107,150 @@ you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number

 Increasing the number of channels can benefit performance, but it also increases
 GPU utilization for collective operations.
-Additionally, RCCL pre-defines a higher number of channels when only 2 or
-4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators
-and 24 channels for four MI300X accelerators.
+Additionally, RCCL pre-defines a higher number of channels when only two or four
+accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32
+channels with two MI300X accelerators and 24 channels for four MI300X
+accelerators.
+
+.. _nps4_cpx_mi300_rccl:
+
+NPS4 and CPX partition modes
+----------------------------
+
+The term compute partitioning modes, or Modular Chiplet Platform (MCP), refers to the
+logical partitioning of XCDs into devices in the ROCm stack. The names are
+derived from the number of logical partitions that are created out of the eight
+XCDs. In the default mode, SPX (Single Partition X-celerator), all eight XCDs are
+viewed as a single logical compute element, meaning that the :doc:`amd-smi <amdsmi:index>`
+utility will show a single MI300X device. In CPX (Core Partitioned X-celerator)
+mode, each XCD appears as a separate logical GPU, for example, as eight separate
+GPUs in :doc:`amd-smi <amdsmi:index>` per MI300X. CPX mode can be viewed as
+having explicit scheduling privileges for each individual compute element (XCD).
+
+While compute partitioning modes change the space on which you can assign work
+to compute units, the memory partitioning modes (known as Non-Uniform Memory
+Access (NUMA) Per Socket (NPS)) change the number of NUMA domains that a device
+exposes. In other words, it changes the number of HBM stacks which are
+accessible to a compute unit, and therefore the size of its memory space. However,
+for the MI300X, the number of memory partitions must be less than or equal to
+the number of compute partitions. NPS4 (viewing pairs of HBM stacks as a
+disparate element), for example, is only enabled when in CPX mode (viewing each
+XCD as a disparate element).
+
+- Compute partition modes 
+
+  - In SPX mode, workgroups launched to the device are distributed
+    round-robin to the XCDs in the device, meaning that the programmer cannot
+    have explicit control over which XCD a workgroup is assigned to.
+
+  - In CPX mode, workgroups are launched to a single XCD, meaning the
+    programmer has explicit control over work placement onto the XCDs.
+  
+- Memory partition modes 
+
+  - In NPS1 mode (compatible with CPX and SPX), the entire memory is accessible
+    to all XCDs.
+
+  - In NPS4 mode (compatible with CPX), each memory quadrant of the memory is
+    directly visible to the logical devices in its quadrant. An XCD can still
+    access all portions of memory through multi-GPU programming techniques.
+
+The MI300 CPX mode can be accessed using the following :doc:`amdsmi:index`
+commands.
+
+.. code-block:: shell
+
+   amd-smi set --gpu all --compute-partition CPX
+   amd-smi set --gpu all --memory-partition NPS4
+
+RCCL performance with CPX and NPS4
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To run RCCL allreduce on 64 GPUs with CPX+NPS4 mode on the MI300X, use this
+example:
+
+.. code-block:: shell
+
+   mpirun -np 64 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
+
+To run RCCL allreduce on 8 GPUs in the same OAM with CPX+NPS4 mode on the
+MI300X, use this example:
+
+.. code-block:: shell
+
+   export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+   mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
+
+RCCL delivers improved allreduce performance in CPX mode for TP=8 (8 GPUs in
+the same OAM) on the MI300X.
+
+.. code-block:: shell
+
+   export HIP_FORCE_DEV_KERNARG=1
+   export RCCL_MSCCLPP_THRESHOLD=1073741824
+
+   export MSCCLPP_READ_ALLRED=1 
+   export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+
+   mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 32 -e 1G -f 2 -g 1 -G 2 -w 20 -n 50
+
+Here are the benchmark results for in-place (where the output buffer is used as
+the input buffer) and out-of-place allreduce bus bandwidth.
+
+.. figure:: ../data/how-to/rccl-usage-tips/in-place_allreduce.png
+    :alt: In-place allreduce benchmark results
+    :align: center
+
+.. figure:: ../data/how-to/rccl-usage-tips/out-of-place_allreduce.png
+    :alt: Out-of-place allreduce benchmark results
+    :align: center
+
+A significant performance improvement is achievable with optimized CPX mode,
+which peaks at ~340 GB/s with a single OAM. The difference in bus bandwidth
+between the unoptimized and optimized modes increases as the buffer size grows.
+
+Using RCCL and CPX in PyTorch
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The PyTorch all_reduce benchmark is used to reproduce the performance reported
+by RCCL-Tests with the RCCL and CPX optimizations.
+
+.. note::
+
+   To use RCCL with CPX mode in PyTorch, check the RCCL version used by PyTorch.
+
+   For a virtualenv with a .whl-based PyTorch setup (such as nightly/rocm6.2),
+   this would be in 
+   ``<path-to-your-venv>/lib/<python-version>/site-packages/torch/lib/librccl.so``
+   This is the version of RCCL that is packaged as part of ROCm version 6.2.
+
+   RCCL for CPX mode was enabled in ROCm 6.3.0. To use the CPX features, replace
+   the existing ``librccl.so`` with one from ROCm 6.3.0 or newer or from a local
+   build of the RCCL develop branch.
+
+To test the effects of RCCL on PyTorch, the `stas00 all reduce benchmark <https://github.com/stas00/ml-engineering/blob/master/network/benchmarks/all_reduce_bench.py>`_
+was used. The following command is used to run a single OAM allreduce
+benchmark:
+
+.. code-block:: shell
+
+   export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+   python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000  --rdzv_backend c10d all_reduce_bench.py
+
+For better performance, the ``HIP_FORCE_DEV_KERNARG``, ``RCCL_MSCCLPP_THRESHOLD``,
+and ``TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`` environment variables are
+set during the benchmark in the following manner:
+
+.. code-block:: shell
+
+   export TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1
+   export HIP_FORCE_DEV_KERNARG=1
+   export RCCL_MSCCLPP_THRESHOLD=$((2*1024*1024*1024))
+   export MSCCLPP_READ_ALLRED=1
+   export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+   python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000  --rdzv_backend c10d all_reduce_bench.py
+
+The default allreduce PyTorch benchmark peak bus bandwidth performance is
+~170 GB/s on a single OAM with ROCm 6.2.4, while the optimized run for CPX on a
+single OAM peaks at ~315 GB/s.