diff --git a/projects/rccl/CHANGELOG.md b/projects/rccl/CHANGELOG.md index abb5f75832..167eda57c7 100644 --- a/projects/rccl/CHANGELOG.md +++ b/projects/rccl/CHANGELOG.md @@ -30,6 +30,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: environment variable `RCCL_DISABLE_RAIL_TREES=1`. * Additional debug information about how the trees are built can be logged to the GRAPH logging subsys by setting `RCCL_OUTPUT_TREES=1`. +* Added documentation about the NPS4 and CPX partition modes performance benefits on the MI300X. ## RCCL 2.21.5 for ROCm 6.3.1 diff --git a/projects/rccl/docs/data/how-to/rccl-usage-tips/in-place_allreduce.png b/projects/rccl/docs/data/how-to/rccl-usage-tips/in-place_allreduce.png new file mode 100644 index 0000000000..3b177ccbf4 Binary files /dev/null and b/projects/rccl/docs/data/how-to/rccl-usage-tips/in-place_allreduce.png differ diff --git a/projects/rccl/docs/data/how-to/rccl-usage-tips/out-of-place_allreduce.png b/projects/rccl/docs/data/how-to/rccl-usage-tips/out-of-place_allreduce.png new file mode 100644 index 0000000000..3352f93097 Binary files /dev/null and b/projects/rccl/docs/data/how-to/rccl-usage-tips/out-of-place_allreduce.png differ diff --git a/projects/rccl/docs/how-to/rccl-usage-tips.rst b/projects/rccl/docs/how-to/rccl-usage-tips.rst index 59abd32c62..852a1c33b0 100644 --- a/projects/rccl/docs/how-to/rccl-usage-tips.rst +++ b/projects/rccl/docs/how-to/rccl-usage-tips.rst @@ -82,15 +82,24 @@ set the HSA environment variable as follows: This feature requires GPUs that support peer-to-peer access along with proper large BAR addressing support. -Improving performance on the MI300X accelerator when using fewer than 8 GPUs -============================================================================ +Improving performance on the MI300X +=================================== -On a system with 8\*MI300X accelerators, each pair of accelerators is connected with dedicated XGMI links -in a fully-connected topology. For collective operations, this can achieve good performance when -all 8 accelerators (and all XGMI links) are used. When fewer than 8 GPUs are used, however, this can only achieve a fraction -of the potential bandwidth on the system. -However, if your workload warrants using fewer than 8 MI300X accelerators on a system, -you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example: +This section outlines ways to improve RCCL performance on MI300X systems, +including guidelines for systems with fewer than eight GPUs and the most efficient +GPU partition modes. + +Configuration with fewer than eight GPUs +---------------------------------------- + +On a system with eight MI300X accelerators, each pair of accelerators is +connected with dedicated Infinity Fabricâ„¢ links in a fully connected topology. +For collective operations, this can achieve good performance when all eight +accelerators (and all Infinity Fabric links) are used. When fewer than eight +GPUs are used, however, this can only achieve a fraction of the potential +bandwidth on the system. However, if your workload warrants using fewer than +eight MI300X accelerators on a system, you can set the run-time variable +``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example: .. code-block:: shell @@ -98,6 +107,150 @@ you can set the run-time variable ``NCCL_MIN_NCHANNELS`` to increase the number Increasing the number of channels can benefit performance, but it also increases GPU utilization for collective operations. -Additionally, RCCL pre-defines a higher number of channels when only 2 or -4 accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 channels with two MI300X accelerators -and 24 channels for four MI300X accelerators. \ No newline at end of file +Additionally, RCCL pre-defines a higher number of channels when only two or four +accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32 +channels with two MI300X accelerators and 24 channels for four MI300X +accelerators. + +.. _nps4_cpx_mi300_rccl: + +NPS4 and CPX partition modes +---------------------------- + +The term compute partitioning modes, or Modular Chiplet Platform (MCP), refers to the +logical partitioning of XCDs into devices in the ROCm stack. The names are +derived from the number of logical partitions that are created out of the eight +XCDs. In the default mode, SPX (Single Partition X-celerator), all eight XCDs are +viewed as a single logical compute element, meaning that the :doc:`amd-smi ` +utility will show a single MI300X device. In CPX (Core Partitioned X-celerator) +mode, each XCD appears as a separate logical GPU, for example, as eight separate +GPUs in :doc:`amd-smi ` per MI300X. CPX mode can be viewed as +having explicit scheduling privileges for each individual compute element (XCD). + +While compute partitioning modes change the space on which you can assign work +to compute units, the memory partitioning modes (known as Non-Uniform Memory +Access (NUMA) Per Socket (NPS)) change the number of NUMA domains that a device +exposes. In other words, it changes the number of HBM stacks which are +accessible to a compute unit, and therefore the size of its memory space. However, +for the MI300X, the number of memory partitions must be less than or equal to +the number of compute partitions. NPS4 (viewing pairs of HBM stacks as a +disparate element), for example, is only enabled when in CPX mode (viewing each +XCD as a disparate element). + +- Compute partition modes + + - In SPX mode, workgroups launched to the device are distributed + round-robin to the XCDs in the device, meaning that the programmer cannot + have explicit control over which XCD a workgroup is assigned to. + + - In CPX mode, workgroups are launched to a single XCD, meaning the + programmer has explicit control over work placement onto the XCDs. + +- Memory partition modes + + - In NPS1 mode (compatible with CPX and SPX), the entire memory is accessible + to all XCDs. + + - In NPS4 mode (compatible with CPX), each memory quadrant of the memory is + directly visible to the logical devices in its quadrant. An XCD can still + access all portions of memory through multi-GPU programming techniques. + +The MI300 CPX mode can be accessed using the following :doc:`amdsmi:index` +commands. + +.. code-block:: shell + + amd-smi set --gpu all --compute-partition CPX + amd-smi set --gpu all --memory-partition NPS4 + +RCCL performance with CPX and NPS4 +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +To run RCCL allreduce on 64 GPUs with CPX+NPS4 mode on the MI300X, use this +example: + +.. code-block:: shell + + mpirun -np 64 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 + +To run RCCL allreduce on 8 GPUs in the same OAM with CPX+NPS4 mode on the +MI300X, use this example: + +.. code-block:: shell + + export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + + mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1 + +RCCL delivers improved allreduce performance in CPX mode for TP=8 (8 GPUs in +the same OAM) on the MI300X. + +.. code-block:: shell + + export HIP_FORCE_DEV_KERNARG=1 + export RCCL_MSCCLPP_THRESHOLD=1073741824 + + export MSCCLPP_READ_ALLRED=1 + export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + + mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 32 -e 1G -f 2 -g 1 -G 2 -w 20 -n 50 + +Here are the benchmark results for in-place (where the output buffer is used as +the input buffer) and out-of-place allreduce bus bandwidth. + +.. figure:: ../data/how-to/rccl-usage-tips/in-place_allreduce.png + :alt: In-place allreduce benchmark results + :align: center + +.. figure:: ../data/how-to/rccl-usage-tips/out-of-place_allreduce.png + :alt: Out-of-place allreduce benchmark results + :align: center + +A significant performance improvement is achievable with optimized CPX mode, +which peaks at ~340 GB/s with a single OAM. The difference in bus bandwidth +between the unoptimized and optimized modes increases as the buffer size grows. + +Using RCCL and CPX in PyTorch +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PyTorch all_reduce benchmark is used to reproduce the performance reported +by RCCL-Tests with the RCCL and CPX optimizations. + +.. note:: + + To use RCCL with CPX mode in PyTorch, check the RCCL version used by PyTorch. + + For a virtualenv with a .whl-based PyTorch setup (such as nightly/rocm6.2), + this would be in + ``/lib//site-packages/torch/lib/librccl.so`` + This is the version of RCCL that is packaged as part of ROCm version 6.2. + + RCCL for CPX mode was enabled in ROCm 6.3.0. To use the CPX features, replace + the existing ``librccl.so`` with one from ROCm 6.3.0 or newer or from a local + build of the RCCL develop branch. + +To test the effects of RCCL on PyTorch, the `stas00 all reduce benchmark `_ +was used. The following command is used to run a single OAM allreduce +benchmark: + +.. code-block:: shell + + export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py + +For better performance, the ``HIP_FORCE_DEV_KERNARG``, ``RCCL_MSCCLPP_THRESHOLD``, +and ``TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`` environment variables are +set during the benchmark in the following manner: + +.. code-block:: shell + + export TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1 + export HIP_FORCE_DEV_KERNARG=1 + export RCCL_MSCCLPP_THRESHOLD=$((2*1024*1024*1024)) + export MSCCLPP_READ_ALLRED=1 + export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 + python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py + +The default allreduce PyTorch benchmark peak bus bandwidth performance is +~170 GB/s on a single OAM with ROCm 6.2.4, while the optimized run for CPX on a +single OAM peaks at ~315 GB/s. \ No newline at end of file