858fa4e65d
[ROCm/rccl commit: 28ab8603d2]
256 lines
11 KiB
ReStructuredText
256 lines
11 KiB
ReStructuredText
.. meta::
|
|
:description: Usage tips for the RCCL library of collective communication primitives
|
|
:keywords: RCCL, ROCm, library, API, peer-to-peer, transport
|
|
|
|
.. _rccl-usage-tips:
|
|
|
|
|
|
*****************************************
|
|
RCCL usage tips
|
|
*****************************************
|
|
|
|
This topic describes some of the more common RCCL extensions, such as NPKit and MSCCL, and provides tips on how to
|
|
configure and customize the application.
|
|
|
|
NPKit
|
|
=====
|
|
|
|
RCCL integrates `NPKit <https://github.com/microsoft/npkit>`_, a profiler framework that
|
|
enables the collection of fine-grained trace events in RCCL components, especially in giant collective GPU kernels.
|
|
See the `NPKit sample workflow for RCCL <https://github.com/microsoft/NPKit/tree/main/rccl_samples>`_ for
|
|
a fully-automated usage example. It also provides useful templates for the following manual instructions.
|
|
|
|
To manually build RCCL with NPKit enabled, pass ``-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"`` to the ``cmake`` command.
|
|
All NPKit compile-time switches are declared in the RCCL code base as macros with the prefix ``ENABLE_NPKIT_``.
|
|
These switches control the information that is collected.
|
|
|
|
.. note::
|
|
|
|
NPKit only supports the collection of non-overlapped events on the GPU.
|
|
The ``-DNPKIT_FLAGS`` settings must follow this rule.
|
|
|
|
To manually run RCCL with NPKit enabled, set the environment variable ``NPKIT_DUMP_DIR``
|
|
to the NPKit event dump directory. NPKit only supports one GPU per process.
|
|
To manually analyze the NPKit dump results, use `npkit_trace_generator.py <https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py>`_.
|
|
|
|
MSCCL/MSCCL++
|
|
=============
|
|
|
|
RCCL integrates `MSCCL <https://github.com/microsoft/msccl>`_ and `MSCCL++ <https://github.com/microsoft/mscclpp>`_ to
|
|
leverage these highly efficient GPU-GPU communication primitives for collective operations.
|
|
Microsoft Corporation collaborated with AMD for this project.
|
|
|
|
MSCCL uses XMLs for different collective algorithms on different architectures.
|
|
RCCL collectives can leverage these algorithms after the user provides the corresponding XML.
|
|
The XML files contain sequences of send-recv and reduction operations for the kernel to run.
|
|
|
|
MSCCL is enabled by default on the AMD Instinct™ MI300X accelerator. On other platforms, users might have to enable it
|
|
using the setting ``RCCL_MSCCL_FORCE_ENABLE=1``. By default, MSCCL is only used if every rank belongs
|
|
to a unique process. To disable this restriction for multi-threaded or single-threaded configurations,
|
|
use the setting ``RCCL_MSCCL_ENABLE_SINGLE_PROCESS=1``.
|
|
|
|
RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels
|
|
for certain message sizes. MSCCL++ support is available whenever MSCCL support is available.
|
|
To run a RCCL workload with MSCCL++ support, set the following RCCL environment variable:
|
|
|
|
.. code-block:: shell
|
|
|
|
RCCL_MSCCLPP_ENABLE=1
|
|
|
|
To set the message size threshold for using MSCCL++, use the environment variable ``RCCL_MSCCLPP_THRESHOLD``,
|
|
which has a default value of 1MB. After ``RCCL_MSCCLPP_THRESHOLD`` has been set,
|
|
RCCL invokes MSCCL++ kernels for all message sizes less than or equal to the specified threshold.
|
|
|
|
The following restrictions apply when using MSCCL++. If these restrictions are not met,
|
|
operations fall back to using MSCCL or RCCL.
|
|
|
|
* The message size must be a non-zero multiple of 32 bytes
|
|
* It does not support ``hipMallocManaged`` buffers
|
|
* Allreduce only supports the ``float16``, ``int32``, ``uint32``, ``float32``, and ``bfloat16`` data types
|
|
* Allreduce only supports the sum operation
|
|
|
|
Enabling peer-to-peer transport
|
|
===============================
|
|
|
|
To enable peer-to-peer access on machines with PCIe-connected GPUs,
|
|
set the HSA environment variable as follows:
|
|
|
|
.. code-block:: shell
|
|
|
|
HSA_FORCE_FINE_GRAIN_PCIE=1
|
|
|
|
This feature requires GPUs that support peer-to-peer access along with
|
|
proper large BAR addressing support.
|
|
|
|
Improving performance on the MI300X
|
|
===================================
|
|
|
|
This section outlines ways to improve RCCL performance on MI300X systems,
|
|
including guidelines for systems with fewer than eight GPUs and the most efficient
|
|
GPU partition modes.
|
|
|
|
Configuration with fewer than eight GPUs
|
|
----------------------------------------
|
|
|
|
On a system with eight MI300X accelerators, each pair of accelerators is
|
|
connected with dedicated Infinity Fabric™ links in a fully connected topology.
|
|
For collective operations, this can achieve good performance when all eight
|
|
accelerators (and all Infinity Fabric links) are used. When fewer than eight
|
|
GPUs are used, however, this can only achieve a fraction of the potential
|
|
bandwidth on the system. However, if your workload warrants using fewer than
|
|
eight MI300X accelerators on a system, you can set the run-time variable
|
|
``NCCL_MIN_NCHANNELS`` to increase the number of channels. For example:
|
|
|
|
.. code-block:: shell
|
|
|
|
export NCCL_MIN_NCHANNELS=32
|
|
|
|
Increasing the number of channels can benefit performance, but it also increases
|
|
GPU utilization for collective operations.
|
|
Additionally, RCCL pre-defines a higher number of channels when only two or four
|
|
accelerators are in use on a 8\*MI300X system. In this situation, RCCL uses 32
|
|
channels with two MI300X accelerators and 24 channels for four MI300X
|
|
accelerators.
|
|
|
|
.. _nps4_cpx_mi300_rccl:
|
|
|
|
NPS4 and CPX partition modes
|
|
----------------------------
|
|
|
|
The term compute partitioning modes, or Modular Chiplet Platform (MCP), refers to the
|
|
logical partitioning of XCDs into devices in the ROCm stack. The names are
|
|
derived from the number of logical partitions that are created out of the eight
|
|
XCDs. In the default mode, SPX (Single Partition X-celerator), all eight XCDs are
|
|
viewed as a single logical compute element, meaning that the :doc:`amd-smi <amdsmi:index>`
|
|
utility will show a single MI300X device. In CPX (Core Partitioned X-celerator)
|
|
mode, each XCD appears as a separate logical GPU, for example, as eight separate
|
|
GPUs in :doc:`amd-smi <amdsmi:index>` per MI300X. CPX mode can be viewed as
|
|
having explicit scheduling privileges for each individual compute element (XCD).
|
|
|
|
While compute partitioning modes change the space on which you can assign work
|
|
to compute units, the memory partitioning modes (known as Non-Uniform Memory
|
|
Access (NUMA) Per Socket (NPS)) change the number of NUMA domains that a device
|
|
exposes. In other words, it changes the number of HBM stacks which are
|
|
accessible to a compute unit, and therefore the size of its memory space. However,
|
|
for the MI300X, the number of memory partitions must be less than or equal to
|
|
the number of compute partitions. NPS4 (viewing pairs of HBM stacks as a
|
|
disparate element), for example, is only enabled when in CPX mode (viewing each
|
|
XCD as a disparate element).
|
|
|
|
- Compute partition modes
|
|
|
|
- In SPX mode, workgroups launched to the device are distributed
|
|
round-robin to the XCDs in the device, meaning that the programmer cannot
|
|
have explicit control over which XCD a workgroup is assigned to.
|
|
|
|
- In CPX mode, workgroups are launched to a single XCD, meaning the
|
|
programmer has explicit control over work placement onto the XCDs.
|
|
|
|
- Memory partition modes
|
|
|
|
- In NPS1 mode (compatible with CPX and SPX), the entire memory is accessible
|
|
to all XCDs.
|
|
|
|
- In NPS4 mode (compatible with CPX), each memory quadrant of the memory is
|
|
directly visible to the logical devices in its quadrant. An XCD can still
|
|
access all portions of memory through multi-GPU programming techniques.
|
|
|
|
The MI300 CPX mode can be accessed using the following :doc:`amdsmi:index`
|
|
commands.
|
|
|
|
.. code-block:: shell
|
|
|
|
amd-smi set --gpu all --compute-partition CPX
|
|
amd-smi set --gpu all --memory-partition NPS4
|
|
|
|
RCCL performance with CPX and NPS4
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
To run RCCL allreduce on 64 GPUs with CPX+NPS4 mode on the MI300X, use this
|
|
example:
|
|
|
|
.. code-block:: shell
|
|
|
|
mpirun -np 64 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
|
|
|
|
To run RCCL allreduce on 8 GPUs in the same OAM with CPX+NPS4 mode on the
|
|
MI300X, use this example:
|
|
|
|
.. code-block:: shell
|
|
|
|
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|
|
|
mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 8 -e 1G -f 2 -g 1
|
|
|
|
RCCL delivers improved allreduce performance in CPX mode for TP=8 (8 GPUs in
|
|
the same OAM) on the MI300X.
|
|
|
|
.. code-block:: shell
|
|
|
|
export HIP_FORCE_DEV_KERNARG=1
|
|
export RCCL_MSCCLPP_THRESHOLD=1073741824
|
|
|
|
export MSCCLPP_READ_ALLRED=1
|
|
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|
|
|
mpirun -np 8 --bind-to numa rccl-tests/build/all_reduce_perf -b 32 -e 1G -f 2 -g 1 -G 2 -w 20 -n 50
|
|
|
|
Here are the benchmark results for in-place (where the output buffer is used as
|
|
the input buffer) and out-of-place allreduce bus bandwidth.
|
|
|
|
.. figure:: ../data/how-to/rccl-usage-tips/in-place_allreduce.png
|
|
:alt: In-place allreduce benchmark results
|
|
:align: center
|
|
|
|
.. figure:: ../data/how-to/rccl-usage-tips/out-of-place_allreduce.png
|
|
:alt: Out-of-place allreduce benchmark results
|
|
:align: center
|
|
|
|
A significant performance improvement is achievable with optimized CPX mode,
|
|
which peaks at ~340 GB/s with a single OAM. The difference in bus bandwidth
|
|
between the unoptimized and optimized modes increases as the buffer size grows.
|
|
|
|
Using RCCL and CPX in PyTorch
|
|
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
|
|
The PyTorch all_reduce benchmark is used to reproduce the performance reported
|
|
by RCCL-Tests with the RCCL and CPX optimizations.
|
|
|
|
.. note::
|
|
|
|
To use RCCL with CPX mode in PyTorch, check the RCCL version used by PyTorch.
|
|
|
|
For a virtualenv with a .whl-based PyTorch setup (such as nightly/rocm6.2),
|
|
this would be in
|
|
``<path-to-your-venv>/lib/<python-version>/site-packages/torch/lib/librccl.so``
|
|
This is the version of RCCL that is packaged as part of ROCm version 6.2.
|
|
|
|
RCCL for CPX mode was enabled in ROCm 6.3.0. To use the CPX features, replace
|
|
the existing ``librccl.so`` with one from ROCm 6.3.0 or newer or from a local
|
|
build of the RCCL develop branch.
|
|
|
|
To test the effects of RCCL on PyTorch, the `stas00 all reduce benchmark <https://github.com/stas00/ml-engineering/blob/master/network/benchmarks/all_reduce_bench.py>`_
|
|
was used. The following command is used to run a single OAM allreduce
|
|
benchmark:
|
|
|
|
.. code-block:: shell
|
|
|
|
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|
python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py
|
|
|
|
For better performance, the ``HIP_FORCE_DEV_KERNARG``, ``RCCL_MSCCLPP_THRESHOLD``,
|
|
and ``TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK`` environment variables are
|
|
set during the benchmark in the following manner:
|
|
|
|
.. code-block:: shell
|
|
|
|
export TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK=1
|
|
export HIP_FORCE_DEV_KERNARG=1
|
|
export RCCL_MSCCLPP_THRESHOLD=$((2*1024*1024*1024))
|
|
export MSCCLPP_READ_ALLRED=1
|
|
export ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
|
|
python -u -m torch.distributed.run --nproc_per_node=8 --rdzv_endpoint localhost:6000 --rdzv_backend c10d all_reduce_bench.py
|
|
|
|
The default allreduce PyTorch benchmark peak bus bandwidth performance is
|
|
~170 GB/s on a single OAM with ROCm 6.2.4, while the optimized run for CPX on a
|
|
single OAM peaks at ~315 GB/s. |