README.md

# RCCL

ROCm Communication Collectives Library

## Introduction

RCCL (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations.  It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.

The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.

## Requirements

1. ROCm supported GPUs
2. ROCm stack installed on the system (HIP runtime & HIP-Clang)

## Quickstart RCCL Build

RCCL directly depends on HIP runtime plus the HIP-Clang compiler, which are part of the ROCm software stack.
For ROCm installation instructions, see https://github.com/ROCm/ROCm.

The root of this repository has a helper script `install.sh` to build and install RCCL with a single command. It hard-codes configurations that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install RCCL.

### To build the library using the install script:

```shell
./install.sh
```

For more info on build options/flags when using the install script, use `./install.sh --help`
```shell
./install.sh --help
RCCL build & installation helper script
 Options:
       --address-sanitizer     Build with address sanitizer enabled
    -d|--dependencies          Install RCCL depdencencies
       --debug                 Build debug library
       --enable_backtrace      Build with custom backtrace support
       --disable-colltrace     Build without collective trace
       --disable-msccl-kernel  Build without MSCCL kernels
       --disable-mscclpp       Build without MSCCL++ support
    -f|--fast                  Quick-build RCCL (local gpu arch only, no backtrace, and collective trace support)
    -h|--help                  Prints this help message
    -i|--install               Install RCCL library (see --prefix argument below)
    -j|--jobs                  Specify how many parallel compilation jobs to run ($nproc by default)
    -l|--local_gpu_only        Only compile for local GPU architecture
       --amdgpu_targets        Only compile for specified GPU architecture(s). For multiple targets, seperate by ';' (builds for all supported GPU architectures by default)
       --no_clean              Don't delete files if they already exist
       --npkit-enable          Compile with npkit enabled
       --openmp-test-enable    Enable OpenMP in rccl unit tests
       --roctx-enable          Compile with roctx enabled (example usage: rocprof --roctx-trace ./rccl-program)
    -p|--package_build         Build RCCL package
       --prefix                Specify custom directory to install RCCL to (default: `/opt/rocm`)
       --rm-legacy-include-dir Remove legacy include dir Packaging added for file/folder reorg backward compatibility
       --run_tests_all         Run all rccl unit tests (must be built already)
    -r|--run_tests_quick       Run small subset of rccl unit tests (must be built already)
       --static                Build RCCL as a static library instead of shared library
    -t|--tests_build           Build rccl unit tests, but do not run
       --time-trace            Plot the build time of RCCL (requires `ninja-build` package installed on the system)
       --verbose               Show compile commands
```

## Manual build

### To build the library using CMake:

```shell
$ git clone https://github.com/ROCm/rccl.git
$ cd rccl
$ mkdir build
$ cd build
$ cmake ..
$ make -j 16      # Or some other suitable number of parallel jobs
```
You may substitute an installation path of your own choosing by passing `CMAKE_INSTALL_PREFIX`. For example:
```shell
$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..
```
Note: ensure rocm-cmake is installed, `apt install rocm-cmake`.

### To build the RCCL package and install package :

Assuming you have already cloned this repository and built the library as shown in the previous section:

```shell
$ cd rccl/build
$ make package
$ sudo dpkg -i *.deb
```

RCCL package install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.

## Enabling peer-to-peer transport

In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable HSA_FORCE_FINE_GRAIN_PCIE=1 is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.

## Tests

There are rccl unit tests implemented with the Googletest framework in RCCL.  The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh).
To invoke the rccl unit tests, go to the build folder, then the test subfolder, and execute the appropriate rccl unit test executable(s).

rccl unit test names are now of the format:

    CollectiveCall.[Type of test]

Filtering of rccl unit tests should be done with environment variable and by passing the --gtest_filter command line flag, for example:

```shell
UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"
```
will run only AllReduce correctness tests with float16 datatype. A list of available filtering environment variables appears at the top of every run. See "Running a Subset of the Tests" at https://chromium.googlesource.com/external/github.com/google/googletest/+/HEAD/googletest/docs/advanced.md for more information on how to form more advanced filters.


There are also other performance and error-checking tests for RCCL.  These are maintained separately at https://github.com/ROCm/rccl-tests.
See the rccl-tests README for more information on how to build and run those tests.

## NPKit

RCCL integrates [NPKit](https://github.com/microsoft/npkit), a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.

Please check [NPKit sample workflow for RCCL](https://github.com/microsoft/NPKit/tree/main/rccl_samples) as a fully automated usage example. It also provides good templates for the following manual instructions.

To manually build RCCL with NPKit enabled, pass `-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"` with cmake command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix `ENABLE_NPKIT_`, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and `-DNPKIT_FLAGS` should follow this rule.

To manually run RCCL with NPKit enabled, environment variable `NPKIT_DUMP_DIR` needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.

To manually analyze NPKit dump results, please leverage [npkit_trace_generator.py](https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py).

## MSCCL/MSCCL++
RCCL integrates MSCCL(https://github.com/microsoft/msccl) and MSCCL++ (https://github.com/microsoft/mscclpp) to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.

MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting `RCCL_MSCCL_FORCE_ENABLE=1`.

On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable `RCCL_ENABLE_MSCCLPP=1` to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable `RCCL_MSCCLPP_THRESHOLD`. Once `RCCL_MSCCLPP_THRESHOLD` (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.

## Library and API Documentation

Please refer to the [RCCL Documentation Site](https://rocm.docs.amd.com/projects/rccl/en/latest/) for current documentation.

### How to build documentation

Run the steps below to build documentation locally.

```shell
cd docs
pip3 install -r sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
```

## Copyright

All source code and accompanying documentation is copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.

All modifications are copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`# RCCL`
Initial release. 2015-11-17 11:30:40 -08:00
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`ROCm Communication Collectives Library`
Initial release. 2015-11-17 11:30:40 -08:00
			`## Introduction`

Documentation updates for NCCL 2.7.0 (#219 ) 2020-06-16 16:48:11 -06:00			RCCL (pronounced "Rickle") is a stand-alone library of standard collective communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, gather, scatter, and all-to-all. There is also initial support for direct GPU-to-GPU send and receive operations. It has been optimized to achieve high bandwidth on platforms using PCIe, xGMI as well as networking using InfiniBand Verbs or TCP/IP sockets. RCCL supports an arbitrary number of GPUs installed in a single node or multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
Initial release. 2015-11-17 11:30:40 -08:00
Updating README and readthedocs documentation. 2020-05-12 20:11:49 +00:00			`The collective operations are implemented using ring and tree algorithms and have been optimized for throughput and latency. For best performance, small operations can be either batched into larger operations or aggregated through the API.`
Initial release. 2015-11-17 11:30:40 -08:00
			`## Requirements`

Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`1. ROCm supported GPUs`
Update README.md (#955 ) 2023-11-15 18:01:45 -08:00			`2. ROCm stack installed on the system (HIP runtime & HIP-Clang)`
Initial release. 2015-11-17 11:30:40 -08:00
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`## Quickstart RCCL Build`
Add official builds download link 2018-11-08 11:22:28 -08:00
Update README.md (#955 ) 2023-11-15 18:01:45 -08:00			`RCCL directly depends on HIP runtime plus the HIP-Clang compiler, which are part of the ROCm software stack.`
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125 ) 2024-03-25 16:29:13 -06:00			`For ROCm installation instructions, see https://github.com/ROCm/ROCm.`
Initial release. 2015-11-17 11:30:40 -08:00
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			The root of this repository has a helper script `install.sh` to build and install RCCL with a single command. It hard-codes configurations that can be specified through invoking cmake directly, but it's a great way to get started quickly and can serve as an example of how to build/install RCCL.

			`### To build the library using the install script:`
Initial release. 2015-11-17 11:30:40 -08:00
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			```shell
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			`./install.sh`
			```
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			For more info on build options/flags when using the install script, use `./install.sh --help`
			```shell
			`./install.sh --help`
			`RCCL build & installation helper script`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`Options:`
			`--address-sanitizer Build with address sanitizer enabled`
			`-d\|--dependencies Install RCCL depdencencies`
			`--debug Build debug library`
Update install.sh --fast and README (#924 ) 2023-10-19 16:35:10 -06:00			`--enable_backtrace Build with custom backtrace support`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`--disable-colltrace Build without collective trace`
Update install.sh --fast and README (#924 ) 2023-10-19 16:35:10 -06:00			`--disable-msccl-kernel Build without MSCCL kernels`
Integrated RCCL with MSCCL++ for small message sizes (#1231 ) 2024-07-12 15:32:58 -06:00			`--disable-mscclpp Build without MSCCL++ support`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`-f\|--fast Quick-build RCCL (local gpu arch only, no backtrace, and collective trace support)`
			`-h\|--help Prints this help message`
			`-i\|--install Install RCCL library (see --prefix argument below)`
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			`-j\|--jobs Specify how many parallel compilation jobs to run ($nproc by default)`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`-l\|--local_gpu_only Only compile for local GPU architecture`
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			`--amdgpu_targets Only compile for specified GPU architecture(s). For multiple targets, seperate by ';' (builds for all supported GPU architectures by default)`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`--no_clean Don't delete files if they already exist`
			`--npkit-enable Compile with npkit enabled`
Integrated RCCL with MSCCL++ for small message sizes (#1231 ) 2024-07-12 15:32:58 -06:00			`--openmp-test-enable Enable OpenMP in rccl unit tests`
Implement ROCTX (#1094 ) 2024-02-27 15:46:15 -07:00			`--roctx-enable Compile with roctx enabled (example usage: rocprof --roctx-trace ./rccl-program)`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`-p\|--package_build Build RCCL package`
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			--prefix Specify custom directory to install RCCL to (default: `/opt/rocm`)
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`--rm-legacy-include-dir Remove legacy include dir Packaging added for file/folder reorg backward compatibility`
			`--run_tests_all Run all rccl unit tests (must be built already)`
			`-r\|--run_tests_quick Run small subset of rccl unit tests (must be built already)`
			`--static Build RCCL as a static library instead of shared library`
			`-t\|--tests_build Build rccl unit tests, but do not run`
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			--time-trace Plot the build time of RCCL (requires `ninja-build` package installed on the system)
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`--verbose Show compile commands`
			```
Initial release. 2015-11-17 11:30:40 -08:00
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`## Manual build`
Update Read the Docs, documentation, and dependabot (#772 ) 2023-06-07 15:31:58 -06:00
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			`### To build the library using CMake:`
Initial release. 2015-11-17 11:30:40 -08:00
2.3.5-5 2018-09-24 16:06:59 -07:00			```shell
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125 ) 2024-03-25 16:29:13 -06:00			`$ git clone https://github.com/ROCm/rccl.git`
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`$ cd rccl`
			`$ mkdir build`
			`$ cd build`
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			`$ cmake ..`
Limiting # parallel jobs in install script to 16 by default, and new -j/--jobs flag (#785 ) 2023-06-22 14:30:44 -06:00			`$ make -j 16 # Or some other suitable number of parallel jobs`
2.3.5-5 2018-09-24 16:06:59 -07:00			```
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			You may substitute an installation path of your own choosing by passing `CMAKE_INSTALL_PREFIX`. For example:
Change manual build instructions to fit most common usage 2019-11-26 12:40:26 -08:00			```shell
[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			`$ cmake -DCMAKE_INSTALL_PREFIX=$PWD/rccl-install ..`
Change manual build instructions to fit most common usage 2019-11-26 12:40:26 -08:00			```
			Note: ensure rocm-cmake is installed, `apt install rocm-cmake`.
Initial release. 2015-11-17 11:30:40 -08:00
Standard template implementation (#703 ) 2023-03-13 11:00:57 -06:00			`### To build the RCCL package and install package :`
Initial release. 2015-11-17 11:30:40 -08:00
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`Assuming you have already cloned this repository and built the library as shown in the previous section:`
Initial release. 2015-11-17 11:30:40 -08:00
2.3.5-5 2018-09-24 16:06:59 -07:00			```shell
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`$ cd rccl/build`
			`$ make package`
			`$ sudo dpkg -i *.deb`
2.3.5-5 2018-09-24 16:06:59 -07:00			```
Fixed deadlock in back-to-back reduce_scatters. 2016-01-20 17:58:25 -08:00
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`RCCL package install requires sudo/root access because it creates a directory called "rccl" under /opt/rocm/. This is an optional step and RCCL can be used directly by including the path containing librccl.so.`
Fixed deadlock in back-to-back reduce_scatters. 2016-01-20 17:58:25 -08:00
Update README.md (#364 ) 2021-05-11 13:41:41 -06:00			`## Enabling peer-to-peer transport`
Update Read the Docs, documentation, and dependabot (#772 ) 2023-06-07 15:31:58 -06:00
Update README.md (#364 ) 2021-05-11 13:41:41 -06:00			`In order to enable peer-to-peer access on machines with PCIe-connected GPUs, the HSA environment variable HSA_FORCE_FINE_GRAIN_PCIE=1 is required to be set, on top of requiring GPUs that support peer-to-peer access and proper large BAR addressing support.`

2.3.5-5 2018-09-24 16:06:59 -07:00			`## Tests`
Fixed deadlock in back-to-back reduce_scatters. 2016-01-20 17:58:25 -08:00
Changed the name of UnitTests to rccl-UnitTests (wrapper executable included). 2022-12-13 21:45:57 +00:00			`There are rccl unit tests implemented with the Googletest framework in RCCL. The rccl unit tests require Googletest 1.10 or higher to build and execute properly (installed with the -d option to install.sh).`
			`To invoke the rccl unit tests, go to the build folder, then the test subfolder, and execute the appropriate rccl unit test executable(s).`
Fixed deadlock in back-to-back reduce_scatters. 2016-01-20 17:58:25 -08:00
Changed the name of UnitTests to rccl-UnitTests (wrapper executable included). 2022-12-13 21:45:57 +00:00			`rccl unit test names are now of the format:`
Add Jenkins docs build (#18 ) 2021-02-18 16:37:37 -07:00
updated readme to reflect the newer tests 2022-07-13 16:08:28 +00:00			`CollectiveCall.[Type of test]`
Adding the ability to force install dependencies (namely gtest); gtest library installation fix for centos (#265 ) 2020-09-10 17:27:22 -06:00
Changed the name of UnitTests to rccl-UnitTests (wrapper executable included). 2022-12-13 21:45:57 +00:00			`Filtering of rccl unit tests should be done with environment variable and by passing the --gtest_filter command line flag, for example:`
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00
2.3.5-5 2018-09-24 16:06:59 -07:00			```shell
Changed the name of UnitTests to rccl-UnitTests (wrapper executable included). 2022-12-13 21:45:57 +00:00			`UT_DATATYPES=ncclBfloat16 UT_REDOPS=prod ./rccl-UnitTests --gtest_filter="AllReduce.C*"`
Initial release. 2015-11-17 11:30:40 -08:00			```
updated readme to reflect the newer tests 2022-07-13 16:08:28 +00:00			`will run only AllReduce correctness tests with float16 datatype. A list of available filtering environment variables appears at the top of every run. See "Running a Subset of the Tests" at https://chromium.googlesource.com/external/github.com/google/googletest/+/HEAD/googletest/docs/advanced.md for more information on how to form more advanced filters.`
Adding the ability to force install dependencies (namely gtest); gtest library installation fix for centos (#265 ) 2020-09-10 17:27:22 -06:00
Initial release. 2015-11-17 11:30:40 -08:00
Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125 ) 2024-03-25 16:29:13 -06:00			`There are also other performance and error-checking tests for RCCL. These are maintained separately at https://github.com/ROCm/rccl-tests.`
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00			`See the rccl-tests README for more information on how to build and run those tests.`

Add Feature - Add NPKit Support in RCCL (#564 ) 2022-06-21 05:30:19 +08:00			`## NPKit`

			`RCCL integrates [NPKit](https://github.com/microsoft/npkit), a profiler framework that enables collecting fine-grained trace events in RCCL components, especially in giant collective GPU kernels.`

			`Please check [NPKit sample workflow for RCCL](https://github.com/microsoft/NPKit/tree/main/rccl_samples) as a fully automated usage example. It also provides good templates for the following manual instructions.`

			To manually build RCCL with NPKit enabled, pass `-DNPKIT_FLAGS="-DENABLE_NPKIT -DENABLE_NPKIT_...(other NPKit compile-time switches)"` with cmake command. All NPKit compile-time switches are declared in the RCCL code base as macros with prefix `ENABLE_NPKIT_`, and they control which information will be collected. Also note that currently NPKit only supports collecting non-overlapped events on GPU, and `-DNPKIT_FLAGS` should follow this rule.

			To manually run RCCL with NPKit enabled, environment variable `NPKIT_DUMP_DIR` needs to be set as the NPKit event dump directory. Also note that currently NPKit only supports 1 GPU per process.

			`To manually analyze NPKit dump results, please leverage [npkit_trace_generator.py](https://github.com/microsoft/NPKit/blob/main/rccl_samples/npkit_trace_generator.py).`

Integrated RCCL with MSCCL++ for small message sizes (#1231 ) 2024-07-12 15:32:58 -06:00			`## MSCCL/MSCCL++`
			`RCCL integrates MSCCL(https://github.com/microsoft/msccl) and MSCCL++ (https://github.com/microsoft/mscclpp) to leverage the highly efficient GPU-GPU communication primitives for collective operations. Thanks to Microsoft Corporation for collaborating with us in this project.`

			MSCCL uses XMLs for different collective algorithms on different architectures. RCCL collectives can leverage those algorithms once the corresponding XML has been provided by the user. The XML files contain the sequence of send-recv and reduction operations to be executed by the kernel. On MI300X, MSCCL is enabled by default. On other platforms, the users may have to enable this by setting `RCCL_MSCCL_FORCE_ENABLE=1`.

			On the other hand, RCCL allreduce and allgather collectives can leverage the efficient MSCCL++ communication kernels for certain message sizes. MSCCL++ support is available whenever MSCCL support is available. Users need to set the RCCL environment variable `RCCL_ENABLE_MSCCLPP=1` to run RCCL workload with MSCCL++ support. It is also possible to set the message size threshold for using MSCCL++ by using the environment variable `RCCL_MSCCLPP_THRESHOLD`. Once `RCCL_MSCCLPP_THRESHOLD` (the default value is 1MB) is set, RCCL will invoke MSCCL++ kernels for all message sizes less than or equal to the specified threshold.

Adding link to readthedocs 2019-05-24 14:37:45 -07:00			`## Library and API Documentation`

Update Read the Docs, documentation, and dependabot (#772 ) 2023-06-07 15:31:58 -06:00			`Please refer to the [RCCL Documentation Site](https://rocm.docs.amd.com/projects/rccl/en/latest/) for current documentation.`
Adding link to readthedocs 2019-05-24 14:37:45 -07:00
Fix Docs static analysis (#708 ) 2023-03-16 13:12:43 -06:00			`### How to build documentation`

			`Run the steps below to build documentation locally.`

[BUILD] Update install.sh for RCCL build (#1191 ) 2024-05-31 17:58:34 -05:00			```shell
Fix Docs static analysis (#708 ) 2023-03-16 13:12:43 -06:00			`cd docs`
Update Read the Docs, documentation, and dependabot (#772 ) 2023-06-07 15:31:58 -06:00			`pip3 install -r sphinx/requirements.txt`
Fix Docs static analysis (#708 ) 2023-03-16 13:12:43 -06:00			`python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html`
			```

2.3.5-5 2018-09-24 16:06:59 -07:00			`## Copyright`
Fixed deadlock in back-to-back reduce_scatters. 2016-01-20 17:58:25 -08:00
Moving opt-in custom signal handler from UnitTests into RCCL (#550 ) 2022-05-20 09:56:38 -06:00			`All source code and accompanying documentation is copyright (c) 2015-2022, NVIDIA CORPORATION. All rights reserved.`
Updating RCCL based on NCCL 2.3.7 2019-05-16 16:16:18 +00:00
Moving opt-in custom signal handler from UnitTests into RCCL (#550 ) 2022-05-20 09:56:38 -06:00			`All modifications are copyright (c) 2019-2022 Advanced Micro Devices, Inc. All rights reserved.`