2024-09-23 14:27:58 -04:00
# Changelog for RCCL
2020-11-05 13:38:30 -07:00
Full documentation for RCCL is available at [https://rccl.readthedocs.io ](https://rccl.readthedocs.io )
2022-01-10 11:34:31 -07:00
2024-10-23 13:49:40 -06:00
## RCCL 2.21.5 for ROCm 6.3.0
2024-09-23 14:27:58 -04:00
2024-10-23 13:49:40 -06:00
### Added
2024-09-23 14:27:58 -04:00
2024-10-23 13:49:40 -06:00
* MSCCL++ integration for specific contexts
* Performance collection to rccl_replayer
* Tuner Plugin example for MI300
* Tuning table for large number of nodes
* Support for amdclang++
* New Rome model
### Changed
2024-09-23 14:27:58 -04:00
2024-10-23 13:49:40 -06:00
* Compatibility with NCCL 2.21.5
* Increased channel count for MI300X multi-node
* Enabled MSCCL for single-process multi-threaded contexts
* Enabled gfx12
* Enabled CPX mode for MI300X
* Enabled tracing with rocprof
* Improved version reporting
* Enabled GDRDMA for Linux kernel 6.4.0+
2024-09-23 14:27:58 -04:00
2024-10-23 13:49:40 -06:00
### Resolved issues
2024-09-23 14:27:58 -04:00
2024-10-23 13:49:40 -06:00
* Fixed model matching with PXN enable
## RCCL 2.20.5 for ROCm 6.2.1
### Fixed
- GDR support flag now set with DMABUF
### Known issues
- On systems running Linux kernel 6.8.0, such as Ubuntu 24.04, Direct Memory Access (DMA) transfers between the GPU and NIC are disabled and impacts multi-node RCCL performance.
- This issue was reproduced with RCCL 2.20.5 (ROCm 6.2.0 and 6.2.1) on systems with Broadcom Thor-2 NICs and affects other systems with RoCE networks using Linux 6.8.0 or newer.
- Older RCCL versions are also impacted.
- This issue will be addressed in a future ROCm release.
2024-09-23 14:27:58 -04:00
2024-10-23 13:49:40 -06:00
## RCCL 2.20.5 for ROCm 6.2.0
2024-04-15 12:03:57 -06:00
### Changed
2024-04-24 09:07:49 -06:00
- Compatibility with NCCL 2.20.5
- Compatibility with NCCL 2.19.4
- Performance tuning for some collective operations on MI300
- Enabled NVTX code in RCCL
- Replaced rccl_bfloat16 with hip_bfloat16
- NPKit updates:
- Removed warm-up iteration removal by default, need to opt in now
- Doubled the size of buffers to accommodate for more channels
- Modified rings to be rail-optimized topology friendly
- Replaced ROCmSoftwarePlatform links with ROCm links
2024-04-15 12:03:57 -06:00
### Added
2024-04-24 09:07:49 -06:00
- Support for fp8 and rccl_bfloat8
- Support for using HIP contiguous memory
- Implemented ROC-TX for host-side profiling
- Enabled static build
- Added new rome model
- Added fp16 and fp8 cases to unit tests
- New unit test for main kernel stack size
- New -n option for topo_expl to override # of nodes
- Improved debug messages of memory allocations
2024-04-15 12:03:57 -06:00
### Fixed
2024-04-24 09:07:49 -06:00
- Bug when configuring RCCL for only LL128 protocol
- Scratch memory allocation after API change for MSCCL
2024-04-15 12:03:57 -06:00
2024-04-24 09:07:49 -06:00
## RCCL 2.18.6 for ROCm 6.1.0
2024-04-15 12:03:57 -06:00
### Changed
- Compatibility with NCCL 2.18.6
## RCCL 2.18.3 for ROCm 6.0.0
### Changed
- Compatibility with NCCL 2.18.3
## RCCL 2.17.1-1 for ROCm 5.7.0
### Changed
- Compatibility with NCCL 2.17.1-1
- Performance tuning for some collective operations
### Added
- Minor improvements to MSCCL codepath
- NCCL_NCHANNELS_PER_PEER support
- Improved compilation performance
- Support for gfx94x
### Fixed
- Potential race-condition during ncclSocketClose()
2024-04-11 15:04:40 -06:00
## RCCL 2.16.2 for ROCm 5.6.0
2022-12-19 08:36:57 -08:00
### Changed
2023-02-04 01:43:38 +00:00
- Compatibility with NCCL 2.16.2
### Fixed
2023-02-18 08:50:31 -08:00
- Remove workaround and use indirect function call
2023-02-04 01:43:38 +00:00
2024-04-11 15:04:40 -06:00
## RCCL 2.15.5 for ROCm 5.5.0
2023-02-04 01:43:38 +00:00
### Changed
2022-12-19 08:36:57 -08:00
- Compatibility with NCCL 2.15.5
- Unit test executable renamed to rccl-UnitTests
### Added
- HW-topology aware binary tree implementation
- Experimental support for MSCCL
- New unit tests for hipGraph support
- NPKit integration
### Fixed
- rocm-smi ID conversion
- Support for HIP_VISIBLE_DEVICES for unit tests
- Support for p2p transfers to non (HIP) visible devices
### Removed
2024-03-25 16:29:13 -06:00
- Removed TransferBench from tools. Exists in standalone repo: https://github.com/ROCm/TransferBench
2022-12-19 08:36:57 -08:00
2022-11-23 09:40:19 -07:00
## RCCL-2.13.4 for ROCm 5.4.0
### Changed
- Compatibility with NCCL 2.13.4
- Improvements to RCCL when running with hipGraphs
- RCCL_ENABLE_HIPGRAPH environment variable is no longer necessary to enable hipGraph support
- Minor latency improvements
### Fixed
- Resolved potential memory access error due to asynchronous memset
## RCCL-2.12.10 for ROCm 5.3.0
### Changed
- Improvements to LL128 algorithms
### Added
- Adding initial hipGraph support via opt-in environment variable RCCL_ENABLE_HIPGRAPH
- Integrating with NPKit (https://github.com/microsoft/NPKit) profiling code
## RCCL-2.12.10 for ROCm 5.2.3
2022-03-21 15:04:14 -06:00
### Added
2022-05-20 09:56:38 -06:00
- Compatibility with NCCL 2.12.10
2022-03-21 15:04:14 -06:00
- Packages for test and benchmark executables on all supported OSes using CPack.
2022-05-20 09:56:38 -06:00
- Adding custom signal handler - opt-in with RCCL_ENABLE_SIGNALHANDLER=1
- Additional details provided if Binary File Descriptor library (BFD) is pre-installed
2022-07-08 07:15:52 +08:00
- Adding support for reusing ports in NET/IB channels
- Opt-in with NCCL_IB_SOCK_CLIENT_PORT_REUSE=1 and NCCL_IB_SOCK_SERVER_PORT_REUSE=1
- When "Call to bind failed : Address already in use" error happens in large-scale AlltoAll
(e.g., >=64 MI200 nodes), users are suggested to opt-in either one or both of the options
to resolve the massive port usage issue
2022-11-23 09:40:19 -07:00
- Avoid using NCCL_IB_SOCK_SERVER_PORT_REUSE when NCCL_NCHANNELS_PER_NET_PEER is tuned >1
2022-05-20 09:56:38 -06:00
### Removed
- Removed experimental clique-based kernels
2022-11-23 09:40:19 -07:00
## RCCL-2.11.4 for ROCm 5.2.0
### Changed
- Unit testing framework rework
- Minor bug fixes
### Known issues
- Managed memory is not currently supported for clique-based kernels
2022-05-20 09:56:38 -06:00
## RCCL-2.11.4 for ROCm 5.1.0
### Added
- Compatibility with NCCL 2.11.4
### Known issues
- Managed memory is not currently supported for clique-based kernels
2022-03-21 15:04:14 -06:00
2022-01-10 11:34:31 -07:00
## RCCL-2.10.3 for ROCm 5.0.0
2021-06-03 10:56:02 -06:00
### Added
2022-01-10 11:34:31 -07:00
- Compatibility with NCCL 2.10.3
### Known issues
- Managed memory is not currently supported for clique-based kernels
## RCCL-2.9.9 for ROCm 4.5.0
2021-07-27 10:38:36 -06:00
### Changed
- Packaging split into a runtime package called rccl and a development package called rccl-devel. The development package depends on runtime. The runtime package suggests the development package for all supported OSes except CentOS 7 to aid in the transition. The suggests feature in packaging is introduced as a deprecated feature and will be removed in a future rocm release.
2022-01-10 11:34:31 -07:00
### Added
- Compatibility with NCCL 2.9.9
### Known issues
- Managed memory is not currently supported for clique-based kernels
2020-11-05 13:38:30 -07:00
2021-06-03 10:56:02 -06:00
## [RCCL-2.8.4 for ROCm 4.3.0]
### Added
- Ability to select the number of channels to use for clique-based all reduce (RCCL_CLIQUE_ALLREDUCE_NCHANNELS). This can be adjusted to tune for performance when computation kernels are being executed in parallel.
2021-05-06 09:50:07 -06:00
### Optimizations
- Additional tuning for clique-based kernel AllReduce performance (still requires opt in with RCCL_ENABLE_CLIQUE=1)
2021-06-03 10:56:02 -06:00
- Modification of default values for number of channels / byte limits for clique-based all reduce based on device architecture
2021-05-06 09:50:07 -06:00
### Changed
- Replaced RCCL_FORCE_ENABLE_CLIQUE to RCCL_CLIQUE_IGNORE_TOPO
- Clique-based kernels can now be enabled on topologies where all active GPUs are XGMI-connected
- Topologies not normally supported by clique-based kernels require RCCL_CLIQUE_IGNORE_TOPO=1
2021-06-08 13:00:40 -06:00
### Fixed
- Install script '-r' flag invoked alone no longer incorrectly deletes any existing builds.
2021-06-03 10:56:02 -06:00
### Known issues
- Managed memory is not currently supported for clique-based kernels
## [RCCL-2.8.4 for ROCm 4.2.0]
### Added
- Compatibility with NCCL 2.8.4
### Optimizations
- Additional tuning for clique-based kernels
- Enabling GPU direct RDMA read from GPU
- Fixing potential memory leak issue when re-creating multiple communicators within same process
- Improved topology detection
### Known issues
- None
2021-05-06 09:50:07 -06:00
2021-03-25 21:44:24 -06:00
## [RCCL-2.7.8 for ROCm 4.1.0]
2020-11-10 15:44:10 -07:00
### Added
- Experimental support for clique-based kernels (opt in with RCCL_ENABLE_CLIQUE=1)
- Clique-based kernels may offer better performance for smaller input sizes
- Clique-based kernels are currently only enabled for AllReduce under a certain byte limit (controlled via RCCL_CLIQUE_ALLREDUCE_BYTE_LIMIT)
2020-11-05 13:38:30 -07:00
### Optimizations
- Performance improvements for Rome-based systems
2020-11-10 15:44:10 -07:00
### Known issues
- Clique-based kernels are currently experimental and have not been fully tested on all topologies. By default, clique-based kernels are disabled if the detected topology is not supported (override with RCCL_FORCE_ENABLE_CLIQUE)
- Clique-based kernels may hang if there are differences between environment variables set across ranks.
- Clique-based kernels may fail if the input / output device pointers are not the base device pointers returned by hipMalloc.
2020-11-05 13:38:30 -07:00
## [RCCL-2.7.8 for ROCm 3.9.0]
### Added
- Adding support for alltoallv RCCL kernel
### Optimizations
- Modifications to topology based on XGMI links
### Known issues
- None
## [RCCL-2.7.6 for ROCm 3.8.0]
### Added
- Support for static library builds
### Known issues
- None
## [RCCL-2.7.6 for ROCm 3.7.0]
### Added
- Updated to RCCL API version of 2.7.6
- Added gather, scatter and all-to-all collectives
## [RCCL-2.7.0 for ROCm 3.6.0]
### Added
- Updated to RCCL API version of 2.6.4
## [RCCL-2.7.0 for ROCm 3.5.0]
### Added
- Compatibility with NCCL 2.6
- Network interface improvements with API v3
### Optimizations
- Fixing issues and built time improvements for hip-clang
- Network topology detection
- Improved CPU type detection
- Infiniband adaptive routing support
### Changed
- Switched to hip-clang as default compiler
### Deprecated
- Deprecated hcc build