From c5cdee4fa5a2ebe814beee359ba3d3aaf8b6dcd6 Mon Sep 17 00:00:00 2001 From: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com> Date: Tue, 28 Oct 2025 13:41:22 -0600 Subject: [PATCH] Updated Changelog with 7.1.1 and 7.2.0 stub sections (#2008) * Missing ROCm 7.0 & 7.1.0 Changelog entries (#1976) * Update CHANGELOG.md * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: Jeffrey Novotny * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny * Added ROCm 7.2.0 section. * Update CHANGELOG.md * Apply suggestion from @corey-derochie-amd --------- Co-authored-by: Jeffrey Novotny [ROCm/rccl commit: 561ad2fe057d05fecb652f0bbfe351693fd979b3] --- projects/rccl/CHANGELOG.md | 38 ++++++++++++++++++++++++++++---------- 1 file changed, 28 insertions(+), 10 deletions(-) diff --git a/projects/rccl/CHANGELOG.md b/projects/rccl/CHANGELOG.md index f4169c11ba..238f819442 100644 --- a/projects/rccl/CHANGELOG.md +++ b/projects/rccl/CHANGELOG.md @@ -2,24 +2,35 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https://rccl.readthedocs.io) -## Unreleased - RCCL 2.27.7 for ROCm 7.1.0 +## Unreleased - RCCL 2.27.7 for ROCm 7.2.0 + +## Unreleased - RCCL 2.27.7 for ROCm 7.1.1 + +### Resolved Issues + +* Fixed crash when using the librccl-profiler plugin with the all-to-all collective after the 2.27 update. + +## RCCL 2.27.7 for ROCm 7.1.0 ### Added -* `RCCL_FORCE_ENABLE_DMABUF` added as a debugging feature if the user wants to explicitly enable DMABUF and forego system/kernel checks. +* Added `RCCL_FORCE_ENABLE_DMABUF` as a debugging feature if the user wants to explicitly enable DMABUF and forego system/kernel checks. * Added `RCCL_P2P_BATCH_THRESHOLD` to set the message size limit for batching P2P operations. This mainly affects small message performance for alltoall at a large scale but also applies to alltoallv. * Added `RCCL_P2P_BATCH_ENABLE` to enable batching P2P operations to receive performance gains for smaller messages up to 4MB for alltoall when the workload requires it. This is to avoid performance dips for larger messages. -* added `RCCL_CHANNEL_TUNING_ENABLE` to enable channel tuning that overrides RCCL's internal adjustments based on threadThreshold. +* Added `RCCL_CHANNEL_TUNING_ENABLE` to enable channel tuning that overrides RCCL's internal adjustments based on `threadThreshold`. ### Changed * The MSCCL++ feature is now disabled by default. The `--disable-mscclpp` build flag is replaced with `--enable-mscclpp` in the `rccl/install.sh` script. -* Compatibility with NCCL 2.27.7 +* Compatibility with NCCL 2.27.7. -### Resolved issues -* Improve small message performance for alltoall by enabling and optimizing batched P2P operations. +### Optimized +* Enabled and optimized batched P2P operations to improve small message performance for AllToAll and AllGather. +* Optimized channel count selection to improve efficiency for small to medium message sizes in ReduceScatter. +* Changed code inlining to improve latency for small message sizes for AllReduce, AllGather, and ReduceScatter. ### Known issues * Symmetric memory kernels are currently disabled due to ongoing CUMEM enablement work. +* When running this version of RCCL using ROCm versions earlier than 6.4.0, the user must set the environment flag `HSA_NO_SCRATCH_RECLAIM=1`. ## RCCL 2.26.6 for ROCm 7.0.0 @@ -29,6 +40,7 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: * Fixed unit test failures in tests ending with `ManagedMem` and `ManagedMemGraph` suffixes. * Suboptimal algorithmic switching point for AllReduce on MI300x. * Fixed the known issue "When splitting a communicator using `ncclCommSplit` in some GPU configurations, MSCCL initialization can cause a segmentation fault." with a design change to use `comm` instead of `rank` for `mscclStatus`. The Global map for `comm` to `mscclStatus` is still not thread safe but should be explicitly handled by mutexes for read writes. This is tested for correctness, but there is a plan to use a thread-safe map data structure in upcoming changes. +* Fixed broken functionality within the LL protocol on gfx950 by disabling inlining of LLGenericOp kernels. ### Added @@ -47,10 +59,16 @@ Full documentation for RCCL is available at [https://rccl.readthedocs.io](https: ### Changed -* Compatibility with NCCL 2.23.4 -* Compatibility with NCCL 2.24.3 -* Compatibility with NCCL 2.25.1 -* Compatibility with NCCL 2.26.6 +* Compatibility with NCCL 2.23.4. +* Compatibility with NCCL 2.24.3. +* Compatibility with NCCL 2.25.1. +* Compatibility with NCCL 2.26.6. + +### Optimized +* Improved the performance of the `FP8` Sum operation by upcasting to `FP16`. + +### Known Issues +* When running this version of RCCL using ROCm versions earlier than 6.4.0, the user must set the environment flag `HSA_NO_SCRATCH_RECLAIM=1`. ## RCCL 2.22.3 for ROCm 6.4.2