نمودار کامیت

68 کامیت‌ها

مولف SHA1 پیام تاریخ
Marzieh Berenjkoub d7293281f3 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 858b4e76eb]
2026-01-20 13:04:02 -06:00
Mustafa Abduljabbar 2621e0254e [Device] WarpSpeed enablement and single node CU and perf opt for MI350 (#2073)
[ROCm/rccl commit: d009ab144e]
2025-12-11 19:04:35 -05:00
AbandiGa 7f7c8d14f6 Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047)
* disable bf16 reduce_copy pipelining for gfx950

* edit CHANGELOG

* Combine unroll and pipeline local arch calculation into single function

* fix multi-node error and disbale for gfx950 even if it's not a local build

* removed has_gfx950

* disable pipelining for gfx950 in rcclSetPipelining

---------

Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>

[ROCm/rccl commit: 277b6e9bac]
2025-11-13 14:55:09 -06:00
Ghadeer Ahmed H Alabandi 5b66480595 [NET] Enable capping the number of QPs created for send/recv colls (#1998)
[ROCm/rccl commit: 45991fadad]
2025-11-07 00:47:01 +00:00
Arm Patinyasakdikul 54194a17c3 Added ERROR message class to handle fatal error messages. (#2002)
* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 1ce83d5cc0]
2025-10-30 16:14:20 -05:00
isaki001 9bccbcd619 P2p batching hang-fix (#2011)
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default

[ROCm/rccl commit: 641c0eb51c]
2025-10-30 13:32:01 -05:00
corey-derochie-amd c5cdee4fa5 Updated Changelog with 7.1.1 and 7.2.0 stub sections (#2008)
* Missing ROCm 7.0 & 7.1.0 Changelog entries (#1976)

* Update CHANGELOG.md

* Update CHANGELOG.md

* Apply suggestions from code review

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Added ROCm 7.2.0 section.

* Update CHANGELOG.md

* Apply suggestion from @corey-derochie-amd

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 561ad2fe05]
2025-10-28 13:41:22 -06:00
mberenjk 96c62b091d Add support for additional paths in RCCL DMABUF kernel configuration loading (#1825)
* Adding more path to the kernel load and an environment variable to force enable DMABUF

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: b58f234539]
2025-10-20 13:35:22 -07:00
isaki001 6d151d4e21 gfx950 channel tuning for ReduceScatter and AllGather (#1940)
* add channel thresholds to override channel-count adjustments

[ROCm/rccl commit: 0f99fd84a3]
2025-10-14 09:50:44 -05:00
BertanDogancay 2a4e4308b0 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 3f94267f21]
2025-10-06 18:36:49 -04:00
Mustafa Abduljabbar a075779dcd Use batched P2P to enhance alltoall small message performance (#1902)
* Batch P2P operations (2 per CU/channel) and update channel-part mapping

- Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs

- P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes

* Address single node regression and channel per net peer

* Add batching threshold

* Add enable switch for batching

* Update CHANGELOG.md

* Add minor comment change

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: c1e1f2faeb]
2025-09-22 16:25:10 -04:00
Mustafa Abduljabbar 1a7ab8dfc8 Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md

[ROCm/rccl commit: 7ccc6f268f]
2025-09-03 08:54:13 -04:00
BertanDogancay 881327184e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 08a7be231b]
2025-08-28 15:46:28 -05:00
Avinash 832c5b1f13 [build] Disable MSCCL++ compilation by default (#1879)
* Enable MSCCLPP on request

* Updating docs and README

* Updates to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Updates to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update CHANGELOG.md

Github didn't take the edit to my suggestion properly.

---------

Co-authored-by: amd <amd@super3.amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: a0ec15bafe]
2025-08-28 08:52:12 -06:00
Mustafa Abduljabbar f37f290134 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 277747c199]
2025-08-26 15:03:54 -04:00
Nusrat Islam 1af94eee8d Add direct allgather algorithm (#1868)
* add direct allgather algorithm

* minor fix

* add debug print for memory allocation tracker

* add message size threshold for direct allgather

* scatter transfers across ranks

* update changelog

* minor fix

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* enable direct AG when pxn is ON on MI300X or MI350

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 5e7937effb]
2025-08-25 07:55:10 -05:00
Mustafa Abduljabbar b3a0cc5e96 Add optional bf16 software-triggered pipelining for reduceCopyPacks (#1758)
- Introduced double-buffering to reduce copy overhead and overlap BF16 arithmetic with data prefetching.
- Aimed to improve performance of reduction-based collectives by up to 10%.
- Implemented based on recommendations from Guennadi Riguer (AMD)
- Added --force-reduce-pipeline option to install.sh to activate this optimization for BF16 reductions.
- Feature is disabled by default to prevent regressions with large messages until auto-tuning logic is upstreamed.
---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>

[ROCm/rccl commit: 0ce20e7e07]
2025-07-25 10:57:05 -04:00
Wenkai Du caff9764d3 Support fused all reduce and elementwise operations (#1729)
* Support fused all reduce and elementwise operations

Add additional "acc" parameter to RCCL Replayer logs

Add flag which indicates availability of new API

* Fix Recorder json parsing

* Remove unreachable code

* Remove extra acc pointer check

* .

* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"

This reverts commit 4cadf3597c.

* Use noinline to reduce kernels linking time

* Don't use noinline for gfx942 and gfx950 to avoid perf regression

---------

Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>

[ROCm/rccl commit: 9a4213356d]
2025-07-23 09:04:17 -07:00
corey-derochie-amd 37ab47fab4 Updated CHANGELOG for LL128 support for gfx942 in 7.0 (#1719)
* Updated CHANGELOG for LL128 support for gfx942 in 7.0

Also ported 6.4.2 section

* Removed unnecessary note from 7.0

[ROCm/rccl commit: e73db11819]
2025-06-23 08:50:12 -06:00
BertanDogancay c0c9312e38 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: aaf023976a]
2025-06-20 07:54:49 -05:00
Nilesh M Negi 4cadf3597c [DEVICE] Adding ability to choose unroll factor at runtime (#1734)
* Adding runtime unroll factor selection via RCCL_UNROLL_FACTOR
* [BUILD] Add support for user-defined UNROLL for debugging
* Update CHANGELOG.md
* Fix COLLTRACE errors in CI
* Add debug statements for unroll and resolve warnings
* Incorporate UNROLL into ONLY_FUNCS for debugging

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 9d72be7b2f]
2025-06-11 00:07:59 -05:00
Nilesh M Negi 7abc3160e7 [BUILD] Enable LL128 on gfx950 (#1731)
* [BUILD] Enable LL128 on gfx950
* Modify comment in src/rccl_wrap.cc
* Update CHANGELOG

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: ef5b4ff630]
2025-06-09 00:25:54 -05:00
Pedram Alizadeh 1ace5d05ed Reapplying PR #1641 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1713)
* Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"

This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c.

* Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8

[ROCm/rccl commit: 3f7c08648f]
2025-06-04 13:22:11 -04:00
Avinash a50ff2c3d3 SPLITCOMM design fix in src/misc/msccl (#1715)
* Fix TOC-TOU in mcclInit

* Improving vector resize thread safety

* Initial commit rank to comm change

* Removing unwanted include header changes

* Updated CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: e94b360246]
2025-06-01 21:00:38 -05:00
Nilesh M Negi 19ed482121 Re-apply unroll=1 and 112 channels for gfx950 (#1706)
* Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
This reverts commit a6972c0d09.

* Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
This reverts commit 1a2eca1756.

[ROCm/rccl commit: 12517a957e]
2025-05-28 14:58:10 -05:00
corey-derochie-amd 22120c6303 Fixed errors in the CHANGELOG for ROCm 7.0 (#1702)
* Updated 6.5 release to be 7.0
* Corrected the RCCL version for 6.4.1
* Moved items to the correct releases
* Added NCCL 2.25.1 compatibility item
* Fixed wording
* Added entry for `ManagedMem` and `ManagedMemGraph` test fix

[ROCm/rccl commit: 7b633d5844]
2025-05-23 15:47:59 -05:00
PedramAlizadeh a99f960742 Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 951ed9cde1.


[ROCm/rccl commit: 7f878baef0]
2025-05-21 20:21:27 -05:00
Arm Patinyasakdikul 1313bccaca CHANGELOG.md: Add UT failures as known issue for 6.4.1. (#1698)
* CHANGELOG.md: Add UT failures as known issue for 6.4.1.

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 1710c27e77]
2025-05-19 10:40:50 -05:00
Arm Patinyasakdikul 3e16753c71 Added known issue for 6.4.1 release to CHANGELOG.md. (#1697)
[ROCm/rccl commit: e602497789]
2025-05-16 08:17:48 -05:00
Arm Patinyasakdikul 4b5ff98d65 Change GPU references to gfx950. (#1695)
[ROCm/rccl commit: f306c00671]
2025-05-15 10:32:46 -05:00
Mustafa Abduljabbar 951ed9cde1 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)
* Update LL128 elems per thread

* Precompute ix[g] in LL128 prim

* Make Threadthreshold part of tuning models

* Ignore channel tuning when channels are env controlled

* Tune LL128 max limit for AG

* Tune LL128 max limit for RS

* Retune AR LL128 limits due to changes

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 00c1eb098c]
2025-05-14 14:35:54 -05:00
Mustafa Abduljabbar 128b0e7074 Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support

[ROCm/rccl commit: d665547eef]
2025-05-13 17:07:03 -05:00
gilbertlee-amd 6e57154001 Fix when more than 64 channels are used for multi-collective group calls (#1688)
* Fix when more than 64 channels are used for multi-collective group calls

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 9ef45df8f7]
2025-05-12 18:05:57 -05:00
Nilesh M Negi a6972c0d09 Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
* Revert "[SRC] Enable unroll=1 for gfx950 (#1602)"
This reverts commit 210f90ae0f.

* Update Changelog

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 329e13efff]
2025-04-30 23:33:08 -05:00
Mustafa Abduljabbar a85cfaa680 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML


[ROCm/rccl commit: aa7991dfc8]
2025-04-24 09:02:03 -04:00
Istvan Kiss 858fa4e65d Add documentation for NPS4 and CPX partition modes (#1555)
[ROCm/rccl commit: 28ab8603d2]
2025-03-31 09:25:25 -06:00
Nilesh M Negi 1a2eca1756 Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
* Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)"

This reverts commit cf17cff5b6.

* [DOC] Update Changelog

* [DOC] Update CHANGELOG

[ROCm/rccl commit: b17338d164]
2025-03-28 17:57:06 -05:00
gilbertlee-amd 4ca7e6873e Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES


[ROCm/rccl commit: ddc5d58b93]
2025-02-20 15:18:29 -07:00
Mustafa Abduljabbar f58025185e Add IB verbs logging and enable traces through install.sh (#1511)
* Add IB Verbs logging

* Simplify tracing and undo debug.h changes

* Update debug.h

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

* Exchange remote comm device index

[ROCm/rccl commit: dc75209dd7]
2025-01-31 12:35:39 -05:00
BertanDogancay 1b000665df Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 36343be84f]
2025-01-23 12:08:46 -06:00
corey-derochie-amd ebacc24598 Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options (#1418)
* Added RCCL env params to control setting the SO_REUSEADDR and SO_LINGER socket options. This can allow control over the number of file descriptors created during bootstrapping.

* Casted the linger value to `int` sooner to avoid a scope of unknown typed-ness.

* Added CHANGELOG entry for this feature.

[ROCm/rccl commit: 2e35417fe5]
2025-01-14 10:26:04 -07:00
Jeffrey Novotny cc9209f770 Update rccl changelog for 6.3.1 (#1433)
* Update rccl changelog for 6.3.1

* Fix version number

* Correct RCCL release version

* Added details to 6.3.0 changelog

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: e42f10a361]
2024-11-26 08:46:37 -05:00
corey-derochie-amd 1c700083b2 Update CHANGELOG to match release branches 6.2 and 6.3 (#1391)
* [CHANGELOG] Add Known issues for ROCm 6.2.1

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Updated 6.2.1 known issues to match the content in develop.

* Updated CHANGELOG for ROCm 6.3 release. (#1380)

* Updated CHANGELOG for ROCm 6.3 release.

* Update CHANGELOG to new format.

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 6ed513e1b9]
2024-10-23 13:49:40 -06:00
Sandra Polifroni 53478f138e Updated the information for 6.2.1 in the changelog so that it reflects what's in the 6.2.1 release notes
[ROCm/rccl commit: 7f87b0cd85]
2024-09-23 14:27:58 -04:00
Bertan Dogancay ed152c5b89 Update CHANGELOG.md for RCCL 2.20.5 (#1150)
[ROCm/rccl commit: dcc75797a1]
2024-04-24 09:07:49 -06:00
corey-derochie-amd 34fb1007a7 Updated CHANGELOG for next release (#1146)
* Updated CHANGELOG to release for ROCm 6.1.0 (#1142)

* Fixed missing CHANGELOG notes from ROCm 5.5 through unreleased 6.1 (#1141)

* Update CHANGELOG.md for ROCm release 5.5

(cherry picked from commit 83342e865445b233319466d4a620c1166ecaf181)

* Update CHANGELOG.md for ROCm 5.7.0

(cherry picked from commit a7c3b8dcb5cd0654f0a39cb3be4fdf7e8c820577)

* Added ROCm 6.0 and 6.1 CHANGELOG notes.

---------

Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
(cherry picked from commit 28a2b09304)

* Updated CHANGELOG to release for ROCm 6.1.0

* Removed empty sections from CHANGELOG in latest releases.

(cherry picked from commit 164c9553717f2c3bce86a372764ea73030dd5f72)

* Reverted ROCm 6.1.0 block to "Unreleased"

[ROCm/rccl commit: a14137c062]
2024-04-15 16:29:40 -06:00
gilbertlee-amd 422a7ffcbb Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)

[ROCm/rccl commit: 4cb62f999a]
2024-04-15 12:03:57 -06:00
corey-derochie-amd 28a2b09304 Fixed missing CHANGELOG notes from ROCm 5.5 through unreleased 6.1 (#1141)
* Update CHANGELOG.md for ROCm release 5.5

(cherry picked from commit 83342e865445b233319466d4a620c1166ecaf181)

* Update CHANGELOG.md for ROCm 5.7.0

(cherry picked from commit a7c3b8dcb5cd0654f0a39cb3be4fdf7e8c820577)

* Added ROCm 6.0 and 6.1 CHANGELOG notes.

---------

Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>

[ROCm/rccl commit: 3361abe786]
2024-04-11 15:04:40 -06:00
corey-derochie-amd 62a6a07d49 Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125)
[ROCm/rccl commit: 503a472a25]
2024-03-25 16:29:13 -06:00
Wenkai Du 393d0ba7f8 Add back __syncthreads() in barrier and adjust stack size (#688)
[ROCm/rccl commit: 1c166046a2]
2023-02-18 08:50:31 -08:00