提交图

2015 次代码提交

作者 SHA1 备注 提交日期
Atul Kulkarni 63aa3bb537 Remove legacy Shm and P2p tests (#2089)
These tests will be replaced by MPI tests.

[ROCm/rccl commit: 0d797d1f6c]
2025-12-05 16:53:28 -06:00
Atul Kulkarni 86a4dd95f6 Remove static to non-static conversion used in tests (#2084)
* Remove coll_reg tests which are unsupported

* removed static to non-static conversion feature

[ROCm/rccl commit: 7ec8e73e12]
2025-12-04 18:03:14 -06:00
Atul Kulkarni a364ada6e7 Add missing header in alloc.h (#2086)
[ROCm/rccl commit: 892d258319]
2025-12-04 11:26:19 -06:00
Atul Kulkarni 0ced7aede8 Fix rccl test suite to use hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2082)
[ROCm/rccl commit: cc6e259a02]
2025-12-04 10:02:06 -06:00
Atul Kulkarni e4aef19511 Added new unit tests for AllReduce with Bias API (#2036)
* Added new unit tests for AllReduce with Bias API

* Address review comments

[ROCm/rccl commit: 7c12b0b76b]
2025-12-03 17:37:34 -06:00
Wenkai Du 3e650467fa Use one side stream per process (#2063)
* Use one side stream per process

* Handle multiple GPUs per process

* Reset stream when not found

* Address review comments

* Fix missing mutex initializer

[ROCm/rccl commit: 185e78a8f0]
2025-12-02 10:03:15 -08:00
corey-derochie-amd 8e3f60e080 Add copyright to src/device/symmetric/all_reduce.cuh (#2080)
[ROCm/rccl commit: 4acd0f64ea]
2025-11-27 14:29:21 -07:00
isaki001 cf11e2f39f add back missing proxy-counter updates (#2052)
[ROCm/rccl commit: da183596cd]
2025-11-25 15:22:34 -06:00
Kapil S. Pawar 566671910a [RcclReplayer] JSON <-> BIN log format conversion tool (#2056)
* Add replay log format converter

* Add Log Sanitizer

* Add no timestamp option (nts) to sanitizer

[ROCm/rccl commit: 5fd86021a8]
2025-11-24 11:51:36 -06:00
Nilesh M Negi 8c928e60f9 [AzureCI] Increase timeout of per PR and nightly pipeline to 240 mins (#2074)
[ROCm/rccl commit: db52690c2a]
2025-11-24 10:55:36 -06:00
AbandiGa d6087d0d62 Fix rcclNetP2pPolicy issue (#2072)
* fix rcclNetP2pPolicy issue

* change the comment to ncclNetIb

[ROCm/rccl commit: b14e32c46e]
2025-11-21 18:28:10 -06:00
Nilesh M Negi 3026698c40 [AzureCI] Increase UT timeout to 3 hours (#2070)
[ROCm/rccl commit: fca0015962]
2025-11-21 09:18:49 -06:00
Matt Williams 7456dc7d17 Fix ToC in API Library page (#2053)
* Add intro and remove ToC

[ROCm/rccl commit: 3495baa6b2]
2025-11-20 09:35:15 -05:00
Kapil S. Pawar 3f12f2f735 [AzureCI] Add RcclReplayer to CI (#2048)
[ROCm/rccl commit: ab1eb6e70e]
2025-11-18 12:21:24 -06:00
Kapil S. Pawar acb0d614a5 Functional Tests for Ext-Profiler Plugin (#2007)
* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Added tests for ext-profiler

* Updated readme

* Updated readme

[ROCm/rccl commit: c7f400dbff]
2025-11-18 11:20:39 -06:00
Arm Patinyasakdikul f81bb04bff Added install.sh flag to suppress warnings. (#2054)
[ROCm/rccl commit: 461e61d10e]
2025-11-17 00:35:06 -06:00
Pedram Alizadeh 3d2fc04b45 Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037)
* Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic

* Switching to hip_bf16.h from ROCm 6.0.0

[ROCm/rccl commit: fb67e5b467]
2025-11-13 15:56:18 -05:00
AbandiGa 7f7c8d14f6 Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047)
* disable bf16 reduce_copy pipelining for gfx950

* edit CHANGELOG

* Combine unroll and pipeline local arch calculation into single function

* fix multi-node error and disbale for gfx950 even if it's not a local build

* removed has_gfx950

* disable pipelining for gfx950 in rcclSetPipelining

---------

Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>

[ROCm/rccl commit: 277b6e9bac]
2025-11-13 14:55:09 -06:00
isaki001 9a81823515 Post thread-block size increase tuning (#2042)
* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank

* dont override direct allgather threshold if set to -1

* restore 2-node AR simple at earlier message sizes than higher multi-node AR

* extend range of LL for single-node RS on gfx950

* update algo/proto for multi-node allreduce on gfx942

* set single-node AR on gfx950 to Tree LL for KB message sizes

* decrease threshold for single node Tree for gfx950 AR

[ROCm/rccl commit: 0d09f86608]
2025-11-13 14:51:04 -06:00
Bertan Dogancay 48f37be1e3 [Launch] Move cudaEventRecord call to capturing stream only (#2050)
[ROCm/rccl commit: 83ffc82fa7]
2025-11-13 08:38:09 -06:00
gilbertlee-amd 22d9a038a2 [GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031)
[ROCm/rccl commit: 46b032b760]
2025-11-12 19:34:27 -06:00
nawrinsu cac8dc67fd Add tuner config file (2,4,8 nodes) for gfx950 (#2012)
* Add tuner config file (2,4,8 nodes) for gfx950

* remove alltoall

* Added comment regarding allgather direct

[ROCm/rccl commit: c488c5307e]
2025-11-12 09:16:36 -08:00
Kapil S. Pawar c4d7680749 Added Functional Tests for CSV Tuner Plugin (#1968)
* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Updated readme

* Updated readme

* Updated readme

[ROCm/rccl commit: c8da880dc7]
2025-11-11 10:11:19 -06:00
Dingming Wu 0d3fba9a22 Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027)
[ROCm/rccl commit: b811645688]
2025-11-11 09:46:51 -06:00
Gheorghe-Teodor Bercea 3da73a7526 Fix compilation when enabling indirect function calls (#1994)
Fix compilation when enabling indirect function calls.

[ROCm/rccl commit: 1678bb9ae7]
2025-11-11 09:36:48 -05:00
Mustafa Abduljabbar b12399898d Reduce LL threshold for a2a (#2032)
[ROCm/rccl commit: 52f9526bd6]
2025-11-10 19:14:23 -05:00
Kapil S. Pawar 6bbc4b5d48 [RcclReplayer] Compile without the need for RCCL to be compiled (#2039)
[ROCm/rccl commit: acdafac49f]
2025-11-10 15:38:48 -06:00
Dingming Wu 23870ceccd Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x (#2023)
* Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x
Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.

* Update src/init.cc to use ERROR instead of WARN
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 05f914c997]
2025-11-10 11:54:35 -06:00
Dingming Wu c601f9b3f8 Increment opCount for intra-node comms as well (#2024)
* Enhance logging in NCCL initialization
It's convenient to log comms obj and default channels together for debugging

* Add opCount to collDevWork and update increment logic
Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms)

* Clarify opCount increment logic in enqueue.cc
Updated comment to clarify incrementing opCount for intranode communications.

* Refactor NCCL_INIT logging format
Updated logging format for NCCL_INIT to improve clarity.

* Remove duplicate INFO logging in init.cc

[ROCm/rccl commit: b00ee4c83c]
2025-11-10 11:23:49 -06:00
Bertan Dogancay b955a7df40 [GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030)
[ROCm/rccl commit: b1e680adc0]
2025-11-07 15:15:25 -05:00
Bertan Dogancay 524453baea [Launch] Enable Implicit order launch with serial mode (#2033)
[ROCm/rccl commit: a9bb7e9807]
2025-11-07 13:29:53 -05:00
Avinash 5ca67dc803 Empty kernel test enhancements [tools] (#1999)
* Initial commit

* Improvements-1

* Initial commit for PR

* Updates warning, run.sh, decoupled loops

* Forcing seq cst for CPU timimg

[ROCm/rccl commit: 85baa0d113]
2025-11-07 12:28:06 -06:00
Ghadeer Ahmed H Alabandi 5b66480595 [NET] Enable capping the number of QPs created for send/recv colls (#1998)
[ROCm/rccl commit: 45991fadad]
2025-11-07 00:47:01 +00:00
alex-breslow-amd bd614458c3 [gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A (#2017)
* Internal benchmarking shows nice single-node performance uplift for MI300A and MI350

[ROCm/rccl commit: 56e0b4e445]
2025-11-06 12:12:45 -08:00
Arm Patinyasakdikul 25005c1cce proxy: handle progressOps return code properly. (#2029)
[ROCm/rccl commit: d6a53d2022]
2025-11-04 09:09:50 -06:00
Aravind Ravikumar 4babb01f4d Add S3 upload support for Perf and test reports by run ID and architecture (#2020)
* Commits to enable scp report copy

* Added Post report upload step

* Added extra arg for fetch artifacts

* Moved to a specific commit

* Add write permissions to s3

* Added comment for TheRock sha commit date

---------

Co-authored-by: arravikum <arravikum@amd.com>

[ROCm/rccl commit: 07f8f6d6c6]
2025-11-03 19:09:34 -05:00
nawrinsu 6d22ce9b1a Fix protocol and channel override when tuner is used (#1985)
* Fix protocol and channel override when tuner is used

* Added comment

* Fix README for basic tuner implementation

[ROCm/rccl commit: 166268d715]
2025-11-03 13:56:34 -08:00
Ahmed Khan acf64be514 Remove duplicate MaxEmptyLinesToKeep from clang-format (#2016)
[ROCm/rccl commit: caffd013f6]
2025-11-02 21:44:27 -06:00
Nilesh M Negi dd625edf56 Revert "[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006)" (#2021)
This reverts commit 40f3faead0.

[ROCm/rccl commit: 62ab7a22d7]
2025-10-31 10:04:12 -05:00
David DeBonis 3e750f0f57 Single-node AllGather and ReduceScatter Optimization (#2019)
* Single-node performance tuning

* Normalizing value to individual rank

[ROCm/rccl commit: 63d5846452]
2025-10-31 08:59:46 -06:00
Arm Patinyasakdikul 54194a17c3 Added ERROR message class to handle fatal error messages. (#2002)
* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 1ce83d5cc0]
2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul 03e92dc942 Added copyrights for Palamida scan 7.2. (#2018)
[ROCm/rccl commit: 84fdcab68a]
2025-10-30 13:33:20 -05:00
isaki001 9bccbcd619 P2p batching hang-fix (#2011)
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default

[ROCm/rccl commit: 641c0eb51c]
2025-10-30 13:32:01 -05:00
isaki001 678366f5e2 gx950 multi-node tuning for LL/LL128 (#1953)
* increased LL threshold for gfx950 AR to 256KB

* AG/RS proto threshold update

[ROCm/rccl commit: 72996e4d9f]
2025-10-30 12:08:12 -05:00
Bertan Dogancay 40f3faead0 [GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006)
[ROCm/rccl commit: bed7cdf863]
2025-10-30 11:45:53 -04:00
Nilesh M Negi 03d37f6305 Fix gfx950 gating conditions to match ROCm 7.0.2 (#2003)
[ROCm/rccl commit: 8444b3c6e9]
2025-10-29 23:27:04 -05:00
Mustafa Abduljabbar eb0b1387b7 [Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978)
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function

[ROCm/rccl commit: 12f51ba8bf]
2025-10-29 23:24:32 -05:00
Bertan Dogancay 4c7afea115 [Tools/Replayer] Fix prohibited calls during capture mode (#1938)
[ROCm/rccl commit: b703ffdfa4]
2025-10-29 12:19:32 -04:00
corey-derochie-amd c5cdee4fa5 Updated Changelog with 7.1.1 and 7.2.0 stub sections (#2008)
* Missing ROCm 7.0 & 7.1.0 Changelog entries (#1976)

* Update CHANGELOG.md

* Update CHANGELOG.md

* Apply suggestions from code review

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

* Update CHANGELOG.md

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Added ROCm 7.2.0 section.

* Update CHANGELOG.md

* Apply suggestion from @corey-derochie-amd

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 561ad2fe05]
2025-10-28 13:41:22 -06:00
Atul Kulkarni f2287e8f97 Removed RCCL_EXPOSE_STATIC duplicate definition. (#1988)
[ROCm/rccl commit: cc867dbaf2]
2025-10-28 13:01:48 -05:00