Commit Graph

1035 Commits

Author SHA1 Message Date
Atul Kulkarni 892d258319 Add missing header in alloc.h (#2086) 2025-12-04 11:26:19 -06:00
Wenkai Du 185e78a8f0 Use one side stream per process (#2063)
* Use one side stream per process

* Handle multiple GPUs per process

* Reset stream when not found

* Address review comments

* Fix missing mutex initializer
2025-12-02 10:03:15 -08:00
corey-derochie-amd 4acd0f64ea Add copyright to src/device/symmetric/all_reduce.cuh (#2080) 2025-11-27 14:29:21 -07:00
isaki001 da183596cd add back missing proxy-counter updates (#2052) 2025-11-25 15:22:34 -06:00
AbandiGa b14e32c46e Fix rcclNetP2pPolicy issue (#2072)
* fix rcclNetP2pPolicy issue

* change the comment to ncclNetIb
2025-11-21 18:28:10 -06:00
Matt Williams 3495baa6b2 Fix ToC in API Library page (#2053)
* Add intro and remove ToC
2025-11-20 09:35:15 -05:00
Pedram Alizadeh fb67e5b467 Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037)
* Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic

* Switching to hip_bf16.h from ROCm 6.0.0
2025-11-13 15:56:18 -05:00
AbandiGa 277b6e9bac Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047)
* disable bf16 reduce_copy pipelining for gfx950

* edit CHANGELOG

* Combine unroll and pipeline local arch calculation into single function

* fix multi-node error and disbale for gfx950 even if it's not a local build

* removed has_gfx950

* disable pipelining for gfx950 in rcclSetPipelining

---------

Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>
2025-11-13 14:55:09 -06:00
isaki001 0d09f86608 Post thread-block size increase tuning (#2042)
* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank

* dont override direct allgather threshold if set to -1

* restore 2-node AR simple at earlier message sizes than higher multi-node AR

* extend range of LL for single-node RS on gfx950

* update algo/proto for multi-node allreduce on gfx942

* set single-node AR on gfx950 to Tree LL for KB message sizes

* decrease threshold for single node Tree for gfx950 AR
2025-11-13 14:51:04 -06:00
Bertan Dogancay 83ffc82fa7 [Launch] Move cudaEventRecord call to capturing stream only (#2050) 2025-11-13 08:38:09 -06:00
gilbertlee-amd 46b032b760 [GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031) 2025-11-12 19:34:27 -06:00
Dingming Wu b811645688 Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027) 2025-11-11 09:46:51 -06:00
Gheorghe-Teodor Bercea 1678bb9ae7 Fix compilation when enabling indirect function calls (#1994)
Fix compilation when enabling indirect function calls.
2025-11-11 09:36:48 -05:00
Mustafa Abduljabbar 52f9526bd6 Reduce LL threshold for a2a (#2032) 2025-11-10 19:14:23 -05:00
Kapil S. Pawar acdafac49f [RcclReplayer] Compile without the need for RCCL to be compiled (#2039) 2025-11-10 15:38:48 -06:00
Dingming Wu 05f914c997 Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x (#2023)
* Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x
Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that.

* Update src/init.cc to use ERROR instead of WARN
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2025-11-10 11:54:35 -06:00
Dingming Wu b00ee4c83c Increment opCount for intra-node comms as well (#2024)
* Enhance logging in NCCL initialization
It's convenient to log comms obj and default channels together for debugging

* Add opCount to collDevWork and update increment logic
Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms)

* Clarify opCount increment logic in enqueue.cc
Updated comment to clarify incrementing opCount for intranode communications.

* Refactor NCCL_INIT logging format
Updated logging format for NCCL_INIT to improve clarity.

* Remove duplicate INFO logging in init.cc
2025-11-10 11:23:49 -06:00
Bertan Dogancay b1e680adc0 [GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030) 2025-11-07 15:15:25 -05:00
Bertan Dogancay a9bb7e9807 [Launch] Enable Implicit order launch with serial mode (#2033) 2025-11-07 13:29:53 -05:00
Ghadeer Ahmed H Alabandi 45991fadad [NET] Enable capping the number of QPs created for send/recv colls (#1998) 2025-11-07 00:47:01 +00:00
alex-breslow-amd 56e0b4e445 [gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A (#2017)
* Internal benchmarking shows nice single-node performance uplift for MI300A and MI350
2025-11-06 12:12:45 -08:00
Arm Patinyasakdikul d6a53d2022 proxy: handle progressOps return code properly. (#2029) 2025-11-04 09:09:50 -06:00
nawrinsu 166268d715 Fix protocol and channel override when tuner is used (#1985)
* Fix protocol and channel override when tuner is used

* Added comment

* Fix README for basic tuner implementation
2025-11-03 13:56:34 -08:00
Nilesh M Negi 62ab7a22d7 Revert "[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006)" (#2021)
This reverts commit bed7cdf863.
2025-10-31 10:04:12 -05:00
David DeBonis 63d5846452 Single-node AllGather and ReduceScatter Optimization (#2019)
* Single-node performance tuning

* Normalizing value to individual rank
2025-10-31 08:59:46 -06:00
Arm Patinyasakdikul 1ce83d5cc0 Added ERROR message class to handle fatal error messages. (#2002)
* Added ERROR message class to handle fatal error messages.

New ERROR message class will print the message in all debug level,
including none.

Change some of the fatal error message to be in ERROR instead of WARN.

Added new error handler function to print out more meaningful error
message in the future.

* Added CHANGELOG entry.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Change to no longer reuse NONE as ERROR. ERROR is now a separated class.

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul 84fdcab68a Added copyrights for Palamida scan 7.2. (#2018) 2025-10-30 13:33:20 -05:00
isaki001 641c0eb51c P2p batching hang-fix (#2011)
* prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes

* correct computation for channel to part mapping

* update changelog

* disabling p2p-batching by default
2025-10-30 13:32:01 -05:00
isaki001 72996e4d9f gx950 multi-node tuning for LL/LL128 (#1953)
* increased LL threshold for gfx950 AR to 256KB

* AG/RS proto threshold update
2025-10-30 12:08:12 -05:00
Bertan Dogancay bed7cdf863 [GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006) 2025-10-30 11:45:53 -04:00
Nilesh M Negi 8444b3c6e9 Fix gfx950 gating conditions to match ROCm 7.0.2 (#2003) 2025-10-29 23:27:04 -05:00
Mustafa Abduljabbar 12f51ba8bf [Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978)
* Add initial commit to increase tb size to 512
* Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used
Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X
* Adjust nthreads for LL
* Opt threads for reduce_scatter upper small range
* Add macro for single node
* Restrict MSCCL to 256 threads to prevent mem access fault
* Support pre-MI350 compatibility
* Partially refactor threadblock size override
* Use const macros instead of numerals
* opt out of unused function
2025-10-29 23:24:32 -05:00
alex-breslow-amd e69b11eba5 Remove nontemporality from stores, put in casts to global address space (#1982)
* Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores
* Removes nontemporality from some key stores for gfx950.
2025-10-28 10:34:48 -07:00
mberenjk b58f234539 Add support for additional paths in RCCL DMABUF kernel configuration loading (#1825)
* Adding more path to the kernel load and an environment variable to force enable DMABUF

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-10-20 13:35:22 -07:00
Nilesh M Negi c35bc721ad Fix ncclDevFuncId for AllReduceWithBias (#1980) 2025-10-17 09:28:57 -05:00
Arm Patinyasakdikul 58eca5d7f8 Disable graph mode memory registration and UBR as unsupported feature. (#1977) 2025-10-17 09:18:39 -05:00
Rahul Vaidya 624f68b2b2 [Profiler plugin] Fix segfault issue with profiler plugin (#1973)
* Fix profiler plugin segfault by correctly setting p2p->func

* Look for librccl-profiler.so instead of libnccl-profiler.so

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

---------

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
Co-authored-by: Yongjie Qiu <Yongjie.Qiu@amd.com>
2025-10-16 16:33:18 -05:00
alex-breslow-amd 154350baaf MSCCL: Unland PR1788 + Fix for MSCCL Data Corruption (#1960)
- Earlier fix PR1788 is no longer necessary after ROCr fix and pre-ROCr fix workaround
- Inserts an s_waitcnt vmcnt(0), which fixes a data corruption issue in MSCCL
2025-10-15 10:32:25 -07:00
gilbertlee-amd fedddb452c Enabling gdrcopy option for gfx950 (#1955) 2025-10-15 10:55:25 -06:00
alex-breslow-amd c70f5b4621 [gfx950] Make bypassing __threadfence the default for multinode. (#1947)
* Gate based on ROCM version, safe for ROCm 7.0.2 and beyond.
* Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950.  Thanks Nilesh.
* Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.
2025-10-15 09:15:36 -07:00
isaki001 0f99fd84a3 gfx950 channel tuning for ReduceScatter and AllGather (#1940)
* add channel thresholds to override channel-count adjustments
2025-10-14 09:50:44 -05:00
mberenjk e738c03e39 fixing the ar_with_bias test issue when running rccl-tests (#1912)
* fixing the AR_With_Bias issue when running rccl-tests
2025-10-13 13:58:21 -07:00
Arm Patinyasakdikul ff75860d73 Fix unroll factor display bug. (#1969) 2025-10-10 15:35:06 -05:00
Surya Periaswamy 5bd5079de1 MSCCL++ fix split path null deref (#1959)
* Add speriaswamy-amd to CODEOWNERS
* MSCCL++: fix split path null deref; key maps by parent ncclUniqueId
* removed no-op
2025-10-09 14:08:38 -05:00
Rahul Vaidya 6b200ee6c5 Fix LL128 proto selection to respect user setting (#1822) 2025-10-09 14:08:03 -05:00
Nusrat Islam d22a39e954 Update direct AG and single node LL threshold (#1944)
* update AG direct and single node LL threshold

* update thresholds based on MI350 expeirmental results

* disable using LL for direct AG

* enable direct AG for lower GPU counts

* direct AG single node tuning

* fix in-place buffer allocation for AG unit test

* whitespace fix

* gate direct AG for gfx950 and gfx942

---------

Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>
2025-10-09 10:48:50 -05:00
Artem Kuzmitckii 00a42c80f3 Reverse logic of context tracking enablement from #1927 (#1971)
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
2025-10-09 10:24:09 +02:00
BertanDogancay 3f94267f21 Merge remote-tracking branch 'nccl/master' into develop 2025-10-06 18:36:49 -04:00
Nilesh M Negi 342ec086e3 Revert "changes for hugepages backed host buffer for larger allocations (#1841)" (#1951)
This reverts commit 65b69bf318.
2025-10-02 23:43:09 -05:00
amd-jiali 5978d2f9ab Print out the hipRuntimeVersion message from WARN to always show up (#1911)
Authored-by: Jiali Li <jialili@amd.com>
2025-10-02 11:32:32 -05:00