1071 Коммитов

Автор SHA1 Сообщение Дата
Pedram Alizadeh c19441b2b9 Reducing the p2pnChannels to 32 (from 64) for send/recv based collectives on multi-node MI350 (2 and 4 nodes) (#2977) 2026-01-30 18:23:09 -05:00
systems-assistant[bot] 1211790607 Direct Reduce Scatter Implementation (#2765)
* Add new implementation of direct send/recv reduce scatter

* Resolved conflicts

* Add multiple channels support to the reduction kernel of direct reduce scatter and adjust offset into buffer to utilize multiple channels.

* Resolve validation issue when number of elements is not divisible by number of channels leaving elements unaccount for in reduction.

* fix proxy hang

* set maxSrcs to 64 in reduceCopy

* optimize multi-channel code

* fix validation issue in single node MI300

* Tune the message size range for 2,4, and 8 Nodes

* Move Direct RS into separate kernel

* Add Copyright

* resolve review comments

* resolve review comments

* fix merge build issue

* revert move Direct RS into separate kernel

* address review comments

* address review comments

---------

Co-authored-by: KawtharShafie <kawtharshafie@gmail.com>
Co-authored-by: Ghadeer Alabandi <abandiga@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
2026-01-30 09:27:27 -06:00
systems-assistant[bot] 055909d335 Set default max channels to 48 for MI350 multi-node (#2759)
* make 48 the default max channels for MI350

* address review comments

---------

Co-authored-by: Ghadeer Alabandi <abandiga@gmail.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
2026-01-30 09:22:42 -06:00
systems-assistant[bot] 58c203e252 Fix channel overuse for 1 rank comms (#2760)
* Fix channel overuse for 1 rank comms

* limit channels when warpSpeed is enabled but not used

* enable std::min check against # of CUs for maxChannels computation when warpSpeed is enabled

---------

Co-authored-by: Mustafa Abduljabbar <muabdulj@amd.com>
Co-authored-by: isaki001 <ioannissakiotis@gmail.com>
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
2026-01-29 12:13:46 -06:00
systems-assistant[bot] 3a479a25ad 8 bytes mem leak fix (#2764)
* 8 bytes mem leak fix

* Adding a missing free()

* Clean up commented lines

* Add stdup fail check, memory ownership info

* Add stdup fail check, memory ownership info

---------

Co-authored-by: PJAvinash <avinashindian2.0@gmail.com>
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
Co-authored-by: Avinash <44542533+PJAvinash@users.noreply.github.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
2026-01-27 08:29:16 -07:00
mberenjk 6743f00777 applying the changes from net_ib.cc to rocm_net_ib.cc to ensure DMABUF-disabled configurations are respected. (#2152)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 3d4813d991]
2026-01-21 12:11:56 -08:00
mberenjk 7069fc936f Adding a check to respect DMABUF being disabled by the user (#2076)
Co-authored-by: Marzieh Berenjkoub <mberenjk@.amd.com>

[ROCm/rccl commit: 9a443f3054]
2026-01-21 11:08:12 -08:00
Nilesh M Negi 244047310e [DEVICE] Switch to amd-smi from rocm-smi (#1759)
* Use amd-smi instead of rocm-smi for ROCM_VERSION >= 7.11.0

[ROCm/rccl commit: cd745b1f4b]
2026-01-21 09:05:47 -06:00
prasanna-amd 520f309bb1 fix potential segfaults due to use after malloc fails (#2137)
* fix potential segfaults

* replace NULL with nullptr

---------

Co-authored-by: Prasannakumar Murugesan <prmuruge@amd.com>

[ROCm/rccl commit: 4a32ec2501]
2026-01-20 14:11:29 -08:00
prasanna-amd bb47eee7cc fix bug in reduce kernel bfloat16 for ROCm >= 6.0 (#2139)
Co-authored-by: Prasannakumar Murugesan <prmuruge@amd.com>
As part of an earlier commit, bfloat16 handling in reduce kernel for FuncMinMax fell into generic/default template when there is no SPECIALIZE_REDUCE for a particular type, this generic template does a bitwise integer comparison and it broke bfloat16 ops.
change the else-if statement to else statement, that way it covers both ROCm version < 6.0 and >= 6.0 (with ROCm > 6.0, device.h already typedefs __hip_bfloat16 to hip_bfloat16, so no special case is needed here).

[ROCm/rccl commit: fa366ac03f]
2026-01-20 14:07:20 -08:00
Mythreya Kuricheti 73df3f12b3 use message instead of warning for nccl.h C++ check (#2128)
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 0dc31b1a4a]
2026-01-20 14:21:38 -07:00
Nusrat Islam 96f6029a1b revert memcpy use for direct AG (#2146)
Co-authored-by: Islam <nusislam@amd.com>

[ROCm/rccl commit: f3c5156bbf]
2026-01-20 13:58:28 -06:00
Marzieh Berenjkoub d7293281f3 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 858b4e76eb]
2026-01-20 13:04:02 -06:00
Nusrat Islam eb347a0dd3 GDA support for alltoall via rocshmem integration (#2099)
* ROCSHMEM linking/building to match MSCCL++ style

* add rocSHMEM as a submodule

* Move rocSHMEM submodule to ext-src/rocSHMEM

* Adding submodule support proper, as well as a patch for rocshmem

* Cleaning up INCLUDE_DIR vs INCLUDE_DIRS mixup

* updating patch file

* Pointing rocshmem submodule to edgars fixup patch

* Adding IBVERBS link to the submodule build

* More IBVERBS patching

* pin rocshmem submodule to b534423

* Adding IPC support in rocSHMEM build

* updating rocshmem submodule to resolve CQ errors

* Updating submodule to include recent a2a optimizations

* invoke rocshmem alltoall from rccl

* Updating submodule to CQ error number hang

* Updating submodule to include a2a improvements and bug fixes

* Updating submodule to point to Yiltan's fork and doorbell ring removal commit

* Updating hash to correspond with submodule change

* Updating to no-ctx wg call and updating submodule

* copy-in/copy-out using multiples CUs

* Updating rocSHMEM submodule to include doorbell improvs

* updating gitmodule to point to upstream

* code cleanup and adjust threashold

* guard rocshmem a2a invocation

* Only build with rocshmem when specified

* code cleanup

* address review comments

* Removing debugging failure case

Signed-off-by: Thomas Huber <thomas.huber@amd.com>

* whitespace fix

* Adding rocshmem compile guard

* Removing unneccesary comment

Signed-off-by: Thomas Huber <thomas.huber@amd.com>

* remove commented lines

* address review comments

* cleanup

---------

Signed-off-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k12-27.cs-aus.dcgpu>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
Co-authored-by: Islam <nusislam@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-03.cs-aus.dcgpu>

[ROCm/rccl commit: 27648b0900]
2026-01-09 14:04:54 -06:00
Wenkai Du 87eec6427e Fix broken build due to ncclCudaCalloc change (#2135)
[ROCm/rccl commit: 11e0f4445e]
2026-01-09 09:22:00 -08:00
Dingming Wu 4e15dc142c Update device.h for hip_bfloat16 inclusion guard (#2107)
* Update device.h for hip_bfloat16 inclusion guard

Prevents other files in rocm include the old hip/hip_bfloat16.h, which is guarded by _HIP_INCLUDE_HIP_AMD_DETAIL_HIP_BFLOAT16_H_ and _HIP_BFLOAT16_H_

* Update device.h to handle old hip_bfloat16.h

Added a workaround for old hip_bfloat16.h header usage.

[ROCm/rccl commit: 8e4dbfdf37]
2026-01-09 09:45:47 -05:00
Karthikeyan Arumugam 94499918b3 Add check for P2pPolicy for rocm-ib (#2122)
[ROCm/rccl commit: d0d00c33ee]
2026-01-09 11:33:05 +00:00
Wenkai Du 07453ebfaf Improve RCCL kernel coll trace (#2061)
[ROCm/rccl commit: 1d22c87167]
2026-01-08 16:07:18 -08:00
Wenkai Du 721c624de8 Remove iommu warning in KVM env (#2112)
* Remove iommu warning in KVM env

* Fix for review comments

[ROCm/rccl commit: de931f4c53]
2026-01-08 13:55:40 -08:00
Mustafa Abduljabbar 5bba932529 [WarpSpeed] Improve handling for auto and manual modes (#2125)
* Force ring in WarpSpeed manual mode and log event

* Skip usage for non-ring in WarpSpeed auto mode

* Enable WarpSpeed when its CU count is set

[ROCm/rccl commit: 93fdcb160c]
2026-01-06 10:21:49 -05:00
Nusrat Islam 49d9f8cc27 use memcpy for local copies (#2121)
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>

[ROCm/rccl commit: b4a86ef680]
2026-01-06 09:00:57 -06:00
Avinash de23e1db6d Navi4 LL enablement and tuning (#2095)
* LL enablement for gfx1201

* Single node LL/Simple tuning

* multinode algo/prto default choice

* First iteration of Table tuning

* gfx924 tuning table correction

* Addressing PR comments and prefix match fix


[ROCm/rccl commit: 9545ae04b2]
2026-01-05 10:17:12 -06:00
Nusrat Islam 57f81914d8 gfx950: restrict maxChannels to 48 for multi-node collectives (#2116)
* gfx950: restrict maxChannels to 48 for multi-node collectives

* change env name for reduced CU config

---------

Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
Co-authored-by: Islam <nusislam@amd.com>

[ROCm/rccl commit: f756aa9add]
2025-12-31 09:28:19 -06:00
amd-jiali 7d25ecc65c Add an environment variable to allow user explicitly turn off direct AllGather (#2119)
Co-authored-by: Jiali Li <jialili@amd.com>

[ROCm/rccl commit: 935208ad09]
2025-12-29 16:43:40 -08:00
Avinash 2585ae8815 Virtual device enablement ( Minimal changes ) (#2110)
* minimal changes

* Setting Default tuning table

* Add warnings NIC merge accross PCIe Root complexes,NUMA

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 6f62165369]
2025-12-25 15:06:33 -06:00
Karthikeyan Arumugam bb599d8ed7 Add support for AMD AINIC within RCCL default internal network plugin. (#2078)
* Added support for AMD ROCm net-ib alongside vanilla net-ib, with auto-generation to detect conflicts early during NCCL sync and enable future customizations.
* Integrated AMD AINIC support in RCCL for out-of-the-box usage, leveraging performance improvements by default, channel pinning for optimal pipeline performance, and extended support for 32B in-line CTS messages.
* Implemented internal derivation of AINIC-specific flags when RCCL AINIC environment parameter is set, and checks before initializing AINIC net-ib methods.
* Included snapshot of auto-generated ROCm net-ib file (src/transport/net_ib_rocm.cc) for reference.
* Fixed typos in RCCL param API (RCCL_AINIC_ROCE) and dlclose.
* Updated plugin loading logic:
* Load internal ROCmIB plugin only when NCCL_NET_PLUGIN is not set.
* Load default internal net-ib only when not AINIC and no external plugin env is set.

[ROCm/rccl commit: 9f4651f20f]
2025-12-23 10:33:10 -05:00
alexander-sannikov 8bc2e81e9a Tuning: use constant value for CorrectionFactor tables
[ROCm/rccl commit: 50568dc93d]
2025-12-18 18:55:03 +00:00
alexander-sannikov 1b00f1a895 Tuning: fixed out-of-bound access
[ROCm/rccl commit: dea50b5e11]
2025-12-18 18:55:03 +00:00
Atul Kulkarni c64c23fbee Removes default visibility in debug mode and updates unit tests for alt_rsmi impl (#2091)
* Update unit tests for alt_rsmi impl

- Create distinct test executable for alt_rsmi testing
- Updated alt_rsmi tests to use public methods
- Compiles alt_rsmi.cc with ARSMI_TEST_BUILD
- Enables external linkage of internal variables
- Only for AltRsmiTests.cpp that manipulates internals
- Clean separation for test behavior

* Address review comments

* restore hidden symbol visibility

[ROCm/rccl commit: 74690ea705]
2025-12-17 10:27:00 -08:00
Mustafa Abduljabbar d15a2c6b65 Keep P2P self-copy for batched ops to prevent >32N hang. (#2108)
[ROCm/rccl commit: 596567ff95]
2025-12-16 11:56:39 -05:00
isaki001 ddfff6b705 Remove node-count and threshold restrictions from p2p-batching (#2077)
* remove node-count and threshold restrictions from p2p-batching

* remove batching threshold usage, fix typo for using batching-enablement flag

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

[ROCm/rccl commit: 7c1049d2a4]
2025-12-15 19:55:46 -05:00
Mustafa Abduljabbar 88652b53d0 Add fix for WarpSpeed auto mode (#2104)
[ROCm/rccl commit: 5787c960fc]
2025-12-12 17:56:52 -05:00
Mustafa Abduljabbar 2621e0254e [Device] WarpSpeed enablement and single node CU and perf opt for MI350 (#2073)
[ROCm/rccl commit: d009ab144e]
2025-12-11 19:04:35 -05:00
Ahmed Khan f17357d0d4 Add ncclCommDump API (#2068)
* Add ncclCommDump API

* remove trailing whitespace changes

* Add more proxy trace timestamps

* Add facebook_rccl namespace before proxyTrace timestamp call

* Clean up ProxyTrae construction

* Move updateProxyOpCounter to member function

* Move setProxyOpTimestamp to member function

* Move addNewProxyOp to member function

* Make internal methods private

* Make ProxyTrace thread safe

* Fix unit tests

* Fix overwritten ProxyTrace DONE setting in net.cc

[ROCm/rccl commit: 08dd75712f]
2025-12-11 15:02:35 -07:00
Mustafa Abduljabbar 085752d6e5 Add WAIT_PEER NPKIT event (#2100)
[ROCm/rccl commit: 2cf6a9bb19]
2025-12-11 11:18:41 -05:00
Atul Kulkarni a364ada6e7 Add missing header in alloc.h (#2086)
[ROCm/rccl commit: 892d258319]
2025-12-04 11:26:19 -06:00
Wenkai Du 3e650467fa Use one side stream per process (#2063)
* Use one side stream per process

* Handle multiple GPUs per process

* Reset stream when not found

* Address review comments

* Fix missing mutex initializer

[ROCm/rccl commit: 185e78a8f0]
2025-12-02 10:03:15 -08:00
corey-derochie-amd 8e3f60e080 Add copyright to src/device/symmetric/all_reduce.cuh (#2080)
[ROCm/rccl commit: 4acd0f64ea]
2025-11-27 14:29:21 -07:00
isaki001 cf11e2f39f add back missing proxy-counter updates (#2052)
[ROCm/rccl commit: da183596cd]
2025-11-25 15:22:34 -06:00
AbandiGa d6087d0d62 Fix rcclNetP2pPolicy issue (#2072)
* fix rcclNetP2pPolicy issue

* change the comment to ncclNetIb

[ROCm/rccl commit: b14e32c46e]
2025-11-21 18:28:10 -06:00
Matt Williams 7456dc7d17 Fix ToC in API Library page (#2053)
* Add intro and remove ToC

[ROCm/rccl commit: 3495baa6b2]
2025-11-20 09:35:15 -05:00
Pedram Alizadeh 3d2fc04b45 Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037)
* Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic

* Switching to hip_bf16.h from ROCm 6.0.0

[ROCm/rccl commit: fb67e5b467]
2025-11-13 15:56:18 -05:00
AbandiGa 7f7c8d14f6 Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047)
* disable bf16 reduce_copy pipelining for gfx950

* edit CHANGELOG

* Combine unroll and pipeline local arch calculation into single function

* fix multi-node error and disbale for gfx950 even if it's not a local build

* removed has_gfx950

* disable pipelining for gfx950 in rcclSetPipelining

---------

Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>

[ROCm/rccl commit: 277b6e9bac]
2025-11-13 14:55:09 -06:00
isaki001 9a81823515 Post thread-block size increase tuning (#2042)
* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank

* dont override direct allgather threshold if set to -1

* restore 2-node AR simple at earlier message sizes than higher multi-node AR

* extend range of LL for single-node RS on gfx950

* update algo/proto for multi-node allreduce on gfx942

* set single-node AR on gfx950 to Tree LL for KB message sizes

* decrease threshold for single node Tree for gfx950 AR

[ROCm/rccl commit: 0d09f86608]
2025-11-13 14:51:04 -06:00
Bertan Dogancay 48f37be1e3 [Launch] Move cudaEventRecord call to capturing stream only (#2050)
[ROCm/rccl commit: 83ffc82fa7]
2025-11-13 08:38:09 -06:00
gilbertlee-amd 22d9a038a2 [GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031)
[ROCm/rccl commit: 46b032b760]
2025-11-12 19:34:27 -06:00
Dingming Wu 0d3fba9a22 Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027)
[ROCm/rccl commit: b811645688]
2025-11-11 09:46:51 -06:00
Gheorghe-Teodor Bercea 3da73a7526 Fix compilation when enabling indirect function calls (#1994)
Fix compilation when enabling indirect function calls.

[ROCm/rccl commit: 1678bb9ae7]
2025-11-11 09:36:48 -05:00
Mustafa Abduljabbar b12399898d Reduce LL threshold for a2a (#2032)
[ROCm/rccl commit: 52f9526bd6]
2025-11-10 19:14:23 -05:00
Kapil S. Pawar 6bbc4b5d48 [RcclReplayer] Compile without the need for RCCL to be compiled (#2039)
[ROCm/rccl commit: acdafac49f]
2025-11-10 15:38:48 -06:00