Граф коммитов

887 Коммитов

Автор SHA1 Сообщение Дата
alex-breslow-amd 2f6b20c00a Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681) for Single Node on Some GFX9 Systems
Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.
2025-05-29 16:17:35 -07:00
Nilesh M Negi 12517a957e Re-apply unroll=1 and 112 channels for gfx950 (#1706)
* Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
This reverts commit 329e13efff.

* Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
This reverts commit b17338d164.
2025-05-28 14:58:10 -05:00
PedramAlizadeh 7f878baef0 Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 00c1eb098c.
2025-05-21 20:21:27 -05:00
corey-derochie-amd 170acf3bda Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues. (#1546)
* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"

This reverts commit 824b81c034.

* [UT] Modify max stack size to 496

* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION

* addressing the ci failure

* Adding the device tag

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-14 15:33:03 -05:00
Mustafa Abduljabbar 00c1eb098c [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)
* Update LL128 elems per thread

* Precompute ix[g] in LL128 prim

* Make Threadthreshold part of tuning models

* Ignore channel tuning when channels are env controlled

* Tune LL128 max limit for AG

* Tune LL128 max limit for RS

* Retune AR LL128 limits due to changes

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-14 14:35:54 -05:00
Dingming Wu 51f87fbb43 Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683)
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()

 For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.

* check hip runtime ver via hipRuntimeGetVersion

* move the detection to ncclinit func

* correct rocm version integer

* update warning message

* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
2025-05-14 10:12:45 -05:00
mberenjk 1cefcee51f moving the thread_fence to apply before atomic fetch (#1672)
* applying thread_fence only on warp 0 before atomic fetch

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-14 10:10:05 -05:00
gilbertlee-amd 9ef45df8f7 Fix when more than 64 channels are used for multi-collective group calls (#1688)
* Fix when more than 64 channels are used for multi-collective group calls

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-12 18:05:57 -05:00
Avinash 5f6805b4f4 RCCL Multinode DMA Buffer crash fix (#1682)
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.

* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate  flushEnabled
2025-05-08 19:17:39 -05:00
Avinash c54a0c085a collective trace improvements for debugging (#1661) 2025-05-07 13:37:31 -05:00
Bertan Dogancay 590ad6acc2 Merge pull request #1662 from BertanDogancay/2.25
[SYNC] 2.25.1-1
2025-05-06 09:39:09 -04:00
Mustafa Abduljabbar f3f3336468 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank
2025-05-05 15:26:29 -04:00
Bertan Dogancay acfac55516 [Graph] Try using P2P by default (#1670) 2025-05-02 11:54:30 -04:00
Nilesh M Negi 329e13efff Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
* Revert "[SRC] Enable unroll=1 for gfx950 (#1602)"
This reverts commit 307bc10781.

* Update Changelog

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-04-30 23:33:08 -05:00
BertanDogancay cb6e23ae67 Merge remote-tracking branch 'nccl/master' into develop 2025-04-30 13:31:41 -05:00
Tim dc0c5f9153 minor fix for empty scope (group) (#1666) 2025-04-30 13:29:13 -04:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 82afb2bcfe Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
2025-04-23 15:44:56 -04:00
Bertan Dogancay ac8ec4c08c Fix NPKit for SendRecv (#1651) 2025-04-21 12:34:47 -04:00
Tim 9a55ff60a9 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT
2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar 52bfdf05dc Address nested designator compiler warning issue (#1633) 2025-04-18 17:09:50 -04:00
Nusrat Islam f20c33effd Fix MSCCLPP accuracy issue for allreduce7 (#1634)
* ext-src: fix a graph-mode bug in allreduce7

* change MSCCLPP threshold to 16MB

* ext-src: change message size threshold for allreduce7

* ext-src: address review comments
2025-04-18 08:54:32 -05:00
AbandiGa 7a84c5dbb0 added copyright (#1635) 2025-04-14 09:46:18 -05:00
Dingming Wu 1786c0268b Adding #include <dlfcn.h> in profiler.cc to pass build (#1632)
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```
2025-04-10 08:48:18 -07:00
Arm Patinyasakdikul 52654e2301 Add device synchronization before destroying proxy thread. (#1631)
This commit ensures that GPU finishes all kernel before destroying
communicator thread.
2025-04-10 10:44:16 -05:00
Pedram Alizadeh e40ff4f84a all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627)
* Enabling LL128 by default on MI300

* Add missing CUDACHECK

* Adjust BW correction factors to fix the Tree->Ring switching point

* Refactor and add ll128 AR logarithmic factor to tuning models

* Move RCCL tuning changes to a separate file 

* Use enum for tunable indexing

* Use explicit indexing in tuning models to avoid mismatch issues

* Place rcclGetSizePerRank in a function

* Remove HIP ifdef for rccl-only call

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-04-10 11:43:54 -04:00
Mustafa Abduljabbar 4be06f04d8 Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models (#1618)
* Enable LL/LL128 cutoff points in tuning models

* Initializing ll/ll128 model cutoffs for MI300

* Use RCCL_LL_LIMITS_UNDEFINED

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>
2025-04-02 16:26:23 -04:00
Nilesh M Negi b17338d164 Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
* Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)"

This reverts commit 1df73e209e.

* [DOC] Update Changelog

* [DOC] Update CHANGELOG
2025-03-28 17:57:06 -05:00
Bertan Dogancay 532f54c244 Merge pull request #1559 from BertanDogancay/2.23
[SYNC] 2.23.4-1
2025-03-28 17:06:56 -04:00
Nilesh M Negi 307bc10781 [SRC] Enable unroll=1 for gfx950 (#1602)
* [SRC] Enable unroll=1 for gfx950

* Fix typo from rebase in generate.py

* Support for unroll=1 and gfx90a when building for all GPU targets

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-27 18:21:35 -05:00
BertanDogancay 0b2062c560 Merge remote-tracking branch 'nccl/master' into develop 2025-03-27 12:53:04 -05:00
isaki001 9dc23d9265 Disable mscclpp (#1614)
* disable mscclpp by default
2025-03-25 15:21:16 -05:00
gilbertlee-amd 626dc50ab5 Removing the experimental clique kernel files (#1610) 2025-03-20 18:10:01 -06:00
Wenkai Du 90ad586d94 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock
2025-03-20 16:11:43 -07:00
corey-derochie-amd 6505639cf4 removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>
2025-03-20 09:34:53 -06:00
Wenkai Du bd0092e8f1 GDRCOPY support: Off by default (#1605) 2025-03-18 08:17:01 -07:00
Avinash ccb0820743 Memory leak fix when numIBDevices = 0 (#1429)
* Initial commit for testing

* Fix memory leak in checkOptions

* Fix memory leak in checkOption

* x

* Delete cmake-3.28.2-linux-x86_64.sh

* gcn changes

* gcn memleak fixes

* gcn leak fix

* memory leak fixes for parseRome4P2H and ncclTopoAddGPU

* Keeping only necessary file for fixes

Deleting temporary scripts I created for debugging and testing

* changing to GCN_ARCH_NAME_LEN

* Added sanity check directory

* refactoring scripts

* Updated to sanity checks folder

* Initial fixes

* changes in tools

* pointing RCCL lib build to debug version

* Removed second pthread_detach

* Removing sanity checks

* Keeping only code changes

* addressing memory leaks in ncclIbinit

---------

Co-authored-by: Chao Chen <cchen104@amd.com>
2025-03-17 11:21:19 -05:00
Mustafa Abduljabbar f67b2cc908 Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 (#1604)
* Add reduce_scatter LL and LL128 thresholds

* Always honor user choice for protocol
2025-03-17 11:21:01 -04:00
Wenkai Du 245c2de909 Enable LL128 on gfx942 (#1549) 2025-03-16 15:10:05 -07:00
Nilesh M Negi 1df73e209e [GRAPH] Increase default nChannels to 112 for gfx950 (#1596)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-14 14:47:03 -07:00
Wenkai Du 4237caad69 Limit P2P channels per peer to not exceeding max channels (#1594)
* Limit P2P channels per peer to not exceeding max channels

* [UT] test single GPU cases for all collectives

* [UT] fix out of range root value
2025-03-11 09:32:09 -07:00
Tim f6c6d451a9 mscclpp compatibility check for ubr (#1573) 2025-03-09 22:10:47 -04:00
Nusrat Islam ac823818aa misc/msccl: force use of mscclpp (#1581) 2025-03-04 12:48:59 -06:00
Bertan Dogancay d88cca3098 [Transport] Fix IntraNet (#1582) 2025-03-04 13:30:36 -05:00
Wenkai Du f957c4fe22 NPKit: enable reduce scatter profiling (#1580) 2025-03-04 10:03:56 -08:00
Nusrat Islam 23c0b7bd84 misc/msccl: Read graph capture status for every collective call (#1576)
* misc/msccl: read graphCaptureStatus for every collective call

* fix a bug in checking whether UBR is enabled in MSCCLPP

* cmake: Fix patch reversal order

* misc/msccl: add logging
2025-02-28 17:16:07 -06:00
Wenkai Du 60c1264d27 Improve RDMA flushing by write dummy payload with RO=0 (#1570)
* Improve RDMA flushing by write dummy payload with RO=0

* Rename env var for disabling this change to RCCL_GDR_FLUSH_GPU_MEM_NO_RELAXED_ORDERING
2025-02-27 16:20:32 -08:00
Nilesh M Negi 02b8e504f0 [BUILD] Fix generate.py for gfx950 (#1575)
* [BUILD] Fix generate.py for gfx950

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

* [BUILD] Cleaner way to check for gfx targets

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>

---------

Signed-off-by: nilenegi <Nilesh.Negi@amd.com>
2025-02-26 18:43:37 -06:00
Bertan Dogancay 85eb1f16bc Use bit reversal based mapping for multi-node (#1572) 2025-02-26 09:48:03 -05:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00