Commit graph

1756 Commits

Autor SHA1 Nachricht Datum
vstojilj 2ac44cfe4e SWDEV-536040 - Include <thread> header (#1724) 2025-06-06 10:28:11 -06:00
Arm Patinyasakdikul ec6efa9b26 Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0. (#1720)
* Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0.

* Create ncclShmemScratchWarpSize on host side for enqueue.cc.

* Update src/enqueue.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* address comments

* fix number of threads

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-06-06 07:34:43 -05:00
Arm Patinyasakdikul d5b5f6b159 Increase default WORK_FIFO size to accommodate larger alltoall. (#1722) 2025-06-05 09:02:45 -05:00
Pedram Alizadeh 3f7c08648f Reapplying PR #1641 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1713)
* Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"

This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c.

* Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8
2025-06-04 13:22:11 -04:00
Avinash e94b360246 SPLITCOMM design fix in src/misc/msccl (#1715)
* Fix TOC-TOU in mcclInit

* Improving vector resize thread safety

* Initial commit rank to comm change

* Removing unwanted include header changes

* Updated CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-06-01 21:00:38 -05:00
alex-breslow-amd 2f6b20c00a Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681) for Single Node on Some GFX9 Systems
Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.
2025-05-29 16:17:35 -07:00
Nilesh M Negi 12517a957e Re-apply unroll=1 and 112 channels for gfx950 (#1706)
* Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
This reverts commit 329e13efff.

* Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
This reverts commit b17338d164.
2025-05-28 14:58:10 -05:00
corey-derochie-amd 7b633d5844 Fixed errors in the CHANGELOG for ROCm 7.0 (#1702)
* Updated 6.5 release to be 7.0
* Corrected the RCCL version for 6.4.1
* Moved items to the correct releases
* Added NCCL 2.25.1 compatibility item
* Fixed wording
* Added entry for `ManagedMem` and `ManagedMemGraph` test fix
2025-05-23 15:47:59 -05:00
akolliasAMD aabd181fe4 remove user from code owner file (#1709) 2025-05-23 15:45:15 -05:00
Arm Patinyasakdikul c07445d5b4 Test: bump max stacksize once again to match current expectation. 2025-05-23 11:18:25 -05:00
alex-breslow-amd f5b44acb1b Make offload-compress the default (#1704)
* Make offload-compress the default
* Add guard for --offload-compress since it was introduced in ROCm 6.2
* Address some of Nilesh's feedback.
* Reorganize for code cleanliness
* Improve comment
* Compress gpu code at link and compile time
2025-05-22 22:33:25 -05:00
Nilesh M Negi 948d2b6a68 [DOCKER] Fix RCCL and RCCL-Tests build for stg1 base images (#1699)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-05-22 20:46:01 -05:00
Arm Patinyasakdikul 523e0893e4 Test: Change max stack size to 520 to accomodate new ROCm changes. 2025-05-21 20:21:27 -05:00
PedramAlizadeh 7f878baef0 Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 00c1eb098c.
2025-05-21 20:21:27 -05:00
isaki001 66ef428714 fix improper patch reverse order (#1696) 2025-05-19 12:29:21 -05:00
Arm Patinyasakdikul 1710c27e77 CHANGELOG.md: Add UT failures as known issue for 6.4.1. (#1698)
* CHANGELOG.md: Add UT failures as known issue for 6.4.1.

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-19 10:40:50 -05:00
Arm Patinyasakdikul e602497789 Added known issue for 6.4.1 release to CHANGELOG.md. (#1697) 2025-05-16 08:17:48 -05:00
Sam Wu e5bf7bc5b1 Remove call to junit from math ci (#1691) 2025-05-15 14:45:49 -06:00
Arm Patinyasakdikul f306c00671 Change GPU references to gfx950. (#1695) 2025-05-15 10:32:46 -05:00
corey-derochie-amd 170acf3bda Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues. (#1546)
* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"

This reverts commit 824b81c034.

* [UT] Modify max stack size to 496

* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION

* addressing the ci failure

* Adding the device tag

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-14 15:33:03 -05:00
Mustafa Abduljabbar 00c1eb098c [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)
* Update LL128 elems per thread

* Precompute ix[g] in LL128 prim

* Make Threadthreshold part of tuning models

* Ignore channel tuning when channels are env controlled

* Tune LL128 max limit for AG

* Tune LL128 max limit for RS

* Retune AR LL128 limits due to changes

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-14 14:35:54 -05:00
Dingming Wu 51f87fbb43 Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683)
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()

 For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.

* check hip runtime ver via hipRuntimeGetVersion

* move the detection to ncclinit func

* correct rocm version integer

* update warning message

* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
2025-05-14 10:12:45 -05:00
mberenjk 1cefcee51f moving the thread_fence to apply before atomic fetch (#1672)
* applying thread_fence only on warp 0 before atomic fetch

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-14 10:10:05 -05:00
Mustafa Abduljabbar d665547eef Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support
2025-05-13 17:07:03 -05:00
Nikhil-Nunna a72a1939d1 Updated Codeowners (#1692) 2025-05-12 18:58:39 -05:00
gilbertlee-amd 9ef45df8f7 Fix when more than 64 channels are used for multi-collective group calls (#1688)
* Fix when more than 64 channels are used for multi-collective group calls

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-12 18:05:57 -05:00
Avinash 5f6805b4f4 RCCL Multinode DMA Buffer crash fix (#1682)
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.

* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate  flushEnabled
2025-05-08 19:17:39 -05:00
mberenjk e70003736e Write JSON file to /tmp directory to avoid incorrect write access in recorderTest (#1680)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-07 13:58:27 -05:00
Avinash c54a0c085a collective trace improvements for debugging (#1661) 2025-05-07 13:37:31 -05:00
Bertan Dogancay 590ad6acc2 Merge pull request #1662 from BertanDogancay/2.25
[SYNC] 2.25.1-1
2025-05-06 09:39:09 -04:00
Mustafa Abduljabbar fdad89690b Add missing MACRO to topo_expl (#1677)
* Fix header compatibility
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar f3f3336468 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank
2025-05-05 15:26:29 -04:00
Siu Chi Chan 9525c5b2ef rccl-UnitTests - link to dl library (#1673) 2025-05-02 21:20:22 -05:00
Bertan Dogancay acfac55516 [Graph] Try using P2P by default (#1670) 2025-05-02 11:54:30 -04:00
Nilesh M Negi 329e13efff Revert "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
* Revert "[SRC] Enable unroll=1 for gfx950 (#1602)"
This reverts commit 307bc10781.

* Update Changelog

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-04-30 23:33:08 -05:00
deeksha-amd 2486838465 Added new tests for improving the code coverage (#1656)
Signed-off-by: Deeksha Goplani <deeksha.goplani@amd.com>
2025-04-30 18:01:11 -05:00
isaki001 8145c4f3b8 Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 (#1607)
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024

* add compilation flag for enabling/disabling clipping in mscclpp

* change flag name for consistency, set flag to OFF

* add compilation flag in rccl for enabling clipping in mscclpp

* set 1024 threads for mscclpp allreduce8 only for bfloat16

* fix improper description for ENABLE_MSCCLPP_CLIP flag

* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"

This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.

* update clip remove-clip.patch for rebase
2025-04-30 16:42:28 -05:00
BertanDogancay cb6e23ae67 Merge remote-tracking branch 'nccl/master' into develop 2025-04-30 13:31:41 -05:00
Tim dc0c5f9153 minor fix for empty scope (group) (#1666) 2025-04-30 13:29:13 -04:00
Richard Barnes 7961624167 Enable -Wall (#1644) 2025-04-24 10:45:46 -07:00
Bertan Dogancay f8067a76dc Merge pull request #1645 from corey-derochie-amd/nccl-2.24
NCCL Sync 2.24.3-1
2025-04-24 10:08:58 -04:00
Mustafa Abduljabbar aa7991dfc8 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML
2025-04-24 09:02:03 -04:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 82afb2bcfe Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
2025-04-23 15:44:56 -04:00
Tim 45e1c3f3e2 reverting change to RcclReplayer (#1657) 2025-04-23 15:36:46 -04:00
Jeffrey Novotny df778b4ea1 Fix broken link to RCCL Replayer GitHub info (#1655) 2025-04-23 14:17:31 -04:00
gilbertlee-amd ee85a70bb4 Adding UT_DEBUG_PAUSE to unit tests (#1653) 2025-04-21 21:15:07 -06:00
Bertan Dogancay ac8ec4c08c Fix NPKit for SendRecv (#1651) 2025-04-21 12:34:47 -04:00
Tim 9a55ff60a9 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT
2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar 52bfdf05dc Address nested designator compiler warning issue (#1633) 2025-04-18 17:09:50 -04:00