Commit Graph

1660 Commits

Author SHA1 Message Date
Dingming Wu d34a38ccfc Add proxyTrace (#1732)
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.

[ROCm/rccl commit: 020dcf0a7c]
2025-06-25 23:01:34 -05:00
Nilesh M Negi 0c41f27b10 [BUILD] Move NPKit flags from install.sh to CMakeLists.txt (#1741)
[ROCm/rccl commit: 568777a9bf]
2025-06-23 21:51:49 -05:00
corey-derochie-amd 37ab47fab4 Updated CHANGELOG for LL128 support for gfx942 in 7.0 (#1719)
* Updated CHANGELOG for LL128 support for gfx942 in 7.0

Also ported 6.4.2 section

* Removed unnecessary note from 7.0

[ROCm/rccl commit: e73db11819]
2025-06-23 08:50:12 -06:00
jonatluu 590fb2798b Remove File reorganization backward compatibility (rccl) (#1753)
[ROCm/rccl commit: 709140204a]
2025-06-22 17:18:26 -05:00
Grant Pinkert 1d68693a2e Fix continuous build hang on extract_metadata.cmake (#1668)
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.

[ROCm/rccl commit: 2482d1475f]
2025-06-22 05:54:44 -05:00
Bertan Dogancay eaa770a017 [NPKit] Create dump dir regardless of default or user provided path (#1757)
[ROCm/rccl commit: 675b495a00]
2025-06-21 21:18:20 -05:00
BertanDogancay c0c9312e38 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: aaf023976a]
2025-06-20 07:54:49 -05:00
Joseph Macaranas b37518663a [Azure CI] rccl nightly pipeline that runs on slurm (#1723)
* [Azure CI] rccl nightly pipeline that runs on slurm
- Login node will be set up as a self-hosted agent on Azure Pipelines.
- Login node will run this job nightly.
- Login node will checkout the latest develop source, and then run build and test through sbatch calls, and then waiting for the jobs to complete. When the jobs are complete, print out the logs.

[ROCm/rccl commit: 12315c259a]
2025-06-19 10:41:40 -05:00
Nilesh M Negi 7c422271a8 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 92a5d225d9]
2025-06-17 01:29:55 -05:00
Sarat Kamisetty e359e834f5 generic net plugin ctxt that is extensible for use in multiple APIs (#1735)
Co-authored-by: Sarat Kamisetty <sakamiset@amd.com>

[ROCm/rccl commit: fa0422f174]
2025-06-16 14:48:08 -07:00
Bertan Dogancay 9fa53cc454 [NPKit] Use default output directory when env var is not set (#1747)
[ROCm/rccl commit: 39211c6b41]
2025-06-16 15:26:53 -04:00
Mustafa Abduljabbar 3e5dc99aa6 Fix topo_explorer compatibility and capture WarpSize (#1743)
[ROCm/rccl commit: fb4ad82d0d]
2025-06-16 08:18:35 -04:00
Tim 7051f217a7 replayer update v0 (#1733)
* First version of new replayer, with comments on future TODOs

* plus minor fixes for UT

* Updated format of recorder, especially in binary department, according to replayer's need

[ROCm/rccl commit: ba97c9c18b]
2025-06-13 15:05:34 -04:00
Richard Barnes 2c0cc20a76 Enable -Wdeprecated-copy-with-user-provided-copy (#1643)
[ROCm/rccl commit: 4486d091b8]
2025-06-13 08:23:31 -07:00
Arm Patinyasakdikul 7f7f1cede3 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.

[ROCm/rccl commit: 6c37ae9470]
2025-06-12 09:58:01 -05:00
corey-derochie-amd 2e7aa3556e Deprecated MSCCL API functions (#1740)
[ROCm/rccl commit: 03fba66e71]
2025-06-11 17:52:09 -06:00
Nusrat Islam 99813a3288 msccl: adjust msccl threshold for bf16 (#1736)
* msccl: adjust msccl threshold for bf16

* Update src/collectives.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 75c3c8215c]
2025-06-11 09:09:57 -05:00
Arm Patinyasakdikul 69f7167b74 Fixed errorneous parenthesis. (#1739)
[ROCm/rccl commit: 600ace7f19]
2025-06-11 09:08:00 -05:00
Nilesh M Negi 4cadf3597c [DEVICE] Adding ability to choose unroll factor at runtime (#1734)
* Adding runtime unroll factor selection via RCCL_UNROLL_FACTOR
* [BUILD] Add support for user-defined UNROLL for debugging
* Update CHANGELOG.md
* Fix COLLTRACE errors in CI
* Add debug statements for unroll and resolve warnings
* Incorporate UNROLL into ONLY_FUNCS for debugging

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 9d72be7b2f]
2025-06-11 00:07:59 -05:00
Atul Kulkarni 4cd71722f2 Added new ENABLE_CODE_COVERAGE option. (#1664)
Modified install.sh script to add this new option

[ROCm/rccl commit: 682ed36fe6]
2025-06-10 12:12:36 -05:00
Nilesh M Negi b797b62f6b [DEVICE] Use threadfence on gfx950 for LL protocol (#1686)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: b926203c05]
2025-06-09 01:26:07 -05:00
Nilesh M Negi 7abc3160e7 [BUILD] Enable LL128 on gfx950 (#1731)
* [BUILD] Enable LL128 on gfx950
* Modify comment in src/rccl_wrap.cc
* Update CHANGELOG

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: ef5b4ff630]
2025-06-09 00:25:54 -05:00
vstojilj bbe7422279 SWDEV-536040 - Include <thread> header (#1724)
[ROCm/rccl commit: 2ac44cfe4e]
2025-06-06 10:28:11 -06:00
Arm Patinyasakdikul f65777536f Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0. (#1720)
* Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0.

* Create ncclShmemScratchWarpSize on host side for enqueue.cc.

* Update src/enqueue.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* address comments

* fix number of threads

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: ec6efa9b26]
2025-06-06 07:34:43 -05:00
Arm Patinyasakdikul 8dd9747504 Increase default WORK_FIFO size to accommodate larger alltoall. (#1722)
[ROCm/rccl commit: d5b5f6b159]
2025-06-05 09:02:45 -05:00
Pedram Alizadeh 1ace5d05ed Reapplying PR #1641 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1713)
* Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"

This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c.

* Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8

[ROCm/rccl commit: 3f7c08648f]
2025-06-04 13:22:11 -04:00
Avinash a50ff2c3d3 SPLITCOMM design fix in src/misc/msccl (#1715)
* Fix TOC-TOU in mcclInit

* Improving vector resize thread safety

* Initial commit rank to comm change

* Removing unwanted include header changes

* Updated CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: e94b360246]
2025-06-01 21:00:38 -05:00
alex-breslow-amd 4277b5aa88 Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681) for Single Node on Some GFX9 Systems
Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.

[ROCm/rccl commit: 2f6b20c00a]
2025-05-29 16:17:35 -07:00
Nilesh M Negi 19ed482121 Re-apply unroll=1 and 112 channels for gfx950 (#1706)
* Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
This reverts commit a6972c0d09.

* Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
This reverts commit 1a2eca1756.

[ROCm/rccl commit: 12517a957e]
2025-05-28 14:58:10 -05:00
corey-derochie-amd 22120c6303 Fixed errors in the CHANGELOG for ROCm 7.0 (#1702)
* Updated 6.5 release to be 7.0
* Corrected the RCCL version for 6.4.1
* Moved items to the correct releases
* Added NCCL 2.25.1 compatibility item
* Fixed wording
* Added entry for `ManagedMem` and `ManagedMemGraph` test fix

[ROCm/rccl commit: 7b633d5844]
2025-05-23 15:47:59 -05:00
akolliasAMD 6e2f75d424 remove user from code owner file (#1709)
[ROCm/rccl commit: aabd181fe4]
2025-05-23 15:45:15 -05:00
Arm Patinyasakdikul 59597ad8a7 Test: bump max stacksize once again to match current expectation.
[ROCm/rccl commit: c07445d5b4]
2025-05-23 11:18:25 -05:00
alex-breslow-amd 056ca0edfa Make offload-compress the default (#1704)
* Make offload-compress the default
* Add guard for --offload-compress since it was introduced in ROCm 6.2
* Address some of Nilesh's feedback.
* Reorganize for code cleanliness
* Improve comment
* Compress gpu code at link and compile time

[ROCm/rccl commit: f5b44acb1b]
2025-05-22 22:33:25 -05:00
Nilesh M Negi 7803531f46 [DOCKER] Fix RCCL and RCCL-Tests build for stg1 base images (#1699)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 948d2b6a68]
2025-05-22 20:46:01 -05:00
Arm Patinyasakdikul 2cb65ba466 Test: Change max stack size to 520 to accomodate new ROCm changes.
[ROCm/rccl commit: 523e0893e4]
2025-05-21 20:21:27 -05:00
PedramAlizadeh a99f960742 Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 951ed9cde1.


[ROCm/rccl commit: 7f878baef0]
2025-05-21 20:21:27 -05:00
Giuseppe Congiu 285e2e41c8 NCCL 2.26.6-1
Fix profiler_v2 compatibility layer
 * Removing trafficBytes in profiler_v3 breaks casting to ncclProfilerEventDescr_v2_t
   in the compatibility layer for profiler_v2 interface. This patch fixes the issue
   by making the conversion between the two descriptors explicit.


[ROCm/rccl commit: 8171af656b]
2025-05-20 04:04:41 -07:00
isaki001 89bc9131aa fix improper patch reverse order (#1696)
[ROCm/rccl commit: 66ef428714]
2025-05-19 12:29:21 -05:00
Arm Patinyasakdikul 1313bccaca CHANGELOG.md: Add UT failures as known issue for 6.4.1. (#1698)
* CHANGELOG.md: Add UT failures as known issue for 6.4.1.

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 1710c27e77]
2025-05-19 10:40:50 -05:00
Arm Patinyasakdikul 3e16753c71 Added known issue for 6.4.1 release to CHANGELOG.md. (#1697)
[ROCm/rccl commit: e602497789]
2025-05-16 08:17:48 -05:00
Sam Wu 0db42fb854 Remove call to junit from math ci (#1691)
[ROCm/rccl commit: e5bf7bc5b1]
2025-05-15 14:45:49 -06:00
Arm Patinyasakdikul 4b5ff98d65 Change GPU references to gfx950. (#1695)
[ROCm/rccl commit: f306c00671]
2025-05-15 10:32:46 -05:00
corey-derochie-amd 65d67dce7a Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues. (#1546)
* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"

This reverts commit 30eecfdb25.

* [UT] Modify max stack size to 496

* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION

* addressing the ci failure

* Adding the device tag

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 170acf3bda]
2025-05-14 15:33:03 -05:00
Mustafa Abduljabbar 951ed9cde1 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)
* Update LL128 elems per thread

* Precompute ix[g] in LL128 prim

* Make Threadthreshold part of tuning models

* Ignore channel tuning when channels are env controlled

* Tune LL128 max limit for AG

* Tune LL128 max limit for RS

* Retune AR LL128 limits due to changes

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 00c1eb098c]
2025-05-14 14:35:54 -05:00
Dingming Wu 3731cae1b7 Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683)
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()

 For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.

* check hip runtime ver via hipRuntimeGetVersion

* move the detection to ncclinit func

* correct rocm version integer

* update warning message

* avoid unnecessary info msg on hsa_no_scratch_reclaim detection

[ROCm/rccl commit: 51f87fbb43]
2025-05-14 10:12:45 -05:00
mberenjk 08c0b8b0fc moving the thread_fence to apply before atomic fetch (#1672)
* applying thread_fence only on warp 0 before atomic fetch

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 1cefcee51f]
2025-05-14 10:10:05 -05:00
Mustafa Abduljabbar 128b0e7074 Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support

[ROCm/rccl commit: d665547eef]
2025-05-13 17:07:03 -05:00
Nikhil-Nunna ad657d957a Updated Codeowners (#1692)
[ROCm/rccl commit: a72a1939d1]
2025-05-12 18:58:39 -05:00
gilbertlee-amd 6e57154001 Fix when more than 64 channels are used for multi-collective group calls (#1688)
* Fix when more than 64 channels are used for multi-collective group calls

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 9ef45df8f7]
2025-05-12 18:05:57 -05:00
Avinash 6d6dd8434a RCCL Multinode DMA Buffer crash fix (#1682)
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.

* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate  flushEnabled

[ROCm/rccl commit: 5f6805b4f4]
2025-05-08 19:17:39 -05:00