Граф коммитов

1781 Коммитов

Автор SHA1 Сообщение Дата
jonatluu 709140204a Remove File reorganization backward compatibility (rccl) (#1753) 2025-06-22 17:18:26 -05:00
Grant Pinkert 2482d1475f Fix continuous build hang on extract_metadata.cmake (#1668)
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.
2025-06-22 05:54:44 -05:00
Bertan Dogancay 675b495a00 [NPKit] Create dump dir regardless of default or user provided path (#1757) 2025-06-21 21:18:20 -05:00
Bertan Dogancay 0c1795c64b Merge pull request #1721 from BertanDogancay/2.26-sync
[SYNC] 2.26.6-1
2025-06-20 09:57:09 -04:00
BertanDogancay aaf023976a Merge remote-tracking branch 'nccl/master' into develop 2025-06-20 07:54:49 -05:00
Joseph Macaranas 12315c259a [Azure CI] rccl nightly pipeline that runs on slurm (#1723)
* [Azure CI] rccl nightly pipeline that runs on slurm
- Login node will be set up as a self-hosted agent on Azure Pipelines.
- Login node will run this job nightly.
- Login node will checkout the latest develop source, and then run build and test through sbatch calls, and then waiting for the jobs to complete. When the jobs are complete, print out the logs.
2025-06-19 10:41:40 -05:00
Nilesh M Negi 92a5d225d9 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-06-17 01:29:55 -05:00
Sarat Kamisetty fa0422f174 generic net plugin ctxt that is extensible for use in multiple APIs (#1735)
Co-authored-by: Sarat Kamisetty <sakamiset@amd.com>
2025-06-16 14:48:08 -07:00
Bertan Dogancay 39211c6b41 [NPKit] Use default output directory when env var is not set (#1747) 2025-06-16 15:26:53 -04:00
Mustafa Abduljabbar fb4ad82d0d Fix topo_explorer compatibility and capture WarpSize (#1743) 2025-06-16 08:18:35 -04:00
Tim ba97c9c18b replayer update v0 (#1733)
* First version of new replayer, with comments on future TODOs

* plus minor fixes for UT

* Updated format of recorder, especially in binary department, according to replayer's need
2025-06-13 15:05:34 -04:00
Richard Barnes 4486d091b8 Enable -Wdeprecated-copy-with-user-provided-copy (#1643) 2025-06-13 08:23:31 -07:00
Arm Patinyasakdikul 6c37ae9470 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.
2025-06-12 09:58:01 -05:00
corey-derochie-amd 03fba66e71 Deprecated MSCCL API functions (#1740) 2025-06-11 17:52:09 -06:00
Nusrat Islam 75c3c8215c msccl: adjust msccl threshold for bf16 (#1736)
* msccl: adjust msccl threshold for bf16

* Update src/collectives.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-06-11 09:09:57 -05:00
Arm Patinyasakdikul 600ace7f19 Fixed errorneous parenthesis. (#1739) 2025-06-11 09:08:00 -05:00
Nilesh M Negi 9d72be7b2f [DEVICE] Adding ability to choose unroll factor at runtime (#1734)
* Adding runtime unroll factor selection via RCCL_UNROLL_FACTOR
* [BUILD] Add support for user-defined UNROLL for debugging
* Update CHANGELOG.md
* Fix COLLTRACE errors in CI
* Add debug statements for unroll and resolve warnings
* Incorporate UNROLL into ONLY_FUNCS for debugging

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-06-11 00:07:59 -05:00
Atul Kulkarni 682ed36fe6 Added new ENABLE_CODE_COVERAGE option. (#1664)
Modified install.sh script to add this new option
2025-06-10 12:12:36 -05:00
Nilesh M Negi b926203c05 [DEVICE] Use threadfence on gfx950 for LL protocol (#1686)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-06-09 01:26:07 -05:00
Nilesh M Negi ef5b4ff630 [BUILD] Enable LL128 on gfx950 (#1731)
* [BUILD] Enable LL128 on gfx950
* Modify comment in src/rccl_wrap.cc
* Update CHANGELOG

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-06-09 00:25:54 -05:00
vstojilj 2ac44cfe4e SWDEV-536040 - Include <thread> header (#1724) 2025-06-06 10:28:11 -06:00
Arm Patinyasakdikul ec6efa9b26 Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0. (#1720)
* Remove 'warpSize' compiler constant as it is deprecated in ROCm 7.0.

* Create ncclShmemScratchWarpSize on host side for enqueue.cc.

* Update src/enqueue.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* address comments

* fix number of threads

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-06-06 07:34:43 -05:00
Arm Patinyasakdikul d5b5f6b159 Increase default WORK_FIFO size to accommodate larger alltoall. (#1722) 2025-06-05 09:02:45 -05:00
Pedram Alizadeh 3f7c08648f Reapplying PR #1641 [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1713)
* Reapply "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"

This reverts commit 943ad6f7820739385a0b54e81f823d0df1dbf71c.

* Decreasing NCCL_LL128_SHMEM_ELEMS_PER_THREAD from 16 to 8
2025-06-04 13:22:11 -04:00
Avinash e94b360246 SPLITCOMM design fix in src/misc/msccl (#1715)
* Fix TOC-TOU in mcclInit

* Improving vector resize thread safety

* Initial commit rank to comm change

* Removing unwanted include header changes

* Updated CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-06-01 21:00:38 -05:00
alex-breslow-amd 2f6b20c00a Use One Slice per Basic Primitive for AllReduce, ReduceScatter, AllGather (#1681) for Single Node on Some GFX9 Systems
Using a single slice rather than the typical two provides about 5% speedup (sometimes more or less) on some GFX9 systems for single node.
2025-05-29 16:17:35 -07:00
Nilesh M Negi 12517a957e Re-apply unroll=1 and 112 channels for gfx950 (#1706)
* Reapply "[SRC] Enable unroll=1 for gfx950 (#1602)" (#1667)
This reverts commit 329e13efff.

* Reapply "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596)" (#1620)
This reverts commit b17338d164.
2025-05-28 14:58:10 -05:00
corey-derochie-amd 7b633d5844 Fixed errors in the CHANGELOG for ROCm 7.0 (#1702)
* Updated 6.5 release to be 7.0
* Corrected the RCCL version for 6.4.1
* Moved items to the correct releases
* Added NCCL 2.25.1 compatibility item
* Fixed wording
* Added entry for `ManagedMem` and `ManagedMemGraph` test fix
2025-05-23 15:47:59 -05:00
akolliasAMD aabd181fe4 remove user from code owner file (#1709) 2025-05-23 15:45:15 -05:00
Arm Patinyasakdikul c07445d5b4 Test: bump max stacksize once again to match current expectation. 2025-05-23 11:18:25 -05:00
alex-breslow-amd f5b44acb1b Make offload-compress the default (#1704)
* Make offload-compress the default
* Add guard for --offload-compress since it was introduced in ROCm 6.2
* Address some of Nilesh's feedback.
* Reorganize for code cleanliness
* Improve comment
* Compress gpu code at link and compile time
2025-05-22 22:33:25 -05:00
Nilesh M Negi 948d2b6a68 [DOCKER] Fix RCCL and RCCL-Tests build for stg1 base images (#1699)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-05-22 20:46:01 -05:00
Arm Patinyasakdikul 523e0893e4 Test: Change max stack size to 520 to accomodate new ROCm changes. 2025-05-21 20:21:27 -05:00
PedramAlizadeh 7f878baef0 Revert "[AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)"
This reverts commit 00c1eb098c.
2025-05-21 20:21:27 -05:00
Giuseppe Congiu 8171af656b NCCL 2.26.6-1
Fix profiler_v2 compatibility layer
 * Removing trafficBytes in profiler_v3 breaks casting to ncclProfilerEventDescr_v2_t
   in the compatibility layer for profiler_v2 interface. This patch fixes the issue
   by making the conversion between the two descriptors explicit.
2025-05-20 04:04:41 -07:00
isaki001 66ef428714 fix improper patch reverse order (#1696) 2025-05-19 12:29:21 -05:00
Arm Patinyasakdikul 1710c27e77 CHANGELOG.md: Add UT failures as known issue for 6.4.1. (#1698)
* CHANGELOG.md: Add UT failures as known issue for 6.4.1.

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-19 10:40:50 -05:00
Arm Patinyasakdikul e602497789 Added known issue for 6.4.1 release to CHANGELOG.md. (#1697) 2025-05-16 08:17:48 -05:00
Sam Wu e5bf7bc5b1 Remove call to junit from math ci (#1691) 2025-05-15 14:45:49 -06:00
Arm Patinyasakdikul f306c00671 Change GPU references to gfx950. (#1695) 2025-05-15 10:32:46 -05:00
corey-derochie-amd 170acf3bda Switched to using the hip_fp8 header instead of rccl_float8, resolving compatibility issues. (#1546)
* Revert "Revert "replacing rccl_float8 with hip_fp8 and address compatibility …"

This reverts commit 824b81c034.

* [UT] Modify max stack size to 496

* adding a check for OCP type and replacing ROCM_VERSION with HIP_VERSION

* addressing the ci failure

* Adding the device tag

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-14 15:33:03 -05:00
Mustafa Abduljabbar 00c1eb098c [AG and RS channel tuning] Add thread work threshold to tuning models and precompute reg index in LL128 (#1641)
* Update LL128 elems per thread

* Precompute ix[g] in LL128 prim

* Make Threadthreshold part of tuning models

* Ignore channel tuning when channels are env controlled

* Tune LL128 max limit for AG

* Tune LL128 max limit for RS

* Retune AR LL128 limits due to changes

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-14 14:35:54 -05:00
Dingming Wu 51f87fbb43 Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv() (#1683)
* Detect if HSA_NO_SCRATCH_RECLAIM is set after initEnv()

 For rocm older than 6.4, we need to set HSA_NO_SCRATCH_RECLAIM=1 to use LL128 protocol.
This Env is set outside of RCCL, add the logging to detect whether its set during runtime.

* check hip runtime ver via hipRuntimeGetVersion

* move the detection to ncclinit func

* correct rocm version integer

* update warning message

* avoid unnecessary info msg on hsa_no_scratch_reclaim detection
2025-05-14 10:12:45 -05:00
mberenjk 1cefcee51f moving the thread_fence to apply before atomic fetch (#1672)
* applying thread_fence only on warp 0 before atomic fetch

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-14 10:10:05 -05:00
Mustafa Abduljabbar d665547eef Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support
2025-05-13 17:07:03 -05:00
Nikhil-Nunna a72a1939d1 Updated Codeowners (#1692) 2025-05-12 18:58:39 -05:00
gilbertlee-amd 9ef45df8f7 Fix when more than 64 channels are used for multi-collective group calls (#1688)
* Fix when more than 64 channels are used for multi-collective group calls

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-05-12 18:05:57 -05:00
Avinash 5f6805b4f4 RCCL Multinode DMA Buffer crash fix (#1682)
This commit handles DMABUF initialization and call appropriate handling function. This fixes crash in OS with no peermem support and relying on only DMABUF.

* Initial test commit
* Handling Dmabuf_fd opening and closing
* Cleanup
* Use DMABuff or Peermem as needed
* Using user input for ibDmaBufSupportInitOnce
* Revert all changes to rocmwrap.cc
* Revert all changes to rocmwrap.cc
* Changing to func definition braces
* Reverting line removal in utils.h
* useDmaBuf to calculate  flushEnabled
2025-05-08 19:17:39 -05:00
mberenjk e70003736e Write JSON file to /tmp directory to avoid incorrect write access in recorderTest (#1680)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-05-07 13:58:27 -05:00
Avinash c54a0c085a collective trace improvements for debugging (#1661) 2025-05-07 13:37:31 -05:00