Граф коммитов

1962 Коммитов

Автор SHA1 Сообщение Дата
Kapil S. Pawar 912d53caba Fix segmentation fault related to ext-profiler plugin (#1986) 2025-10-23 09:26:35 -05:00
Joseph Macaranas c2e71e83d1 [External CI] Add references to rocm-systems super repo (#1935)
- In order to trigger downstream jobs to verify projects that consume rccl, references to those repos are required.
2025-10-22 16:07:05 -04:00
Aravind Ravikumar 506c2e9878 Adding reservation time for salloc in CI (#1992)
Co-authored-by: arravikum <arravikum@amd.com>
2025-10-22 10:00:01 -04:00
ehsanhosseinzadehKhaligh aec4f0a659 Updating npkit_trace_generator.py to check npkit directory (#1891)
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2025-10-22 02:51:16 -05:00
Afzal Patel 724680f87c add roctracer and rocm-core include directiories (#1970) 2025-10-21 13:53:57 -04:00
Sourav Chakraborty 57286d5df3 Fix incorrect benchmark name in JitterBench script (#1983) 2025-10-21 12:52:20 -05:00
Sourav Chakraborty 5b345d105c Fix build failure in rccl_prim_test (#1984)
Added missing header in rccl_prim_test
2025-10-21 12:51:14 -05:00
mberenjk b58f234539 Add support for additional paths in RCCL DMABUF kernel configuration loading (#1825)
* Adding more path to the kernel load and an environment variable to force enable DMABUF

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-10-20 13:35:22 -07:00
Mythreya Kuricheti 9ae5956ca5 [rocprofiler-sdk] Update codeowner for api-trace.h (#1974)
Feedback from #1933
2025-10-20 10:43:42 -06:00
Nilesh M Negi 34d469864b [FORMAT] Add .clang-format for C++ code (#1404) 2025-10-20 10:54:03 -05:00
JC b1589a5786 [CI] Enable ccache w/ namespace for external use (#1966)
* Enable ccache w/ namespace for external use

* Remove TheRock from setup_tools.py command line

* Bump TheRock commit to use health_status.py

Resolves https://github.com/ROCm/rccl/pull/1966/files/c6d2e8ce5c14a2c94bfb47e21d3e2d466f25c9b4#r2420734710

* Bump TheRock to older commit with health_status.py

* Add git safe directory for working directory

* Move install python deps

* Remove pip freeze
2025-10-20 08:44:42 -07:00
Nilesh M Negi c35bc721ad Fix ncclDevFuncId for AllReduceWithBias (#1980) 2025-10-17 09:28:57 -05:00
Arm Patinyasakdikul 58eca5d7f8 Disable graph mode memory registration and UBR as unsupported feature. (#1977) 2025-10-17 09:18:39 -05:00
Arm Patinyasakdikul 9806f5e9dd Fix git version fetching logic. (#1981) 2025-10-17 09:17:49 -05:00
Rahul Vaidya 624f68b2b2 [Profiler plugin] Fix segfault issue with profiler plugin (#1973)
* Fix profiler plugin segfault by correctly setting p2p->func

* Look for librccl-profiler.so instead of libnccl-profiler.so

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>

---------

Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
Co-authored-by: Yongjie Qiu <Yongjie.Qiu@amd.com>
2025-10-16 16:33:18 -05:00
alex-breslow-amd 154350baaf MSCCL: Unland PR1788 + Fix for MSCCL Data Corruption (#1960)
- Earlier fix PR1788 is no longer necessary after ROCr fix and pre-ROCr fix workaround
- Inserts an s_waitcnt vmcnt(0), which fixes a data corruption issue in MSCCL
2025-10-15 10:32:25 -07:00
gilbertlee-amd fedddb452c Enabling gdrcopy option for gfx950 (#1955) 2025-10-15 10:55:25 -06:00
alex-breslow-amd c70f5b4621 [gfx950] Make bypassing __threadfence the default for multinode. (#1947)
* Gate based on ROCM version, safe for ROCm 7.0.2 and beyond.
* Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950.  Thanks Nilesh.
* Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.
2025-10-15 09:15:36 -07:00
isaki001 0f99fd84a3 gfx950 channel tuning for ReduceScatter and AllGather (#1940)
* add channel thresholds to override channel-count adjustments
2025-10-14 09:50:44 -05:00
mberenjk e738c03e39 fixing the ar_with_bias test issue when running rccl-tests (#1912)
* fixing the AR_With_Bias issue when running rccl-tests
2025-10-13 13:58:21 -07:00
alex-breslow-amd ff209e5b19 Dump compiler-determined GPU kernel resource usage (#1965)
Adds --kernel-resource-use flag to install.sh to allow dumping per-GPU kernel resource use at compile time (e.g., VGPRs, LDS, SGPRs, scratch, etc.)
2025-10-13 11:24:42 -05:00
Geo Min 97f2665da2 fixing group id (#1975) 2025-10-10 16:40:44 -07:00
Mythreya Kuricheti 3000f0e837 [rocprofiler-sdk] Add codeowner for api-trace.h (#1933) 2025-10-10 16:29:17 -05:00
Arm Patinyasakdikul ff75860d73 Fix unroll factor display bug. (#1969) 2025-10-10 15:35:06 -05:00
Surya Periaswamy 5bd5079de1 MSCCL++ fix split path null deref (#1959)
* Add speriaswamy-amd to CODEOWNERS
* MSCCL++: fix split path null deref; key maps by parent ncclUniqueId
* removed no-op
2025-10-09 14:08:38 -05:00
Rahul Vaidya 6b200ee6c5 Fix LL128 proto selection to respect user setting (#1822) 2025-10-09 14:08:03 -05:00
Nusrat Islam d22a39e954 Update direct AG and single node LL threshold (#1944)
* update AG direct and single node LL threshold

* update thresholds based on MI350 expeirmental results

* disable using LL for direct AG

* enable direct AG for lower GPU counts

* direct AG single node tuning

* fix in-place buffer allocation for AG unit test

* whitespace fix

* gate direct AG for gfx950 and gfx942

---------

Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>
2025-10-09 10:48:50 -05:00
Artem Kuzmitckii 00a42c80f3 Reverse logic of context tracking enablement from #1927 (#1971)
In this commit it disabled by default and can be enabled via
`RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA)
Original PR https://github.com/ROCm/rccl/pull/1927
2025-10-09 10:24:09 +02:00
Arm Patinyasakdikul cede6d0134 Revert "Change to use -O0 instead of -O1 in debug build. (#1949)" (#1957)
This reverts commit feee02ca61.
2025-10-08 10:01:45 -05:00
Aravind Ravikumar 1858a31c41 Enable Presubmit CI Gating for develop Branch (TheRock CI for RCCL) (#1954)
* Trigger CI run on pull request

* Enabling CI run on different PR types

---------

Co-authored-by: arravikum <arravikum@amd.com>
2025-10-07 09:11:50 -04:00
corey-derochie-amd b1fbf535da [SYNC] 2.27.7 (#1928)
Merge pull request #1928 from corey-derochie-amd/2.27.7-sync
2025-10-06 16:47:50 -06:00
BertanDogancay 3f94267f21 Merge remote-tracking branch 'nccl/master' into develop 2025-10-06 18:36:49 -04:00
Arm Patinyasakdikul feee02ca61 Change to use -O0 instead of -O1 in debug build. (#1949)
* Change to use -O0 instead of -O1 in debug build.

* Use -O1 for device code to avoid linking issue in debug build.
2025-10-03 16:05:01 -05:00
Nilesh M Negi 342ec086e3 Revert "changes for hugepages backed host buffer for larger allocations (#1841)" (#1951)
This reverts commit 65b69bf318.
2025-10-02 23:43:09 -05:00
amd-jiali 5978d2f9ab Print out the hipRuntimeVersion message from WARN to always show up (#1911)
Authored-by: Jiali Li <jialili@amd.com>
2025-10-02 11:32:32 -05:00
dependabot[bot] 42ce371e3d Bump rocm-docs-core from 1.22.0 to 1.26.0 in /docs/sphinx (#1952)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.22.0 to 1.26.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.26.0/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.22.0...v1.26.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.26.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-10-02 11:33:14 -04:00
Istvan Kiss 3776129011 Add reference to supported data types section (#1893) 2025-10-01 12:36:14 +02:00
David DeBonis d23d18f423 Adding usage tip for ignore cpu affinity (#1948)
* Adding usage tip for ignore cpu affinity

* Update docs/how-to/rccl-usage-tips.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update docs/how-to/rccl-usage-tips.rst

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-29 10:11:21 -06:00
Bhuvan Mital 65b69bf318 changes for hugepages backed host buffer for larger allocations (#1841) 2025-09-28 00:40:22 -05:00
Artem Kuzmitckii 07925ec027 Revert disabling of context tracking for Radeon (#1927)
* Revert disabling of context tracking for Radeon

Original commit 6fc228e2
 `Disable context tracking for the current version. (#1839)`

* Add env variable for disabling of context tracking for Radeon

`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking

* Update docs/how-to/rccl-usage-tips.rst

Fix grammar, thanks @amd-jnovotny

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING

* Revert changes in includes and rename util function

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-27 15:19:50 -04:00
alex-breslow-amd 45166f6586 Gate code by rocm_version (#1945) 2025-09-26 13:28:41 -07:00
Mustafa Abduljabbar 0dd2b2f65e Fix extra token typo (#1943) 2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar 7a329bbd94 Expose symbols for RCCL algo/proto/channels selection functions (#1923)
* Unhide symbols for algo/proto functions

* Add all_gather direct usage detection
2025-09-25 18:58:30 -04:00
Larry Meadows cb14fccdcc - LL Protocol: Add missing fences for gfx950, this fixes the hang issue (#1932)
- Remove asm flat_store_dwordx4, not needed
2025-09-25 14:07:07 -07:00
Sai Enduri 01d16d4139 Enable multi node rccl tests on MI350x slurm cluster. (#1900)
* Add tests on slurm cluster

* Integrate slurm.

* Add flags.

* Added dynamic selection of runners for tests and cleanup for slurm reservation

* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"

This reverts commit d5350ff6e4f563ddd56ad81e4bc2a393ed55ba00.

* Refactor so tests run on both architectures.

* continue on error

* fail fast false on matrix

* remove scancel

* skip all single node tests

* fix pattern matching for pytest

* switch to always skip github job

* Update to latest allocation.

* Clean up workflows and update docker image.

* Updated container image published from PR #1517

* Switch back to TheRock main branch sha.

---------

Co-authored-by: arravikum <arravikum@amd.com>
2025-09-23 22:00:26 -07:00
corey-derochie-amd d86cf78810 Moved new functions to the bottom of the function table to maintain backward compatibility (#1931)
* Moved new functions to the bottom of the function table to maintain backward compatibility

* Added ordering fixes to api_trace.cc
2025-09-23 13:30:27 -06:00
alex-breslow-amd 8d6e21285c Implement disassembling library into assembly with source code (#1714)
- Add --dump-asm to install.sh dump assembly from RCCL library
2025-09-23 10:11:32 -07:00
Mustafa Abduljabbar c1e1f2faeb Use batched P2P to enhance alltoall small message performance (#1902)
* Batch P2P operations (2 per CU/channel) and update channel-part mapping

- Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs

- P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes

* Address single node regression and channel per net peer

* Add batching threshold

* Add enable switch for batching

* Update CHANGELOG.md

* Add minor comment change

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-09-22 16:25:10 -04:00
Tim ba44f170ad Update RCCL Replayer README.md (#1870)
* Update Replayer README.md
2025-09-19 17:57:48 -04:00
corey-derochie-amd 9b04b2a42f Added an implementation of ncclSymGetKernelPtr for when GENERATE_SYM_KERNELS is not defined, as it is normally generated code. (#1925) 2025-09-19 07:52:33 -06:00