BertanDogancay
2a4e4308b0
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 3f94267f21 ]
2025-10-06 18:36:49 -04:00
Arm Patinyasakdikul
5f16e69d8e
Change to use -O0 instead of -O1 in debug build. ( #1949 )
...
* Change to use -O0 instead of -O1 in debug build.
* Use -O1 for device code to avoid linking issue in debug build.
[ROCm/rccl commit: feee02ca61 ]
2025-10-03 16:05:01 -05:00
Nilesh M Negi
6ade586fdc
Revert "changes for hugepages backed host buffer for larger allocations ( #1841 )" ( #1951 )
...
This reverts commit 3169352cad .
[ROCm/rccl commit: 342ec086e3 ]
2025-10-02 23:43:09 -05:00
amd-jiali
917973d9e9
Print out the hipRuntimeVersion message from WARN to always show up ( #1911 )
...
Authored-by: Jiali Li <jialili@amd.com >
[ROCm/rccl commit: 5978d2f9ab ]
2025-10-02 11:32:32 -05:00
dependabot[bot]
dfd4f19978
Bump rocm-docs-core from 1.22.0 to 1.26.0 in /docs/sphinx ( #1952 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.22.0 to 1.26.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.26.0/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.22.0...v1.26.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-version: 1.26.0
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 42ce371e3d ]
2025-10-02 11:33:14 -04:00
Istvan Kiss
118dc600ca
Add reference to supported data types section ( #1893 )
...
[ROCm/rccl commit: 3776129011 ]
2025-10-01 12:36:14 +02:00
David DeBonis
32b3a82956
Adding usage tip for ignore cpu affinity ( #1948 )
...
* Adding usage tip for ignore cpu affinity
* Update docs/how-to/rccl-usage-tips.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docs/how-to/rccl-usage-tips.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
[ROCm/rccl commit: d23d18f423 ]
2025-09-29 10:11:21 -06:00
Bhuvan Mital
3169352cad
changes for hugepages backed host buffer for larger allocations ( #1841 )
...
[ROCm/rccl commit: 65b69bf318 ]
2025-09-28 00:40:22 -05:00
Artem Kuzmitckii
722b0cd579
Revert disabling of context tracking for Radeon ( #1927 )
...
* Revert disabling of context tracking for Radeon
Original commit df3b7e47
`Disable context tracking for the current version. (#1839 )`
* Add env variable for disabling of context tracking for Radeon
`export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking
* Update docs/how-to/rccl-usage-tips.rst
Fix grammar, thanks @amd-jnovotny
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING
* Revert changes in includes and rename util function
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
[ROCm/rccl commit: 07925ec027 ]
2025-09-27 15:19:50 -04:00
alex-breslow-amd
6423f5b024
Gate code by rocm_version ( #1945 )
...
[ROCm/rccl commit: 45166f6586 ]
2025-09-26 13:28:41 -07:00
Mustafa Abduljabbar
22a24cc61a
Fix extra token typo ( #1943 )
...
[ROCm/rccl commit: 0dd2b2f65e ]
2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar
d646b0e49b
Expose symbols for RCCL algo/proto/channels selection functions ( #1923 )
...
* Unhide symbols for algo/proto functions
* Add all_gather direct usage detection
[ROCm/rccl commit: 7a329bbd94 ]
2025-09-25 18:58:30 -04:00
Larry Meadows
a8bf65a298
- LL Protocol: Add missing fences for gfx950, this fixes the hang issue ( #1932 )
...
- Remove asm flat_store_dwordx4, not needed
[ROCm/rccl commit: cb14fccdcc ]
2025-09-25 14:07:07 -07:00
Sai Enduri
15628819e2
Enable multi node rccl tests on MI350x slurm cluster. ( #1900 )
...
* Add tests on slurm cluster
* Integrate slurm.
* Add flags.
* Added dynamic selection of runners for tests and cleanup for slurm reservation
* Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation"
This reverts commit fdd5a6cc968c764d3d1039f0897fb11f11422928.
* Refactor so tests run on both architectures.
* continue on error
* fail fast false on matrix
* remove scancel
* skip all single node tests
* fix pattern matching for pytest
* switch to always skip github job
* Update to latest allocation.
* Clean up workflows and update docker image.
* Updated container image published from PR #1517
* Switch back to TheRock main branch sha.
---------
Co-authored-by: arravikum <arravikum@amd.com >
[ROCm/rccl commit: 01d16d4139 ]
2025-09-23 22:00:26 -07:00
corey-derochie-amd
c6fccec835
Moved new functions to the bottom of the function table to maintain backward compatibility ( #1931 )
...
* Moved new functions to the bottom of the function table to maintain backward compatibility
* Added ordering fixes to api_trace.cc
[ROCm/rccl commit: d86cf78810 ]
2025-09-23 13:30:27 -06:00
alex-breslow-amd
8c8c6886bc
Implement disassembling library into assembly with source code ( #1714 )
...
- Add --dump-asm to install.sh dump assembly from RCCL library
[ROCm/rccl commit: 8d6e21285c ]
2025-09-23 10:11:32 -07:00
Mustafa Abduljabbar
a075779dcd
Use batched P2P to enhance alltoall small message performance ( #1902 )
...
* Batch P2P operations (2 per CU/channel) and update channel-part mapping
- Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs
- P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes
* Address single node regression and channel per net peer
* Add batching threshold
* Add enable switch for batching
* Update CHANGELOG.md
* Add minor comment change
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
[ROCm/rccl commit: c1e1f2faeb ]
2025-09-22 16:25:10 -04:00
Tim
bb2e12c831
Update RCCL Replayer README.md ( #1870 )
...
* Update Replayer README.md
[ROCm/rccl commit: ba44f170ad ]
2025-09-19 17:57:48 -04:00
corey-derochie-amd
375a2a009b
Added an implementation of ncclSymGetKernelPtr for when GENERATE_SYM_KERNELS is not defined, as it is normally generated code. ( #1925 )
...
[ROCm/rccl commit: 9b04b2a42f ]
2025-09-19 07:52:33 -06:00
corey-derochie-amd
ebcc75f118
Moved latency_profiler license into subdirs and updated NOTICES. ( #1918 )
...
[ROCm/rccl commit: ed095cad35 ]
2025-09-18 12:54:39 -06:00
Atul Kulkarni
980392b279
Updated tests based on NCCL 2.27.3-1 sync ( #1892 )
...
[ROCm/rccl commit: 9839d1c7c8 ]
2025-09-18 09:56:09 -05:00
Venkateshwar Reddy Kandula
1a0657f347
due nccl api sync update RCCL_API_TRACE_VERSION_PATCH to 2 ( #1916 )
...
[ROCm/rccl commit: 0cc896910e ]
2025-09-18 07:36:50 -06:00
Surya Periaswamy
ebbcb16cca
Add speriaswamy-amd to CODEOWNERS ( #1921 )
...
[ROCm/rccl commit: 389f794d9a ]
2025-09-18 07:15:21 -05:00
Nilesh M Negi
d354caecb9
[INIT] Use rocm-smi API instead of CLI for querying FW version ( #1920 )
...
[ROCm/rccl commit: da06c69cb8 ]
2025-09-17 19:17:19 -05:00
nawrinsu
266067920f
Add nawrinsu to CODEOWNERS ( #1917 )
...
[ROCm/rccl commit: 0b03bb718a ]
2025-09-16 23:40:51 -05:00
Laura Promberger
b9be197d53
Bump minimum cmake version to 3.16 to enable cmake 4 ( #1909 )
...
Minimum required cmake version of test/CMakeList.txt is bumped from 2.8
to 3.16. This alignes with the version used in CMakeList.txt and will
enable building with cmake 4.
[ROCm/rccl commit: 0f6fec1553 ]
2025-09-16 23:10:22 -05:00
Weile
6ddae6ec42
add weilewei to CODEOWNERS ( #1915 )
...
[ROCm/rccl commit: f64b1f409f ]
2025-09-16 10:14:18 -07:00
Karthik Ganesan
8f4703c0cc
Update prims_simple.h to keep header file access to rccl_metadata.h uniform ( #1906 )
...
Header files in device/ folder are directly referenced in the code base except here.
[ROCm/rccl commit: 740dfd1efd ]
2025-09-16 08:58:50 -05:00
Kapil S. Pawar
a8f84f32a4
Added new tests for rccl_wrap - rcclOverrideProtocol, rcclOverrideAlgorithm ( #1895 )
...
* Added new unit tests for rccl_wrap
[ROCm/rccl commit: 86a6d06e40 ]
2025-09-15 18:00:26 -05:00
Bertan Dogancay
546b37e35a
[BUILD] Stop generating sym kernels by default ( #1907 )
...
* Stop generating sym kernels by default
[ROCm/rccl commit: 93d86dd8e3 ]
2025-09-15 12:19:35 -04:00
ycui1984
1b8a616247
[MIT] Add MIT license file ( #1908 )
...
[ROCm/rccl commit: da8abb2651 ]
2025-09-12 13:37:44 -05:00
Arm Patinyasakdikul
99699b10a2
Fix issue where staging/mainline build commit hash doesn't match the actual RCCL commit. ( #1910 )
...
[ROCm/rccl commit: f21fbdfc18 ]
2025-09-11 16:13:21 -05:00
mberenjk
84833f8472
disabling msccl for fp8 datatype ( #1888 )
...
* disabling msccl for fp8 datatype
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: ada4e12360 ]
2025-09-11 13:09:34 -05:00
Wenkai Du
94b53a03ac
Treat PIX and PXB as same GDR distance ( #1894 )
...
[ROCm/rccl commit: de9ebd8a8b ]
2025-09-11 10:44:10 -05:00
isaki001
9fa7a738da
add reduce/broadcast algo/proto selection table for multi-node gfx940 ( #1889 )
...
[ROCm/rccl commit: 9c36439354 ]
2025-09-10 14:25:23 -05:00
Wenkai Du
2577b33de8
Enable LL128 and use same tuning table for gfx942 4 NICs ( #1898 )
...
[ROCm/rccl commit: c2bccf9156 ]
2025-09-10 11:11:15 -04:00
Kapil S. Pawar
80aa4daa4d
Added new tests for rccl_wrap - rcclSetPipelining ( #1890 )
...
* Added tests for rcclSetPipelining
* Added conditions to skip the test
* Updated message size
[ROCm/rccl commit: f418a4c6d0 ]
2025-09-05 09:29:11 -05:00
Mustafa Abduljabbar
26495be59c
Use add_unroll.sh in topo_expl makefile ( #1886 )
...
[ROCm/rccl commit: 6e45eaf75e ]
2025-09-03 09:43:16 -04:00
Mustafa Abduljabbar
1a7ab8dfc8
Force enable proto and/or algo after model selection ( #1799 )
...
* Force enable proto or algo
* Remove inc nccl_common.h
* Move logic and add error checks
* Fix topo_expl compatibility
* Allow algo/proto overrides
* Remove extra function decl
* Clarify warning message
* Move algo/proto overrides into separate functions
* Update CHANGELOG.md
[ROCm/rccl commit: 7ccc6f268f ]
2025-09-03 08:54:13 -04:00
ycui1984
1999f2eba8
[rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 ( #1867 )
...
* [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0
* [rocm_regression] Check firmware version
* [rocm_regression] Resolve review comments
* [rocm_regression] Move hsa env checking into init once func
* [rocm_regression] Prevent hot fix version in firmware
* [rocm_regression] Improve unit tests
[ROCm/rccl commit: 361d596229 ]
2025-08-29 11:18:23 -05:00
Bertan Dogancay
21278d2073
Merge pull request #1880 from rahulvaidya20/2.27.3-1
...
[SYNC] 2.27.3-1
[ROCm/rccl commit: 9afc15625f ]
2025-08-29 12:10:12 -04:00
BertanDogancay
881327184e
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 08a7be231b ]
2025-08-28 15:46:28 -05:00
Avinash
832c5b1f13
[build] Disable MSCCL++ compilation by default ( #1879 )
...
* Enable MSCCLPP on request
* Updating docs and README
* Updates to CHANGELOG.md
* Update CHANGELOG.md
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Updates to CHANGELOG.md
* Update CHANGELOG.md
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
* Update CHANGELOG.md
Github didn't take the edit to my suggestion properly.
---------
Co-authored-by: amd <amd@super3.amd.com >
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
[ROCm/rccl commit: a0ec15bafe ]
2025-08-28 08:52:12 -06:00
Nilesh M Negi
ddb1c9bd8b
[AzureCI] Switch to ROCm 6.4.1 and add rccl-tests ( #1782 )
...
* Use ROCm 6.4.1 for testing
* Extend RCCL-Tests to multi-node
* Add HSA_NO_SCRATCH_RECLAIM to UT runs
* Limit to single-node rccl-tests for now
[ROCm/rccl commit: d73cee7588 ]
2025-08-27 21:07:53 -05:00
jonatluu
8526a86978
fix lintian warning package-contains-timestamped-gzip ( #1865 )
...
* fix lintian warning package-contains-timestamped-gzip
* fix lintian warning
[ROCm/rccl commit: 4699bff790 ]
2025-08-27 13:29:07 -04:00
Geo Min
6db483845d
[TheRock CI] Adding single node tests for RCCL ( #1876 )
...
* Add single-node testing
* Adding single node test
* Adding quotes
* fix typo
* Adding test flag
* No MPI
* Adding openmpi install
* Adding comment
* PR comments
* Missing proj
* Adding half
* Adding rocr runtime
* Adding them all'
* new sha
* Fixing script
* Removing confusing skip test case
* Adding docs
* Update .github/workflows/therock-test-packages-single-node.yml
Co-authored-by: Marius Brehler <marius.brehler@amd.com >
---------
Co-authored-by: Marius Brehler <marius.brehler@amd.com >
[ROCm/rccl commit: f404624d9e ]
2025-08-27 08:13:10 -07:00
Nusrat Islam
fde5d7a8be
Device allocation tracker ( #1878 )
...
* alloc: add memory allocation tracker
* alloc: add tracker for ncclCuMemAlloc() APIs
* alloc: add null pointer check during free
[ROCm/rccl commit: df448862c3 ]
2025-08-27 09:30:51 -05:00
Kapil S. Pawar
3d889cc189
Code coverage tests for param.cc ( #1872 )
...
* Added code coverage unit tests for param.cc
* Updated ParamTests.cpp and removed ParamTestsConfFile.txt
* Updated ParamTests.cpp
* Removed NCCL_LOG_INFO and added sample cofig file
---------
Co-authored-by: Pawar <kpawar@ctr2-alola-ctrl-01.amd.com >
[ROCm/rccl commit: c9becd89cd ]
2025-08-27 09:30:37 -05:00
ishkool
f500628ef2
Code coverage tests for net_socket.cc ( #1840 )
...
* Code coverage UTs for net_socket.cc
* Addressed review comments
---------
Co-authored-by: Atul Kulkarni <atul.kulkarni@amd.com >
[ROCm/rccl commit: c288fbf1b2 ]
2025-08-27 09:24:21 -05:00
Marius Brehler
5277457f21
Bump TheRock version used for testing ( #1885 )
...
[ROCm/rccl commit: 221205ebd4 ]
2025-08-27 16:22:27 +02:00