rocm-systems

Author	SHA1	Message	Date
Arm Patinyasakdikul	ff75860d73	Fix unroll factor display bug. (#1969 )	2025-10-10 15:35:06 -05:00
Surya Periaswamy	5bd5079de1	MSCCL++ fix split path null deref (#1959 ) * Add speriaswamy-amd to CODEOWNERS * MSCCL++: fix split path null deref; key maps by parent ncclUniqueId * removed no-op	2025-10-09 14:08:38 -05:00
Rahul Vaidya	6b200ee6c5	Fix LL128 proto selection to respect user setting (#1822 )	2025-10-09 14:08:03 -05:00
Nusrat Islam	d22a39e954	Update direct AG and single node LL threshold (#1944 ) * update AG direct and single node LL threshold * update thresholds based on MI350 expeirmental results * disable using LL for direct AG * enable direct AG for lower GPU counts * direct AG single node tuning * fix in-place buffer allocation for AG unit test * whitespace fix * gate direct AG for gfx950 and gfx942 --------- Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>	2025-10-09 10:48:50 -05:00
Artem Kuzmitckii	00a42c80f3	Reverse logic of context tracking enablement from #1927 (#1971 ) In this commit it disabled by default and can be enabled via `RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA) Original PR https://github.com/ROCm/rccl/pull/1927	2025-10-09 10:24:09 +02:00
Arm Patinyasakdikul	cede6d0134	Revert "Change to use -O0 instead of -O1 in debug build. (#1949 )" (#1957 ) This reverts commit `feee02ca61`.	2025-10-08 10:01:45 -05:00
Aravind Ravikumar	1858a31c41	Enable Presubmit CI Gating for develop Branch (TheRock CI for RCCL) (#1954 ) * Trigger CI run on pull request * Enabling CI run on different PR types --------- Co-authored-by: arravikum <arravikum@amd.com>	2025-10-07 09:11:50 -04:00
corey-derochie-amd	b1fbf535da	[SYNC] 2.27.7 (#1928 ) Merge pull request #1928 from corey-derochie-amd/2.27.7-sync	2025-10-06 16:47:50 -06:00
BertanDogancay	3f94267f21	Merge remote-tracking branch 'nccl/master' into develop	2025-10-06 18:36:49 -04:00
Arm Patinyasakdikul	feee02ca61	Change to use -O0 instead of -O1 in debug build. (#1949 ) * Change to use -O0 instead of -O1 in debug build. * Use -O1 for device code to avoid linking issue in debug build.	2025-10-03 16:05:01 -05:00
Nilesh M Negi	342ec086e3	Revert "changes for hugepages backed host buffer for larger allocations (#1841 )" (#1951 ) This reverts commit `65b69bf318`.	2025-10-02 23:43:09 -05:00
amd-jiali	5978d2f9ab	Print out the hipRuntimeVersion message from WARN to always show up (#1911 ) Authored-by: Jiali Li <jialili@amd.com>	2025-10-02 11:32:32 -05:00
dependabot[bot]	42ce371e3d	Bump rocm-docs-core from 1.22.0 to 1.26.0 in /docs/sphinx (#1952 ) Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.22.0 to 1.26.0. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/v1.26.0/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.22.0...v1.26.0) --- updated-dependencies: - dependency-name: rocm-docs-core dependency-version: 1.26.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2025-10-02 11:33:14 -04:00
Istvan Kiss	3776129011	Add reference to supported data types section (#1893 )	2025-10-01 12:36:14 +02:00
David DeBonis	d23d18f423	Adding usage tip for ignore cpu affinity (#1948 ) * Adding usage tip for ignore cpu affinity * Update docs/how-to/rccl-usage-tips.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update docs/how-to/rccl-usage-tips.rst Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-29 10:11:21 -06:00
Bhuvan Mital	65b69bf318	changes for hugepages backed host buffer for larger allocations (#1841 )	2025-09-28 00:40:22 -05:00
Artem Kuzmitckii	07925ec027	Revert disabling of context tracking for Radeon (#1927 ) * Revert disabling of context tracking for Radeon Original commit `6fc228e2` `Disable context tracking for the current version. (#1839)` * Add env variable for disabling of context tracking for Radeon `export NCCL_DISABLE_CONTEXT_TRACKING=1` to force disable of context tracking * Update docs/how-to/rccl-usage-tips.rst Fix grammar, thanks @amd-jnovotny Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Rename NCCL_DISABLE_CONTEXT_TRACKING -> RCCL_DISABLE_CONTEXT_TRACKING * Revert changes in includes and rename util function --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-27 15:19:50 -04:00
alex-breslow-amd	45166f6586	Gate code by rocm_version (#1945 )	2025-09-26 13:28:41 -07:00
Mustafa Abduljabbar	0dd2b2f65e	Fix extra token typo (#1943 )	2025-09-26 11:18:43 -04:00
Mustafa Abduljabbar	7a329bbd94	Expose symbols for RCCL algo/proto/channels selection functions (#1923 ) * Unhide symbols for algo/proto functions * Add all_gather direct usage detection	2025-09-25 18:58:30 -04:00
Larry Meadows	cb14fccdcc	- LL Protocol: Add missing fences for gfx950, this fixes the hang issue (#1932 ) - Remove asm flat_store_dwordx4, not needed	2025-09-25 14:07:07 -07:00
Sai Enduri	01d16d4139	Enable multi node rccl tests on MI350x slurm cluster. (#1900 ) * Add tests on slurm cluster * Integrate slurm. * Add flags. * Added dynamic selection of runners for tests and cleanup for slurm reservation * Revert "Added dynamic selection of runners for tests and cleanup for slurm reservation" This reverts commit d5350ff6e4f563ddd56ad81e4bc2a393ed55ba00. * Refactor so tests run on both architectures. * continue on error * fail fast false on matrix * remove scancel * skip all single node tests * fix pattern matching for pytest * switch to always skip github job * Update to latest allocation. * Clean up workflows and update docker image. * Updated container image published from PR #1517 * Switch back to TheRock main branch sha. --------- Co-authored-by: arravikum <arravikum@amd.com>	2025-09-23 22:00:26 -07:00
corey-derochie-amd	d86cf78810	Moved new functions to the bottom of the function table to maintain backward compatibility (#1931 ) * Moved new functions to the bottom of the function table to maintain backward compatibility * Added ordering fixes to api_trace.cc	2025-09-23 13:30:27 -06:00
alex-breslow-amd	8d6e21285c	Implement disassembling library into assembly with source code (#1714 ) - Add --dump-asm to install.sh dump assembly from RCCL library	2025-09-23 10:11:32 -07:00
Mustafa Abduljabbar	c1e1f2faeb	Use batched P2P to enhance alltoall small message performance (#1902 ) * Batch P2P operations (2 per CU/channel) and update channel-part mapping - Revert bitreversal and fix channel mapping to be compatible with P2P batching and avoid hangs - P2P batching is only used for more than 2 nodes to avoid aggregating intra-node traffic when it is dominant for less than 2 nodes * Address single node regression and channel per net peer * Add batching threshold * Add enable switch for batching * Update CHANGELOG.md * Add minor comment change * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-09-22 16:25:10 -04:00
Tim	ba44f170ad	Update RCCL Replayer README.md (#1870 ) * Update Replayer README.md	2025-09-19 17:57:48 -04:00
corey-derochie-amd	9b04b2a42f	Added an implementation of `ncclSymGetKernelPtr` for when `GENERATE_SYM_KERNELS` is not defined, as it is normally generated code. (#1925 )	2025-09-19 07:52:33 -06:00
corey-derochie-amd	ed095cad35	Moved latency_profiler license into subdirs and updated NOTICES. (#1918 )	2025-09-18 12:54:39 -06:00
Atul Kulkarni	9839d1c7c8	Updated tests based on NCCL 2.27.3-1 sync (#1892 )	2025-09-18 09:56:09 -05:00
Venkateshwar Reddy Kandula	0cc896910e	due nccl api sync update RCCL_API_TRACE_VERSION_PATCH to 2 (#1916 )	2025-09-18 07:36:50 -06:00
Surya Periaswamy	389f794d9a	Add speriaswamy-amd to CODEOWNERS (#1921 )	2025-09-18 07:15:21 -05:00
Nilesh M Negi	da06c69cb8	[INIT] Use rocm-smi API instead of CLI for querying FW version (#1920 )	2025-09-17 19:17:19 -05:00
nawrinsu	0b03bb718a	Add nawrinsu to CODEOWNERS (#1917 )	2025-09-16 23:40:51 -05:00
Laura Promberger	0f6fec1553	Bump minimum cmake version to 3.16 to enable cmake 4 (#1909 ) Minimum required cmake version of test/CMakeList.txt is bumped from 2.8 to 3.16. This alignes with the version used in CMakeList.txt and will enable building with cmake 4.	2025-09-16 23:10:22 -05:00
Weile	f64b1f409f	add weilewei to CODEOWNERS (#1915 )	2025-09-16 10:14:18 -07:00
Karthik Ganesan	740dfd1efd	Update prims_simple.h to keep header file access to rccl_metadata.h uniform (#1906 ) Header files in device/ folder are directly referenced in the code base except here.	2025-09-16 08:58:50 -05:00
Kapil S. Pawar	86a6d06e40	Added new tests for rccl_wrap - rcclOverrideProtocol, rcclOverrideAlgorithm (#1895 ) * Added new unit tests for rccl_wrap	2025-09-15 18:00:26 -05:00
Bertan Dogancay	93d86dd8e3	[BUILD] Stop generating sym kernels by default (#1907 ) * Stop generating sym kernels by default	2025-09-15 12:19:35 -04:00
ycui1984	da8abb2651	[MIT] Add MIT license file (#1908 )	2025-09-12 13:37:44 -05:00
Arm Patinyasakdikul	f21fbdfc18	Fix issue where staging/mainline build commit hash doesn't match the actual RCCL commit. (#1910 )	2025-09-11 16:13:21 -05:00
mberenjk	ada4e12360	disabling msccl for fp8 datatype (#1888 ) * disabling msccl for fp8 datatype --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-09-11 13:09:34 -05:00
Wenkai Du	de9ebd8a8b	Treat PIX and PXB as same GDR distance (#1894 )	2025-09-11 10:44:10 -05:00
isaki001	9c36439354	add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889 )	2025-09-10 14:25:23 -05:00
Wenkai Du	c2bccf9156	Enable LL128 and use same tuning table for gfx942 4 NICs (#1898 )	2025-09-10 11:11:15 -04:00
Kapil S. Pawar	f418a4c6d0	Added new tests for rccl_wrap - rcclSetPipelining (#1890 ) * Added tests for rcclSetPipelining * Added conditions to skip the test * Updated message size	2025-09-05 09:29:11 -05:00
Mustafa Abduljabbar	6e45eaf75e	Use add_unroll.sh in topo_expl makefile (#1886 )	2025-09-03 09:43:16 -04:00
Mustafa Abduljabbar	7ccc6f268f	Force enable proto and/or algo after model selection (#1799 ) * Force enable proto or algo * Remove inc nccl_common.h * Move logic and add error checks * Fix topo_expl compatibility * Allow algo/proto overrides * Remove extra function decl * Clarify warning message * Move algo/proto overrides into separate functions * Update CHANGELOG.md	2025-09-03 08:54:13 -04:00
ycui1984	361d596229	[rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm>=6.4.0 (#1867 ) * [rocm_regression] Return errors when HSA_NO_SCRATCH_RECLAIM=1 even for rocm >= 6.4.0 * [rocm_regression] Check firmware version * [rocm_regression] Resolve review comments * [rocm_regression] Move hsa env checking into init once func * [rocm_regression] Prevent hot fix version in firmware * [rocm_regression] Improve unit tests	2025-08-29 11:18:23 -05:00
Bertan Dogancay	9afc15625f	Merge pull request #1880 from rahulvaidya20/2.27.3-1 [SYNC] 2.27.3-1	2025-08-29 12:10:12 -04:00
BertanDogancay	08a7be231b	Merge remote-tracking branch 'nccl/master' into develop	2025-08-28 15:46:28 -05:00

1 2 3 4 5 ...

1939 Commits