Avinash
c81ea25407
collective trace improvements for debugging ( #1661 )
...
[ROCm/rccl commit: c54a0c085a ]
2025-05-07 13:37:31 -05:00
Bertan Dogancay
c75ebd9147
Merge pull request #1662 from BertanDogancay/2.25
...
[SYNC] 2.25.1-1
[ROCm/rccl commit: 590ad6acc2 ]
2025-05-06 09:39:09 -04:00
Mustafa Abduljabbar
750bd73047
Add missing MACRO to topo_expl ( #1677 )
...
* Fix header compatibility
[ROCm/rccl commit: fdad89690b ]
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar
ab4a3eb0c1
Fix topo explorer's compatibility with NCCL 2.24 ( #1671 )
...
* Fix build issues
* Fix failure to find path remote rank
[ROCm/rccl commit: f3f3336468 ]
2025-05-05 15:26:29 -04:00
Siu Chi Chan
be0761502d
rccl-UnitTests - link to dl library ( #1673 )
...
[ROCm/rccl commit: 9525c5b2ef ]
2025-05-02 21:20:22 -05:00
Bertan Dogancay
b435c75068
[Graph] Try using P2P by default ( #1670 )
...
[ROCm/rccl commit: acfac55516 ]
2025-05-02 11:54:30 -04:00
Nilesh M Negi
a6972c0d09
Revert "[SRC] Enable unroll=1 for gfx950 ( #1602 )" ( #1667 )
...
* Revert "[SRC] Enable unroll=1 for gfx950 (#1602 )"
This reverts commit 210f90ae0f .
* Update Changelog
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 329e13efff ]
2025-04-30 23:33:08 -05:00
deeksha-amd
5580cb7574
Added new tests for improving the code coverage ( #1656 )
...
Signed-off-by: Deeksha Goplani <deeksha.goplani@amd.com >
[ROCm/rccl commit: 2486838465 ]
2025-04-30 18:01:11 -05:00
isaki001
de76d7f649
Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 ( #1607 )
...
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024
* add compilation flag for enabling/disabling clipping in mscclpp
* change flag name for consistency, set flag to OFF
* add compilation flag in rccl for enabling clipping in mscclpp
* set 1024 threads for mscclpp allreduce8 only for bfloat16
* fix improper description for ENABLE_MSCCLPP_CLIP flag
* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"
This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.
* update clip remove-clip.patch for rebase
[ROCm/rccl commit: 8145c4f3b8 ]
2025-04-30 16:42:28 -05:00
BertanDogancay
064062ef70
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: cb6e23ae67 ]
2025-04-30 13:31:41 -05:00
Tim
33c390b36e
minor fix for empty scope (group) ( #1666 )
...
[ROCm/rccl commit: dc0c5f9153 ]
2025-04-30 13:29:13 -04:00
Richard Barnes
f2d30a163b
Enable -Wall ( #1644 )
...
[ROCm/rccl commit: 7961624167 ]
2025-04-24 10:45:46 -07:00
Bertan Dogancay
eb50c947eb
Merge pull request #1645 from corey-derochie-amd/nccl-2.24
...
NCCL Sync 2.24.3-1
[ROCm/rccl commit: f8067a76dc ]
2025-04-24 10:08:58 -04:00
Mustafa Abduljabbar
a85cfaa680
[AllGather MSCCL] Multinode and single node support up to certain send count ( #1650 )
...
* Add multinode and singlenode allgather XML
[ROCm/rccl commit: aa7991dfc8 ]
2025-04-24 09:02:03 -04:00
BertanDogancay
d045d0ca23
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: a6bf9bfc9e ]
2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar
07620c7efd
Expose production tuning table in topo_explorer using internal RCCL/NCCL logic ( #1628 )
...
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
[ROCm/rccl commit: 82afb2bcfe ]
2025-04-23 15:44:56 -04:00
Tim
38f91fa2c8
reverting change to RcclReplayer ( #1657 )
...
[ROCm/rccl commit: 45e1c3f3e2 ]
2025-04-23 15:36:46 -04:00
Jeffrey Novotny
fb1fdef8e2
Fix broken link to RCCL Replayer GitHub info ( #1655 )
...
[ROCm/rccl commit: df778b4ea1 ]
2025-04-23 14:17:31 -04:00
gilbertlee-amd
8023be9355
Adding UT_DEBUG_PAUSE to unit tests ( #1653 )
...
[ROCm/rccl commit: ee85a70bb4 ]
2025-04-21 21:15:07 -06:00
Bertan Dogancay
aac829125c
Fix NPKit for SendRecv ( #1651 )
...
[ROCm/rccl commit: ac8ec4c08c ]
2025-04-21 12:34:47 -04:00
Tim
58ee618194
RCCL Replayer update ( #1603 )
...
RCCL recorder w/ suggested change and UT
[ROCm/rccl commit: 9a55ff60a9 ]
2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar
de2b66921a
Address nested designator compiler warning issue ( #1633 )
...
[ROCm/rccl commit: 52bfdf05dc ]
2025-04-18 17:09:50 -04:00
Nusrat Islam
691e98940c
Fix MSCCLPP accuracy issue for allreduce7 ( #1634 )
...
* ext-src: fix a graph-mode bug in allreduce7
* change MSCCLPP threshold to 16MB
* ext-src: change message size threshold for allreduce7
* ext-src: address review comments
[ROCm/rccl commit: f20c33effd ]
2025-04-18 08:54:32 -05:00
AbandiGa
acf0bc1c6e
added copyright ( #1635 )
...
[ROCm/rccl commit: 7a84c5dbb0 ]
2025-04-14 09:46:18 -05:00
Nilesh M Negi
708c053b21
Update Dockerfile to use CMake-based build ( #1630 )
...
* [DOCKER] Update Dockerfile to switch to CMake build
* Fix typo in Dockerfile.ubuntu
* Add README to docker sub-dir
* Update Dockerfile and README
* Modify markdown headings in docker/README
* Update docs
* Fix typo in docs
* Update docs/install/docker-install.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docs/install/docker-install.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docs/install/docker-install.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docker/README
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
[ROCm/rccl commit: bd1a5b38b6 ]
2025-04-10 11:40:10 -05:00
Dingming Wu
63c6180130
Adding #include <dlfcn.h> in profiler.cc to pass build ( #1632 )
...
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```
[ROCm/rccl commit: 1786c0268b ]
2025-04-10 08:48:18 -07:00
Arm Patinyasakdikul
f29d59aa00
Add device synchronization before destroying proxy thread. ( #1631 )
...
This commit ensures that GPU finishes all kernel before destroying
communicator thread.
[ROCm/rccl commit: 52654e2301 ]
2025-04-10 10:44:16 -05:00
Pedram Alizadeh
93ac2ea61e
all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 ( #1627 )
...
* Enabling LL128 by default on MI300
* Add missing CUDACHECK
* Adjust BW correction factors to fix the Tree->Ring switching point
* Refactor and add ll128 AR logarithmic factor to tuning models
* Move RCCL tuning changes to a separate file
* Use enum for tunable indexing
* Use explicit indexing in tuning models to avoid mismatch issues
* Place rcclGetSizePerRank in a function
* Remove HIP ifdef for rccl-only call
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com >
[ROCm/rccl commit: e40ff4f84a ]
2025-04-10 11:43:54 -04:00
Pedram Alizadeh
b225281747
single-node AR msccl algorithm tuning for MI300 ( #1629 )
...
[ROCm/rccl commit: 5b36b68d06 ]
2025-04-10 10:42:28 -04:00
dependabot[bot]
464d69963d
Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx ( #1625 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.18.1 to 1.18.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-version: 1.18.2
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: b6d97a6176 ]
2025-04-09 17:00:31 -06:00
Nikhil-Nunna
ee9da06c80
Fetching RCCL version from shared object file. ( #1569 )
...
* added rccl version using rccl-tests
* Added function to get rccl version from rccl-tests
* removed whitespace
* Added rccl version
* Updated readme and fixed formatting
* removed debug prints
[ROCm/rccl commit: 3dc0478722 ]
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar
2f4cd5718e
Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models ( #1618 )
...
* Enable LL/LL128 cutoff points in tuning models
* Initializing ll/ll128 model cutoffs for MI300
* Use RCCL_LL_LIMITS_UNDEFINED
---------
Co-authored-by: PedramAlizadeh <pmohamma@amd.com >
[ROCm/rccl commit: 4be06f04d8 ]
2025-04-02 16:26:23 -04:00
Mustafa Abduljabbar
0a81478bd9
Fix topo explorer's nccl 2.23 compatibility ( #1623 )
...
* Fix compiler issues due to broken compatibility
* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm
[ROCm/rccl commit: aace4e27f8 ]
2025-04-02 09:47:29 -04:00
dependabot[bot]
25dafc0c82
Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx ( #1599 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.17.0 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: ffe255d285 ]
2025-04-01 17:08:53 -06:00
Nusrat Islam
f599690ce3
ext-src: fix mscclpp correctness issue ( #1615 )
...
* ext-src: fix mscclpp correctness issue
* ext-src: remove white-space warnings
[ROCm/rccl commit: 4a29bba3c6 ]
2025-04-01 15:02:16 -05:00
Istvan Kiss
858fa4e65d
Add documentation for NPS4 and CPX partition modes ( #1555 )
...
[ROCm/rccl commit: 28ab8603d2 ]
2025-03-31 09:25:25 -06:00
Nilesh M Negi
1a2eca1756
Revert "[GRAPH] Increase default nChannels to 112 for gfx950 ( #1596 )" ( #1620 )
...
* Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596 )"
This reverts commit cf17cff5b6 .
* [DOC] Update Changelog
* [DOC] Update CHANGELOG
[ROCm/rccl commit: b17338d164 ]
2025-03-28 17:57:06 -05:00
Bertan Dogancay
b737d8c222
Merge pull request #1559 from BertanDogancay/2.23
...
[SYNC] 2.23.4-1
[ROCm/rccl commit: 532f54c244 ]
2025-03-28 17:06:56 -04:00
Nilesh M Negi
a7ec191754
[TEST] Switch to googletest release 1.12.0 ( #1621 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 0e2c461c6c ]
2025-03-28 12:39:42 -05:00
Nilesh M Negi
210f90ae0f
[SRC] Enable unroll=1 for gfx950 ( #1602 )
...
* [SRC] Enable unroll=1 for gfx950
* Fix typo from rebase in generate.py
* Support for unroll=1 and gfx90a when building for all GPU targets
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 307bc10781 ]
2025-03-27 18:21:35 -05:00
BertanDogancay
8ed27fde74
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 0b2062c560 ]
2025-03-27 12:53:04 -05:00
isaki001
f0c853438c
Disable mscclpp ( #1614 )
...
* disable mscclpp by default
[ROCm/rccl commit: 9dc23d9265 ]
2025-03-25 15:21:16 -05:00
Nilesh M Negi
290ca7deca
[CI] Fix warnings from pytest in Azure CI ( #1617 )
...
* [CI] Fix warnings from pytest
* [CI] Append to LD_LIBRARY_PATH instead of overwrite
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 565f0ade60 ]
2025-03-22 23:22:53 -05:00
Nilesh M Negi
8cfbc0fbd1
[UT] Increase stack size for StandaloneTests to 480 ( #1616 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: d6b987a53f ]
2025-03-21 21:33:32 -05:00
dependabot[bot]
68672b9b3c
Bump jinja2 from 3.1.5 to 3.1.6 in /docs/sphinx ( #1591 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: fe3f3aae17 ]
2025-03-21 17:17:32 -06:00
dependabot[bot]
83c59c859a
Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx ( #1543 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 43.0.1 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/43.0.1...44.0.1 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: dd61df75ab ]
2025-03-21 17:17:19 -06:00
gilbertlee-amd
4f67522420
Removing the experimental clique kernel files ( #1610 )
...
[ROCm/rccl commit: 626dc50ab5 ]
2025-03-20 18:10:01 -06:00
Wenkai Du
e86b217182
Add fault injection of starting warps with random variations ( #1593 )
...
* Add fault injection of starting warps with random variations
This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.
* Remove manually introduced bug for demo purpose
* Use only one thread per warp for checking wall clock
[ROCm/rccl commit: 90ad586d94 ]
2025-03-20 16:11:43 -07:00
gilbertlee-amd
12c1fe8fdf
Psuedo-randomly adding zero-byte sends in AllToAllv unit test ( #1597 )
...
[ROCm/rccl commit: 9a4e49ff1a ]
2025-03-20 17:00:48 -06:00
amd-jmacaran
72454ece3e
[Azure CI] Minor environment setup fixes
...
- Add extra deletion to ensure source workspace is clean for the job.
- pytest expects function names to start with test_
[ROCm/rccl commit: 805448261d ]
2025-03-20 15:36:10 -04:00