Tim
38f91fa2c8
reverting change to RcclReplayer ( #1657 )
...
[ROCm/rccl commit: 45e1c3f3e2 ]
2025-04-23 15:36:46 -04:00
Jeffrey Novotny
fb1fdef8e2
Fix broken link to RCCL Replayer GitHub info ( #1655 )
...
[ROCm/rccl commit: df778b4ea1 ]
2025-04-23 14:17:31 -04:00
gilbertlee-amd
8023be9355
Adding UT_DEBUG_PAUSE to unit tests ( #1653 )
...
[ROCm/rccl commit: ee85a70bb4 ]
2025-04-21 21:15:07 -06:00
Bertan Dogancay
aac829125c
Fix NPKit for SendRecv ( #1651 )
...
[ROCm/rccl commit: ac8ec4c08c ]
2025-04-21 12:34:47 -04:00
Tim
58ee618194
RCCL Replayer update ( #1603 )
...
RCCL recorder w/ suggested change and UT
[ROCm/rccl commit: 9a55ff60a9 ]
2025-04-19 00:21:27 -04:00
Mustafa Abduljabbar
de2b66921a
Address nested designator compiler warning issue ( #1633 )
...
[ROCm/rccl commit: 52bfdf05dc ]
2025-04-18 17:09:50 -04:00
Nusrat Islam
691e98940c
Fix MSCCLPP accuracy issue for allreduce7 ( #1634 )
...
* ext-src: fix a graph-mode bug in allreduce7
* change MSCCLPP threshold to 16MB
* ext-src: change message size threshold for allreduce7
* ext-src: address review comments
[ROCm/rccl commit: f20c33effd ]
2025-04-18 08:54:32 -05:00
AbandiGa
acf0bc1c6e
added copyright ( #1635 )
...
[ROCm/rccl commit: 7a84c5dbb0 ]
2025-04-14 09:46:18 -05:00
Nilesh M Negi
708c053b21
Update Dockerfile to use CMake-based build ( #1630 )
...
* [DOCKER] Update Dockerfile to switch to CMake build
* Fix typo in Dockerfile.ubuntu
* Add README to docker sub-dir
* Update Dockerfile and README
* Modify markdown headings in docker/README
* Update docs
* Fix typo in docs
* Update docs/install/docker-install.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docs/install/docker-install.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docs/install/docker-install.rst
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
* Update docker/README
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
[ROCm/rccl commit: bd1a5b38b6 ]
2025-04-10 11:40:10 -05:00
Dingming Wu
63c6180130
Adding #include <dlfcn.h> in profiler.cc to pass build ( #1632 )
...
w/o the header, experience following errors:
```
error: use of undeclared identifier 'RTLD_NOW'
error: use of undeclared identifier 'RTLD_LOCAL'
error: use of undeclared identifier 'dlerror'
error: use of undeclared identifier 'dlsym'
error: use of undeclared identifier 'dlclose'
```
[ROCm/rccl commit: 1786c0268b ]
2025-04-10 08:48:18 -07:00
Arm Patinyasakdikul
f29d59aa00
Add device synchronization before destroying proxy thread. ( #1631 )
...
This commit ensures that GPU finishes all kernel before destroying
communicator thread.
[ROCm/rccl commit: 52654e2301 ]
2025-04-10 10:44:16 -05:00
Pedram Alizadeh
93ac2ea61e
all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 ( #1627 )
...
* Enabling LL128 by default on MI300
* Add missing CUDACHECK
* Adjust BW correction factors to fix the Tree->Ring switching point
* Refactor and add ll128 AR logarithmic factor to tuning models
* Move RCCL tuning changes to a separate file
* Use enum for tunable indexing
* Use explicit indexing in tuning models to avoid mismatch issues
* Place rcclGetSizePerRank in a function
* Remove HIP ifdef for rccl-only call
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com >
[ROCm/rccl commit: e40ff4f84a ]
2025-04-10 11:43:54 -04:00
Pedram Alizadeh
b225281747
single-node AR msccl algorithm tuning for MI300 ( #1629 )
...
[ROCm/rccl commit: 5b36b68d06 ]
2025-04-10 10:42:28 -04:00
dependabot[bot]
464d69963d
Bump rocm-docs-core from 1.18.1 to 1.18.2 in /docs/sphinx ( #1625 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.18.1 to 1.18.2.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.1...v1.18.2 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-version: 1.18.2
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: b6d97a6176 ]
2025-04-09 17:00:31 -06:00
Nikhil-Nunna
ee9da06c80
Fetching RCCL version from shared object file. ( #1569 )
...
* added rccl version using rccl-tests
* Added function to get rccl version from rccl-tests
* removed whitespace
* Added rccl version
* Updated readme and fixed formatting
* removed debug prints
[ROCm/rccl commit: 3dc0478722 ]
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar
2f4cd5718e
Add AllGather LL128 multi-node tuning and include LL cutoff points in tuning models ( #1618 )
...
* Enable LL/LL128 cutoff points in tuning models
* Initializing ll/ll128 model cutoffs for MI300
* Use RCCL_LL_LIMITS_UNDEFINED
---------
Co-authored-by: PedramAlizadeh <pmohamma@amd.com >
[ROCm/rccl commit: 4be06f04d8 ]
2025-04-02 16:26:23 -04:00
Mustafa Abduljabbar
0a81478bd9
Fix topo explorer's nccl 2.23 compatibility ( #1623 )
...
* Fix compiler issues due to broken compatibility
* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm
[ROCm/rccl commit: aace4e27f8 ]
2025-04-02 09:47:29 -04:00
dependabot[bot]
25dafc0c82
Bump rocm-docs-core from 1.17.0 to 1.18.1 in /docs/sphinx ( #1599 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.17.0 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.18.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: ffe255d285 ]
2025-04-01 17:08:53 -06:00
Nusrat Islam
f599690ce3
ext-src: fix mscclpp correctness issue ( #1615 )
...
* ext-src: fix mscclpp correctness issue
* ext-src: remove white-space warnings
[ROCm/rccl commit: 4a29bba3c6 ]
2025-04-01 15:02:16 -05:00
Istvan Kiss
858fa4e65d
Add documentation for NPS4 and CPX partition modes ( #1555 )
...
[ROCm/rccl commit: 28ab8603d2 ]
2025-03-31 09:25:25 -06:00
Nilesh M Negi
1a2eca1756
Revert "[GRAPH] Increase default nChannels to 112 for gfx950 ( #1596 )" ( #1620 )
...
* Revert "[GRAPH] Increase default nChannels to 112 for gfx950 (#1596 )"
This reverts commit cf17cff5b6 .
* [DOC] Update Changelog
* [DOC] Update CHANGELOG
[ROCm/rccl commit: b17338d164 ]
2025-03-28 17:57:06 -05:00
Bertan Dogancay
b737d8c222
Merge pull request #1559 from BertanDogancay/2.23
...
[SYNC] 2.23.4-1
[ROCm/rccl commit: 532f54c244 ]
2025-03-28 17:06:56 -04:00
Nilesh M Negi
a7ec191754
[TEST] Switch to googletest release 1.12.0 ( #1621 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 0e2c461c6c ]
2025-03-28 12:39:42 -05:00
Nilesh M Negi
210f90ae0f
[SRC] Enable unroll=1 for gfx950 ( #1602 )
...
* [SRC] Enable unroll=1 for gfx950
* Fix typo from rebase in generate.py
* Support for unroll=1 and gfx90a when building for all GPU targets
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 307bc10781 ]
2025-03-27 18:21:35 -05:00
BertanDogancay
8ed27fde74
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 0b2062c560 ]
2025-03-27 12:53:04 -05:00
isaki001
f0c853438c
Disable mscclpp ( #1614 )
...
* disable mscclpp by default
[ROCm/rccl commit: 9dc23d9265 ]
2025-03-25 15:21:16 -05:00
Nilesh M Negi
290ca7deca
[CI] Fix warnings from pytest in Azure CI ( #1617 )
...
* [CI] Fix warnings from pytest
* [CI] Append to LD_LIBRARY_PATH instead of overwrite
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 565f0ade60 ]
2025-03-22 23:22:53 -05:00
Nilesh M Negi
8cfbc0fbd1
[UT] Increase stack size for StandaloneTests to 480 ( #1616 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: d6b987a53f ]
2025-03-21 21:33:32 -05:00
dependabot[bot]
68672b9b3c
Bump jinja2 from 3.1.5 to 3.1.6 in /docs/sphinx ( #1591 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.5 to 3.1.6.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.5...3.1.6 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: fe3f3aae17 ]
2025-03-21 17:17:32 -06:00
dependabot[bot]
83c59c859a
Bump cryptography from 43.0.1 to 44.0.1 in /docs/sphinx ( #1543 )
...
Bumps [cryptography](https://github.com/pyca/cryptography ) from 43.0.1 to 44.0.1.
- [Changelog](https://github.com/pyca/cryptography/blob/main/CHANGELOG.rst )
- [Commits](https://github.com/pyca/cryptography/compare/43.0.1...44.0.1 )
---
updated-dependencies:
- dependency-name: cryptography
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: dd61df75ab ]
2025-03-21 17:17:19 -06:00
gilbertlee-amd
4f67522420
Removing the experimental clique kernel files ( #1610 )
...
[ROCm/rccl commit: 626dc50ab5 ]
2025-03-20 18:10:01 -06:00
Wenkai Du
e86b217182
Add fault injection of starting warps with random variations ( #1593 )
...
* Add fault injection of starting warps with random variations
This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.
* Remove manually introduced bug for demo purpose
* Use only one thread per warp for checking wall clock
[ROCm/rccl commit: 90ad586d94 ]
2025-03-20 16:11:43 -07:00
gilbertlee-amd
12c1fe8fdf
Psuedo-randomly adding zero-byte sends in AllToAllv unit test ( #1597 )
...
[ROCm/rccl commit: 9a4e49ff1a ]
2025-03-20 17:00:48 -06:00
amd-jmacaran
72454ece3e
[Azure CI] Minor environment setup fixes
...
- Add extra deletion to ensure source workspace is clean for the job.
- pytest expects function names to start with test_
[ROCm/rccl commit: 805448261d ]
2025-03-20 15:36:10 -04:00
corey-derochie-amd
e95578ef4c
removed gfx940 and gfx941 ( #1606 )
...
* removed gfx940 and gfx941
* removed gfx940 and gfx941
* Update "gfx94" to "gfx942" in init.cc
* Updated remaining "gfx94" updates to "gfx942"
* Update filenames and variables from gfx940 to gfx942
---------
Co-authored-by: akolliasAMD <akollias@amd.com >
[ROCm/rccl commit: 6505639cf4 ]
2025-03-20 09:34:53 -06:00
Joseph Macaranas
a95b2c9fb7
Introduce framework to run Azure Pipeline jobs ( #1608 )
...
- PR runs targeting develop will require approval from at least one person in a predefined list.
- Nightly runs are separated and will not require approval.
- Sample pytest script is provided for expanding test coverage.
[ROCm/rccl commit: 6403afff4a ]
2025-03-18 17:29:44 -04:00
Wenkai Du
c6f4c8d17a
GDRCOPY support: Off by default ( #1605 )
...
[ROCm/rccl commit: bd0092e8f1 ]
2025-03-18 08:17:01 -07:00
Avinash
5a25b110af
Memory leak fix when numIBDevices = 0 ( #1429 )
...
* Initial commit for testing
* Fix memory leak in checkOptions
* Fix memory leak in checkOption
* x
* Delete cmake-3.28.2-linux-x86_64.sh
* gcn changes
* gcn memleak fixes
* gcn leak fix
* memory leak fixes for parseRome4P2H and ncclTopoAddGPU
* Keeping only necessary file for fixes
Deleting temporary scripts I created for debugging and testing
* changing to GCN_ARCH_NAME_LEN
* Added sanity check directory
* refactoring scripts
* Updated to sanity checks folder
* Initial fixes
* changes in tools
* pointing RCCL lib build to debug version
* Removed second pthread_detach
* Removing sanity checks
* Keeping only code changes
* addressing memory leaks in ncclIbinit
---------
Co-authored-by: Chao Chen <cchen104@amd.com >
[ROCm/rccl commit: ccb0820743 ]
2025-03-17 11:21:19 -05:00
mberenjk
a3a598efb3
Skipping AllReduce test on more than 8 ranks for FP8 type on Hyabusa ( #1598 )
...
* Skipping AllReduce FP8 test on 9 to 16 ranks (gfx90a) as it's using Tree algorithm not RING
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: 5f691aaf65 ]
2025-03-17 10:22:49 -05:00
Mustafa Abduljabbar
74d6537141
Multi-node reduce_scatter improved auto-selection for LL and LL128 on gfx942 ( #1604 )
...
* Add reduce_scatter LL and LL128 thresholds
* Always honor user choice for protocol
[ROCm/rccl commit: f67b2cc908 ]
2025-03-17 11:21:01 -04:00
Wenkai Du
0bd40f5a87
Enable LL128 on gfx942 ( #1549 )
...
[ROCm/rccl commit: 245c2de909 ]
2025-03-16 15:10:05 -07:00
Nilesh M Negi
cf17cff5b6
[GRAPH] Increase default nChannels to 112 for gfx950 ( #1596 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 1df73e209e ]
2025-03-14 14:47:03 -07:00
Wenkai Du
afd04a5117
Limit P2P channels per peer to not exceeding max channels ( #1594 )
...
* Limit P2P channels per peer to not exceeding max channels
* [UT] test single GPU cases for all collectives
* [UT] fix out of range root value
[ROCm/rccl commit: 4237caad69 ]
2025-03-11 09:32:09 -07:00
Tim
a79fa36b77
mscclpp compatibility check for ubr ( #1573 )
...
[ROCm/rccl commit: f6c6d451a9 ]
2025-03-09 22:10:47 -04:00
isaki001
14be1c9a7a
fix the size of the recv buffer in AllGather UBR test ( #1564 )
...
[ROCm/rccl commit: 59c55842f1 ]
2025-03-05 11:42:15 -06:00
Nusrat Islam
e7c90e0a46
misc/msccl: force use of mscclpp ( #1581 )
...
[ROCm/rccl commit: ac823818aa ]
2025-03-04 12:48:59 -06:00
Bertan Dogancay
d1247bbf2a
[Transport] Fix IntraNet ( #1582 )
...
[ROCm/rccl commit: d88cca3098 ]
2025-03-04 13:30:36 -05:00
Wenkai Du
086fa823db
NPKit: enable reduce scatter profiling ( #1580 )
...
[ROCm/rccl commit: f957c4fe22 ]
2025-03-04 10:03:56 -08:00
Nilesh M Negi
751370bb70
[BUILD] Enable multiple GPU targets in MSCCLPP ( #1574 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 063c6cfc11 ]
2025-03-01 22:28:42 -06:00
dependabot[bot]
977d04cb9a
Bump rocm-docs-core from 1.15.0 to 1.17.0 in /docs/sphinx ( #1558 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.15.0 to 1.17.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.15.0...v1.17.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: a6b2ca224e ]
2025-02-28 16:32:38 -07:00