Rahul Vaidya
f60367f1c3
Improved version reporting in NCCL_DEBUG=VERSION ( #1232 )
...
* Improved version reporting in NCCL_DEBUG=VERSION.
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
* Version reporting changes
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
* Versioning changes: Initialized char arrays to null and fixed typo.
---------
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
[ROCm/rccl commit: c755b9cf93 ]
2024-07-12 08:14:29 -05:00
akolliasAMD
942555fd21
gfx12 initial enablement ( #1219 )
...
[ROCm/rccl commit: 63e4d76e23 ]
2024-07-10 13:32:09 -06:00
akolliasAMD
9644767ead
cleaned codeowners file ( #1247 )
...
[ROCm/rccl commit: 7e78641dc1 ]
2024-07-09 10:31:23 -06:00
dependabot[bot]
4a75b7efc4
Bump certifi from 2024.2.2 to 2024.7.4 in /docs/sphinx ( #1241 )
...
Bumps [certifi](https://github.com/certifi/python-certifi ) from 2024.2.2 to 2024.7.4.
- [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04 )
---
updated-dependencies:
- dependency-name: certifi
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 71e0f551e7 ]
2024-07-08 11:49:35 -06:00
dependabot[bot]
e73afaecaa
Bump rocm-docs-core from 1.4.1 to 1.5.0 in /docs/sphinx ( #1240 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.4.1 to 1.5.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.1...v1.5.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 799a8b5e59 ]
2024-07-08 10:59:32 -06:00
corey-derochie-amd
37bf54b8f8
Enable multi-threading for MSCCL ( #1203 )
...
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
[ROCm/rccl commit: 0c36d571ea ]
2024-07-04 09:34:38 -06:00
Wenkai Du
b5bc883f61
Checking kernel header files only when missing sysfs entry ( #1239 )
...
[ROCm/rccl commit: 45f3fbc52f ]
2024-07-03 15:53:15 -07:00
dependabot[bot]
a4fba80661
Bump rocm-docs-core from 1.4.0 to 1.4.1 in /docs/sphinx ( #1233 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.0...v1.4.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: aeaaacad26 ]
2024-07-03 09:32:49 -06:00
Nilesh M Negi
8a7dd0e590
[GRAPH] Use channel shuffling only for IB systems ( #1228 )
...
* [GRAPH] Use channel shuffling only for IB systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
* [GRAPH] Define channels=48 for gfx94 RoCE systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
* [GRAPH] Increase channels for RoCE gfx94 systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 5be3b713ef ]
2024-07-02 12:20:40 -05:00
Wenkai Du
b3e9d2d61b
NPKit: separate time stamps for GPU access from different blocks ( #1229 )
...
To avoid races in memory access in GPU
[ROCm/rccl commit: 9d8f68b4ee ]
2024-06-28 08:00:22 -07:00
Nusrat Islam
c24b1ff14f
graph: fix minNchannels for multi-node overwrite ( #1230 )
...
[ROCm/rccl commit: b09ea29d66 ]
2024-06-26 16:56:10 -05:00
Wenkai Du
bb6fab3d8e
Revert "Changing channel stride for MI300X multinode ( #1196 )" ( #1224 )
...
This reverts channel stride change in commit
a009e43f3c
[ROCm/rccl commit: ad31d93f3d ]
2024-06-25 14:03:30 -07:00
saurabhAMD
de7ea612d7
Unit Tests for testing channels ( #1222 )
...
[ROCm/rccl commit: e170f41ddd ]
2024-06-25 10:10:10 -05:00
Jack Taylor
ebcac26530
Add pytorch rccl/intra node all-reduce benchmark ( #1221 )
...
* Add gpt-fast pytorch all reduce benchmark script
* Update readme instructions
* Minor changes
[ROCm/rccl commit: 5f2b88bc28 ]
2024-06-25 08:04:38 -07:00
Nusrat Islam
3c7bbd3243
Merge pull request #1223 from nusislam/minNchannels-multinode
...
graph: fix minNchannels for multi-node
[ROCm/rccl commit: 9f2514e5c8 ]
2024-06-25 10:03:35 -05:00
Wenkai Du
2e6c26a36c
Fix DMABUF support ( #1218 )
...
* Fix DMABUF support
* Reduce log output by moving dmabuf allocation details to TRACE
* Enable peer memory GDR support if ib_umem_get_peer is in kernel
[ROCm/rccl commit: 5d7078e383 ]
2024-06-25 08:00:15 -07:00
Nusrat Islam
bf164da244
graph: fix minNchannels for multi-node
...
Multi-node rccl was not correctly setting the minNchannels value. This
PR fixes the bug.
[ROCm/rccl commit: 05df0f8cea ]
2024-06-24 16:42:44 -05:00
dependabot[bot]
1fb4fd2e94
Bump urllib3 from 2.2.1 to 2.2.2 in /docs/sphinx ( #1215 )
...
Bumps [urllib3](https://github.com/urllib3/urllib3 ) from 2.2.1 to 2.2.2.
- [Release notes](https://github.com/urllib3/urllib3/releases )
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst )
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2 )
---
updated-dependencies:
- dependency-name: urllib3
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 1ddb02c010 ]
2024-06-18 14:52:24 -06:00
dependabot[bot]
ebbaccd32a
Bump rocm-docs-core from 1.1.3 to 1.4.0 in /docs/sphinx ( #1213 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.1.3 to 1.4.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.1.3...v1.4.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 53dcfcc5e0 ]
2024-06-17 09:32:27 -06:00
Sam Wu
17eb7a3c6b
Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core ( #1190 )
...
* Add doc team as owners of RTD config
* Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core
[ROCm/rccl commit: 9f01acc030 ]
2024-06-14 12:12:22 -06:00
saurabhAMD
09c4d50e50
Merge pull request #1211 from saurabhAMD/channel
...
enable UT to test with channels greater than 64
[ROCm/rccl commit: 959545dce2 ]
2024-06-13 14:38:38 -05:00
saurabhAMD
44064a612c
enable UT to test with channels greater than 64
...
[ROCm/rccl commit: 392a73fdef ]
2024-06-13 13:54:08 -05:00
Paul Emberson
f7fb3392fb
fix initOnceFunc setting incorrect result code ( #1205 )
...
Addresses DMA-BUF support check unexpectedly failing
[ROCm/rccl commit: 435756af02 ]
2024-06-07 16:47:19 -07:00
Nusrat Islam
7c029672a8
Merge pull request #1200 from nusislam/multi-node-256-fix
...
graph: fix multi-node channel count
[ROCm/rccl commit: 9660e2e2dc ]
2024-06-07 14:34:20 -05:00
Nilesh M Negi
9b88e59ea4
Fix min_nchannels bug for gfx94* nranks=4 ( #1202 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: d9661c17e6 ]
2024-06-07 14:31:28 -05:00
gilbertlee-amd
efe851cfff
Disabling NUMA maching for model 79 for some VM configs ( #1204 )
...
[ROCm/rccl commit: 9b94a1052f ]
2024-06-06 17:15:04 -06:00
Nusrat Islam
3cbf67715f
graph: restrict maxChannels to 64 for multi-node and RCCL_ENABLE_INTRANET=1
...
[ROCm/rccl commit: 526cce9bf4 ]
2024-06-06 10:58:41 -05:00
Nusrat Islam
bf35178250
graph: fix multi-node minChannel count
...
[ROCm/rccl commit: 6ab20a7c6b ]
2024-06-06 10:56:39 -05:00
Wenkai Du
470302a776
Allow multiple parameters during selective function generation ( #1201 )
...
* Allow multiple parameters during selective function generation
* Remove debug print
* Add examples into Generator.cmake
[ROCm/rccl commit: 9fcd7b55e1 ]
2024-06-06 07:07:24 -07:00
Nusrat Islam
99fca5fa3a
Merge pull request #1184 from nusislam/256-channel-2
...
add 256 channels support
[ROCm/rccl commit: 955347bab4 ]
2024-06-04 08:25:34 -05:00
Nusrat Islam
62cbb3bcdd
set MIN_NCHANNEL limit to 64 for multi-node
...
[ROCm/rccl commit: 9746d8ca3f ]
2024-06-03 13:05:05 -05:00
Nusrat Islam
b34fd115a1
doubling debug buffer size with increased channels
...
[ROCm/rccl commit: 0634c5c8e1 ]
2024-06-03 13:05:05 -05:00
Nusrat Islam
48821ad0d7
set MAXCHANNELS to 128
...
[ROCm/rccl commit: ef442f8f92 ]
2024-06-03 13:05:05 -05:00
Nusrat Islam
2447161fb4
graph: restrict MAXCHANNELS for certain platforms
...
[ROCm/rccl commit: 9f654f6cf5 ]
2024-06-03 13:05:01 -05:00
Nusrat Islam
ed2f96bc6a
device: update the logic for channelId assignment
...
[ROCm/rccl commit: 48859a97b1 ]
2024-06-03 13:03:18 -05:00
Nusrat Islam
b0b2aa1166
add 256 channels support
...
[ROCm/rccl commit: 506f16c506 ]
2024-06-03 13:03:18 -05:00
akolliasAMD
1934d0f377
fixed typo on BFD linkage ( #1192 )
...
[ROCm/rccl commit: 6475da2ed9 ]
2024-06-03 10:05:47 -06:00
gilbertlee-amd
a009e43f3c
Changing channel stride for MI300X multinode ( #1196 )
...
* Shuffling MI300X multi-node channels
* Updating tree channel logic
[ROCm/rccl commit: 0948eecbba ]
2024-06-03 10:00:55 -06:00
srawat
072c378bc0
doc organization ( #1197 )
...
* doc organization
* removing what is rccl file
* Update index.rst
[ROCm/rccl commit: 3301cdf59a ]
2024-06-03 18:38:45 +05:30
ClementLinCF
4f56aa5f8c
Optimize NCHANNELS and MSCCL config for gfx942 80CUs ( #1195 )
...
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs
Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs
* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml
* Change the factor of gfx94 and update msccl config
[ROCm/rccl commit: cab25f919e ]
2024-06-01 07:07:46 -07:00
Nilesh M Negi
7ca67f1cb9
[BUILD] Update install.sh for RCCL build ( #1191 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 5aaf7121d9 ]
2024-05-31 17:58:34 -05:00
Nilesh M Negi
8518ef4afb
[MSCCL]: Move scratch buffer debug msgs to TRACE ( #1189 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 1249a6c3fd ]
2024-05-31 17:54:23 -05:00
gilbertlee-amd
b35e2b8c4b
Addressing possible out-of-bounds mem access during channel duplication ( #1193 )
...
[ROCm/rccl commit: 354e0b29a6 ]
2024-05-30 14:02:14 -06:00
Wenkai Du
a724f1ebb7
Add ring simple chunk size tuning ( #1180 )
...
* Add ring simple chunk size tuning
* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning
* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning
---------
Co-authored-by: PedramAlizadeh <pmohamma@amd.com >
[ROCm/rccl commit: 73221b4230 ]
2024-05-29 07:59:47 -07:00
Wenkai Du
ae6c372406
Make WSL detection thread safe ( #1178 )
...
* Make WSL detection thread safe
* Change static to beginning
* Switch to use atomics
[ROCm/rccl commit: 8f099b1adb ]
2024-05-28 17:23:50 -07:00
Edgar Gabriel
7f4bec7682
Merge pull request #1175 from edgargabriel/topic/alt_rsmi
...
add alternative to rocm_smi_lib
[ROCm/rccl commit: a78c4f5e88 ]
2024-05-28 09:36:55 -05:00
Joseph Macaranas
23fd9cfd80
Merge pull request #1188 from ROCm/amd/jmacaran/externalCIEnablement
...
Enable external CI pipeline triggers
[ROCm/rccl commit: e4c10e4438 ]
2024-05-27 15:52:49 -04:00
amd-jmacaran
0fde09b879
Enable external CI pipeline triggers
...
[ROCm/rccl commit: 125f841c5f ]
2024-05-23 16:29:05 -04:00
dependabot[bot]
e8a2977f05
--- ( #1183 )
...
updated-dependencies:
- dependency-name: requests
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: b06e617bea ]
2024-05-21 09:34:41 -06:00
Wenkai Du
bc29c89d9d
Report error when collective is not enabled in build ( #1177 )
...
* Report error when collective is not enabled in build
* Fix typo
[ROCm/rccl commit: eeea3b693b ]
2024-05-16 10:11:12 -07:00