Tim
9fdecceefb
Adding core binding in info ( #1212 )
...
Signed-off-by: AtlantaPepsi <timhu102@amd.com >
[ROCm/rccl commit: 4200964202 ]
2024-08-08 11:36:24 -04:00
Nilesh M Negi
3e52d15ced
[README] Tips on using less than 8 MI300 GPUs ( #1270 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: a2474846f5 ]
2024-08-06 11:12:09 -05:00
Nilesh M Negi
35f4a405f0
[BUILD] Update gfxTargets for ASAN build ( #1242 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 4f31ab85ea ]
2024-08-06 10:53:51 -05:00
Ziyue Yang
30e3db969f
Fix number of loops in p2p-latency-test ( #1286 )
...
[ROCm/rccl commit: 145a13235a ]
2024-08-05 13:35:56 -07:00
Nilesh M Negi
713ed3341d
[BUILD] Disable MSCCLPP build by default ( #1283 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: cb2e0615d7 ]
2024-08-02 23:17:51 -05:00
Tim
3261e2a5fd
Adding User Buffer Registration support for Unit test ( #1199 )
...
* Adding UBR support for UT SendRecv
Signed-off-by: Tim Hu <timhu102@amd.com >
* Update test/common/TestBedChild.cpp
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
---------
Signed-off-by: Tim Hu <timhu102@amd.com >
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
[ROCm/rccl commit: a4793286c7 ]
2024-07-30 13:39:25 -04:00
Wenkai Du
27b7998d13
Restore number of parallel linking jobs ( #1278 )
...
* Restore number of parallel linking jobs
* Dynamically adjust number of linker jobs with limit of 16 jobs max
* Fix typo
* Add cgroup v1 support
[ROCm/rccl commit: ca5341d419 ]
2024-07-30 08:04:14 -07:00
Pedram Alizadeh
562eb08978
Adding tuner plugin example for MI300 ( #1274 )
...
[ROCm/rccl commit: b005c13292 ]
2024-07-29 15:43:36 -04:00
Richard Barnes
92d874be50
Remove unused but set variable from all_reduce.h ( #1258 )
...
Allows `-Wunused-but-set-variable` to pass
[ROCm/rccl commit: d09b152aa0 ]
2024-07-29 08:11:24 -07:00
Richard Barnes
3d208c8eb9
Remove unused but set variable from prims_ll128.h ( #1257 )
...
Allows `-Wunused-but-set-variable` to pass
[ROCm/rccl commit: 86a4ad6e8b ]
2024-07-29 08:11:01 -07:00
Richard Barnes
780324296c
Remove unused but set variable from prims_ll.h ( #1256 )
...
Allows `-Wunused-but-set-variable` to pass
[ROCm/rccl commit: 7ad432ee23 ]
2024-07-29 08:10:38 -07:00
akolliasAMD
37c44d531b
gfx12 Disable ll protocol ( #1268 )
...
[ROCm/rccl commit: c246e25f8e ]
2024-07-26 08:59:55 -06:00
Sam Wu
0aa81c3194
Double compile timeout for extended ci to 400 min ( #1277 )
...
[ROCm/rccl commit: 05dca6def9 ]
2024-07-26 09:59:36 -05:00
Benjamin Kitor
d2df042c36
topo_expl: Update channel masks for >64 channels ( #1279 )
...
[ROCm/rccl commit: 4bc118336a ]
2024-07-25 17:27:34 -07:00
Joseph Macaranas
496e98a73f
Merge pull request #1262 from ROCm/amd/jmacaran/externalCImainline
...
External CI: Add triggers for mainline branch
[ROCm/rccl commit: 00cd4dae1e ]
2024-07-22 20:41:01 -04:00
corey-derochie-amd
94910b8f80
Fix bug where the first collective call was using MSCCL instead of MSCCL++ ( #1260 )
...
[ROCm/rccl commit: 69135976d6 ]
2024-07-22 15:46:47 -06:00
saurabhAMD
24e1ed5288
Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing ( #1265 )
...
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
[ROCm/rccl commit: cf311b71ee ]
2024-07-22 10:21:29 -05:00
corey-derochie-amd
f2b2372056
Only initialize MSCCL++ when runtime-enabled. ( #1266 )
...
[ROCm/rccl commit: b31b4082dd ]
2024-07-22 00:41:31 -06:00
mberenjk
863b213fd2
adding rocprof and pytorch parser scripts ( #1214 )
...
* adding rocprof parser script
* adding the support for multiple json files
* adding pytorch profiler script
* remove filtering from pytorch log
* adding the addressing the comments and add the feature to parse all kernels
* completing the report for torch profiler
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: 519843d2cf ]
2024-07-19 14:51:28 -05:00
Nusrat Islam
df63c9772f
Enable CPX mode for MI300X ( #1259 )
...
* graph: enable cpx mode for MI300X
* graph: tune limits for cpx and cleanup
[ROCm/rccl commit: 6f331b0d43 ]
2024-07-19 11:30:37 -05:00
Wenkai Du
54e4899607
Template unroll for RCCL kernels ( #1250 )
...
* Template unroll for RCCL kernels
* Adding unroll template arg during CMake hipification
* Reduce linking parallel jobs to avoid OOM in CI
* Workaround issues with UT tests
SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking
* CI: do not use -j 16 when building
* CI: use -j 8 when building
* Only reduce parallel linking job for CI extended
* Restore original jenkins command. Change parallel linking jobs in cmake
* Disable MSCCLPP
---------
Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com >
[ROCm/rccl commit: 89349f2ce4 ]
2024-07-19 08:15:59 -07:00
Nilesh M Negi
73e17b3e70
Consistent channel shuffling for MI300X multi-node ( #1255 )
...
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228 )"
This reverts commit 8a7dd0e590 .
* Revert "Revert "Changing channel stride for MI300X multinode (#1196 )" (#1224 )"
This reverts commit bb6fab3d8e .
[ROCm/rccl commit: a1ef217b32 ]
2024-07-18 10:18:09 -05:00
amd-jmacaran
1ade35ac64
External CI: Add triggers for mainline branch
...
[ROCm/rccl commit: 346fee4c83 ]
2024-07-17 23:16:49 -04:00
Nilesh M Negi
13134c6c64
[GRAPH] Disable MSCCL override of no. of channels ( #1187 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 67e867271f ]
2024-07-15 10:45:21 -05:00
corey-derochie-amd
da08e5ed1e
Only enable MSCCL++ AllReduce for message sizes that are multiples 32 ( #1253 )
...
* Only enable MSCCL++ AllReduce for message sizes that are multiples of 32. MSCCL++ does not handle these other sizes.
* Sanitized MSCCL++ logging.
[ROCm/rccl commit: 9cbb3da224 ]
2024-07-12 17:04:23 -07:00
corey-derochie-amd
b8542c2477
Integrated RCCL with MSCCL++ for small message sizes ( #1231 )
...
[ROCm/rccl commit: 6dc47eecd7 ]
2024-07-12 15:32:58 -06:00
Rahul Vaidya
f60367f1c3
Improved version reporting in NCCL_DEBUG=VERSION ( #1232 )
...
* Improved version reporting in NCCL_DEBUG=VERSION.
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
* Version reporting changes
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
* Versioning changes: Initialized char arrays to null and fixed typo.
---------
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
[ROCm/rccl commit: c755b9cf93 ]
2024-07-12 08:14:29 -05:00
akolliasAMD
942555fd21
gfx12 initial enablement ( #1219 )
...
[ROCm/rccl commit: 63e4d76e23 ]
2024-07-10 13:32:09 -06:00
akolliasAMD
9644767ead
cleaned codeowners file ( #1247 )
...
[ROCm/rccl commit: 7e78641dc1 ]
2024-07-09 10:31:23 -06:00
dependabot[bot]
4a75b7efc4
Bump certifi from 2024.2.2 to 2024.7.4 in /docs/sphinx ( #1241 )
...
Bumps [certifi](https://github.com/certifi/python-certifi ) from 2024.2.2 to 2024.7.4.
- [Commits](https://github.com/certifi/python-certifi/compare/2024.02.02...2024.07.04 )
---
updated-dependencies:
- dependency-name: certifi
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 71e0f551e7 ]
2024-07-08 11:49:35 -06:00
dependabot[bot]
e73afaecaa
Bump rocm-docs-core from 1.4.1 to 1.5.0 in /docs/sphinx ( #1240 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.4.1 to 1.5.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.1...v1.5.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 799a8b5e59 ]
2024-07-08 10:59:32 -06:00
corey-derochie-amd
37bf54b8f8
Enable multi-threading for MSCCL ( #1203 )
...
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
[ROCm/rccl commit: 0c36d571ea ]
2024-07-04 09:34:38 -06:00
Wenkai Du
b5bc883f61
Checking kernel header files only when missing sysfs entry ( #1239 )
...
[ROCm/rccl commit: 45f3fbc52f ]
2024-07-03 15:53:15 -07:00
dependabot[bot]
a4fba80661
Bump rocm-docs-core from 1.4.0 to 1.4.1 in /docs/sphinx ( #1233 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.4.0 to 1.4.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.4.0...v1.4.1 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-patch
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: aeaaacad26 ]
2024-07-03 09:32:49 -06:00
Nilesh M Negi
8a7dd0e590
[GRAPH] Use channel shuffling only for IB systems ( #1228 )
...
* [GRAPH] Use channel shuffling only for IB systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
* [GRAPH] Define channels=48 for gfx94 RoCE systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
* [GRAPH] Increase channels for RoCE gfx94 systems
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 5be3b713ef ]
2024-07-02 12:20:40 -05:00
Wenkai Du
b3e9d2d61b
NPKit: separate time stamps for GPU access from different blocks ( #1229 )
...
To avoid races in memory access in GPU
[ROCm/rccl commit: 9d8f68b4ee ]
2024-06-28 08:00:22 -07:00
Nusrat Islam
c24b1ff14f
graph: fix minNchannels for multi-node overwrite ( #1230 )
...
[ROCm/rccl commit: b09ea29d66 ]
2024-06-26 16:56:10 -05:00
Wenkai Du
bb6fab3d8e
Revert "Changing channel stride for MI300X multinode ( #1196 )" ( #1224 )
...
This reverts channel stride change in commit
a009e43f3c
[ROCm/rccl commit: ad31d93f3d ]
2024-06-25 14:03:30 -07:00
saurabhAMD
de7ea612d7
Unit Tests for testing channels ( #1222 )
...
[ROCm/rccl commit: e170f41ddd ]
2024-06-25 10:10:10 -05:00
Jack Taylor
ebcac26530
Add pytorch rccl/intra node all-reduce benchmark ( #1221 )
...
* Add gpt-fast pytorch all reduce benchmark script
* Update readme instructions
* Minor changes
[ROCm/rccl commit: 5f2b88bc28 ]
2024-06-25 08:04:38 -07:00
Nusrat Islam
3c7bbd3243
Merge pull request #1223 from nusislam/minNchannels-multinode
...
graph: fix minNchannels for multi-node
[ROCm/rccl commit: 9f2514e5c8 ]
2024-06-25 10:03:35 -05:00
Wenkai Du
2e6c26a36c
Fix DMABUF support ( #1218 )
...
* Fix DMABUF support
* Reduce log output by moving dmabuf allocation details to TRACE
* Enable peer memory GDR support if ib_umem_get_peer is in kernel
[ROCm/rccl commit: 5d7078e383 ]
2024-06-25 08:00:15 -07:00
Nusrat Islam
bf164da244
graph: fix minNchannels for multi-node
...
Multi-node rccl was not correctly setting the minNchannels value. This
PR fixes the bug.
[ROCm/rccl commit: 05df0f8cea ]
2024-06-24 16:42:44 -05:00
dependabot[bot]
1fb4fd2e94
Bump urllib3 from 2.2.1 to 2.2.2 in /docs/sphinx ( #1215 )
...
Bumps [urllib3](https://github.com/urllib3/urllib3 ) from 2.2.1 to 2.2.2.
- [Release notes](https://github.com/urllib3/urllib3/releases )
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst )
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2 )
---
updated-dependencies:
- dependency-name: urllib3
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 1ddb02c010 ]
2024-06-18 14:52:24 -06:00
dependabot[bot]
ebbaccd32a
Bump rocm-docs-core from 1.1.3 to 1.4.0 in /docs/sphinx ( #1213 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.1.3 to 1.4.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.1.3...v1.4.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 53dcfcc5e0 ]
2024-06-17 09:32:27 -06:00
Sam Wu
17eb7a3c6b
Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core ( #1190 )
...
* Add doc team as owners of RTD config
* Update Read the Docs configuration to use Python 3.10 and latest rocm-docs-core
[ROCm/rccl commit: 9f01acc030 ]
2024-06-14 12:12:22 -06:00
saurabhAMD
09c4d50e50
Merge pull request #1211 from saurabhAMD/channel
...
enable UT to test with channels greater than 64
[ROCm/rccl commit: 959545dce2 ]
2024-06-13 14:38:38 -05:00
saurabhAMD
44064a612c
enable UT to test with channels greater than 64
...
[ROCm/rccl commit: 392a73fdef ]
2024-06-13 13:54:08 -05:00
Paul Emberson
f7fb3392fb
fix initOnceFunc setting incorrect result code ( #1205 )
...
Addresses DMA-BUF support check unexpectedly failing
[ROCm/rccl commit: 435756af02 ]
2024-06-07 16:47:19 -07:00
Nusrat Islam
7c029672a8
Merge pull request #1200 from nusislam/multi-node-256-fix
...
graph: fix multi-node channel count
[ROCm/rccl commit: 9660e2e2dc ]
2024-06-07 14:34:20 -05:00