gilbertlee-amd
efe851cfff
Disabling NUMA maching for model 79 for some VM configs ( #1204 )
...
[ROCm/rccl commit: 9b94a1052f ]
2024-06-06 17:15:04 -06:00
Wenkai Du
470302a776
Allow multiple parameters during selective function generation ( #1201 )
...
* Allow multiple parameters during selective function generation
* Remove debug print
* Add examples into Generator.cmake
[ROCm/rccl commit: 9fcd7b55e1 ]
2024-06-06 07:07:24 -07:00
Nusrat Islam
62cbb3bcdd
set MIN_NCHANNEL limit to 64 for multi-node
...
[ROCm/rccl commit: 9746d8ca3f ]
2024-06-03 13:05:05 -05:00
Nusrat Islam
b34fd115a1
doubling debug buffer size with increased channels
...
[ROCm/rccl commit: 0634c5c8e1 ]
2024-06-03 13:05:05 -05:00
Nusrat Islam
48821ad0d7
set MAXCHANNELS to 128
...
[ROCm/rccl commit: ef442f8f92 ]
2024-06-03 13:05:05 -05:00
Nusrat Islam
2447161fb4
graph: restrict MAXCHANNELS for certain platforms
...
[ROCm/rccl commit: 9f654f6cf5 ]
2024-06-03 13:05:01 -05:00
Nusrat Islam
ed2f96bc6a
device: update the logic for channelId assignment
...
[ROCm/rccl commit: 48859a97b1 ]
2024-06-03 13:03:18 -05:00
Nusrat Islam
b0b2aa1166
add 256 channels support
...
[ROCm/rccl commit: 506f16c506 ]
2024-06-03 13:03:18 -05:00
akolliasAMD
1934d0f377
fixed typo on BFD linkage ( #1192 )
...
[ROCm/rccl commit: 6475da2ed9 ]
2024-06-03 10:05:47 -06:00
gilbertlee-amd
a009e43f3c
Changing channel stride for MI300X multinode ( #1196 )
...
* Shuffling MI300X multi-node channels
* Updating tree channel logic
[ROCm/rccl commit: 0948eecbba ]
2024-06-03 10:00:55 -06:00
srawat
072c378bc0
doc organization ( #1197 )
...
* doc organization
* removing what is rccl file
* Update index.rst
[ROCm/rccl commit: 3301cdf59a ]
2024-06-03 18:38:45 +05:30
ClementLinCF
4f56aa5f8c
Optimize NCHANNELS and MSCCL config for gfx942 80CUs ( #1195 )
...
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs
Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs
* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml
* Change the factor of gfx94 and update msccl config
[ROCm/rccl commit: cab25f919e ]
2024-06-01 07:07:46 -07:00
Nilesh M Negi
7ca67f1cb9
[BUILD] Update install.sh for RCCL build ( #1191 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 5aaf7121d9 ]
2024-05-31 17:58:34 -05:00
Nilesh M Negi
8518ef4afb
[MSCCL]: Move scratch buffer debug msgs to TRACE ( #1189 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: 1249a6c3fd ]
2024-05-31 17:54:23 -05:00
gilbertlee-amd
b35e2b8c4b
Addressing possible out-of-bounds mem access during channel duplication ( #1193 )
...
[ROCm/rccl commit: 354e0b29a6 ]
2024-05-30 14:02:14 -06:00
Wenkai Du
a724f1ebb7
Add ring simple chunk size tuning ( #1180 )
...
* Add ring simple chunk size tuning
* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning
* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning
---------
Co-authored-by: PedramAlizadeh <pmohamma@amd.com >
[ROCm/rccl commit: 73221b4230 ]
2024-05-29 07:59:47 -07:00
Wenkai Du
ae6c372406
Make WSL detection thread safe ( #1178 )
...
* Make WSL detection thread safe
* Change static to beginning
* Switch to use atomics
[ROCm/rccl commit: 8f099b1adb ]
2024-05-28 17:23:50 -07:00
Edgar Gabriel
7f4bec7682
Merge pull request #1175 from edgargabriel/topic/alt_rsmi
...
add alternative to rocm_smi_lib
[ROCm/rccl commit: a78c4f5e88 ]
2024-05-28 09:36:55 -05:00
amd-jmacaran
0fde09b879
Enable external CI pipeline triggers
...
[ROCm/rccl commit: 125f841c5f ]
2024-05-23 16:29:05 -04:00
dependabot[bot]
e8a2977f05
--- ( #1183 )
...
updated-dependencies:
- dependency-name: requests
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: b06e617bea ]
2024-05-21 09:34:41 -06:00
Wenkai Du
bc29c89d9d
Report error when collective is not enabled in build ( #1177 )
...
* Report error when collective is not enabled in build
* Fix typo
[ROCm/rccl commit: eeea3b693b ]
2024-05-16 10:11:12 -07:00
Edgar Gabriel
c31ac0d17b
add alternative to rocm_smi_lib
...
[ROCm/rccl commit: 9ad913bfa8 ]
2024-05-14 13:51:41 -07:00
gilbertlee-amd
33a9e1c29f
RCCL Replayer - multi communicator support ( #1176 )
...
[ROCm/rccl commit: 52fa5d1178 ]
2024-05-13 10:56:32 -06:00
Wenkai Du
984f87a5d5
Support WSL2 ( #1173 )
...
[ROCm/rccl commit: ecafc1969c ]
2024-05-10 07:31:12 -07:00
Tim
b7c743b3a0
Upload npkit_trace_analysis.py ( #1152 )
...
script for parsing json trace, generating heatmap, throughput series, etc.
[ROCm/rccl commit: f078db5998 ]
2024-05-09 16:27:49 -04:00
mberenjk
42d9f2458a
Changing the failure print to reflect a bad installation of MSCCL ( #1160 )
...
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: 87d01e6bf5 ]
2024-05-08 16:56:26 -05:00
Wenkai Du
dd5d66d01a
Use rocm-smi thread only mutex when available ( #1169 )
...
[ROCm/rccl commit: a64aab5f63 ]
2024-05-08 14:32:24 -07:00
Pedram Alizadeh
43d875b422
modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X ( #1172 )
...
[ROCm/rccl commit: 73acf3eeec ]
2024-05-08 15:49:33 -04:00
mberenjk
00c0f3d67f
Adding ASAN changes to address memory leak issue" ( #1170 )
...
Co-authored-by: akolliasAMD <akollias@amd.com >
[ROCm/rccl commit: 408278209d ]
2024-05-08 09:16:00 -05:00
Wenkai Du
63d8de3bd3
Add compiler warning for uninitialized variable and fix ( #1163 )
...
* Add compiler warning for uninitialized variable and fix
* Add -Wsometimes-uninitialized
* Convert warning to error
[ROCm/rccl commit: b18784d8b8 ]
2024-05-08 07:00:25 -07:00
Wenkai Du
eab9978e48
Use normal permute path when one NIC per GPU ( #1171 )
...
[ROCm/rccl commit: f679db6ff6 ]
2024-05-08 06:59:57 -07:00
Wenkai Du
0ff5fc0bad
npkit: add broadcast trace ( #1166 )
...
[ROCm/rccl commit: a0cef69110 ]
2024-05-07 14:00:16 -07:00
Pak Nin Lui
df3d462dd9
Merge pull request #1167 from paklui/dmabuf
...
fix typo for DMABUF_ENABLE
[ROCm/rccl commit: 92a4fc6204 ]
2024-05-07 08:48:44 -07:00
dependabot[bot]
0d025525ad
Bump jinja2 from 3.1.3 to 3.1.4 in /docs/sphinx ( #1168 )
...
Bumps [jinja2](https://github.com/pallets/jinja ) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases )
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst )
- [Commits](https://github.com/pallets/jinja/compare/3.1.3...3.1.4 )
---
updated-dependencies:
- dependency-name: jinja2
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: eb562e7b22 ]
2024-05-06 15:35:34 -06:00
paklui
dd8e937948
fix typo for DMABUF_ENABLE
...
[ROCm/rccl commit: 140b7dd40f ]
2024-05-06 13:27:50 -07:00
Wenkai Du
c782aba364
Bypass NVIDIA Ampere related tuning ( #1165 )
...
[ROCm/rccl commit: b513c3970a ]
2024-05-03 17:57:16 -07:00
Wenkai Du
7c811a7582
Fix ignore NUMA not being observed for NICs during model matching ( #1164 )
...
[ROCm/rccl commit: bb58b1c258 ]
2024-05-03 16:42:07 -07:00
Wenkai Du
9638535690
Fix build error when roctracer-dev package is not installed ( #1161 )
...
[ROCm/rccl commit: 6f5a8ce1fb ]
2024-05-01 13:55:09 -07:00
Wenkai Du
3906e992f8
MSCCL: add support for out-of-place all reduce ( #1156 )
...
[ROCm/rccl commit: 4e1b8c1cbb ]
2024-04-28 19:49:09 -07:00
Wenkai Du
703014e960
Add back tree simple chunk size tuning ( #1157 )
...
[ROCm/rccl commit: cd6e840e0b ]
2024-04-28 19:48:53 -07:00
Nilesh M Negi
b99b89e7a2
[GRAPH] Reduce NCCL_TOPO_MAX_NODES to 64 ( #1153 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
[ROCm/rccl commit: b90436d292 ]
2024-04-27 23:41:11 -05:00
Tim
afeaa17475
Merge pull request #1158 from AtlantaPepsi/NPKit_fix
...
Prevent segfault from npkit-enabled rccl build
[ROCm/rccl commit: cc39e91c6f ]
2024-04-26 12:44:04 -04:00
AtlantaPepsi
8cf28704ce
prevent segfault from npkit-enabled rccl build
...
Signed-off-by: AtlantaPepsi <timhu102@amd.com >
[ROCm/rccl commit: 67246649ac ]
2024-04-26 10:54:27 -05:00
Wenkai Du
3c94f98688
Revert "Use relaxed atomics for LL on GFX11 ( #859 )" ( #1148 )
...
This reverts commit 5983f0e371 .
Use inline asm for 128b load on GFX11 for better peformance.
[ROCm/rccl commit: f330b82985 ]
2024-04-26 07:49:55 -07:00
Bertan Dogancay
dea5e83940
[UT] Start supporting multiple group calls and graphs ( #1151 )
...
* Start supporting multiple group calls UT
[ROCm/rccl commit: 0ec41f1386 ]
2024-04-25 11:11:16 -06:00
Shilei Tian
9a203f439c
SWDEV-455705: Fix an UB that could lead to miscompilation ( #1155 )
...
[ROCm/rccl commit: efe99057b0 ]
2024-04-25 10:10:01 -07:00
Wenkai Du
e494f29235
Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ ( #1154 )
...
[ROCm/rccl commit: 9e0c9b4ed8 ]
2024-04-25 07:19:18 -07:00
Bertan Dogancay
ed152c5b89
Update CHANGELOG.md for RCCL 2.20.5 ( #1150 )
...
[ROCm/rccl commit: dcc75797a1 ]
2024-04-24 09:07:49 -06:00
BertanDogancay
36f9492cda
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: e1a835910e ]
2024-04-23 13:34:00 -07:00
Wenkai Du
35f8d269f8
Use hipExtMallocWithFlags to allocate host memory on APU ( #1149 )
...
Also use SM60 as CUDA compatibility level.
[ROCm/rccl commit: 220066197a ]
2024-04-17 16:56:38 -07:00