Графік комітів

1412 Коміти

Автор SHA1 Повідомлення Дата
gilbertlee-amd efe851cfff Disabling NUMA maching for model 79 for some VM configs (#1204)
[ROCm/rccl commit: 9b94a1052f]
2024-06-06 17:15:04 -06:00
Wenkai Du 470302a776 Allow multiple parameters during selective function generation (#1201)
* Allow multiple parameters during selective function generation

* Remove debug print

* Add examples into Generator.cmake

[ROCm/rccl commit: 9fcd7b55e1]
2024-06-06 07:07:24 -07:00
Nusrat Islam 99fca5fa3a Merge pull request #1184 from nusislam/256-channel-2
add 256 channels support

[ROCm/rccl commit: 955347bab4]
2024-06-04 08:25:34 -05:00
Nusrat Islam 62cbb3bcdd set MIN_NCHANNEL limit to 64 for multi-node
[ROCm/rccl commit: 9746d8ca3f]
2024-06-03 13:05:05 -05:00
Nusrat Islam b34fd115a1 doubling debug buffer size with increased channels
[ROCm/rccl commit: 0634c5c8e1]
2024-06-03 13:05:05 -05:00
Nusrat Islam 48821ad0d7 set MAXCHANNELS to 128
[ROCm/rccl commit: ef442f8f92]
2024-06-03 13:05:05 -05:00
Nusrat Islam 2447161fb4 graph: restrict MAXCHANNELS for certain platforms
[ROCm/rccl commit: 9f654f6cf5]
2024-06-03 13:05:01 -05:00
Nusrat Islam ed2f96bc6a device: update the logic for channelId assignment
[ROCm/rccl commit: 48859a97b1]
2024-06-03 13:03:18 -05:00
Nusrat Islam b0b2aa1166 add 256 channels support
[ROCm/rccl commit: 506f16c506]
2024-06-03 13:03:18 -05:00
akolliasAMD 1934d0f377 fixed typo on BFD linkage (#1192)
[ROCm/rccl commit: 6475da2ed9]
2024-06-03 10:05:47 -06:00
gilbertlee-amd a009e43f3c Changing channel stride for MI300X multinode (#1196)
* Shuffling MI300X multi-node channels
* Updating tree channel logic

[ROCm/rccl commit: 0948eecbba]
2024-06-03 10:00:55 -06:00
srawat 072c378bc0 doc organization (#1197)
* doc organization

* removing what is rccl file

* Update index.rst

[ROCm/rccl commit: 3301cdf59a]
2024-06-03 18:38:45 +05:30
ClementLinCF 4f56aa5f8c Optimize NCHANNELS and MSCCL config for gfx942 80CUs (#1195)
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs

Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs

* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml

* Change the factor of gfx94 and update msccl config

[ROCm/rccl commit: cab25f919e]
2024-06-01 07:07:46 -07:00
Nilesh M Negi 7ca67f1cb9 [BUILD] Update install.sh for RCCL build (#1191)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 5aaf7121d9]
2024-05-31 17:58:34 -05:00
Nilesh M Negi 8518ef4afb [MSCCL]: Move scratch buffer debug msgs to TRACE (#1189)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 1249a6c3fd]
2024-05-31 17:54:23 -05:00
gilbertlee-amd b35e2b8c4b Addressing possible out-of-bounds mem access during channel duplication (#1193)
[ROCm/rccl commit: 354e0b29a6]
2024-05-30 14:02:14 -06:00
Wenkai Du a724f1ebb7 Add ring simple chunk size tuning (#1180)
* Add ring simple chunk size tuning

* modifying the tuning table to improve the performance of broadcast for 8MB to 32MB for single-node MI300X after ring simple chunk size tuning

* modifying the tuning table to improve the performance of reduce for 1MB to 4MB for single-node MI300X after ring simple chunk size tuning

---------

Co-authored-by: PedramAlizadeh <pmohamma@amd.com>

[ROCm/rccl commit: 73221b4230]
2024-05-29 07:59:47 -07:00
Wenkai Du ae6c372406 Make WSL detection thread safe (#1178)
* Make WSL detection thread safe

* Change static to beginning

* Switch to use atomics

[ROCm/rccl commit: 8f099b1adb]
2024-05-28 17:23:50 -07:00
Edgar Gabriel 7f4bec7682 Merge pull request #1175 from edgargabriel/topic/alt_rsmi
add alternative to rocm_smi_lib

[ROCm/rccl commit: a78c4f5e88]
2024-05-28 09:36:55 -05:00
Joseph Macaranas 23fd9cfd80 Merge pull request #1188 from ROCm/amd/jmacaran/externalCIEnablement
Enable external CI pipeline triggers

[ROCm/rccl commit: e4c10e4438]
2024-05-27 15:52:49 -04:00
amd-jmacaran 0fde09b879 Enable external CI pipeline triggers
[ROCm/rccl commit: 125f841c5f]
2024-05-23 16:29:05 -04:00
dependabot[bot] e8a2977f05 --- (#1183)
updated-dependencies:
- dependency-name: requests
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: b06e617bea]
2024-05-21 09:34:41 -06:00
Wenkai Du bc29c89d9d Report error when collective is not enabled in build (#1177)
* Report error when collective is not enabled in build

* Fix typo

[ROCm/rccl commit: eeea3b693b]
2024-05-16 10:11:12 -07:00
Edgar Gabriel c31ac0d17b add alternative to rocm_smi_lib
[ROCm/rccl commit: 9ad913bfa8]
2024-05-14 13:51:41 -07:00
gilbertlee-amd 33a9e1c29f RCCL Replayer - multi communicator support (#1176)
[ROCm/rccl commit: 52fa5d1178]
2024-05-13 10:56:32 -06:00
Wenkai Du 984f87a5d5 Support WSL2 (#1173)
[ROCm/rccl commit: ecafc1969c]
2024-05-10 07:31:12 -07:00
Tim b7c743b3a0 Upload npkit_trace_analysis.py (#1152)
script for parsing json trace, generating heatmap, throughput series, etc.

[ROCm/rccl commit: f078db5998]
2024-05-09 16:27:49 -04:00
mberenjk 42d9f2458a Changing the failure print to reflect a bad installation of MSCCL (#1160)
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 87d01e6bf5]
2024-05-08 16:56:26 -05:00
Wenkai Du dd5d66d01a Use rocm-smi thread only mutex when available (#1169)
[ROCm/rccl commit: a64aab5f63]
2024-05-08 14:32:24 -07:00
Pedram Alizadeh 43d875b422 modifying the tuning table to improve the performance of broadcast for 1MB to 64MB for single-node MI300X (#1172)
[ROCm/rccl commit: 73acf3eeec]
2024-05-08 15:49:33 -04:00
mberenjk 00c0f3d67f Adding ASAN changes to address memory leak issue" (#1170)
Co-authored-by: akolliasAMD <akollias@amd.com>

[ROCm/rccl commit: 408278209d]
2024-05-08 09:16:00 -05:00
Wenkai Du 63d8de3bd3 Add compiler warning for uninitialized variable and fix (#1163)
* Add compiler warning for uninitialized variable and fix

* Add -Wsometimes-uninitialized

* Convert warning to error

[ROCm/rccl commit: b18784d8b8]
2024-05-08 07:00:25 -07:00
Wenkai Du eab9978e48 Use normal permute path when one NIC per GPU (#1171)
[ROCm/rccl commit: f679db6ff6]
2024-05-08 06:59:57 -07:00
Wenkai Du 0ff5fc0bad npkit: add broadcast trace (#1166)
[ROCm/rccl commit: a0cef69110]
2024-05-07 14:00:16 -07:00
Pak Nin Lui df3d462dd9 Merge pull request #1167 from paklui/dmabuf
fix typo for DMABUF_ENABLE

[ROCm/rccl commit: 92a4fc6204]
2024-05-07 08:48:44 -07:00
dependabot[bot] 0d025525ad Bump jinja2 from 3.1.3 to 3.1.4 in /docs/sphinx (#1168)
Bumps [jinja2](https://github.com/pallets/jinja) from 3.1.3 to 3.1.4.
- [Release notes](https://github.com/pallets/jinja/releases)
- [Changelog](https://github.com/pallets/jinja/blob/main/CHANGES.rst)
- [Commits](https://github.com/pallets/jinja/compare/3.1.3...3.1.4)

---
updated-dependencies:
- dependency-name: jinja2
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/rccl commit: eb562e7b22]
2024-05-06 15:35:34 -06:00
paklui dd8e937948 fix typo for DMABUF_ENABLE
[ROCm/rccl commit: 140b7dd40f]
2024-05-06 13:27:50 -07:00
Wenkai Du c782aba364 Bypass NVIDIA Ampere related tuning (#1165)
[ROCm/rccl commit: b513c3970a]
2024-05-03 17:57:16 -07:00
Wenkai Du 7c811a7582 Fix ignore NUMA not being observed for NICs during model matching (#1164)
[ROCm/rccl commit: bb58b1c258]
2024-05-03 16:42:07 -07:00
Wenkai Du 9638535690 Fix build error when roctracer-dev package is not installed (#1161)
[ROCm/rccl commit: 6f5a8ce1fb]
2024-05-01 13:55:09 -07:00
Wenkai Du 3906e992f8 MSCCL: add support for out-of-place all reduce (#1156)
[ROCm/rccl commit: 4e1b8c1cbb]
2024-04-28 19:49:09 -07:00
Wenkai Du 703014e960 Add back tree simple chunk size tuning (#1157)
[ROCm/rccl commit: cd6e840e0b]
2024-04-28 19:48:53 -07:00
Nilesh M Negi b99b89e7a2 [GRAPH] Reduce NCCL_TOPO_MAX_NODES to 64 (#1153)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: b90436d292]
2024-04-27 23:41:11 -05:00
Tim afeaa17475 Merge pull request #1158 from AtlantaPepsi/NPKit_fix
Prevent segfault from npkit-enabled rccl build

[ROCm/rccl commit: cc39e91c6f]
2024-04-26 12:44:04 -04:00
AtlantaPepsi 8cf28704ce prevent segfault from npkit-enabled rccl build
Signed-off-by: AtlantaPepsi <timhu102@amd.com>


[ROCm/rccl commit: 67246649ac]
2024-04-26 10:54:27 -05:00
Wenkai Du 3c94f98688 Revert "Use relaxed atomics for LL on GFX11 (#859)" (#1148)
This reverts commit 5983f0e371.

Use inline asm for 128b load on GFX11 for better peformance.

[ROCm/rccl commit: f330b82985]
2024-04-26 07:49:55 -07:00
Bertan Dogancay dea5e83940 [UT] Start supporting multiple group calls and graphs (#1151)
* Start supporting multiple group calls UT

[ROCm/rccl commit: 0ec41f1386]
2024-04-25 11:11:16 -06:00
Shilei Tian 9a203f439c SWDEV-455705: Fix an UB that could lead to miscompilation (#1155)
[ROCm/rccl commit: efe99057b0]
2024-04-25 10:10:01 -07:00
Wenkai Du e494f29235 Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154)
[ROCm/rccl commit: 9e0c9b4ed8]
2024-04-25 07:19:18 -07:00
Bertan Dogancay ed152c5b89 Update CHANGELOG.md for RCCL 2.20.5 (#1150)
[ROCm/rccl commit: dcc75797a1]
2024-04-24 09:07:49 -06:00