23 Коммитов

Автор SHA1 Сообщение Дата
Mustafa Abduljabbar 128b0e7074 Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support

[ROCm/rccl commit: d665547eef]
2025-05-13 17:07:03 -05:00
Mustafa Abduljabbar a85cfaa680 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML


[ROCm/rccl commit: aa7991dfc8]
2025-04-24 09:02:03 -04:00
Pedram Alizadeh b225281747 single-node AR msccl algorithm tuning for MI300 (#1629)
[ROCm/rccl commit: 5b36b68d06]
2025-04-10 10:42:28 -04:00
Wenkai Du 5f8571dcbc msccl: disable 1-shot xmls (#1375)
MSCCL 1-shot xmls may cause different output values on different ranks.
Disabling them for now to avoid undefined behavior in applications.

[ROCm/rccl commit: 62d10fdc25]
2024-10-14 15:10:53 -07:00
Wenkai Du 9ad1fe571b Temporarily disable MSCCL all gather XMLs due to UT failure (#1373)
[ROCm/rccl commit: a680e329e6]
2024-10-12 08:43:16 -07:00
ClementLinCF 4f56aa5f8c Optimize NCHANNELS and MSCCL config for gfx942 80CUs (#1195)
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs

Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs

* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml

* Change the factor of gfx94 and update msccl config

[ROCm/rccl commit: cab25f919e]
2024-06-01 07:07:46 -07:00
Wenkai Du 3906e992f8 MSCCL: add support for out-of-place all reduce (#1156)
[ROCm/rccl commit: 4e1b8c1cbb]
2024-04-28 19:49:09 -07:00
Pedram Alizadeh 61f89d680d msccl algorithms tuning for alltoall on MI300 (#1120)
Co-authored-by: PedramAlizadeh <amd@pmohamma.com>

[ROCm/rccl commit: c2fc1d6809]
2024-03-21 20:35:29 -04:00
Pedram Alizadeh 17b9546da9 msccl algorithms tuning for allgather on MI300 (#1110)
[ROCm/rccl commit: 50f22e8317]
2024-03-14 12:18:26 -04:00
Pedram Alizadeh bf48d1bc4d msccl algorithms tuning for allreduce on MI300 (#1088)
[ROCm/rccl commit: 5a0f9990a9]
2024-02-21 11:31:56 -05:00
Ziyue Yang e3d45f9de4 Improve MSCCL algorithms (#1023)
[ROCm/rccl commit: 0a53077c9c]
2024-01-03 14:51:34 -08:00
Ziyue Yang 62299668bd Tune MSCCL all-reduce algorithm (#1009)
[ROCm/rccl commit: bb144dcd50]
2023-12-08 17:47:02 -06:00
Wen-Heng (Jack) Chung 0266febb31 Let 320KB message size uses LL protocol. (#1006)
[ROCm/rccl commit: 8e8323252a]
2023-12-06 18:14:31 -06:00
Ziyue Yang cef45b8311 Fix mscclAlgoHandle not initialized issue (#995)
[ROCm/rccl commit: e44e112a17]
2023-12-01 07:58:01 -08:00
Ziyue Yang f0c47d085e Move MSCCL algorithm loading to initialization to workaround HIP graph conflict (#982)
* MSCCL: pre-specify channels and pre-load algorithms

* add mutex

* fix bug

* clean include

* disable all-gathers temporarily

[ROCm/rccl commit: 4bb0b4a380]
2023-11-30 09:47:20 -08:00
Ziyue Yang 2351578d5b Optimize MSCCL all-gather algorithms for gfx942 (#964)
[ROCm/rccl commit: 7ae95db5b8]
2023-11-15 08:18:59 -08:00
akolliasAMD 691df735a3 Revert "Introduce allgather for MSCCL on 8 sockets up to 320KB. (#931)" (#939)
This reverts commit 769f00db5c.

[ROCm/rccl commit: 9f02ee8dea]
2023-10-30 23:52:58 -06:00
Wen-Heng (Jack) Chung 769f00db5c Introduce allgather for MSCCL on 8 sockets up to 320KB. (#931)
[ROCm/rccl commit: bfb8642450]
2023-10-24 18:41:12 -05:00
Wen-Heng (Jack) Chung 89a8493ef8 Introduce allgather MSCCL XML specification for MI250X up to 320KB. (#930)
[ROCm/rccl commit: 3f9ffe4788]
2023-10-24 18:35:55 -05:00
Wen-Heng (Jack) Chung fc2a13c077 Introduce 1-shot allreduce for MI250X Hayabusa. (#929)
[ROCm/rccl commit: 72d5fbddfd]
2023-10-24 16:31:18 -05:00
Wen-Heng (Jack) Chung 49e52e7269 Introduce 1pass allreduce. Tailor it for very small message sizes <= 20KB. (#919)
[ROCm/rccl commit: 341926c60a]
2023-10-16 16:31:08 -05:00
Wenkai Du af04103d72 Add MSCCL xml files (#861)
[ROCm/rccl commit: aeca1af374]
2023-08-23 14:12:34 -07:00
Ziyue Yang f7f669e7f0 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message

[ROCm/rccl commit: e3b2342f39]
2023-03-14 14:34:25 -07:00