6 Коммитов

Автор SHA1 Сообщение Дата
Nusrat Islam 53c927678b Tune allreduce performance in CPX mode (single OAM) (#1508)
[ROCm/rccl commit: 7ac82248de]
2025-01-29 08:58:48 -06:00
isaki001 25150b1f20 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package

[ROCm/rccl commit: d89432e8c8]
2025-01-20 08:06:43 -06:00
Nusrat Islam b19a809788 ext-src: tune TP=8 case on MI308 CPX mode (#1446)
Tune the number of blocks for hierarchical mscclpp allreduce.

[ROCm/rccl commit: 42b6831a39]
2024-12-06 08:16:39 -06:00
Nusrat Islam e1c20e7f24 ext-src: Improved allreduce performance in cpx mode for MI308 (#1393)
To get the improved performance for TP=4, the user needs to use
RCCL_MSCCL_FORCE_ENABLE=1 and MSCCLPP_READ_ALLRED=1. For TP=8, the
user should use MSCCLPP_HIERARCHICAL_ALLRED=1.

[ROCm/rccl commit: 0fb3b5eba9]
2024-10-30 08:30:15 -05:00
Nusrat Islam 5545392913 ext-src: Fix compiler warnings for MSCCLPP integration (#1368)
[ROCm/rccl commit: 6160603d4c]
2024-10-10 08:20:02 -05:00
Nusrat Islam f61053dcba Add a custom allreduce algorithm in MSCCLPP for cpx mode (#1362)
* cmake: remove mscclpp patch after build is complete

To enable mscclpp in cpx mode, a patch cpx.patch needs to be applied.
This patch can be removed after building is done. This helps with the
build process the following time.

* Use read-based mscclpp allreduce from rccl

MSCCLPP by default uses remote write in the allreduce kernel for
large (> 1MB) messages. This PR adds an allreduce kernel that uses
remote read. It needs the users to use an environment variable
MSCCLPP_READ_ALLRED=1.

[ROCm/rccl commit: 4d68751ce1]
2024-10-08 14:42:12 -05:00