Graphe des révisions

6 Révisions

Auteur SHA1 Message Date
Nusrat Islam 7ac82248de Tune allreduce performance in CPX mode (single OAM) (#1508) 2025-01-29 08:58:48 -06:00
isaki001 d89432e8c8 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package
2025-01-20 08:06:43 -06:00
Nusrat Islam 42b6831a39 ext-src: tune TP=8 case on MI308 CPX mode (#1446)
Tune the number of blocks for hierarchical mscclpp allreduce.
2024-12-06 08:16:39 -06:00
Nusrat Islam 0fb3b5eba9 ext-src: Improved allreduce performance in cpx mode for MI308 (#1393)
To get the improved performance for TP=4, the user needs to use
RCCL_MSCCL_FORCE_ENABLE=1 and MSCCLPP_READ_ALLRED=1. For TP=8, the
user should use MSCCLPP_HIERARCHICAL_ALLRED=1.
2024-10-30 08:30:15 -05:00
Nusrat Islam 6160603d4c ext-src: Fix compiler warnings for MSCCLPP integration (#1368) 2024-10-10 08:20:02 -05:00
Nusrat Islam 4d68751ce1 Add a custom allreduce algorithm in MSCCLPP for cpx mode (#1362)
* cmake: remove mscclpp patch after build is complete

To enable mscclpp in cpx mode, a patch cpx.patch needs to be applied.
This patch can be removed after building is done. This helps with the
build process the following time.

* Use read-based mscclpp allreduce from rccl

MSCCLPP by default uses remote write in the allreduce kernel for
large (> 1MB) messages. This PR adds an allreduce kernel that uses
remote read. It needs the users to use an environment variable
MSCCLPP_READ_ALLRED=1.
2024-10-08 14:42:12 -05:00