4 Коммитов

Автор SHA1 Сообщение Дата
isaki001 d2b5ba80a7 Update MSCCL++ register/deregister (#1523)
* erase handle key from mscclpp communicator during deregistration

* remove check on buffer size being a multiple of 32 from registration/deregistration routines since these checks are applied during enqueue

* add check for greater than zero buffer size in mscclpp registration

[ROCm/rccl commit: 19105206f6]
2025-02-04 09:09:56 -06:00
isaki001 835c708a92 fix scatter_perf crash (#1493)
* fix scatter_perf crash

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update src/misc/msccl/msccl_lifecycle.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* More buffsRegisteredNonGraphMode spelling fixes.

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: ff130cce7a]
2025-01-21 09:24:32 -06:00
isaki001 25150b1f20 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package

[ROCm/rccl commit: d89432e8c8]
2025-01-20 08:06:43 -06:00
Nusrat Islam cf907dbf61 Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: e9b6bbca8a]
2025-01-14 08:20:24 -06:00