86 Commit

Autore SHA1 Messaggio Data
systems-assistant[bot] f05be9efb3 AICOMRCCL-82 AICOMRCCL-85 Switched MSCCLPP.cmake to use targets (#2774)
* Initial refactoring work, including using build targets, and settable MSCCLPP_ROOT, MSCCLPP_SOURCE, MSCCLPP_APPLY_PATCHES.

* Another large refactor of MSCCLPP cmake to make all portions targets with appropriate dependencies. This should include all paths to the final target: starting with a full mscclpp install, starting with custom mscclpp and/or json source code, or from submodules + optional patches.

* Update whitespace Findmscclpp_nccl_static.cmake

---------

Co-authored-by: Corey Derochie <corey.derochie@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2026-01-26 23:12:16 -07:00
Marzieh Berenjkoub d7293281f3 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 858b4e76eb]
2026-01-20 13:04:02 -06:00
Nusrat Islam eb347a0dd3 GDA support for alltoall via rocshmem integration (#2099)
* ROCSHMEM linking/building to match MSCCL++ style

* add rocSHMEM as a submodule

* Move rocSHMEM submodule to ext-src/rocSHMEM

* Adding submodule support proper, as well as a patch for rocshmem

* Cleaning up INCLUDE_DIR vs INCLUDE_DIRS mixup

* updating patch file

* Pointing rocshmem submodule to edgars fixup patch

* Adding IBVERBS link to the submodule build

* More IBVERBS patching

* pin rocshmem submodule to b534423

* Adding IPC support in rocSHMEM build

* updating rocshmem submodule to resolve CQ errors

* Updating submodule to include recent a2a optimizations

* invoke rocshmem alltoall from rccl

* Updating submodule to CQ error number hang

* Updating submodule to include a2a improvements and bug fixes

* Updating submodule to point to Yiltan's fork and doorbell ring removal commit

* Updating hash to correspond with submodule change

* Updating to no-ctx wg call and updating submodule

* copy-in/copy-out using multiples CUs

* Updating rocSHMEM submodule to include doorbell improvs

* updating gitmodule to point to upstream

* code cleanup and adjust threashold

* guard rocshmem a2a invocation

* Only build with rocshmem when specified

* code cleanup

* address review comments

* Removing debugging failure case

Signed-off-by: Thomas Huber <thomas.huber@amd.com>

* whitespace fix

* Adding rocshmem compile guard

* Removing unneccesary comment

Signed-off-by: Thomas Huber <thomas.huber@amd.com>

* remove commented lines

* address review comments

* cleanup

---------

Signed-off-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k12-27.cs-aus.dcgpu>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
Co-authored-by: Islam <nusislam@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-03.cs-aus.dcgpu>

[ROCm/rccl commit: 27648b0900]
2026-01-09 14:04:54 -06:00
Karthikeyan Arumugam bb599d8ed7 Add support for AMD AINIC within RCCL default internal network plugin. (#2078)
* Added support for AMD ROCm net-ib alongside vanilla net-ib, with auto-generation to detect conflicts early during NCCL sync and enable future customizations.
* Integrated AMD AINIC support in RCCL for out-of-the-box usage, leveraging performance improvements by default, channel pinning for optimal pipeline performance, and extended support for 32B in-line CTS messages.
* Implemented internal derivation of AINIC-specific flags when RCCL AINIC environment parameter is set, and checks before initializing AINIC net-ib methods.
* Included snapshot of auto-generated ROCm net-ib file (src/transport/net_ib_rocm.cc) for reference.
* Fixed typos in RCCL param API (RCCL_AINIC_ROCE) and dlclose.
* Updated plugin loading logic:
* Load internal ROCmIB plugin only when NCCL_NET_PLUGIN is not set.
* Load default internal net-ib only when not AINIC and no external plugin env is set.

[ROCm/rccl commit: 9f4651f20f]
2025-12-23 10:33:10 -05:00
Arm Patinyasakdikul b14fec8dbc Fix git version fetching logic. (#1981)
[ROCm/rccl commit: 9806f5e9dd]
2025-10-17 09:17:49 -05:00
Arm Patinyasakdikul 99699b10a2 Fix issue where staging/mainline build commit hash doesn't match the actual RCCL commit. (#1910)
[ROCm/rccl commit: f21fbdfc18]
2025-09-11 16:13:21 -05:00
Mustafa Abduljabbar f37f290134 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

[ROCm/rccl commit: 277747c199]
2025-08-26 15:03:54 -04:00
mberenjk c76a4492f1 Added useAcc as a template parameter to address the performance regression (#1856)
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>


[ROCm/rccl commit: c61152baa4]
2025-08-14 15:58:54 -05:00
Rahul Vaidya d65eb0b021 Fix RHEL10 packaging for rcclras and rccl-UnitTests (#1831)
Signed-off-by: ravaidya <ravaidya@amd.com>

[ROCm/rccl commit: 0adc5edc74]
2025-07-31 11:00:49 -05:00
Nilesh M Negi 41c985462c [BUILD] Use fmt-header instead of libfmt (#1791)
[ROCm/rccl commit: 6b4ad0fd74]
2025-07-10 17:19:53 -05:00
Atul Kulkarni 16aadd67cf Enable Google Test's GMOCK feature (#1773)
[ROCm/rccl commit: a28d5cb986]
2025-07-09 17:25:44 -05:00
Marius Brehler 5d753cb871 Set GTEST_BOTH_LIBRARIES appropriately (#1669)
If `find_package()` succeeds to find GTest and `INSTALL_DEPENDENCIES`
is set to OFF, `GTEST_BOTH_LIBRARIES` is not set and thus
`rccl-UnitTests` fails with trying to link unkown symbols.

[ROCm/rccl commit: dac0e528a0]
2025-07-05 20:38:31 -05:00
Nilesh M Negi 23618f9e65 [MSCCLPP] Disable format checks in MSCCLPP by default (#1781)
[ROCm/rccl commit: 9e99c18f6e]
2025-07-02 09:11:42 -05:00
Nilesh M Negi fd0d9ac44c [BUILD] Fix packaging for RAS (#1784)
[ROCm/rccl commit: 3e51c41dcb]
2025-07-01 16:37:14 -05:00
Nilesh M Negi d88d033aba [RAS] Add support for RAS client (#1748)
Enable RAS client binary `rcclras`

[ROCm/rccl commit: 8d3a5542fb]
2025-06-29 18:53:16 -05:00
Dingming Wu d34a38ccfc Add proxyTrace (#1732)
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.

[ROCm/rccl commit: 020dcf0a7c]
2025-06-25 23:01:34 -05:00
Grant Pinkert 1d68693a2e Fix continuous build hang on extract_metadata.cmake (#1668)
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.

[ROCm/rccl commit: 2482d1475f]
2025-06-22 05:54:44 -05:00
Nilesh M Negi 7c422271a8 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 92a5d225d9]
2025-06-17 01:29:55 -05:00
isaki001 89bc9131aa fix improper patch reverse order (#1696)
[ROCm/rccl commit: 66ef428714]
2025-05-19 12:29:21 -05:00
isaki001 de76d7f649 Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 (#1607)
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024

* add compilation flag for enabling/disabling clipping in mscclpp

* change flag name for consistency, set flag to OFF

* add compilation flag in rccl for enabling clipping in mscclpp

* set 1024 threads for mscclpp allreduce8 only for bfloat16

* fix improper description for ENABLE_MSCCLPP_CLIP flag

* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"

This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.

* update clip remove-clip.patch for rebase

[ROCm/rccl commit: 8145c4f3b8]
2025-04-30 16:42:28 -05:00
BertanDogancay d045d0ca23 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a6bf9bfc9e]
2025-04-23 20:47:43 -07:00
Nusrat Islam 691e98940c Fix MSCCLPP accuracy issue for allreduce7 (#1634)
* ext-src: fix a graph-mode bug in allreduce7

* change MSCCLPP threshold to 16MB

* ext-src: change message size threshold for allreduce7

* ext-src: address review comments

[ROCm/rccl commit: f20c33effd]
2025-04-18 08:54:32 -05:00
Nusrat Islam f599690ce3 ext-src: fix mscclpp correctness issue (#1615)
* ext-src: fix mscclpp correctness issue

* ext-src: remove white-space warnings

[ROCm/rccl commit: 4a29bba3c6]
2025-04-01 15:02:16 -05:00
Nilesh M Negi a7ec191754 [TEST] Switch to googletest release 1.12.0 (#1621)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 0e2c461c6c]
2025-03-28 12:39:42 -05:00
Wenkai Du e86b217182 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock

[ROCm/rccl commit: 90ad586d94]
2025-03-20 16:11:43 -07:00
corey-derochie-amd e95578ef4c removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>

[ROCm/rccl commit: 6505639cf4]
2025-03-20 09:34:53 -06:00
Nilesh M Negi 751370bb70 [BUILD] Enable multiple GPU targets in MSCCLPP (#1574)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 063c6cfc11]
2025-03-01 22:28:42 -06:00
Nusrat Islam f70f406463 misc/msccl: Read graph capture status for every collective call (#1576)
* misc/msccl: read graphCaptureStatus for every collective call

* fix a bug in checking whether UBR is enabled in MSCCLPP

* cmake: Fix patch reversal order

* misc/msccl: add logging

[ROCm/rccl commit: 23c0b7bd84]
2025-02-28 17:16:07 -06:00
Pedram Alizadeh acf5822a6c enable building rccl for gfx950 (#1571)
[ROCm/rccl commit: f268553ee4]
2025-02-25 16:13:48 -05:00
Nusrat Islam 3fbafef948 ext-src: tuning for allreduce8 kernel (#1560)
This PR tunes the number of threadblocks used for larger (>1MB)
message sizes.

[ROCm/rccl commit: fdf75fd2c1]
2025-02-21 19:34:38 -06:00
Nusrat Islam 4a5ab6cf75 ext-src: fix mscclpp allreduce for non-multiple of 128 message sizes (#1556)
[ROCm/rccl commit: 83f8b191ff]
2025-02-21 11:58:10 -06:00
BertanDogancay 1b000665df Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 36343be84f]
2025-01-23 12:08:46 -06:00
corey-derochie-amd e3a29f5eab Changed working dir for the submodule command and extended it to the json repo (#1495)
This allows it to work when the sub repos don't exist.

[ROCm/rccl commit: b6377e0b8c]
2025-01-23 09:34:25 -07:00
isaki001 25150b1f20 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package

[ROCm/rccl commit: d89432e8c8]
2025-01-20 08:06:43 -06:00
Nusrat Islam cf907dbf61 Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: e9b6bbca8a]
2025-01-14 08:20:24 -06:00
Nilesh M Negi b9e7e3024b [MSCCLPP] IBVerbs: Check if IBV_ACCESS_RELAXED_ORDERING exists (#1483)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: f0eae84663]
2025-01-08 08:38:51 -06:00
akolliasAMD c65d4ab18f changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS (#1440)
[ROCm/rccl commit: 45c1c1a781]
2024-12-12 12:09:30 -07:00
corey-derochie-amd ad1384bea1 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.

[ROCm/rccl commit: 1c45962273]
2024-11-04 09:46:42 -07:00
Nilesh M Negi 912e9f4b61 [BUILD] Simplify CMake args for building MSCCLPP (#1363)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 364a6c2130]
2024-10-09 23:52:04 -05:00
Nusrat Islam f61053dcba Add a custom allreduce algorithm in MSCCLPP for cpx mode (#1362)
* cmake: remove mscclpp patch after build is complete

To enable mscclpp in cpx mode, a patch cpx.patch needs to be applied.
This patch can be removed after building is done. This helps with the
build process the following time.

* Use read-based mscclpp allreduce from rccl

MSCCLPP by default uses remote write in the allreduce kernel for
large (> 1MB) messages. This PR adds an allreduce kernel that uses
remote read. It needs the users to use an environment variable
MSCCLPP_READ_ALLRED=1.

[ROCm/rccl commit: 4d68751ce1]
2024-10-08 14:42:12 -05:00
Bertan Dogancay 974c13cd62 [BUILD] Move code generation to python from CMake (#1360)
* Use generate.py for func generation

* Convert AddUnroll.cmake to bash

[ROCm/rccl commit: 2dd10c8f17]
2024-10-03 10:21:19 -04:00
Nusrat Islam 1f7945286c Enable MSCCLPP use in CPX mode (#1355)
This PR enables the use of MSCCLPP in CPX mode for 8 GPUs.

[ROCm/rccl commit: d13f9c44f5]
2024-10-02 11:52:04 -05:00
corey-derochie-amd c8f4dedfd1 Added nlohmann/json:v3.11.3 as a submodule in ext-src and passed its path into the mscclpp build to avoid downloading the package at build time. (#1330)
[ROCm/rccl commit: b3b0ffdbf3]
2024-09-11 16:54:26 -06:00
corey-derochie-amd 9ffd893c5a Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

[ROCm/rccl commit: 736a705875]
2024-09-11 09:55:16 -06:00
corey-derochie-amd f2b2372056 Only initialize MSCCL++ when runtime-enabled. (#1266)
[ROCm/rccl commit: b31b4082dd]
2024-07-22 00:41:31 -06:00
Wenkai Du 54e4899607 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>

[ROCm/rccl commit: 89349f2ce4]
2024-07-19 08:15:59 -07:00
corey-derochie-amd b8542c2477 Integrated RCCL with MSCCL++ for small message sizes (#1231)
[ROCm/rccl commit: 6dc47eecd7]
2024-07-12 15:32:58 -06:00
Wenkai Du 470302a776 Allow multiple parameters during selective function generation (#1201)
* Allow multiple parameters during selective function generation

* Remove debug print

* Add examples into Generator.cmake

[ROCm/rccl commit: 9fcd7b55e1]
2024-06-06 07:07:24 -07:00
Bertan Dogancay 8ddb74e3b1 Add unique files to source list (#1144)
[ROCm/rccl commit: 3caad91f32]
2024-04-15 09:46:53 -06:00
mberenjk da835cff9c replacing rccl_bfloat16 with hip_bfloat16 (#1126)
Co-authored-by: mberenjk <mberenjk@amd.com>

[ROCm/rccl commit: 428837ffe4]
2024-04-11 11:30:37 -05:00