Commit grafiek

85 Commits

Auteur SHA1 Bericht Datum
Marzieh Berenjkoub 858b4e76eb Merge remote-tracking branch 'nccl/master' into develop 2026-01-20 13:04:02 -06:00
Nusrat Islam 27648b0900 GDA support for alltoall via rocshmem integration (#2099)
* ROCSHMEM linking/building to match MSCCL++ style

* add rocSHMEM as a submodule

* Move rocSHMEM submodule to ext-src/rocSHMEM

* Adding submodule support proper, as well as a patch for rocshmem

* Cleaning up INCLUDE_DIR vs INCLUDE_DIRS mixup

* updating patch file

* Pointing rocshmem submodule to edgars fixup patch

* Adding IBVERBS link to the submodule build

* More IBVERBS patching

* pin rocshmem submodule to b534423

* Adding IPC support in rocSHMEM build

* updating rocshmem submodule to resolve CQ errors

* Updating submodule to include recent a2a optimizations

* invoke rocshmem alltoall from rccl

* Updating submodule to CQ error number hang

* Updating submodule to include a2a improvements and bug fixes

* Updating submodule to point to Yiltan's fork and doorbell ring removal commit

* Updating hash to correspond with submodule change

* Updating to no-ctx wg call and updating submodule

* copy-in/copy-out using multiples CUs

* Updating rocSHMEM submodule to include doorbell improvs

* updating gitmodule to point to upstream

* code cleanup and adjust threashold

* guard rocshmem a2a invocation

* Only build with rocshmem when specified

* code cleanup

* address review comments

* Removing debugging failure case

Signed-off-by: Thomas Huber <thomas.huber@amd.com>

* whitespace fix

* Adding rocshmem compile guard

* Removing unneccesary comment

Signed-off-by: Thomas Huber <thomas.huber@amd.com>

* remove commented lines

* address review comments

* cleanup

---------

Signed-off-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Thomas Huber <thomas.huber@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k12-27.cs-aus.dcgpu>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
Co-authored-by: Islam <nusislam@amd.com>
Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-03.cs-aus.dcgpu>
2026-01-09 14:04:54 -06:00
Karthikeyan Arumugam 9f4651f20f Add support for AMD AINIC within RCCL default internal network plugin. (#2078)
* Added support for AMD ROCm net-ib alongside vanilla net-ib, with auto-generation to detect conflicts early during NCCL sync and enable future customizations.
* Integrated AMD AINIC support in RCCL for out-of-the-box usage, leveraging performance improvements by default, channel pinning for optimal pipeline performance, and extended support for 32B in-line CTS messages.
* Implemented internal derivation of AINIC-specific flags when RCCL AINIC environment parameter is set, and checks before initializing AINIC net-ib methods.
* Included snapshot of auto-generated ROCm net-ib file (src/transport/net_ib_rocm.cc) for reference.
* Fixed typos in RCCL param API (RCCL_AINIC_ROCE) and dlclose.
* Updated plugin loading logic:
* Load internal ROCmIB plugin only when NCCL_NET_PLUGIN is not set.
* Load default internal net-ib only when not AINIC and no external plugin env is set.
2025-12-23 10:33:10 -05:00
Arm Patinyasakdikul 9806f5e9dd Fix git version fetching logic. (#1981) 2025-10-17 09:17:49 -05:00
Arm Patinyasakdikul f21fbdfc18 Fix issue where staging/mainline build commit hash doesn't match the actual RCCL commit. (#1910) 2025-09-11 16:13:21 -05:00
Mustafa Abduljabbar 277747c199 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-26 15:03:54 -04:00
mberenjk c61152baa4 Added useAcc as a template parameter to address the performance regression (#1856)
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-08-14 15:58:54 -05:00
Rahul Vaidya 0adc5edc74 Fix RHEL10 packaging for rcclras and rccl-UnitTests (#1831)
Signed-off-by: ravaidya <ravaidya@amd.com>
2025-07-31 11:00:49 -05:00
Nilesh M Negi 6b4ad0fd74 [BUILD] Use fmt-header instead of libfmt (#1791) 2025-07-10 17:19:53 -05:00
Atul Kulkarni a28d5cb986 Enable Google Test's GMOCK feature (#1773) 2025-07-09 17:25:44 -05:00
Marius Brehler dac0e528a0 Set GTEST_BOTH_LIBRARIES appropriately (#1669)
If `find_package()` succeeds to find GTest and `INSTALL_DEPENDENCIES`
is set to OFF, `GTEST_BOTH_LIBRARIES` is not set and thus
`rccl-UnitTests` fails with trying to link unkown symbols.
2025-07-05 20:38:31 -05:00
Nilesh M Negi 9e99c18f6e [MSCCLPP] Disable format checks in MSCCLPP by default (#1781) 2025-07-02 09:11:42 -05:00
Nilesh M Negi 3e51c41dcb [BUILD] Fix packaging for RAS (#1784) 2025-07-01 16:37:14 -05:00
Nilesh M Negi 8d3a5542fb [RAS] Add support for RAS client (#1748)
Enable RAS client binary `rcclras`
2025-06-29 18:53:16 -05:00
Dingming Wu 020dcf0a7c Add proxyTrace (#1732)
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.
2025-06-25 23:01:34 -05:00
Grant Pinkert 2482d1475f Fix continuous build hang on extract_metadata.cmake (#1668)
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.
2025-06-22 05:54:44 -05:00
Nilesh M Negi 92a5d225d9 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-06-17 01:29:55 -05:00
isaki001 66ef428714 fix improper patch reverse order (#1696) 2025-05-19 12:29:21 -05:00
isaki001 8145c4f3b8 Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 (#1607)
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024

* add compilation flag for enabling/disabling clipping in mscclpp

* change flag name for consistency, set flag to OFF

* add compilation flag in rccl for enabling clipping in mscclpp

* set 1024 threads for mscclpp allreduce8 only for bfloat16

* fix improper description for ENABLE_MSCCLPP_CLIP flag

* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"

This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.

* update clip remove-clip.patch for rebase
2025-04-30 16:42:28 -05:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Nusrat Islam f20c33effd Fix MSCCLPP accuracy issue for allreduce7 (#1634)
* ext-src: fix a graph-mode bug in allreduce7

* change MSCCLPP threshold to 16MB

* ext-src: change message size threshold for allreduce7

* ext-src: address review comments
2025-04-18 08:54:32 -05:00
Nusrat Islam 4a29bba3c6 ext-src: fix mscclpp correctness issue (#1615)
* ext-src: fix mscclpp correctness issue

* ext-src: remove white-space warnings
2025-04-01 15:02:16 -05:00
Nilesh M Negi 0e2c461c6c [TEST] Switch to googletest release 1.12.0 (#1621)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-28 12:39:42 -05:00
Wenkai Du 90ad586d94 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock
2025-03-20 16:11:43 -07:00
corey-derochie-amd 6505639cf4 removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>
2025-03-20 09:34:53 -06:00
Nilesh M Negi 063c6cfc11 [BUILD] Enable multiple GPU targets in MSCCLPP (#1574)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-01 22:28:42 -06:00
Nusrat Islam 23c0b7bd84 misc/msccl: Read graph capture status for every collective call (#1576)
* misc/msccl: read graphCaptureStatus for every collective call

* fix a bug in checking whether UBR is enabled in MSCCLPP

* cmake: Fix patch reversal order

* misc/msccl: add logging
2025-02-28 17:16:07 -06:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00
Nusrat Islam fdf75fd2c1 ext-src: tuning for allreduce8 kernel (#1560)
This PR tunes the number of threadblocks used for larger (>1MB)
message sizes.
2025-02-21 19:34:38 -06:00
Nusrat Islam 83f8b191ff ext-src: fix mscclpp allreduce for non-multiple of 128 message sizes (#1556) 2025-02-21 11:58:10 -06:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
corey-derochie-amd b6377e0b8c Changed working dir for the submodule command and extended it to the json repo (#1495)
This allows it to work when the sub repos don't exist.
2025-01-23 09:34:25 -07:00
isaki001 d89432e8c8 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package
2025-01-20 08:06:43 -06:00
Nusrat Islam e9b6bbca8a Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-14 08:20:24 -06:00
Nilesh M Negi f0eae84663 [MSCCLPP] IBVerbs: Check if IBV_ACCESS_RELAXED_ORDERING exists (#1483)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-01-08 08:38:51 -06:00
akolliasAMD 45c1c1a781 changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS (#1440) 2024-12-12 12:09:30 -07:00
corey-derochie-amd 1c45962273 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.
2024-11-04 09:46:42 -07:00
Nilesh M Negi 364a6c2130 [BUILD] Simplify CMake args for building MSCCLPP (#1363)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-10-09 23:52:04 -05:00
Nusrat Islam 4d68751ce1 Add a custom allreduce algorithm in MSCCLPP for cpx mode (#1362)
* cmake: remove mscclpp patch after build is complete

To enable mscclpp in cpx mode, a patch cpx.patch needs to be applied.
This patch can be removed after building is done. This helps with the
build process the following time.

* Use read-based mscclpp allreduce from rccl

MSCCLPP by default uses remote write in the allreduce kernel for
large (> 1MB) messages. This PR adds an allreduce kernel that uses
remote read. It needs the users to use an environment variable
MSCCLPP_READ_ALLRED=1.
2024-10-08 14:42:12 -05:00
Bertan Dogancay 2dd10c8f17 [BUILD] Move code generation to python from CMake (#1360)
* Use generate.py for func generation

* Convert AddUnroll.cmake to bash
2024-10-03 10:21:19 -04:00
Nusrat Islam d13f9c44f5 Enable MSCCLPP use in CPX mode (#1355)
This PR enables the use of MSCCLPP in CPX mode for 8 GPUs.
2024-10-02 11:52:04 -05:00
corey-derochie-amd b3b0ffdbf3 Added nlohmann/json:v3.11.3 as a submodule in ext-src and passed its path into the mscclpp build to avoid downloading the package at build time. (#1330) 2024-09-11 16:54:26 -06:00
corey-derochie-amd 736a705875 Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2024-09-11 09:55:16 -06:00
corey-derochie-amd b31b4082dd Only initialize MSCCL++ when runtime-enabled. (#1266) 2024-07-22 00:41:31 -06:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
corey-derochie-amd 6dc47eecd7 Integrated RCCL with MSCCL++ for small message sizes (#1231) 2024-07-12 15:32:58 -06:00
Wenkai Du 9fcd7b55e1 Allow multiple parameters during selective function generation (#1201)
* Allow multiple parameters during selective function generation

* Remove debug print

* Add examples into Generator.cmake
2024-06-06 07:07:24 -07:00
Bertan Dogancay 3caad91f32 Add unique files to source list (#1144) 2024-04-15 09:46:53 -06:00
mberenjk 428837ffe4 replacing rccl_bfloat16 with hip_bfloat16 (#1126)
Co-authored-by: mberenjk <mberenjk@amd.com>
2024-04-11 11:30:37 -05:00
corey-derochie-amd 503a472a25 Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125) 2024-03-25 16:29:13 -06:00