커밋 그래프

81 커밋

작성자 SHA1 메시지 날짜
Arm Patinyasakdikul f21fbdfc18 Fix issue where staging/mainline build commit hash doesn't match the actual RCCL commit. (#1910) 2025-09-11 16:13:21 -05:00
Mustafa Abduljabbar 277747c199 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-26 15:03:54 -04:00
mberenjk c61152baa4 Added useAcc as a template parameter to address the performance regression (#1856)
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-08-14 15:58:54 -05:00
Rahul Vaidya 0adc5edc74 Fix RHEL10 packaging for rcclras and rccl-UnitTests (#1831)
Signed-off-by: ravaidya <ravaidya@amd.com>
2025-07-31 11:00:49 -05:00
Nilesh M Negi 6b4ad0fd74 [BUILD] Use fmt-header instead of libfmt (#1791) 2025-07-10 17:19:53 -05:00
Atul Kulkarni a28d5cb986 Enable Google Test's GMOCK feature (#1773) 2025-07-09 17:25:44 -05:00
Marius Brehler dac0e528a0 Set GTEST_BOTH_LIBRARIES appropriately (#1669)
If `find_package()` succeeds to find GTest and `INSTALL_DEPENDENCIES`
is set to OFF, `GTEST_BOTH_LIBRARIES` is not set and thus
`rccl-UnitTests` fails with trying to link unkown symbols.
2025-07-05 20:38:31 -05:00
Nilesh M Negi 9e99c18f6e [MSCCLPP] Disable format checks in MSCCLPP by default (#1781) 2025-07-02 09:11:42 -05:00
Nilesh M Negi 3e51c41dcb [BUILD] Fix packaging for RAS (#1784) 2025-07-01 16:37:14 -05:00
Nilesh M Negi 8d3a5542fb [RAS] Add support for RAS client (#1748)
Enable RAS client binary `rcclras`
2025-06-29 18:53:16 -05:00
Dingming Wu 020dcf0a7c Add proxyTrace (#1732)
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.
2025-06-25 23:01:34 -05:00
Grant Pinkert 2482d1475f Fix continuous build hang on extract_metadata.cmake (#1668)
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.
2025-06-22 05:54:44 -05:00
Nilesh M Negi 92a5d225d9 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-06-17 01:29:55 -05:00
isaki001 66ef428714 fix improper patch reverse order (#1696) 2025-05-19 12:29:21 -05:00
isaki001 8145c4f3b8 Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 (#1607)
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024

* add compilation flag for enabling/disabling clipping in mscclpp

* change flag name for consistency, set flag to OFF

* add compilation flag in rccl for enabling clipping in mscclpp

* set 1024 threads for mscclpp allreduce8 only for bfloat16

* fix improper description for ENABLE_MSCCLPP_CLIP flag

* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"

This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.

* update clip remove-clip.patch for rebase
2025-04-30 16:42:28 -05:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Nusrat Islam f20c33effd Fix MSCCLPP accuracy issue for allreduce7 (#1634)
* ext-src: fix a graph-mode bug in allreduce7

* change MSCCLPP threshold to 16MB

* ext-src: change message size threshold for allreduce7

* ext-src: address review comments
2025-04-18 08:54:32 -05:00
Nusrat Islam 4a29bba3c6 ext-src: fix mscclpp correctness issue (#1615)
* ext-src: fix mscclpp correctness issue

* ext-src: remove white-space warnings
2025-04-01 15:02:16 -05:00
Nilesh M Negi 0e2c461c6c [TEST] Switch to googletest release 1.12.0 (#1621)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-28 12:39:42 -05:00
Wenkai Du 90ad586d94 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock
2025-03-20 16:11:43 -07:00
corey-derochie-amd 6505639cf4 removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>
2025-03-20 09:34:53 -06:00
Nilesh M Negi 063c6cfc11 [BUILD] Enable multiple GPU targets in MSCCLPP (#1574)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-01 22:28:42 -06:00
Nusrat Islam 23c0b7bd84 misc/msccl: Read graph capture status for every collective call (#1576)
* misc/msccl: read graphCaptureStatus for every collective call

* fix a bug in checking whether UBR is enabled in MSCCLPP

* cmake: Fix patch reversal order

* misc/msccl: add logging
2025-02-28 17:16:07 -06:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00
Nusrat Islam fdf75fd2c1 ext-src: tuning for allreduce8 kernel (#1560)
This PR tunes the number of threadblocks used for larger (>1MB)
message sizes.
2025-02-21 19:34:38 -06:00
Nusrat Islam 83f8b191ff ext-src: fix mscclpp allreduce for non-multiple of 128 message sizes (#1556) 2025-02-21 11:58:10 -06:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
corey-derochie-amd b6377e0b8c Changed working dir for the submodule command and extended it to the json repo (#1495)
This allows it to work when the sub repos don't exist.
2025-01-23 09:34:25 -07:00
isaki001 d89432e8c8 update mscclpp (#1488)
* update commit hash for mscclpp submodule

* update mscclpp submodule

* remove print messages in cmake

* add back some print messages, update MSCLPP CMAKE_ARGS

* enable MSCCL++ patches regardless of finding mscclpp_nccl package
2025-01-20 08:06:43 -06:00
Nusrat Islam e9b6bbca8a Add MSCCLPP user buffer registration APIs and integrate with RCCL (#1477)
* ext-src: add MSCCLPP memory registration APIs

* update mem-reg patch with mscclpp helper routine to check if buffer is registered

* RCCL integration of MSCCL++ user-buffer registration APIs

* only include mscclpp_nccl header if ENABLE_MSCCLPP is defined

* ext-src: update mscclpp mem-reg patch

* add helper routine to patch

* check handle before MSCCL++ deregister

* fix typo to replace send buff with recv buff

* in case of no mscclpp registration, dduring deRegister call, ont fall back to rccl deRegister which will return an error

* Apply suggestions from code review

Whitespace suggestions and reducing diffs to avoid future merge conflicts

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* rename helper functions and change their return type

* set RCCL user-buffer registration to occur if attempting MSCCL++ registration with a buffer in managed memory

---------

Co-authored-by: isaki001 <Ioannis.Sakiotis@amd.com>
Co-authored-by: isaki001 <36317038+isaki001@users.noreply.github.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-01-14 08:20:24 -06:00
Nilesh M Negi f0eae84663 [MSCCLPP] IBVerbs: Check if IBV_ACCESS_RELAXED_ORDERING exists (#1483)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-01-08 08:38:51 -06:00
akolliasAMD 45c1c1a781 changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS (#1440) 2024-12-12 12:09:30 -07:00
corey-derochie-amd 1c45962273 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.
2024-11-04 09:46:42 -07:00
Nilesh M Negi 364a6c2130 [BUILD] Simplify CMake args for building MSCCLPP (#1363)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-10-09 23:52:04 -05:00
Nusrat Islam 4d68751ce1 Add a custom allreduce algorithm in MSCCLPP for cpx mode (#1362)
* cmake: remove mscclpp patch after build is complete

To enable mscclpp in cpx mode, a patch cpx.patch needs to be applied.
This patch can be removed after building is done. This helps with the
build process the following time.

* Use read-based mscclpp allreduce from rccl

MSCCLPP by default uses remote write in the allreduce kernel for
large (> 1MB) messages. This PR adds an allreduce kernel that uses
remote read. It needs the users to use an environment variable
MSCCLPP_READ_ALLRED=1.
2024-10-08 14:42:12 -05:00
Bertan Dogancay 2dd10c8f17 [BUILD] Move code generation to python from CMake (#1360)
* Use generate.py for func generation

* Convert AddUnroll.cmake to bash
2024-10-03 10:21:19 -04:00
Nusrat Islam d13f9c44f5 Enable MSCCLPP use in CPX mode (#1355)
This PR enables the use of MSCCLPP in CPX mode for 8 GPUs.
2024-10-02 11:52:04 -05:00
corey-derochie-amd b3b0ffdbf3 Added nlohmann/json:v3.11.3 as a submodule in ext-src and passed its path into the mscclpp build to avoid downloading the package at build time. (#1330) 2024-09-11 16:54:26 -06:00
corey-derochie-amd 736a705875 Re-enabled MSCCL++ (#1325)
* Added restrictions around calling MSCCL++ collectives (#1281)

* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.

* Renamed and refactored some mscclpp types.

* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.

* Disable MSCCL++ when using managed memory buffers as it isn't supported.

* Added datatype and op constraints for MSCCL++ AllReduce.

* Added documentation on MSCCL++ restrictions to the README.

* [BUILD] Support custom CMake flags in MSCCLPP (#1275)

* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] CMake flags to support build-id in MSCCLPP

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* [BUILD] Fix CMake warnings in MSCCLPP build

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>

* Link to libmscclpp_nccl statically (#1282)

* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.

* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.

* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.

* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)

* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)

* Include mscclpp as a git submodule (#1314)

* Added the desired mscclpp commit as a git submodule.

* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.

* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.

* Enabled MSCCL++ feature build.

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2024-09-11 09:55:16 -06:00
corey-derochie-amd b31b4082dd Only initialize MSCCL++ when runtime-enabled. (#1266) 2024-07-22 00:41:31 -06:00
Wenkai Du 89349f2ce4 Template unroll for RCCL kernels (#1250)
* Template unroll for RCCL kernels

* Adding unroll template arg during CMake hipification

* Reduce linking parallel jobs to avoid OOM in CI

* Workaround issues with UT tests

SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking

* CI: do not use -j 16 when building

* CI: use -j 8 when building

* Only reduce parallel linking job for CI extended

* Restore original jenkins command. Change parallel linking jobs in cmake

* Disable MSCCLPP

---------

Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
2024-07-19 08:15:59 -07:00
corey-derochie-amd 6dc47eecd7 Integrated RCCL with MSCCL++ for small message sizes (#1231) 2024-07-12 15:32:58 -06:00
Wenkai Du 9fcd7b55e1 Allow multiple parameters during selective function generation (#1201)
* Allow multiple parameters during selective function generation

* Remove debug print

* Add examples into Generator.cmake
2024-06-06 07:07:24 -07:00
Bertan Dogancay 3caad91f32 Add unique files to source list (#1144) 2024-04-15 09:46:53 -06:00
mberenjk 428837ffe4 replacing rccl_bfloat16 with hip_bfloat16 (#1126)
Co-authored-by: mberenjk <mberenjk@amd.com>
2024-04-11 11:30:37 -05:00
corey-derochie-amd 503a472a25 Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125) 2024-03-25 16:29:13 -06:00
Andy li 6777e65c1d Enable fp8 support (#1101)
* initial checkin

* resolve cr comments

* resolve the build issue

* fix the data correctless issue

* update fp8 header file and update the unit test for fp8 support

* remove fp16 from fp8 headers

* fix ut issue and catch up the latest code from develop

* udate according to cr comments

* update ut according to cr comments

* update num floats for each SumPostDiv from 4 to 6

* update fp8 header file name

* fix the typo
2024-03-08 15:17:53 -08:00
Bertan Dogancay a279e7f32d Fix bug when configuring for only LL128 (#1097) 2024-03-01 18:09:39 -07:00
BertanDogancay 76f83f95ab Merge remote-tracking branch 'rccl/develop' into 2.19.4 2024-02-15 13:37:14 -08:00
Bertan Dogancay dc2d486ba0 Add stack size UT (#1081)
* Add stack size UT
2024-02-12 17:56:15 -07:00