İşleme Grafiği

225 İşleme

Yazar SHA1 Mesaj Tarih
Atul Kulkarni 81ec6bff4c Added new unit tests for src/transport/p2p.cc (#1774) 2025-07-25 12:57:57 -05:00
Mustafa Abduljabbar 0ce20e7e07 Add optional bf16 software-triggered pipelining for reduceCopyPacks (#1758)
- Introduced double-buffering to reduce copy overhead and overlap BF16 arithmetic with data prefetching.
- Aimed to improve performance of reduction-based collectives by up to 10%.
- Implemented based on recommendations from Guennadi Riguer (AMD)
- Added --force-reduce-pipeline option to install.sh to activate this optimization for BF16 reductions.
- Feature is disabled by default to prevent regressions with large messages until auto-tuning logic is upstreamed.
---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
2025-07-25 10:57:05 -04:00
Atul Kulkarni 1c3d1b3842 Added new unit tests for src/transport/shm.cc (#1689) 2025-07-25 05:54:42 -05:00
Wenkai Du 9a4213356d Support fused all reduce and elementwise operations (#1729)
* Support fused all reduce and elementwise operations

Add additional "acc" parameter to RCCL Replayer logs

Add flag which indicates availability of new API

* Fix Recorder json parsing

* Remove unreachable code

* Remove extra acc pointer check

* .

* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"

This reverts commit 9d72be7b2f.

* Use noinline to reduce kernels linking time

* Don't use noinline for gfx942 and gfx950 to avoid perf regression

---------

Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2025-07-23 09:04:17 -07:00
alex-breslow-amd 11fabf1de1 Cheaper threadfence for gfx942 in postPeer [1/N]: enable for single node allreduce (#1766)
Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable
2025-07-22 07:15:15 -07:00
Atul Kulkarni 275fdd43c1 Code coverage improvements (#1665)
* Increased max stack size to 640

* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
2025-07-17 11:20:49 -05:00
isaki001 ef6a54ba34 Fix typo in NPKit build that prevents NET_TEST event (#1807) 2025-07-16 09:08:06 -05:00
Nilesh M Negi 6b4ad0fd74 [BUILD] Use fmt-header instead of libfmt (#1791) 2025-07-10 17:19:53 -05:00
Nilesh M Negi 9e99c18f6e [MSCCLPP] Disable format checks in MSCCLPP by default (#1781) 2025-07-02 09:11:42 -05:00
Wenkai Du 4640ab19b3 Add support for extended fine grained system memory pool (#1770)
* Add support for extended fine-grained system memory pool
* Use hipHostRegisterUncached
* Add "sc0 sc1" flags for LL store on gfx950
* Update after HIP flag is changed to hipExtHostRegisterUncached
2025-07-01 16:38:49 -05:00
Nilesh M Negi 3e51c41dcb [BUILD] Fix packaging for RAS (#1784) 2025-07-01 16:37:14 -05:00
Nilesh M Negi 8d3a5542fb [RAS] Add support for RAS client (#1748)
Enable RAS client binary `rcclras`
2025-06-29 18:53:16 -05:00
Dingming Wu 020dcf0a7c Add proxyTrace (#1732)
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.
2025-06-25 23:01:34 -05:00
Nilesh M Negi 568777a9bf [BUILD] Move NPKit flags from install.sh to CMakeLists.txt (#1741) 2025-06-23 21:51:49 -05:00
jonatluu 709140204a Remove File reorganization backward compatibility (rccl) (#1753) 2025-06-22 17:18:26 -05:00
BertanDogancay aaf023976a Merge remote-tracking branch 'nccl/master' into develop 2025-06-20 07:54:49 -05:00
Nilesh M Negi 92a5d225d9 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-06-17 01:29:55 -05:00
Richard Barnes 4486d091b8 Enable -Wdeprecated-copy-with-user-provided-copy (#1643) 2025-06-13 08:23:31 -07:00
Nilesh M Negi 9d72be7b2f [DEVICE] Adding ability to choose unroll factor at runtime (#1734)
* Adding runtime unroll factor selection via RCCL_UNROLL_FACTOR
* [BUILD] Add support for user-defined UNROLL for debugging
* Update CHANGELOG.md
* Fix COLLTRACE errors in CI
* Add debug statements for unroll and resolve warnings
* Incorporate UNROLL into ONLY_FUNCS for debugging

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-06-11 00:07:59 -05:00
Atul Kulkarni 682ed36fe6 Added new ENABLE_CODE_COVERAGE option. (#1664)
Modified install.sh script to add this new option
2025-06-10 12:12:36 -05:00
alex-breslow-amd f5b44acb1b Make offload-compress the default (#1704)
* Make offload-compress the default
* Add guard for --offload-compress since it was introduced in ROCm 6.2
* Address some of Nilesh's feedback.
* Reorganize for code cleanliness
* Improve comment
* Compress gpu code at link and compile time
2025-05-22 22:33:25 -05:00
Bertan Dogancay 590ad6acc2 Merge pull request #1662 from BertanDogancay/2.25
[SYNC] 2.25.1-1
2025-05-06 09:39:09 -04:00
isaki001 8145c4f3b8 Add Compilation Flag for enabling/disabling clipping, and tune number of blocks for mscclpp allreduce8 (#1607)
* mscclpp patch apply clip patch and set allreduce8 blocks from 512 to 1024

* add compilation flag for enabling/disabling clipping in mscclpp

* change flag name for consistency, set flag to OFF

* add compilation flag in rccl for enabling clipping in mscclpp

* set 1024 threads for mscclpp allreduce8 only for bfloat16

* fix improper description for ENABLE_MSCCLPP_CLIP flag

* Revert "Merge branch 'clip-patch' of https://github.com/isaki001/rccl into clip-patch"

This reverts commit 6e31857a9db98314b8a748eb024f2c3699ebe2d5, reversing
changes made to 193f4caa8ffa78b4e056893212fd8344aa14e937.

* update clip remove-clip.patch for rebase
2025-04-30 16:42:28 -05:00
BertanDogancay cb6e23ae67 Merge remote-tracking branch 'nccl/master' into develop 2025-04-30 13:31:41 -05:00
Richard Barnes 7961624167 Enable -Wall (#1644) 2025-04-24 10:45:46 -07:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Tim 9a55ff60a9 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT
2025-04-19 00:21:27 -04:00
Pedram Alizadeh e40ff4f84a all_reduce LL/LL128 and Ring/Tree multi-node tuning for MI300 (#1627)
* Enabling LL128 by default on MI300

* Add missing CUDACHECK

* Adjust BW correction factors to fix the Tree->Ring switching point

* Refactor and add ll128 AR logarithmic factor to tuning models

* Move RCCL tuning changes to a separate file 

* Use enum for tunable indexing

* Use explicit indexing in tuning models to avoid mismatch issues

* Place rcclGetSizePerRank in a function

* Remove HIP ifdef for rccl-only call

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-04-10 11:43:54 -04:00
BertanDogancay 0b2062c560 Merge remote-tracking branch 'nccl/master' into develop 2025-03-27 12:53:04 -05:00
gilbertlee-amd 626dc50ab5 Removing the experimental clique kernel files (#1610) 2025-03-20 18:10:01 -06:00
Wenkai Du 90ad586d94 Add fault injection of starting warps with random variations (#1593)
* Add fault injection of starting warps with random variations

This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.

* Remove manually introduced bug for demo purpose

* Use only one thread per warp for checking wall clock
2025-03-20 16:11:43 -07:00
Nilesh M Negi 063c6cfc11 [BUILD] Enable multiple GPU targets in MSCCLPP (#1574)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-03-01 22:28:42 -06:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00
Nilesh M Negi daaa6e155f [BUILD] MSCCLPP: Fix OS check for CentOS (#1568)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <corey.derochie@amd.com>
2025-02-25 13:03:04 -06:00
Sohaib Nadeem 2f1c0bb213 Remove COMPILING_TARGETS from CMakeLists.txt (#1533)
COMPILING_TARGETS is not actually used for --offload-arch option,
instead GPU_TARGETS is being used implicitly when we call
find_package(hip REQUIRED) (See hip-config-amd.cmake).
2025-02-16 21:46:37 -06:00
rahulc1984 92ac136db5 Make rccl version detection robust. (#1517)
* Accept an EXPLICIT_ROCM_VERSION and use that vs inspecting the environment if provided.
* Use CMake's built in file reading support vs execute_process (without error checking) to avoid silent but deadly later failures.
* Properly quote some comparisons to avoid syntax errors if they happen to have an empty string.
* Guard against ROCM_PATH being an empty string, avoiding stray path extensions to root directories, etc.

Co-authored-by: Stella Laurenzo <stellaraccident@gmail.com>
2025-02-11 10:48:22 -07:00
corey-derochie-amd 42ab425037 Switched from cmake_host_system_information feature to a manual parse (#1518)
* Switched cmake_host_system_information feature to a manual parse to remain cmake 3.5 compliant.

* Updating minimum cmake to 3.16 to conform with the rest of ROCm. This change still applies.
2025-02-11 08:51:39 -07:00
Bertan Dogancay 5804603632 [BUILD] Fix unsupported arguments in generator (#1519)
* Fix unsupported arguments in generator

* Get ROCM_PATH as env variable
2025-02-03 14:51:55 -05:00
Jeffrey E Erickson 7af21dd996 modify max memory to use free (#1513) 2025-02-03 09:35:02 -06:00
Bertan Dogancay 35fe9e06f3 [Profiler] Enable ROCTX during build by default (#1506)
* Enable ROCTX during build by default

* Check for roctx support in cmake
2025-01-29 11:29:46 -05:00
corey-derochie-amd bd0f5cccbe Disabled MSCCL++ feature except when building on Ubuntu or CentOS host systems (#1505)
* Added condition for MSCCL++ to only build on an Ubuntu host system.

* Added CentOS to the supported OS list
2025-01-29 08:54:09 -07:00
BertanDogancay 36343be84f Merge remote-tracking branch 'nccl/master' into develop 2025-01-23 12:08:46 -06:00
Nilesh M Negi fd03b5b6a5 [BUILD] Fix ASAN build if GPU targets has xnack+ (#1474)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2024-12-26 12:13:36 -06:00
akolliasAMD 45c1c1a781 changed the CMake option from AMDGPU_TARGETS to GPU_TARGETS (#1440) 2024-12-12 12:09:30 -07:00
Shilei Tian 7386fac64a Improve the handling of CMake deduplication (#1450)
Certain CMake functions deduplicates arguments by default. For example, if we
have two `target_link_options` with both `-Xoffload-linker -opt-A` and then
`-Xoffload-linker -opt-B`, the final link command would be `-Xoffload-linker
-opt-A -opt-B`, which is not what we want.
2024-12-11 13:48:18 -08:00
Shilei Tian 8e9fcf111a Check -parallel-jobs before use (#1451)
`-parallel-jobs` is not always available, such as upstream LLVM.
2024-12-11 11:40:49 -06:00
akolliasAMD 2284101624 removing unused gfx targets (#1411) 2024-11-06 08:50:08 -07:00
corey-derochie-amd 1c45962273 Hide or fix all build warnings (#1331)
* Changing C-strings to be const.

* Changed variable-length arrays to std::vector to avoid warnings. VLA is a compiler extension.

* Changed `#define` inside functions into `constexpr int` to preserve scoping and avoid macro redefinition warnings.

* Disabled warnings for modifying `CMAKE_CXX_FLAGS` caused by `check_symbol_exists`, which temporarily modifies the flag to do a compile check.

* Fixed VLA in rccl UT.
2024-11-04 09:46:42 -07:00
corey-derochie-amd 6db2644766 Set minimum ROCm version for MSCCLPP to 6.2 (#1401)
* Added ROCm version check around setting `ENABLE_MSCCLPP` flag.
2024-10-30 16:48:54 -06:00
Bertan Dogancay 373f113524 Dynamically select unroll factor to build for when targeting local arch (#1371)
* Dynamically select unroll factor to build for when targeting local arch only
2024-10-21 10:53:11 -04:00