Nilesh M Negi
74adb64dfb
[BUILD] Fix UT packaging on Debian OS ( #1848 )
...
[ROCm/rccl commit: 5036d0e713 ]
2025-08-11 09:43:26 -05:00
Rahul Vaidya
70a5f2f317
Fix rccl-UnitTests packaging on Debian systems ( #1846 )
...
Signed-off-by: ravaidya <ravaidya@amd.com >
[ROCm/rccl commit: cbbc713b03 ]
2025-08-08 12:28:56 -05:00
isaki001
52d33058bb
enable more events for LL128 NPKIT trace collection ( #1827 )
...
[ROCm/rccl commit: 74d82a8145 ]
2025-08-07 11:19:36 -05:00
awelling2801
c5b4e1bc78
Created coverage tests for rccl_wrap ( #1694 )
...
* Created coverage tests for rccl_wrap
RCCL_EXPOSE_STATIC off by default
Coverage tests for rccl_wrap.cc
* Remove RCCL_EXPOSE_STATIC dependency
* Removed Rcclwrap.RcclGetAlgoInfoTest
* Remove comments
* Corrected RCCL_EXPOSE_STATIC definition logic
---------
Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com >
Co-authored-by: Atul Kulkarni <atul.kulkarni@amd.com >
[ROCm/rccl commit: 82bea39280 ]
2025-08-06 14:48:00 -05:00
Avinash
f34d760613
Compiler warnings fix 2 ( #1801 )
...
* Changes to device code
* Changes to src/misc
* Changes to graph
* src/include changes
* src/transport changes
* changes in init, enqueue, proxy
* Changes to CMakeLists.txt
* Additional changes to device code
* Additional changes to net.cc
* adding 'compiler warning' tag to ease upstream merge'
* typo correction
* Addessing comments
* Additional changes for new commits
[ROCm/rccl commit: 3f8cac388e ]
2025-08-05 17:36:23 -05:00
Arm Patinyasakdikul
df3b7e477f
Disable context tracking for the current version. ( #1839 )
...
[ROCm/rccl commit: 6fc228e247 ]
2025-08-04 10:48:00 -05:00
Atul Kulkarni
35283394ed
Add unit tests for graph/xml.cc & graph/xml.h ( #1833 )
...
* Added new binary for executing unit tests
Added new unit tests for argcheck.cc and alt_rsmi.cc files
Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
* Added new unit tests for src/transport/shm.cc
* Added new unit tests for graph/xml.cc
[ROCm/rccl commit: 0e7d7da55d ]
2025-08-01 14:20:27 -05:00
Atul Kulkarni
e550ba1e3b
Update help text in README ( #1837 )
...
[ROCm/rccl commit: e2c9f2feab ]
2025-08-01 14:19:27 -05:00
awelling2801
0d34963b35
Added tests for coll_reg ( #1700 )
...
Changes to coll_reg
Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com >
[ROCm/rccl commit: 5ecc1b7ede ]
2025-07-31 13:49:23 -05:00
dependabot[bot]
b6639c85f4
Bump urllib3 from 2.2.2 to 2.5.0 in /docs/sphinx ( #1751 )
...
Bumps [urllib3](https://github.com/urllib3/urllib3 ) from 2.2.2 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases )
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst )
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0 )
---
updated-dependencies:
- dependency-name: urllib3
dependency-version: 2.5.0
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 32e95963dc ]
2025-07-31 11:25:45 -06:00
dependabot[bot]
e31001e378
Bump rocm-docs-core from 1.18.2 to 1.22.0 in /docs/sphinx ( #1836 )
...
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core ) from 1.18.2 to 1.22.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases )
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md )
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.22.0 )
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-version: 1.22.0
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: 1acc3eb6c1 ]
2025-07-31 11:15:01 -06:00
awelling2801
839fcb54b5
Added tests for transport.cc ( #1725 )
...
Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com >
[ROCm/rccl commit: 7320752bf3 ]
2025-07-31 11:04:28 -05:00
Rahul Vaidya
d65eb0b021
Fix RHEL10 packaging for rcclras and rccl-UnitTests ( #1831 )
...
Signed-off-by: ravaidya <ravaidya@amd.com >
[ROCm/rccl commit: 0adc5edc74 ]
2025-07-31 11:00:49 -05:00
Nilesh M Negi
be810f10f3
[DEVICE] Add unroll=2 for gfx950 multi-node ( #1824 )
...
[ROCm/rccl commit: bd55f876e9 ]
2025-07-31 02:35:26 -05:00
ycui1984
39c508b80d
Add collective latency profiler ( #1785 )
...
* [LatencyProfiler] Initial commit
* [LatencyProfiler] Add unit tests
* [LatencyProfiler] add more
* [LatencyProfiler] Pass unit tests
* [LatencyProfiler] Add hooks to integrate with meta internal tools
* [LatencyProfiler] Restore install.sh
* [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace
* [LatencyProfiler] Add header
[ROCm/rccl commit: 874cd657ef ]
2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar
cafd7a5126
Optimize alltoall for 64 GPUs and above for gfx942 ( #1828 )
...
Add pxn and p2p net chunksize mi300x tuning
[ROCm/rccl commit: 4ce3df8d3a ]
2025-07-30 15:14:43 -04:00
mberenjk
cca5172260
Upcast FP8 to Half (FP16) for Sum Operation ( #1775 )
...
* adding hadd and hadd2 support using builtin functions.
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: c84ee3d298 ]
2025-07-29 11:33:06 -05:00
awelling2801
da2bb8a578
Added tests for Ipcsocket ( #1690 )
...
Co-authored-by: Welling <awelling@ctr2-alola-ctrl-01.amd.com >
[ROCm/rccl commit: 9843adaab2 ]
2025-07-29 10:03:28 -05:00
awelling2801
88dcaaddc5
Code coverage improvements for alloc.h ( #1676 )
...
* Added tests for alloc.h
* Added tests for ZeroElementCopy and MemcpyNullSrcOrDstPointer
---------
Co-authored-by: Welling <awelling@ctr2-alola-ctrl-01.amd.com >
[ROCm/rccl commit: e118aadc14 ]
2025-07-29 09:19:57 -05:00
peizhang56
5c02be7b51
Add Unit Test for bitops.h ( #1821 )
...
* Add Unit Test for bitops.h
* Change the style
* Fix the code review comments
* Add more test cases
[ROCm/rccl commit: fe182d6546 ]
2025-07-28 11:25:15 -05:00
Atul Kulkarni
de0d446e03
Added new unit tests for src/transport/p2p.cc ( #1774 )
...
[ROCm/rccl commit: 81ec6bff4c ]
2025-07-25 12:57:57 -05:00
Sarat Kamisetty
1719aa67be
passing down NET_OPTIONAL_RECV_COMPLETION hint to n/w plugin to enable optimizations ( #1752 )
...
Co-authored-by: Sarat Kamisetty <sakamiset@amd.com >
[ROCm/rccl commit: 783c073a03 ]
2025-07-25 10:26:58 -05:00
Mustafa Abduljabbar
b3a0cc5e96
Add optional bf16 software-triggered pipelining for reduceCopyPacks ( #1758 )
...
- Introduced double-buffering to reduce copy overhead and overlap BF16 arithmetic with data prefetching.
- Aimed to improve performance of reduction-based collectives by up to 10%.
- Implemented based on recommendations from Guennadi Riguer (AMD)
- Added --force-reduce-pipeline option to install.sh to activate this optimization for BF16 reductions.
- Feature is disabled by default to prevent regressions with large messages until auto-tuning logic is upstreamed.
---------
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com >
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com >
[ROCm/rccl commit: 0ce20e7e07 ]
2025-07-25 10:57:05 -04:00
Atul Kulkarni
bd53bdf447
Added new unit tests for src/transport/shm.cc ( #1689 )
...
[ROCm/rccl commit: 1c3d1b3842 ]
2025-07-25 05:54:42 -05:00
Arm Patinyasakdikul
866058c6d9
Fix segfault when libibverbs returns 0 device. ( #1820 )
...
Fix: SWDEV-543816
[ROCm/rccl commit: 3c9c22bb52 ]
2025-07-23 15:18:52 -05:00
Wenkai Du
caff9764d3
Support fused all reduce and elementwise operations ( #1729 )
...
* Support fused all reduce and elementwise operations
Add additional "acc" parameter to RCCL Replayer logs
Add flag which indicates availability of new API
* Fix Recorder json parsing
* Remove unreachable code
* Remove extra acc pointer check
* .
* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734 )"
This reverts commit 4cadf3597c .
* Use noinline to reduce kernels linking time
* Don't use noinline for gfx942 and gfx950 to avoid perf regression
---------
Co-authored-by: AtlantaPepsi <timhu102@amd.com >
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com >
[ROCm/rccl commit: 9a4213356d ]
2025-07-23 09:04:17 -07:00
alex-breslow-amd
cbb648505a
Cheaper threadfence for gfx942 in postPeer [1/N]: enable for single node allreduce ( #1766 )
...
Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable
[ROCm/rccl commit: 11fabf1de1 ]
2025-07-22 07:15:15 -07:00
Rahul Vaidya
bd63518944
Add datatype validation for MSCCLPP AllGather ( #1816 )
...
Signed-off-by: rahulvaidya20 <ravaidya@amd.com >
[ROCm/rccl commit: c28d3d26a3 ]
2025-07-21 11:50:45 -05:00
Atul Kulkarni
c94fb7c58e
Code coverage improvements ( #1665 )
...
* Increased max stack size to 640
* Added new binary for executing unit tests
Added new unit tests for argcheck.cc and alt_rsmi.cc files
Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
[ROCm/rccl commit: 275fdd43c1 ]
2025-07-17 11:20:49 -05:00
isaki001
af4ce678b5
Fix typo in NPKit build that prevents NET_TEST event ( #1807 )
...
[ROCm/rccl commit: ef6a54ba34 ]
2025-07-16 09:08:06 -05:00
Nilesh M Negi
2c0c02b211
[GRAPH] Match maxChannels for gfx942 CUs ( #1302 )
...
[ROCm/rccl commit: 6632183efe ]
2025-07-16 09:07:02 -05:00
Wenkai Du
670966f86b
Fix inline compilation issue with LL ( #1806 )
...
[ROCm/rccl commit: 106024b0db ]
2025-07-15 08:39:18 -07:00
isaki001
a20e65cfc0
gfx950 updated on LL thresholds for allreduce/allgather, update treeCorrection ( #1803 )
...
* change LL thresholds for allreduce/allgather and update treeCorrectionFactor
* update allGather LL cutoff
* adjust allgather LL/LL128 thresholds
[ROCm/rccl commit: 8d0f1a1cef ]
2025-07-15 09:10:19 -05:00
dependabot[bot]
c447d779b9
Bump requests from 2.32.2 to 2.32.4 in /docs/sphinx ( #1738 )
...
Bumps [requests](https://github.com/psf/requests ) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases )
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md )
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4 )
---
updated-dependencies:
- dependency-name: requests
dependency-version: 2.32.4
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: aafbdad2ab ]
2025-07-14 10:30:37 -06:00
dependabot[bot]
01b3922075
Bump tornado from 6.4.2 to 6.5.1 in /docs/sphinx ( #1710 )
...
Bumps [tornado](https://github.com/tornadoweb/tornado ) from 6.4.2 to 6.5.1.
- [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst )
- [Commits](https://github.com/tornadoweb/tornado/compare/v6.4.2...v6.5.1 )
---
updated-dependencies:
- dependency-name: tornado
dependency-version: 6.5.1
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
[ROCm/rccl commit: d4d021c726 ]
2025-07-14 10:25:01 -06:00
Wenkai Du
e2ad96bb96
Disable P2P net option by default ( #1793 )
...
[ROCm/rccl commit: 708ad75f7a ]
2025-07-14 08:55:39 -07:00
Nikhil-Nunna
bf4031276c
topo_explorer initial readme ( #1797 )
...
* topo_explorer intial readme
* topo_explorer readme update
* topo_explorer readme update
* Added sample output to README
* Update README.md
* Update README.md
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com >
[ROCm/rccl commit: 7abc7538ea ]
2025-07-11 11:28:20 -05:00
Jobbins
be9a573cb0
[rccl] Remove .jenkins folder ( #1754 )
...
[ROCm/rccl commit: 7ebd31097c ]
2025-07-11 11:24:06 -05:00
Bertan Dogancay
d4aafe31fa
[GEN] Fix typo in IFC code gen ( #1796 )
...
[ROCm/rccl commit: 7158adb57f ]
2025-07-11 09:19:39 -04:00
Nilesh M Negi
41c985462c
[BUILD] Use fmt-header instead of libfmt ( #1791 )
...
[ROCm/rccl commit: 6b4ad0fd74 ]
2025-07-10 17:19:53 -05:00
Nilesh M Negi
86dd6f262b
[TOOLS] Update p2p-latency-test for gfx950 ( #1730 )
...
[ROCm/rccl commit: f839e4edef ]
2025-07-10 12:13:29 -05:00
Nilesh M Negi
ba31e4e846
[INIT] Fix fallback for unsupported user-specified runtime unroll factor ( #1780 )
...
* [INIT] Fix fallback for unsupported user-specified runtime unroll factor
* Add CollTrace guard
* Move `commSetUnrollFactor()` to rccl_wrap.cc
* Modify comments in the device-code generator script
[ROCm/rccl commit: 2c099fe29a ]
2025-07-10 10:56:18 -05:00
Nilesh M Negi
1050eb13ac
[DEVICE] Fix validation errors for multi-node LL with gfx950 non-coherent system memory ( #1795 )
...
[ROCm/rccl commit: 68d6f99e0f ]
2025-07-10 09:05:46 -05:00
Mustafa Abduljabbar
caeaaa284c
Fix AllReduce regression due to previous max range increase for LL64/LL128 ( #1787 )
...
* Adjust tuning factor impacting more than 2 nodes
* Scale max LL128 size for > 2 nodes
* Retune max LL128 range for N > 2
[ROCm/rccl commit: 058264b3f3 ]
2025-07-09 19:17:10 -05:00
Atul Kulkarni
16aadd67cf
Enable Google Test's GMOCK feature ( #1773 )
...
[ROCm/rccl commit: a28d5cb986 ]
2025-07-09 17:25:44 -05:00
mberenjk
1623fcc7a1
Improving build time by removing the gfx11xx and host code from rccl_float8.h ( #1789 )
...
* removing extra build time by removing the gfx11xx arch from using hip_fp8
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
[ROCm/rccl commit: 697bee4ee8 ]
2025-07-09 14:03:47 -05:00
Bertan Dogancay
b1470b4e50
[GRAPH] Pass rank instead of busId due to a change in an internal function signature ( #1792 )
...
[ROCm/rccl commit: 9c89573580 ]
2025-07-08 08:45:54 -04:00
Marius Brehler
5d753cb871
Set GTEST_BOTH_LIBRARIES appropriately ( #1669 )
...
If `find_package()` succeeds to find GTest and `INSTALL_DEPENDENCIES`
is set to OFF, `GTEST_BOTH_LIBRARIES` is not set and thus
`rccl-UnitTests` fails with trying to link unkown symbols.
[ROCm/rccl commit: dac0e528a0 ]
2025-07-05 20:38:31 -05:00
Bertan Dogancay
471fc6bff2
[DEVICE] Enable PAT algo for RCCL 1ppn ( #1756 )
...
* Enable PAT algo for RCCL 1ppn
[ROCm/rccl commit: e96c8473a1 ]
2025-07-04 13:45:18 -04:00
Rakesh Roy
82a822b646
Fix chrono build error ( #1790 )
...
[ROCm/rccl commit: dd3b1d816c ]
2025-07-04 08:27:30 -05:00