Graf commitů

2056 Commity

Autor SHA1 Zpráva Datum
Rahul Vaidya ee9ed3ef87 [BUILD] Fix UT packaging on Debian family OS (#1854)
* Fix UT packaging on Debian family OSes

Signed-off-by: ravaidya <ravaidya@amd.com>

* Split OR condition when performing Debian checks

Signed-off-by: ravaidya <ravaidya@amd.com>

---------

Signed-off-by: ravaidya <ravaidya@amd.com>
2025-08-11 17:03:16 -05:00
Chris Sosa 53977821b5 Add CI Badge for tracking CI status in prep for gating changes (#1851)
This PR is intended to move RCCL to gating changes on CI failures. Right now, only build/unittests run per PR consistently. We should eventually add all single and multi-node test status badges once those tests are running in presubmit and continuously on develop
2025-08-11 14:02:46 -07:00
Nilesh M Negi 5036d0e713 [BUILD] Fix UT packaging on Debian OS (#1848) 2025-08-11 09:43:26 -05:00
Rahul Vaidya cbbc713b03 Fix rccl-UnitTests packaging on Debian systems (#1846)
Signed-off-by: ravaidya <ravaidya@amd.com>
2025-08-08 12:28:56 -05:00
isaki001 74d82a8145 enable more events for LL128 NPKIT trace collection (#1827) 2025-08-07 11:19:36 -05:00
awelling2801 82bea39280 Created coverage tests for rccl_wrap (#1694)
* Created coverage tests for rccl_wrap

RCCL_EXPOSE_STATIC off by default

Coverage tests for rccl_wrap.cc

* Remove RCCL_EXPOSE_STATIC dependency

* Removed Rcclwrap.RcclGetAlgoInfoTest

* Remove comments

* Corrected RCCL_EXPOSE_STATIC definition logic

---------

Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com>
Co-authored-by: Atul Kulkarni <atul.kulkarni@amd.com>
2025-08-06 14:48:00 -05:00
Avinash 3f8cac388e Compiler warnings fix 2 (#1801)
* Changes to device code

* Changes to src/misc

* Changes to graph

* src/include changes

* src/transport changes

* changes in init, enqueue, proxy

* Changes to CMakeLists.txt

* Additional changes to device code

* Additional changes to net.cc

* adding 'compiler warning' tag to ease upstream merge'

* typo correction

* Addessing comments

* Additional changes for new commits
2025-08-05 17:36:23 -05:00
Arm Patinyasakdikul 6fc228e247 Disable context tracking for the current version. (#1839) 2025-08-04 10:48:00 -05:00
Atul Kulkarni 0e7d7da55d Add unit tests for graph/xml.cc & graph/xml.h (#1833)
* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.

* Added new unit tests for src/transport/shm.cc

* Added new unit tests for graph/xml.cc
2025-08-01 14:20:27 -05:00
Atul Kulkarni e2c9f2feab Update help text in README (#1837) 2025-08-01 14:19:27 -05:00
awelling2801 5ecc1b7ede Added tests for coll_reg (#1700)
Changes to coll_reg

Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com>
2025-07-31 13:49:23 -05:00
dependabot[bot] 32e95963dc Bump urllib3 from 2.2.2 to 2.5.0 in /docs/sphinx (#1751)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.2 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-31 11:25:45 -06:00
dependabot[bot] 1acc3eb6c1 Bump rocm-docs-core from 1.18.2 to 1.22.0 in /docs/sphinx (#1836)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.22.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.22.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.22.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-31 11:15:01 -06:00
awelling2801 7320752bf3 Added tests for transport.cc (#1725)
Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com>
2025-07-31 11:04:28 -05:00
Rahul Vaidya 0adc5edc74 Fix RHEL10 packaging for rcclras and rccl-UnitTests (#1831)
Signed-off-by: ravaidya <ravaidya@amd.com>
2025-07-31 11:00:49 -05:00
Nilesh M Negi bd55f876e9 [DEVICE] Add unroll=2 for gfx950 multi-node (#1824) 2025-07-31 02:35:26 -05:00
ycui1984 874cd657ef Add collective latency profiler (#1785)
* [LatencyProfiler] Initial commit

* [LatencyProfiler] Add unit tests

* [LatencyProfiler] add more

* [LatencyProfiler] Pass unit tests

* [LatencyProfiler] Add hooks to integrate with meta internal tools

* [LatencyProfiler] Restore install.sh

* [LatencyProfiler] Resolved comments 1. add proper license 2. use proper namespace

* [LatencyProfiler] Add header
2025-07-30 14:59:28 -07:00
Mustafa Abduljabbar 4ce3df8d3a Optimize alltoall for 64 GPUs and above for gfx942 (#1828)
Add pxn and p2p net chunksize mi300x tuning
2025-07-30 15:14:43 -04:00
mberenjk c84ee3d298 Upcast FP8 to Half (FP16) for Sum Operation (#1775)
* adding hadd and hadd2 support using builtin functions.

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-07-29 11:33:06 -05:00
awelling2801 9843adaab2 Added tests for Ipcsocket (#1690)
Co-authored-by: Welling <awelling@ctr2-alola-ctrl-01.amd.com>
2025-07-29 10:03:28 -05:00
awelling2801 e118aadc14 Code coverage improvements for alloc.h (#1676)
* Added tests for alloc.h

* Added tests for ZeroElementCopy and MemcpyNullSrcOrDstPointer

---------

Co-authored-by: Welling <awelling@ctr2-alola-ctrl-01.amd.com>
2025-07-29 09:19:57 -05:00
peizhang56 fe182d6546 Add Unit Test for bitops.h (#1821)
* Add Unit Test for bitops.h

* Change the style

* Fix the code review comments

* Add more test cases
2025-07-28 11:25:15 -05:00
Atul Kulkarni 81ec6bff4c Added new unit tests for src/transport/p2p.cc (#1774) 2025-07-25 12:57:57 -05:00
Sarat Kamisetty 783c073a03 passing down NET_OPTIONAL_RECV_COMPLETION hint to n/w plugin to enable optimizations (#1752)
Co-authored-by: Sarat Kamisetty <sakamiset@amd.com>
2025-07-25 10:26:58 -05:00
Mustafa Abduljabbar 0ce20e7e07 Add optional bf16 software-triggered pipelining for reduceCopyPacks (#1758)
- Introduced double-buffering to reduce copy overhead and overlap BF16 arithmetic with data prefetching.
- Aimed to improve performance of reduction-based collectives by up to 10%.
- Implemented based on recommendations from Guennadi Riguer (AMD)
- Added --force-reduce-pipeline option to install.sh to activate this optimization for BF16 reductions.
- Feature is disabled by default to prevent regressions with large messages until auto-tuning logic is upstreamed.
---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: Pedram Alizadeh <pmohamma@amd.com>
2025-07-25 10:57:05 -04:00
Atul Kulkarni 1c3d1b3842 Added new unit tests for src/transport/shm.cc (#1689) 2025-07-25 05:54:42 -05:00
Kamil Iskra 593de54e52 NCCL 2.27.7-1
Prevent initialization failures in certain configurations when attempting
to load fp8-specific symmetric multicast kernels on GPUs older than
Blackwell.
2025-07-24 10:39:53 -07:00
Arm Patinyasakdikul 3c9c22bb52 Fix segfault when libibverbs returns 0 device. (#1820)
Fix: SWDEV-543816
2025-07-23 15:18:52 -05:00
Wenkai Du 9a4213356d Support fused all reduce and elementwise operations (#1729)
* Support fused all reduce and elementwise operations

Add additional "acc" parameter to RCCL Replayer logs

Add flag which indicates availability of new API

* Fix Recorder json parsing

* Remove unreachable code

* Remove extra acc pointer check

* .

* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"

This reverts commit 9d72be7b2f.

* Use noinline to reduce kernels linking time

* Don't use noinline for gfx942 and gfx950 to avoid perf regression

---------

Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
2025-07-23 09:04:17 -07:00
alex-breslow-amd 11fabf1de1 Cheaper threadfence for gfx942 in postPeer [1/N]: enable for single node allreduce (#1766)
Boosts single node bfloat16 allreduce performance by up to 20% for some data sizes and provides gating with the RCCL_GFX942_CHEAP_FENCE_OFF environment variable
2025-07-22 07:15:15 -07:00
Rahul Vaidya c28d3d26a3 Add datatype validation for MSCCLPP AllGather (#1816)
Signed-off-by: rahulvaidya20 <ravaidya@amd.com>
2025-07-21 11:50:45 -05:00
Stephen Sachs 0d1ece2b43 Exclude ongoing issues from auto-closing logic
- Added a check to skip issues labeled "ongoing" in the close-old-issues script
- Adjusted the condition to compare both creation and update dates against six months ago
2025-07-17 21:50:05 +02:00
Atul Kulkarni 275fdd43c1 Code coverage improvements (#1665)
* Increased max stack size to 640

* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.
2025-07-17 11:20:49 -05:00
Stephen Sachs bfedf2629e Add issues templates and Github action to remove stale issues
We add 3 different issue types issue/question/RFE and add some predefined
questions to speed up the debugging process.

We also add a custom action which will close all issues create mode than 6
months ago which have not been updated for more than a month.
2025-07-16 17:56:12 +02:00
isaki001 ef6a54ba34 Fix typo in NPKit build that prevents NET_TEST event (#1807) 2025-07-16 09:08:06 -05:00
Nilesh M Negi 6632183efe [GRAPH] Match maxChannels for gfx942 CUs (#1302) 2025-07-16 09:07:02 -05:00
Wenkai Du 106024b0db Fix inline compilation issue with LL (#1806) 2025-07-15 08:39:18 -07:00
isaki001 8d0f1a1cef gfx950 updated on LL thresholds for allreduce/allgather, update treeCorrection (#1803)
* change LL thresholds for allreduce/allgather and update treeCorrectionFactor

* update allGather LL cutoff

* adjust allgather LL/LL128 thresholds
2025-07-15 09:10:19 -05:00
dependabot[bot] aafbdad2ab Bump requests from 2.32.2 to 2.32.4 in /docs/sphinx (#1738)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-14 10:30:37 -06:00
dependabot[bot] d4d021c726 Bump tornado from 6.4.2 to 6.5.1 in /docs/sphinx (#1710)
Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.4.2 to 6.5.1.
- [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst)
- [Commits](https://github.com/tornadoweb/tornado/compare/v6.4.2...v6.5.1)

---
updated-dependencies:
- dependency-name: tornado
  dependency-version: 6.5.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-14 10:25:01 -06:00
Wenkai Du 708ad75f7a Disable P2P net option by default (#1793) 2025-07-14 08:55:39 -07:00
Nikhil-Nunna 7abc7538ea topo_explorer initial readme (#1797)
* topo_explorer intial readme

* topo_explorer readme update

* topo_explorer readme update

* Added sample output to README

* Update README.md

* Update README.md

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-07-11 11:28:20 -05:00
Jobbins 7ebd31097c [rccl] Remove .jenkins folder (#1754) 2025-07-11 11:24:06 -05:00
Kamil Iskra 7c12c627c6 NCCL 2.27.6-1
Improve support for DirectNIC (CX8)
* Add support for XDR speed detection.
* When DirectNIC is enabled, report only the RDMA interfaces.

Extend the P2C (PXN over C2C) support to send/receive operations.

Support compilation with GCC 14 (Issues #1743, #1751).

Fix the unloading of network plugins that also provide tuner capability.

Fix the change of the current device across the calls to ncclCommDestroy()
and ncclCommAbort().

A note for users on MNNVL systems: please ensure an adequate stack size for
NCCL threads.  While the default Linux stack size limit of 8192 KB is known
to be sufficient, we've seen crashes if the limit is changed to
"unlimited", as it causes the glibc library to unexpectedly *decrease* the
stack size of NCCL's background threads to just 2048 KB.  Use "ulimit -s"
in bash to print the current limit; if needed, reset it to 8192 KB using
"ulimit -s 8192" (one also needs to ensure that the new setting is
propagated to other nodes when launching a multi-node NCCL job).
2025-07-11 07:32:13 -07:00
Bertan Dogancay 7158adb57f [GEN] Fix typo in IFC code gen (#1796) 2025-07-11 09:19:39 -04:00
Nilesh M Negi 6b4ad0fd74 [BUILD] Use fmt-header instead of libfmt (#1791) 2025-07-10 17:19:53 -05:00
Nilesh M Negi f839e4edef [TOOLS] Update p2p-latency-test for gfx950 (#1730) 2025-07-10 12:13:29 -05:00
Nilesh M Negi 2c099fe29a [INIT] Fix fallback for unsupported user-specified runtime unroll factor (#1780)
* [INIT] Fix fallback for unsupported user-specified runtime unroll factor
* Add CollTrace guard
* Move `commSetUnrollFactor()` to rccl_wrap.cc
* Modify comments in the device-code generator script
2025-07-10 10:56:18 -05:00
Nilesh M Negi 68d6f99e0f [DEVICE] Fix validation errors for multi-node LL with gfx950 non-coherent system memory (#1795) 2025-07-10 09:05:46 -05:00
Mustafa Abduljabbar 058264b3f3 Fix AllReduce regression due to previous max range increase for LL64/LL128 (#1787)
* Adjust tuning factor impacting more than 2 nodes
* Scale max LL128 size for > 2 nodes
* Retune max LL128 range for N > 2
2025-07-09 19:17:10 -05:00