Граф коммитов

1818 Коммитов

Автор SHA1 Сообщение Дата
Nilesh M Negi 6632183efe [GRAPH] Match maxChannels for gfx942 CUs (#1302) 2025-07-16 09:07:02 -05:00
Wenkai Du 106024b0db Fix inline compilation issue with LL (#1806) 2025-07-15 08:39:18 -07:00
isaki001 8d0f1a1cef gfx950 updated on LL thresholds for allreduce/allgather, update treeCorrection (#1803)
* change LL thresholds for allreduce/allgather and update treeCorrectionFactor

* update allGather LL cutoff

* adjust allgather LL/LL128 thresholds
2025-07-15 09:10:19 -05:00
dependabot[bot] aafbdad2ab Bump requests from 2.32.2 to 2.32.4 in /docs/sphinx (#1738)
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases)
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md)
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4)

---
updated-dependencies:
- dependency-name: requests
  dependency-version: 2.32.4
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-14 10:30:37 -06:00
dependabot[bot] d4d021c726 Bump tornado from 6.4.2 to 6.5.1 in /docs/sphinx (#1710)
Bumps [tornado](https://github.com/tornadoweb/tornado) from 6.4.2 to 6.5.1.
- [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst)
- [Commits](https://github.com/tornadoweb/tornado/compare/v6.4.2...v6.5.1)

---
updated-dependencies:
- dependency-name: tornado
  dependency-version: 6.5.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-14 10:25:01 -06:00
Wenkai Du 708ad75f7a Disable P2P net option by default (#1793) 2025-07-14 08:55:39 -07:00
Nikhil-Nunna 7abc7538ea topo_explorer initial readme (#1797)
* topo_explorer intial readme

* topo_explorer readme update

* topo_explorer readme update

* Added sample output to README

* Update README.md

* Update README.md

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-07-11 11:28:20 -05:00
Jobbins 7ebd31097c [rccl] Remove .jenkins folder (#1754) 2025-07-11 11:24:06 -05:00
Bertan Dogancay 7158adb57f [GEN] Fix typo in IFC code gen (#1796) 2025-07-11 09:19:39 -04:00
Nilesh M Negi 6b4ad0fd74 [BUILD] Use fmt-header instead of libfmt (#1791) 2025-07-10 17:19:53 -05:00
Nilesh M Negi f839e4edef [TOOLS] Update p2p-latency-test for gfx950 (#1730) 2025-07-10 12:13:29 -05:00
Nilesh M Negi 2c099fe29a [INIT] Fix fallback for unsupported user-specified runtime unroll factor (#1780)
* [INIT] Fix fallback for unsupported user-specified runtime unroll factor
* Add CollTrace guard
* Move `commSetUnrollFactor()` to rccl_wrap.cc
* Modify comments in the device-code generator script
2025-07-10 10:56:18 -05:00
Nilesh M Negi 68d6f99e0f [DEVICE] Fix validation errors for multi-node LL with gfx950 non-coherent system memory (#1795) 2025-07-10 09:05:46 -05:00
Mustafa Abduljabbar 058264b3f3 Fix AllReduce regression due to previous max range increase for LL64/LL128 (#1787)
* Adjust tuning factor impacting more than 2 nodes
* Scale max LL128 size for > 2 nodes
* Retune max LL128 range for N > 2
2025-07-09 19:17:10 -05:00
Atul Kulkarni a28d5cb986 Enable Google Test's GMOCK feature (#1773) 2025-07-09 17:25:44 -05:00
mberenjk 697bee4ee8 Improving build time by removing the gfx11xx and host code from rccl_float8.h (#1789)
* removing extra build time by removing the gfx11xx arch from using hip_fp8

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-07-09 14:03:47 -05:00
Bertan Dogancay 9c89573580 [GRAPH] Pass rank instead of busId due to a change in an internal function signature (#1792) 2025-07-08 08:45:54 -04:00
Marius Brehler dac0e528a0 Set GTEST_BOTH_LIBRARIES appropriately (#1669)
If `find_package()` succeeds to find GTest and `INSTALL_DEPENDENCIES`
is set to OFF, `GTEST_BOTH_LIBRARIES` is not set and thus
`rccl-UnitTests` fails with trying to link unkown symbols.
2025-07-05 20:38:31 -05:00
Bertan Dogancay e96c8473a1 [DEVICE] Enable PAT algo for RCCL 1ppn (#1756)
* Enable PAT algo for RCCL 1ppn
2025-07-04 13:45:18 -04:00
Rakesh Roy dd3b1d816c Fix chrono build error (#1790) 2025-07-04 08:27:30 -05:00
Wenkai Du ae9642d4bc msccl: use special send for LL on gfx950 (#1788) 2025-07-03 04:16:18 -05:00
ryanhankins 9d35581d5e Adding #include <dlfcn.h> in nccl_net.h to pass build (#1786) 2025-07-02 19:21:53 -05:00
Nilesh M Negi 9e99c18f6e [MSCCLPP] Disable format checks in MSCCLPP by default (#1781) 2025-07-02 09:11:42 -05:00
Wenkai Du 4640ab19b3 Add support for extended fine grained system memory pool (#1770)
* Add support for extended fine-grained system memory pool
* Use hipHostRegisterUncached
* Add "sc0 sc1" flags for LL store on gfx950
* Update after HIP flag is changed to hipExtHostRegisterUncached
2025-07-01 16:38:49 -05:00
Nilesh M Negi 3e51c41dcb [BUILD] Fix packaging for RAS (#1784) 2025-07-01 16:37:14 -05:00
Nilesh M Negi 8d3a5542fb [RAS] Add support for RAS client (#1748)
Enable RAS client binary `rcclras`
2025-06-29 18:53:16 -05:00
isaki001 75d22b47cb added tuning table for gfx950 (#1779)
* added tuning table for mi350

* remove erroneous string
2025-06-29 15:45:39 -05:00
Bertan Dogancay 358dc1bc84 Switch to linear channel mapping for 2 nodes (#1777) 2025-06-28 09:10:18 -05:00
Arm Patinyasakdikul 35024ca1cb [topo-expl] update header file location. (#1769) 2025-06-27 15:29:37 -05:00
gilbertlee-amd 16101e654f Fixing HelloRccl include path to RCCL, fixing some warnings (#1778) 2025-06-27 09:12:59 -06:00
Arm Patinyasakdikul 265b1b3775 add warning if workFIFO is not available after multiple retries. (#1772)
* add warning if workFIFO is not available after multiple retries.

* Update src/enqueue.cc

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-06-26 19:49:52 -05:00
Arm Patinyasakdikul 71c788d4d7 Update plugin to look for librccl-net.so. (#1768) 2025-06-26 16:59:38 -05:00
mberenjk 5fb9d8f828 changing the HIP-VERSION to 6.3 to avoid using hip_fp8 for older ROCm versions (#1764)
Co-authored-by: Marzieh Berenjkoub <mberenjk@.amd.com>
2025-06-26 11:15:01 -05:00
Mustafa Abduljabbar 7e2ac00980 Revert LL64 cutoff points based on internal tuning (#1771) 2025-06-26 11:59:42 -04:00
Dingming Wu 020dcf0a7c Add proxyTrace (#1732)
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.
2025-06-25 23:01:34 -05:00
Nilesh M Negi 568777a9bf [BUILD] Move NPKit flags from install.sh to CMakeLists.txt (#1741) 2025-06-23 21:51:49 -05:00
corey-derochie-amd e73db11819 Updated CHANGELOG for LL128 support for gfx942 in 7.0 (#1719)
* Updated CHANGELOG for LL128 support for gfx942 in 7.0

Also ported 6.4.2 section

* Removed unnecessary note from 7.0
2025-06-23 08:50:12 -06:00
jonatluu 709140204a Remove File reorganization backward compatibility (rccl) (#1753) 2025-06-22 17:18:26 -05:00
Grant Pinkert 2482d1475f Fix continuous build hang on extract_metadata.cmake (#1668)
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.
2025-06-22 05:54:44 -05:00
Bertan Dogancay 675b495a00 [NPKit] Create dump dir regardless of default or user provided path (#1757) 2025-06-21 21:18:20 -05:00
Bertan Dogancay 0c1795c64b Merge pull request #1721 from BertanDogancay/2.26-sync
[SYNC] 2.26.6-1
2025-06-20 09:57:09 -04:00
BertanDogancay aaf023976a Merge remote-tracking branch 'nccl/master' into develop 2025-06-20 07:54:49 -05:00
Joseph Macaranas 12315c259a [Azure CI] rccl nightly pipeline that runs on slurm (#1723)
* [Azure CI] rccl nightly pipeline that runs on slurm
- Login node will be set up as a self-hosted agent on Azure Pipelines.
- Login node will run this job nightly.
- Login node will checkout the latest develop source, and then run build and test through sbatch calls, and then waiting for the jobs to complete. When the jobs are complete, print out the logs.
2025-06-19 10:41:40 -05:00
Nilesh M Negi 92a5d225d9 [MSCCLPP] Disable MSCCLPP Executor (#1744)
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
2025-06-17 01:29:55 -05:00
Sarat Kamisetty fa0422f174 generic net plugin ctxt that is extensible for use in multiple APIs (#1735)
Co-authored-by: Sarat Kamisetty <sakamiset@amd.com>
2025-06-16 14:48:08 -07:00
Bertan Dogancay 39211c6b41 [NPKit] Use default output directory when env var is not set (#1747) 2025-06-16 15:26:53 -04:00
Mustafa Abduljabbar fb4ad82d0d Fix topo_explorer compatibility and capture WarpSize (#1743) 2025-06-16 08:18:35 -04:00
Tim ba97c9c18b replayer update v0 (#1733)
* First version of new replayer, with comments on future TODOs

* plus minor fixes for UT

* Updated format of recorder, especially in binary department, according to replayer's need
2025-06-13 15:05:34 -04:00
Richard Barnes 4486d091b8 Enable -Wdeprecated-copy-with-user-provided-copy (#1643) 2025-06-13 08:23:31 -07:00
Arm Patinyasakdikul 6c37ae9470 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.
2025-06-12 09:58:01 -05:00