Nilesh M Negi
6632183efe
[GRAPH] Match maxChannels for gfx942 CUs ( #1302 )
2025-07-16 09:07:02 -05:00
Wenkai Du
106024b0db
Fix inline compilation issue with LL ( #1806 )
2025-07-15 08:39:18 -07:00
isaki001
8d0f1a1cef
gfx950 updated on LL thresholds for allreduce/allgather, update treeCorrection ( #1803 )
...
* change LL thresholds for allreduce/allgather and update treeCorrectionFactor
* update allGather LL cutoff
* adjust allgather LL/LL128 thresholds
2025-07-15 09:10:19 -05:00
dependabot[bot]
aafbdad2ab
Bump requests from 2.32.2 to 2.32.4 in /docs/sphinx ( #1738 )
...
Bumps [requests](https://github.com/psf/requests ) from 2.32.2 to 2.32.4.
- [Release notes](https://github.com/psf/requests/releases )
- [Changelog](https://github.com/psf/requests/blob/main/HISTORY.md )
- [Commits](https://github.com/psf/requests/compare/v2.32.2...v2.32.4 )
---
updated-dependencies:
- dependency-name: requests
dependency-version: 2.32.4
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-14 10:30:37 -06:00
dependabot[bot]
d4d021c726
Bump tornado from 6.4.2 to 6.5.1 in /docs/sphinx ( #1710 )
...
Bumps [tornado](https://github.com/tornadoweb/tornado ) from 6.4.2 to 6.5.1.
- [Changelog](https://github.com/tornadoweb/tornado/blob/master/docs/releases.rst )
- [Commits](https://github.com/tornadoweb/tornado/compare/v6.4.2...v6.5.1 )
---
updated-dependencies:
- dependency-name: tornado
dependency-version: 6.5.1
dependency-type: indirect
...
Signed-off-by: dependabot[bot] <support@github.com >
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-14 10:25:01 -06:00
Wenkai Du
708ad75f7a
Disable P2P net option by default ( #1793 )
2025-07-14 08:55:39 -07:00
Nikhil-Nunna
7abc7538ea
topo_explorer initial readme ( #1797 )
...
* topo_explorer intial readme
* topo_explorer readme update
* topo_explorer readme update
* Added sample output to README
* Update README.md
* Update README.md
---------
Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com >
2025-07-11 11:28:20 -05:00
Jobbins
7ebd31097c
[rccl] Remove .jenkins folder ( #1754 )
2025-07-11 11:24:06 -05:00
Bertan Dogancay
7158adb57f
[GEN] Fix typo in IFC code gen ( #1796 )
2025-07-11 09:19:39 -04:00
Nilesh M Negi
6b4ad0fd74
[BUILD] Use fmt-header instead of libfmt ( #1791 )
2025-07-10 17:19:53 -05:00
Nilesh M Negi
f839e4edef
[TOOLS] Update p2p-latency-test for gfx950 ( #1730 )
2025-07-10 12:13:29 -05:00
Nilesh M Negi
2c099fe29a
[INIT] Fix fallback for unsupported user-specified runtime unroll factor ( #1780 )
...
* [INIT] Fix fallback for unsupported user-specified runtime unroll factor
* Add CollTrace guard
* Move `commSetUnrollFactor()` to rccl_wrap.cc
* Modify comments in the device-code generator script
2025-07-10 10:56:18 -05:00
Nilesh M Negi
68d6f99e0f
[DEVICE] Fix validation errors for multi-node LL with gfx950 non-coherent system memory ( #1795 )
2025-07-10 09:05:46 -05:00
Mustafa Abduljabbar
058264b3f3
Fix AllReduce regression due to previous max range increase for LL64/LL128 ( #1787 )
...
* Adjust tuning factor impacting more than 2 nodes
* Scale max LL128 size for > 2 nodes
* Retune max LL128 range for N > 2
2025-07-09 19:17:10 -05:00
Atul Kulkarni
a28d5cb986
Enable Google Test's GMOCK feature ( #1773 )
2025-07-09 17:25:44 -05:00
mberenjk
697bee4ee8
Improving build time by removing the gfx11xx and host code from rccl_float8.h ( #1789 )
...
* removing extra build time by removing the gfx11xx arch from using hip_fp8
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com >
2025-07-09 14:03:47 -05:00
Bertan Dogancay
9c89573580
[GRAPH] Pass rank instead of busId due to a change in an internal function signature ( #1792 )
2025-07-08 08:45:54 -04:00
Marius Brehler
dac0e528a0
Set GTEST_BOTH_LIBRARIES appropriately ( #1669 )
...
If `find_package()` succeeds to find GTest and `INSTALL_DEPENDENCIES`
is set to OFF, `GTEST_BOTH_LIBRARIES` is not set and thus
`rccl-UnitTests` fails with trying to link unkown symbols.
2025-07-05 20:38:31 -05:00
Bertan Dogancay
e96c8473a1
[DEVICE] Enable PAT algo for RCCL 1ppn ( #1756 )
...
* Enable PAT algo for RCCL 1ppn
2025-07-04 13:45:18 -04:00
Rakesh Roy
dd3b1d816c
Fix chrono build error ( #1790 )
2025-07-04 08:27:30 -05:00
Wenkai Du
ae9642d4bc
msccl: use special send for LL on gfx950 ( #1788 )
2025-07-03 04:16:18 -05:00
ryanhankins
9d35581d5e
Adding #include <dlfcn.h> in nccl_net.h to pass build ( #1786 )
2025-07-02 19:21:53 -05:00
Nilesh M Negi
9e99c18f6e
[MSCCLPP] Disable format checks in MSCCLPP by default ( #1781 )
2025-07-02 09:11:42 -05:00
Wenkai Du
4640ab19b3
Add support for extended fine grained system memory pool ( #1770 )
...
* Add support for extended fine-grained system memory pool
* Use hipHostRegisterUncached
* Add "sc0 sc1" flags for LL store on gfx950
* Update after HIP flag is changed to hipExtHostRegisterUncached
2025-07-01 16:38:49 -05:00
Nilesh M Negi
3e51c41dcb
[BUILD] Fix packaging for RAS ( #1784 )
2025-07-01 16:37:14 -05:00
Nilesh M Negi
8d3a5542fb
[RAS] Add support for RAS client ( #1748 )
...
Enable RAS client binary `rcclras`
2025-06-29 18:53:16 -05:00
isaki001
75d22b47cb
added tuning table for gfx950 ( #1779 )
...
* added tuning table for mi350
* remove erroneous string
2025-06-29 15:45:39 -05:00
Bertan Dogancay
358dc1bc84
Switch to linear channel mapping for 2 nodes ( #1777 )
2025-06-28 09:10:18 -05:00
Arm Patinyasakdikul
35024ca1cb
[topo-expl] update header file location. ( #1769 )
2025-06-27 15:29:37 -05:00
gilbertlee-amd
16101e654f
Fixing HelloRccl include path to RCCL, fixing some warnings ( #1778 )
2025-06-27 09:12:59 -06:00
Arm Patinyasakdikul
265b1b3775
add warning if workFIFO is not available after multiple retries. ( #1772 )
...
* add warning if workFIFO is not available after multiple retries.
* Update src/enqueue.cc
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
---------
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com >
2025-06-26 19:49:52 -05:00
Arm Patinyasakdikul
71c788d4d7
Update plugin to look for librccl-net.so. ( #1768 )
2025-06-26 16:59:38 -05:00
mberenjk
5fb9d8f828
changing the HIP-VERSION to 6.3 to avoid using hip_fp8 for older ROCm versions ( #1764 )
...
Co-authored-by: Marzieh Berenjkoub <mberenjk@.amd.com>
2025-06-26 11:15:01 -05:00
Mustafa Abduljabbar
7e2ac00980
Revert LL64 cutoff points based on internal tuning ( #1771 )
2025-06-26 11:59:42 -04:00
Dingming Wu
020dcf0a7c
Add proxyTrace ( #1732 )
...
This feature tracks the proxy events and status of each send/recv op. ProxyTrace keeps a fixed number of active ops in host mem and dumps the status of each op when the program crashes or hangs.
2025-06-25 23:01:34 -05:00
Nilesh M Negi
568777a9bf
[BUILD] Move NPKit flags from install.sh to CMakeLists.txt ( #1741 )
2025-06-23 21:51:49 -05:00
corey-derochie-amd
e73db11819
Updated CHANGELOG for LL128 support for gfx942 in 7.0 ( #1719 )
...
* Updated CHANGELOG for LL128 support for gfx942 in 7.0
Also ported 6.4.2 section
* Removed unnecessary note from 7.0
2025-06-23 08:50:12 -06:00
jonatluu
709140204a
Remove File reorganization backward compatibility (rccl) ( #1753 )
2025-06-22 17:18:26 -05:00
Grant Pinkert
2482d1475f
Fix continuous build hang on extract_metadata.cmake ( #1668 )
...
When the `roc-obj-ls` executable fails, it sometimes does not return. Since the `execute_process` command will wait until the executable finishes, this means that in some cases, the build will hang indefinitely. There is no error message, and no indication that anything is wrong. This commit fixes that by introducing timeouts into the code and better error reporting.
2025-06-22 05:54:44 -05:00
Bertan Dogancay
675b495a00
[NPKit] Create dump dir regardless of default or user provided path ( #1757 )
2025-06-21 21:18:20 -05:00
Bertan Dogancay
0c1795c64b
Merge pull request #1721 from BertanDogancay/2.26-sync
...
[SYNC] 2.26.6-1
2025-06-20 09:57:09 -04:00
BertanDogancay
aaf023976a
Merge remote-tracking branch 'nccl/master' into develop
2025-06-20 07:54:49 -05:00
Joseph Macaranas
12315c259a
[Azure CI] rccl nightly pipeline that runs on slurm ( #1723 )
...
* [Azure CI] rccl nightly pipeline that runs on slurm
- Login node will be set up as a self-hosted agent on Azure Pipelines.
- Login node will run this job nightly.
- Login node will checkout the latest develop source, and then run build and test through sbatch calls, and then waiting for the jobs to complete. When the jobs are complete, print out the logs.
2025-06-19 10:41:40 -05:00
Nilesh M Negi
92a5d225d9
[MSCCLPP] Disable MSCCLPP Executor ( #1744 )
...
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com >
2025-06-17 01:29:55 -05:00
Sarat Kamisetty
fa0422f174
generic net plugin ctxt that is extensible for use in multiple APIs ( #1735 )
...
Co-authored-by: Sarat Kamisetty <sakamiset@amd.com >
2025-06-16 14:48:08 -07:00
Bertan Dogancay
39211c6b41
[NPKit] Use default output directory when env var is not set ( #1747 )
2025-06-16 15:26:53 -04:00
Mustafa Abduljabbar
fb4ad82d0d
Fix topo_explorer compatibility and capture WarpSize ( #1743 )
2025-06-16 08:18:35 -04:00
Tim
ba97c9c18b
replayer update v0 ( #1733 )
...
* First version of new replayer, with comments on future TODOs
* plus minor fixes for UT
* Updated format of recorder, especially in binary department, according to replayer's need
2025-06-13 15:05:34 -04:00
Richard Barnes
4486d091b8
Enable -Wdeprecated-copy-with-user-provided-copy ( #1643 )
2025-06-13 08:23:31 -07:00
Arm Patinyasakdikul
6c37ae9470
Added missing copyright message. ( #1742 )
...
* Added missing copyright message.
* addressed comments.
2025-06-12 09:58:01 -05:00