Commit Graph

1885 Commits

Author SHA1 Message Date
BertanDogancay 08a7be231b Merge remote-tracking branch 'nccl/master' into develop 2025-08-28 15:46:28 -05:00
Avinash a0ec15bafe [build] Disable MSCCL++ compilation by default (#1879)
* Enable MSCCLPP on request

* Updating docs and README

* Updates to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* Updates to CHANGELOG.md

* Update CHANGELOG.md

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

* Update CHANGELOG.md

Github didn't take the edit to my suggestion properly.

---------

Co-authored-by: amd <amd@super3.amd.com>
Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-08-28 08:52:12 -06:00
Nilesh M Negi d73cee7588 [AzureCI] Switch to ROCm 6.4.1 and add rccl-tests (#1782)
* Use ROCm 6.4.1 for testing
* Extend RCCL-Tests to multi-node
* Add HSA_NO_SCRATCH_RECLAIM to UT runs
* Limit to single-node rccl-tests for now
2025-08-27 21:07:53 -05:00
jonatluu 4699bff790 fix lintian warning package-contains-timestamped-gzip (#1865)
* fix lintian warning package-contains-timestamped-gzip

* fix lintian warning
2025-08-27 13:29:07 -04:00
Geo Min f404624d9e [TheRock CI] Adding single node tests for RCCL (#1876)
* Add single-node testing

* Adding single node test

* Adding quotes

* fix typo

* Adding test flag

* No MPI

* Adding openmpi install

* Adding comment

* PR comments

* Missing proj

* Adding half

* Adding rocr runtime

* Adding them all'

* new sha

* Fixing script

* Removing confusing skip test case

* Adding docs

* Update .github/workflows/therock-test-packages-single-node.yml

Co-authored-by: Marius Brehler <marius.brehler@amd.com>

---------

Co-authored-by: Marius Brehler <marius.brehler@amd.com>
2025-08-27 08:13:10 -07:00
Nusrat Islam df448862c3 Device allocation tracker (#1878)
* alloc: add memory allocation tracker

* alloc: add tracker for ncclCuMemAlloc() APIs

* alloc: add null pointer check during free
2025-08-27 09:30:51 -05:00
Kapil S. Pawar c9becd89cd Code coverage tests for param.cc (#1872)
* Added code coverage unit tests for param.cc

* Updated ParamTests.cpp and removed ParamTestsConfFile.txt

* Updated ParamTests.cpp

* Removed NCCL_LOG_INFO and added sample cofig file

---------

Co-authored-by: Pawar <kpawar@ctr2-alola-ctrl-01.amd.com>
2025-08-27 09:30:37 -05:00
ishkool c288fbf1b2 Code coverage tests for net_socket.cc (#1840)
* Code coverage UTs for net_socket.cc

* Addressed review comments

---------

Co-authored-by: Atul Kulkarni <atul.kulkarni@amd.com>
2025-08-27 09:24:21 -05:00
Marius Brehler 221205ebd4 Bump TheRock version used for testing (#1885) 2025-08-27 16:22:27 +02:00
Mustafa Abduljabbar 277747c199 [Device] Add dynamic fetch/reduce pipelining for reduction collectives - Simple protocol (#1861)
* Support pipelining codegen and template specialization

* Support ReduceCopy pipelining for AllReduce, ReduceScatter, and Reduce (currently enabled for bfloat16)

* Remove need for FUNC_INDEX_TOTAL

* Add pipeline field to device function key construction logic

* Avoid unneeded codegen for LL/LL64 kernels

* Modify conditions and add pipeline dtypes env

* Optimize selection for both gfx942 and gfx950

* Increase pipeline bitfield width

* Use __forceinline__ for all device functions

* Realign reduceCopy with original form

* Add opt-out option to enable perf debugs

* Remove force-reduce-pipelining option from README

* Update CHANGELOG.md

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-26 15:03:54 -04:00
Nusrat Islam b882af9ffd fixup: remove extra semicolon (#1881) 2025-08-26 10:57:25 -05:00
Jeffrey Novotny 64f8e01b76 Docs: Fix formatting for Docker guide (#1882)
* Docs: Fix formatting for Docker guide

* Incorporate feedback
2025-08-26 10:18:32 -04:00
Mustafa Abduljabbar dfad51e3c9 Support gfx950 in topo_expl and resolve dependency on FMT (#1829)
* Support gfx950 in topo_expl

* Fix dependencies and fetch fmt from sources

* Remove third_party folder in make clean

* Add empty target when fmt is found

* Add MI350 example

* Update README.md

---------

Co-authored-by: isaki001 <ioannissakiotis@gmail.com>
2025-08-26 10:11:38 -04:00
Nusrat Islam 5e7937effb Add direct allgather algorithm (#1868)
* add direct allgather algorithm

* minor fix

* add debug print for memory allocation tracker

* add message size threshold for direct allgather

* scatter transfers across ranks

* update changelog

* minor fix

* Update CHANGELOG.md

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>

* enable direct AG when pxn is ON on MI300X or MI350

---------

Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>
2025-08-25 07:55:10 -05:00
corey-derochie-amd b88c134874 Changed TestBedChild to avoid hang if the call fails (#1875)
Changed `TestBedChild` protocol to send the result code before the return value to avoid hanging if the call fails. Switched `TestBedChild::GetUniqueId` to use this.
2025-08-23 00:17:34 -05:00
Nilesh M Negi bf6660ee4e [BUILD] Populate host_table entries only for 1 unroll (#1871) 2025-08-23 00:15:38 -05:00
awelling2801 a1a65c65c4 Added new tests for rccl_wrap - rcclUpdateThreadThreshold (#1855)
* Added tests for rccl_wrap - rcclUpdateThreadThreshold

* Skipped tests gtest_skip added

* Added tests for new functions rcclSetP2pNetChunkSize and rcclSetPxn

---------

Co-authored-by: Atul Kulkarni <atul.kulkarni@amd.com>
2025-08-21 16:39:53 -05:00
Marius Brehler 5ae5eb9440 Add a badge for TheRock CI (#1874)
Adds a badge for TheRock CI and moves the existing badge to the top.
2025-08-21 21:54:37 +02:00
Geo Min f9a957bbab [TheRock CI] Adding TheRock RCCL tests (#1873)
* First commit for rccl multi node test workflow

* Adding workflow dispatch

* Added branch based pull trigger

* Changed typo in branch name

* Add input variables to push

* Removed input variables to push

* Added self hosted runner for Vultr cloud

* Skipping build and only running test

* Changed test runner label name

* Made changes to executable paths in test script

* Made changes to run

* Made changes to cd into cvs dir

* This is a dummy commit

* Added cmake options

* Modified build options

* Commiting build changes

* Adding rccl and rccl-tests

* Re-ordering rccl and rccl-tests

* adding --global command

* modified cmake command

* modified script paths

* Testing OIDC for rccl repo

* Testing OIDC for rccl repo

* Testing build and upload workflow

* use default env variable for AMDGPU families on push workflow trigger

* Adding cleanup and correct role

* Adding additional yml files

* Fixing typo';

* Adding new sha

* Adding correct gpu target

* Adding back venv bin activate

* Adding workflow dispatch for tests

* Testing

* Adding cat

* Adding cat

* Adding rocm dir change

* Adding checkout

* cat with sudo

* rccl checkout

* correcting branch

* removing sudo

* trying to adjust correct path'

* Adding output dir path

* Use docker container with pre-installed MPI

* Adding back build steps

* Fixing SHA

* Adding exclusion logic:

* Adding test

* Adding CI check

* Removing testing

* Limit to build only rccl, rccl-tests and required dependencies

* Adding test

* Removing test

* Removing quote

* Reverting test

* PR comments

---------

Co-authored-by: arravikum <arravikum@amd.com>
Co-authored-by: Marius Brehler <marius.brehler@amd.com>
2025-08-20 15:07:23 -07:00
Arm Patinyasakdikul 28a83c3ea6 Removing "Could not find any local path from gpu X to net." warning (#1866)
* Removing "Could not find any local path from gpu X to net." warning to avoid confusion.
2025-08-20 16:52:35 -05:00
Arm Patinyasakdikul 9d3acffa5f Test: delete child object to address memory leak. (#1863) 2025-08-20 10:15:03 -05:00
Arm Patinyasakdikul fb882e80f6 Remove noinline attribute from reduceCopyPacks and (#1864)
reduceCopyPacksWithBias.
2025-08-19 20:24:31 -05:00
Atul Kulkarni 231449c896 Added new code owners (#1869) 2025-08-19 16:32:25 -05:00
Mustafa Abduljabbar c1b3cd8911 Have ncclDevFuncId use 64-Bit keyed map with field packing (#1857)
- Updated ncclDevFuncId to use a hash-based lookup with std::unordered_map.
- Keys are now 64-bit integers, which pack coll, algo, proto, devRedOp, and type fields.
- Improved flexibility and maintainability by moving away from row-based indexing.
- Added error handling for missing keys in the hash map.
- Aligned key generation logic with generate.py and updated generate.py.
2025-08-19 16:41:19 -04:00
Nusrat Islam 6ade5065b4 device: optimize threadfence for ll64 protocol (#1858)
* device: optimize threadfence for ll64 protocol

* device: use __atomic_signal_fence()

---------

Co-authored-by: Nusrat Islam <nusislam@useocpslog-003.amd.com>
2025-08-18 09:16:41 -05:00
ishkool 876f985e0f Code Coverage: Proxy.cc tests (#1818)
* Proxy.cc tests

* Update ProxyTest.cpp

Cleaned up the code.

* Update ProxyTests.cpp

Bring back deleting dynamically allocated memory
2025-08-15 19:06:32 -05:00
Atul Kulkarni 84f3cc6a02 Added new unit tests for src/enqueue.cc (#1853) 2025-08-15 18:26:26 -05:00
ishkool 6453273aa6 Code Coverage Unit Tests for comm.h (#1783)
* File containing test for comm.h

* Update CommTest.cpp

Added gtest API for assert

* Update CommTest.cpp

Adding copyright

* Update CommTest.cpp

Removing info and tested as not required.

* Update and rename CommTest.cpp to CommTests.cpp

* Update CMakeLists.txt
2025-08-15 17:44:24 -05:00
Nilesh M Negi c3b8de4ec8 [DEVICE] Use noinline for LLGenericOp only on gfx950 (#1849) 2025-08-15 15:15:02 -05:00
isaki001 44121db890 [TUNING] gfx950 16N tuning (#1835)
* change gfx950 algo/proto selection for multinode allreduce, allgather, reduceScatter
* gfx950 tuning: enable tuning for broadcast, allreduce starts LL128 earlier and switches to ring earlier, change LL128 start for allgather and reduceScatter
* lower LL128 threshold
* update reduceScatter LL128 min to match LL max for consistency
* enable multinode PXN and increase chunksize for gfx950
* change LL128 start to 128KB, adjust ring-start according to node-count
* disable code-path for fused-AR on LL128 for gfx950
* use LL128 starting from 1KB for multinode allgather on gfx950
* start LL128 earlier for multinode reduceScatter on gfx950
* start LL128 earlier for multinode broadcast on gfx950
* set multinode allreduce to start simple on 64MB for gfx950
* start LL128 from 1KB for multinode broadcast on gfx950
* setting multinode AR to use tree instead of ring at 16MB, 64MB, 128MB
* set multinode broadcast to use LL for up to 256KB depending on node-count for gfx950
* adjust algo for 32MB  multinode allreduce on gfx950
* make 32MB tree LL128 for multinode AR on gfx950
* make sure ring is not picked on 2N allreduce on small sizes
2025-08-15 15:12:45 -05:00
alex-breslow-amd 1aa2570b48 Disable the __threadfence on the sender side of the simple protocol when possible. (#1830)
Leverages the traits of extended-scope fine-grain memory to get rid of a device-scope acquire-release fence.  This improves throughput for single node workloads on gfx942 and gfx950 for some input sizes (e.g., ~32 MiB to about 256 MiB) when using the simple protocol.  Multinode workloads on MI300X see a smaller but statistically significant uplift for some message sizes.  Runtime disablement is supported via setting the environment variable RCCL_GFX942_CHEAP_FENCE_ON to 0.
2025-08-15 07:54:54 -07:00
mberenjk c61152baa4 Added useAcc as a template parameter to address the performance regression (#1856)
* Added useAcc as a template parameter to address the 2% performance regression in allreduceWithBias
---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2025-08-14 15:58:54 -05:00
Adel Johar aaf8613b76 Docs: Add environment variables reference page 2025-08-14 09:55:28 +02:00
Karthikeyan Arumugam 6d41e5ba99 Add cstring header explictly as it is removed from HIP (#1859) 2025-08-13 15:14:22 -07:00
Rahul Vaidya ee9ed3ef87 [BUILD] Fix UT packaging on Debian family OS (#1854)
* Fix UT packaging on Debian family OSes

Signed-off-by: ravaidya <ravaidya@amd.com>

* Split OR condition when performing Debian checks

Signed-off-by: ravaidya <ravaidya@amd.com>

---------

Signed-off-by: ravaidya <ravaidya@amd.com>
2025-08-11 17:03:16 -05:00
Chris Sosa 53977821b5 Add CI Badge for tracking CI status in prep for gating changes (#1851)
This PR is intended to move RCCL to gating changes on CI failures. Right now, only build/unittests run per PR consistently. We should eventually add all single and multi-node test status badges once those tests are running in presubmit and continuously on develop
2025-08-11 14:02:46 -07:00
Nilesh M Negi 5036d0e713 [BUILD] Fix UT packaging on Debian OS (#1848) 2025-08-11 09:43:26 -05:00
Rahul Vaidya cbbc713b03 Fix rccl-UnitTests packaging on Debian systems (#1846)
Signed-off-by: ravaidya <ravaidya@amd.com>
2025-08-08 12:28:56 -05:00
isaki001 74d82a8145 enable more events for LL128 NPKIT trace collection (#1827) 2025-08-07 11:19:36 -05:00
awelling2801 82bea39280 Created coverage tests for rccl_wrap (#1694)
* Created coverage tests for rccl_wrap

RCCL_EXPOSE_STATIC off by default

Coverage tests for rccl_wrap.cc

* Remove RCCL_EXPOSE_STATIC dependency

* Removed Rcclwrap.RcclGetAlgoInfoTest

* Remove comments

* Corrected RCCL_EXPOSE_STATIC definition logic

---------

Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com>
Co-authored-by: Atul Kulkarni <atul.kulkarni@amd.com>
2025-08-06 14:48:00 -05:00
Avinash 3f8cac388e Compiler warnings fix 2 (#1801)
* Changes to device code

* Changes to src/misc

* Changes to graph

* src/include changes

* src/transport changes

* changes in init, enqueue, proxy

* Changes to CMakeLists.txt

* Additional changes to device code

* Additional changes to net.cc

* adding 'compiler warning' tag to ease upstream merge'

* typo correction

* Addessing comments

* Additional changes for new commits
2025-08-05 17:36:23 -05:00
Arm Patinyasakdikul 6fc228e247 Disable context tracking for the current version. (#1839) 2025-08-04 10:48:00 -05:00
Atul Kulkarni 0e7d7da55d Add unit tests for graph/xml.cc & graph/xml.h (#1833)
* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.

* Added new unit tests for src/transport/shm.cc

* Added new unit tests for graph/xml.cc
2025-08-01 14:20:27 -05:00
Atul Kulkarni e2c9f2feab Update help text in README (#1837) 2025-08-01 14:19:27 -05:00
awelling2801 5ecc1b7ede Added tests for coll_reg (#1700)
Changes to coll_reg

Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com>
2025-07-31 13:49:23 -05:00
dependabot[bot] 32e95963dc Bump urllib3 from 2.2.2 to 2.5.0 in /docs/sphinx (#1751)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.2 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.2...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-31 11:25:45 -06:00
dependabot[bot] 1acc3eb6c1 Bump rocm-docs-core from 1.18.2 to 1.22.0 in /docs/sphinx (#1836)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.18.2 to 1.22.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.18.2...v1.22.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-version: 1.22.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-07-31 11:15:01 -06:00
awelling2801 7320752bf3 Added tests for transport.cc (#1725)
Co-authored-by: Welling <awelling@ctr2-alola-login-01.amd.com>
2025-07-31 11:04:28 -05:00
Rahul Vaidya 0adc5edc74 Fix RHEL10 packaging for rcclras and rccl-UnitTests (#1831)
Signed-off-by: ravaidya <ravaidya@amd.com>
2025-07-31 11:00:49 -05:00
Nilesh M Negi bd55f876e9 [DEVICE] Add unroll=2 for gfx950 multi-node (#1824) 2025-07-31 02:35:26 -05:00