Граф коммитов

2041 Коммитов

Автор SHA1 Сообщение Дата
Avinash 9545ae04b2 Navi4 LL enablement and tuning (#2095)
* LL enablement for gfx1201

* Single node LL/Simple tuning

* multinode algo/prto default choice

* First iteration of Table tuning

* gfx924 tuning table correction

* Addressing PR comments and prefix match fix
2026-01-05 10:17:12 -06:00
Nusrat Islam f756aa9add gfx950: restrict maxChannels to 48 for multi-node collectives (#2116)
* gfx950: restrict maxChannels to 48 for multi-node collectives

* change env name for reduced CU config

---------

Co-authored-by: Nusrat Islam <nusislam@dell300x-ccs-aus-k13-09.cs-aus.dcgpu>
Co-authored-by: Islam <nusislam@amd.com>
2025-12-31 09:28:19 -06:00
amd-jiali 935208ad09 Add an environment variable to allow user explicitly turn off direct AllGather (#2119)
Co-authored-by: Jiali Li <jialili@amd.com>
2025-12-29 16:43:40 -08:00
Avinash 6f62165369 Virtual device enablement ( Minimal changes ) (#2110)
* minimal changes

* Setting Default tuning table

* Add warnings NIC merge accross PCIe Root complexes,NUMA

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
2025-12-25 15:06:33 -06:00
Corey Derochie f942810959 Updated troubleshooting-rccl.rst to change rocm-smi to amd-smi (#2028)
* Updated troubleshooting-rccl.rst to change rocm-smi to amd-smi

* Added `amd-smi static --driver`

* Update docs/how-to/troubleshooting-rccl.rst

Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>

---------

Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
2025-12-23 21:22:11 +05:30
Karthikeyan Arumugam 9f4651f20f Add support for AMD AINIC within RCCL default internal network plugin. (#2078)
* Added support for AMD ROCm net-ib alongside vanilla net-ib, with auto-generation to detect conflicts early during NCCL sync and enable future customizations.
* Integrated AMD AINIC support in RCCL for out-of-the-box usage, leveraging performance improvements by default, channel pinning for optimal pipeline performance, and extended support for 32B in-line CTS messages.
* Implemented internal derivation of AINIC-specific flags when RCCL AINIC environment parameter is set, and checks before initializing AINIC net-ib methods.
* Included snapshot of auto-generated ROCm net-ib file (src/transport/net_ib_rocm.cc) for reference.
* Fixed typos in RCCL param API (RCCL_AINIC_ROCE) and dlclose.
* Updated plugin loading logic:
* Load internal ROCmIB plugin only when NCCL_NET_PLUGIN is not set.
* Load default internal net-ib only when not AINIC and no external plugin env is set.
2025-12-23 10:33:10 -05:00
Geo Min 4f474a7389 Revert "Adding org var and dynamic runner selection (#2106)" (#2114)
This reverts commit 2e193aed68.
2025-12-19 12:53:09 -08:00
alexander-sannikov 50568dc93d Tuning: use constant value for CorrectionFactor tables 2025-12-18 18:55:03 +00:00
alexander-sannikov dea50b5e11 Tuning: fixed out-of-bound access 2025-12-18 18:55:03 +00:00
Atul Kulkarni 313b98281c Revert: Restore default symbol visibility for tests in debug mode (#2111) 2025-12-18 11:20:12 -06:00
Pedram Alizadeh f0e7e8745f Adding tuning conf file for CU reduction for AR, AG, and RS with under-subscribed number of GPUs per node (#2102) 2025-12-17 16:58:54 -05:00
Atul Kulkarni 74690ea705 Removes default visibility in debug mode and updates unit tests for alt_rsmi impl (#2091)
* Update unit tests for alt_rsmi impl

- Create distinct test executable for alt_rsmi testing
- Updated alt_rsmi tests to use public methods
- Compiles alt_rsmi.cc with ARSMI_TEST_BUILD
- Enables external linkage of internal variables
- Only for AltRsmiTests.cpp that manipulates internals
- Clean separation for test behavior

* Address review comments

* restore hidden symbol visibility
2025-12-17 10:27:00 -08:00
Geo Min 2e193aed68 Adding org var and dynamic runner selection (#2106) 2025-12-16 10:41:57 -08:00
Mustafa Abduljabbar 596567ff95 Keep P2P self-copy for batched ops to prevent >32N hang. (#2108) 2025-12-16 11:56:39 -05:00
isaki001 7c1049d2a4 Remove node-count and threshold restrictions from p2p-batching (#2077)
* remove node-count and threshold restrictions from p2p-batching

* remove batching threshold usage, fix typo for using batching-enablement flag

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-12-15 19:55:46 -05:00
Mustafa Abduljabbar 5787c960fc Add fix for WarpSpeed auto mode (#2104) 2025-12-12 17:56:52 -05:00
Mustafa Abduljabbar d009ab144e [Device] WarpSpeed enablement and single node CU and perf opt for MI350 (#2073) 2025-12-11 19:04:35 -05:00
Ahmed Khan 08dd75712f Add ncclCommDump API (#2068)
* Add ncclCommDump API

* remove trailing whitespace changes

* Add more proxy trace timestamps

* Add facebook_rccl namespace before proxyTrace timestamp call

* Clean up ProxyTrae construction

* Move updateProxyOpCounter to member function

* Move setProxyOpTimestamp to member function

* Move addNewProxyOp to member function

* Make internal methods private

* Make ProxyTrace thread safe

* Fix unit tests

* Fix overwritten ProxyTrace DONE setting in net.cc
2025-12-11 15:02:35 -07:00
Thomas Huber 1f2f9f33ba Update gfx950 tuner conf to include broadcast (#2065)
Signed-off-by: Thomas Huber <thomas.huber@amd.com>
2025-12-11 14:36:03 -05:00
Mustafa Abduljabbar 2cf6a9bb19 Add WAIT_PEER NPKIT event (#2100) 2025-12-11 11:18:41 -05:00
Geo Min 5384a8abb2 Correct runner name (#2098) 2025-12-10 11:44:48 -08:00
corey-derochie-amd 18e9ad913b Fixed unit-test env var list parsing and improved filtered test run speed (#1626)
* Fixed parsing of env var lists which were overwriting the mutable env var string and polluting future parses.

* Fixed all tests to obey UT_DATATYPES and UT_REDOPS filters.

* Allow tests to bail early via `GTEST_SKIP` if UT_DATATYPES or UT_REDOPS filters give a test size of zero. This allows tests to run much faster with filters on.

* Wrapped the support checks in helper functions on `TestBed`.
2025-12-10 10:06:44 -07:00
Geo Min 6af9087b0c [ci] Bumping TheRock CI commit hash (#2097)
* Bumping TheRock CI commit hasH

* fixing artifact group
2025-12-09 16:25:57 -08:00
Atul Kulkarni 7e10267dfd Added a Process Isolated Test Runner (#1993)
* Added single process isolation support to execute tests

* Address review comments

* Update README

* Removed requirement of explicit call to clear method

* Added macros for simplified usage

* Updated tests to use process isolation framework

* Adjust summary output format for isolated tests

* Updated rccl_wrap tests

* Used process isolation in AllocTests

* Used process isolation and fixed failing tests

* Modified test output, added signal handling

Updated macros to handle lambdas

* Convert argcheck tests to isolated tests

* Convert proxy tests to isolated tests

* Remove non-supported test

* Fixed file descriptor handling and clearing env vars for tests
2025-12-08 10:36:05 -06:00
Atul Kulkarni 29e1567b95 Enable MPI support to execute MPI specific unit/functional tests (#1996)
* Added MPI support to execute unit/functional tests

Update node and process validation
Updated node detection count and modified validation method
Update validation logic to include max procs and nodes

* Address review comments

* Fix warnings

* Added a new NET transport test and clean up

* Added MPI test logging mechanism

* Decoupled GTest framework

* Added Net IB functional tests

* Updated with resource guards

* Added NET IB tests and refactored code

* Update P2pWorkflow test

* Update documentation

* Add MPI_TESTS_ENABLED guard to the file

* Fix Shm and NetIB tests

* Applied refactoring and cleanup

* Replaced BufferGuard with AutoGuard

* Modified test debug logging

* Use macro to reduce NcclTypeTraits code duplication

- Replace repetitive template specializations with a single
  DEFINE_NCCL_TYPE_TRAIT macro
- Use stringification operator (#) to auto-generate type name strings
- Add #undef to keep macro from polluting namespace
- Makes adding new type mappings trivial

* Unify buffer initialization with generic pattern function

- Remove initializeBufferWithCustomPattern
- Make initializeBufferWithPattern generic with PatternFunc template param
- Now single function handles all patterns via lambda injection
- Updated all test files to use lambdas for pattern generation
- Pattern logic now visible at call site (self-documenting)

* Unify buffer verification with pluggable pattern function

- Remove verifyBufferWithCustomCheck
- Make verifyBufferData generic with PatternFunc template param
- Single function handles all verification patterns via lambda injection
- Updated all test files to use lambdas
- Better defaults: num_samples=0 means verify all elements
- Pattern logic now visible at call site (self-documenting)

* Docs: Add DeviceBufferHelpers section to MPITestRunner.md

- Document new refactored buffer initialization/verification API
- Explain pluggable pattern functions with lambda examples
- Show type mapping and automatic float/int comparison
- Include migration guide from old API to new unified functions
- Demonstrate best practices with real-world examples
- Reference recent refactoring commits (macro-based type traits)

* Docs: Update documentation and examples

- Update on DeviceBufferHelpers
- Update examples using DeviceBufferHelpers methods, e.g. data verification

* Address review comment.

- Replace manual pattern generation loop with initializeBufferWithPattern call
- Use downloadBuffer to get host copy instead of manual hipMemcpy

* Remove non-existent dependency

* Remove duplicate testcase

* Code cleanup in test files

* Moved common constants to base class
2025-12-06 16:05:37 -06:00
Atul Kulkarni 8ad446b271 Remove legacy AltRsmi tests (#2090)
These tests will be replaced by new tests.
2025-12-05 16:53:55 -06:00
Atul Kulkarni 0d797d1f6c Remove legacy Shm and P2p tests (#2089)
These tests will be replaced by MPI tests.
2025-12-05 16:53:28 -06:00
Atul Kulkarni 7ec8e73e12 Remove static to non-static conversion used in tests (#2084)
* Remove coll_reg tests which are unsupported

* removed static to non-static conversion feature
2025-12-04 18:03:14 -06:00
Atul Kulkarni 892d258319 Add missing header in alloc.h (#2086) 2025-12-04 11:26:19 -06:00
Atul Kulkarni cc6e259a02 Fix rccl test suite to use hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2082) 2025-12-04 10:02:06 -06:00
Atul Kulkarni 7c12b0b76b Added new unit tests for AllReduce with Bias API (#2036)
* Added new unit tests for AllReduce with Bias API

* Address review comments
2025-12-03 17:37:34 -06:00
Wenkai Du 185e78a8f0 Use one side stream per process (#2063)
* Use one side stream per process

* Handle multiple GPUs per process

* Reset stream when not found

* Address review comments

* Fix missing mutex initializer
2025-12-02 10:03:15 -08:00
corey-derochie-amd 4acd0f64ea Add copyright to src/device/symmetric/all_reduce.cuh (#2080) 2025-11-27 14:29:21 -07:00
isaki001 da183596cd add back missing proxy-counter updates (#2052) 2025-11-25 15:22:34 -06:00
Kapil S. Pawar 5fd86021a8 [RcclReplayer] JSON <-> BIN log format conversion tool (#2056)
* Add replay log format converter

* Add Log Sanitizer

* Add no timestamp option (nts) to sanitizer
2025-11-24 11:51:36 -06:00
Nilesh M Negi db52690c2a [AzureCI] Increase timeout of per PR and nightly pipeline to 240 mins (#2074) 2025-11-24 10:55:36 -06:00
AbandiGa b14e32c46e Fix rcclNetP2pPolicy issue (#2072)
* fix rcclNetP2pPolicy issue

* change the comment to ncclNetIb
2025-11-21 18:28:10 -06:00
Nilesh M Negi fca0015962 [AzureCI] Increase UT timeout to 3 hours (#2070) 2025-11-21 09:18:49 -06:00
Matt Williams 3495baa6b2 Fix ToC in API Library page (#2053)
* Add intro and remove ToC
2025-11-20 09:35:15 -05:00
Kapil S. Pawar ab1eb6e70e [AzureCI] Add RcclReplayer to CI (#2048) 2025-11-18 12:21:24 -06:00
Kapil S. Pawar c7f400dbff Functional Tests for Ext-Profiler Plugin (#2007)
* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Added tests for ext-profiler

* Updated readme

* Updated readme
2025-11-18 11:20:39 -06:00
Arm Patinyasakdikul 461e61d10e Added install.sh flag to suppress warnings. (#2054) 2025-11-17 00:35:06 -06:00
Pedram Alizadeh fb67e5b467 Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037)
* Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic

* Switching to hip_bf16.h from ROCm 6.0.0
2025-11-13 15:56:18 -05:00
AbandiGa 277b6e9bac Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047)
* disable bf16 reduce_copy pipelining for gfx950

* edit CHANGELOG

* Combine unroll and pipeline local arch calculation into single function

* fix multi-node error and disbale for gfx950 even if it's not a local build

* removed has_gfx950

* disable pipelining for gfx950 in rcclSetPipelining

---------

Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com>
Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>
2025-11-13 14:55:09 -06:00
isaki001 0d09f86608 Post thread-block size increase tuning (#2042)
* for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank

* dont override direct allgather threshold if set to -1

* restore 2-node AR simple at earlier message sizes than higher multi-node AR

* extend range of LL for single-node RS on gfx950

* update algo/proto for multi-node allreduce on gfx942

* set single-node AR on gfx950 to Tree LL for KB message sizes

* decrease threshold for single node Tree for gfx950 AR
2025-11-13 14:51:04 -06:00
Bertan Dogancay 83ffc82fa7 [Launch] Move cudaEventRecord call to capturing stream only (#2050) 2025-11-13 08:38:09 -06:00
gilbertlee-amd 46b032b760 [GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031) 2025-11-12 19:34:27 -06:00
nawrinsu c488c5307e Add tuner config file (2,4,8 nodes) for gfx950 (#2012)
* Add tuner config file (2,4,8 nodes) for gfx950

* remove alltoall

* Added comment regarding allgather direct
2025-11-12 09:16:36 -08:00
Kapil S. Pawar c8da880dc7 Added Functional Tests for CSV Tuner Plugin (#1968)
* Add functional tests for CSV Tuner Plugin

* Updated directory structure

* Updated and renamed directories

* Updated csv conf files

* Updated readme

* Updated readme

* Updated readme
2025-11-11 10:11:19 -06:00
Dingming Wu b811645688 Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027) 2025-11-11 09:46:51 -06:00