312 Commits

Author SHA1 Message Date
Marzieh Berenjkoub d7293281f3 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 858b4e76eb]
2026-01-20 13:04:02 -06:00
Atul Kulkarni 30d36661c2 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 0c2c61d2f1]
2026-01-08 10:04:41 -06:00
Atul Kulkarni 86a4dd95f6 Remove static to non-static conversion used in tests (#2084)
* Remove coll_reg tests which are unsupported

* removed static to non-static conversion feature

[ROCm/rccl commit: 7ec8e73e12]
2025-12-04 18:03:14 -06:00
Kapil S. Pawar 566671910a [RcclReplayer] JSON <-> BIN log format conversion tool (#2056)
* Add replay log format converter

* Add Log Sanitizer

* Add no timestamp option (nts) to sanitizer

[ROCm/rccl commit: 5fd86021a8]
2025-11-24 11:51:36 -06:00
Kapil S. Pawar 6bbc4b5d48 [RcclReplayer] Compile without the need for RCCL to be compiled (#2039)
[ROCm/rccl commit: acdafac49f]
2025-11-10 15:38:48 -06:00
Avinash 5ca67dc803 Empty kernel test enhancements [tools] (#1999)
* Initial commit

* Improvements-1

* Initial commit for PR

* Updates warning, run.sh, decoupled loops

* Forcing seq cst for CPU timimg

[ROCm/rccl commit: 85baa0d113]
2025-11-07 12:28:06 -06:00
Bertan Dogancay 4c7afea115 [Tools/Replayer] Fix prohibited calls during capture mode (#1938)
[ROCm/rccl commit: b703ffdfa4]
2025-10-29 12:19:32 -04:00
ehsanhosseinzadehKhaligh f5b45c549d Updating npkit_trace_generator.py to check npkit directory (#1891)
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>

[ROCm/rccl commit: aec4f0a659]
2025-10-22 02:51:16 -05:00
Sourav Chakraborty a3a5631f53 Fix incorrect benchmark name in JitterBench script (#1983)
[ROCm/rccl commit: 57286d5df3]
2025-10-21 12:52:20 -05:00
Sourav Chakraborty 046af13751 Fix build failure in rccl_prim_test (#1984)
Added missing header in rccl_prim_test

[ROCm/rccl commit: 5b345d105c]
2025-10-21 12:51:14 -05:00
Tim bb2e12c831 Update RCCL Replayer README.md (#1870)
* Update Replayer README.md

[ROCm/rccl commit: ba44f170ad]
2025-09-19 17:57:48 -04:00
isaki001 9fa7a738da add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889)
[ROCm/rccl commit: 9c36439354]
2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar 26495be59c Use add_unroll.sh in topo_expl makefile (#1886)
[ROCm/rccl commit: 6e45eaf75e]
2025-09-03 09:43:16 -04:00
Mustafa Abduljabbar 1a7ab8dfc8 Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md

[ROCm/rccl commit: 7ccc6f268f]
2025-09-03 08:54:13 -04:00
BertanDogancay 881327184e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 08a7be231b]
2025-08-28 15:46:28 -05:00
Mustafa Abduljabbar b33b5755f6 Support gfx950 in topo_expl and resolve dependency on FMT (#1829)
* Support gfx950 in topo_expl

* Fix dependencies and fetch fmt from sources

* Remove third_party folder in make clean

* Add empty target when fmt is found

* Add MI350 example

* Update README.md

---------

Co-authored-by: isaki001 <ioannissakiotis@gmail.com>

[ROCm/rccl commit: dfad51e3c9]
2025-08-26 10:11:38 -04:00
Atul Kulkarni de0d446e03 Added new unit tests for src/transport/p2p.cc (#1774)
[ROCm/rccl commit: 81ec6bff4c]
2025-07-25 12:57:57 -05:00
Wenkai Du caff9764d3 Support fused all reduce and elementwise operations (#1729)
* Support fused all reduce and elementwise operations

Add additional "acc" parameter to RCCL Replayer logs

Add flag which indicates availability of new API

* Fix Recorder json parsing

* Remove unreachable code

* Remove extra acc pointer check

* .

* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"

This reverts commit 4cadf3597c.

* Use noinline to reduce kernels linking time

* Don't use noinline for gfx942 and gfx950 to avoid perf regression

---------

Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>

[ROCm/rccl commit: 9a4213356d]
2025-07-23 09:04:17 -07:00
Atul Kulkarni c94fb7c58e Code coverage improvements (#1665)
* Increased max stack size to 640

* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.

[ROCm/rccl commit: 275fdd43c1]
2025-07-17 11:20:49 -05:00
Nikhil-Nunna bf4031276c topo_explorer initial readme (#1797)
* topo_explorer intial readme

* topo_explorer readme update

* topo_explorer readme update

* Added sample output to README

* Update README.md

* Update README.md

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

[ROCm/rccl commit: 7abc7538ea]
2025-07-11 11:28:20 -05:00
Nilesh M Negi 86dd6f262b [TOOLS] Update p2p-latency-test for gfx950 (#1730)
[ROCm/rccl commit: f839e4edef]
2025-07-10 12:13:29 -05:00
Arm Patinyasakdikul 4d71cae249 [topo-expl] update header file location. (#1769)
[ROCm/rccl commit: 35024ca1cb]
2025-06-27 15:29:37 -05:00
gilbertlee-amd 23e5680038 Fixing HelloRccl include path to RCCL, fixing some warnings (#1778)
[ROCm/rccl commit: 16101e654f]
2025-06-27 09:12:59 -06:00
Mustafa Abduljabbar 3e5dc99aa6 Fix topo_explorer compatibility and capture WarpSize (#1743)
[ROCm/rccl commit: fb4ad82d0d]
2025-06-16 08:18:35 -04:00
Tim 7051f217a7 replayer update v0 (#1733)
* First version of new replayer, with comments on future TODOs

* plus minor fixes for UT

* Updated format of recorder, especially in binary department, according to replayer's need

[ROCm/rccl commit: ba97c9c18b]
2025-06-13 15:05:34 -04:00
Arm Patinyasakdikul 7f7f1cede3 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.

[ROCm/rccl commit: 6c37ae9470]
2025-06-12 09:58:01 -05:00
Mustafa Abduljabbar 128b0e7074 Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support

[ROCm/rccl commit: d665547eef]
2025-05-13 17:07:03 -05:00
Mustafa Abduljabbar 750bd73047 Add missing MACRO to topo_expl (#1677)
* Fix header compatibility

[ROCm/rccl commit: fdad89690b]
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar ab4a3eb0c1 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank


[ROCm/rccl commit: f3f3336468]
2025-05-05 15:26:29 -04:00
Bertan Dogancay eb50c947eb Merge pull request #1645 from corey-derochie-amd/nccl-2.24
NCCL Sync 2.24.3-1

[ROCm/rccl commit: f8067a76dc]
2025-04-24 10:08:58 -04:00
Mustafa Abduljabbar a85cfaa680 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML


[ROCm/rccl commit: aa7991dfc8]
2025-04-24 09:02:03 -04:00
BertanDogancay d045d0ca23 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a6bf9bfc9e]
2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 07620c7efd Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool

[ROCm/rccl commit: 82afb2bcfe]
2025-04-23 15:44:56 -04:00
Tim 38f91fa2c8 reverting change to RcclReplayer (#1657)
[ROCm/rccl commit: 45e1c3f3e2]
2025-04-23 15:36:46 -04:00
Tim 58ee618194 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT



[ROCm/rccl commit: 9a55ff60a9]
2025-04-19 00:21:27 -04:00
AbandiGa acf0bc1c6e added copyright (#1635)
[ROCm/rccl commit: 7a84c5dbb0]
2025-04-14 09:46:18 -05:00
Pedram Alizadeh b225281747 single-node AR msccl algorithm tuning for MI300 (#1629)
[ROCm/rccl commit: 5b36b68d06]
2025-04-10 10:42:28 -04:00
Nikhil-Nunna ee9da06c80 Fetching RCCL version from shared object file. (#1569)
* added rccl version using rccl-tests

* Added function to get rccl version from rccl-tests

* removed whitespace

* Added rccl version

* Updated readme and fixed formatting

* removed debug prints

[ROCm/rccl commit: 3dc0478722]
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar 0a81478bd9 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm

[ROCm/rccl commit: aace4e27f8]
2025-04-02 09:47:29 -04:00
gilbertlee-amd 4f67522420 Removing the experimental clique kernel files (#1610)
[ROCm/rccl commit: 626dc50ab5]
2025-03-20 18:10:01 -06:00
corey-derochie-amd e95578ef4c removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>

[ROCm/rccl commit: 6505639cf4]
2025-03-20 09:34:53 -06:00
Pedram Alizadeh acf5822a6c enable building rccl for gfx950 (#1571)
[ROCm/rccl commit: f268553ee4]
2025-02-25 16:13:48 -05:00
gilbertlee-amd 4ca7e6873e Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES


[ROCm/rccl commit: ddc5d58b93]
2025-02-20 15:18:29 -07:00
Nilesh M Negi 5f3134db50 [TOOLS] Update rcclDiagnostics script (#1557)
* [TOOLS] Update rcclDiagnostics script

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Fix typo in valid_marketing_names list

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 159587be5c]
2025-02-20 16:11:05 -06:00
Nikhil-Nunna f228a50646 Env conf debug (#1534)
* Initial Script ready for review

* Added RCCL-tests and RCCL versions

* Added output folder and README

* Base format built

* Added ROCm version

* Added function to center titles and Vram information

* Added HIP version

* Cleaned formatting

* UCX version and MPI version

* Added NUMA balancing

* Added rocminfo

* Removed notes

* Changed regex for broadcom Nic

* Removed note by the ACS info

* Added Hostname to summary and details

* Print summary to terminal

* Added argparse

* Added flags and readme

* Added GPU ID

* fixed spelling

* renamed script again

* Added file descriptor and locked mem checks

* Added file descriptor and locked mem checks

* Removed extra spaces from summary table

* printing output file location

* Removed sudo in code and ACS flag

[ROCm/rccl commit: 4ba94d6662]
2025-02-14 17:31:18 -06:00
gilbertlee-amd 94545f827c Updating topology explorer (#1536)
[ROCm/rccl commit: 6cb0599e38]
2025-02-07 08:44:04 -07:00
Benjamin Kitor fe806d5427 Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P

[ROCm/rccl commit: a05329bd0d]
2024-12-03 13:12:03 -08:00
gilbertlee-amd 71809e8bc0 Updating RCCL Replayer README (#1408)
[ROCm/rccl commit: cb1027de97]
2024-11-05 08:06:11 -07:00
Bertan Dogancay fcb0b2da3f [Replayer] Add validation (#1387)
* Add validation to rccl_replayer

[ROCm/rccl commit: cfecce790f]
2024-10-22 10:41:08 -04:00
Wenkai Du 5f8571dcbc msccl: disable 1-shot xmls (#1375)
MSCCL 1-shot xmls may cause different output values on different ranks.
Disabling them for now to avoid undefined behavior in applications.

[ROCm/rccl commit: 62d10fdc25]
2024-10-14 15:10:53 -07:00