Wykres commitów

57 Commity

Autor SHA1 Wiadomość Data
Atul Kulkarni 30d36661c2 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 0c2c61d2f1]
2026-01-08 10:04:41 -06:00
Atul Kulkarni 86a4dd95f6 Remove static to non-static conversion used in tests (#2084)
* Remove coll_reg tests which are unsupported

* removed static to non-static conversion feature

[ROCm/rccl commit: 7ec8e73e12]
2025-12-04 18:03:14 -06:00
ehsanhosseinzadehKhaligh f5b45c549d Updating npkit_trace_generator.py to check npkit directory (#1891)
* create dir regardless of default or user-provided path if it doesn't exist
* Fix npkit_dump_dir on npkit_trace_generator.py

---------

Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>

[ROCm/rccl commit: aec4f0a659]
2025-10-22 02:51:16 -05:00
Atul Kulkarni de0d446e03 Added new unit tests for src/transport/p2p.cc (#1774)
[ROCm/rccl commit: 81ec6bff4c]
2025-07-25 12:57:57 -05:00
Atul Kulkarni c94fb7c58e Code coverage improvements (#1665)
* Increased max stack size to 640

* Added new binary for executing unit tests

Added new unit tests for argcheck.cc and alt_rsmi.cc files

Modified the method to execute unit tests to cover static methods
by using a bash script to convert static to non-static functions
and variables on the fly restricted to debug build type.

[ROCm/rccl commit: 275fdd43c1]
2025-07-17 11:20:49 -05:00
AbandiGa acf0bc1c6e added copyright (#1635)
[ROCm/rccl commit: 7a84c5dbb0]
2025-04-14 09:46:18 -05:00
Nikhil-Nunna ee9da06c80 Fetching RCCL version from shared object file. (#1569)
* added rccl version using rccl-tests

* Added function to get rccl version from rccl-tests

* removed whitespace

* Added rccl version

* Updated readme and fixed formatting

* removed debug prints

[ROCm/rccl commit: 3dc0478722]
2025-04-08 14:09:39 -05:00
Nilesh M Negi 5f3134db50 [TOOLS] Update rcclDiagnostics script (#1557)
* [TOOLS] Update rcclDiagnostics script

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Fix typo in valid_marketing_names list

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 159587be5c]
2025-02-20 16:11:05 -06:00
Nikhil-Nunna f228a50646 Env conf debug (#1534)
* Initial Script ready for review

* Added RCCL-tests and RCCL versions

* Added output folder and README

* Base format built

* Added ROCm version

* Added function to center titles and Vram information

* Added HIP version

* Cleaned formatting

* UCX version and MPI version

* Added NUMA balancing

* Added rocminfo

* Removed notes

* Changed regex for broadcom Nic

* Removed note by the ACS info

* Added Hostname to summary and details

* Print summary to terminal

* Added argparse

* Added flags and readme

* Added GPU ID

* fixed spelling

* renamed script again

* Added file descriptor and locked mem checks

* Added file descriptor and locked mem checks

* Removed extra spaces from summary table

* printing output file location

* Removed sudo in code and ACS flag

[ROCm/rccl commit: 4ba94d6662]
2025-02-14 17:31:18 -06:00
Wenkai Du 74aa13afbe Add another Rome model (#1354)
[ROCm/rccl commit: e453f1ced9]
2024-10-01 17:41:27 -05:00
Wenkai Du 157cc5f6ba Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names

[ROCm/rccl commit: 532b70afb6]
2024-08-23 08:45:43 +08:00
mberenjk 863b213fd2 adding rocprof and pytorch parser scripts (#1214)
* adding rocprof parser script

* adding the support for multiple json files

* adding pytorch profiler script

* remove filtering from pytorch log

* adding the addressing the comments and add the feature to parse all kernels

* completing the report for torch profiler

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>

[ROCm/rccl commit: 519843d2cf]
2024-07-19 14:51:28 -05:00
Jack Taylor ebcac26530 Add pytorch rccl/intra node all-reduce benchmark (#1221)
* Add gpt-fast pytorch all reduce benchmark script

* Update readme instructions

* Minor changes

[ROCm/rccl commit: 5f2b88bc28]
2024-06-25 08:04:38 -07:00
Tim b7c743b3a0 Upload npkit_trace_analysis.py (#1152)
script for parsing json trace, generating heatmap, throughput series, etc.

[ROCm/rccl commit: f078db5998]
2024-05-09 16:27:49 -04:00
gilbertlee-amd 422a7ffcbb Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)

[ROCm/rccl commit: 4cb62f999a]
2024-04-15 12:03:57 -06:00
corey-derochie-amd 62a6a07d49 Replaced ROCmSoftwarePlatform and RadeonOpenCompute links with ROCm links. (#1125)
[ROCm/rccl commit: 503a472a25]
2024-03-25 16:29:13 -06:00
akolliasAMD 5d44815d95 Npkit updates (#1084)
* removed warmup runs to be an opt in

[ROCm/rccl commit: 16d7f372b7]
2024-02-15 07:48:45 -07:00
akolliasAMD 81cc39899b npkit trace script now syncs the on average difference per rank (#981)
[ROCm/rccl commit: c71bae1608]
2023-11-28 11:03:55 -07:00
Pedram Alizadeh 279da575be Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)
[ROCm/rccl commit: 3f6c2b9b32]
2023-09-27 12:44:36 -04:00
Wenkai Du 0c31452135 Add new model support (#847)
* Add new model support

* Update new rings

[ROCm/rccl commit: 7044599575]
2023-08-10 17:14:51 -07:00
Wenkai Du dfda1d6fab Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)

[ROCm/rccl commit: a7fcd58a97]
2023-07-21 07:31:27 -07:00
akolliasAMD 8438fd9e42 Wall clock update and npkit trace script Update (#771)
* changed builtin clock to wall_clock64
* updated npkit_Trace_generator to the new version of npkit

[ROCm/rccl commit: 9cdac774ea]
2023-06-07 17:47:10 -06:00
gilbertlee-amd 05b8523a7b Updating NOTICES.txt and LICENSE.txt (#770)
[ROCm/rccl commit: 20b567caac]
2023-06-07 09:45:03 -06:00
akolliasAMD fd64cb242d added time results on npkit generator (#749)
[ROCm/rccl commit: 2b1efa9e9a]
2023-05-30 12:57:25 -06:00
akolliasAMD a26ca6d766 added modified npkit_trace_generator.py to scripts (#738)
* added modified npkit_trace_generator.py to scripts

[ROCm/rccl commit: c88475462b]
2023-05-09 10:11:35 -06:00
Wenkai Du 5becf1669f Add another Rome model (#553)
* Add another Rome model

* Add option to force enable intranet on single node

* Limit p2p channels to number of ranks

* Refine p2p channels handling

[ROCm/rccl commit: ef499c4810]
2022-05-31 11:31:30 -07:00
Wenkai Du b30b8becea Refine and add new Rome models (#548)
[ROCm/rccl commit: 283dc86a73]
2022-05-17 08:23:59 -07:00
Wenkai Du 011447e4dc Add new Rome model (#536)
[ROCm/rccl commit: 2151c79d14]
2022-04-13 11:45:40 -07:00
Wenkai Du f8023f2e07 Add new Rome model (#535)
[ROCm/rccl commit: ba4c165bf3]
2022-04-12 13:27:32 -07:00
Wenkai Du db1e628ba3 Update Rome model matching and add new models (#516)
* Update Rome model matching and add new models

* Add missing file

* Models update

[ROCm/rccl commit: cd17cf6dce]
2022-03-21 10:54:40 -07:00
Wenkai Du 93c7526fdc Add another Rome model (#483)
[ROCm/rccl commit: f8d0775a6f]
2022-01-05 09:26:31 -08:00
Wenkai Du fd98ee84b4 Update Rome model matching (#461)
* Update Rome model matching

* Add another Rome model

* Automatically setup NET GDR level from model

[ROCm/rccl commit: 0331e39f81]
2021-11-05 08:53:47 -07:00
Wenkai Du b587b55c2e Add more Rome models (#434)
* Add more Rome models

* Update models and tuning

* Update tuning

[ROCm/rccl commit: 2249a1d9d3]
2021-10-12 08:23:20 -07:00
Wenkai Du d377c4dcc6 Add another Rome model (#431)
[ROCm/rccl commit: e0053311c0]
2021-10-06 08:17:12 -07:00
Wenkai Du b9508a6aba Implement NIC identification and remapping (#420)
* Add 1H16P GPU model

* Implement NIC identification and remapping

* Revert "Sort IB devices based on device name (#413)"

This reverts commit de0c586bad.

* Fix permute and check order

* Correction on IB speed reporting

* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"

This reverts commit fa690c47a0.

[ROCm/rccl commit: 5c8380ff5b]
2021-08-24 09:42:04 -07:00
Wenkai Du 57518da006 Add gfx908 VM model (#418)
[ROCm/rccl commit: 5f15ed6e3e]
2021-08-10 08:55:11 -07:00
Wenkai Du 4b31e521e9 Add option to enable multiple SAT in SHARP (#380)
* Add option to enable multiple SAT in SHARP

* Extend number of NICs to 16

[ROCm/rccl commit: 961922ea02]
2021-06-03 19:45:18 -07:00
Wenkai Du aa95cc6102 Update Rome models matching (#376)
[ROCm/rccl commit: 4c83adb75c]
2021-05-25 10:12:40 -07:00
Wenkai Du 0f4d497edc Add gfx90a target (#344)
* Add gfx90a target

* Support gfx90a topology

Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com>

[ROCm/rccl commit: 1fe031402a]
2021-04-14 09:29:00 -06:00
Wenkai Du 065bde98d8 collnet: support multiple NICs (#335)
[ROCm/rccl commit: d87dc7c2e8]
2021-03-25 20:59:32 -07:00
Wenkai Du 287ed0f18a Enable collnet in RCCL (#333)
* Enable CollNet and use different number of channels

* topo_expl: enable collnet

[ROCm/rccl commit: 1d6244b18d]
2021-03-19 12:58:13 -07:00
Wenkai Du fe8923ebba Add gfx908 Rome 4 NICs model
[ROCm/rccl commit: 6dfdfef98f]
2021-02-06 00:19:47 +00:00
Wenkai Du 4ea285c527 Fix Rome PCIe 2 node topology generation (#310)
[ROCm/rccl commit: 373a108516]
2020-12-15 17:16:17 -08:00
Wenkai Du b68ff1ebba Add Rome model and improve search (#305)
[ROCm/rccl commit: 975b14dffa]
2020-11-17 14:55:06 -08:00
Wenkai Du c0c64d970a Add more Rome models (#292)
[ROCm/rccl commit: dfa3c41ede]
2020-10-30 21:26:04 -07:00
Wenkai Du 8b120c0508 Update Rome single node models (#277)
[ROCm/rccl commit: 33babcb5e2]
2020-10-13 13:33:09 -07:00
Wenkai Du 41260bb948 Rework Rome detection and add multiple network ports models (#274)
* Rework Rome detection and add multiple network ports models

* Remove unused opCount in p2p transport

[ROCm/rccl commit: ae008fd2db]
2020-10-07 13:37:36 -07:00
lijietang f6b08ca547 Add rccl bw test script in tools (#255)
[ROCm/rccl commit: bbe233f8c1]
2020-09-11 16:59:03 +08:00
Wenkai Du 03bb6bcb54 Increase minimal channels for gfx908 (#259)
[ROCm/rccl commit: c5cbece6d0]
2020-08-26 11:40:11 -07:00
Wenkai Du 5f49a0e088 Add NPS4 support on some models (#256)
* Add NPS4 support on some models

* Add XML models

[ROCm/rccl commit: 391bbf3f1e]
2020-08-19 11:03:20 -07:00