Graf commitů

293 Commity

Autor SHA1 Zpráva Datum
Nikhil-Nunna 7abc7538ea topo_explorer initial readme (#1797)
* topo_explorer intial readme

* topo_explorer readme update

* topo_explorer readme update

* Added sample output to README

* Update README.md

* Update README.md

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>
2025-07-11 11:28:20 -05:00
Nilesh M Negi f839e4edef [TOOLS] Update p2p-latency-test for gfx950 (#1730) 2025-07-10 12:13:29 -05:00
Arm Patinyasakdikul 35024ca1cb [topo-expl] update header file location. (#1769) 2025-06-27 15:29:37 -05:00
gilbertlee-amd 16101e654f Fixing HelloRccl include path to RCCL, fixing some warnings (#1778) 2025-06-27 09:12:59 -06:00
Mustafa Abduljabbar fb4ad82d0d Fix topo_explorer compatibility and capture WarpSize (#1743) 2025-06-16 08:18:35 -04:00
Tim ba97c9c18b replayer update v0 (#1733)
* First version of new replayer, with comments on future TODOs

* plus minor fixes for UT

* Updated format of recorder, especially in binary department, according to replayer's need
2025-06-13 15:05:34 -04:00
Arm Patinyasakdikul 6c37ae9470 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.
2025-06-12 09:58:01 -05:00
Mustafa Abduljabbar d665547eef Remove MSCCL single node AllGather XMLs (#1693)
* Remove MSCCL single node XMLs

* Remove comment on MSCCL AG single node support
2025-05-13 17:07:03 -05:00
Mustafa Abduljabbar fdad89690b Add missing MACRO to topo_expl (#1677)
* Fix header compatibility
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar f3f3336468 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank
2025-05-05 15:26:29 -04:00
Bertan Dogancay f8067a76dc Merge pull request #1645 from corey-derochie-amd/nccl-2.24
NCCL Sync 2.24.3-1
2025-04-24 10:08:58 -04:00
Mustafa Abduljabbar aa7991dfc8 [AllGather MSCCL] Multinode and single node support up to certain send count (#1650)
* Add multinode and singlenode allgather XML
2025-04-24 09:02:03 -04:00
BertanDogancay a6bf9bfc9e Merge remote-tracking branch 'nccl/master' into develop 2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 82afb2bcfe Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
2025-04-23 15:44:56 -04:00
Tim 45e1c3f3e2 reverting change to RcclReplayer (#1657) 2025-04-23 15:36:46 -04:00
Tim 9a55ff60a9 RCCL Replayer update (#1603)
RCCL recorder w/ suggested change and UT
2025-04-19 00:21:27 -04:00
AbandiGa 7a84c5dbb0 added copyright (#1635) 2025-04-14 09:46:18 -05:00
Pedram Alizadeh 5b36b68d06 single-node AR msccl algorithm tuning for MI300 (#1629) 2025-04-10 10:42:28 -04:00
Nikhil-Nunna 3dc0478722 Fetching RCCL version from shared object file. (#1569)
* added rccl version using rccl-tests

* Added function to get rccl version from rccl-tests

* removed whitespace

* Added rccl version

* Updated readme and fixed formatting

* removed debug prints
2025-04-08 14:09:39 -05:00
Mustafa Abduljabbar aace4e27f8 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm
2025-04-02 09:47:29 -04:00
gilbertlee-amd 626dc50ab5 Removing the experimental clique kernel files (#1610) 2025-03-20 18:10:01 -06:00
corey-derochie-amd 6505639cf4 removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>
2025-03-20 09:34:53 -06:00
Pedram Alizadeh f268553ee4 enable building rccl for gfx950 (#1571) 2025-02-25 16:13:48 -05:00
gilbertlee-amd ddc5d58b93 Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES
2025-02-20 15:18:29 -07:00
Nilesh M Negi 159587be5c [TOOLS] Update rcclDiagnostics script (#1557)
* [TOOLS] Update rcclDiagnostics script

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>

* Fix typo in valid_marketing_names list

Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: corey-derochie-amd <161367113+corey-derochie-amd@users.noreply.github.com>
2025-02-20 16:11:05 -06:00
Nikhil-Nunna 4ba94d6662 Env conf debug (#1534)
* Initial Script ready for review

* Added RCCL-tests and RCCL versions

* Added output folder and README

* Base format built

* Added ROCm version

* Added function to center titles and Vram information

* Added HIP version

* Cleaned formatting

* UCX version and MPI version

* Added NUMA balancing

* Added rocminfo

* Removed notes

* Changed regex for broadcom Nic

* Removed note by the ACS info

* Added Hostname to summary and details

* Print summary to terminal

* Added argparse

* Added flags and readme

* Added GPU ID

* fixed spelling

* renamed script again

* Added file descriptor and locked mem checks

* Added file descriptor and locked mem checks

* Removed extra spaces from summary table

* printing output file location

* Removed sudo in code and ACS flag
2025-02-14 17:31:18 -06:00
gilbertlee-amd 6cb0599e38 Updating topology explorer (#1536) 2025-02-07 08:44:04 -07:00
Benjamin Kitor a05329bd0d Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P
2024-12-03 13:12:03 -08:00
gilbertlee-amd cb1027de97 Updating RCCL Replayer README (#1408) 2024-11-05 08:06:11 -07:00
Bertan Dogancay cfecce790f [Replayer] Add validation (#1387)
* Add validation to rccl_replayer
2024-10-22 10:41:08 -04:00
Wenkai Du 62d10fdc25 msccl: disable 1-shot xmls (#1375)
MSCCL 1-shot xmls may cause different output values on different ranks.
Disabling them for now to avoid undefined behavior in applications.
2024-10-14 15:10:53 -07:00
Wenkai Du a680e329e6 Temporarily disable MSCCL all gather XMLs due to UT failure (#1373) 2024-10-12 08:43:16 -07:00
BertanDogancay 84081064a0 Merge remote-tracking branch 'nccl/master' into develop 2024-10-02 09:31:25 -05:00
Wenkai Du e453f1ced9 Add another Rome model (#1354) 2024-10-01 17:41:27 -05:00
Wenkai Du 1a48e19b18 topo_expl: update sm fields in topo xml files (#1310) 2024-08-29 12:03:51 -07:00
Edgar Gabriel bba3559334 Merge pull request #1299 from edgargabriel/topic/remove-multirank-examples
Remove MultiRank examples
2024-08-23 08:32:16 -05:00
Wenkai Du 532b70afb6 Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
Edgar Gabriel 8953a26bcd Remove MultiRank examples
remove the MultiRank examples, the features was never released (because
it didn't work reliably), and it might just cause confusion if somebody
sees it. In additional, the locdation in tools was suboptimal.
2024-08-14 14:11:16 -07:00
Ziyue Yang 145a13235a Fix number of loops in p2p-latency-test (#1286) 2024-08-05 13:35:56 -07:00
Benjamin Kitor 4bc118336a topo_expl: Update channel masks for >64 channels (#1279) 2024-07-25 17:27:34 -07:00
saurabhAMD cf311b71ee Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing (#1265)
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing

* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
2024-07-22 10:21:29 -05:00
mberenjk 519843d2cf adding rocprof and pytorch parser scripts (#1214)
* adding rocprof parser script

* adding the support for multiple json files

* adding pytorch profiler script

* remove filtering from pytorch log

* adding the addressing the comments and add the feature to parse all kernels

* completing the report for torch profiler

---------

Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
2024-07-19 14:51:28 -05:00
Jack Taylor 5f2b88bc28 Add pytorch rccl/intra node all-reduce benchmark (#1221)
* Add gpt-fast pytorch all reduce benchmark script

* Update readme instructions

* Minor changes
2024-06-25 08:04:38 -07:00
Nusrat Islam 0634c5c8e1 doubling debug buffer size with increased channels 2024-06-03 13:05:05 -05:00
ClementLinCF cab25f919e Optimize NCHANNELS and MSCCL config for gfx942 80CUs (#1195)
* Optimize NCHANNELS and MSCCL config for gfx942 80CUs

Set appropriately for different NCCL_MIN_NCHANNELS and MSCCL config,
potentially improving communication perf on the MI300x 80CUs

* Delete tools/msccl-algorithms/allreduce_1step_mccl_8_2_16777216_LL.xml

* Change the factor of gfx94 and update msccl config
2024-06-01 07:07:46 -07:00
gilbertlee-amd 52fa5d1178 RCCL Replayer - multi communicator support (#1176) 2024-05-13 10:56:32 -06:00
Tim f078db5998 Upload npkit_trace_analysis.py (#1152)
script for parsing json trace, generating heatmap, throughput series, etc.
2024-05-09 16:27:49 -04:00
Wenkai Du 4e1b8c1cbb MSCCL: add support for out-of-place all reduce (#1156) 2024-04-28 19:49:09 -07:00
Wenkai Du 9e0c9b4ed8 Replace __HIP_PLATFORM_HCC__ with __HIP_PLATFORM_AMD__ (#1154) 2024-04-25 07:19:18 -07:00
gilbertlee-amd 4cb62f999a Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
2024-04-15 12:03:57 -06:00