66 Commits

Autor SHA1 Mensaje Fecha
isaki001 9fa7a738da add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889)
[ROCm/rccl commit: 9c36439354]
2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar b33b5755f6 Support gfx950 in topo_expl and resolve dependency on FMT (#1829)
* Support gfx950 in topo_expl

* Fix dependencies and fetch fmt from sources

* Remove third_party folder in make clean

* Add empty target when fmt is found

* Add MI350 example

* Update README.md

---------

Co-authored-by: isaki001 <ioannissakiotis@gmail.com>

[ROCm/rccl commit: dfad51e3c9]
2025-08-26 10:11:38 -04:00
Mustafa Abduljabbar 3e5dc99aa6 Fix topo_explorer compatibility and capture WarpSize (#1743)
[ROCm/rccl commit: fb4ad82d0d]
2025-06-16 08:18:35 -04:00
Mustafa Abduljabbar ab4a3eb0c1 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank


[ROCm/rccl commit: f3f3336468]
2025-05-05 15:26:29 -04:00
Mustafa Abduljabbar 07620c7efd Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool

[ROCm/rccl commit: 82afb2bcfe]
2025-04-23 15:44:56 -04:00
corey-derochie-amd e95578ef4c removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>

[ROCm/rccl commit: 6505639cf4]
2025-03-20 09:34:53 -06:00
gilbertlee-amd 4ca7e6873e Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES


[ROCm/rccl commit: ddc5d58b93]
2025-02-20 15:18:29 -07:00
gilbertlee-amd 94545f827c Updating topology explorer (#1536)
[ROCm/rccl commit: 6cb0599e38]
2025-02-07 08:44:04 -07:00
Benjamin Kitor fe806d5427 Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P

[ROCm/rccl commit: a05329bd0d]
2024-12-03 13:12:03 -08:00
Wenkai Du 74aa13afbe Add another Rome model (#1354)
[ROCm/rccl commit: e453f1ced9]
2024-10-01 17:41:27 -05:00
Wenkai Du 157cc5f6ba Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names

[ROCm/rccl commit: 532b70afb6]
2024-08-23 08:45:43 +08:00
gilbertlee-amd 422a7ffcbb Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)

[ROCm/rccl commit: 4cb62f999a]
2024-04-15 12:03:57 -06:00
gilbertlee-amd 62b9f0d3a7 [topo_expl] Adding -n option to override number of nodes (#1134)
[ROCm/rccl commit: 93982533d7]
2024-04-04 15:11:47 -06:00
Wenkai Du 058886cb20 Add another Rome model (#1095)
[ROCm/rccl commit: df98a6957d]
2024-02-28 10:46:05 -08:00
Wenkai Du 874998033f Add new GPU model (#1080)
[ROCm/rccl commit: 74f9e5db64]
2024-02-23 12:19:42 -08:00
Wenkai Du dcf623f2ec Add special handling of gfx940 (#976)
* Add special handling of gfx940

* Update ring base

[ROCm/rccl commit: 50b2dd9fd7]
2023-11-22 15:07:36 -08:00
Wenkai Du 0c31452135 Add new model support (#847)
* Add new model support

* Update new rings

[ROCm/rccl commit: 7044599575]
2023-08-10 17:14:51 -07:00
Wenkai Du dfda1d6fab Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)

[ROCm/rccl commit: a7fcd58a97]
2023-07-21 07:31:27 -07:00
Wenkai Du f98715baea Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: abd0615351]
2023-06-26 22:51:56 +00:00
Wenkai Du 36e5e02e46 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 4f0e223db4]
2022-10-20 15:41:29 +00:00
Wenkai Du 7874a99c75 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a79d9e3586]
2022-09-09 16:05:38 +00:00
Wenkai Du 5becf1669f Add another Rome model (#553)
* Add another Rome model

* Add option to force enable intranet on single node

* Limit p2p channels to number of ranks

* Refine p2p channels handling

[ROCm/rccl commit: ef499c4810]
2022-05-31 11:31:30 -07:00
Wenkai Du b30b8becea Refine and add new Rome models (#548)
[ROCm/rccl commit: 283dc86a73]
2022-05-17 08:23:59 -07:00
Wenkai Du 95b30d9762 topo_expl: fix build and add tuning support (#539)
[ROCm/rccl commit: 063da25563]
2022-04-26 15:40:07 -07:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Wenkai Du 011447e4dc Add new Rome model (#536)
[ROCm/rccl commit: 2151c79d14]
2022-04-13 11:45:40 -07:00
Wenkai Du f8023f2e07 Add new Rome model (#535)
[ROCm/rccl commit: ba4c165bf3]
2022-04-12 13:27:32 -07:00
Wenkai Du db1e628ba3 Update Rome model matching and add new models (#516)
* Update Rome model matching and add new models

* Add missing file

* Models update

[ROCm/rccl commit: cd17cf6dce]
2022-03-21 10:54:40 -07:00
Wenkai Du 02a94fc552 topo_expl: update for 2.11.4 (#490)
* topo_expl: update for 2.11.4

* topo_expl: revert a few logging changes

[ROCm/rccl commit: 369c021992]
2022-01-13 13:33:07 -08:00
Wenkai Du 93c7526fdc Add another Rome model (#483)
[ROCm/rccl commit: f8d0775a6f]
2022-01-05 09:26:31 -08:00
Wenkai Du fd98ee84b4 Update Rome model matching (#461)
* Update Rome model matching

* Add another Rome model

* Automatically setup NET GDR level from model

[ROCm/rccl commit: 0331e39f81]
2021-11-05 08:53:47 -07:00
Wenkai Du b587b55c2e Add more Rome models (#434)
* Add more Rome models

* Update models and tuning

* Update tuning

[ROCm/rccl commit: 2249a1d9d3]
2021-10-12 08:23:20 -07:00
Wenkai Du d377c4dcc6 Add another Rome model (#431)
[ROCm/rccl commit: e0053311c0]
2021-10-06 08:17:12 -07:00
Wenkai Du b9508a6aba Implement NIC identification and remapping (#420)
* Add 1H16P GPU model

* Implement NIC identification and remapping

* Revert "Sort IB devices based on device name (#413)"

This reverts commit de0c586bad.

* Fix permute and check order

* Correction on IB speed reporting

* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"

This reverts commit fa690c47a0.

[ROCm/rccl commit: 5c8380ff5b]
2021-08-24 09:42:04 -07:00
Wenkai Du 57518da006 Add gfx908 VM model (#418)
[ROCm/rccl commit: 5f15ed6e3e]
2021-08-10 08:55:11 -07:00
Wenkai Du 8fbeb14175 topo_expl: fix build after switching to rocm-smi-lib (#405)
* topo_expl: fix build after switching to rocm-smi-lib

* Use minimal of 4 channels for gfx908

[ROCm/rccl commit: 135d47d125]
2021-07-27 08:30:08 -07:00
Wenkai Du 4b31e521e9 Add option to enable multiple SAT in SHARP (#380)
* Add option to enable multiple SAT in SHARP

* Extend number of NICs to 16

[ROCm/rccl commit: 961922ea02]
2021-06-03 19:45:18 -07:00
Wenkai Du aa95cc6102 Update Rome models matching (#376)
[ROCm/rccl commit: 4c83adb75c]
2021-05-25 10:12:40 -07:00
Wenkai Du a76bebf8b6 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a4ea1fed5b]
2021-05-05 16:01:01 -07:00
Wenkai Du 0f4d497edc Add gfx90a target (#344)
* Add gfx90a target

* Support gfx90a topology

Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com>

[ROCm/rccl commit: 1fe031402a]
2021-04-14 09:29:00 -06:00
Wenkai Du 065bde98d8 collnet: support multiple NICs (#335)
[ROCm/rccl commit: d87dc7c2e8]
2021-03-25 20:59:32 -07:00
Wenkai Du 287ed0f18a Enable collnet in RCCL (#333)
* Enable CollNet and use different number of channels

* topo_expl: enable collnet

[ROCm/rccl commit: 1d6244b18d]
2021-03-19 12:58:13 -07:00
Wenkai Du fe8923ebba Add gfx908 Rome 4 NICs model
[ROCm/rccl commit: 6dfdfef98f]
2021-02-06 00:19:47 +00:00
Wenkai Du ae5779702a Merge remote-tracking branch 'origin/develop' into 2.8.3
[ROCm/rccl commit: ab1e7a0318]
2021-02-04 20:02:34 -05:00
Wenkai Du adff98765c Merge remote-tracking branch 'nccl/master' into no-target-id
[ROCm/rccl commit: d469947641]
2021-01-14 19:27:53 -05:00
Wenkai Du 4ea285c527 Fix Rome PCIe 2 node topology generation (#310)
[ROCm/rccl commit: 373a108516]
2020-12-15 17:16:17 -08:00
Wenkai Du b68ff1ebba Add Rome model and improve search (#305)
[ROCm/rccl commit: 975b14dffa]
2020-11-17 14:55:06 -08:00
Wenkai Du c0c64d970a Add more Rome models (#292)
[ROCm/rccl commit: dfa3c41ede]
2020-10-30 21:26:04 -07:00
Wenkai Du 8b120c0508 Update Rome single node models (#277)
[ROCm/rccl commit: 33babcb5e2]
2020-10-13 13:33:09 -07:00
Wenkai Du 41260bb948 Rework Rome detection and add multiple network ports models (#274)
* Rework Rome detection and add multiple network ports models

* Remove unused opCount in p2p transport

[ROCm/rccl commit: ae008fd2db]
2020-10-07 13:37:36 -07:00