isaki001
9c36439354
add reduce/broadcast algo/proto selection table for multi-node gfx940 ( #1889 )
2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar
dfad51e3c9
Support gfx950 in topo_expl and resolve dependency on FMT ( #1829 )
...
* Support gfx950 in topo_expl
* Fix dependencies and fetch fmt from sources
* Remove third_party folder in make clean
* Add empty target when fmt is found
* Add MI350 example
* Update README.md
---------
Co-authored-by: isaki001 <ioannissakiotis@gmail.com >
2025-08-26 10:11:38 -04:00
Mustafa Abduljabbar
fb4ad82d0d
Fix topo_explorer compatibility and capture WarpSize ( #1743 )
2025-06-16 08:18:35 -04:00
Mustafa Abduljabbar
f3f3336468
Fix topo explorer's compatibility with NCCL 2.24 ( #1671 )
...
* Fix build issues
* Fix failure to find path remote rank
2025-05-05 15:26:29 -04:00
Mustafa Abduljabbar
82afb2bcfe
Expose production tuning table in topo_explorer using internal RCCL/NCCL logic ( #1628 )
...
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool
2025-04-23 15:44:56 -04:00
corey-derochie-amd
6505639cf4
removed gfx940 and gfx941 ( #1606 )
...
* removed gfx940 and gfx941
* removed gfx940 and gfx941
* Update "gfx94" to "gfx942" in init.cc
* Updated remaining "gfx94" updates to "gfx942"
* Update filenames and variables from gfx940 to gfx942
---------
Co-authored-by: akolliasAMD <akollias@amd.com >
2025-03-20 09:34:53 -06:00
gilbertlee-amd
ddc5d58b93
Rail optimized trees ( #1540 )
...
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES
2025-02-20 15:18:29 -07:00
gilbertlee-amd
6cb0599e38
Updating topology explorer ( #1536 )
2025-02-07 08:44:04 -07:00
Benjamin Kitor
a05329bd0d
Add Topologies for 16-GPU gfx942 SuperNode ( #1417 )
...
* Add Topologies for 16-GPU gfx942 SuperNode
- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
* Fix bug w/ 1H16P
2024-12-03 13:12:03 -08:00
Wenkai Du
e453f1ced9
Add another Rome model ( #1354 )
2024-10-01 17:41:27 -05:00
Wenkai Du
532b70afb6
Add new Rome model ( #1304 )
...
* Add another rome model and override
* Fix bug
* Fix typo
* Add ring
* Update ring
* Fix model matching
* Clean up
* Clean up
* Reverse rings for NCCL_RINGS input
* Only reverse NCCL_RINGS for ring graph
* Fix mapping issue when using NCCL_RINGS
* Add NCCL_RINGS_REMAP to handle inconsistant net names
2024-08-23 08:45:43 +08:00
gilbertlee-amd
4cb62f999a
Rail optimization for rings ( #1140 )
...
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
2024-04-15 12:03:57 -06:00
gilbertlee-amd
93982533d7
[topo_expl] Adding -n option to override number of nodes ( #1134 )
2024-04-04 15:11:47 -06:00
Wenkai Du
df98a6957d
Add another Rome model ( #1095 )
2024-02-28 10:46:05 -08:00
Wenkai Du
74f9e5db64
Add new GPU model ( #1080 )
2024-02-23 12:19:42 -08:00
Wenkai Du
50b2dd9fd7
Add special handling of gfx940 ( #976 )
...
* Add special handling of gfx940
* Update ring base
2023-11-22 15:07:36 -08:00
Wenkai Du
7044599575
Add new model support ( #847 )
...
* Add new model support
* Update new rings
2023-08-10 17:14:51 -07:00
Wenkai Du
a7fcd58a97
Enable gfx94x ( #808 ) ( #816 )
...
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)
2023-07-21 07:31:27 -07:00
Wenkai Du
abd0615351
Merge remote-tracking branch 'nccl/master' into develop
2023-06-26 22:51:56 +00:00
Wenkai Du
4f0e223db4
Merge remote-tracking branch 'nccl/master' into develop
2022-10-20 15:41:29 +00:00
Wenkai Du
a79d9e3586
Merge remote-tracking branch 'nccl/master' into develop
2022-09-09 16:05:38 +00:00
Wenkai Du
ef499c4810
Add another Rome model ( #553 )
...
* Add another Rome model
* Add option to force enable intranet on single node
* Limit p2p channels to number of ranks
* Refine p2p channels handling
2022-05-31 11:31:30 -07:00
Wenkai Du
283dc86a73
Refine and add new Rome models ( #548 )
2022-05-17 08:23:59 -07:00
Wenkai Du
063da25563
topo_expl: fix build and add tuning support ( #539 )
2022-04-26 15:40:07 -07:00
Wenkai Du
d28e1cb44f
Merge remote-tracking branch 'nccl/master' into develop
2022-04-18 11:15:25 -07:00
Wenkai Du
2151c79d14
Add new Rome model ( #536 )
2022-04-13 11:45:40 -07:00
Wenkai Du
ba4c165bf3
Add new Rome model ( #535 )
2022-04-12 13:27:32 -07:00
Wenkai Du
cd17cf6dce
Update Rome model matching and add new models ( #516 )
...
* Update Rome model matching and add new models
* Add missing file
* Models update
2022-03-21 10:54:40 -07:00
Wenkai Du
369c021992
topo_expl: update for 2.11.4 ( #490 )
...
* topo_expl: update for 2.11.4
* topo_expl: revert a few logging changes
2022-01-13 13:33:07 -08:00
Wenkai Du
f8d0775a6f
Add another Rome model ( #483 )
2022-01-05 09:26:31 -08:00
Wenkai Du
0331e39f81
Update Rome model matching ( #461 )
...
* Update Rome model matching
* Add another Rome model
* Automatically setup NET GDR level from model
2021-11-05 08:53:47 -07:00
Wenkai Du
2249a1d9d3
Add more Rome models ( #434 )
...
* Add more Rome models
* Update models and tuning
* Update tuning
2021-10-12 08:23:20 -07:00
Wenkai Du
e0053311c0
Add another Rome model ( #431 )
2021-10-06 08:17:12 -07:00
Wenkai Du
5c8380ff5b
Implement NIC identification and remapping ( #420 )
...
* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413 )"
This reverts commit 2d0ed8dff6 .
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361 )"
This reverts commit caf5c9992a .
2021-08-24 09:42:04 -07:00
Wenkai Du
5f15ed6e3e
Add gfx908 VM model ( #418 )
2021-08-10 08:55:11 -07:00
Wenkai Du
135d47d125
topo_expl: fix build after switching to rocm-smi-lib ( #405 )
...
* topo_expl: fix build after switching to rocm-smi-lib
* Use minimal of 4 channels for gfx908
2021-07-27 08:30:08 -07:00
Wenkai Du
961922ea02
Add option to enable multiple SAT in SHARP ( #380 )
...
* Add option to enable multiple SAT in SHARP
* Extend number of NICs to 16
2021-06-03 19:45:18 -07:00
Wenkai Du
4c83adb75c
Update Rome models matching ( #376 )
2021-05-25 10:12:40 -07:00
Wenkai Du
a4ea1fed5b
Merge remote-tracking branch 'nccl/master' into develop
2021-05-05 16:01:01 -07:00
Wenkai Du
1fe031402a
Add gfx90a target ( #344 )
...
* Add gfx90a target
* Support gfx90a topology
Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com >
2021-04-14 09:29:00 -06:00
Wenkai Du
d87dc7c2e8
collnet: support multiple NICs ( #335 )
2021-03-25 20:59:32 -07:00
Wenkai Du
1d6244b18d
Enable collnet in RCCL ( #333 )
...
* Enable CollNet and use different number of channels
* topo_expl: enable collnet
2021-03-19 12:58:13 -07:00
Wenkai Du
6dfdfef98f
Add gfx908 Rome 4 NICs model
2021-02-06 00:19:47 +00:00
Wenkai Du
ab1e7a0318
Merge remote-tracking branch 'origin/develop' into 2.8.3
2021-02-04 20:02:34 -05:00
Wenkai Du
d469947641
Merge remote-tracking branch 'nccl/master' into no-target-id
2021-01-14 19:27:53 -05:00
Wenkai Du
373a108516
Fix Rome PCIe 2 node topology generation ( #310 )
2020-12-15 17:16:17 -08:00
Wenkai Du
975b14dffa
Add Rome model and improve search ( #305 )
2020-11-17 14:55:06 -08:00
Wenkai Du
dfa3c41ede
Add more Rome models ( #292 )
2020-10-30 21:26:04 -07:00
Wenkai Du
33babcb5e2
Update Rome single node models ( #277 )
2020-10-13 13:33:09 -07:00
Wenkai Du
ae008fd2db
Rework Rome detection and add multiple network ports models ( #274 )
...
* Rework Rome detection and add multiple network ports models
* Remove unused opCount in p2p transport
2020-10-07 13:37:36 -07:00