Mustafa Abduljabbar
b33b5755f6
Support gfx950 in topo_expl and resolve dependency on FMT ( #1829 )
...
* Support gfx950 in topo_expl
* Fix dependencies and fetch fmt from sources
* Remove third_party folder in make clean
* Add empty target when fmt is found
* Add MI350 example
* Update README.md
---------
Co-authored-by: isaki001 <ioannissakiotis@gmail.com >
[ROCm/rccl commit: dfad51e3c9 ]
2025-08-26 10:11:38 -04:00
corey-derochie-amd
e95578ef4c
removed gfx940 and gfx941 ( #1606 )
...
* removed gfx940 and gfx941
* removed gfx940 and gfx941
* Update "gfx94" to "gfx942" in init.cc
* Updated remaining "gfx94" updates to "gfx942"
* Update filenames and variables from gfx940 to gfx942
---------
Co-authored-by: akolliasAMD <akollias@amd.com >
[ROCm/rccl commit: 6505639cf4 ]
2025-03-20 09:34:53 -06:00
Benjamin Kitor
fe806d5427
Add Topologies for 16-GPU gfx942 SuperNode ( #1417 )
...
* Add Topologies for 16-GPU gfx942 SuperNode
- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
* Fix bug w/ 1H16P
[ROCm/rccl commit: a05329bd0d ]
2024-12-03 13:12:03 -08:00
Wenkai Du
27e0569eed
topo_expl: update sm fields in topo xml files ( #1310 )
...
[ROCm/rccl commit: 1a48e19b18 ]
2024-08-29 12:03:51 -07:00
corey-derochie-amd
19897f8d90
Fixes the copyright comment block on each of topo_expl/models/*.xml. The format was not valid XML. ( #1124 )
...
[ROCm/rccl commit: 9eefc68cb5 ]
2024-03-25 16:21:17 -06:00
Wenkai Du
df1d9b2415
topo_expl: 2.19 update
...
[ROCm/rccl commit: d1575a1622 ]
2024-01-31 16:11:14 -06:00
gilbertlee-amd
0ca30fb88a
Updating files for missing licenses ( #637 )
...
[ROCm/rccl commit: ebb8b5bf63 ]
2022-10-14 13:49:16 -06:00
Wenkai Du
5becf1669f
Add another Rome model ( #553 )
...
* Add another Rome model
* Add option to force enable intranet on single node
* Limit p2p channels to number of ranks
* Refine p2p channels handling
[ROCm/rccl commit: ef499c4810 ]
2022-05-31 11:31:30 -07:00
Wenkai Du
2c125ce6ed
Update Rome model ( #552 )
...
[ROCm/rccl commit: c5b77121f0 ]
2022-05-26 09:59:23 -07:00
Wenkai Du
b30b8becea
Refine and add new Rome models ( #548 )
...
[ROCm/rccl commit: 283dc86a73 ]
2022-05-17 08:23:59 -07:00
Wenkai Du
011447e4dc
Add new Rome model ( #536 )
...
[ROCm/rccl commit: 2151c79d14 ]
2022-04-13 11:45:40 -07:00
Wenkai Du
f8023f2e07
Add new Rome model ( #535 )
...
[ROCm/rccl commit: ba4c165bf3 ]
2022-04-12 13:27:32 -07:00
Wenkai Du
db1e628ba3
Update Rome model matching and add new models ( #516 )
...
* Update Rome model matching and add new models
* Add missing file
* Models update
[ROCm/rccl commit: cd17cf6dce ]
2022-03-21 10:54:40 -07:00
Wenkai Du
93c7526fdc
Add another Rome model ( #483 )
...
[ROCm/rccl commit: f8d0775a6f ]
2022-01-05 09:26:31 -08:00
Wenkai Du
fd98ee84b4
Update Rome model matching ( #461 )
...
* Update Rome model matching
* Add another Rome model
* Automatically setup NET GDR level from model
[ROCm/rccl commit: 0331e39f81 ]
2021-11-05 08:53:47 -07:00
Wenkai Du
15143b1cfb
Query XGMI link count through rocm_smi_lib API ( #442 )
...
[ROCm/rccl commit: 14a184eb67 ]
2021-10-26 10:30:20 -07:00
Wenkai Du
b587b55c2e
Add more Rome models ( #434 )
...
* Add more Rome models
* Update models and tuning
* Update tuning
[ROCm/rccl commit: 2249a1d9d3 ]
2021-10-12 08:23:20 -07:00
Wenkai Du
d377c4dcc6
Add another Rome model ( #431 )
...
[ROCm/rccl commit: e0053311c0 ]
2021-10-06 08:17:12 -07:00
Wenkai Du
91eca0d7d2
Trim NICs when all GPUs are connected by XGMI ( #430 )
...
* Trim NICs when all GPUs are connected by XGMI
* Only enable clique with maximum of 2 hops
[ROCm/rccl commit: 29c729d8b6 ]
2021-10-05 18:27:43 -07:00
Wenkai Du
b9508a6aba
Implement NIC identification and remapping ( #420 )
...
* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413 )"
This reverts commit de0c586bad .
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361 )"
This reverts commit fa690c47a0 .
[ROCm/rccl commit: 5c8380ff5b ]
2021-08-24 09:42:04 -07:00
Wenkai Du
57518da006
Add gfx908 VM model ( #418 )
...
[ROCm/rccl commit: 5f15ed6e3e ]
2021-08-10 08:55:11 -07:00
Wenkai Du
aa95cc6102
Update Rome models matching ( #376 )
...
[ROCm/rccl commit: 4c83adb75c ]
2021-05-25 10:12:40 -07:00
Wenkai Du
0f4d497edc
Add gfx90a target ( #344 )
...
* Add gfx90a target
* Support gfx90a topology
Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com >
[ROCm/rccl commit: 1fe031402a ]
2021-04-14 09:29:00 -06:00
Wenkai Du
065bde98d8
collnet: support multiple NICs ( #335 )
...
[ROCm/rccl commit: d87dc7c2e8 ]
2021-03-25 20:59:32 -07:00
Wenkai Du
287ed0f18a
Enable collnet in RCCL ( #333 )
...
* Enable CollNet and use different number of channels
* topo_expl: enable collnet
[ROCm/rccl commit: 1d6244b18d ]
2021-03-19 12:58:13 -07:00
Wenkai Du
fe8923ebba
Add gfx908 Rome 4 NICs model
...
[ROCm/rccl commit: 6dfdfef98f ]
2021-02-06 00:19:47 +00:00
Wenkai Du
4ea285c527
Fix Rome PCIe 2 node topology generation ( #310 )
...
[ROCm/rccl commit: 373a108516 ]
2020-12-15 17:16:17 -08:00
Wenkai Du
b68ff1ebba
Add Rome model and improve search ( #305 )
...
[ROCm/rccl commit: 975b14dffa ]
2020-11-17 14:55:06 -08:00
Wenkai Du
c0c64d970a
Add more Rome models ( #292 )
...
[ROCm/rccl commit: dfa3c41ede ]
2020-10-30 21:26:04 -07:00
Wenkai Du
8b120c0508
Update Rome single node models ( #277 )
...
[ROCm/rccl commit: 33babcb5e2 ]
2020-10-13 13:33:09 -07:00
Wenkai Du
41260bb948
Rework Rome detection and add multiple network ports models ( #274 )
...
* Rework Rome detection and add multiple network ports models
* Remove unused opCount in p2p transport
[ROCm/rccl commit: ae008fd2db ]
2020-10-07 13:37:36 -07:00
Wenkai Du
03bb6bcb54
Increase minimal channels for gfx908 ( #259 )
...
[ROCm/rccl commit: c5cbece6d0 ]
2020-08-26 11:40:11 -07:00
Wenkai Du
5f49a0e088
Add NPS4 support on some models ( #256 )
...
* Add NPS4 support on some models
* Add XML models
[ROCm/rccl commit: 391bbf3f1e ]
2020-08-19 11:03:20 -07:00
Wenkai Du
3d5fb8142e
Add another Rome model ( #249 )
...
* Add another Rome model
* Add gfx908 4P3L models and support
* Revert "Use cached value for detecting GDR support only once"
This reverts commit 0108a1219d .
* Skip using ibverb for GPU direct RDMA detection
* Fine tune one Rome model
[ROCm/rccl commit: a51e4071e3 ]
2020-08-17 10:51:02 -07:00
Wenkai Du
f242a2f0b0
Collect gcnArch and hipDeviceArch_t in XML ( #252 )
...
[ROCm/rccl commit: 7e3d8a31cc ]
2020-08-12 15:48:38 -07:00
Wenkai Du
c9815aaa36
Add more Rome 4P2H models
...
[ROCm/rccl commit: 09ef75656a ]
2020-08-06 18:20:02 +00:00
Wenkai Du
487f93b83f
Topology tuning for 4P2H on Rome ( #242 )
...
* Topology tuning for 4P2H on Rome
* Use ncclTopoIdToIndex
[ROCm/rccl commit: e7a10aa0e4 ]
2020-07-27 11:53:57 -07:00
Wenkai Du
f604fc774e
Add 8P6L multi-node models ( #239 )
...
[ROCm/rccl commit: d5f90e19b5 ]
2020-07-21 14:10:36 -07:00
Wenkai Du
27519fd019
Give preference to path with more XGMI connections
...
[ROCm/rccl commit: b3c9852634 ]
2020-05-14 15:33:16 -07:00
Wenkai Du
8852e54181
topo_expl: update to 2.6
...
[ROCm/rccl commit: 6f54b23503 ]
2020-04-01 13:37:08 -07:00