128 次程式碼提交

作者 SHA1 備註 日期
Marzieh Berenjkoub d7293281f3 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 858b4e76eb]
2026-01-20 13:04:02 -06:00
isaki001 9fa7a738da add reduce/broadcast algo/proto selection table for multi-node gfx940 (#1889)
[ROCm/rccl commit: 9c36439354]
2025-09-10 14:25:23 -05:00
Mustafa Abduljabbar 26495be59c Use add_unroll.sh in topo_expl makefile (#1886)
[ROCm/rccl commit: 6e45eaf75e]
2025-09-03 09:43:16 -04:00
Mustafa Abduljabbar 1a7ab8dfc8 Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md

[ROCm/rccl commit: 7ccc6f268f]
2025-09-03 08:54:13 -04:00
BertanDogancay 881327184e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 08a7be231b]
2025-08-28 15:46:28 -05:00
Mustafa Abduljabbar b33b5755f6 Support gfx950 in topo_expl and resolve dependency on FMT (#1829)
* Support gfx950 in topo_expl

* Fix dependencies and fetch fmt from sources

* Remove third_party folder in make clean

* Add empty target when fmt is found

* Add MI350 example

* Update README.md

---------

Co-authored-by: isaki001 <ioannissakiotis@gmail.com>

[ROCm/rccl commit: dfad51e3c9]
2025-08-26 10:11:38 -04:00
Nikhil-Nunna bf4031276c topo_explorer initial readme (#1797)
* topo_explorer intial readme

* topo_explorer readme update

* topo_explorer readme update

* Added sample output to README

* Update README.md

* Update README.md

---------

Co-authored-by: Mustafa Abduljabbar <mustafa.abduljabbar@amd.com>

[ROCm/rccl commit: 7abc7538ea]
2025-07-11 11:28:20 -05:00
Arm Patinyasakdikul 4d71cae249 [topo-expl] update header file location. (#1769)
[ROCm/rccl commit: 35024ca1cb]
2025-06-27 15:29:37 -05:00
Mustafa Abduljabbar 3e5dc99aa6 Fix topo_explorer compatibility and capture WarpSize (#1743)
[ROCm/rccl commit: fb4ad82d0d]
2025-06-16 08:18:35 -04:00
Arm Patinyasakdikul 7f7f1cede3 Added missing copyright message. (#1742)
* Added missing copyright message.

* addressed comments.

[ROCm/rccl commit: 6c37ae9470]
2025-06-12 09:58:01 -05:00
Mustafa Abduljabbar 750bd73047 Add missing MACRO to topo_expl (#1677)
* Fix header compatibility

[ROCm/rccl commit: fdad89690b]
2025-05-05 15:58:57 -04:00
Mustafa Abduljabbar ab4a3eb0c1 Fix topo explorer's compatibility with NCCL 2.24 (#1671)
* Fix build issues

* Fix failure to find path remote rank


[ROCm/rccl commit: f3f3336468]
2025-05-05 15:26:29 -04:00
BertanDogancay d045d0ca23 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a6bf9bfc9e]
2025-04-23 20:47:43 -07:00
Mustafa Abduljabbar 07620c7efd Expose production tuning table in topo_explorer using internal RCCL/NCCL logic (#1628)
* Internal RCCL/NCCL functionality exposed when RCCL_EXPOSE_STATIC is enabled
* Algo/protocol/max channels can be obtained with the new RCCL API
* Introduce rccl_static and rccl_static_inline macros to work around invisible functions in core source files like enqueue.cc
* Add usage example in topo-explorer tool

[ROCm/rccl commit: 82afb2bcfe]
2025-04-23 15:44:56 -04:00
Mustafa Abduljabbar 0a81478bd9 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm

[ROCm/rccl commit: aace4e27f8]
2025-04-02 09:47:29 -04:00
gilbertlee-amd 4f67522420 Removing the experimental clique kernel files (#1610)
[ROCm/rccl commit: 626dc50ab5]
2025-03-20 18:10:01 -06:00
corey-derochie-amd e95578ef4c removed gfx940 and gfx941 (#1606)
* removed gfx940 and gfx941

* removed gfx940 and gfx941

* Update "gfx94" to "gfx942" in init.cc

* Updated remaining "gfx94" updates to "gfx942"

* Update filenames and variables from gfx940 to gfx942

---------

Co-authored-by: akolliasAMD <akollias@amd.com>

[ROCm/rccl commit: 6505639cf4]
2025-03-20 09:34:53 -06:00
gilbertlee-amd 4ca7e6873e Rail optimized trees (#1540)
* Allow disabling rail-optimized trees via RCCL_DISABLE_RAIL_TREES, Graphviz-friendly output via RCCL_OUTPUT_TREES


[ROCm/rccl commit: ddc5d58b93]
2025-02-20 15:18:29 -07:00
gilbertlee-amd 94545f827c Updating topology explorer (#1536)
[ROCm/rccl commit: 6cb0599e38]
2025-02-07 08:44:04 -07:00
Benjamin Kitor fe806d5427 Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P

[ROCm/rccl commit: a05329bd0d]
2024-12-03 13:12:03 -08:00
BertanDogancay 9059445acb Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 84081064a0]
2024-10-02 09:31:25 -05:00
Wenkai Du 74aa13afbe Add another Rome model (#1354)
[ROCm/rccl commit: e453f1ced9]
2024-10-01 17:41:27 -05:00
Wenkai Du 27e0569eed topo_expl: update sm fields in topo xml files (#1310)
[ROCm/rccl commit: 1a48e19b18]
2024-08-29 12:03:51 -07:00
Wenkai Du 157cc5f6ba Add new Rome model (#1304)
* Add another rome model and override

* Fix bug

* Fix typo

* Add ring

* Update ring

* Fix model matching

* Clean up

* Clean up

* Reverse rings for NCCL_RINGS input

* Only reverse NCCL_RINGS for ring graph

* Fix mapping issue when using  NCCL_RINGS

* Add NCCL_RINGS_REMAP to handle inconsistant net names

[ROCm/rccl commit: 532b70afb6]
2024-08-23 08:45:43 +08:00
Benjamin Kitor d2df042c36 topo_expl: Update channel masks for >64 channels (#1279)
[ROCm/rccl commit: 4bc118336a]
2024-07-25 17:27:34 -07:00
Nusrat Islam b34fd115a1 doubling debug buffer size with increased channels
[ROCm/rccl commit: 0634c5c8e1]
2024-06-03 13:05:05 -05:00
gilbertlee-amd 422a7ffcbb Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)

[ROCm/rccl commit: 4cb62f999a]
2024-04-15 12:03:57 -06:00
gilbertlee-amd 62b9f0d3a7 [topo_expl] Adding -n option to override number of nodes (#1134)
[ROCm/rccl commit: 93982533d7]
2024-04-04 15:11:47 -06:00
corey-derochie-amd 19897f8d90 Fixes the copyright comment block on each of topo_expl/models/*.xml. The format was not valid XML. (#1124)
[ROCm/rccl commit: 9eefc68cb5]
2024-03-25 16:21:17 -06:00
Andy li e373bd44bf Enable fp8 support (#1101)
* initial checkin

* resolve cr comments

* resolve the build issue

* fix the data correctless issue

* update fp8 header file and update the unit test for fp8 support

* remove fp16 from fp8 headers

* fix ut issue and catch up the latest code from develop

* udate according to cr comments

* update ut according to cr comments

* update num floats for each SumPostDiv from 4 to 6

* update fp8 header file name

* fix the typo

[ROCm/rccl commit: 6777e65c1d]
2024-03-08 15:17:53 -08:00
Wenkai Du c2eff3ecd9 topo_expl: 2.19.4 update and fix build error (#1098)
[ROCm/rccl commit: d2224fd3e1]
2024-03-07 08:52:50 -08:00
Wenkai Du 058886cb20 Add another Rome model (#1095)
[ROCm/rccl commit: df98a6957d]
2024-02-28 10:46:05 -08:00
Wenkai Du 874998033f Add new GPU model (#1080)
[ROCm/rccl commit: 74f9e5db64]
2024-02-23 12:19:42 -08:00
Wenkai Du df1d9b2415 topo_expl: 2.19 update
[ROCm/rccl commit: d1575a1622]
2024-01-31 16:11:14 -06:00
Wenkai Du 366cd12bed topo-expl: fix broken build (#1048)
[ROCm/rccl commit: 600b44fee5]
2024-01-17 08:59:03 -08:00
Wenkai Du cd7a346297 Doubling buffer size to fix NCCL INFO corruption with increased channels (#1035)
[ROCm/rccl commit: f7e39fced2]
2024-01-08 08:14:33 -08:00
Wenkai Du dcf623f2ec Add special handling of gfx940 (#976)
* Add special handling of gfx940

* Update ring base

[ROCm/rccl commit: 50b2dd9fd7]
2023-11-22 15:07:36 -08:00
akolliasAMD 8685535346 Fixed topo_expl (#891)
[ROCm/rccl commit: 762a42859e]
2023-09-13 12:05:35 -06:00
Audrey MP 2e3d45a53a Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.

[ROCm/rccl commit: e58ec78d35]
2023-09-12 15:34:40 -04:00
akolliasAMD 56129830a6 NCCL_TREES variable and rome model fixes (#856)
[ROCm/rccl commit: d33cd5a233]
2023-08-21 10:35:37 -06:00
Wenkai Du 0c31452135 Add new model support (#847)
* Add new model support

* Update new rings

[ROCm/rccl commit: 7044599575]
2023-08-10 17:14:51 -07:00
Wenkai Du dfda1d6fab Enable gfx94x (#808) (#816)
(cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)

[ROCm/rccl commit: a7fcd58a97]
2023-07-21 07:31:27 -07:00
Wenkai Du f98715baea Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: abd0615351]
2023-06-26 22:51:56 +00:00
Wenkai Du 90cbef7042 Add NCCL_NCHANNELS_PER_PEER override (#767)
Also fix topol_expl build issue

[ROCm/rccl commit: 3af90902c8]
2023-06-06 08:41:38 -07:00
Ziyue Yang f7f669e7f0 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message

[ROCm/rccl commit: e3b2342f39]
2023-03-14 14:34:25 -07:00
Wenkai Du c76bc214c8 Merge remote-tracking branch 'nccl/master' into HEAD
[ROCm/rccl commit: e1cb45ff22]
2023-02-04 01:44:43 +00:00
Wenkai Du 7be2c55b32 topo_expl: fix broken build by adding hipify steps (#670)
[ROCm/rccl commit: a0dd8e0b84]
2023-01-06 07:29:40 -08:00
Wenkai Du ffecb74b1e Update tuning table and fix topo_expl
[ROCm/rccl commit: 94ad7f6f51]
2022-11-07 18:24:24 +00:00
Wenkai Du 36e5e02e46 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 4f0e223db4]
2022-10-20 15:41:29 +00:00
Wenkai Du 7fe0b0161f topo_expl: fix compilation error (#639)
[ROCm/rccl commit: fc554a2428]
2022-10-19 14:19:50 -07:00