64 Коммитов

Автор SHA1 Сообщение Дата
Marzieh Berenjkoub d7293281f3 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 858b4e76eb]
2026-01-20 13:04:02 -06:00
Mustafa Abduljabbar 1a7ab8dfc8 Force enable proto and/or algo after model selection (#1799)
* Force enable proto or algo

* Remove inc nccl_common.h

* Move logic and add error checks

* Fix topo_expl compatibility

* Allow algo/proto overrides

* Remove extra function decl

* Clarify warning message

* Move algo/proto overrides into separate functions

* Update CHANGELOG.md

[ROCm/rccl commit: 7ccc6f268f]
2025-09-03 08:54:13 -04:00
BertanDogancay 881327184e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 08a7be231b]
2025-08-28 15:46:28 -05:00
Mustafa Abduljabbar 0a81478bd9 Fix topo explorer's nccl 2.23 compatibility (#1623)
* Fix compiler issues due to broken compatibility 

* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm

[ROCm/rccl commit: aace4e27f8]
2025-04-02 09:47:29 -04:00
gilbertlee-amd 4f67522420 Removing the experimental clique kernel files (#1610)
[ROCm/rccl commit: 626dc50ab5]
2025-03-20 18:10:01 -06:00
gilbertlee-amd 94545f827c Updating topology explorer (#1536)
[ROCm/rccl commit: 6cb0599e38]
2025-02-07 08:44:04 -07:00
Benjamin Kitor fe806d5427 Add Topologies for 16-GPU gfx942 SuperNode (#1417)
* Add Topologies for 16-GPU gfx942 SuperNode

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl

* Fix bug w/ 1H16P

[ROCm/rccl commit: a05329bd0d]
2024-12-03 13:12:03 -08:00
BertanDogancay 9059445acb Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 84081064a0]
2024-10-02 09:31:25 -05:00
Benjamin Kitor d2df042c36 topo_expl: Update channel masks for >64 channels (#1279)
[ROCm/rccl commit: 4bc118336a]
2024-07-25 17:27:34 -07:00
Nusrat Islam b34fd115a1 doubling debug buffer size with increased channels
[ROCm/rccl commit: 0634c5c8e1]
2024-06-03 13:05:05 -05:00
gilbertlee-amd 422a7ffcbb Rail optimization for rings (#1140)
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)

[ROCm/rccl commit: 4cb62f999a]
2024-04-15 12:03:57 -06:00
Wenkai Du c2eff3ecd9 topo_expl: 2.19.4 update and fix build error (#1098)
[ROCm/rccl commit: d2224fd3e1]
2024-03-07 08:52:50 -08:00
Wenkai Du df1d9b2415 topo_expl: 2.19 update
[ROCm/rccl commit: d1575a1622]
2024-01-31 16:11:14 -06:00
Wenkai Du 366cd12bed topo-expl: fix broken build (#1048)
[ROCm/rccl commit: 600b44fee5]
2024-01-17 08:59:03 -08:00
Wenkai Du cd7a346297 Doubling buffer size to fix NCCL INFO corruption with increased channels (#1035)
[ROCm/rccl commit: f7e39fced2]
2024-01-08 08:14:33 -08:00
akolliasAMD 8685535346 Fixed topo_expl (#891)
[ROCm/rccl commit: 762a42859e]
2023-09-13 12:05:35 -06:00
Audrey MP 2e3d45a53a Gcn arch name (#886)
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.

[ROCm/rccl commit: e58ec78d35]
2023-09-12 15:34:40 -04:00
akolliasAMD 56129830a6 NCCL_TREES variable and rome model fixes (#856)
[ROCm/rccl commit: d33cd5a233]
2023-08-21 10:35:37 -06:00
Wenkai Du f98715baea Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: abd0615351]
2023-06-26 22:51:56 +00:00
Wenkai Du 90cbef7042 Add NCCL_NCHANNELS_PER_PEER override (#767)
Also fix topol_expl build issue

[ROCm/rccl commit: 3af90902c8]
2023-06-06 08:41:38 -07:00
Ziyue Yang f7f669e7f0 MSCCL: Improve executor and integrate scheduler (#694)
* MSCCL: improve executor and add scheduler for testing

* Use external scheduler

* Fix cmake error

* Address comments

* Fix thread safe issue

* Make MSCCL lifecycle APIs thread safe

* Make MSCCL internal scheduler aware of topology hint

* Revise error message

[ROCm/rccl commit: e3b2342f39]
2023-03-14 14:34:25 -07:00
Wenkai Du c76bc214c8 Merge remote-tracking branch 'nccl/master' into HEAD
[ROCm/rccl commit: e1cb45ff22]
2023-02-04 01:44:43 +00:00
Wenkai Du ffecb74b1e Update tuning table and fix topo_expl
[ROCm/rccl commit: 94ad7f6f51]
2022-11-07 18:24:24 +00:00
Wenkai Du 36e5e02e46 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 4f0e223db4]
2022-10-20 15:41:29 +00:00
Wenkai Du 7fe0b0161f topo_expl: fix compilation error (#639)
[ROCm/rccl commit: fc554a2428]
2022-10-19 14:19:50 -07:00
Wenkai Du 7874a99c75 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a79d9e3586]
2022-09-09 16:05:38 +00:00
akolliasAMD 22dc8bd246 Added creation of new tree and added switch for using treesplit for specific cases (#551)
[ROCm/rccl commit: 98f0809a39]
2022-05-25 18:55:14 -04:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Wenkai Du 3332cdff07 Support multiple tuning tables (#522)
* Support multiple tuning tables

* [UnitTests] Skip managed memory testing

[ROCm/rccl commit: bbe780ca6c]
2022-03-31 17:09:21 -07:00
Ziyue Yang dfa9b9e958 Add Pivot AllToAll algorithm for Rome model (#503)
* add a2a pivot interface

* remove debug info

* address comments

* fix bug

* remove custom script

* address comments

* fix bug

[ROCm/rccl commit: b569c0a1db]
2022-02-20 21:09:47 -08:00
Wenkai Du 4d43d9ce22 Update Rome models (#491)
[ROCm/rccl commit: 598c6fdded]
2022-01-14 10:03:30 -08:00
Wenkai Du 02a94fc552 topo_expl: update for 2.11.4 (#490)
* topo_expl: update for 2.11.4

* topo_expl: revert a few logging changes

[ROCm/rccl commit: 369c021992]
2022-01-13 13:33:07 -08:00
Wenkai Du fd98ee84b4 Update Rome model matching (#461)
* Update Rome model matching

* Add another Rome model

* Automatically setup NET GDR level from model

[ROCm/rccl commit: 0331e39f81]
2021-11-05 08:53:47 -07:00
Wenkai Du b587b55c2e Add more Rome models (#434)
* Add more Rome models

* Update models and tuning

* Update tuning

[ROCm/rccl commit: 2249a1d9d3]
2021-10-12 08:23:20 -07:00
Wenkai Du 91eca0d7d2 Trim NICs when all GPUs are connected by XGMI (#430)
* Trim NICs when all GPUs are connected by XGMI

* Only enable clique with maximum of 2 hops

[ROCm/rccl commit: 29c729d8b6]
2021-10-05 18:27:43 -07:00
Wenkai Du 4fd7a14087 Merge remote-tracking branch 'origin/develop' into 2.10.3
[ROCm/rccl commit: d5f93649ff]
2021-08-24 09:49:47 -07:00
Wenkai Du b9508a6aba Implement NIC identification and remapping (#420)
* Add 1H16P GPU model

* Implement NIC identification and remapping

* Revert "Sort IB devices based on device name (#413)"

This reverts commit de0c586bad.

* Fix permute and check order

* Correction on IB speed reporting

* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"

This reverts commit fa690c47a0.

[ROCm/rccl commit: 5c8380ff5b]
2021-08-24 09:42:04 -07:00
Wenkai Du 4b89e98675 Merge remote-tracking branch 'nccl/master' into 2.10.3
[ROCm/rccl commit: bf2339f93e]
2021-07-30 16:23:14 -07:00
Wenkai Du faea6ead5c Query XGMI links from xml and adjust gfx906 channel usage (#410)
[ROCm/rccl commit: 818cdb16a8]
2021-07-27 17:32:41 -07:00
Wenkai Du 8fbeb14175 topo_expl: fix build after switching to rocm-smi-lib (#405)
* topo_expl: fix build after switching to rocm-smi-lib

* Use minimal of 4 channels for gfx908

[ROCm/rccl commit: 135d47d125]
2021-07-27 08:30:08 -07:00
Wenkai Du 90ae176437 Fixes for NCCL_MAX_NCHANNELS and topo_expl (#398)
[ROCm/rccl commit: fa6d7e9a63]
2021-06-22 08:41:49 -07:00
Wenkai Du 5bebcb0015 Setup collectives threshold for enabling intranet (#387)
* Setup collectives threshold for enabling intranet

* Use separate operation counters for coll and p2p

[ROCm/rccl commit: b815a2800f]
2021-06-09 13:24:26 -07:00
Wenkai Du c8a432dc25 Allow intranode use of network connection (#383)
* Allow intranode use of network connection

* Checking for graph for null pointer

[ROCm/rccl commit: a3a8c2d56b]
2021-06-08 07:37:59 -07:00
Wenkai Du cdf2780687 topo_expl: update to 2.9.9
[ROCm/rccl commit: 13dc80ee14]
2021-05-26 09:24:34 -07:00
Wenkai Du a76bebf8b6 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a4ea1fed5b]
2021-05-05 16:01:01 -07:00
Wenkai Du b4a7fa7011 Cleanup number of channels calculation (#340)
[ROCm/rccl commit: e26ad2995e]
2021-04-05 17:51:56 -07:00
Wenkai Du 8927d8bf17 Fix incorrect net counting (#339)
* Fix incorrect net counting

* Add comments

[ROCm/rccl commit: 17491c918e]
2021-04-05 12:21:57 -07:00
Wenkai Du 065bde98d8 collnet: support multiple NICs (#335)
[ROCm/rccl commit: d87dc7c2e8]
2021-03-25 20:59:32 -07:00
Wenkai Du 287ed0f18a Enable collnet in RCCL (#333)
* Enable CollNet and use different number of channels

* topo_expl: enable collnet

[ROCm/rccl commit: 1d6244b18d]
2021-03-19 12:58:13 -07:00
Wenkai Du b7253710ca Revert "Port alltoall[v]" (#325)
This reverts commit 2c49121171.

[ROCm/rccl commit: 8e180cf087]
2021-03-06 13:59:31 -08:00