Marzieh Berenjkoub
d7293281f3
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 858b4e76eb ]
2026-01-20 13:04:02 -06:00
Mustafa Abduljabbar
1a7ab8dfc8
Force enable proto and/or algo after model selection ( #1799 )
...
* Force enable proto or algo
* Remove inc nccl_common.h
* Move logic and add error checks
* Fix topo_expl compatibility
* Allow algo/proto overrides
* Remove extra function decl
* Clarify warning message
* Move algo/proto overrides into separate functions
* Update CHANGELOG.md
[ROCm/rccl commit: 7ccc6f268f ]
2025-09-03 08:54:13 -04:00
BertanDogancay
881327184e
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 08a7be231b ]
2025-08-28 15:46:28 -05:00
Mustafa Abduljabbar
0a81478bd9
Fix topo explorer's nccl 2.23 compatibility ( #1623 )
...
* Fix compiler issues due to broken compatibility
* Fix segfault and pass rank instead of busid and add a pointer to cover a new algorithm
[ROCm/rccl commit: aace4e27f8 ]
2025-04-02 09:47:29 -04:00
gilbertlee-amd
4f67522420
Removing the experimental clique kernel files ( #1610 )
...
[ROCm/rccl commit: 626dc50ab5 ]
2025-03-20 18:10:01 -06:00
gilbertlee-amd
94545f827c
Updating topology explorer ( #1536 )
...
[ROCm/rccl commit: 6cb0599e38 ]
2025-02-07 08:44:04 -07:00
Benjamin Kitor
fe806d5427
Add Topologies for 16-GPU gfx942 SuperNode ( #1417 )
...
* Add Topologies for 16-GPU gfx942 SuperNode
- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
* Fix bug w/ 1H16P
[ROCm/rccl commit: a05329bd0d ]
2024-12-03 13:12:03 -08:00
BertanDogancay
9059445acb
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 84081064a0 ]
2024-10-02 09:31:25 -05:00
Benjamin Kitor
d2df042c36
topo_expl: Update channel masks for >64 channels ( #1279 )
...
[ROCm/rccl commit: 4bc118336a ]
2024-07-25 17:27:34 -07:00
Nusrat Islam
b34fd115a1
doubling debug buffer size with increased channels
...
[ROCm/rccl commit: 0634c5c8e1 ]
2024-06-03 13:05:05 -05:00
gilbertlee-amd
422a7ffcbb
Rail optimization for rings ( #1140 )
...
- Modifies the ring creation algorithm to be friendlier to rail-optimized topologies (should not affect classic fabric topologies)
[ROCm/rccl commit: 4cb62f999a ]
2024-04-15 12:03:57 -06:00
Wenkai Du
c2eff3ecd9
topo_expl: 2.19.4 update and fix build error ( #1098 )
...
[ROCm/rccl commit: d2224fd3e1 ]
2024-03-07 08:52:50 -08:00
Wenkai Du
df1d9b2415
topo_expl: 2.19 update
...
[ROCm/rccl commit: d1575a1622 ]
2024-01-31 16:11:14 -06:00
Wenkai Du
366cd12bed
topo-expl: fix broken build ( #1048 )
...
[ROCm/rccl commit: 600b44fee5 ]
2024-01-17 08:59:03 -08:00
Wenkai Du
cd7a346297
Doubling buffer size to fix NCCL INFO corruption with increased channels ( #1035 )
...
[ROCm/rccl commit: f7e39fced2 ]
2024-01-08 08:14:33 -08:00
akolliasAMD
8685535346
Fixed topo_expl ( #891 )
...
[ROCm/rccl commit: 762a42859e ]
2023-09-13 12:05:35 -06:00
Audrey MP
2e3d45a53a
Gcn arch name ( #886 )
...
We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.
[ROCm/rccl commit: e58ec78d35 ]
2023-09-12 15:34:40 -04:00
akolliasAMD
56129830a6
NCCL_TREES variable and rome model fixes ( #856 )
...
[ROCm/rccl commit: d33cd5a233 ]
2023-08-21 10:35:37 -06:00
Wenkai Du
f98715baea
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: abd0615351 ]
2023-06-26 22:51:56 +00:00
Wenkai Du
90cbef7042
Add NCCL_NCHANNELS_PER_PEER override ( #767 )
...
Also fix topol_expl build issue
[ROCm/rccl commit: 3af90902c8 ]
2023-06-06 08:41:38 -07:00
Ziyue Yang
f7f669e7f0
MSCCL: Improve executor and integrate scheduler ( #694 )
...
* MSCCL: improve executor and add scheduler for testing
* Use external scheduler
* Fix cmake error
* Address comments
* Fix thread safe issue
* Make MSCCL lifecycle APIs thread safe
* Make MSCCL internal scheduler aware of topology hint
* Revise error message
[ROCm/rccl commit: e3b2342f39 ]
2023-03-14 14:34:25 -07:00
Wenkai Du
c76bc214c8
Merge remote-tracking branch 'nccl/master' into HEAD
...
[ROCm/rccl commit: e1cb45ff22 ]
2023-02-04 01:44:43 +00:00
Wenkai Du
ffecb74b1e
Update tuning table and fix topo_expl
...
[ROCm/rccl commit: 94ad7f6f51 ]
2022-11-07 18:24:24 +00:00
Wenkai Du
36e5e02e46
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: 4f0e223db4 ]
2022-10-20 15:41:29 +00:00
Wenkai Du
7fe0b0161f
topo_expl: fix compilation error ( #639 )
...
[ROCm/rccl commit: fc554a2428 ]
2022-10-19 14:19:50 -07:00
Wenkai Du
7874a99c75
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: a79d9e3586 ]
2022-09-09 16:05:38 +00:00
akolliasAMD
22dc8bd246
Added creation of new tree and added switch for using treesplit for specific cases ( #551 )
...
[ROCm/rccl commit: 98f0809a39 ]
2022-05-25 18:55:14 -04:00
Wenkai Du
67e7e6507e
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: d28e1cb44f ]
2022-04-18 11:15:25 -07:00
Wenkai Du
3332cdff07
Support multiple tuning tables ( #522 )
...
* Support multiple tuning tables
* [UnitTests] Skip managed memory testing
[ROCm/rccl commit: bbe780ca6c ]
2022-03-31 17:09:21 -07:00
Ziyue Yang
dfa9b9e958
Add Pivot AllToAll algorithm for Rome model ( #503 )
...
* add a2a pivot interface
* remove debug info
* address comments
* fix bug
* remove custom script
* address comments
* fix bug
[ROCm/rccl commit: b569c0a1db ]
2022-02-20 21:09:47 -08:00
Wenkai Du
4d43d9ce22
Update Rome models ( #491 )
...
[ROCm/rccl commit: 598c6fdded ]
2022-01-14 10:03:30 -08:00
Wenkai Du
02a94fc552
topo_expl: update for 2.11.4 ( #490 )
...
* topo_expl: update for 2.11.4
* topo_expl: revert a few logging changes
[ROCm/rccl commit: 369c021992 ]
2022-01-13 13:33:07 -08:00
Wenkai Du
fd98ee84b4
Update Rome model matching ( #461 )
...
* Update Rome model matching
* Add another Rome model
* Automatically setup NET GDR level from model
[ROCm/rccl commit: 0331e39f81 ]
2021-11-05 08:53:47 -07:00
Wenkai Du
b587b55c2e
Add more Rome models ( #434 )
...
* Add more Rome models
* Update models and tuning
* Update tuning
[ROCm/rccl commit: 2249a1d9d3 ]
2021-10-12 08:23:20 -07:00
Wenkai Du
91eca0d7d2
Trim NICs when all GPUs are connected by XGMI ( #430 )
...
* Trim NICs when all GPUs are connected by XGMI
* Only enable clique with maximum of 2 hops
[ROCm/rccl commit: 29c729d8b6 ]
2021-10-05 18:27:43 -07:00
Wenkai Du
4fd7a14087
Merge remote-tracking branch 'origin/develop' into 2.10.3
...
[ROCm/rccl commit: d5f93649ff ]
2021-08-24 09:49:47 -07:00
Wenkai Du
b9508a6aba
Implement NIC identification and remapping ( #420 )
...
* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413 )"
This reverts commit de0c586bad .
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361 )"
This reverts commit fa690c47a0 .
[ROCm/rccl commit: 5c8380ff5b ]
2021-08-24 09:42:04 -07:00
Wenkai Du
4b89e98675
Merge remote-tracking branch 'nccl/master' into 2.10.3
...
[ROCm/rccl commit: bf2339f93e ]
2021-07-30 16:23:14 -07:00
Wenkai Du
faea6ead5c
Query XGMI links from xml and adjust gfx906 channel usage ( #410 )
...
[ROCm/rccl commit: 818cdb16a8 ]
2021-07-27 17:32:41 -07:00
Wenkai Du
8fbeb14175
topo_expl: fix build after switching to rocm-smi-lib ( #405 )
...
* topo_expl: fix build after switching to rocm-smi-lib
* Use minimal of 4 channels for gfx908
[ROCm/rccl commit: 135d47d125 ]
2021-07-27 08:30:08 -07:00
Wenkai Du
90ae176437
Fixes for NCCL_MAX_NCHANNELS and topo_expl ( #398 )
...
[ROCm/rccl commit: fa6d7e9a63 ]
2021-06-22 08:41:49 -07:00
Wenkai Du
5bebcb0015
Setup collectives threshold for enabling intranet ( #387 )
...
* Setup collectives threshold for enabling intranet
* Use separate operation counters for coll and p2p
[ROCm/rccl commit: b815a2800f ]
2021-06-09 13:24:26 -07:00
Wenkai Du
c8a432dc25
Allow intranode use of network connection ( #383 )
...
* Allow intranode use of network connection
* Checking for graph for null pointer
[ROCm/rccl commit: a3a8c2d56b ]
2021-06-08 07:37:59 -07:00
Wenkai Du
cdf2780687
topo_expl: update to 2.9.9
...
[ROCm/rccl commit: 13dc80ee14 ]
2021-05-26 09:24:34 -07:00
Wenkai Du
a76bebf8b6
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: a4ea1fed5b ]
2021-05-05 16:01:01 -07:00
Wenkai Du
b4a7fa7011
Cleanup number of channels calculation ( #340 )
...
[ROCm/rccl commit: e26ad2995e ]
2021-04-05 17:51:56 -07:00
Wenkai Du
8927d8bf17
Fix incorrect net counting ( #339 )
...
* Fix incorrect net counting
* Add comments
[ROCm/rccl commit: 17491c918e ]
2021-04-05 12:21:57 -07:00
Wenkai Du
065bde98d8
collnet: support multiple NICs ( #335 )
...
[ROCm/rccl commit: d87dc7c2e8 ]
2021-03-25 20:59:32 -07:00
Wenkai Du
287ed0f18a
Enable collnet in RCCL ( #333 )
...
* Enable CollNet and use different number of channels
* topo_expl: enable collnet
[ROCm/rccl commit: 1d6244b18d ]
2021-03-19 12:58:13 -07:00
Wenkai Du
b7253710ca
Revert "Port alltoall[v]" ( #325 )
...
This reverts commit 2c49121171 .
[ROCm/rccl commit: 8e180cf087 ]
2021-03-06 13:59:31 -08:00