Commit Graph

42 Commitit

Tekijä SHA1 Viesti Päivämäärä
Wenkai Du ffecb74b1e Update tuning table and fix topo_expl
[ROCm/rccl commit: 94ad7f6f51]
2022-11-07 18:24:24 +00:00
Wenkai Du 36e5e02e46 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: 4f0e223db4]
2022-10-20 15:41:29 +00:00
Wenkai Du 7fe0b0161f topo_expl: fix compilation error (#639)
[ROCm/rccl commit: fc554a2428]
2022-10-19 14:19:50 -07:00
Wenkai Du 7874a99c75 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a79d9e3586]
2022-09-09 16:05:38 +00:00
akolliasAMD 22dc8bd246 Added creation of new tree and added switch for using treesplit for specific cases (#551)
[ROCm/rccl commit: 98f0809a39]
2022-05-25 18:55:14 -04:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Wenkai Du 3332cdff07 Support multiple tuning tables (#522)
* Support multiple tuning tables

* [UnitTests] Skip managed memory testing

[ROCm/rccl commit: bbe780ca6c]
2022-03-31 17:09:21 -07:00
Ziyue Yang dfa9b9e958 Add Pivot AllToAll algorithm for Rome model (#503)
* add a2a pivot interface

* remove debug info

* address comments

* fix bug

* remove custom script

* address comments

* fix bug

[ROCm/rccl commit: b569c0a1db]
2022-02-20 21:09:47 -08:00
Wenkai Du 4d43d9ce22 Update Rome models (#491)
[ROCm/rccl commit: 598c6fdded]
2022-01-14 10:03:30 -08:00
Wenkai Du 02a94fc552 topo_expl: update for 2.11.4 (#490)
* topo_expl: update for 2.11.4

* topo_expl: revert a few logging changes

[ROCm/rccl commit: 369c021992]
2022-01-13 13:33:07 -08:00
Wenkai Du fd98ee84b4 Update Rome model matching (#461)
* Update Rome model matching

* Add another Rome model

* Automatically setup NET GDR level from model

[ROCm/rccl commit: 0331e39f81]
2021-11-05 08:53:47 -07:00
Wenkai Du b587b55c2e Add more Rome models (#434)
* Add more Rome models

* Update models and tuning

* Update tuning

[ROCm/rccl commit: 2249a1d9d3]
2021-10-12 08:23:20 -07:00
Wenkai Du 91eca0d7d2 Trim NICs when all GPUs are connected by XGMI (#430)
* Trim NICs when all GPUs are connected by XGMI

* Only enable clique with maximum of 2 hops

[ROCm/rccl commit: 29c729d8b6]
2021-10-05 18:27:43 -07:00
Wenkai Du 4fd7a14087 Merge remote-tracking branch 'origin/develop' into 2.10.3
[ROCm/rccl commit: d5f93649ff]
2021-08-24 09:49:47 -07:00
Wenkai Du b9508a6aba Implement NIC identification and remapping (#420)
* Add 1H16P GPU model

* Implement NIC identification and remapping

* Revert "Sort IB devices based on device name (#413)"

This reverts commit de0c586bad.

* Fix permute and check order

* Correction on IB speed reporting

* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"

This reverts commit fa690c47a0.

[ROCm/rccl commit: 5c8380ff5b]
2021-08-24 09:42:04 -07:00
Wenkai Du 4b89e98675 Merge remote-tracking branch 'nccl/master' into 2.10.3
[ROCm/rccl commit: bf2339f93e]
2021-07-30 16:23:14 -07:00
Wenkai Du faea6ead5c Query XGMI links from xml and adjust gfx906 channel usage (#410)
[ROCm/rccl commit: 818cdb16a8]
2021-07-27 17:32:41 -07:00
Wenkai Du 8fbeb14175 topo_expl: fix build after switching to rocm-smi-lib (#405)
* topo_expl: fix build after switching to rocm-smi-lib

* Use minimal of 4 channels for gfx908

[ROCm/rccl commit: 135d47d125]
2021-07-27 08:30:08 -07:00
Wenkai Du 90ae176437 Fixes for NCCL_MAX_NCHANNELS and topo_expl (#398)
[ROCm/rccl commit: fa6d7e9a63]
2021-06-22 08:41:49 -07:00
Wenkai Du 5bebcb0015 Setup collectives threshold for enabling intranet (#387)
* Setup collectives threshold for enabling intranet

* Use separate operation counters for coll and p2p

[ROCm/rccl commit: b815a2800f]
2021-06-09 13:24:26 -07:00
Wenkai Du c8a432dc25 Allow intranode use of network connection (#383)
* Allow intranode use of network connection

* Checking for graph for null pointer

[ROCm/rccl commit: a3a8c2d56b]
2021-06-08 07:37:59 -07:00
Wenkai Du cdf2780687 topo_expl: update to 2.9.9
[ROCm/rccl commit: 13dc80ee14]
2021-05-26 09:24:34 -07:00
Wenkai Du a76bebf8b6 Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: a4ea1fed5b]
2021-05-05 16:01:01 -07:00
Wenkai Du b4a7fa7011 Cleanup number of channels calculation (#340)
[ROCm/rccl commit: e26ad2995e]
2021-04-05 17:51:56 -07:00
Wenkai Du 8927d8bf17 Fix incorrect net counting (#339)
* Fix incorrect net counting

* Add comments

[ROCm/rccl commit: 17491c918e]
2021-04-05 12:21:57 -07:00
Wenkai Du 065bde98d8 collnet: support multiple NICs (#335)
[ROCm/rccl commit: d87dc7c2e8]
2021-03-25 20:59:32 -07:00
Wenkai Du 287ed0f18a Enable collnet in RCCL (#333)
* Enable CollNet and use different number of channels

* topo_expl: enable collnet

[ROCm/rccl commit: 1d6244b18d]
2021-03-19 12:58:13 -07:00
Wenkai Du b7253710ca Revert "Port alltoall[v]" (#325)
This reverts commit 2c49121171.

[ROCm/rccl commit: 8e180cf087]
2021-03-06 13:59:31 -08:00
Wenkai Du bcf4ecb0e3 Enable local sendrecv over network if GDR is available on all GPUs (#324)
[ROCm/rccl commit: c018edf0f2]
2021-03-05 19:59:41 -08:00
Wenkai Du 6c3ccc2192 Add support to another Rome model
[ROCm/rccl commit: 95f178324c]
2021-02-18 02:00:31 +00:00
Wenkai Du d4382de267 Improve collective trace
[ROCm/rccl commit: 2ddbe6646b]
2021-01-14 19:28:01 -05:00
Wenkai Du 2c49121171 Port alltoall[v]
[ROCm/rccl commit: f4d5d3d620]
2021-01-14 19:28:01 -05:00
Wenkai Du adff98765c Merge remote-tracking branch 'nccl/master' into no-target-id
[ROCm/rccl commit: d469947641]
2021-01-14 19:27:53 -05:00
Wenkai Du 41260bb948 Rework Rome detection and add multiple network ports models (#274)
* Rework Rome detection and add multiple network ports models

* Remove unused opCount in p2p transport

[ROCm/rccl commit: ae008fd2db]
2020-10-07 13:37:36 -07:00
Wenkai Du dbde26e681 Add Alltoallv RCCL kernel implementation (#269)
* Add alltoallv API and implementation

* Extend Rome P2P channel limit to multinode and alltoall kernels

* topo_expl: fix compilation and sync up with main

* gtest: use RCCL alltoallv API

* Code review changes

[ROCm/rccl commit: b871ea3c0c]
2020-09-30 16:25:36 -07:00
Wenkai Du 03bb6bcb54 Increase minimal channels for gfx908 (#259)
[ROCm/rccl commit: c5cbece6d0]
2020-08-26 11:40:11 -07:00
Wenkai Du 3e2c9054cd Change default channels duplication for chordal ring (#233)
[ROCm/rccl commit: ab787c767e]
2020-07-14 15:16:50 -07:00
Wenkai Du e8da2a0da6 topo_expl: fix broken build (#224)
[ROCm/rccl commit: a6be82f5ab]
2020-06-30 11:11:23 -07:00
Wenkai Du 15cd2aff3c Add gather, scatter and alltoall collectives
Introducing 3 new APIs:
ncclResult_t  ncclGather(const void* sendbuff, void* recvbuff, size_t sendcount,
    ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream);
ncclResult_t  ncclScatter(const void* sendbuff, void* recvbuff,
    size_t recvcount, ncclDataType_t datatype, int root, ncclComm_t comm,
    hipStream_t stream);
ncclResult_t  ncclAllToAll(const void* sendbuff, void* recvbuff, size_t count,
    ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream);

Only out of place operation is supported.
Preprocessor symbol RCCL_GATHER_SCATTER=1 indicates API availibility.
By default the APIs launche RCCL kernel implementation, which can be disabled by
RCCL_ALLTOALL_KERNEL_DISABLE=1. Then the APIs use wrapper around ncclSend and ncclRecv.


[ROCm/rccl commit: e80e29573c]
2020-06-09 17:44:08 -07:00
Wenkai Du 69eb70ce43 tpol_expl: update to 2.7
[ROCm/rccl commit: 71ec3e09df]
2020-06-09 17:40:24 -07:00
Wenkai Du 8852e54181 topo_expl: update to 2.6
[ROCm/rccl commit: 6f54b23503]
2020-04-01 13:37:08 -07:00
Wenkai Du 00f421ccbd Add topology explorer
[ROCm/rccl commit: 55f8e2dec7]
2020-02-19 14:42:06 -08:00