Ziyue Yang
e3b2342f39
MSCCL: Improve executor and integrate scheduler ( #694 )
...
* MSCCL: improve executor and add scheduler for testing
* Use external scheduler
* Fix cmake error
* Address comments
* Fix thread safe issue
* Make MSCCL lifecycle APIs thread safe
* Make MSCCL internal scheduler aware of topology hint
* Revise error message
2023-03-14 14:34:25 -07:00
Wenkai Du
e1cb45ff22
Merge remote-tracking branch 'nccl/master' into HEAD
2023-02-04 01:44:43 +00:00
Wenkai Du
a0dd8e0b84
topo_expl: fix broken build by adding hipify steps ( #670 )
2023-01-06 07:29:40 -08:00
Wenkai Du
94ad7f6f51
Update tuning table and fix topo_expl
2022-11-07 18:24:24 +00:00
Wenkai Du
4f0e223db4
Merge remote-tracking branch 'nccl/master' into develop
2022-10-20 15:41:29 +00:00
Wenkai Du
fc554a2428
topo_expl: fix compilation error ( #639 )
2022-10-19 14:19:50 -07:00
gilbertlee-amd
ebb8b5bf63
Updating files for missing licenses ( #637 )
2022-10-14 13:49:16 -06:00
Wenkai Du
a79d9e3586
Merge remote-tracking branch 'nccl/master' into develop
2022-09-09 16:05:38 +00:00
arvindcheru
2cb2f9493a
HIP Path default updated to ROCM_PATH (reorg path) ( #592 )
...
Updated default path for hip to ROCM_PATH (/opt/rocm instead of /opt/rocm/hip) as per new/current structure.
2022-08-04 13:38:41 -04:00
Edgar
0336ffdf70
Introduce multi-rank support per device.
...
This is a single commit of the source code changes required to
introduce support for multiple ranks per device.
A new interface (ncclCommRankInitMulti) has to be used to make use of
this new feature.
2022-06-10 14:23:12 +00:00
Wenkai Du
ef499c4810
Add another Rome model ( #553 )
...
* Add another Rome model
* Add option to force enable intranet on single node
* Limit p2p channels to number of ranks
* Refine p2p channels handling
2022-05-31 11:31:30 -07:00
Wenkai Du
c5b77121f0
Update Rome model ( #552 )
2022-05-26 09:59:23 -07:00
akolliasAMD
98f0809a39
Added creation of new tree and added switch for using treesplit for specific cases ( #551 )
2022-05-25 18:55:14 -04:00
Wenkai Du
283dc86a73
Refine and add new Rome models ( #548 )
2022-05-17 08:23:59 -07:00
Wenkai Du
063da25563
topo_expl: fix build and add tuning support ( #539 )
2022-04-26 15:40:07 -07:00
Wenkai Du
d28e1cb44f
Merge remote-tracking branch 'nccl/master' into develop
2022-04-18 11:15:25 -07:00
Wenkai Du
2151c79d14
Add new Rome model ( #536 )
2022-04-13 11:45:40 -07:00
Wenkai Du
ba4c165bf3
Add new Rome model ( #535 )
2022-04-12 13:27:32 -07:00
Wenkai Du
bbe780ca6c
Support multiple tuning tables ( #522 )
...
* Support multiple tuning tables
* [UnitTests] Skip managed memory testing
2022-03-31 17:09:21 -07:00
Wenkai Du
cd17cf6dce
Update Rome model matching and add new models ( #516 )
...
* Update Rome model matching and add new models
* Add missing file
* Models update
2022-03-21 10:54:40 -07:00
Ziyue Yang
b569c0a1db
Add Pivot AllToAll algorithm for Rome model ( #503 )
...
* add a2a pivot interface
* remove debug info
* address comments
* fix bug
* remove custom script
* address comments
* fix bug
2022-02-20 21:09:47 -08:00
Wenkai Du
598c6fdded
Update Rome models ( #491 )
2022-01-14 10:03:30 -08:00
Wenkai Du
369c021992
topo_expl: update for 2.11.4 ( #490 )
...
* topo_expl: update for 2.11.4
* topo_expl: revert a few logging changes
2022-01-13 13:33:07 -08:00
Wenkai Du
f8d0775a6f
Add another Rome model ( #483 )
2022-01-05 09:26:31 -08:00
Wenkai Du
0331e39f81
Update Rome model matching ( #461 )
...
* Update Rome model matching
* Add another Rome model
* Automatically setup NET GDR level from model
2021-11-05 08:53:47 -07:00
Wenkai Du
14a184eb67
Query XGMI link count through rocm_smi_lib API ( #442 )
2021-10-26 10:30:20 -07:00
Wenkai Du
2249a1d9d3
Add more Rome models ( #434 )
...
* Add more Rome models
* Update models and tuning
* Update tuning
2021-10-12 08:23:20 -07:00
Wenkai Du
e0053311c0
Add another Rome model ( #431 )
2021-10-06 08:17:12 -07:00
Wenkai Du
29c729d8b6
Trim NICs when all GPUs are connected by XGMI ( #430 )
...
* Trim NICs when all GPUs are connected by XGMI
* Only enable clique with maximum of 2 hops
2021-10-05 18:27:43 -07:00
Wenkai Du
d5f93649ff
Merge remote-tracking branch 'origin/develop' into 2.10.3
2021-08-24 09:49:47 -07:00
Wenkai Du
5c8380ff5b
Implement NIC identification and remapping ( #420 )
...
* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413 )"
This reverts commit 2d0ed8dff6 .
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361 )"
This reverts commit caf5c9992a .
2021-08-24 09:42:04 -07:00
Wenkai Du
5f15ed6e3e
Add gfx908 VM model ( #418 )
2021-08-10 08:55:11 -07:00
Wenkai Du
bf2339f93e
Merge remote-tracking branch 'nccl/master' into 2.10.3
2021-07-30 16:23:14 -07:00
Wenkai Du
818cdb16a8
Query XGMI links from xml and adjust gfx906 channel usage ( #410 )
2021-07-27 17:32:41 -07:00
Wenkai Du
135d47d125
topo_expl: fix build after switching to rocm-smi-lib ( #405 )
...
* topo_expl: fix build after switching to rocm-smi-lib
* Use minimal of 4 channels for gfx908
2021-07-27 08:30:08 -07:00
Wenkai Du
fa6d7e9a63
Fixes for NCCL_MAX_NCHANNELS and topo_expl ( #398 )
2021-06-22 08:41:49 -07:00
Wenkai Du
b815a2800f
Setup collectives threshold for enabling intranet ( #387 )
...
* Setup collectives threshold for enabling intranet
* Use separate operation counters for coll and p2p
2021-06-09 13:24:26 -07:00
Wenkai Du
a3a8c2d56b
Allow intranode use of network connection ( #383 )
...
* Allow intranode use of network connection
* Checking for graph for null pointer
2021-06-08 07:37:59 -07:00
Wenkai Du
961922ea02
Add option to enable multiple SAT in SHARP ( #380 )
...
* Add option to enable multiple SAT in SHARP
* Extend number of NICs to 16
2021-06-03 19:45:18 -07:00
Wenkai Du
13dc80ee14
topo_expl: update to 2.9.9
2021-05-26 09:24:34 -07:00
Wenkai Du
4c83adb75c
Update Rome models matching ( #376 )
2021-05-25 10:12:40 -07:00
Wenkai Du
a4ea1fed5b
Merge remote-tracking branch 'nccl/master' into develop
2021-05-05 16:01:01 -07:00
Wenkai Du
1fe031402a
Add gfx90a target ( #344 )
...
* Add gfx90a target
* Support gfx90a topology
Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com >
2021-04-14 09:29:00 -06:00
Wenkai Du
e26ad2995e
Cleanup number of channels calculation ( #340 )
2021-04-05 17:51:56 -07:00
Wenkai Du
17491c918e
Fix incorrect net counting ( #339 )
...
* Fix incorrect net counting
* Add comments
2021-04-05 12:21:57 -07:00
Wenkai Du
1d2946ee4b
Rework network port trimming code ( #338 )
...
* Rework network port trimming code
* Move Rome related changes to separate source files
2021-03-31 10:25:59 -07:00
Wenkai Du
d87dc7c2e8
collnet: support multiple NICs ( #335 )
2021-03-25 20:59:32 -07:00
Wenkai Du
1d6244b18d
Enable collnet in RCCL ( #333 )
...
* Enable CollNet and use different number of channels
* topo_expl: enable collnet
2021-03-19 12:58:13 -07:00
Wenkai Du
8e180cf087
Revert "Port alltoall[v]" ( #325 )
...
This reverts commit f4d5d3d620 .
2021-03-06 13:59:31 -08:00
Wenkai Du
c018edf0f2
Enable local sendrecv over network if GDR is available on all GPUs ( #324 )
2021-03-05 19:59:41 -08:00