gilbertlee-amd
caba0a63d2
Fixing clique-topology detection ( #342 )
...
* Fixing clique-topology detection
* Fix to enable multi-process clique-based kernels
2021-04-07 11:29:44 -06:00
Wenkai Du
e26ad2995e
Cleanup number of channels calculation ( #340 )
2021-04-05 17:51:56 -07:00
Wenkai Du
17491c918e
Fix incorrect net counting ( #339 )
...
* Fix incorrect net counting
* Add comments
2021-04-05 12:21:57 -07:00
Wenkai Du
1d2946ee4b
Rework network port trimming code ( #338 )
...
* Rework network port trimming code
* Move Rome related changes to separate source files
2021-03-31 10:25:59 -07:00
Wenkai Du
0c78553ee0
Check fine grained memory on peer GPU before enabling P2P ( #337 )
2021-03-30 09:06:39 -07:00
Wenkai Du
d87dc7c2e8
collnet: support multiple NICs ( #335 )
2021-03-25 20:59:32 -07:00
Stanley Tsang
289db2a636
Fixing message queue leak. ( #331 )
2021-03-25 19:11:43 -06:00
Wenkai Du
0fbb9510a5
Remove HDP workaround for ROCm 4.2 HIP ( #334 )
2021-03-23 20:11:37 -07:00
Wenkai Du
1d6244b18d
Enable collnet in RCCL ( #333 )
...
* Enable CollNet and use different number of channels
* topo_expl: enable collnet
2021-03-19 12:58:13 -07:00
Wenkai Du
b46260260a
Sort GPUs by HIP device ID ( #329 )
...
* Sort GPUs by HIP device ID
* Remove extra space
2021-03-16 16:51:32 -07:00
Wenkai Du
f60b76c67a
Add GPU memory usage tracker ( #326 )
2021-03-06 20:32:30 -08:00
Wenkai Du
8e180cf087
Revert "Port alltoall[v]" ( #325 )
...
This reverts commit f4d5d3d620 .
2021-03-06 13:59:31 -08:00
Wenkai Du
c018edf0f2
Enable local sendrecv over network if GDR is available on all GPUs ( #324 )
2021-03-05 19:59:41 -08:00
gilbertlee-amd
f4a9b9acba
Adding pthread_join / pthread_detach to clean up pthreads to avoid leaks ( #322 )
2021-02-26 16:29:55 -07:00
Wenkai Du
e820a943e9
Update tuning parameters for XGMI and NET
2021-02-23 21:41:26 +00:00
Wenkai Du
ec8d89b1dd
Match NBIO only when GPUs and NICs are directly connected to CPU
2021-02-22 18:52:29 -05:00
Stanley Tsang
45f5255f7c
Fixing cache deletion for CliqueManager; updating copyright
2021-02-19 22:22:46 +00:00
Wenkai Du
95f178324c
Add support to another Rome model
2021-02-18 02:00:31 +00:00
Wenkai Du
c985358e11
Merge remote-tracking branch 'nccl/master' into 2.8.3
2021-02-15 18:44:47 -05:00
Wenkai Du
bf8eb40705
Move HDP flush to CPU
2021-02-12 18:06:19 +00:00
Sylvain Jeaugey
911d61f214
2.8.4-1
...
Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.
2021-02-09 15:36:48 -08:00
Wenkai Du
9cc3b56166
Fix GDRDMA read and remove unused files
2021-02-09 01:34:39 +00:00
Stanley Tsang
d00b7d17bd
Update MP UT to support arbitrary # of GPUs; multiple bugfixes ( #16 )
...
* Fixing temp file creation/deletion for Clique kernel mode.
* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs
* GroupCall MP UT properly quits when too many devices specified
* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
2021-02-05 16:49:25 -08:00
Wenkai Du
ab1e7a0318
Merge remote-tracking branch 'origin/develop' into 2.8.3
2021-02-04 20:02:34 -05:00
gilbertlee-amd
1990ffd76a
Tuning some clique-based kernel parameters ( #315 )
2021-02-03 20:00:08 -07:00
Wenkai Du
5f97122442
Enable GPU direct RDMA read from GPU
2021-02-03 02:48:30 +00:00
gilbertlee-amd
3e62ceddc5
Clique kernel support ( #295 ) ( #15 )
...
* Adding experimental clique-based kernels (opt-in only)
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com >
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com >
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com >
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com >
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com >
2021-01-28 09:45:01 -07:00
Wenkai Du
41e47a36e7
Use less unroll for clique kernels ( #313 )
2021-01-15 17:48:10 -08:00
Wenkai Du
2ddbe6646b
Improve collective trace
2021-01-14 19:28:01 -05:00
Wenkai Du
f4d5d3d620
Port alltoall[v]
2021-01-14 19:28:01 -05:00
Wenkai Du
105db19a11
Do not allow GPU as intermediate
2021-01-14 19:28:01 -05:00
Wenkai Du
e055229e56
Revert "Changes to topology based on XGMI ( #272 )"
...
This reverts commit 01bd2573db .
2021-01-14 19:28:01 -05:00
Wenkai Du
d469947641
Merge remote-tracking branch 'nccl/master' into no-target-id
2021-01-14 19:27:53 -05:00
Jonas Zhou
3996562690
x86: Add CPU detection for Zhaoxin processors
...
Signed-off-by: Jonas Zhou <JonasZhou@zhaoxin.com >
2020-12-17 11:15:18 -08:00
Wenkai Du
373a108516
Fix Rome PCIe 2 node topology generation ( #310 )
2020-12-15 17:16:17 -08:00
Wenkai Du
975b14dffa
Add Rome model and improve search ( #305 )
2020-11-17 14:55:06 -08:00
Sylvain Jeaugey
920dbe5b35
2.8.3-1
...
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
2020-11-17 11:08:52 -08:00
Wenkai Du
554729079d
Use device's link width and speed if port doesn't report ( #304 )
2020-11-13 17:58:04 -08:00
Stanley Tsang
2958f7eace
Fixing IPC handle leak ( #302 )
2020-11-13 10:32:42 -07:00
gilbertlee-amd
c8d08a7c2f
Adding RCCL_CLIQUE_DEBUG to help debug experimental clique feature ( #300 )
2020-11-13 09:07:11 -07:00
Wenkai Du
4e68229c8b
Skip unused peer connection in scatter and gather ( #301 )
2020-11-12 15:47:34 -08:00
gilbertlee-amd
41bcfb8878
Clique kernel support ( #295 )
...
* Adding experimental clique-based kernels (opt-in only)
Co-authored-by: Stanley Tsang <stanley.tsang@amd.com >
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com >
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com >
2020-11-10 15:44:10 -07:00
Wenkai Du
2e8b3a0857
Use ncclSend/ncclRecv for alltoall type of collectives as default ( #297 )
2020-11-09 11:23:17 -08:00
Wenkai Du
709b7e4880
Improve GPU direct RDMA handling on Rome ( #294 )
2020-11-03 14:29:08 -08:00
Wenkai Du
dfa3c41ede
Add more Rome models ( #292 )
2020-10-30 21:26:04 -07:00
xietingwew
084207e685
fix proxyArgs for trace log
2020-10-21 09:18:40 -07:00
Wenkai Du
dcad0ef7cb
Fix incorrect pointer checking for scatter and gather ( #285 )
2020-10-19 13:27:09 -07:00
Wenkai Du
c835d8263a
Merge remote-tracking branch 'nccl/master' into nccl_sync
2020-10-15 18:42:38 -04:00
gilbertlee-amd
84a2541e01
Revert "Initial support for clique-based kernels ( #276 )" ( #280 )
...
This reverts commit 2b8184808d .
2020-10-15 11:30:18 -07:00
Sylvain Jeaugey
0e14394c5f
Fix affinity move
2020-10-13 16:58:05 -07:00