Commit Graph

66 Commits

Author SHA1 Message Date
Wenkai Du a4ea1fed5b Merge remote-tracking branch 'nccl/master' into develop 2021-05-05 16:01:01 -07:00
Wenkai Du ed237dcaa7 Use better name for kernel collective trace enable (#357)
"NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL" enables collectives API
trace. Adding "RCCL_KERNEL_COLL_TRACE_ENABLE=1" enables kernel traces.
2021-04-26 08:35:53 -07:00
Wenkai Du 9cc9c3360b Control collective trace from kernel separately (#356) 2021-04-23 16:36:19 -07:00
Wenkai Du 415c7cd3d1 Tune number of channels for gfx90a (#349) 2021-04-19 15:27:01 -07:00
Sylvain Jeaugey a46ea10583 2.9.6-1
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.
2021-04-12 16:00:46 -07:00
TomSang 87f12cbb86 Add detection of cooperative multi device launch attribute (#345) 2021-04-11 13:29:24 -07:00
gilbertlee-amd caba0a63d2 Fixing clique-topology detection (#342)
* Fixing clique-topology detection
* Fix to enable multi-process clique-based kernels
2021-04-07 11:29:44 -06:00
Wenkai Du e26ad2995e Cleanup number of channels calculation (#340) 2021-04-05 17:51:56 -07:00
Wenkai Du 17491c918e Fix incorrect net counting (#339)
* Fix incorrect net counting

* Add comments
2021-04-05 12:21:57 -07:00
Wenkai Du d87dc7c2e8 collnet: support multiple NICs (#335) 2021-03-25 20:59:32 -07:00
Wenkai Du 1d6244b18d Enable collnet in RCCL (#333)
* Enable CollNet and use different number of channels

* topo_expl: enable collnet
2021-03-19 12:58:13 -07:00
Wenkai Du f60b76c67a Add GPU memory usage tracker (#326) 2021-03-06 20:32:30 -08:00
Wenkai Du 8e180cf087 Revert "Port alltoall[v]" (#325)
This reverts commit f4d5d3d620.
2021-03-06 13:59:31 -08:00
Wenkai Du c018edf0f2 Enable local sendrecv over network if GDR is available on all GPUs (#324) 2021-03-05 19:59:41 -08:00
Wenkai Du c985358e11 Merge remote-tracking branch 'nccl/master' into 2.8.3 2021-02-15 18:44:47 -05:00
Sylvain Jeaugey 911d61f214 2.8.4-1
Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.
2021-02-09 15:36:48 -08:00
gilbertlee-amd 3e62ceddc5 Clique kernel support (#295) (#15)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2021-01-28 09:45:01 -07:00
Wenkai Du 2ddbe6646b Improve collective trace 2021-01-14 19:28:01 -05:00
Wenkai Du f4d5d3d620 Port alltoall[v] 2021-01-14 19:28:01 -05:00
Wenkai Du d469947641 Merge remote-tracking branch 'nccl/master' into no-target-id 2021-01-14 19:27:53 -05:00
Sylvain Jeaugey 920dbe5b35 2.8.3-1
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
2020-11-17 11:08:52 -08:00
Wenkai Du 2e8b3a0857 Use ncclSend/ncclRecv for alltoall type of collectives as default (#297) 2020-11-09 11:23:17 -08:00
Wenkai Du c835d8263a Merge remote-tracking branch 'nccl/master' into nccl_sync 2020-10-15 18:42:38 -04:00
gilbertlee-amd 84a2541e01 Revert "Initial support for clique-based kernels (#276)" (#280)
This reverts commit 2b8184808d.
2020-10-15 11:30:18 -07:00
Sylvain Jeaugey 0e14394c5f Fix affinity move 2020-10-13 16:58:05 -07:00
Sylvain Jeaugey c6dbdb0084 Make sure proxy threads inherit the CPU affinity. 2020-10-13 16:37:52 -07:00
gilbertlee-amd 2b8184808d Initial support for clique-based kernels (#276)
* Initial support for clique-based kernels
2020-10-13 11:22:04 -06:00
Wenkai Du ae008fd2db Rework Rome detection and add multiple network ports models (#274)
* Rework Rome detection and add multiple network ports models

* Remove unused opCount in p2p transport
2020-10-07 13:37:36 -07:00
Wenkai Du b871ea3c0c Add Alltoallv RCCL kernel implementation (#269)
* Add alltoallv API and implementation

* Extend Rome P2P channel limit to multinode and alltoall kernels

* topo_expl: fix compilation and sync up with main

* gtest: use RCCL alltoallv API

* Code review changes
2020-09-30 16:25:36 -07:00
Wenkai Du 44fcde7835 Ensure all ranks on same send/receive or alltoall kernel path (#271) 2020-09-24 08:25:04 -07:00
Wenkai Du d871fceb54 Change network plugin name to librccl-net.so (#266) 2020-09-18 13:23:30 -07:00
Wenkai Du 42955f5f4f Limit P2P channels on Rome 2020-09-17 17:20:32 -07:00
Wenkai Du c5cbece6d0 Increase minimal channels for gfx908 (#259) 2020-08-26 11:40:11 -07:00
Wenkai Du 7e3f841fab Merge remote-tracking branch 'nccl/master' into 2.7.8 2020-08-10 16:11:00 +00:00
David Addison 033d799524 2.7.8-1
Fix collective mismatch error when using ncclSend/ncclRecv
2020-07-27 16:34:09 -07:00
Wenkai Du ab787c767e Change default channels duplication for chordal ring (#233) 2020-07-14 15:16:50 -07:00
Wenkai Du 7484e53ff7 Rework network proxy profiling 2020-06-13 03:13:58 +00:00
Wenkai Du 812543104d Add network proxy profiling support 2020-06-09 17:44:15 -07:00
Wenkai Du c9aa11928a Calculate and use total wait cycles for RCCL profiling 2020-06-09 17:44:15 -07:00
Wenkai Du e80e29573c Add gather, scatter and alltoall collectives
Introducing 3 new APIs:
ncclResult_t  ncclGather(const void* sendbuff, void* recvbuff, size_t sendcount,
    ncclDataType_t datatype, int root, ncclComm_t comm, hipStream_t stream);
ncclResult_t  ncclScatter(const void* sendbuff, void* recvbuff,
    size_t recvcount, ncclDataType_t datatype, int root, ncclComm_t comm,
    hipStream_t stream);
ncclResult_t  ncclAllToAll(const void* sendbuff, void* recvbuff, size_t count,
    ncclDataType_t datatype, ncclComm_t comm, hipStream_t stream);

Only out of place operation is supported.
Preprocessor symbol RCCL_GATHER_SCATTER=1 indicates API availibility.
By default the APIs launche RCCL kernel implementation, which can be disabled by
RCCL_ALLTOALL_KERNEL_DISABLE=1. Then the APIs use wrapper around ncclSend and ncclRecv.
2020-06-09 17:44:08 -07:00
Wenkai Du 26a0fd2517 Merge remote-tracking branch 'nccl/master' into develop 2020-06-09 17:40:11 -07:00
Sylvain Jeaugey 5949d96f36 2.7.3-1
Add support for A100 GPU and related platforms.
Add support for CUDA 11.
Add support for send/receive operations (beta).
2020-06-08 09:31:44 -07:00
Wenkai Du e41ab173cf Report HIP version in logs 2020-05-20 18:15:32 +00:00
Wenkai Du fa36fd9ef9 Merge remote-tracking branch 'nccl/master' into v2.6.4_merge 2020-04-01 13:35:12 -07:00
Sylvain Jeaugey b221128eca 2.6.4-1
Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.
2020-03-20 14:58:36 -07:00
Wenkai Du 8e73a2ad60 Merge remote-tracking branch 'remotes/nccl/master' 2020-02-27 12:53:03 -08:00
Stanley Tsang 20fa04d9b6 Updating copyright notices for 2020. 2020-01-29 15:28:08 -08:00
Wenkai Du 1e55645d97 Misc fixes and improvements for 2.5.6
1. Fix RCCL unit test
2. Add ROME detection and tuning
3. Change default P2P level
4. Fix search algorithm for XGMI
5. Remove explicit channel duplication with implicit by using half of link speed
6. Add collective trace support
7. Correct Intel Skylake CPU detection and bandwidth
8. Fix topo connect function
9. Disable GDR read and remove unreachable code
10. Disable LL128 kernels
11. Add tuning parameters
12. Use original clock64() implementation which returns RTC counter value
13. Print out timestamp of collective trace
14. Do not use struct ncclColl in kernel launch parameter
15. Fix abort handling and add tracing
17. Add __launch_bounds__ to kernel functions
18. Remove unused abortCount
19. Unset default MIN_NRINGS and MIN_NCHANNELS
20. Do not allocate shared memory when not using LL128 kernels
21. Correct time print out in tuning log
2020-01-29 15:27:05 -08:00
Christian Sigg 3899f6e0f2 Fix clang build (#274)
The attribute is called `optnone`, not `noopt`.
2019-12-09 09:31:13 -08:00
Sylvain Jeaugey aa15dfb29c Fix clang compilation 2019-12-06 09:55:54 -08:00