2.8.3-1

Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.

Этот коммит содержится в:

Sylvain Jeaugey

2020-09-04 14:35:05 -07:00

родитель 084207e685

Коммит 920dbe5b35

90 изменённых файлов: 11172 добавлений и 3209 удалений

									
										makefiles/version.mk
									
		+2
		-2
	
												Просмотреть файл
												
				@@ -1,6 +1,6 @@

				##### version

				NCCL_MAJOR   := 2

				NCCL_MINOR   := 7

				NCCL_PATCH   := 8

				NCCL_MINOR   := 8

				NCCL_PATCH   := 3

				NCCL_SUFFIX  :=

				PKG_REVISION := 1