Improve support for DirectNIC (CX8) * Add support for XDR speed detection. * When DirectNIC is enabled, report only the RDMA interfaces. Extend the P2C (PXN over C2C) support to send/receive operations. Support compilation with GCC 14 (Issues #1743, #1751). Fix the unloading of network plugins that also provide tuner capability. Fix the change of the current device across the calls to ncclCommDestroy() and ncclCommAbort(). A note for users on MNNVL systems: please ensure an adequate stack size for NCCL threads. While the default Linux stack size limit of 8192 KB is known to be sufficient, we've seen crashes if the limit is changed to "unlimited", as it causes the glibc library to unexpectedly *decrease* the stack size of NCCL's background threads to just 2048 KB. Use "ulimit -s" in bash to print the current limit; if needed, reset it to 8192 KB using "ulimit -s 8192" (one also needs to ensure that the new setting is propagated to other nodes when launching a multi-node NCCL job).
NCCL
Optimized primitives for inter-GPU communication.
Introduction
NCCL (pronounced "Nickel") is a stand-alone library of standard communication routines for GPUs, implementing all-reduce, all-gather, reduce, broadcast, reduce-scatter, as well as any send/receive based communication pattern. It has been optimized to achieve high bandwidth on platforms using PCIe, NVLink, NVswitch, as well as networking using InfiniBand Verbs or TCP/IP sockets. NCCL supports an arbitrary number of GPUs installed in a single node or across multiple nodes, and can be used in either single- or multi-process (e.g., MPI) applications.
For more information on NCCL usage, please refer to the NCCL documentation.
Build
Note: the official and tested builds of NCCL can be downloaded from: https://developer.nvidia.com/nccl. You can skip the following build steps if you choose to use the official builds.
To build the library :
$ cd nccl
$ make -j src.build
If CUDA is not installed in the default /usr/local/cuda path, you can define the CUDA path with :
$ make src.build CUDA_HOME=<path to cuda install>
NCCL will be compiled and installed in build/ unless BUILDDIR is set.
By default, NCCL is compiled for all supported architectures. To accelerate the compilation and reduce the binary size, consider redefining NVCC_GENCODE (defined in makefiles/common.mk) to only include the architecture of the target platform :
$ make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"
Install
To install NCCL on the system, create a package then install it as root.
Debian/Ubuntu :
$ # Install tools to create debian packages
$ sudo apt install build-essential devscripts debhelper fakeroot
$ # Build NCCL deb package
$ make pkg.debian.build
$ ls build/pkg/deb/
RedHat/CentOS :
$ # Install tools to create rpm packages
$ sudo yum install rpm-build rpmdevtools
$ # Build NCCL rpm package
$ make pkg.redhat.build
$ ls build/pkg/rpm/
OS-agnostic tarball :
$ make pkg.txz.build
$ ls build/pkg/txz/
Tests
Tests for NCCL are maintained separately at https://github.com/nvidia/nccl-tests.
$ git clone https://github.com/NVIDIA/nccl-tests.git
$ cd nccl-tests
$ make
$ ./build/all_reduce_perf -b 8 -e 256M -f 2 -g <ngpus>
Copyright
All source code and accompanying documentation is copyright (c) 2015-2020, NVIDIA CORPORATION. All rights reserved.