* Fixing temp file creation/deletion for Clique kernel mode.
* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs
* GroupCall MP UT properly quits when too many devices specified
* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
* gtest: add scatter to combined calls and use loops
* gtest: run validation inside loop
* gtest: revert small element count to 2520
* gtest: fix memory leak in validation
(cherry picked from commit b0853ccd51)
* Fix combined call UT
* Fix memory leak
* Fix alltoallv test
* Adding CPU based execution, fixing typos, adding Fine-grained mem
* Exposing sampling factor when generating range of data sizes
* Refactoring how Links are launched, now once per thread
* Documentation updates
* Changing default timing mechanism, adjusting CPU bandwidth calc, adding flag to use combined timing
* Adding support for smaller transfers (byte size must be multiple of 4 instead of 128)
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix#379 : topology injection failing when using less GPUs than
described in the XML.
Fix#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
* gtest: add scatter to combined calls and use loops
* gtest: run validation inside loop
* gtest: revert small element count to 2520
* gtest: fix memory leak in validation