* Add 1H16P GPU model
* Implement NIC identification and remapping
* Revert "Sort IB devices based on device name (#413)"
This reverts commit 2d0ed8dff6.
* Fix permute and check order
* Correction on IB speed reporting
* Revert "Allow user to link layer with RCCL_IB_HCA_SKIP_LINK_LAYER (#361)"
This reverts commit caf5c9992a.
Add support for bfloat16.
Add ncclAvg reduction operation.
Improve performance for aggregated operations.
Improve performance for tree.
Improve network error reporting.
Add NCCL_NET parameter to force a specific network.
Add NCCL_IB_QPS_PER_CONNECTION parameter to split IB traffic onto multiple queue pairs.
Fix topology detection error in WSL2.
Fix proxy memory elements affinity (improve alltoall performance).
Fix graph search on cubemesh topologies.
Fix hang in cubemesh during NVB connections.
* Select sendrecv path based on collective data size
* Add comments on packing and unpacking group field
* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests
* Re-enabling mp unit tests
* Fixing shared memory leak and other bugs related to shared mem for MP unit tests
* Revert 43bfbfc97bf9edbae1f386d461439091618ff8ed
* Further tightening up unlinks
* Moving test check macros to separate header file
* Tightening up shared memory unlinking for clique kernels, add munmap for host barrier for MP unit tests
* Updating new MP unit test
* Fixing mqueue bug
* Fixing memory leak in MP unit tests
Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.
* Fixing message queue leak.
* Using POSIX implementation of Message Queues
* Adding unlink to msgqueue
* MsgQueue update
* Adding timeout check to msgqueue broadcast; tightening up system checks
* Removing unnecessary code
* Removing extra argument from print
* Adding explicit msg queue close call to all other ranks
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.