* Select sendrecv path based on collective data size
* Add comments on packing and unpacking group field
* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests
* Re-enabling mp unit tests
* Fixing shared memory leak and other bugs related to shared mem for MP unit tests
* Revert 43bfbfc97bf9edbae1f386d461439091618ff8ed
* Further tightening up unlinks
* Moving test check macros to separate header file
* Tightening up shared memory unlinking for clique kernels, add munmap for host barrier for MP unit tests
* Updating new MP unit test
* Fixing mqueue bug
* Fixing memory leak in MP unit tests
Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)
* 2.9.6-1
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.
* Clique tuning upgrade (#352) (#19)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
Co-authored-by: Sylvain Jeaugey <sjeaugey@nvidia.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
* Fixing message queue leak.
* Using POSIX implementation of Message Queues
* Adding unlink to msgqueue
* MsgQueue update
* Adding timeout check to msgqueue broadcast; tightening up system checks
* Removing unnecessary code
* Removing extra argument from print
* Adding explicit msg queue close call to all other ranks
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.