* Select sendrecv path based on collective data size
* Add comments on packing and unpacking group field
* Toggling RCCL_P2P_NET_DISABLE in combined calls unit tests
[ROCm/rccl commit: 6dcae8a459]
* Re-enabling mp unit tests
* Fixing shared memory leak and other bugs related to shared mem for MP unit tests
* Revert 43bfbfc97bf9edbae1f386d461439091618ff8ed
* Further tightening up unlinks
* Moving test check macros to separate header file
* Tightening up shared memory unlinking for clique kernels, add munmap for host barrier for MP unit tests
* Updating new MP unit test
* Fixing mqueue bug
* Fixing memory leak in MP unit tests
[ROCm/rccl commit: 0b2bfdd6d8]
Fix crash when setting NCCL_MAX_P2P_NCHANNELS below nchannels.
Fix hang during sendrecv dynamic NVB connection establishment on
cubemesh topologies.
Add environment variable to only use SHARP on communicators beyond
a given number of ranks.
Add debug subsystem to trace memory allocations.
Fix compilation with TRACE=1. (Issue #505)
[ROCm/rccl commit: 3fec2fa5ee]
* 2.9.6-1
Add support for CUDA graphs.
Fuse BCM Gen4 switches to avoid suboptimal performance on some platforms. Issue #439.
Fix bootstrap issue caused by connection reordering.
Fix CPU locking block.
Improve CollNet algorithm.
Improve performance on DGX A100 for communicators with only one GPU per node.
* Clique tuning upgrade (#352) (#19)
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
Co-authored-by: Sylvain Jeaugey <sjeaugey@nvidia.com>
Co-authored-by: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com>
[ROCm/rccl commit: 6021329af0]
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
[ROCm/rccl commit: e796b1645c]
Fix memory leaks.
Fix crash in bootstrap error case.
Fix Collnet clean-up issue.
Make PCI switch vendor/device optional for XML injection.
Add support for nvidia-peermem module.
[ROCm/rccl commit: ca8485b0d0]
* Enabling clique for any XGMI-connected topology, adding tuning
* Updating CHANGELOG for clique tuning
* Re-working clique barrier system to work on multi-process / multi-gpu
[ROCm/rccl commit: 9d7232c091]
* Fixing message queue leak.
* Using POSIX implementation of Message Queues
* Adding unlink to msgqueue
* MsgQueue update
* Adding timeout check to msgqueue broadcast; tightening up system checks
* Removing unnecessary code
* Removing extra argument from print
* Adding explicit msg queue close call to all other ranks
[ROCm/rccl commit: 70597789d0]