This PR tunes the number of threadblocks used for larger (>1MB) message sizes. [ROCm/rccl commit: fdf75fd2c1]
fdf75fd2c1