e1c20e7f24
To get the improved performance for TP=4, the user needs to use
RCCL_MSCCL_FORCE_ENABLE=1 and MSCCLPP_READ_ALLRED=1. For TP=8, the
user should use MSCCLPP_HIERARCHICAL_ALLRED=1.
[ROCm/rccl commit: 0fb3b5eba9]
json
@ 9cca280a4d
mscclpp
@ 1e82dd444f