0fb3b5eba9
To get the improved performance for TP=4, the user needs to use RCCL_MSCCL_FORCE_ENABLE=1 and MSCCLPP_READ_ALLRED=1. For TP=8, the user should use MSCCLPP_HIERARCHICAL_ALLRED=1.