* Make sure the target device is used for MSCCL * Enable single process mode by default to use MSCCL in MT * Create a per-rank state when GPUs share a thread [ROCm/rccl commit: 03a3ef3c34]
03a3ef3c34