* Make sure the target device is used for MSCCL
* Enable single process mode by default to use MSCCL in MT
* Create a per-rank state when GPUs share a thread
[ROCm/rccl commit: 03a3ef3c34]
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
[ROCm/rccl commit: db840f024e]
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
[ROCm/rccl commit: 0c36d571ea]