* Make sure the target device is used for MSCCL
* Enable single process mode by default to use MSCCL in MT
* Create a per-rank state when GPUs share a thread
* Moved call to `mscclpp_ncclGetUniqueId` into `ncclCommInitRankFunc` to avoid setting up transport early in environments where MSCCL++ isn't valid.
* Checking `mscclEnabled` for the process and the topology to gate MSCCL++.
* Allowed `mscclForceEnable` to enable MSCCL++.
* Added restrictions around calling MSCCL++ collectives (#1281)
* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.
* Renamed and refactored some mscclpp types.
* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.
* Disable MSCCL++ when using managed memory buffers as it isn't supported.
* Added datatype and op constraints for MSCCL++ AllReduce.
* Added documentation on MSCCL++ restrictions to the README.
* [BUILD] Support custom CMake flags in MSCCLPP (#1275)
* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] CMake flags to support build-id in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] Fix CMake warnings in MSCCLPP build
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>
* Link to libmscclpp_nccl statically (#1282)
* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.
* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.
* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.
* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)
* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)
* Include mscclpp as a git submodule (#1314)
* Added the desired mscclpp commit as a git submodule.
* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.
* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.
* Enabled MSCCL++ feature build.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
* Unit Tests for RCCL in CPX mode
* override pow2gpus set by cpx mode by user argument
* Adding comment for UT_POW2_GPUS
* Additional comment on why using pow2gpus for cpx mode.
* Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread.
* Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.
* Add another rome model and override
* Fix bug
* Fix typo
* Add ring
* Update ring
* Fix model matching
* Clean up
* Clean up
* Reverse rings for NCCL_RINGS input
* Only reverse NCCL_RINGS for ring graph
* Fix mapping issue when using NCCL_RINGS
* Add NCCL_RINGS_REMAP to handle inconsistant net names
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
remove the MultiRank examples, the features was never released (because
it didn't work reliably), and it might just cause confusion if somebody
sees it. In additional, the locdation in tools was suboptimal.