* Added restrictions around calling MSCCL++ collectives (#1281)
* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.
* Renamed and refactored some mscclpp types.
* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.
* Disable MSCCL++ when using managed memory buffers as it isn't supported.
* Added datatype and op constraints for MSCCL++ AllReduce.
* Added documentation on MSCCL++ restrictions to the README.
* [BUILD] Support custom CMake flags in MSCCLPP (#1275)
* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] CMake flags to support build-id in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] Fix CMake warnings in MSCCLPP build
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>
* Link to libmscclpp_nccl statically (#1282)
* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.
* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.
* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.
* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)
* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)
* Include mscclpp as a git submodule (#1314)
* Added the desired mscclpp commit as a git submodule.
* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.
* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.
* Enabled MSCCL++ feature build.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
[ROCm/rccl commit: 736a705875]
* Unit Tests for RCCL in CPX mode
* override pow2gpus set by cpx mode by user argument
* Adding comment for UT_POW2_GPUS
* Additional comment on why using pow2gpus for cpx mode.
[ROCm/rccl commit: 289a80c4e9]
* Added `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime flag to return to the original MSCCL enablement behaviour except when explicitly enabling for multi-thread.
* Added documentation for the new `RCCL_MSCCL_ENABLE_SINGLE_PROCESS` runtime env var.
[ROCm/rccl commit: e056fe8f7e]
* Add another rome model and override
* Fix bug
* Fix typo
* Add ring
* Update ring
* Fix model matching
* Clean up
* Clean up
* Reverse rings for NCCL_RINGS input
* Only reverse NCCL_RINGS for ring graph
* Fix mapping issue when using NCCL_RINGS
* Add NCCL_RINGS_REMAP to handle inconsistant net names
[ROCm/rccl commit: 532b70afb6]
* adding all nccl apis to api_support to enable rccl tracing by rocprofv3
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Co-authored-by: Jonathan R. Madsen <jonathanrmadsen@gmail.com>
[ROCm/rccl commit: db840f024e]
remove the MultiRank examples, the features was never released (because
it didn't work reliably), and it might just cause confusion if somebody
sees it. In additional, the locdation in tools was suboptimal.
[ROCm/rccl commit: 8953a26bcd]
* Restore number of parallel linking jobs
* Dynamically adjust number of linker jobs with limit of 16 jobs max
* Fix typo
* Add cgroup v1 support
[ROCm/rccl commit: ca5341d419]
* Adding performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
* Performance collection feature in rccl_replayer, and updating MSCCL logging and replayer parsing
[ROCm/rccl commit: cf311b71ee]
* adding rocprof parser script
* adding the support for multiple json files
* adding pytorch profiler script
* remove filtering from pytorch log
* adding the addressing the comments and add the feature to parse all kernels
* completing the report for torch profiler
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
[ROCm/rccl commit: 519843d2cf]
* Template unroll for RCCL kernels
* Adding unroll template arg during CMake hipification
* Reduce linking parallel jobs to avoid OOM in CI
* Workaround issues with UT tests
SWDEV-469533: register spill fix is needed for mainline build
LWPCOMMLIBS-369: cannot enable 112 channels with 80 CUs
Use -parallel-jobs=8 for linking
* CI: do not use -j 16 when building
* CI: use -j 8 when building
* Only reduce parallel linking job for CI extended
* Restore original jenkins command. Change parallel linking jobs in cmake
* Disable MSCCLPP
---------
Co-authored-by: gilbertlee-amd <gilbert.lee@amd.com>
[ROCm/rccl commit: 89349f2ce4]
* Revert "[GRAPH] Use channel shuffling only for IB systems (#1228)"
This reverts commit 8a7dd0e590.
* Revert "Revert "Changing channel stride for MI300X multinode (#1196)" (#1224)"
This reverts commit bb6fab3d8e.
[ROCm/rccl commit: a1ef217b32]