* Refactor RCCL install guide into several pages
* Changes from code review and new docker guide
* Add missing entries to ToC
* Minor fixes
* Fix help strings
* Edits after review and remove extra white space
* Added restrictions around calling MSCCL++ collectives (#1281)
* Added restriction to non-zero 32-byte multiple message sizes to MSCCL++ AllGather.
* Renamed and refactored some mscclpp types.
* Only transmit the MSCCL++ unique id for non-split comm init. For splitting comm, it has already been transmitted. Instead, save the MSCCL++ communicator in child communicators when calling `ncclCommSplit`. Only destroy MSCCL++ communicators when no RCCL communicators remain that use it. Also improved trace logging.
* Disable MSCCL++ when using managed memory buffers as it isn't supported.
* Added datatype and op constraints for MSCCL++ AllReduce.
* Added documentation on MSCCL++ restrictions to the README.
* [BUILD] Support custom CMake flags in MSCCLPP (#1275)
* [BUILD] Support custom CMAKE_PREFIX_PATH in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] CMake flags to support build-id in MSCCLPP
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* [BUILD] Fix CMake warnings in MSCCLPP build
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
* Wrapped all cmake arguments passed to mscclpp to remove empty arguments and properly format them.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Corey Derochie <corey.derochie@amd.com>
* Link to libmscclpp_nccl statically (#1282)
* Switched mscclpp_nccl to static linking. Added a build step to rename the NCCL API functions.
* Undid separation of building libmscclpp_nccl from building librccl with MSCCL++ integration. With a static build, it's either fully enabled or fully disabled.
* `nm` isn't always available in docker containers due to being stripped down. Removed use of `nm` in `cmake` and hard-coded the output into mscclpp_nccl_syms.txt.
* Removed IBVerbs dependency for integrating with MSCCL++ (#1313)
* Renamed `RCCL_ENABLE_MSCCLPP` to `RCCL_MSCCLPP_ENABLE` to conform to MSCCL. Set `RCCL_MSCCLPP_ENABLE` to 1 by default if `ENABLE_MSCCLPP` is defined, or 0 otherwise. Added a log warning if `RCCL_MSCCLPP_ENABLE` is set to 1 but `ENABLE_MSCCLPP` is not defined. (#1294)
* Include mscclpp as a git submodule (#1314)
* Added the desired mscclpp commit as a git submodule.
* Added step to automatically checkout the mscclpp submodule if it isn't already present, in case the user forgot to clone recursively.
* Added instruction to README to clone using --recurse-submodules to get the mscclpp submodule.
* Enabled MSCCL++ feature build.
---------
Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>
Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>
MSCCL can now run in a multi-threaded configuration. To test in the unit tests, added the ENABLE_OPENMP compile definition flag and the --openmp-test-enable flag to the unit test build script. To activate, set the environment variables UT_MULTITHREADED=1 and UT_PROCESS_MASK=1. Set Jenkins to use this mode.
The header files will in /opt/rocm-xxx/include/rccl
Libraries and cmake will be in /opt/rocm-xxx/lib folder.
Added wrappers for header files using rocm-cmake functions for backward compatibility.
* Fixing cmake_install_prefix search to include /opt/rocm-xxxx
* Removing all hard references to /opt/rocm with ROCM_PATH
* Setting ROCM_PATH CMake variable in install script
* Initial commit of all_reduce_only support
* Working AllReduce only build
* Removing printfs and restoring release build
* Restore P2P index
* Updates to build_allreduce_only mode.
* cleaning up macro ifdefs
* Update install.sh
Install.sh having hard code like /opt/rocm/bin/hipcc for rocm_path and default_path=/opt/rocm
This will work only when we have standalone rocm installed. If anyone has installed, side-by-side, they will face below error.
Can we keep like ROCM_PATH=$ROCM_PATH instead of “default_path” as variable name and
ROCM_BIN_PATH=$ROCM_PATH/bin ,rocm_path can be replaced with ROCM_BIN_PATH.
This way, we will have option to export ROCM_PATH as env variable as per need and use the script.
I have also tried locally, it’s working. ROCM_PATH is common variable name, we are having.
If you are ok, I can also submit the PR for the same.
Error when side-by-side install is done for driver.
# ./install.sh -dtr 2>&1 | tee /dockerx/6519_rccl-test.log
CMake Error at /usr/share/cmake/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
Could not find compiler set in environment variable CXX:
/opt/rocm/bin/hipcc.
Call Stack (most recent call first):
CMakeLists.txt:12 (project)
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
See also "/root/driver/rccl/build/release/CMakeFiles/CMakeOutput.log".
* Update install.sh
Removed ROCM_PATH=$ROCM_PATH
* Update install.sh
Set default value if external value is not supplied.
* Fixing temp file creation/deletion for Clique kernel mode.
* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs
* GroupCall MP UT properly quits when too many devices specified
* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
* Adding the ability to force install dependencies (namely gtest); gtest library installation fix for centos
* Removing potentially unneccessary dependencies from install script
* Adding static library building option.
* Disabling running tests for static build
* Removing static packaging in CI
Co-authored-by: Saad Rahim <saad.rahim@amd.com>
* Making hip-clang the default compiler; documentation update
* Adding back --hip-clang to install.sh as a silent option for CI
* Documentation updates for NCCL 2.7
* Restoring deleted line in install script