* Update install.sh
Install.sh having hard code like /opt/rocm/bin/hipcc for rocm_path and default_path=/opt/rocm
This will work only when we have standalone rocm installed. If anyone has installed, side-by-side, they will face below error.
Can we keep like ROCM_PATH=$ROCM_PATH instead of “default_path” as variable name and
ROCM_BIN_PATH=$ROCM_PATH/bin ,rocm_path can be replaced with ROCM_BIN_PATH.
This way, we will have option to export ROCM_PATH as env variable as per need and use the script.
I have also tried locally, it’s working. ROCM_PATH is common variable name, we are having.
If you are ok, I can also submit the PR for the same.
Error when side-by-side install is done for driver.
# ./install.sh -dtr 2>&1 | tee /dockerx/6519_rccl-test.log
CMake Error at /usr/share/cmake/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
Could not find compiler set in environment variable CXX:
/opt/rocm/bin/hipcc.
Call Stack (most recent call first):
CMakeLists.txt:12 (project)
CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
See also "/root/driver/rccl/build/release/CMakeFiles/CMakeOutput.log".
* Update install.sh
Removed ROCM_PATH=$ROCM_PATH
* Update install.sh
Set default value if external value is not supplied.
Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.
* Fixing temp file creation/deletion for Clique kernel mode.
* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs
* GroupCall MP UT properly quits when too many devices specified
* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
* gtest: add scatter to combined calls and use loops
* gtest: run validation inside loop
* gtest: revert small element count to 2520
* gtest: fix memory leak in validation
(cherry picked from commit b0853ccd51)
* Fix combined call UT
* Fix memory leak
* Fix alltoallv test
* Adding CPU based execution, fixing typos, adding Fine-grained mem
* Exposing sampling factor when generating range of data sizes
* Refactoring how Links are launched, now once per thread
* Documentation updates
* Changing default timing mechanism, adjusting CPU bandwidth calc, adding flag to use combined timing
* Adding support for smaller transfers (byte size must be multiple of 4 instead of 128)
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix#379 : topology injection failing when using less GPUs than
described in the XML.
Fix#394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
* gtest: add scatter to combined calls and use loops
* gtest: run validation inside loop
* gtest: revert small element count to 2520
* gtest: fix memory leak in validation