Commit Graph

555 Commits

Author SHA1 Message Date
Wenkai Du c985358e11 Merge remote-tracking branch 'nccl/master' into 2.8.3 2021-02-15 18:44:47 -05:00
Wenkai Du 3a1aebd742 Merge remote-tracking branch 'rccl/develop' into 2.8.3 2021-02-15 13:17:38 -05:00
Wenkai Du bf8eb40705 Move HDP flush to CPU 2021-02-12 18:06:19 +00:00
pramenku e9f7908592 Update install.sh (#317)
* Update install.sh

Install.sh having hard code like /opt/rocm/bin/hipcc for rocm_path and default_path=/opt/rocm
This will work only when we have standalone rocm installed. If anyone has installed, side-by-side, they will face below error.

Can we keep like ROCM_PATH=$ROCM_PATH  instead of “default_path” as variable name and 
ROCM_BIN_PATH=$ROCM_PATH/bin ,rocm_path can be replaced with ROCM_BIN_PATH.

This way, we will have option to export ROCM_PATH as env variable as per need and use the script. 
I have also tried locally, it’s working.  ROCM_PATH is common variable name, we are having.

If you are ok, I can also submit the PR for the same.


Error when side-by-side install is done for driver.
# ./install.sh -dtr 2>&1 | tee /dockerx/6519_rccl-test.log
CMake Error at /usr/share/cmake/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
Could not find compiler set in environment variable CXX:
/opt/rocm/bin/hipcc.
Call Stack (most recent call first):
CMakeLists.txt:12 (project)

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
See also "/root/driver/rccl/build/release/CMakeFiles/CMakeOutput.log".

* Update install.sh

Removed ROCM_PATH=$ROCM_PATH

* Update install.sh

Set default value if external value is not supplied.
2021-02-12 08:44:30 -08:00
Stanley Tsang 6b7b312fb9 Fixed temp file creation/deletion with clique mode (#316) 2021-02-12 08:44:10 -08:00
Sylvain Jeaugey 911d61f214 2.8.4-1
Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.
2021-02-09 15:36:48 -08:00
Gilbert Lee f1a9ce3fa5 Using GTEST_SKIP() to skip unit tests that have insufficient devices. Skipping out earlier 2021-02-09 03:54:04 +00:00
Wenkai Du 9cc3b56166 Fix GDRDMA read and remove unused files 2021-02-09 01:34:39 +00:00
Stanley Tsang d00b7d17bd Update MP UT to support arbitrary # of GPUs; multiple bugfixes (#16)
* Fixing temp file creation/deletion for Clique kernel mode.

* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs

* GroupCall MP UT properly quits when too many devices specified

* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script
2021-02-05 16:49:25 -08:00
Wenkai Du 6dfdfef98f Add gfx908 Rome 4 NICs model 2021-02-06 00:19:47 +00:00
Gilbert Lee f372c53d52 [TransferBench] Fixing some merge issues 2021-02-05 16:46:20 +00:00
Wenkai Du ab1e7a0318 Merge remote-tracking branch 'origin/develop' into 2.8.3 2021-02-04 20:02:34 -05:00
Gilbert Lee 2f541508c5 [topo_expl] Updating for 2.8.3 2021-02-04 19:08:42 +00:00
Gilbert Lee 9aac1ed38f [ib-test] Update for 2.8.3] 2021-02-04 19:05:03 +00:00
Gilbert Lee 9ce203dd0a [TransferBench] Updating for 2.8.3 2021-02-04 18:58:25 +00:00
gilbertlee-amd 1990ffd76a Tuning some clique-based kernel parameters (#315) 2021-02-03 20:00:08 -07:00
Wenkai Du 5f97122442 Enable GPU direct RDMA read from GPU 2021-02-03 02:48:30 +00:00
gilbertlee-amd 62e0447e9a [TransferBench] Restore some previous fixes - memory leak, PCIe address (#314) 2021-02-01 09:48:09 -07:00
Gilbert Lee 01a998b17c Removing in-place tests from Combined calls (no support for send/recv) 2021-01-28 20:09:03 +00:00
gilbertlee-amd 3e62ceddc5 Clique kernel support (#295) (#15)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2021-01-28 09:45:01 -07:00
Wenkai Du 41e47a36e7 Use less unroll for clique kernels (#313) 2021-01-15 17:48:10 -08:00
Stanley Tsang d3fa257682 Adding multiprocess unit tests (#312)
Adding multiprocess unit tests for collectives.  

To run, NCCL_COMM_ID=$HOSTNAME:12345 build/release/test/UnitTestsMultiProcess
2021-01-15 16:34:36 -07:00
Wenkai Du 2ddbe6646b Improve collective trace 2021-01-14 19:28:01 -05:00
Wenkai Du b33a2cac8b gtest: add scatter to combined calls and use loops (#303)
* gtest: add scatter to combined calls and use loops

* gtest: run validation inside loop

* gtest: revert small element count to 2520

* gtest: fix memory leak in validation

(cherry picked from commit b0853ccd51)

* Fix combined call UT

* Fix memory leak

* Fix alltoallv test
2021-01-14 19:28:01 -05:00
Wenkai Du f4d5d3d620 Port alltoall[v] 2021-01-14 19:28:01 -05:00
Wenkai Du 105db19a11 Do not allow GPU as intermediate 2021-01-14 19:28:01 -05:00
Wenkai Du e055229e56 Revert "Changes to topology based on XGMI (#272)"
This reverts commit 01bd2573db.
2021-01-14 19:28:01 -05:00
Wenkai Du d469947641 Merge remote-tracking branch 'nccl/master' into no-target-id 2021-01-14 19:27:53 -05:00
Jonas Zhou 3996562690 x86: Add CPU detection for Zhaoxin processors
Signed-off-by: Jonas Zhou <JonasZhou@zhaoxin.com>
2020-12-17 11:15:18 -08:00
Wenkai Du 373a108516 Fix Rome PCIe 2 node topology generation (#310) 2020-12-15 17:16:17 -08:00
gilbertlee-amd 41c35dad48 [TransferBench] Fixing bug with fine-grained memory allocation (#311)
* Fixing bug with fine-grained memory
2020-12-15 17:37:31 -07:00
gilbertlee-amd ae0c4092c7 [TransferBench] Adding ability to perform CPU-executed copies, various upgrades (#309)
* Adding CPU based execution, fixing typos, adding Fine-grained mem
* Exposing sampling factor when generating range of data sizes
* Refactoring how Links are launched, now once per thread
* Documentation updates
2020-12-11 10:21:14 -07:00
gilbertlee-amd b80ae551b1 [TransferBench] Support multiple of 4 byte sizes, changing default GPU timing mechanism (#307)
* Changing default timing mechanism, adjusting CPU bandwidth calc, adding flag to use combined timing
* Adding support for smaller transfers (byte size must be multiple of 4 instead of 128)
2020-12-04 14:57:13 -07:00
Wenkai Du 882d52ad7e Adding backward compatibility for target-id syntax for AMDGPU_TARGETS (#306) 2020-12-04 13:55:56 -08:00
Wenkai Du 975b14dffa Add Rome model and improve search (#305) 2020-11-17 14:55:06 -08:00
Sylvain Jeaugey 920dbe5b35 2.8.3-1
Optimization for Tree allreduce on A100.
Improve aggregation performance.
Use shared buffers for inter-node send/recv.
Add NVTX profiling hooks.
Accelerate alltoall connections by merging communication for all
channels.
Add support for one hop communication through NVLink, for faster
send/recv communication on cubemesh topologies like DGX-1.
Improve alltoall scheduling to better balance intra/inter node
communication.
Increase send/recv parallelism by 8x, each warp sending or
receiving to a different peer.
Net: move to v4.
Net: make flush operation asynchronous to accelerate alltoall.
Net: define maximum number of requests.
Fix hang when using LL128 protocol after 2^31 steps.
Fix #379 : topology injection failing when using less GPUs than
described in the XML.
Fix #394 : protocol mismatch causing hangs or crashes when using
one GPU per node.
2020-11-17 11:08:52 -08:00
Wenkai Du 1943bac646 Merge remote-tracking branch 'origin/master' into develop 2020-11-16 12:16:53 -05:00
Wenkai Du 554729079d Use device's link width and speed if port doesn't report (#304) 2020-11-13 17:58:04 -08:00
Wenkai Du b0853ccd51 gtest: add scatter to combined calls and use loops (#303)
* gtest: add scatter to combined calls and use loops

* gtest: run validation inside loop

* gtest: revert small element count to 2520

* gtest: fix memory leak in validation
2020-11-13 17:57:44 -08:00
Stanley Tsang 2958f7eace Fixing IPC handle leak (#302) 2020-11-13 10:32:42 -07:00
gilbertlee-amd c8d08a7c2f Adding RCCL_CLIQUE_DEBUG to help debug experimental clique feature (#300) 2020-11-13 09:07:11 -07:00
Wenkai Du 4e68229c8b Skip unused peer connection in scatter and gather (#301) 2020-11-12 15:47:34 -08:00
Colin Smith 377b43470b Merge pull request #299 from ROCmSoftwarePlatform/develop
Enable target id build
2020-11-10 15:47:42 -07:00
gilbertlee-amd 41bcfb8878 Clique kernel support (#295)
* Adding experimental clique-based kernels (opt-in only)

Co-authored-by: Stanley Tsang <stanley.tsang@amd.com>
Co-authored-by: Gilbert Lee <gilbert.lee@amd.com>
Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>
2020-11-10 15:44:10 -07:00
Wenkai Du 1fdb216f87 Use target id of xnack off (#298) 2020-11-10 11:10:48 -08:00
Wenkai Du 2e8b3a0857 Use ncclSend/ncclRecv for alltoall type of collectives as default (#297) 2020-11-09 11:23:17 -08:00
gilbertlee-amd bdd8adf1ca Adding a CHANGELOG (#296) 2020-11-05 13:38:30 -07:00
Wenkai Du 709b7e4880 Improve GPU direct RDMA handling on Rome (#294) 2020-11-03 14:29:08 -08:00
Wenkai Du dfa3c41ede Add more Rome models (#292) 2020-10-30 21:26:04 -07:00
gilbertlee-amd bfab1d3592 Adding output to CSV, removing OpenMP, decreasing default numBytes to 64MB, adding aggregate stats (#290) 2020-10-27 09:00:33 -06:00