Граф коммитов

519 Коммитов

Автор SHA1 Сообщение Дата
gilbertlee-amd ffdf00a2fa Fixing potential race-condition in env var parameter macro (#359)
[ROCm/rccl commit: 4f8e788a61]
2021-04-28 12:04:41 -06:00
saadrahim 7e30cf002d Expanding CI coverage for 8GPU configurations plus extended tests (#350)
[ROCm/rccl commit: 96782191cf]
2021-04-27 09:57:00 -06:00
Wenkai Du 22e8269864 Add libdl linking option (#358)
[ROCm/rccl commit: ad54a14a5c]
2021-04-26 15:24:58 -07:00
Wenkai Du 185ad0deab Use better name for kernel collective trace enable (#357)
"NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,COLL" enables collectives API
trace. Adding "RCCL_KERNEL_COLL_TRACE_ENABLE=1" enables kernel traces.

[ROCm/rccl commit: ed237dcaa7]
2021-04-26 08:35:53 -07:00
Wenkai Du 60a24aa4db Control collective trace from kernel separately (#356)
[ROCm/rccl commit: 9cc9c3360b]
2021-04-23 16:36:19 -07:00
Stanley Tsang 6680e23c63 Message queue refactor to POSIX implementation and leak fix (#355)
* Fixing message queue leak.

* Using POSIX implementation of Message Queues

* Adding unlink to msgqueue

* MsgQueue update

* Adding timeout check to msgqueue broadcast; tightening up system checks

* Removing unnecessary code

* Removing extra argument from print

* Adding explicit msg queue close call to all other ranks

[ROCm/rccl commit: 70597789d0]
2021-04-23 11:33:20 -06:00
Wenkai Du e28bac31aa Tune number of channels for gfx90a (#349)
[ROCm/rccl commit: 415c7cd3d1]
2021-04-19 15:27:01 -07:00
Wenkai Du 951d89b12f Use correct WARP_SIZE for gfx1030 (#348)
[ROCm/rccl commit: 9c718ce6d6]
2021-04-14 14:09:52 -07:00
Wenkai Du 661b1351a3 Limit max channels for ring graph on single node Rome (#347)
* Limit max channels for ring graph on single node Rome
* Partially revert "Use non-temporal access for streaming data (#341)"

[ROCm/rccl commit: a79f74082e]
2021-04-14 10:14:54 -07:00
Wenkai Du 0f4d497edc Add gfx90a target (#344)
* Add gfx90a target

* Support gfx90a topology

Co-authored-by: Eiden Yoshida <eiden.yoshida@amd.com>

[ROCm/rccl commit: 1fe031402a]
2021-04-14 09:29:00 -06:00
Wenkai Du 27f33208e3 Remove link to NUMA lib as it is no longer needed (#346)
[ROCm/rccl commit: 3f18540f50]
2021-04-12 09:53:17 -07:00
TomSang 6105af2dfc Add detection of cooperative multi device launch attribute (#345)
[ROCm/rccl commit: 87f12cbb86]
2021-04-11 13:29:24 -07:00
Wenkai Du 040656ba9b Move RCCL changelog and Copyright out of /usr/share (#343)
[ROCm/rccl commit: def8b4ca0d]
2021-04-09 14:08:40 -07:00
Wenkai Du 6b3389b790 Use non-temporal access for streaming data (#341)
* Use non-temporal access for streaming data

* Revert to ulong2 after fixing compiling issue

[ROCm/rccl commit: 9dfc2c183e]
2021-04-07 17:34:35 -07:00
gilbertlee-amd a7e99734e8 Fixing clique-topology detection (#342)
* Fixing clique-topology detection
* Fix to enable multi-process clique-based kernels

[ROCm/rccl commit: caba0a63d2]
2021-04-07 11:29:44 -06:00
Wenkai Du b4a7fa7011 Cleanup number of channels calculation (#340)
[ROCm/rccl commit: e26ad2995e]
2021-04-05 17:51:56 -07:00
Wenkai Du 8927d8bf17 Fix incorrect net counting (#339)
* Fix incorrect net counting

* Add comments

[ROCm/rccl commit: 17491c918e]
2021-04-05 12:21:57 -07:00
Wenkai Du 018a31877c Rework network port trimming code (#338)
* Rework network port trimming code

* Move Rome related changes to separate source files

[ROCm/rccl commit: 1d2946ee4b]
2021-03-31 10:25:59 -07:00
Wenkai Du 4da9c54d4e Check fine grained memory on peer GPU before enabling P2P (#337)
[ROCm/rccl commit: 0c78553ee0]
2021-03-30 09:06:39 -07:00
Wenkai Du 065bde98d8 collnet: support multiple NICs (#335)
[ROCm/rccl commit: d87dc7c2e8]
2021-03-25 20:59:32 -07:00
gilbertlee-amd e85a70a967 Updating CHANGELOG.md for ROCm 4.1.0 (#336)
[ROCm/rccl commit: 69c0b44553]
2021-03-25 21:44:24 -06:00
Stanley Tsang 016a11f61b Fixing message queue leak. (#331)
[ROCm/rccl commit: 289db2a636]
2021-03-25 19:11:43 -06:00
Wenkai Du 07548845a6 Remove HDP workaround for ROCm 4.2 HIP (#334)
[ROCm/rccl commit: 0fbb9510a5]
2021-03-23 20:11:37 -07:00
Wenkai Du 287ed0f18a Enable collnet in RCCL (#333)
* Enable CollNet and use different number of channels

* topo_expl: enable collnet

[ROCm/rccl commit: 1d6244b18d]
2021-03-19 12:58:13 -07:00
Wenkai Du 260b54adb9 Hide non-public symbols from library (#332)
* Hide non-public symbols from library

* Move flag outside of parallel-jobs check

[ROCm/rccl commit: 244e25d980]
2021-03-18 18:08:08 -07:00
Wenkai Du 052119c20a Sort GPUs by HIP device ID (#329)
* Sort GPUs by HIP device ID

* Remove extra space

[ROCm/rccl commit: b46260260a]
2021-03-16 16:51:32 -07:00
Wenkai Du 7374c512d9 Add GPU memory usage tracker (#326)
[ROCm/rccl commit: f60b76c67a]
2021-03-06 20:32:30 -08:00
Wenkai Du b7253710ca Revert "Port alltoall[v]" (#325)
This reverts commit 2c49121171.

[ROCm/rccl commit: 8e180cf087]
2021-03-06 13:59:31 -08:00
Wenkai Du bcf4ecb0e3 Enable local sendrecv over network if GDR is available on all GPUs (#324)
[ROCm/rccl commit: c018edf0f2]
2021-03-05 19:59:41 -08:00
gilbertlee-amd 8d90821062 Adding pthread_join / pthread_detach to clean up pthreads to avoid leaks (#322)
[ROCm/rccl commit: f4a9b9acba]
2021-02-26 16:29:55 -07:00
Wenkai Du 897cb8d618 Merge pull request #320 from wenkaidu/nbio
Match NBIO only when GPUs and NICs are directly connected to CPU

[ROCm/rccl commit: 7191d0959b]
2021-02-25 17:49:09 -08:00
Stanley Tsang 4c42a55c53 Temporarily disabling multiprocess unit tests
[ROCm/rccl commit: 5d3b1fa4d0]
2021-02-25 23:59:44 +00:00
Eiden Yoshida 310dc0e738 Add ubuntu gfx908 target to CI
[ROCm/rccl commit: 25a7dad765]
2021-02-24 19:01:56 -07:00
Wenkai Du 020ccb6730 Update tuning parameters for XGMI and NET
[ROCm/rccl commit: e820a943e9]
2021-02-23 21:41:26 +00:00
Wenkai Du d24df90882 Match NBIO only when GPUs and NICs are directly connected to CPU
[ROCm/rccl commit: ec8d89b1dd]
2021-02-22 18:52:29 -05:00
Stanley Tsang fb7478e3a2 Do not search for GTest if we're force installing it into build directory (#318)
[ROCm/rccl commit: 61d66c8e18]
2021-02-22 15:22:01 -07:00
Stanley Tsang ce02076d64 Fixing cache deletion for CliqueManager; updating copyright
[ROCm/rccl commit: 45f5255f7c]
2021-02-19 22:22:46 +00:00
Stanley Tsang cc450d0fc5 Make CI only run subset of unit tests.
[ROCm/rccl commit: 6a568a8652]
2021-02-19 21:37:51 +00:00
Cory Bloor 6c28ab176f Add Jenkins docs build (#18)
* Fix typo in copyright

* Minor README improvements

- Prevent underscores from being interpreted as italics in test name format.
- Switch URL to HTTPS.

* Update docs scripts config

- Allow run_doc.sh and run_doxygen.sh to be called from any directory.

* Add docs build to Jenkins

[ROCm/rccl commit: 8aea5edb29]
2021-02-18 16:37:37 -07:00
Wenkai Du 6c3ccc2192 Add support to another Rome model
[ROCm/rccl commit: 95f178324c]
2021-02-18 02:00:31 +00:00
Wenkai Du ab71643c99 Merge remote-tracking branch 'nccl/master' into 2.8.3
[ROCm/rccl commit: c985358e11]
2021-02-15 18:44:47 -05:00
Wenkai Du ea2e665834 Merge remote-tracking branch 'rccl/develop' into 2.8.3
[ROCm/rccl commit: 3a1aebd742]
2021-02-15 13:17:38 -05:00
Wenkai Du 34d7735419 Move HDP flush to CPU
[ROCm/rccl commit: bf8eb40705]
2021-02-12 18:06:19 +00:00
pramenku 7d2074fb31 Update install.sh (#317)
* Update install.sh

Install.sh having hard code like /opt/rocm/bin/hipcc for rocm_path and default_path=/opt/rocm
This will work only when we have standalone rocm installed. If anyone has installed, side-by-side, they will face below error.

Can we keep like ROCM_PATH=$ROCM_PATH  instead of “default_path” as variable name and 
ROCM_BIN_PATH=$ROCM_PATH/bin ,rocm_path can be replaced with ROCM_BIN_PATH.

This way, we will have option to export ROCM_PATH as env variable as per need and use the script. 
I have also tried locally, it’s working.  ROCM_PATH is common variable name, we are having.

If you are ok, I can also submit the PR for the same.


Error when side-by-side install is done for driver.
# ./install.sh -dtr 2>&1 | tee /dockerx/6519_rccl-test.log
CMake Error at /usr/share/cmake/Modules/CMakeDetermineCXXCompiler.cmake:48 (message):
Could not find compiler set in environment variable CXX:
/opt/rocm/bin/hipcc.
Call Stack (most recent call first):
CMakeLists.txt:12 (project)

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
-- Configuring incomplete, errors occurred!
See also "/root/driver/rccl/build/release/CMakeFiles/CMakeOutput.log".

* Update install.sh

Removed ROCM_PATH=$ROCM_PATH

* Update install.sh

Set default value if external value is not supplied.

[ROCm/rccl commit: e9f7908592]
2021-02-12 08:44:30 -08:00
Stanley Tsang 59f9e92374 Fixed temp file creation/deletion with clique mode (#316)
[ROCm/rccl commit: 6b7b312fb9]
2021-02-12 08:44:10 -08:00
Sylvain Jeaugey fc7bdb38a5 2.8.4-1
Fix hang in corner cases of alltoallv using point to point send/recv.
Harmonize error messages.
Fix missing NVTX section in the license.
Update README.


[ROCm/rccl commit: 911d61f214]
2021-02-09 15:36:48 -08:00
Gilbert Lee b36d93ae91 Using GTEST_SKIP() to skip unit tests that have insufficient devices. Skipping out earlier
[ROCm/rccl commit: f1a9ce3fa5]
2021-02-09 03:54:04 +00:00
Wenkai Du 602fe60428 Fix GDRDMA read and remove unused files
[ROCm/rccl commit: 9cc3b56166]
2021-02-09 01:34:39 +00:00
Stanley Tsang f152c8d160 Update MP UT to support arbitrary # of GPUs; multiple bugfixes (#16)
* Fixing temp file creation/deletion for Clique kernel mode.

* Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs

* GroupCall MP UT properly quits when too many devices specified

* MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script

[ROCm/rccl commit: d00b7d17bd]
2021-02-05 16:49:25 -08:00
Wenkai Du fe8923ebba Add gfx908 Rome 4 NICs model
[ROCm/rccl commit: 6dfdfef98f]
2021-02-06 00:19:47 +00:00