akolliasAMD
dcf46e84e0
moved default number of max ranks per gpu to 1
...
[ROCm/rccl commit: 8b9291eb47 ]
2022-06-22 17:37:49 +00:00
Ziyue Yang
2b418b5dee
Add Feature - Add NPKit Support in RCCL ( #564 )
...
* apply npkit
* fix bug
* add npkit in readme
[ROCm/rccl commit: 6e93fafdc3 ]
2022-06-20 14:30:19 -07:00
Wenkai Du
0fb000932f
Change default nchannels per peer ( #563 )
...
[ROCm/rccl commit: f274c865c1 ]
2022-06-13 06:39:05 -07:00
arvindcheru
9c0e790eb5
[CMake] GNU Install Dir Enhancements ( #557 )
...
* sd321110 (GNUInstall Dir) enhancements
[ROCm/rccl commit: a1fe1adf1c ]
2022-06-10 18:51:51 -04:00
Edgar
f7ef619ba7
extending the unit-tests for multi-rank support
...
[ROCm/rccl commit: a87d61db2b ]
2022-06-10 14:23:19 +00:00
Edgar
8953f5b5ca
Introduce multi-rank support per device.
...
This is a single commit of the source code changes required to
introduce support for multiple ranks per device.
A new interface (ncclCommRankInitMulti) has to be used to make use of
this new feature.
[ROCm/rccl commit: 0336ffdf70 ]
2022-06-10 14:23:12 +00:00
Wenkai Du
11a6cdd52f
Fix P2P scheduling ( #560 )
...
[ROCm/rccl commit: 5cb2aca3d9 ]
2022-06-06 13:32:28 -07:00
Wenkai Du
f2dbc77afe
Enable timing profile option ( #558 )
...
[ROCm/rccl commit: 7a6c6927ae ]
2022-06-03 07:05:13 -07:00
Aristotelis
0b55e01ef3
Merge remote-tracking branch 'ncclRepo/master' into develop
...
[ROCm/rccl commit: e0864e7093 ]
2022-06-02 15:27:24 +00:00
Wenkai Du
1e36b432f1
Revert chunksteps changes ( #555 )
...
[ROCm/rccl commit: eef812bed7 ]
2022-05-31 14:45:51 -07:00
Wenkai Du
5becf1669f
Add another Rome model ( #553 )
...
* Add another Rome model
* Add option to force enable intranet on single node
* Limit p2p channels to number of ranks
* Refine p2p channels handling
[ROCm/rccl commit: ef499c4810 ]
2022-05-31 11:31:30 -07:00
akolliasAMD
a03ab8e752
code cleanup ( #554 )
...
[ROCm/rccl commit: a0a686e74c ]
2022-05-31 09:59:36 -04:00
Wenkai Du
2c125ce6ed
Update Rome model ( #552 )
...
[ROCm/rccl commit: c5b77121f0 ]
2022-05-26 09:59:23 -07:00
akolliasAMD
22dc8bd246
Added creation of new tree and added switch for using treesplit for specific cases ( #551 )
...
[ROCm/rccl commit: 98f0809a39 ]
2022-05-25 18:55:14 -04:00
gilbertlee-amd
a2a4888497
Moving opt-in custom signal handler from UnitTests into RCCL ( #550 )
...
* Enable via RCCL_ENABLE_SIGNALHANDLER=1
[ROCm/rccl commit: 700b473211 ]
2022-05-20 09:56:38 -06:00
Wenkai Du
86e8797602
Add switch for pivot alltoall kernel ( #549 )
...
[ROCm/rccl commit: 6707a270b1 ]
2022-05-17 18:14:04 -07:00
Wenkai Du
b30b8becea
Refine and add new Rome models ( #548 )
...
[ROCm/rccl commit: 283dc86a73 ]
2022-05-17 08:23:59 -07:00
Sylvain Jeaugey
1c5734046d
2.12.12-1
...
Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.
[ROCm/rccl commit: 7aa1c46fd5 ]
2022-05-13 00:26:57 -07:00
Wenkai Du
b37180b7ed
Improve LL performance ( #546 )
...
* Improve LL performance
* Add split barriers for LL
[ROCm/rccl commit: c9919e0e35 ]
2022-05-10 13:32:10 -07:00
Edgar Gabriel
10ad61f469
Merge pull request #544 from edgargabriel/topic/header-file-include
...
fix cmake logic to handle old and new include dirs
[ROCm/rccl commit: 46b30c5f9b ]
2022-04-28 16:29:08 -05:00
Edgar
053b658a48
fix cmake logic to handle old and new include dirs
...
Starting from rocm 5.2 there is a reorganization of the
include directories. This pr allows to compile
rccl on both the old and the new directory layout.
This solution is using find_package() for identifying correct
settings for rocm_smi starting from rocm-5.2, and the original (manual)
settings for all previous releases.
Tested with rocm-5.2, 5.1.1, 5.0.2, and 4.5.2.
[ROCm/rccl commit: 4c4a7cb696 ]
2022-04-28 14:33:46 -04:00
gilbertlee-amd
c6804778d1
[TransferBench] Syncing with TransferBench v1.02 ( #541 )
...
[ROCm/rccl commit: 685bcea127 ]
2022-04-27 20:43:24 -06:00
Wenkai Du
95b30d9762
topo_expl: fix build and add tuning support ( #539 )
...
[ROCm/rccl commit: 063da25563 ]
2022-04-26 15:40:07 -07:00
Wenkai Du
f610810d7b
Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
...
Sync up with NCCL 2.12.10
[ROCm/rccl commit: 379940dfac ]
2022-04-26 10:09:37 -07:00
Edgar
1bfc5d06f8
add a signal handler and backtrace
...
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling
[ROCm/rccl commit: 2bf6d254b6 ]
2022-04-25 10:48:17 -04:00
Wenkai Du
347ea354c2
Update tuning parameters
...
[ROCm/rccl commit: 83fd4f70e7 ]
2022-04-18 16:04:04 -07:00
Wenkai Du
67e7e6507e
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: d28e1cb44f ]
2022-04-18 11:15:25 -07:00
Wenkai Du
58b2c1ec9c
Fix random segfault ( #537 )
...
[ROCm/rccl commit: fd2f1b3b88 ]
2022-04-15 14:32:11 -07:00
Wenkai Du
011447e4dc
Add new Rome model ( #536 )
...
[ROCm/rccl commit: 2151c79d14 ]
2022-04-13 11:45:40 -07:00
Wenkai Du
f8023f2e07
Add new Rome model ( #535 )
...
[ROCm/rccl commit: ba4c165bf3 ]
2022-04-12 13:27:32 -07:00
nunnikri
be9374aa34
Installing rccl.h wrapper to /opt/rocm-xxx/include path ( #532 )
...
* Fixing the broken library soft link
* Installing rccl.h wrapper to /opt/rocm-xxx/include path.
This missing wrapper was causing compilation errors with pytorch. Fixing it
[ROCm/rccl commit: b83efe9c5c ]
2022-04-09 07:55:39 -07:00
gilbertlee-amd
e61ff3ce37
Transfer bench single stream mode ( #531 )
...
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes
[ROCm/rccl commit: def6832287 ]
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey
e89ff21d35
Update Makefile to install static library.
...
Make sure make install also installs the static library.
Fixes #662
[ROCm/rccl commit: 9bfc1c6e35 ]
2022-04-08 14:00:43 +02:00
nunnikri
21415407ac
Fixing the broken library soft link ( #529 )
...
[ROCm/rccl commit: acfb0210ea ]
2022-04-07 15:19:33 -07:00
Wenkai Du
5ccdd9f5e1
Increase chunk steps of broadcast and reduce ( #528 )
...
[ROCm/rccl commit: 15b572751e ]
2022-04-07 13:34:04 -07:00
Colin Smith
3830310156
Doxygen fix for ncclRecv ( #527 )
...
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.
[ROCm/rccl commit: b2ffcf6d89 ]
2022-04-05 14:07:56 -07:00
Wenkai Du
9884e61367
Add tuning model ( #523 )
...
[ROCm/rccl commit: 5cc0a405c0 ]
2022-04-04 10:19:57 -07:00
Wenkai Du
3332cdff07
Support multiple tuning tables ( #522 )
...
* Support multiple tuning tables
* [UnitTests] Skip managed memory testing
[ROCm/rccl commit: bbe780ca6c ]
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey
74f8baa0f3
Merge remote-tracking branch 'origin/master'
...
[ROCm/rccl commit: 8133784b32 ]
2022-03-30 02:29:05 -07:00
Sylvain Jeaugey
27130280b2
2.12.10-1
...
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.
[ROCm/rccl commit: 353e8ba446 ]
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey
a52e328ba4
Fix merging error
...
[ROCm/rccl commit: 2247152a8e ]
2022-03-30 02:14:32 -07:00
Sylvain Jeaugey
3bc2e34df2
Merge branch 'master' into truncated_msg_warning
...
[ROCm/rccl commit: 2dfd83752c ]
2022-03-30 10:58:05 +02:00
Ke Wen
92bdde35eb
Display host name instead of numeric IP when referring to a peer
...
For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"
[ROCm/rccl commit: 1382a87306 ]
2022-03-30 10:47:10 +02:00
Christopher Hesse
f6d1c7261f
Fix typo in net_ib.cc
...
[ROCm/rccl commit: b895abcdb8 ]
2022-03-30 10:45:01 +02:00
Felix Abecassis
54590464ca
Remove unnecessary newline in plugin logging
...
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com >
[ROCm/rccl commit: 1c7c014ceb ]
2022-03-30 10:44:49 +02:00
Wenkai Du
828f3d11a0
Update tuning parameters ( #518 )
...
* Update tuning parameters
* Respect user algo and topo selections
[ROCm/rccl commit: 7cbbca4da1 ]
2022-03-29 08:15:37 -07:00
gilbertlee-amd
4c32c51772
Adding explicit request for coarse-grained host memory due to changes in HipHostMalloc ( #517 )
...
[ROCm/rccl commit: 2d558c9abc ]
2022-03-25 13:05:07 -06:00
Liam Wrubleski
95c6476678
Packages for test and benchmark executables on all supported OSes using CPack. ( #512 )
...
[ROCm/rccl commit: a8f1e61f48 ]
2022-03-21 15:04:14 -06:00
Wenkai Du
db1e628ba3
Update Rome model matching and add new models ( #516 )
...
* Update Rome model matching and add new models
* Add missing file
* Models update
[ROCm/rccl commit: cd17cf6dce ]
2022-03-21 10:54:40 -07:00
akolliasAMD
3493750b6b
Added alltoallv test and optional args variable on collective args ( #514 )
...
* Added alltoallv test and optional args variable on collective args
[ROCm/rccl commit: 65ea3d80db ]
2022-03-18 13:55:11 -04:00