İşleme Grafiği

736 İşleme

Yazar SHA1 Mesaj Tarih
akolliasAMD dcf46e84e0 moved default number of max ranks per gpu to 1
[ROCm/rccl commit: 8b9291eb47]
2022-06-22 17:37:49 +00:00
Ziyue Yang 2b418b5dee Add Feature - Add NPKit Support in RCCL (#564)
* apply npkit

* fix bug

* add npkit in readme

[ROCm/rccl commit: 6e93fafdc3]
2022-06-20 14:30:19 -07:00
Wenkai Du 0fb000932f Change default nchannels per peer (#563)
[ROCm/rccl commit: f274c865c1]
2022-06-13 06:39:05 -07:00
arvindcheru 9c0e790eb5 [CMake] GNU Install Dir Enhancements (#557)
* sd321110 (GNUInstall Dir) enhancements

[ROCm/rccl commit: a1fe1adf1c]
2022-06-10 18:51:51 -04:00
Edgar f7ef619ba7 extending the unit-tests for multi-rank support
[ROCm/rccl commit: a87d61db2b]
2022-06-10 14:23:19 +00:00
Edgar 8953f5b5ca Introduce multi-rank support per device.
This is a single commit of the source code changes required to
introduce support for multiple ranks per device.
A new interface (ncclCommRankInitMulti) has to be used to make use of
this new feature.


[ROCm/rccl commit: 0336ffdf70]
2022-06-10 14:23:12 +00:00
Wenkai Du 11a6cdd52f Fix P2P scheduling (#560)
[ROCm/rccl commit: 5cb2aca3d9]
2022-06-06 13:32:28 -07:00
Wenkai Du f2dbc77afe Enable timing profile option (#558)
[ROCm/rccl commit: 7a6c6927ae]
2022-06-03 07:05:13 -07:00
Aristotelis 0b55e01ef3 Merge remote-tracking branch 'ncclRepo/master' into develop
[ROCm/rccl commit: e0864e7093]
2022-06-02 15:27:24 +00:00
Wenkai Du 1e36b432f1 Revert chunksteps changes (#555)
[ROCm/rccl commit: eef812bed7]
2022-05-31 14:45:51 -07:00
Wenkai Du 5becf1669f Add another Rome model (#553)
* Add another Rome model

* Add option to force enable intranet on single node

* Limit p2p channels to number of ranks

* Refine p2p channels handling

[ROCm/rccl commit: ef499c4810]
2022-05-31 11:31:30 -07:00
akolliasAMD a03ab8e752 code cleanup (#554)
[ROCm/rccl commit: a0a686e74c]
2022-05-31 09:59:36 -04:00
Wenkai Du 2c125ce6ed Update Rome model (#552)
[ROCm/rccl commit: c5b77121f0]
2022-05-26 09:59:23 -07:00
akolliasAMD 22dc8bd246 Added creation of new tree and added switch for using treesplit for specific cases (#551)
[ROCm/rccl commit: 98f0809a39]
2022-05-25 18:55:14 -04:00
gilbertlee-amd a2a4888497 Moving opt-in custom signal handler from UnitTests into RCCL (#550)
* Enable via RCCL_ENABLE_SIGNALHANDLER=1

[ROCm/rccl commit: 700b473211]
2022-05-20 09:56:38 -06:00
Wenkai Du 86e8797602 Add switch for pivot alltoall kernel (#549)
[ROCm/rccl commit: 6707a270b1]
2022-05-17 18:14:04 -07:00
Wenkai Du b30b8becea Refine and add new Rome models (#548)
[ROCm/rccl commit: 283dc86a73]
2022-05-17 08:23:59 -07:00
Sylvain Jeaugey 1c5734046d 2.12.12-1
Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.


[ROCm/rccl commit: 7aa1c46fd5]
2022-05-13 00:26:57 -07:00
Wenkai Du b37180b7ed Improve LL performance (#546)
* Improve LL performance

* Add split barriers for LL

[ROCm/rccl commit: c9919e0e35]
2022-05-10 13:32:10 -07:00
Edgar Gabriel 10ad61f469 Merge pull request #544 from edgargabriel/topic/header-file-include
fix cmake logic to handle old and new include dirs

[ROCm/rccl commit: 46b30c5f9b]
2022-04-28 16:29:08 -05:00
Edgar 053b658a48 fix cmake logic to handle old and new include dirs
Starting from rocm 5.2 there is a reorganization of the
include directories. This pr allows to compile
rccl on both the old and the new directory layout.
This solution is using find_package() for identifying correct
settings for rocm_smi starting from rocm-5.2, and the original (manual)
settings for all previous releases.

Tested with rocm-5.2, 5.1.1, 5.0.2, and 4.5.2.


[ROCm/rccl commit: 4c4a7cb696]
2022-04-28 14:33:46 -04:00
gilbertlee-amd c6804778d1 [TransferBench] Syncing with TransferBench v1.02 (#541)
[ROCm/rccl commit: 685bcea127]
2022-04-27 20:43:24 -06:00
Wenkai Du 95b30d9762 topo_expl: fix build and add tuning support (#539)
[ROCm/rccl commit: 063da25563]
2022-04-26 15:40:07 -07:00
Wenkai Du f610810d7b Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
Sync up with NCCL 2.12.10

[ROCm/rccl commit: 379940dfac]
2022-04-26 10:09:37 -07:00
Edgar 1bfc5d06f8 add a signal handler and backtrace
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling


[ROCm/rccl commit: 2bf6d254b6]
2022-04-25 10:48:17 -04:00
Wenkai Du 347ea354c2 Update tuning parameters
[ROCm/rccl commit: 83fd4f70e7]
2022-04-18 16:04:04 -07:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Wenkai Du 58b2c1ec9c Fix random segfault (#537)
[ROCm/rccl commit: fd2f1b3b88]
2022-04-15 14:32:11 -07:00
Wenkai Du 011447e4dc Add new Rome model (#536)
[ROCm/rccl commit: 2151c79d14]
2022-04-13 11:45:40 -07:00
Wenkai Du f8023f2e07 Add new Rome model (#535)
[ROCm/rccl commit: ba4c165bf3]
2022-04-12 13:27:32 -07:00
nunnikri be9374aa34 Installing rccl.h wrapper to /opt/rocm-xxx/include path (#532)
* Fixing the broken library soft link

* Installing rccl.h wrapper to /opt/rocm-xxx/include path.

This missing wrapper was causing compilation errors with pytorch. Fixing it

[ROCm/rccl commit: b83efe9c5c]
2022-04-09 07:55:39 -07:00
gilbertlee-amd e61ff3ce37 Transfer bench single stream mode (#531)
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes

[ROCm/rccl commit: def6832287]
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey e89ff21d35 Update Makefile to install static library.
Make sure make install also installs the static library. 
Fixes #662

[ROCm/rccl commit: 9bfc1c6e35]
2022-04-08 14:00:43 +02:00
nunnikri 21415407ac Fixing the broken library soft link (#529)
[ROCm/rccl commit: acfb0210ea]
2022-04-07 15:19:33 -07:00
Wenkai Du 5ccdd9f5e1 Increase chunk steps of broadcast and reduce (#528)
[ROCm/rccl commit: 15b572751e]
2022-04-07 13:34:04 -07:00
Colin Smith 3830310156 Doxygen fix for ncclRecv (#527)
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.

[ROCm/rccl commit: b2ffcf6d89]
2022-04-05 14:07:56 -07:00
Wenkai Du 9884e61367 Add tuning model (#523)
[ROCm/rccl commit: 5cc0a405c0]
2022-04-04 10:19:57 -07:00
Wenkai Du 3332cdff07 Support multiple tuning tables (#522)
* Support multiple tuning tables

* [UnitTests] Skip managed memory testing

[ROCm/rccl commit: bbe780ca6c]
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey 74f8baa0f3 Merge remote-tracking branch 'origin/master'
[ROCm/rccl commit: 8133784b32]
2022-03-30 02:29:05 -07:00
Sylvain Jeaugey 27130280b2 2.12.10-1
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.


[ROCm/rccl commit: 353e8ba446]
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey a52e328ba4 Fix merging error
[ROCm/rccl commit: 2247152a8e]
2022-03-30 02:14:32 -07:00
Sylvain Jeaugey 3bc2e34df2 Merge branch 'master' into truncated_msg_warning
[ROCm/rccl commit: 2dfd83752c]
2022-03-30 10:58:05 +02:00
Ke Wen 92bdde35eb Display host name instead of numeric IP when referring to a peer
For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"


[ROCm/rccl commit: 1382a87306]
2022-03-30 10:47:10 +02:00
Christopher Hesse f6d1c7261f Fix typo in net_ib.cc
[ROCm/rccl commit: b895abcdb8]
2022-03-30 10:45:01 +02:00
Felix Abecassis 54590464ca Remove unnecessary newline in plugin logging
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

[ROCm/rccl commit: 1c7c014ceb]
2022-03-30 10:44:49 +02:00
Wenkai Du 828f3d11a0 Update tuning parameters (#518)
* Update tuning parameters

* Respect user algo and topo selections

[ROCm/rccl commit: 7cbbca4da1]
2022-03-29 08:15:37 -07:00
gilbertlee-amd 4c32c51772 Adding explicit request for coarse-grained host memory due to changes in HipHostMalloc (#517)
[ROCm/rccl commit: 2d558c9abc]
2022-03-25 13:05:07 -06:00
Liam Wrubleski 95c6476678 Packages for test and benchmark executables on all supported OSes using CPack. (#512)
[ROCm/rccl commit: a8f1e61f48]
2022-03-21 15:04:14 -06:00
Wenkai Du db1e628ba3 Update Rome model matching and add new models (#516)
* Update Rome model matching and add new models

* Add missing file

* Models update

[ROCm/rccl commit: cd17cf6dce]
2022-03-21 10:54:40 -07:00
akolliasAMD 3493750b6b Added alltoallv test and optional args variable on collective args (#514)
* Added alltoallv test and optional args variable on collective args

[ROCm/rccl commit: 65ea3d80db]
2022-03-18 13:55:11 -04:00