Commit-Graf

747 Incheckningar

Upphovsman SHA1 Meddelande Datum
hubertlu-tw e13eb2eab9 Enhancement of RCCL logging information for topology-aware utilities
[ROCm/rccl commit: a1842df858]
2022-07-11 19:01:10 +00:00
Wenkai Du c129677fe0 Skip HDP cache flush for gfx90a (#578)
* Skip HDP cache flush for gfx90a

* Remove extra debug print

[ROCm/rccl commit: 8c3c8b78c0]
2022-07-08 10:13:32 -07:00
Wenkai Du 659cd52d5c Add more constraints to enable GDR (#579)
* Add more constraints to enable GDR

* Revert deleted line

[ROCm/rccl commit: aa0d7ca882]
2022-07-08 09:52:27 -07:00
Yifan Xiong bf15ad1d72 Reduce AlltoAll port usage in send/recv proxy (#577)
* Reduce AlltoAll port usage when connecting proxy

Reuse socket ports when connecting proxies in AlltoAll.

Existing port usage in AlltoAll is O(n) for recv and O(n) for send,
reusing socket ports in server or client side will make one of them
O(1), reusing both will reduce the total port usage to O(1) and enables
AlltoAll in >64 MI200 nodes.

* Update changelog accordingly

Update changelog accordingly.

[ROCm/rccl commit: 80f53cc171]
2022-07-07 16:15:52 -07:00
Wenkai Du 4b99cef680 Revert "Adding the missing roc:: namespace (#570)" (#576)
This reverts commit fc340decf4.

[ROCm/rccl commit: 2e65881a79]
2022-07-06 10:07:35 -07:00
Wenkai Du e04bba619a Use nontemporal in slow path and add XGMI sys type (#575)
* Use nontemporal in slow path and add XGMI sys type

* Clean up XGMI detection

[ROCm/rccl commit: b250c01cbe]
2022-07-06 07:58:41 -07:00
Wenkai Du 2f4aea93e0 Fix GPU to NIC mapping in tree (#573)
* Fix GPU to NIC mapping in tree

* Update tuning table

[ROCm/rccl commit: 00af1f64e9]
2022-07-03 20:52:52 -07:00
gilbertlee-amd cb5ae7224e Adding git hash info to version output line (#572)
[ROCm/rccl commit: a89a9966aa]
2022-06-28 16:42:51 -06:00
Dmitry Mikushin fc340decf4 Adding the missing roc:: namespace (#570)
* Adding the missing roc:: namespace, effectively changing the value of RCCL_LIBRARY from rccl to roc::rccl.
The important difference is that rccl is treated as a symbolic "-lrccl" by linker (and fail the linking
due to a missing library search path), while roc::rccl is a target name, which can resolve into an absolute
library path.

Co-authored-by: Paul Fultz II <pfultz2@yahoo.com>

* Adding a changelog entry

* minor updates to wording

* missing period

Co-authored-by: Paul Fultz II <pfultz2@yahoo.com>
Co-authored-by: Saad Rahim <44449863+saadrahim@users.noreply.github.com>

[ROCm/rccl commit: d5bea2cfaa]
2022-06-27 11:44:43 -06:00
Wenkai Du 915a9d3934 Do not set NET GDR level automatically (#571)
[ROCm/rccl commit: 9a285b5e1d]
2022-06-23 16:28:28 -07:00
Wenkai Du 784b12bf75 Use different atomics to check flags in kernel (#568)
[ROCm/rccl commit: c3bb9e70d0]
2022-06-23 09:16:41 -07:00
akolliasAMD dcf46e84e0 moved default number of max ranks per gpu to 1
[ROCm/rccl commit: 8b9291eb47]
2022-06-22 17:37:49 +00:00
Ziyue Yang 2b418b5dee Add Feature - Add NPKit Support in RCCL (#564)
* apply npkit

* fix bug

* add npkit in readme

[ROCm/rccl commit: 6e93fafdc3]
2022-06-20 14:30:19 -07:00
Wenkai Du 0fb000932f Change default nchannels per peer (#563)
[ROCm/rccl commit: f274c865c1]
2022-06-13 06:39:05 -07:00
arvindcheru 9c0e790eb5 [CMake] GNU Install Dir Enhancements (#557)
* sd321110 (GNUInstall Dir) enhancements

[ROCm/rccl commit: a1fe1adf1c]
2022-06-10 18:51:51 -04:00
Edgar f7ef619ba7 extending the unit-tests for multi-rank support
[ROCm/rccl commit: a87d61db2b]
2022-06-10 14:23:19 +00:00
Edgar 8953f5b5ca Introduce multi-rank support per device.
This is a single commit of the source code changes required to
introduce support for multiple ranks per device.
A new interface (ncclCommRankInitMulti) has to be used to make use of
this new feature.


[ROCm/rccl commit: 0336ffdf70]
2022-06-10 14:23:12 +00:00
Wenkai Du 11a6cdd52f Fix P2P scheduling (#560)
[ROCm/rccl commit: 5cb2aca3d9]
2022-06-06 13:32:28 -07:00
Wenkai Du f2dbc77afe Enable timing profile option (#558)
[ROCm/rccl commit: 7a6c6927ae]
2022-06-03 07:05:13 -07:00
Aristotelis 0b55e01ef3 Merge remote-tracking branch 'ncclRepo/master' into develop
[ROCm/rccl commit: e0864e7093]
2022-06-02 15:27:24 +00:00
Wenkai Du 1e36b432f1 Revert chunksteps changes (#555)
[ROCm/rccl commit: eef812bed7]
2022-05-31 14:45:51 -07:00
Wenkai Du 5becf1669f Add another Rome model (#553)
* Add another Rome model

* Add option to force enable intranet on single node

* Limit p2p channels to number of ranks

* Refine p2p channels handling

[ROCm/rccl commit: ef499c4810]
2022-05-31 11:31:30 -07:00
akolliasAMD a03ab8e752 code cleanup (#554)
[ROCm/rccl commit: a0a686e74c]
2022-05-31 09:59:36 -04:00
Wenkai Du 2c125ce6ed Update Rome model (#552)
[ROCm/rccl commit: c5b77121f0]
2022-05-26 09:59:23 -07:00
akolliasAMD 22dc8bd246 Added creation of new tree and added switch for using treesplit for specific cases (#551)
[ROCm/rccl commit: 98f0809a39]
2022-05-25 18:55:14 -04:00
gilbertlee-amd a2a4888497 Moving opt-in custom signal handler from UnitTests into RCCL (#550)
* Enable via RCCL_ENABLE_SIGNALHANDLER=1

[ROCm/rccl commit: 700b473211]
2022-05-20 09:56:38 -06:00
Wenkai Du 86e8797602 Add switch for pivot alltoall kernel (#549)
[ROCm/rccl commit: 6707a270b1]
2022-05-17 18:14:04 -07:00
Wenkai Du b30b8becea Refine and add new Rome models (#548)
[ROCm/rccl commit: 283dc86a73]
2022-05-17 08:23:59 -07:00
Sylvain Jeaugey 1c5734046d 2.12.12-1
Improve allreduce performance when we have more than one network interface per
GPU and we need to use PXN to close rings.
Add support for PCI Gen5 on 5.4 kernels.
Fix crash when setting NCCL_SET_THREAD_NAME.
Fix random crash in init due to uninitialized struct.
Fix hang on cubemesh topologies.
Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a
process.


[ROCm/rccl commit: 7aa1c46fd5]
2022-05-13 00:26:57 -07:00
Wenkai Du b37180b7ed Improve LL performance (#546)
* Improve LL performance

* Add split barriers for LL

[ROCm/rccl commit: c9919e0e35]
2022-05-10 13:32:10 -07:00
Edgar Gabriel 10ad61f469 Merge pull request #544 from edgargabriel/topic/header-file-include
fix cmake logic to handle old and new include dirs

[ROCm/rccl commit: 46b30c5f9b]
2022-04-28 16:29:08 -05:00
Edgar 053b658a48 fix cmake logic to handle old and new include dirs
Starting from rocm 5.2 there is a reorganization of the
include directories. This pr allows to compile
rccl on both the old and the new directory layout.
This solution is using find_package() for identifying correct
settings for rocm_smi starting from rocm-5.2, and the original (manual)
settings for all previous releases.

Tested with rocm-5.2, 5.1.1, 5.0.2, and 4.5.2.


[ROCm/rccl commit: 4c4a7cb696]
2022-04-28 14:33:46 -04:00
gilbertlee-amd c6804778d1 [TransferBench] Syncing with TransferBench v1.02 (#541)
[ROCm/rccl commit: 685bcea127]
2022-04-27 20:43:24 -06:00
Wenkai Du 95b30d9762 topo_expl: fix build and add tuning support (#539)
[ROCm/rccl commit: 063da25563]
2022-04-26 15:40:07 -07:00
Wenkai Du f610810d7b Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
Sync up with NCCL 2.12.10

[ROCm/rccl commit: 379940dfac]
2022-04-26 10:09:37 -07:00
Edgar 1bfc5d06f8 add a signal handler and backtrace
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling


[ROCm/rccl commit: 2bf6d254b6]
2022-04-25 10:48:17 -04:00
Wenkai Du 347ea354c2 Update tuning parameters
[ROCm/rccl commit: 83fd4f70e7]
2022-04-18 16:04:04 -07:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Wenkai Du 58b2c1ec9c Fix random segfault (#537)
[ROCm/rccl commit: fd2f1b3b88]
2022-04-15 14:32:11 -07:00
Wenkai Du 011447e4dc Add new Rome model (#536)
[ROCm/rccl commit: 2151c79d14]
2022-04-13 11:45:40 -07:00
Wenkai Du f8023f2e07 Add new Rome model (#535)
[ROCm/rccl commit: ba4c165bf3]
2022-04-12 13:27:32 -07:00
nunnikri be9374aa34 Installing rccl.h wrapper to /opt/rocm-xxx/include path (#532)
* Fixing the broken library soft link

* Installing rccl.h wrapper to /opt/rocm-xxx/include path.

This missing wrapper was causing compilation errors with pytorch. Fixing it

[ROCm/rccl commit: b83efe9c5c]
2022-04-09 07:55:39 -07:00
gilbertlee-amd e61ff3ce37 Transfer bench single stream mode (#531)
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes

[ROCm/rccl commit: def6832287]
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey e89ff21d35 Update Makefile to install static library.
Make sure make install also installs the static library. 
Fixes #662

[ROCm/rccl commit: 9bfc1c6e35]
2022-04-08 14:00:43 +02:00
nunnikri 21415407ac Fixing the broken library soft link (#529)
[ROCm/rccl commit: acfb0210ea]
2022-04-07 15:19:33 -07:00
Wenkai Du 5ccdd9f5e1 Increase chunk steps of broadcast and reduce (#528)
[ROCm/rccl commit: 15b572751e]
2022-04-07 13:34:04 -07:00
Colin Smith 3830310156 Doxygen fix for ncclRecv (#527)
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.

[ROCm/rccl commit: b2ffcf6d89]
2022-04-05 14:07:56 -07:00
Wenkai Du 9884e61367 Add tuning model (#523)
[ROCm/rccl commit: 5cc0a405c0]
2022-04-04 10:19:57 -07:00
Wenkai Du 3332cdff07 Support multiple tuning tables (#522)
* Support multiple tuning tables

* [UnitTests] Skip managed memory testing

[ROCm/rccl commit: bbe780ca6c]
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey 74f8baa0f3 Merge remote-tracking branch 'origin/master'
[ROCm/rccl commit: 8133784b32]
2022-03-30 02:29:05 -07:00