커밋 그래프

798 커밋

작성자 SHA1 메시지 날짜
Wenkai Du 95b30d9762 topo_expl: fix build and add tuning support (#539)
[ROCm/rccl commit: 063da25563]
2022-04-26 15:40:07 -07:00
Wenkai Du f610810d7b Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
Sync up with NCCL 2.12.10

[ROCm/rccl commit: 379940dfac]
2022-04-26 10:09:37 -07:00
Edgar Gabriel 673c695422 Merge pull request #530 from edgargabriel/topic/signal-intercept
Topic/signal intercept

[ROCm/rccl commit: 39e3002e19]
2022-04-25 10:44:26 -05:00
Edgar 1bfc5d06f8 add a signal handler and backtrace
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling


[ROCm/rccl commit: 2bf6d254b6]
2022-04-25 10:48:17 -04:00
Wenkai Du 347ea354c2 Update tuning parameters
[ROCm/rccl commit: 83fd4f70e7]
2022-04-18 16:04:04 -07:00
Wenkai Du 67e7e6507e Merge remote-tracking branch 'nccl/master' into develop
[ROCm/rccl commit: d28e1cb44f]
2022-04-18 11:15:25 -07:00
Wenkai Du 58b2c1ec9c Fix random segfault (#537)
[ROCm/rccl commit: fd2f1b3b88]
2022-04-15 14:32:11 -07:00
Wenkai Du 011447e4dc Add new Rome model (#536)
[ROCm/rccl commit: 2151c79d14]
2022-04-13 11:45:40 -07:00
Wenkai Du f8023f2e07 Add new Rome model (#535)
[ROCm/rccl commit: ba4c165bf3]
2022-04-12 13:27:32 -07:00
nunnikri be9374aa34 Installing rccl.h wrapper to /opt/rocm-xxx/include path (#532)
* Fixing the broken library soft link

* Installing rccl.h wrapper to /opt/rocm-xxx/include path.

This missing wrapper was causing compilation errors with pytorch. Fixing it

[ROCm/rccl commit: b83efe9c5c]
2022-04-09 07:55:39 -07:00
gilbertlee-amd e61ff3ce37 Transfer bench single stream mode (#531)
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes

[ROCm/rccl commit: def6832287]
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey e89ff21d35 Update Makefile to install static library.
Make sure make install also installs the static library. 
Fixes #662

[ROCm/rccl commit: 9bfc1c6e35]
2022-04-08 14:00:43 +02:00
nunnikri 21415407ac Fixing the broken library soft link (#529)
[ROCm/rccl commit: acfb0210ea]
2022-04-07 15:19:33 -07:00
Wenkai Du 5ccdd9f5e1 Increase chunk steps of broadcast and reduce (#528)
[ROCm/rccl commit: 15b572751e]
2022-04-07 13:34:04 -07:00
Colin Smith 3830310156 Doxygen fix for ncclRecv (#527)
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.

[ROCm/rccl commit: b2ffcf6d89]
2022-04-05 14:07:56 -07:00
Wenkai Du 9884e61367 Add tuning model (#523)
[ROCm/rccl commit: 5cc0a405c0]
2022-04-04 10:19:57 -07:00
Wenkai Du 3332cdff07 Support multiple tuning tables (#522)
* Support multiple tuning tables

* [UnitTests] Skip managed memory testing

[ROCm/rccl commit: bbe780ca6c]
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey 74f8baa0f3 Merge remote-tracking branch 'origin/master'
[ROCm/rccl commit: 8133784b32]
2022-03-30 02:29:05 -07:00
Sylvain Jeaugey 27130280b2 2.12.10-1
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.


[ROCm/rccl commit: 353e8ba446]
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey a52e328ba4 Fix merging error
[ROCm/rccl commit: 2247152a8e]
2022-03-30 02:14:32 -07:00
Sylvain Jeaugey 3bc2e34df2 Merge branch 'master' into truncated_msg_warning
[ROCm/rccl commit: 2dfd83752c]
2022-03-30 10:58:05 +02:00
Ke Wen 92bdde35eb Display host name instead of numeric IP when referring to a peer
For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"


[ROCm/rccl commit: 1382a87306]
2022-03-30 10:47:10 +02:00
Christopher Hesse f6d1c7261f Fix typo in net_ib.cc
[ROCm/rccl commit: b895abcdb8]
2022-03-30 10:45:01 +02:00
Felix Abecassis 54590464ca Remove unnecessary newline in plugin logging
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>

[ROCm/rccl commit: 1c7c014ceb]
2022-03-30 10:44:49 +02:00
Wenkai Du 828f3d11a0 Update tuning parameters (#518)
* Update tuning parameters

* Respect user algo and topo selections

[ROCm/rccl commit: 7cbbca4da1]
2022-03-29 08:15:37 -07:00
gilbertlee-amd 4c32c51772 Adding explicit request for coarse-grained host memory due to changes in HipHostMalloc (#517)
[ROCm/rccl commit: 2d558c9abc]
2022-03-25 13:05:07 -06:00
Liam Wrubleski 95c6476678 Packages for test and benchmark executables on all supported OSes using CPack. (#512)
[ROCm/rccl commit: a8f1e61f48]
2022-03-21 15:04:14 -06:00
Wenkai Du db1e628ba3 Update Rome model matching and add new models (#516)
* Update Rome model matching and add new models

* Add missing file

* Models update

[ROCm/rccl commit: cd17cf6dce]
2022-03-21 10:54:40 -07:00
akolliasAMD 3493750b6b Added alltoallv test and optional args variable on collective args (#514)
* Added alltoallv test and optional args variable on collective args

[ROCm/rccl commit: 65ea3d80db]
2022-03-18 13:55:11 -04:00
John Bachan 7707479804 Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649


[ROCm/rccl commit: 44eb40da0e]
2022-03-15 10:27:59 -07:00
nunnikri a44ff0fad5 Merge pull request #511 from nunnikri/develop
File reorganization as per the new defined standard

[ROCm/rccl commit: a04da71647]
2022-03-10 08:39:29 -08:00
Nirmal Unnikrishnan e740088560 File reorganization with backward compatibility
Updated the header file location and export path


[ROCm/rccl commit: 115461cc04]
2022-03-10 01:28:41 +00:00
Nirmal Unnikrishnan 4a4c053a6a File reorganization as per the new defined standard
The header files will in /opt/rocm-xxx/include/rccl
Libraries and cmake will be in /opt/rocm-xxx/lib folder.
Added wrappers for header files using rocm-cmake functions for backward compatibility.


[ROCm/rccl commit: 676a4737c1]
2022-03-08 17:32:02 +00:00
gilbertlee-amd 8f7ec04f37 Changing initialization method for UnitTests (#510)
[ROCm/rccl commit: 0687940b84]
2022-03-07 09:22:55 -07:00
Wenkai Du 133aed2dfb Force ring algorithm on single node (#509)
[ROCm/rccl commit: d6d6af710e]
2022-03-04 10:29:02 -08:00
gilbertlee-amd 211ff286be Adding NCCL_DEBUG=INFO for CI runs (#508)
[ROCm/rccl commit: b634b2f1c2]
2022-03-03 18:04:28 -07:00
gilbertlee-amd c802f53282 [UnitTests] Check process mask for custom tests (#507)
[ROCm/rccl commit: 699dc30f05]
2022-03-02 17:24:14 -07:00
akolliasAMD 2419a950fe Added Unit test for nccl send recv (#506)
Added Send Receive test that tests through all pairs

[ROCm/rccl commit: ff54e79799]
2022-03-02 15:50:16 -05:00
Sylvain Jeaugey f8886d8687 2.12.7-1
Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.


[ROCm/rccl commit: 3c223c105a]
2022-03-02 20:48:56 +01:00
gilbertlee-amd a182076a0e Unit test refactor (#500)
Refactoring and consolidating single-process / multi-process unit testing

[ROCm/rccl commit: 29ad0f5fbe]
2022-02-25 08:59:07 -07:00
Ziyue Yang dfa9b9e958 Add Pivot AllToAll algorithm for Rome model (#503)
* add a2a pivot interface

* remove debug info

* address comments

* fix bug

* remove custom script

* address comments

* fix bug

[ROCm/rccl commit: b569c0a1db]
2022-02-20 21:09:47 -08:00
Wenkai Du 0f0388ba0b Allow additional options to be passed in through model's definition (#501)
[ROCm/rccl commit: 94e0dc8bfd]
2022-02-17 08:28:58 -08:00
Wenkai Du 5b697e40db Add another Rome model (#497)
[ROCm/rccl commit: 02096c9936]
2022-02-12 10:30:16 -08:00
Ke Wen 92d6888bdc Split IB parameter sanity check into two parts
First part on collective mismatch, second part on internal errors


[ROCm/rccl commit: fbfb6ac5d7]
2022-02-08 15:21:22 -08:00
gilbertlee-amd 9c3189589f [TransferBench] Fix for cases with subsets of configured numa nodes (#495)
[ROCm/rccl commit: f3c2cafd9d]
2022-02-07 12:16:19 -07:00
gilbertlee-amd b2deea27f5 TransferBench: Adding ability to reindex GPUs based on PCIe address (#494)
[ROCm/rccl commit: 84d5fce7dd]
2022-02-02 08:51:41 -07:00
Sylvain Jeaugey ed02fb8993 Fix ext-net/google-fastsocket build
[ROCm/rccl commit: 0144073673]
2022-01-24 07:19:48 -08:00
Sylvain Jeaugey 51df47f9b8 Revert "remove unused basePath"
This reverts commit d973ddac8b.


[ROCm/rccl commit: cc78e9fab8]
2022-01-21 12:30:34 +01:00
void-main d973ddac8b remove unused basePath
[ROCm/rccl commit: 445bc19657]
2022-01-21 12:12:26 +01:00
Wenkai Du 635c0bcc01 Generate proper b-tree with non-repeating channels (#493)
[ROCm/rccl commit: 400df49dbe]
2022-01-19 15:09:17 -08:00