Wenkai Du
95b30d9762
topo_expl: fix build and add tuning support ( #539 )
...
[ROCm/rccl commit: 063da25563 ]
2022-04-26 15:40:07 -07:00
Wenkai Du
f610810d7b
Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
...
Sync up with NCCL 2.12.10
[ROCm/rccl commit: 379940dfac ]
2022-04-26 10:09:37 -07:00
Edgar Gabriel
673c695422
Merge pull request #530 from edgargabriel/topic/signal-intercept
...
Topic/signal intercept
[ROCm/rccl commit: 39e3002e19 ]
2022-04-25 10:44:26 -05:00
Edgar
1bfc5d06f8
add a signal handler and backtrace
...
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling
[ROCm/rccl commit: 2bf6d254b6 ]
2022-04-25 10:48:17 -04:00
Wenkai Du
347ea354c2
Update tuning parameters
...
[ROCm/rccl commit: 83fd4f70e7 ]
2022-04-18 16:04:04 -07:00
Wenkai Du
67e7e6507e
Merge remote-tracking branch 'nccl/master' into develop
...
[ROCm/rccl commit: d28e1cb44f ]
2022-04-18 11:15:25 -07:00
Wenkai Du
58b2c1ec9c
Fix random segfault ( #537 )
...
[ROCm/rccl commit: fd2f1b3b88 ]
2022-04-15 14:32:11 -07:00
Wenkai Du
011447e4dc
Add new Rome model ( #536 )
...
[ROCm/rccl commit: 2151c79d14 ]
2022-04-13 11:45:40 -07:00
Wenkai Du
f8023f2e07
Add new Rome model ( #535 )
...
[ROCm/rccl commit: ba4c165bf3 ]
2022-04-12 13:27:32 -07:00
nunnikri
be9374aa34
Installing rccl.h wrapper to /opt/rocm-xxx/include path ( #532 )
...
* Fixing the broken library soft link
* Installing rccl.h wrapper to /opt/rocm-xxx/include path.
This missing wrapper was causing compilation errors with pytorch. Fixing it
[ROCm/rccl commit: b83efe9c5c ]
2022-04-09 07:55:39 -07:00
gilbertlee-amd
e61ff3ce37
Transfer bench single stream mode ( #531 )
...
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes
[ROCm/rccl commit: def6832287 ]
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey
e89ff21d35
Update Makefile to install static library.
...
Make sure make install also installs the static library.
Fixes #662
[ROCm/rccl commit: 9bfc1c6e35 ]
2022-04-08 14:00:43 +02:00
nunnikri
21415407ac
Fixing the broken library soft link ( #529 )
...
[ROCm/rccl commit: acfb0210ea ]
2022-04-07 15:19:33 -07:00
Wenkai Du
5ccdd9f5e1
Increase chunk steps of broadcast and reduce ( #528 )
...
[ROCm/rccl commit: 15b572751e ]
2022-04-07 13:34:04 -07:00
Colin Smith
3830310156
Doxygen fix for ncclRecv ( #527 )
...
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.
[ROCm/rccl commit: b2ffcf6d89 ]
2022-04-05 14:07:56 -07:00
Wenkai Du
9884e61367
Add tuning model ( #523 )
...
[ROCm/rccl commit: 5cc0a405c0 ]
2022-04-04 10:19:57 -07:00
Wenkai Du
3332cdff07
Support multiple tuning tables ( #522 )
...
* Support multiple tuning tables
* [UnitTests] Skip managed memory testing
[ROCm/rccl commit: bbe780ca6c ]
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey
74f8baa0f3
Merge remote-tracking branch 'origin/master'
...
[ROCm/rccl commit: 8133784b32 ]
2022-03-30 02:29:05 -07:00
Sylvain Jeaugey
27130280b2
2.12.10-1
...
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.
[ROCm/rccl commit: 353e8ba446 ]
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey
a52e328ba4
Fix merging error
...
[ROCm/rccl commit: 2247152a8e ]
2022-03-30 02:14:32 -07:00
Sylvain Jeaugey
3bc2e34df2
Merge branch 'master' into truncated_msg_warning
...
[ROCm/rccl commit: 2dfd83752c ]
2022-03-30 10:58:05 +02:00
Ke Wen
92bdde35eb
Display host name instead of numeric IP when referring to a peer
...
For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"
[ROCm/rccl commit: 1382a87306 ]
2022-03-30 10:47:10 +02:00
Christopher Hesse
f6d1c7261f
Fix typo in net_ib.cc
...
[ROCm/rccl commit: b895abcdb8 ]
2022-03-30 10:45:01 +02:00
Felix Abecassis
54590464ca
Remove unnecessary newline in plugin logging
...
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com >
[ROCm/rccl commit: 1c7c014ceb ]
2022-03-30 10:44:49 +02:00
Wenkai Du
828f3d11a0
Update tuning parameters ( #518 )
...
* Update tuning parameters
* Respect user algo and topo selections
[ROCm/rccl commit: 7cbbca4da1 ]
2022-03-29 08:15:37 -07:00
gilbertlee-amd
4c32c51772
Adding explicit request for coarse-grained host memory due to changes in HipHostMalloc ( #517 )
...
[ROCm/rccl commit: 2d558c9abc ]
2022-03-25 13:05:07 -06:00
Liam Wrubleski
95c6476678
Packages for test and benchmark executables on all supported OSes using CPack. ( #512 )
...
[ROCm/rccl commit: a8f1e61f48 ]
2022-03-21 15:04:14 -06:00
Wenkai Du
db1e628ba3
Update Rome model matching and add new models ( #516 )
...
* Update Rome model matching and add new models
* Add missing file
* Models update
[ROCm/rccl commit: cd17cf6dce ]
2022-03-21 10:54:40 -07:00
akolliasAMD
3493750b6b
Added alltoallv test and optional args variable on collective args ( #514 )
...
* Added alltoallv test and optional args variable on collective args
[ROCm/rccl commit: 65ea3d80db ]
2022-03-18 13:55:11 -04:00
John Bachan
7707479804
Add pthread_detach()'s for threads we never pthread_join(). Helps
...
reduce diagnostic noise for ThreadSanitizer.
Fixes https://github.com/NVIDIA/nccl/issues/649
[ROCm/rccl commit: 44eb40da0e ]
2022-03-15 10:27:59 -07:00
nunnikri
a44ff0fad5
Merge pull request #511 from nunnikri/develop
...
File reorganization as per the new defined standard
[ROCm/rccl commit: a04da71647 ]
2022-03-10 08:39:29 -08:00
Nirmal Unnikrishnan
e740088560
File reorganization with backward compatibility
...
Updated the header file location and export path
[ROCm/rccl commit: 115461cc04 ]
2022-03-10 01:28:41 +00:00
Nirmal Unnikrishnan
4a4c053a6a
File reorganization as per the new defined standard
...
The header files will in /opt/rocm-xxx/include/rccl
Libraries and cmake will be in /opt/rocm-xxx/lib folder.
Added wrappers for header files using rocm-cmake functions for backward compatibility.
[ROCm/rccl commit: 676a4737c1 ]
2022-03-08 17:32:02 +00:00
gilbertlee-amd
8f7ec04f37
Changing initialization method for UnitTests ( #510 )
...
[ROCm/rccl commit: 0687940b84 ]
2022-03-07 09:22:55 -07:00
Wenkai Du
133aed2dfb
Force ring algorithm on single node ( #509 )
...
[ROCm/rccl commit: d6d6af710e ]
2022-03-04 10:29:02 -08:00
gilbertlee-amd
211ff286be
Adding NCCL_DEBUG=INFO for CI runs ( #508 )
...
[ROCm/rccl commit: b634b2f1c2 ]
2022-03-03 18:04:28 -07:00
gilbertlee-amd
c802f53282
[UnitTests] Check process mask for custom tests ( #507 )
...
[ROCm/rccl commit: 699dc30f05 ]
2022-03-02 17:24:14 -07:00
akolliasAMD
2419a950fe
Added Unit test for nccl send recv ( #506 )
...
Added Send Receive test that tests through all pairs
[ROCm/rccl commit: ff54e79799 ]
2022-03-02 15:50:16 -05:00
Sylvain Jeaugey
f8886d8687
2.12.7-1
...
Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.
[ROCm/rccl commit: 3c223c105a ]
2022-03-02 20:48:56 +01:00
gilbertlee-amd
a182076a0e
Unit test refactor ( #500 )
...
Refactoring and consolidating single-process / multi-process unit testing
[ROCm/rccl commit: 29ad0f5fbe ]
2022-02-25 08:59:07 -07:00
Ziyue Yang
dfa9b9e958
Add Pivot AllToAll algorithm for Rome model ( #503 )
...
* add a2a pivot interface
* remove debug info
* address comments
* fix bug
* remove custom script
* address comments
* fix bug
[ROCm/rccl commit: b569c0a1db ]
2022-02-20 21:09:47 -08:00
Wenkai Du
0f0388ba0b
Allow additional options to be passed in through model's definition ( #501 )
...
[ROCm/rccl commit: 94e0dc8bfd ]
2022-02-17 08:28:58 -08:00
Wenkai Du
5b697e40db
Add another Rome model ( #497 )
...
[ROCm/rccl commit: 02096c9936 ]
2022-02-12 10:30:16 -08:00
Ke Wen
92d6888bdc
Split IB parameter sanity check into two parts
...
First part on collective mismatch, second part on internal errors
[ROCm/rccl commit: fbfb6ac5d7 ]
2022-02-08 15:21:22 -08:00
gilbertlee-amd
9c3189589f
[TransferBench] Fix for cases with subsets of configured numa nodes ( #495 )
...
[ROCm/rccl commit: f3c2cafd9d ]
2022-02-07 12:16:19 -07:00
gilbertlee-amd
b2deea27f5
TransferBench: Adding ability to reindex GPUs based on PCIe address ( #494 )
...
[ROCm/rccl commit: 84d5fce7dd ]
2022-02-02 08:51:41 -07:00
Sylvain Jeaugey
ed02fb8993
Fix ext-net/google-fastsocket build
...
[ROCm/rccl commit: 0144073673 ]
2022-01-24 07:19:48 -08:00
Sylvain Jeaugey
51df47f9b8
Revert "remove unused basePath"
...
This reverts commit d973ddac8b .
[ROCm/rccl commit: cc78e9fab8 ]
2022-01-21 12:30:34 +01:00
void-main
d973ddac8b
remove unused basePath
...
[ROCm/rccl commit: 445bc19657 ]
2022-01-21 12:12:26 +01:00
Wenkai Du
635c0bcc01
Generate proper b-tree with non-repeating channels ( #493 )
...
[ROCm/rccl commit: 400df49dbe ]
2022-01-19 15:09:17 -08:00