Wenkai Du
063da25563
topo_expl: fix build and add tuning support ( #539 )
2022-04-26 15:40:07 -07:00
Wenkai Du
379940dfac
Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
...
Sync up with NCCL 2.12.10
2022-04-26 10:09:37 -07:00
Edgar Gabriel
39e3002e19
Merge pull request #530 from edgargabriel/topic/signal-intercept
...
Topic/signal intercept
2022-04-25 10:44:26 -05:00
Edgar
2bf6d254b6
add a signal handler and backtrace
...
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling
2022-04-25 10:48:17 -04:00
Wenkai Du
83fd4f70e7
Update tuning parameters
2022-04-18 16:04:04 -07:00
Wenkai Du
d28e1cb44f
Merge remote-tracking branch 'nccl/master' into develop
2022-04-18 11:15:25 -07:00
Wenkai Du
fd2f1b3b88
Fix random segfault ( #537 )
2022-04-15 14:32:11 -07:00
Wenkai Du
2151c79d14
Add new Rome model ( #536 )
2022-04-13 11:45:40 -07:00
Wenkai Du
ba4c165bf3
Add new Rome model ( #535 )
2022-04-12 13:27:32 -07:00
nunnikri
b83efe9c5c
Installing rccl.h wrapper to /opt/rocm-xxx/include path ( #532 )
...
* Fixing the broken library soft link
* Installing rccl.h wrapper to /opt/rocm-xxx/include path.
This missing wrapper was causing compilation errors with pytorch. Fixing it
2022-04-09 07:55:39 -07:00
gilbertlee-amd
def6832287
Transfer bench single stream mode ( #531 )
...
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey
9bfc1c6e35
Update Makefile to install static library.
...
Make sure make install also installs the static library.
Fixes #662
2022-04-08 14:00:43 +02:00
nunnikri
acfb0210ea
Fixing the broken library soft link ( #529 )
2022-04-07 15:19:33 -07:00
Wenkai Du
15b572751e
Increase chunk steps of broadcast and reduce ( #528 )
2022-04-07 13:34:04 -07:00
Colin Smith
b2ffcf6d89
Doxygen fix for ncclRecv ( #527 )
...
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.
2022-04-05 14:07:56 -07:00
Wenkai Du
5cc0a405c0
Add tuning model ( #523 )
2022-04-04 10:19:57 -07:00
Wenkai Du
bbe780ca6c
Support multiple tuning tables ( #522 )
...
* Support multiple tuning tables
* [UnitTests] Skip managed memory testing
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey
8133784b32
Merge remote-tracking branch 'origin/master'
2022-03-30 02:29:05 -07:00
Sylvain Jeaugey
353e8ba446
2.12.10-1
...
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey
2247152a8e
Fix merging error
2022-03-30 02:14:32 -07:00
Sylvain Jeaugey
2dfd83752c
Merge branch 'master' into truncated_msg_warning
2022-03-30 10:58:05 +02:00
Ke Wen
1382a87306
Display host name instead of numeric IP when referring to a peer
...
For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"
2022-03-30 10:47:10 +02:00
Christopher Hesse
b895abcdb8
Fix typo in net_ib.cc
2022-03-30 10:45:01 +02:00
Felix Abecassis
1c7c014ceb
Remove unnecessary newline in plugin logging
...
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com >
2022-03-30 10:44:49 +02:00
Wenkai Du
7cbbca4da1
Update tuning parameters ( #518 )
...
* Update tuning parameters
* Respect user algo and topo selections
2022-03-29 08:15:37 -07:00
gilbertlee-amd
2d558c9abc
Adding explicit request for coarse-grained host memory due to changes in HipHostMalloc ( #517 )
2022-03-25 13:05:07 -06:00
Liam Wrubleski
a8f1e61f48
Packages for test and benchmark executables on all supported OSes using CPack. ( #512 )
2022-03-21 15:04:14 -06:00
Wenkai Du
cd17cf6dce
Update Rome model matching and add new models ( #516 )
...
* Update Rome model matching and add new models
* Add missing file
* Models update
2022-03-21 10:54:40 -07:00
akolliasAMD
65ea3d80db
Added alltoallv test and optional args variable on collective args ( #514 )
...
* Added alltoallv test and optional args variable on collective args
2022-03-18 13:55:11 -04:00
John Bachan
44eb40da0e
Add pthread_detach()'s for threads we never pthread_join(). Helps
...
reduce diagnostic noise for ThreadSanitizer.
Fixes https://github.com/NVIDIA/nccl/issues/649
2022-03-15 10:27:59 -07:00
nunnikri
a04da71647
Merge pull request #511 from nunnikri/develop
...
File reorganization as per the new defined standard
2022-03-10 08:39:29 -08:00
Nirmal Unnikrishnan
115461cc04
File reorganization with backward compatibility
...
Updated the header file location and export path
2022-03-10 01:28:41 +00:00
Nirmal Unnikrishnan
676a4737c1
File reorganization as per the new defined standard
...
The header files will in /opt/rocm-xxx/include/rccl
Libraries and cmake will be in /opt/rocm-xxx/lib folder.
Added wrappers for header files using rocm-cmake functions for backward compatibility.
2022-03-08 17:32:02 +00:00
gilbertlee-amd
0687940b84
Changing initialization method for UnitTests ( #510 )
2022-03-07 09:22:55 -07:00
Wenkai Du
d6d6af710e
Force ring algorithm on single node ( #509 )
2022-03-04 10:29:02 -08:00
gilbertlee-amd
b634b2f1c2
Adding NCCL_DEBUG=INFO for CI runs ( #508 )
2022-03-03 18:04:28 -07:00
gilbertlee-amd
699dc30f05
[UnitTests] Check process mask for custom tests ( #507 )
2022-03-02 17:24:14 -07:00
akolliasAMD
ff54e79799
Added Unit test for nccl send recv ( #506 )
...
Added Send Receive test that tests through all pairs
2022-03-02 15:50:16 -05:00
Sylvain Jeaugey
3c223c105a
2.12.7-1
...
Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.
2022-03-02 20:48:56 +01:00
gilbertlee-amd
29ad0f5fbe
Unit test refactor ( #500 )
...
Refactoring and consolidating single-process / multi-process unit testing
2022-02-25 08:59:07 -07:00
Ziyue Yang
b569c0a1db
Add Pivot AllToAll algorithm for Rome model ( #503 )
...
* add a2a pivot interface
* remove debug info
* address comments
* fix bug
* remove custom script
* address comments
* fix bug
2022-02-20 21:09:47 -08:00
Wenkai Du
94e0dc8bfd
Allow additional options to be passed in through model's definition ( #501 )
2022-02-17 08:28:58 -08:00
Wenkai Du
02096c9936
Add another Rome model ( #497 )
2022-02-12 10:30:16 -08:00
Ke Wen
fbfb6ac5d7
Split IB parameter sanity check into two parts
...
First part on collective mismatch, second part on internal errors
2022-02-08 15:21:22 -08:00
gilbertlee-amd
f3c2cafd9d
[TransferBench] Fix for cases with subsets of configured numa nodes ( #495 )
2022-02-07 12:16:19 -07:00
gilbertlee-amd
84d5fce7dd
TransferBench: Adding ability to reindex GPUs based on PCIe address ( #494 )
2022-02-02 08:51:41 -07:00
Sylvain Jeaugey
0144073673
Fix ext-net/google-fastsocket build
2022-01-24 07:19:48 -08:00
Sylvain Jeaugey
cc78e9fab8
Revert "remove unused basePath"
...
This reverts commit 445bc19657 .
2022-01-21 12:30:34 +01:00
void-main
445bc19657
remove unused basePath
2022-01-21 12:12:26 +01:00
Wenkai Du
400df49dbe
Generate proper b-tree with non-repeating channels ( #493 )
2022-01-19 15:09:17 -08:00