Gráfico de Commits

798 Commits

Autor SHA1 Mensagem Data
Wenkai Du 063da25563 topo_expl: fix build and add tuning support (#539) 2022-04-26 15:40:07 -07:00
Wenkai Du 379940dfac Merge pull request #533 from ROCmSoftwarePlatform/2.12.10
Sync up with NCCL 2.12.10
2022-04-26 10:09:37 -07:00
Edgar Gabriel 39e3002e19 Merge pull request #530 from edgargabriel/topic/signal-intercept
Topic/signal intercept
2022-04-25 10:44:26 -05:00
Edgar 2bf6d254b6 add a signal handler and backtrace
Tweak the signal handler and force non-release build
Increase ulimit locked memory value
Update the singal handler to use bfd symbol resolution.
Include configure logic to find bfd functions.
Add optionally c++ function name demangling
2022-04-25 10:48:17 -04:00
Wenkai Du 83fd4f70e7 Update tuning parameters 2022-04-18 16:04:04 -07:00
Wenkai Du d28e1cb44f Merge remote-tracking branch 'nccl/master' into develop 2022-04-18 11:15:25 -07:00
Wenkai Du fd2f1b3b88 Fix random segfault (#537) 2022-04-15 14:32:11 -07:00
Wenkai Du 2151c79d14 Add new Rome model (#536) 2022-04-13 11:45:40 -07:00
Wenkai Du ba4c165bf3 Add new Rome model (#535) 2022-04-12 13:27:32 -07:00
nunnikri b83efe9c5c Installing rccl.h wrapper to /opt/rocm-xxx/include path (#532)
* Fixing the broken library soft link

* Installing rccl.h wrapper to /opt/rocm-xxx/include path.

This missing wrapper was causing compilation errors with pytorch. Fixing it
2022-04-09 07:55:39 -07:00
gilbertlee-amd def6832287 Transfer bench single stream mode (#531)
- Adding single stream mode
- Removing some unused env vars
- Adding output to CSV mode for p2p benchmark, topology listing modes
2022-04-08 15:20:55 -06:00
Sylvain Jeaugey 9bfc1c6e35 Update Makefile to install static library.
Make sure make install also installs the static library. 
Fixes #662
2022-04-08 14:00:43 +02:00
nunnikri acfb0210ea Fixing the broken library soft link (#529) 2022-04-07 15:19:33 -07:00
Wenkai Du 15b572751e Increase chunk steps of broadcast and reduce (#528) 2022-04-07 13:34:04 -07:00
Colin Smith b2ffcf6d89 Doxygen fix for ncclRecv (#527)
Changed the Doxygen command for ncclRecv and pncclRecv, to be consistent with other APIs.
2022-04-05 14:07:56 -07:00
Wenkai Du 5cc0a405c0 Add tuning model (#523) 2022-04-04 10:19:57 -07:00
Wenkai Du bbe780ca6c Support multiple tuning tables (#522)
* Support multiple tuning tables

* [UnitTests] Skip managed memory testing
2022-03-31 17:09:21 -07:00
Sylvain Jeaugey 8133784b32 Merge remote-tracking branch 'origin/master' 2022-03-30 02:29:05 -07:00
Sylvain Jeaugey 353e8ba446 2.12.10-1
Fix bug with CollNet
Fix bug with zero-bytes send/recv operations
Fix NCCL_PARAM implementation to avoid taking a lock on every call
Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one.
Improve error reporting for network errors.
2022-03-30 02:27:01 -07:00
Sylvain Jeaugey 2247152a8e Fix merging error 2022-03-30 02:14:32 -07:00
Sylvain Jeaugey 2dfd83752c Merge branch 'master' into truncated_msg_warning 2022-03-30 10:58:05 +02:00
Ke Wen 1382a87306 Display host name instead of numeric IP when referring to a peer
For easier interpretation of debug messages like "connection closed by
peer", "peer message truncated" and "peer collective mismatch"
2022-03-30 10:47:10 +02:00
Christopher Hesse b895abcdb8 Fix typo in net_ib.cc 2022-03-30 10:45:01 +02:00
Felix Abecassis 1c7c014ceb Remove unnecessary newline in plugin logging
Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
2022-03-30 10:44:49 +02:00
Wenkai Du 7cbbca4da1 Update tuning parameters (#518)
* Update tuning parameters

* Respect user algo and topo selections
2022-03-29 08:15:37 -07:00
gilbertlee-amd 2d558c9abc Adding explicit request for coarse-grained host memory due to changes in HipHostMalloc (#517) 2022-03-25 13:05:07 -06:00
Liam Wrubleski a8f1e61f48 Packages for test and benchmark executables on all supported OSes using CPack. (#512) 2022-03-21 15:04:14 -06:00
Wenkai Du cd17cf6dce Update Rome model matching and add new models (#516)
* Update Rome model matching and add new models

* Add missing file

* Models update
2022-03-21 10:54:40 -07:00
akolliasAMD 65ea3d80db Added alltoallv test and optional args variable on collective args (#514)
* Added alltoallv test and optional args variable on collective args
2022-03-18 13:55:11 -04:00
John Bachan 44eb40da0e Add pthread_detach()'s for threads we never pthread_join(). Helps
reduce diagnostic noise for ThreadSanitizer.

Fixes https://github.com/NVIDIA/nccl/issues/649
2022-03-15 10:27:59 -07:00
nunnikri a04da71647 Merge pull request #511 from nunnikri/develop
File reorganization as per the new defined standard
2022-03-10 08:39:29 -08:00
Nirmal Unnikrishnan 115461cc04 File reorganization with backward compatibility
Updated the header file location and export path
2022-03-10 01:28:41 +00:00
Nirmal Unnikrishnan 676a4737c1 File reorganization as per the new defined standard
The header files will in /opt/rocm-xxx/include/rccl
Libraries and cmake will be in /opt/rocm-xxx/lib folder.
Added wrappers for header files using rocm-cmake functions for backward compatibility.
2022-03-08 17:32:02 +00:00
gilbertlee-amd 0687940b84 Changing initialization method for UnitTests (#510) 2022-03-07 09:22:55 -07:00
Wenkai Du d6d6af710e Force ring algorithm on single node (#509) 2022-03-04 10:29:02 -08:00
gilbertlee-amd b634b2f1c2 Adding NCCL_DEBUG=INFO for CI runs (#508) 2022-03-03 18:04:28 -07:00
gilbertlee-amd 699dc30f05 [UnitTests] Check process mask for custom tests (#507) 2022-03-02 17:24:14 -07:00
akolliasAMD ff54e79799 Added Unit test for nccl send recv (#506)
Added Send Receive test that tests through all pairs
2022-03-02 15:50:16 -05:00
Sylvain Jeaugey 3c223c105a 2.12.7-1
Add network communication through another GPU connected with NVLink
(PXN).
Add aggregation of messages coming from different local GPUs through
PXN and going to the same destination.
Add new v5 plugin API with grouped receives and tags.
Add compat for v4 plugins.
Add naming of NCCL threads to help debugging.
Fix NVLink detection and avoid data corruption when some NVLinks are
down.
Add support for Relaxed Ordering for IB.
Add profiling and timing infrastructure.
2022-03-02 20:48:56 +01:00
gilbertlee-amd 29ad0f5fbe Unit test refactor (#500)
Refactoring and consolidating single-process / multi-process unit testing
2022-02-25 08:59:07 -07:00
Ziyue Yang b569c0a1db Add Pivot AllToAll algorithm for Rome model (#503)
* add a2a pivot interface

* remove debug info

* address comments

* fix bug

* remove custom script

* address comments

* fix bug
2022-02-20 21:09:47 -08:00
Wenkai Du 94e0dc8bfd Allow additional options to be passed in through model's definition (#501) 2022-02-17 08:28:58 -08:00
Wenkai Du 02096c9936 Add another Rome model (#497) 2022-02-12 10:30:16 -08:00
Ke Wen fbfb6ac5d7 Split IB parameter sanity check into two parts
First part on collective mismatch, second part on internal errors
2022-02-08 15:21:22 -08:00
gilbertlee-amd f3c2cafd9d [TransferBench] Fix for cases with subsets of configured numa nodes (#495) 2022-02-07 12:16:19 -07:00
gilbertlee-amd 84d5fce7dd TransferBench: Adding ability to reindex GPUs based on PCIe address (#494) 2022-02-02 08:51:41 -07:00
Sylvain Jeaugey 0144073673 Fix ext-net/google-fastsocket build 2022-01-24 07:19:48 -08:00
Sylvain Jeaugey cc78e9fab8 Revert "remove unused basePath"
This reverts commit 445bc19657.
2022-01-21 12:30:34 +01:00
void-main 445bc19657 remove unused basePath 2022-01-21 12:12:26 +01:00
Wenkai Du 400df49dbe Generate proper b-tree with non-repeating channels (#493) 2022-01-19 15:09:17 -08:00