Wykres commitów

98 Commity

Autor SHA1 Wiadomość Data
Wenkai Du 5e109ed400 Add bfloat16 support in RCCL
Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator
2019-11-18 13:45:53 -08:00
Wenkai Du 8995047830 Correct RTC frequencies for profiling purpose 2019-11-05 11:36:45 -08:00
Wenkai Du 669f1951a4 Check for fine grain support using memory allocation 2019-11-01 15:58:49 -07:00
Jeff Daily 5a502955c9 additional check for fine grain support in p2pCanConnect (#146) 2019-10-31 08:58:38 -07:00
Wenkai Du 296176a4fd Disable HDP flush for RDMA 2019-10-23 14:40:17 -07:00
Wenkai Du df74d12946 Revert collective chunk and slice steps to avoid drop in throughput 2019-10-18 12:54:00 -07:00
Gilbert Lee 37603ae6cb Reverting GenericOp bug workaround modifications to slice/chunk steps 2019-10-11 09:20:10 -07:00
Gilbert Lee 1392dd2997 Performing __threadfence_system() with only first thread 2019-10-11 09:16:19 -07:00
Gilbert Lee 8ae1bce3bb Fix for GenericOp device primitive bug 2019-10-10 22:39:45 -07:00
Wenkai Du 062c798c86 Merge pull request #136 from wenkaidu/tree
Enable tree kernels in build
2019-10-09 10:58:52 -07:00
Wenkai Du 76976c9e2e Enable tree kernels in build
Need to tune and specify NCCL_TREE_THRESHOLD to allow usage
2019-10-08 23:20:11 +00:00
Changpeng Fang eec319038e Tuning the inline and unroll to reduce the scratch usage
Summary:
 1. remove the noinline attribute for AllReduceThreeKernel;
 2. change AUTPUNROLL for tree functions to 1 or 2;
 Combining 1 and 2 will reduce the scratch usage from 1256 to 952
2019-10-08 14:02:25 -07:00
Wenkai Du 61ef1dcad5 Only generate kernels for sum and copy 2019-09-24 17:01:12 -07:00
Gilbert Lee 86ce0a93b5 RDMA HDP flush fix 2019-09-06 16:35:55 +00:00
Gilbert Lee 3e6b326a19 Revert "Set RDMA default to off state"
This reverts commit 0f16ad966a.
2019-09-05 18:16:53 +00:00
Wenkai Du 8c975353ed Allocate opCount in pinned host memory for P2P transport
To avoid remote P2P read access when checking remote GPU's opCount
2019-08-29 10:22:09 -07:00
Wenkai Du 0f16ad966a Set RDMA default to off state 2019-08-26 10:59:33 -07:00
Wenkai Du 6759660529 Merge pull request #125 from wenkaidu/fix_nvml_id
Assign unused nmvlDev to avoid random number
2019-08-19 09:08:13 -07:00
Wenkai Du 86efdfc3b5 Assign unused nmvlDev to avoid random number 2019-08-16 16:34:14 -07:00
Wenkai Du 7c38da0939 Merge remote-tracking branch 'remotes/nccl/master' into HEAD 2019-08-16 16:13:34 -07:00
Wenkai Du 1faededc03 Tune AUTOUNROLL for better performance
Also remove all unused UNROLL defines
2019-08-16 10:34:53 -07:00
Michael LIAO 9369f8d75d Fix build with hip-clang.
- Add necessary function attribute for HIP programming model.
- Explicitly include hsa headers.
2019-08-15 14:56:04 -04:00
Wenkai Du 2223cccf15 Tune LL threshold for VEGA
Also move abort check after SPINS_BEFORE_CHECK_ABORT as NCCL
2019-08-15 09:16:11 -07:00
Wenkai Du 4b77a16f3f Default to minimal 2 rings and improve LL loop 2019-08-14 14:12:56 -07:00
Wenkai Du 5782a8d857 Remove duplicate line 2019-08-14 13:22:43 -07:00
Wenkai Du f11c8f60cd RCCL 2.4 update 2019-08-14 10:42:35 -07:00
David Addison fad079a8ae Updated PR#196 to use a common hash function 2019-08-14 10:08:39 -07:00
David Addison 01d1836668 Merge branch 'shm' of git://github.com/lowintelligence/nccl into lowintelligence-shm 2019-08-14 09:45:45 -07:00
David Addison 7f2b337e70 Make use of SO_REUSEPORT conditional
Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.
2019-08-13 16:32:07 -07:00
Ke Wen 4d579e51cc Fix NIC distances for 11+ NICs 2019-07-17 06:32:33 -07:00
Ke Wen 920ae57c14 Fix #224: prevent number of IB devices from going out of bound 2019-07-17 06:32:33 -07:00
Ke Wen c8c68fb5f7 Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization
2019-07-12 09:50:17 -07:00
Hirochika Asai 0b192d2299 Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)
Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.
2019-07-09 14:45:41 -07:00
Ke Wen 8e04d80382 Merge branch 'master' into HEAD 2019-06-25 13:39:08 -07:00
Ke Wen 7c72dee660 2.4.8-1
Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly
2019-06-25 13:22:47 -07:00
Felix Abecassis 37e4f8729e Fix out-of-bounds read in ncclStrToCpuset (#233)
The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
2019-06-21 10:25:08 +02:00
David Addison 0ceaec9cee NCCL 2.4.7-1
Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection
2019-05-10 13:52:16 -07:00
jakirkham 60a586ded9 Allow CUDA runtime library selection (#220)
Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.
2019-05-07 17:35:14 -07:00
Gustavo Alvarez 9db4b1d801 Add pkgconfig file (#190) 2019-04-08 09:16:54 -07:00
David Addison f40ce73e89 NCCL 2.4.6-1
Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc
2019-04-05 13:05:45 -07:00
Cao Zongyan 161763aab2 Fix share memory collision in multi-communicator case.
Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.
2019-03-15 12:50:32 +08:00
Rong Ou 14e0cf644b Fix crash during shared memory creation (#185)
The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <rong.ou@gmail.com>
2019-03-04 11:42:47 -08:00
Sylvain Jeaugey 1450d42675 2.4.2-1
Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.
2019-01-29 15:19:27 -08:00
Christian Sigg 4861e197fd Fix memory leak in bootstrapRoot() 2019-01-07 14:18:46 -08:00
Sylvain Jeaugey c244b51ae7 Replace CUDA_VERSION by CUDART_VERSION 2018-12-13 15:22:17 -08:00
Christian Sigg 3e6afef473 Qualify nullptr_t with std:: 2018-12-13 14:18:09 -08:00
Christian Sigg 346fc49514 Two temporary workarounds for cuda-clang issues. 2018-12-13 14:17:58 -08:00
Christian Sigg d08e9b5279 Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former. 2018-12-13 14:17:46 -08:00
Sylvain Jeaugey 469b69a5d0 Fix #163 : remove warnings 2018-12-11 09:19:16 -08:00
Sylvain Jeaugey 57368189e1 Remove error logging from a normal path
When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)
2018-12-04 14:47:41 -08:00