コミットグラフ

105 コミット

作成者 SHA1 メッセージ 日付
Wenkai Du d1dae2721d Add ring bandwidth correction factor 2020-01-30 09:52:27 -08:00
Stanley Tsang 20fa04d9b6 Updating copyright notices for 2020. 2020-01-29 15:28:08 -08:00
Wenkai Du 486fd436af Split primitive class to smaller structures 2020-01-29 15:27:23 -08:00
Wenkai Du 1e55645d97 Misc fixes and improvements for 2.5.6
1. Fix RCCL unit test
2. Add ROME detection and tuning
3. Change default P2P level
4. Fix search algorithm for XGMI
5. Remove explicit channel duplication with implicit by using half of link speed
6. Add collective trace support
7. Correct Intel Skylake CPU detection and bandwidth
8. Fix topo connect function
9. Disable GDR read and remove unreachable code
10. Disable LL128 kernels
11. Add tuning parameters
12. Use original clock64() implementation which returns RTC counter value
13. Print out timestamp of collective trace
14. Do not use struct ncclColl in kernel launch parameter
15. Fix abort handling and add tracing
17. Add __launch_bounds__ to kernel functions
18. Remove unused abortCount
19. Unset default MIN_NRINGS and MIN_NCHANNELS
20. Do not allocate shared memory when not using LL128 kernels
21. Correct time print out in tuning log
2020-01-29 15:27:05 -08:00
Wenkai Du 6648c81dc6 Merge remote-tracking branch 'remotes/nccl/master' into rccl_2.5.6 2019-12-03 15:42:04 -08:00
Wenkai Du a0be2b8812 Disable direct buffers to reduce scratch memory size 2019-11-20 13:03:16 -08:00
Sylvain Jeaugey 299c554dcc 2.5.6-1 (#255)
Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP
2019-11-19 14:57:39 -08:00
Wenkai Du 5e109ed400 Add bfloat16 support in RCCL
Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator
2019-11-18 13:45:53 -08:00
Wenkai Du 8995047830 Correct RTC frequencies for profiling purpose 2019-11-05 11:36:45 -08:00
Wenkai Du 669f1951a4 Check for fine grain support using memory allocation 2019-11-01 15:58:49 -07:00
Jeff Daily 5a502955c9 additional check for fine grain support in p2pCanConnect (#146) 2019-10-31 08:58:38 -07:00
Wenkai Du 296176a4fd Disable HDP flush for RDMA 2019-10-23 14:40:17 -07:00
Wenkai Du df74d12946 Revert collective chunk and slice steps to avoid drop in throughput 2019-10-18 12:54:00 -07:00
Gilbert Lee 37603ae6cb Reverting GenericOp bug workaround modifications to slice/chunk steps 2019-10-11 09:20:10 -07:00
Gilbert Lee 1392dd2997 Performing __threadfence_system() with only first thread 2019-10-11 09:16:19 -07:00
Gilbert Lee 8ae1bce3bb Fix for GenericOp device primitive bug 2019-10-10 22:39:45 -07:00
Wenkai Du 062c798c86 Merge pull request #136 from wenkaidu/tree
Enable tree kernels in build
2019-10-09 10:58:52 -07:00
Wenkai Du 76976c9e2e Enable tree kernels in build
Need to tune and specify NCCL_TREE_THRESHOLD to allow usage
2019-10-08 23:20:11 +00:00
Changpeng Fang eec319038e Tuning the inline and unroll to reduce the scratch usage
Summary:
 1. remove the noinline attribute for AllReduceThreeKernel;
 2. change AUTPUNROLL for tree functions to 1 or 2;
 Combining 1 and 2 will reduce the scratch usage from 1256 to 952
2019-10-08 14:02:25 -07:00
Wenkai Du 61ef1dcad5 Only generate kernels for sum and copy 2019-09-24 17:01:12 -07:00
Gilbert Lee 86ce0a93b5 RDMA HDP flush fix 2019-09-06 16:35:55 +00:00
Gilbert Lee 3e6b326a19 Revert "Set RDMA default to off state"
This reverts commit 0f16ad966a.
2019-09-05 18:16:53 +00:00
Wenkai Du 8c975353ed Allocate opCount in pinned host memory for P2P transport
To avoid remote P2P read access when checking remote GPU's opCount
2019-08-29 10:22:09 -07:00
Wenkai Du 0f16ad966a Set RDMA default to off state 2019-08-26 10:59:33 -07:00
Wenkai Du 6759660529 Merge pull request #125 from wenkaidu/fix_nvml_id
Assign unused nmvlDev to avoid random number
2019-08-19 09:08:13 -07:00
Wenkai Du 86efdfc3b5 Assign unused nmvlDev to avoid random number 2019-08-16 16:34:14 -07:00
Wenkai Du 7c38da0939 Merge remote-tracking branch 'remotes/nccl/master' into HEAD 2019-08-16 16:13:34 -07:00
Wenkai Du 1faededc03 Tune AUTOUNROLL for better performance
Also remove all unused UNROLL defines
2019-08-16 10:34:53 -07:00
Michael LIAO 9369f8d75d Fix build with hip-clang.
- Add necessary function attribute for HIP programming model.
- Explicitly include hsa headers.
2019-08-15 14:56:04 -04:00
Wenkai Du 2223cccf15 Tune LL threshold for VEGA
Also move abort check after SPINS_BEFORE_CHECK_ABORT as NCCL
2019-08-15 09:16:11 -07:00
Wenkai Du 4b77a16f3f Default to minimal 2 rings and improve LL loop 2019-08-14 14:12:56 -07:00
Wenkai Du 5782a8d857 Remove duplicate line 2019-08-14 13:22:43 -07:00
Wenkai Du f11c8f60cd RCCL 2.4 update 2019-08-14 10:42:35 -07:00
David Addison fad079a8ae Updated PR#196 to use a common hash function 2019-08-14 10:08:39 -07:00
David Addison 01d1836668 Merge branch 'shm' of git://github.com/lowintelligence/nccl into lowintelligence-shm 2019-08-14 09:45:45 -07:00
David Addison 7f2b337e70 Make use of SO_REUSEPORT conditional
Fixes: #244

SO_RESUEPORT was introduced in Linux 3.9 and later.
This change allows NCCL to compile against older releases.

The functionality is only required if the user is specifying
a NCCL bootstrap address via an environment variable.
2019-08-13 16:32:07 -07:00
Ke Wen 4d579e51cc Fix NIC distances for 11+ NICs 2019-07-17 06:32:33 -07:00
Ke Wen 920ae57c14 Fix #224: prevent number of IB devices from going out of bound 2019-07-17 06:32:33 -07:00
Ke Wen c8c68fb5f7 Size up IPC buffers to multiples of 2MB
Avoid potential CUDA error in concurrent communicator initialization
2019-07-12 09:50:17 -07:00
Hirochika Asai 0b192d2299 Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236)
Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.
2019-07-09 14:45:41 -07:00
Ke Wen 8e04d80382 Merge branch 'master' into HEAD 2019-06-25 13:39:08 -07:00
Ke Wen 7c72dee660 2.4.8-1
Fix #209: improve socket transport performance
  Split transfers over multiple sockets
  Launch multiple threads to drive sockets
  Detect AWS NICs and set nsockets/nthreads accordingly
2019-06-25 13:22:47 -07:00
Felix Abecassis 37e4f8729e Fix out-of-bounds read in ncclStrToCpuset (#233)
The affinityStr string was not null-terminated but was passed to strlen(3).

Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>
2019-06-21 10:25:08 +02:00
David Addison 0ceaec9cee NCCL 2.4.7-1
Performance tweaks for PowerPC builds only;
      Set default NCCL_MIN_NRINGS to 4
      Disable PCI-E NUMA distance detection
2019-05-10 13:52:16 -07:00
jakirkham 60a586ded9 Allow CUDA runtime library selection (#220)
Makes a change to allow the user to select between the static CUDA
runtime library (default) and the dynamic CUDA runtime library. Does
this by allowing `CUDARTLIB` to be overridden.
2019-05-07 17:35:14 -07:00
Gustavo Alvarez 9db4b1d801 Add pkgconfig file (#190) 2019-04-08 09:16:54 -07:00
David Addison f40ce73e89 NCCL 2.4.6-1
Added detection of IBM/Power NVLink bridge device.
    Add NUMA support to PCI distance calculations.
    Added NCCL_IGNORE_CPU_AFFINITY env var.
    Fix memory leaks; GithubIssue#180
    Compiler warning fix; GithubIssue#178
    Replace non-standard variable length arrays. GithubIssue#171
    Fix Tree+Shared Memory crash. GithubPR#185
    Fix LL cleanup hang during long running DL jobs.
    Fix NCCL_RINGS environment variable handling.
    Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191
    Improve bootstrap socket connection reliability at scale.
    Fix hostname hashing issue. GithubIssue#187
    Code cleanup to rename all non device files from *.cu to *.cc
2019-04-05 13:05:45 -07:00
Cao Zongyan 161763aab2 Fix share memory collision in multi-communicator case.
Current SHM object name would only use pidHash and ranks as
identification, which would collide each other when program runs with
multiple communicators. Here we added commId info into pidHash, it makes
'pidHash'es of different communicators keeping in same process will be
distincted with each other.
2019-03-15 12:50:32 +08:00
Rong Ou 14e0cf644b Fix crash during shared memory creation (#185)
The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <rong.ou@gmail.com>
2019-03-04 11:42:47 -08:00
Sylvain Jeaugey 1450d42675 2.4.2-1
Add tree algorithms for allreduce to improve performance at scale.
Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle
network errors and be permit recover.
Detect initial CPU affinity and no longer escape it.
2019-01-29 15:19:27 -08:00