Sylvain Jeaugey
7fa81a40a1
Fix #163 : remove warnings
...
[ROCm/rccl commit: 469b69a5d0 ]
2018-12-11 09:19:16 -08:00
Ke Wen
e74cca3317
Fix dummy plugin
...
[ROCm/rccl commit: 8606cdb8b2 ]
2018-12-05 17:25:23 -08:00
Sylvain Jeaugey
e2f73fea64
Remove error logging from a normal path
...
When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)
[ROCm/rccl commit: 57368189e1 ]
2018-12-04 14:47:41 -08:00
Sylvain Jeaugey
49819bcb7c
Fix GPU Direct RDMA detection.
...
Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.
[ROCm/rccl commit: 4b39a4cf91 ]
2018-12-04 14:42:28 -08:00
Sylvain Jeaugey
274fd3756a
Add NCCL_NET flag to many debug lines.
...
[ROCm/rccl commit: b8a9a32ccb ]
2018-12-04 13:10:19 -08:00
Sylvain Jeaugey
245786e9a5
Improve INFO message when external network is not found.
...
Fix #162
[ROCm/rccl commit: cdae05b277 ]
2018-12-04 12:10:58 -08:00
David Addison
f842c81503
Fixed some compilation errors when TRACE=1 set
...
[ROCm/rccl commit: 5fe2618c0e ]
2018-11-29 14:12:14 -08:00
Sylvain Jeaugey
5dffba1287
Rework shared memory code to use SYSCHECK macros.
...
This is to handle EINTR/EGAIN properly (issue #137 ), and also
make the code consistent with the rest.
Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.
[ROCm/rccl commit: eed8218e17 ]
2018-11-29 12:52:13 -08:00
Sylvain Jeaugey
57bff87e1d
Rework SYSCHECK macros to better handle retries.
...
SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.
Also rework the socket connection code and improve error reporting.
[ROCm/rccl commit: 302d538b73 ]
2018-11-29 12:52:13 -08:00
Sylvain Jeaugey
f0e160728f
Improve net API description
...
[ROCm/rccl commit: 61b50a63ef ]
2018-11-26 16:24:31 -08:00
Sylvain Jeaugey
ec1247221e
Make network isend/irecv non blocking
...
[ROCm/rccl commit: 98adf2fe11 ]
2018-11-26 16:24:31 -08:00
Sylvain Jeaugey
378a103248
Add support for external network.
...
Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network
[ROCm/rccl commit: 0d3a20f96d ]
2018-11-26 16:24:31 -08:00
Alex Sergeev
b458b3b337
Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) ( #156 )
...
[ROCm/rccl commit: d7a58cfa58 ]
2018-11-19 17:39:44 -08:00
Sylvain Jeaugey
726c60336d
Generate nccl.h in build instead of src
...
Generating nccl.h in src makes source directories dirty after builds.
[ROCm/rccl commit: 3c6e25210b ]
2018-11-09 14:00:41 -08:00
Ke Wen
ec81b3ad0e
Add official builds download link
...
[ROCm/rccl commit: 21d9a877be ]
2018-11-08 11:22:28 -08:00
Sylvain Jeaugey
287e36a92d
Add instructions to install packaging toolchain
...
Address #143 and #150 : debuild not installed.
[ROCm/rccl commit: f7d31919d7 ]
2018-11-05 11:42:33 -08:00
Sylvain Jeaugey
1cc88accd5
Add install target
...
Fix issue #145
[ROCm/rccl commit: bed43524cc ]
2018-11-05 09:53:59 -08:00
David Addison
8c945929af
2.3.7-1
...
Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.
[ROCm/rccl commit: b56650c7f5 ]
2018-10-24 14:44:59 -07:00
Obihörnchen
8507b66cfd
Fix nccl-tests all_reduce_perf path
...
It's `all_reduce_perf` not `allreduce_perf`
[ROCm/rccl commit: 3202d6b393 ]
2018-10-14 00:53:17 -07:00
Sylvain Jeaugey
8ffcfac437
2.3.5-5
...
Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .
[ROCm/rccl commit: f93fe9bfd9 ]
2018-09-25 14:12:01 -07:00
sclarkson
42037d978f
fix tests on maxwell
...
[ROCm/rccl commit: 680a35c6b7 ]
2017-11-11 19:22:06 -08:00
Sylvain Jeaugey
5047915874
Update README to link to NCCL2
...
[ROCm/rccl commit: 03d856977e ]
2017-08-04 09:44:37 -07:00
Sylvain Jeaugey
eadeff7e4c
Update README to link to NCCL2 part 3
...
[ROCm/rccl commit: 4a33f66e27 ]
2017-08-04 09:44:09 -07:00
Sylvain Jeaugey
39f8f3a3a9
Update README to link to NCCL2 #2
...
[ROCm/rccl commit: d66fb63679 ]
2017-08-04 09:43:29 -07:00
Sylvain Jeaugey
5ef74d8b80
Update README to link to NCCL2
...
[ROCm/rccl commit: 80ae43b443 ]
2017-08-04 09:42:25 -07:00
Sylvain Jeaugey
338f6eaae1
Add support for CUDA9 half semantics
...
[ROCm/rccl commit: 29a1a916dc ]
2017-06-14 11:20:24 -07:00
Sylvain Jeaugey
498770f57b
Merge pull request #78 from ilya-biryukov/master
...
Fix compilation error when compiling with 'clang -x cuda'.
[ROCm/rccl commit: ccfc4567dc ]
2017-04-04 09:47:52 -07:00
Boris Fomitchev
12e9519ed0
Added Pascal nvcc flags, bumped version
...
[ROCm/rccl commit: 649f04d077 ]
2017-03-24 11:58:14 -07:00
Ilya Biryukov
fa9d609d5d
Fix compilation error when compiling with 'clang -x cuda'.
...
Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.
[ROCm/rccl commit: 8241cd7b6e ]
2017-03-16 12:01:11 +01:00
Sylvain Jeaugey
4644095088
Bumping version to 1.3.3
...
[ROCm/rccl commit: 7fef264bfa ]
2017-03-01 16:44:27 -08:00
Nathan Luehr
72ab3de6f6
Only enable peer access for ring neighbors.
...
This enables support for systems with more than 9 GPUs attached to a single PCIe root complex.
[ROCm/rccl commit: 8996811936 ]
2017-03-01 16:42:38 -08:00
Sylvain Jeaugey
f93f7b5d52
Fix copy/paste typo in error message
...
[ROCm/rccl commit: c219a183d0 ]
2017-03-01 16:42:38 -08:00
Sylvain Jeaugey
d2a9c5d52e
Fix crash in Reduce when non-root ranks have invalid recvbuff
...
[ROCm/rccl commit: 8e1d6f9b60 ]
2017-03-01 16:42:38 -08:00
Chad Whipkey
030edd0507
Qualify nullptr_t with std::.
...
[ROCm/rccl commit: 5eab428294 ]
2017-02-08 07:06:31 -08:00
Sylvain Jeaugey
851663654d
Fix 1.3.2 compilation
...
[ROCm/rccl commit: 2a974f5ca2 ]
2016-12-08 09:11:43 -08:00
Sylvain Jeaugey
56e1a845e9
Adding missing file
...
[ROCm/rccl commit: 648e9fbb58 ]
2016-12-05 18:06:24 -08:00
Sylvain Jeaugey
db3103d96a
1.3.2 release
...
Broadcast tuning
Better checking of inputs
Copy/reduce code simplification
[ROCm/rccl commit: 34d27771c6 ]
2016-12-01 15:17:50 -08:00
Sylvain Jeaugey
9ec80e0873
Replace min BW by average BW in tests
...
[ROCm/rccl commit: 1093821c33 ]
2016-12-01 15:16:35 -08:00
Peter Jin
95860f0721
Add a static library target "staticlib" to the Makefile.
...
Rename the static library "libnccl_static.a" to disambiguate from the
dynamic libraries.
[ROCm/rccl commit: 5765d608cc ]
2016-11-24 11:31:03 -08:00
Kyle Fernandes, ne Jacobs
11e4da5bc0
Remove irrelevant output from ncclReduce Fortran tests
...
[ROCm/rccl commit: c2c515516b ]
2016-11-21 10:18:04 -08:00
Kyle Fernandes, ne Jacobs
daffd4e0ee
Add Copyright header to Fortran bindings source files
...
[ROCm/rccl commit: 9c18468fe2 ]
2016-11-21 10:17:58 -08:00
Kyle Fernandes, ne Jacobs
a35c827aa6
Add Fortran bindings
...
[ROCm/rccl commit: 5f2b32e45b ]
2016-11-17 15:33:34 -08:00
Sylvain Jeaugey
5d496d605d
Bump to 1.3.1
...
[ROCm/rccl commit: 534b9a1697 ]
2016-10-13 10:33:05 -07:00
Sylvain Jeaugey
ea6feab31f
Fix primitives function prototype
...
[ROCm/rccl commit: b2781d0501 ]
2016-10-13 10:32:42 -07:00
Sylvain Jeaugey
ad50fded63
NVML (libwrap) : import the needed definitions
...
[ROCm/rccl commit: bf7d1514f7 ]
2016-10-13 10:28:59 -07:00
Sylvain Jeaugey
827119c43e
Improved allreduce segmentation for small sizes
...
[ROCm/rccl commit: 8bb06c94be ]
2016-10-07 12:42:23 -07:00
Sylvain Jeaugey
d4c7d8014a
Add scan tests
...
[ROCm/rccl commit: ca330b110a ]
2016-09-22 11:58:33 -07:00
Sylvain Jeaugey
5dd42ff223
Make tests check for deltas and report bandwidth
...
[ROCm/rccl commit: 6c77476cc1 ]
2016-09-22 11:58:28 -07:00
Sylvain Jeaugey
b540c2b5cc
Heavy code refactoring to remove a lot of code in collectives (~1000 lines).
...
Have all collectives use the same args, the same ring, and the same primitives for synchronization between threads with the same pattern.
[ROCm/rccl commit: cabd6848e4 ]
2016-09-22 11:57:56 -07:00
Sylvain Jeaugey
93f5211c22
Add profiling API
...
[ROCm/rccl commit: e3dbc6110e ]
2016-09-22 11:56:51 -07:00