saadrahim
af09015f8d
Updating readme for 2.5 release ( #67 )
...
[ROCm/rccl commit: 42c3e4b93d ]
2019-05-22 15:31:12 -06:00
gilbertlee-amd
336883ef2b
Test combined calls ( #64 )
...
* Adding test for queueing multiple different collectives, 1 device per thread
[ROCm/rccl commit: ffe2054ed2 ]
2019-05-22 15:30:37 -06:00
Aaron Enye Shi
6201fd9645
Update README to note install rocm-cmake ( #68 )
...
[ROCm/rccl commit: 6e8f40eb22 ]
2019-05-22 15:29:59 -06:00
gilbertlee-amd
08a65f2201
Adding fix for unsufficient devices / better logging for skipped tests ( #63 )
...
[ROCm/rccl commit: a115f577dd ]
2019-05-21 14:34:20 -06:00
Stanley Tsang
7c60e997e0
Renaming jenkinsfile
...
[ROCm/rccl commit: afa945d6e6 ]
2019-05-21 15:54:41 +00:00
Wenkai Du
b815e21d58
Remove extra compiler path setting
...
[ROCm/rccl commit: 4bfa506a6b ]
2019-05-21 00:08:42 +00:00
Wenkai Du
d42406be17
By default will not build test program
...
[ROCm/rccl commit: e517dbed5c ]
2019-05-20 18:37:58 +00:00
gilbertlee-amd
2215ef431d
Merge pull request #62 from gilbertlee-amd/AlignmentTests
...
Adding support for alignment tests via sub-datasets
[ROCm/rccl commit: c57ab960ff ]
2019-05-18 10:54:52 -06:00
Gilbert Lee
57ac9a8a93
Adding support for alignment tests via sub-datasets
...
Added sample alignment test for AllGather
Datasets no longer free memory on destruction so Release() must be used
[ROCm/rccl commit: a50c852851 ]
2019-05-18 00:04:03 +00:00
Stanley Tsang
81c99b59cd
Update README.md
...
Adding mention of requirement for chrpath for unit tests.
[ROCm/rccl commit: 0d6a5a3d25 ]
2019-05-17 11:42:14 -06:00
gilbertlee-amd
7419ef1567
Merge pull request #61 from gilbertlee-amd/master
...
Reducing scope of linker flags to rccl target to avoid warnings.
Changing to tagged version of googletests
Checking for chrpath
[ROCm/rccl commit: 28f8dcb7c9 ]
2019-05-17 11:36:23 -06:00
Gilbert Lee
edf1a1e9c1
Adding check for chrpath for unit tests
...
[ROCm/rccl commit: 248e1d61cc ]
2019-05-17 17:13:40 +00:00
Gilbert Lee
0e17bed9e6
Fixing GoogleTest to 1.8.1 and making changes to tests to support older API
...
[ROCm/rccl commit: 08fcce5ec9 ]
2019-05-16 23:13:49 +00:00
Gilbert Lee
fe79e10c52
Reducing scope of linker flags to rccl target to avoid warnings
...
[ROCm/rccl commit: 11f78df04d ]
2019-05-16 21:39:39 +00:00
Gilbert Lee
60f91f645d
Updating RCCL based on NCCL 2.3.7
...
- Contains modifications to support AMD hardware
- Adds unit tests
[ROCm/rccl commit: 55a4b22ad7 ]
2019-05-16 16:16:18 +00:00
Christian Sigg
3a7974d4c4
Fix memory leak in bootstrapRoot()
...
[ROCm/rccl commit: 4861e197fd ]
2019-01-07 14:18:46 -08:00
Sylvain Jeaugey
4e32c7a19a
Replace CUDA_VERSION by CUDART_VERSION
...
[ROCm/rccl commit: c244b51ae7 ]
2018-12-13 15:22:17 -08:00
Christian Sigg
6a93dca51b
Qualify nullptr_t with std::
...
[ROCm/rccl commit: 3e6afef473 ]
2018-12-13 14:18:09 -08:00
Christian Sigg
fcf027b42d
Two temporary workarounds for cuda-clang issues.
...
[ROCm/rccl commit: 346fc49514 ]
2018-12-13 14:17:58 -08:00
Christian Sigg
457c2ab5ae
Change __CUDACC_VER_*__ preprocessor directives to CUDA_VERSION because clang doesn't define the former.
...
[ROCm/rccl commit: d08e9b5279 ]
2018-12-13 14:17:46 -08:00
Sylvain Jeaugey
7fa81a40a1
Fix #163 : remove warnings
...
[ROCm/rccl commit: 469b69a5d0 ]
2018-12-11 09:19:16 -08:00
Ke Wen
e74cca3317
Fix dummy plugin
...
[ROCm/rccl commit: 8606cdb8b2 ]
2018-12-05 17:25:23 -08:00
Sylvain Jeaugey
e2f73fea64
Remove error logging from a normal path
...
When initNet fails, we should not print the backtrace as it is
supposed to be normal operation (falling back to sockets)
[ROCm/rccl commit: 57368189e1 ]
2018-12-04 14:47:41 -08:00
Sylvain Jeaugey
49819bcb7c
Fix GPU Direct RDMA detection.
...
Whether the network supported GPU Direct RDMA or not was ignored,
causing sockets to break when cards were local enough that NCCL
tried to use it.
[ROCm/rccl commit: 4b39a4cf91 ]
2018-12-04 14:42:28 -08:00
Sylvain Jeaugey
274fd3756a
Add NCCL_NET flag to many debug lines.
...
[ROCm/rccl commit: b8a9a32ccb ]
2018-12-04 13:10:19 -08:00
Sylvain Jeaugey
245786e9a5
Improve INFO message when external network is not found.
...
Fix #162
[ROCm/rccl commit: cdae05b277 ]
2018-12-04 12:10:58 -08:00
David Addison
f842c81503
Fixed some compilation errors when TRACE=1 set
...
[ROCm/rccl commit: 5fe2618c0e ]
2018-11-29 14:12:14 -08:00
Sylvain Jeaugey
5dffba1287
Rework shared memory code to use SYSCHECK macros.
...
This is to handle EINTR/EGAIN properly (issue #137 ), and also
make the code consistent with the rest.
Unfortunately posix_fallocate and mmap do not follow the classic
return code/errno pattern, so we need to write wrappers around those
functions.
[ROCm/rccl commit: eed8218e17 ]
2018-11-29 12:52:13 -08:00
Sylvain Jeaugey
57bff87e1d
Rework SYSCHECK macros to better handle retries.
...
SYSCHECKVAL was not retrying when a retry was needed. Since not all
calls are inside a loop, that means we could silently miss an
EINTR/EAGAIN return code.
Also rework the socket connection code and improve error reporting.
[ROCm/rccl commit: 302d538b73 ]
2018-11-29 12:52:13 -08:00
Sylvain Jeaugey
f0e160728f
Improve net API description
...
[ROCm/rccl commit: 61b50a63ef ]
2018-11-26 16:24:31 -08:00
Sylvain Jeaugey
ec1247221e
Make network isend/irecv non blocking
...
[ROCm/rccl commit: 98adf2fe11 ]
2018-11-26 16:24:31 -08:00
Sylvain Jeaugey
378a103248
Add support for external network.
...
Dynamically load external network from libnccl-net.so.
Add init function in networks.
Move PCI scoring to net.cu, only ask transport to provide a path.
Simplify CUDA PCI path detection.
Add dummy external network
[ROCm/rccl commit: 0d3a20f96d ]
2018-11-26 16:24:31 -08:00
Alex Sergeev
b458b3b337
Generate host-hash for P2P and SHM based on $(readlink /proc/self/ns/uts) + $(readlink /proc/self/ns/mnt) ( #156 )
...
[ROCm/rccl commit: d7a58cfa58 ]
2018-11-19 17:39:44 -08:00
Sylvain Jeaugey
726c60336d
Generate nccl.h in build instead of src
...
Generating nccl.h in src makes source directories dirty after builds.
[ROCm/rccl commit: 3c6e25210b ]
2018-11-09 14:00:41 -08:00
Ke Wen
ec81b3ad0e
Add official builds download link
...
[ROCm/rccl commit: 21d9a877be ]
2018-11-08 11:22:28 -08:00
Sylvain Jeaugey
287e36a92d
Add instructions to install packaging toolchain
...
Address #143 and #150 : debuild not installed.
[ROCm/rccl commit: f7d31919d7 ]
2018-11-05 11:42:33 -08:00
Sylvain Jeaugey
1cc88accd5
Add install target
...
Fix issue #145
[ROCm/rccl commit: bed43524cc ]
2018-11-05 09:53:59 -08:00
David Addison
8c945929af
2.3.7-1
...
Improved LL tuning for multi-node jobs.
Improved bootstrap for large job scaling.
Fixed a hang during bootstrap due to socket reuse.
Added operation name to the COLL INFO logging.
[ROCm/rccl commit: b56650c7f5 ]
2018-10-24 14:44:59 -07:00
Obihörnchen
8507b66cfd
Fix nccl-tests all_reduce_perf path
...
It's `all_reduce_perf` not `allreduce_perf`
[ROCm/rccl commit: 3202d6b393 ]
2018-10-14 00:53:17 -07:00
Sylvain Jeaugey
8ffcfac437
2.3.5-5
...
Add support for inter-node communication using sockets and InfiniBand/RoCE.
Improve latency.
Add support for aggregation.
Improve LL/regular tuning.
Remove tests as those are now at github.com/nvidia/nccl-tests .
[ROCm/rccl commit: f93fe9bfd9 ]
2018-09-25 14:12:01 -07:00
Sylvain Jeaugey
63ab6df5b3
Merge pull request #119 from sclarkson/master
...
Fix tests: call cudaHostUnregister on the host pointer instead of the device pointer.
[ROCm/rccl commit: 286916a1a3 ]
2017-11-28 18:41:26 -08:00
sclarkson
42037d978f
fix tests on maxwell
...
[ROCm/rccl commit: 680a35c6b7 ]
2017-11-11 19:22:06 -08:00
Sylvain Jeaugey
5047915874
Update README to link to NCCL2
...
[ROCm/rccl commit: 03d856977e ]
2017-08-04 09:44:37 -07:00
Sylvain Jeaugey
eadeff7e4c
Update README to link to NCCL2 part 3
...
[ROCm/rccl commit: 4a33f66e27 ]
2017-08-04 09:44:09 -07:00
Sylvain Jeaugey
39f8f3a3a9
Update README to link to NCCL2 #2
...
[ROCm/rccl commit: d66fb63679 ]
2017-08-04 09:43:29 -07:00
Sylvain Jeaugey
5ef74d8b80
Update README to link to NCCL2
...
[ROCm/rccl commit: 80ae43b443 ]
2017-08-04 09:42:25 -07:00
Sylvain Jeaugey
338f6eaae1
Add support for CUDA9 half semantics
...
[ROCm/rccl commit: 29a1a916dc ]
2017-06-14 11:20:24 -07:00
Sylvain Jeaugey
498770f57b
Merge pull request #78 from ilya-biryukov/master
...
Fix compilation error when compiling with 'clang -x cuda'.
[ROCm/rccl commit: ccfc4567dc ]
2017-04-04 09:47:52 -07:00
Boris Fomitchev
12e9519ed0
Added Pascal nvcc flags, bumped version
...
[ROCm/rccl commit: 649f04d077 ]
2017-03-24 11:58:14 -07:00
Ilya Biryukov
fa9d609d5d
Fix compilation error when compiling with 'clang -x cuda'.
...
Functions vFetch and vStore are not found by ADL with clang,
so they need to be declared before usage in ReduceCopy.
[ROCm/rccl commit: 8241cd7b6e ]
2017-03-16 12:01:11 +01:00