Wenkai Du
e7eff47be4
Revert "Tuning the inline and unroll to reduce the scratch usage"
...
This reverts commit d8a06589c9 .
[ROCm/rccl commit: ca493a6b51 ]
2020-05-15 14:15:40 -07:00
Wenkai Du
5780637fff
Merge pull request #209 from wenkaidu/hip-clang
...
Rename files which only diffs in extension
[ROCm/rccl commit: c245f1507e ]
2020-05-15 13:51:12 -07:00
Wenkai Du
61a82ff572
Merge pull request #208 from wenkaidu/perf_xgmi
...
Give preference to path with more XGMI connections
[ROCm/rccl commit: 706de76046 ]
2020-05-15 10:07:22 -07:00
Wenkai Du
c2afb1f4ca
Rename files which only diffs in extension
...
[ROCm/rccl commit: e7b36304c8 ]
2020-05-15 09:16:32 -07:00
Wenkai Du
27519fd019
Give preference to path with more XGMI connections
...
[ROCm/rccl commit: b3c9852634 ]
2020-05-14 15:33:16 -07:00
Wenkai Du
ced9958094
rccl-prim-test: add flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang
...
[ROCm/rccl commit: f1058b6353 ]
2020-05-12 23:54:07 +00:00
Stanley Tsang
abba9d5504
Restoring doxygen documentation to nccl.h.in.
...
[ROCm/rccl commit: 787ac13486 ]
2020-05-12 22:03:31 +00:00
Stanley Tsang
e35e4d3401
Updating README and readthedocs documentation.
...
[ROCm/rccl commit: b59b9d328b ]
2020-05-12 20:11:49 +00:00
Wenkai Du
5ba51914f1
Update rccl_bfloat16.h to match rocBLAS
...
[ROCm/rccl commit: d5a07a7b5c ]
2020-05-08 22:48:07 +00:00
Wenkai Du
069322d05a
Set flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang
...
[ROCm/rccl commit: 24ea2ef6dd ]
2020-05-08 15:57:14 +00:00
Saad Rahim
65ad48404b
Merge remote-tracking branch 'upstream/master' into develop
...
[ROCm/rccl commit: 33c23fdcda ]
2020-04-29 16:12:37 -07:00
saadrahim
75863c97a4
Refactoring packaging ( #193 )
...
[ROCm/rccl commit: 308e96877e ]
2020-04-29 16:24:21 -06:00
Wenkai Du
2badecfa20
Merge pull request #199 from wenkaidu/para_jobs
...
Enable parallel jobs for hip-clang build
[ROCm/rccl commit: 914b6ca27c ]
2020-04-29 13:49:48 -07:00
saadrahim
fa25190ddb
Junit test storage call corrected ( #197 )
...
* Focus testing on Centos for now
* storing junit
* Reducing test suite to Ubuntu
[ROCm/rccl commit: 65390f9872 ]
2020-04-29 14:22:03 -06:00
Wenkai Du
52096f2cb6
Enable parallel jobs for hip-clang build
...
[ROCm/rccl commit: 3f471ab5b1 ]
2020-04-29 17:58:16 +00:00
saadrahim
c0c0e92ef5
Adding NCCL_DEBUG=INFO Logging to CI ( #196 )
...
[ROCm/rccl commit: 6b1d70b03b ]
2020-04-27 15:12:15 -06:00
Wenkai Du
779ee97ada
topo_expl: fix build error
...
[ROCm/rccl commit: 5743c6b7d2 ]
2020-04-27 17:17:05 +00:00
Wenkai Du
9813d67cd1
Merge remote-tracking branch 'nccl/master' into HEAD
...
[ROCm/rccl commit: c4edc257b0 ]
2020-04-27 17:16:54 +00:00
Wenkai Du
5f57e6b466
Merge pull request #194 from wenkaidu/search
...
Fix incorrect next device ID in PCI ordered search
[ROCm/rccl commit: cf5070f6c0 ]
2020-04-27 09:54:09 -07:00
Wenkai Du
7b7f781658
Fix incorrect next device ID in PCI ordered search
...
[ROCm/rccl commit: edb49ed2d5 ]
2020-04-25 01:01:13 +00:00
saadrahim
b9acac2db6
Enabling CI Testing Again ( #192 )
...
Adding CI support based on AMD internal CI refactor.
[ROCm/rccl commit: cc66dd46e9 ]
2020-04-24 10:36:57 -06:00
Gilbert Lee
eebc6f2844
Adding option to re-use streams instead of re-creating per topology
...
[ROCm/rccl commit: 339bf9ff19 ]
2020-04-23 15:53:40 +00:00
Sylvain Jeaugey
c43022b9d8
Fix crash when only a subset of GPUs are visible within a container.
...
Fixes #326 .
[ROCm/rccl commit: f36540f55a ]
2020-04-17 10:03:14 -07:00
Sylvain Jeaugey
5df2502deb
Improve robustness of PCI detection
...
Fallback to default values when class/speed is unknown.
[ROCm/rccl commit: 23a9fbb788 ]
2020-04-16 14:27:50 -07:00
aokomoriuta
acd868dfcb
Fix wrong variable name "slice" to "chunk"
...
https://github.com/NVIDIA/nccl/issues/287
[ROCm/rccl commit: a783484ab5 ]
2020-04-14 19:00:51 -07:00
Wenkai Du
728cf9ee10
Revert "Temporary disable 0x803 target due to build error"
...
This reverts commit 8b1ce44c2a .
[ROCm/rccl commit: 5170bd1c02 ]
2020-04-14 16:58:41 +00:00
Wenkai Du
2de0b24c30
rccl-prim-test: auto-detect rings in 4P and 8P configurations
...
[ROCm/rccl commit: ef7064ba9b ]
2020-04-10 18:17:21 +00:00
Sylvain Jeaugey
627e1a06d0
Fix bug #307 : wrong NIC selection on the reduction tree.
...
The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.
[ROCm/rccl commit: b5b6c6acdd ]
2020-04-09 17:14:07 -07:00
Aaron Enye Shi
bfbfe370c3
Fix HIP-Clang build with HSA headers
...
HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.
[ROCm/rccl commit: a95090d981 ]
2020-04-03 17:58:23 -04:00
Wenkai Du
8852e54181
topo_expl: update to 2.6
...
[ROCm/rccl commit: 6f54b23503 ]
2020-04-01 13:37:08 -07:00
Wenkai Du
4aeb7f041e
Merge remote-tracking branch 'nccl/master' into v2.6.4_merge
...
[ROCm/rccl commit: fa36fd9ef9 ]
2020-04-01 13:35:12 -07:00
Sylvain Jeaugey
b996c2ca00
Merge pull request #314 from NVIDIA/v2.6
...
2.6.4-1
[ROCm/rccl commit: 533e3702cf ]
2020-03-26 17:31:24 -07:00
Sylvain Jeaugey
40adc74496
2.6.4-1
...
Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
capability into a single structure and add other properties.
[ROCm/rccl commit: b221128eca ]
2020-03-20 14:58:36 -07:00
Rashika Kheria
38b445c94f
Check return code for Flush operation
...
Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.
Signed-off-by: Rashika Kheria <rashika@amazon.com >
[ROCm/rccl commit: 6c61492eba ]
2020-03-16 20:40:59 -07:00
Wenkai Du
e3e1c6b29c
rccl-prim-test: add all-to-all benchmark ( #185 )
...
For gfx908, support simple detection of ring topology.
Call ReduceOrCopyMulti directly from kernel.
Also simplify code by removing kernel start synchronization option
which has no effect on throughput measurements.
[ROCm/rccl commit: ebc823e603 ]
2020-03-16 10:00:54 -07:00
amdkila
eef6314001
set hip::host and hip::device and remove some deprecated targets ( #184 )
...
[ROCm/rccl commit: b9fb0cd808 ]
2020-03-05 13:36:55 -07:00
Wenkai Du
cb19bce4e0
Merge pull request #183 from wenkaidu/dup_rings
...
Remove condition for ring duplication
[ROCm/rccl commit: 0976e47b06 ]
2020-03-02 17:12:42 -08:00
Wenkai Du
dba615366b
Merge pull request #182 from wenkaidu/topo_expl
...
Topo expl
[ROCm/rccl commit: 88752f9173 ]
2020-03-02 15:44:09 -08:00
Wenkai Du
85fd51a06f
Remove condition for ring duplication
...
Fix insufficent number of rings on single node after pull #179
[ROCm/rccl commit: 62dc28bd2e ]
2020-03-02 12:55:06 -08:00
Wenkai Du
7882b2f0c5
topo_expl: add a few more single node models
...
[ROCm/rccl commit: 32388d60a9 ]
2020-03-02 11:43:03 -08:00
Wenkai Du
593d99d9a9
Check fine grained memory before enabling RDMA
...
Adding back the check which was lost from 2.5 merge.
[ROCm/rccl commit: fb59328a7b ]
2020-03-02 11:18:27 -08:00
Wenkai Du
2a66deb694
Merge pull request #179 from wenkaidu/search
...
Use fraction of system maxWidth as steps for searching
[ROCm/rccl commit: 8b5bc8bca2 ]
2020-02-28 11:05:46 -08:00
Wenkai Du
b750defc28
Merge remote-tracking branch 'remotes/nccl/master'
...
[ROCm/rccl commit: 8e73a2ad60 ]
2020-02-27 12:53:03 -08:00
Wenkai Du
a36c2ecbc4
Add topology visualizer tool
...
[ROCm/rccl commit: 498d5029ad ]
2020-02-26 15:23:34 -08:00
Wenkai Du
3886f9bea8
topo_expl: use bandwidth numbers defined in graph in CPU models
...
[ROCm/rccl commit: 934b6de557 ]
2020-02-26 14:17:36 -08:00
Wenkai Du
45a7541582
Revise PCI BW numbers on Rome
...
[ROCm/rccl commit: d2adc61bf6 ]
2020-02-26 13:17:49 -08:00
Wenkai Du
b4be0ff3b8
Use fraction of system maxWidth as steps for searching
...
This reverts previous workaround of deducting only half of width
from paths.
[ROCm/rccl commit: 8391637613 ]
2020-02-26 09:14:35 -08:00
Wenkai Du
5747c3cac1
Fix abort handling in LL primitives
...
[ROCm/rccl commit: 077c3cda74 ]
2020-02-25 13:42:54 -08:00
Wenkai Du
d640f38d56
Fix system maxSpeed and maxWidth calculation
...
[ROCm/rccl commit: 9b80b3633f ]
2020-02-24 15:18:57 -08:00
Wenkai Du
93d448e2fe
Fix incorrect CR8 detection
...
Also change level of ring graph print to help debugging
[ROCm/rccl commit: f54dc58113 ]
2020-02-21 10:09:49 -08:00