Граф коммитов

331 Коммитов

Автор SHA1 Сообщение Дата
Wenkai Du e7eff47be4 Revert "Tuning the inline and unroll to reduce the scratch usage"
This reverts commit d8a06589c9.


[ROCm/rccl commit: ca493a6b51]
2020-05-15 14:15:40 -07:00
Wenkai Du 5780637fff Merge pull request #209 from wenkaidu/hip-clang
Rename files which only diffs in extension

[ROCm/rccl commit: c245f1507e]
2020-05-15 13:51:12 -07:00
Wenkai Du 61a82ff572 Merge pull request #208 from wenkaidu/perf_xgmi
Give preference to path with more XGMI connections

[ROCm/rccl commit: 706de76046]
2020-05-15 10:07:22 -07:00
Wenkai Du c2afb1f4ca Rename files which only diffs in extension
[ROCm/rccl commit: e7b36304c8]
2020-05-15 09:16:32 -07:00
Wenkai Du 27519fd019 Give preference to path with more XGMI connections
[ROCm/rccl commit: b3c9852634]
2020-05-14 15:33:16 -07:00
Wenkai Du ced9958094 rccl-prim-test: add flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang
[ROCm/rccl commit: f1058b6353]
2020-05-12 23:54:07 +00:00
Stanley Tsang abba9d5504 Restoring doxygen documentation to nccl.h.in.
[ROCm/rccl commit: 787ac13486]
2020-05-12 22:03:31 +00:00
Stanley Tsang e35e4d3401 Updating README and readthedocs documentation.
[ROCm/rccl commit: b59b9d328b]
2020-05-12 20:11:49 +00:00
Wenkai Du 5ba51914f1 Update rccl_bfloat16.h to match rocBLAS
[ROCm/rccl commit: d5a07a7b5c]
2020-05-08 22:48:07 +00:00
Wenkai Du 069322d05a Set flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang
[ROCm/rccl commit: 24ea2ef6dd]
2020-05-08 15:57:14 +00:00
Saad Rahim 65ad48404b Merge remote-tracking branch 'upstream/master' into develop
[ROCm/rccl commit: 33c23fdcda]
2020-04-29 16:12:37 -07:00
saadrahim 75863c97a4 Refactoring packaging (#193)
[ROCm/rccl commit: 308e96877e]
2020-04-29 16:24:21 -06:00
Wenkai Du 2badecfa20 Merge pull request #199 from wenkaidu/para_jobs
Enable parallel jobs for hip-clang build

[ROCm/rccl commit: 914b6ca27c]
2020-04-29 13:49:48 -07:00
saadrahim fa25190ddb Junit test storage call corrected (#197)
* Focus testing on Centos for now

* storing junit

* Reducing test suite to Ubuntu

[ROCm/rccl commit: 65390f9872]
2020-04-29 14:22:03 -06:00
Wenkai Du 52096f2cb6 Enable parallel jobs for hip-clang build
[ROCm/rccl commit: 3f471ab5b1]
2020-04-29 17:58:16 +00:00
saadrahim c0c0e92ef5 Adding NCCL_DEBUG=INFO Logging to CI (#196)
[ROCm/rccl commit: 6b1d70b03b]
2020-04-27 15:12:15 -06:00
Wenkai Du 779ee97ada topo_expl: fix build error
[ROCm/rccl commit: 5743c6b7d2]
2020-04-27 17:17:05 +00:00
Wenkai Du 9813d67cd1 Merge remote-tracking branch 'nccl/master' into HEAD
[ROCm/rccl commit: c4edc257b0]
2020-04-27 17:16:54 +00:00
Wenkai Du 5f57e6b466 Merge pull request #194 from wenkaidu/search
Fix incorrect next device ID in PCI ordered search

[ROCm/rccl commit: cf5070f6c0]
2020-04-27 09:54:09 -07:00
Wenkai Du 7b7f781658 Fix incorrect next device ID in PCI ordered search
[ROCm/rccl commit: edb49ed2d5]
2020-04-25 01:01:13 +00:00
saadrahim b9acac2db6 Enabling CI Testing Again (#192)
Adding CI support based on AMD internal CI refactor.

[ROCm/rccl commit: cc66dd46e9]
2020-04-24 10:36:57 -06:00
Gilbert Lee eebc6f2844 Adding option to re-use streams instead of re-creating per topology
[ROCm/rccl commit: 339bf9ff19]
2020-04-23 15:53:40 +00:00
Sylvain Jeaugey c43022b9d8 Fix crash when only a subset of GPUs are visible within a container.
Fixes #326.


[ROCm/rccl commit: f36540f55a]
2020-04-17 10:03:14 -07:00
Sylvain Jeaugey 5df2502deb Improve robustness of PCI detection
Fallback to default values when class/speed is unknown.


[ROCm/rccl commit: 23a9fbb788]
2020-04-16 14:27:50 -07:00
aokomoriuta acd868dfcb Fix wrong variable name "slice" to "chunk"
https://github.com/NVIDIA/nccl/issues/287


[ROCm/rccl commit: a783484ab5]
2020-04-14 19:00:51 -07:00
Wenkai Du 728cf9ee10 Revert "Temporary disable 0x803 target due to build error"
This reverts commit 8b1ce44c2a.


[ROCm/rccl commit: 5170bd1c02]
2020-04-14 16:58:41 +00:00
Wenkai Du 2de0b24c30 rccl-prim-test: auto-detect rings in 4P and 8P configurations
[ROCm/rccl commit: ef7064ba9b]
2020-04-10 18:17:21 +00:00
Sylvain Jeaugey 627e1a06d0 Fix bug #307 : wrong NIC selection on the reduction tree.
The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.


[ROCm/rccl commit: b5b6c6acdd]
2020-04-09 17:14:07 -07:00
Aaron Enye Shi bfbfe370c3 Fix HIP-Clang build with HSA headers
HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.


[ROCm/rccl commit: a95090d981]
2020-04-03 17:58:23 -04:00
Wenkai Du 8852e54181 topo_expl: update to 2.6
[ROCm/rccl commit: 6f54b23503]
2020-04-01 13:37:08 -07:00
Wenkai Du 4aeb7f041e Merge remote-tracking branch 'nccl/master' into v2.6.4_merge
[ROCm/rccl commit: fa36fd9ef9]
2020-04-01 13:35:12 -07:00
Sylvain Jeaugey b996c2ca00 Merge pull request #314 from NVIDIA/v2.6
2.6.4-1

[ROCm/rccl commit: 533e3702cf]
2020-03-26 17:31:24 -07:00
Sylvain Jeaugey 40adc74496 2.6.4-1
Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.


[ROCm/rccl commit: b221128eca]
2020-03-20 14:58:36 -07:00
Rashika Kheria 38b445c94f Check return code for Flush operation
Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <rashika@amazon.com>


[ROCm/rccl commit: 6c61492eba]
2020-03-16 20:40:59 -07:00
Wenkai Du e3e1c6b29c rccl-prim-test: add all-to-all benchmark (#185)
For gfx908, support simple detection of ring topology.
Call ReduceOrCopyMulti directly from kernel.
Also simplify code by removing kernel start synchronization option
which has no effect on throughput measurements.

[ROCm/rccl commit: ebc823e603]
2020-03-16 10:00:54 -07:00
amdkila eef6314001 set hip::host and hip::device and remove some deprecated targets (#184)
[ROCm/rccl commit: b9fb0cd808]
2020-03-05 13:36:55 -07:00
Wenkai Du cb19bce4e0 Merge pull request #183 from wenkaidu/dup_rings
Remove condition for ring duplication

[ROCm/rccl commit: 0976e47b06]
2020-03-02 17:12:42 -08:00
Wenkai Du dba615366b Merge pull request #182 from wenkaidu/topo_expl
Topo expl

[ROCm/rccl commit: 88752f9173]
2020-03-02 15:44:09 -08:00
Wenkai Du 85fd51a06f Remove condition for ring duplication
Fix insufficent number of rings on single node after pull #179


[ROCm/rccl commit: 62dc28bd2e]
2020-03-02 12:55:06 -08:00
Wenkai Du 7882b2f0c5 topo_expl: add a few more single node models
[ROCm/rccl commit: 32388d60a9]
2020-03-02 11:43:03 -08:00
Wenkai Du 593d99d9a9 Check fine grained memory before enabling RDMA
Adding back the check which was lost from 2.5 merge.


[ROCm/rccl commit: fb59328a7b]
2020-03-02 11:18:27 -08:00
Wenkai Du 2a66deb694 Merge pull request #179 from wenkaidu/search
Use fraction of system maxWidth as steps for searching

[ROCm/rccl commit: 8b5bc8bca2]
2020-02-28 11:05:46 -08:00
Wenkai Du b750defc28 Merge remote-tracking branch 'remotes/nccl/master'
[ROCm/rccl commit: 8e73a2ad60]
2020-02-27 12:53:03 -08:00
Wenkai Du a36c2ecbc4 Add topology visualizer tool
[ROCm/rccl commit: 498d5029ad]
2020-02-26 15:23:34 -08:00
Wenkai Du 3886f9bea8 topo_expl: use bandwidth numbers defined in graph in CPU models
[ROCm/rccl commit: 934b6de557]
2020-02-26 14:17:36 -08:00
Wenkai Du 45a7541582 Revise PCI BW numbers on Rome
[ROCm/rccl commit: d2adc61bf6]
2020-02-26 13:17:49 -08:00
Wenkai Du b4be0ff3b8 Use fraction of system maxWidth as steps for searching
This reverts previous workaround of deducting only half of width
from paths.


[ROCm/rccl commit: 8391637613]
2020-02-26 09:14:35 -08:00
Wenkai Du 5747c3cac1 Fix abort handling in LL primitives
[ROCm/rccl commit: 077c3cda74]
2020-02-25 13:42:54 -08:00
Wenkai Du d640f38d56 Fix system maxSpeed and maxWidth calculation
[ROCm/rccl commit: 9b80b3633f]
2020-02-24 15:18:57 -08:00
Wenkai Du 93d448e2fe Fix incorrect CR8 detection
Also change level of ring graph print to help debugging


[ROCm/rccl commit: f54dc58113]
2020-02-21 10:09:49 -08:00