Commit graph

402 Commits

Autor SHA1 Nachricht Datum
Stanley Tsang dc403e0ca2 Making hip-clang the default compiler; documentation update (#216)
* Making hip-clang the default compiler; documentation update

* Adding back --hip-clang to install.sh as a silent option for CI
2020-06-04 11:58:27 -06:00
Wenkai Du 2a4514772c Merge pull request #214 from wenkaidu/gdr
Use cached value for detecting GDR support only once
2020-05-22 13:36:23 -07:00
Wenkai Du 67c8e72ce3 Use cached value for detecting GDR support only once 2020-05-22 17:19:10 +00:00
Wenkai Du 957be85944 Merge pull request #212 from wenkaidu/version
Report HIP version in logs
2020-05-20 16:25:54 -07:00
Wenkai Du e41ab173cf Report HIP version in logs 2020-05-20 18:15:32 +00:00
Wenkai Du af703877cf Merge pull request #210 from wenkaidu/unroll
Revert "Tuning the inline and unroll to reduce the scratch usage"
2020-05-15 15:35:27 -07:00
Wenkai Du ca493a6b51 Revert "Tuning the inline and unroll to reduce the scratch usage"
This reverts commit eec319038e.
2020-05-15 14:15:40 -07:00
Wenkai Du c245f1507e Merge pull request #209 from wenkaidu/hip-clang
Rename files which only diffs in extension
2020-05-15 13:51:12 -07:00
Wenkai Du 706de76046 Merge pull request #208 from wenkaidu/perf_xgmi
Give preference to path with more XGMI connections
2020-05-15 10:07:22 -07:00
Wenkai Du e7b36304c8 Rename files which only diffs in extension 2020-05-15 09:16:32 -07:00
Wenkai Du ca4987e5fb Merge pull request #207 from wenkaidu/hip-clang
rccl-prim-test: add flags when calling hipExtLaunchMultiKernelMultiDe…
2020-05-14 18:31:56 -07:00
Wenkai Du b3c9852634 Give preference to path with more XGMI connections 2020-05-14 15:33:16 -07:00
Wenkai Du f1058b6353 rccl-prim-test: add flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang 2020-05-12 23:54:07 +00:00
Stanley Tsang 66a9f11910 Merge pull request #206 from stanleytsang-amd/develop
Updating RCCL documentation
2020-05-12 17:24:40 -06:00
Stanley Tsang 787ac13486 Restoring doxygen documentation to nccl.h.in. 2020-05-12 22:03:31 +00:00
Stanley Tsang b59b9d328b Updating README and readthedocs documentation. 2020-05-12 20:11:49 +00:00
Wenkai Du 52752aba6e Merge pull request #205 from wenkaidu/bf16
Update rccl_bfloat16.h to match rocBLAS
2020-05-11 09:55:06 -07:00
Wenkai Du d5a07a7b5c Update rccl_bfloat16.h to match rocBLAS 2020-05-08 22:48:07 +00:00
Wenkai Du 94d16c0f0a Merge pull request #204 from wenkaidu/launch_flags
Set flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang
2020-05-08 11:20:50 -07:00
Wenkai Du 24ea2ef6dd Set flags when calling hipExtLaunchMultiKernelMultiDevice in hip-clang 2020-05-08 15:57:14 +00:00
Saad Rahim 33c23fdcda Merge remote-tracking branch 'upstream/master' into develop 2020-04-29 16:12:37 -07:00
saadrahim 308e96877e Refactoring packaging (#193) 2020-04-29 16:24:21 -06:00
Wenkai Du 914b6ca27c Merge pull request #199 from wenkaidu/para_jobs
Enable parallel jobs for hip-clang build
2020-04-29 13:49:48 -07:00
saadrahim 65390f9872 Junit test storage call corrected (#197)
* Focus testing on Centos for now

* storing junit

* Reducing test suite to Ubuntu
2020-04-29 14:22:03 -06:00
Wenkai Du 3f471ab5b1 Enable parallel jobs for hip-clang build 2020-04-29 17:58:16 +00:00
saadrahim 6b1d70b03b Adding NCCL_DEBUG=INFO Logging to CI (#196) 2020-04-27 15:12:15 -06:00
Wenkai Du f7c27c6c9f Merge pull request #195 from wenkaidu/sync_nccl
Sync up with NCCL
2020-04-27 11:45:05 -07:00
Wenkai Du 5743c6b7d2 topo_expl: fix build error 2020-04-27 17:17:05 +00:00
Wenkai Du c4edc257b0 Merge remote-tracking branch 'nccl/master' into HEAD 2020-04-27 17:16:54 +00:00
Wenkai Du cf5070f6c0 Merge pull request #194 from wenkaidu/search
Fix incorrect next device ID in PCI ordered search
2020-04-27 09:54:09 -07:00
Wenkai Du edb49ed2d5 Fix incorrect next device ID in PCI ordered search 2020-04-25 01:01:13 +00:00
saadrahim cc66dd46e9 Enabling CI Testing Again (#192)
Adding CI support based on AMD internal CI refactor.
2020-04-24 10:36:57 -06:00
Gilbert Lee 339bf9ff19 Adding option to re-use streams instead of re-creating per topology 2020-04-23 15:53:40 +00:00
Sylvain Jeaugey f36540f55a Fix crash when only a subset of GPUs are visible within a container.
Fixes #326.
2020-04-17 10:03:14 -07:00
Sylvain Jeaugey 23a9fbb788 Improve robustness of PCI detection
Fallback to default values when class/speed is unknown.
2020-04-16 14:27:50 -07:00
Wenkai Du c017f6e900 Merge pull request #191 from wenkaidu/gfx803
Revert "Temporary disable 0x803 target due to build error"
2020-04-16 09:27:24 -07:00
aokomoriuta a783484ab5 Fix wrong variable name "slice" to "chunk"
https://github.com/NVIDIA/nccl/issues/287
2020-04-14 19:00:51 -07:00
Wenkai Du 5170bd1c02 Revert "Temporary disable 0x803 target due to build error"
This reverts commit cd7ab1425b.
2020-04-14 16:58:41 +00:00
Wenkai Du 3ac98e7d39 Merge pull request #188 from wenkaidu/prim_test
rccl-prim-test: auto-detect rings in 4P and 8P configurations
2020-04-14 09:52:49 -07:00
Wenkai Du ef7064ba9b rccl-prim-test: auto-detect rings in 4P and 8P configurations 2020-04-10 18:17:21 +00:00
Sylvain Jeaugey b5b6c6acdd Fix bug #307 : wrong NIC selection on the reduction tree.
The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.
2020-04-09 17:14:07 -07:00
Aaron Enye Shi fa52d4f0aa Merge pull request #187 from aaronenyeshi/fix-hip-vdi-hsa-ext
Fix HIP-Clang build with HSA headers
2020-04-03 19:06:38 -04:00
Aaron Enye Shi a95090d981 Fix HIP-Clang build with HSA headers
HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.
2020-04-03 17:58:23 -04:00
Wenkai Du 3cbe5c8a40 Merge pull request #186 from wenkaidu/v2.6.4
Merge with NCCL 2.6.4
2020-04-02 10:42:01 -07:00
Wenkai Du 6f54b23503 topo_expl: update to 2.6 2020-04-01 13:37:08 -07:00
Wenkai Du fa36fd9ef9 Merge remote-tracking branch 'nccl/master' into v2.6.4_merge 2020-04-01 13:35:12 -07:00
Sylvain Jeaugey 533e3702cf Merge pull request #314 from NVIDIA/v2.6
2.6.4-1
2020-03-26 17:31:24 -07:00
Sylvain Jeaugey b221128eca 2.6.4-1
Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.
2020-03-20 14:58:36 -07:00
Rashika Kheria 6c61492eba Check return code for Flush operation
Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <rashika@amazon.com>
2020-03-16 20:40:59 -07:00
Wenkai Du ebc823e603 rccl-prim-test: add all-to-all benchmark (#185)
For gfx908, support simple detection of ring topology.
Call ReduceOrCopyMulti directly from kernel.
Also simplify code by removing kernel start synchronization option
which has no effect on throughput measurements.
2020-03-16 10:00:54 -07:00