Commit Graph

135 Commits

Author SHA1 Message Date
Wenkai Du c4edc257b0 Merge remote-tracking branch 'nccl/master' into HEAD 2020-04-27 17:16:54 +00:00
Wenkai Du edb49ed2d5 Fix incorrect next device ID in PCI ordered search 2020-04-25 01:01:13 +00:00
Sylvain Jeaugey f36540f55a Fix crash when only a subset of GPUs are visible within a container.
Fixes #326.
2020-04-17 10:03:14 -07:00
Sylvain Jeaugey 23a9fbb788 Improve robustness of PCI detection
Fallback to default values when class/speed is unknown.
2020-04-16 14:27:50 -07:00
aokomoriuta a783484ab5 Fix wrong variable name "slice" to "chunk"
https://github.com/NVIDIA/nccl/issues/287
2020-04-14 19:00:51 -07:00
Sylvain Jeaugey b5b6c6acdd Fix bug #307 : wrong NIC selection on the reduction tree.
The reduction tree (tree up) was inverting the NICs to use,
causing performance issue in cases where we are using different
NICs on a given channel.
2020-04-09 17:14:07 -07:00
Aaron Enye Shi a95090d981 Fix HIP-Clang build with HSA headers
HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.
2020-04-03 17:58:23 -04:00
Wenkai Du 6f54b23503 topo_expl: update to 2.6 2020-04-01 13:37:08 -07:00
Wenkai Du fa36fd9ef9 Merge remote-tracking branch 'nccl/master' into v2.6.4_merge 2020-04-01 13:35:12 -07:00
Sylvain Jeaugey 533e3702cf Merge pull request #314 from NVIDIA/v2.6
2.6.4-1
2020-03-26 17:31:24 -07:00
Sylvain Jeaugey b221128eca 2.6.4-1
Add support for network collectives.
Add support for XML topology dump/injection.
Add text values for GDR and P2P Levels, including "NVL".
Add speed detection for PCI, Infiniband and Ethernet cards.
Add CPU detection for ARM and AMD CPUs.
Add support for adaptive routing on Infiniband.
Change NET plugin API to v3 : merge PCI path and GPU pointer
  capability into a single structure and add other properties.
2020-03-20 14:58:36 -07:00
Rashika Kheria 6c61492eba Check return code for Flush operation
Current NCCL code does not abort for failed Flush operations by
underlying network. This may compromise data integrity.

Signed-off-by: Rashika Kheria <rashika@amazon.com>
2020-03-16 20:40:59 -07:00
Wenkai Du 0976e47b06 Merge pull request #183 from wenkaidu/dup_rings
Remove condition for ring duplication
2020-03-02 17:12:42 -08:00
Wenkai Du 62dc28bd2e Remove condition for ring duplication
Fix insufficent number of rings on single node after pull #179
2020-03-02 12:55:06 -08:00
Wenkai Du fb59328a7b Check fine grained memory before enabling RDMA
Adding back the check which was lost from 2.5 merge.
2020-03-02 11:18:27 -08:00
Wenkai Du 8b5bc8bca2 Merge pull request #179 from wenkaidu/search
Use fraction of system maxWidth as steps for searching
2020-02-28 11:05:46 -08:00
Wenkai Du 8e73a2ad60 Merge remote-tracking branch 'remotes/nccl/master' 2020-02-27 12:53:03 -08:00
Wenkai Du d2adc61bf6 Revise PCI BW numbers on Rome 2020-02-26 13:17:49 -08:00
Wenkai Du 8391637613 Use fraction of system maxWidth as steps for searching
This reverts previous workaround of deducting only half of width
from paths.
2020-02-26 09:14:35 -08:00
Wenkai Du 077c3cda74 Fix abort handling in LL primitives 2020-02-25 13:42:54 -08:00
Wenkai Du 9b80b3633f Fix system maxSpeed and maxWidth calculation 2020-02-24 15:18:57 -08:00
Wenkai Du f54dc58113 Fix incorrect CR8 detection
Also change level of ring graph print to help debugging
2020-02-21 10:09:49 -08:00
Wenkai Du 5b3856f2ed Merge pull request #172 from wenkaidu/topo_expl
Add topology explorer
2020-02-20 15:16:55 -08:00
Wenkai Du 55f8e2dec7 Add topology explorer 2020-02-19 14:42:06 -08:00
Sylvain Jeaugey c38f174bd4 Fix Allgather operations above 4G with multiple GPUs per process.
Fixes nccl-tests#37.
Direct offsets were still on 32 bits in the low-level primitives.
2020-02-12 11:11:55 -08:00
Wenkai Du abcfbf1231 Generate 8G6L chordal ring from reference 2020-02-11 22:01:12 +00:00
Wenkai Du d1dae2721d Add ring bandwidth correction factor 2020-01-30 09:52:27 -08:00
Stanley Tsang 20fa04d9b6 Updating copyright notices for 2020. 2020-01-29 15:28:08 -08:00
Wenkai Du 486fd436af Split primitive class to smaller structures 2020-01-29 15:27:23 -08:00
Wenkai Du 1e55645d97 Misc fixes and improvements for 2.5.6
1. Fix RCCL unit test
2. Add ROME detection and tuning
3. Change default P2P level
4. Fix search algorithm for XGMI
5. Remove explicit channel duplication with implicit by using half of link speed
6. Add collective trace support
7. Correct Intel Skylake CPU detection and bandwidth
8. Fix topo connect function
9. Disable GDR read and remove unreachable code
10. Disable LL128 kernels
11. Add tuning parameters
12. Use original clock64() implementation which returns RTC counter value
13. Print out timestamp of collective trace
14. Do not use struct ncclColl in kernel launch parameter
15. Fix abort handling and add tracing
17. Add __launch_bounds__ to kernel functions
18. Remove unused abortCount
19. Unset default MIN_NRINGS and MIN_NCHANNELS
20. Do not allocate shared memory when not using LL128 kernels
21. Correct time print out in tuning log
2020-01-29 15:27:05 -08:00
Luke Yeager 7a18fe0784 [topology] remove NET links when trimming system
This fixes a memory leak.
2020-01-07 13:29:57 -08:00
Christian Sigg 3899f6e0f2 Fix clang build (#274)
The attribute is called `optnone`, not `noopt`.
2019-12-09 09:31:13 -08:00
Sylvain Jeaugey aa15dfb29c Fix clang compilation 2019-12-06 09:55:54 -08:00
Christian Sigg 8c564e9b57 Fix clang build (#271)
Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.
2019-12-06 09:14:55 -08:00
Wenkai Du 6648c81dc6 Merge remote-tracking branch 'remotes/nccl/master' into rccl_2.5.6 2019-12-03 15:42:04 -08:00
Wenkai Du a0be2b8812 Disable direct buffers to reduce scratch memory size 2019-11-20 13:03:16 -08:00
Sylvain Jeaugey 299c554dcc 2.5.6-1 (#255)
Add LL128 Protocol.

Rewrite the topology detection and tree/ring creation (#179). Improve
tree performance by sending/receiving from different GPUs. Add
model-based tuning to switch between the different algorithms and
protocols.

Rework P2P/SHM detection in containers (#155, #248).

Detect duplicated devices and return an error (#231).

Add tuning for GCP
2019-11-19 14:57:39 -08:00
Wenkai Du 5e109ed400 Add bfloat16 support in RCCL
Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator
2019-11-18 13:45:53 -08:00
Wenkai Du 8995047830 Correct RTC frequencies for profiling purpose 2019-11-05 11:36:45 -08:00
Wenkai Du 669f1951a4 Check for fine grain support using memory allocation 2019-11-01 15:58:49 -07:00
Jeff Daily 5a502955c9 additional check for fine grain support in p2pCanConnect (#146) 2019-10-31 08:58:38 -07:00
Wenkai Du 296176a4fd Disable HDP flush for RDMA 2019-10-23 14:40:17 -07:00
Wenkai Du df74d12946 Revert collective chunk and slice steps to avoid drop in throughput 2019-10-18 12:54:00 -07:00
Gilbert Lee 37603ae6cb Reverting GenericOp bug workaround modifications to slice/chunk steps 2019-10-11 09:20:10 -07:00
Gilbert Lee 1392dd2997 Performing __threadfence_system() with only first thread 2019-10-11 09:16:19 -07:00
Gilbert Lee 8ae1bce3bb Fix for GenericOp device primitive bug 2019-10-10 22:39:45 -07:00
Wenkai Du 062c798c86 Merge pull request #136 from wenkaidu/tree
Enable tree kernels in build
2019-10-09 10:58:52 -07:00
Wenkai Du 76976c9e2e Enable tree kernels in build
Need to tune and specify NCCL_TREE_THRESHOLD to allow usage
2019-10-08 23:20:11 +00:00
Changpeng Fang eec319038e Tuning the inline and unroll to reduce the scratch usage
Summary:
 1. remove the noinline attribute for AllReduceThreeKernel;
 2. change AUTPUNROLL for tree functions to 1 or 2;
 Combining 1 and 2 will reduce the scratch usage from 1256 to 952
2019-10-08 14:02:25 -07:00
Wenkai Du 61ef1dcad5 Only generate kernels for sum and copy 2019-09-24 17:01:12 -07:00