rocm-systems

Author	SHA1	Message	Date
Wenkai Du	c4edc257b0	Merge remote-tracking branch 'nccl/master' into HEAD	2020-04-27 17:16:54 +00:00
Wenkai Du	edb49ed2d5	Fix incorrect next device ID in PCI ordered search	2020-04-25 01:01:13 +00:00
Sylvain Jeaugey	f36540f55a	Fix crash when only a subset of GPUs are visible within a container. Fixes #326.	2020-04-17 10:03:14 -07:00
Sylvain Jeaugey	23a9fbb788	Improve robustness of PCI detection Fallback to default values when class/speed is unknown.	2020-04-16 14:27:50 -07:00
aokomoriuta	a783484ab5	Fix wrong variable name "slice" to "chunk" https://github.com/NVIDIA/nccl/issues/287	2020-04-14 19:00:51 -07:00
Sylvain Jeaugey	b5b6c6acdd	Fix bug #307 : wrong NIC selection on the reduction tree. The reduction tree (tree up) was inverting the NICs to use, causing performance issue in cases where we are using different NICs on a given channel.	2020-04-09 17:14:07 -07:00
Aaron Enye Shi	a95090d981	Fix HIP-Clang build with HSA headers HIP-Clang does not include these HSA headers, and they need to be explicitly added in RCCL.	2020-04-03 17:58:23 -04:00
Wenkai Du	6f54b23503	topo_expl: update to 2.6	2020-04-01 13:37:08 -07:00
Wenkai Du	fa36fd9ef9	Merge remote-tracking branch 'nccl/master' into v2.6.4_merge	2020-04-01 13:35:12 -07:00
Sylvain Jeaugey	533e3702cf	Merge pull request #314 from NVIDIA/v2.6 2.6.4-1	2020-03-26 17:31:24 -07:00
Sylvain Jeaugey	b221128eca	2.6.4-1 Add support for network collectives. Add support for XML topology dump/injection. Add text values for GDR and P2P Levels, including "NVL". Add speed detection for PCI, Infiniband and Ethernet cards. Add CPU detection for ARM and AMD CPUs. Add support for adaptive routing on Infiniband. Change NET plugin API to v3 : merge PCI path and GPU pointer capability into a single structure and add other properties.	2020-03-20 14:58:36 -07:00
Rashika Kheria	6c61492eba	Check return code for Flush operation Current NCCL code does not abort for failed Flush operations by underlying network. This may compromise data integrity. Signed-off-by: Rashika Kheria <rashika@amazon.com>	2020-03-16 20:40:59 -07:00
Wenkai Du	0976e47b06	Merge pull request #183 from wenkaidu/dup_rings Remove condition for ring duplication	2020-03-02 17:12:42 -08:00
Wenkai Du	62dc28bd2e	Remove condition for ring duplication Fix insufficent number of rings on single node after pull #179	2020-03-02 12:55:06 -08:00
Wenkai Du	fb59328a7b	Check fine grained memory before enabling RDMA Adding back the check which was lost from 2.5 merge.	2020-03-02 11:18:27 -08:00
Wenkai Du	8b5bc8bca2	Merge pull request #179 from wenkaidu/search Use fraction of system maxWidth as steps for searching	2020-02-28 11:05:46 -08:00
Wenkai Du	8e73a2ad60	Merge remote-tracking branch 'remotes/nccl/master'	2020-02-27 12:53:03 -08:00
Wenkai Du	d2adc61bf6	Revise PCI BW numbers on Rome	2020-02-26 13:17:49 -08:00
Wenkai Du	8391637613	Use fraction of system maxWidth as steps for searching This reverts previous workaround of deducting only half of width from paths.	2020-02-26 09:14:35 -08:00
Wenkai Du	077c3cda74	Fix abort handling in LL primitives	2020-02-25 13:42:54 -08:00
Wenkai Du	9b80b3633f	Fix system maxSpeed and maxWidth calculation	2020-02-24 15:18:57 -08:00
Wenkai Du	f54dc58113	Fix incorrect CR8 detection Also change level of ring graph print to help debugging	2020-02-21 10:09:49 -08:00
Wenkai Du	5b3856f2ed	Merge pull request #172 from wenkaidu/topo_expl Add topology explorer	2020-02-20 15:16:55 -08:00
Wenkai Du	55f8e2dec7	Add topology explorer	2020-02-19 14:42:06 -08:00
Sylvain Jeaugey	c38f174bd4	Fix Allgather operations above 4G with multiple GPUs per process. Fixes nccl-tests#37. Direct offsets were still on 32 bits in the low-level primitives.	2020-02-12 11:11:55 -08:00
Wenkai Du	abcfbf1231	Generate 8G6L chordal ring from reference	2020-02-11 22:01:12 +00:00
Wenkai Du	d1dae2721d	Add ring bandwidth correction factor	2020-01-30 09:52:27 -08:00
Stanley Tsang	20fa04d9b6	Updating copyright notices for 2020.	2020-01-29 15:28:08 -08:00
Wenkai Du	486fd436af	Split primitive class to smaller structures	2020-01-29 15:27:23 -08:00
Wenkai Du	1e55645d97	Misc fixes and improvements for 2.5.6 1. Fix RCCL unit test 2. Add ROME detection and tuning 3. Change default P2P level 4. Fix search algorithm for XGMI 5. Remove explicit channel duplication with implicit by using half of link speed 6. Add collective trace support 7. Correct Intel Skylake CPU detection and bandwidth 8. Fix topo connect function 9. Disable GDR read and remove unreachable code 10. Disable LL128 kernels 11. Add tuning parameters 12. Use original clock64() implementation which returns RTC counter value 13. Print out timestamp of collective trace 14. Do not use struct ncclColl in kernel launch parameter 15. Fix abort handling and add tracing 17. Add __launch_bounds__ to kernel functions 18. Remove unused abortCount 19. Unset default MIN_NRINGS and MIN_NCHANNELS 20. Do not allocate shared memory when not using LL128 kernels 21. Correct time print out in tuning log	2020-01-29 15:27:05 -08:00
Luke Yeager	7a18fe0784	[topology] remove NET links when trimming system This fixes a memory leak.	2020-01-07 13:29:57 -08:00
Christian Sigg	3899f6e0f2	Fix clang build (#274 ) The attribute is called `optnone`, not `noopt`.	2019-12-09 09:31:13 -08:00
Sylvain Jeaugey	aa15dfb29c	Fix clang compilation	2019-12-06 09:55:54 -08:00
Christian Sigg	8c564e9b57	Fix clang build (#271 ) Clang doesn't understand `optimize("O0")`. It has `noopt`, which GCC doesn't understand. Wrap the difference in a macro.	2019-12-06 09:14:55 -08:00
Wenkai Du	6648c81dc6	Merge remote-tracking branch 'remotes/nccl/master' into rccl_2.5.6	2019-12-03 15:42:04 -08:00
Wenkai Du	a0be2b8812	Disable direct buffers to reduce scratch memory size	2019-11-20 13:03:16 -08:00
Sylvain Jeaugey	299c554dcc	2.5.6-1 (#255 ) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP	2019-11-19 14:57:39 -08:00
Wenkai Du	5e109ed400	Add bfloat16 support in RCCL Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator	2019-11-18 13:45:53 -08:00
Wenkai Du	8995047830	Correct RTC frequencies for profiling purpose	2019-11-05 11:36:45 -08:00
Wenkai Du	669f1951a4	Check for fine grain support using memory allocation	2019-11-01 15:58:49 -07:00
Jeff Daily	5a502955c9	additional check for fine grain support in p2pCanConnect (#146 )	2019-10-31 08:58:38 -07:00
Wenkai Du	296176a4fd	Disable HDP flush for RDMA	2019-10-23 14:40:17 -07:00
Wenkai Du	df74d12946	Revert collective chunk and slice steps to avoid drop in throughput	2019-10-18 12:54:00 -07:00
Gilbert Lee	37603ae6cb	Reverting GenericOp bug workaround modifications to slice/chunk steps	2019-10-11 09:20:10 -07:00
Gilbert Lee	1392dd2997	Performing __threadfence_system() with only first thread	2019-10-11 09:16:19 -07:00
Gilbert Lee	8ae1bce3bb	Fix for GenericOp device primitive bug	2019-10-10 22:39:45 -07:00
Wenkai Du	062c798c86	Merge pull request #136 from wenkaidu/tree Enable tree kernels in build	2019-10-09 10:58:52 -07:00
Wenkai Du	76976c9e2e	Enable tree kernels in build Need to tune and specify NCCL_TREE_THRESHOLD to allow usage	2019-10-08 23:20:11 +00:00
Changpeng Fang	eec319038e	Tuning the inline and unroll to reduce the scratch usage Summary: 1. remove the noinline attribute for AllReduceThreeKernel; 2. change AUTPUNROLL for tree functions to 1 or 2; Combining 1 and 2 will reduce the scratch usage from 1256 to 952	2019-10-08 14:02:25 -07:00
Wenkai Du	61ef1dcad5	Only generate kernels for sum and copy	2019-09-24 17:01:12 -07:00

1 2 3

135 Commits