rocm-systems

作成者	SHA1	メッセージ	日付
Wenkai Du	d1dae2721d	Add ring bandwidth correction factor	2020-01-30 09:52:27 -08:00
Stanley Tsang	20fa04d9b6	Updating copyright notices for 2020.	2020-01-29 15:28:08 -08:00
Wenkai Du	486fd436af	Split primitive class to smaller structures	2020-01-29 15:27:23 -08:00
Wenkai Du	1e55645d97	Misc fixes and improvements for 2.5.6 1. Fix RCCL unit test 2. Add ROME detection and tuning 3. Change default P2P level 4. Fix search algorithm for XGMI 5. Remove explicit channel duplication with implicit by using half of link speed 6. Add collective trace support 7. Correct Intel Skylake CPU detection and bandwidth 8. Fix topo connect function 9. Disable GDR read and remove unreachable code 10. Disable LL128 kernels 11. Add tuning parameters 12. Use original clock64() implementation which returns RTC counter value 13. Print out timestamp of collective trace 14. Do not use struct ncclColl in kernel launch parameter 15. Fix abort handling and add tracing 17. Add __launch_bounds__ to kernel functions 18. Remove unused abortCount 19. Unset default MIN_NRINGS and MIN_NCHANNELS 20. Do not allocate shared memory when not using LL128 kernels 21. Correct time print out in tuning log	2020-01-29 15:27:05 -08:00
Wenkai Du	6648c81dc6	Merge remote-tracking branch 'remotes/nccl/master' into rccl_2.5.6	2019-12-03 15:42:04 -08:00
Wenkai Du	a0be2b8812	Disable direct buffers to reduce scratch memory size	2019-11-20 13:03:16 -08:00
Sylvain Jeaugey	299c554dcc	2.5.6-1 (#255 ) Add LL128 Protocol. Rewrite the topology detection and tree/ring creation (#179). Improve tree performance by sending/receiving from different GPUs. Add model-based tuning to switch between the different algorithms and protocols. Rework P2P/SHM detection in containers (#155, #248). Detect duplicated devices and return an error (#231). Add tuning for GCP	2019-11-19 14:57:39 -08:00
Wenkai Du	5e109ed400	Add bfloat16 support in RCCL Preprocessor symbol RCCL_BFLOAT16 is used as feature indicator	2019-11-18 13:45:53 -08:00
Wenkai Du	8995047830	Correct RTC frequencies for profiling purpose	2019-11-05 11:36:45 -08:00
Wenkai Du	669f1951a4	Check for fine grain support using memory allocation	2019-11-01 15:58:49 -07:00
Jeff Daily	5a502955c9	additional check for fine grain support in p2pCanConnect (#146 )	2019-10-31 08:58:38 -07:00
Wenkai Du	296176a4fd	Disable HDP flush for RDMA	2019-10-23 14:40:17 -07:00
Wenkai Du	df74d12946	Revert collective chunk and slice steps to avoid drop in throughput	2019-10-18 12:54:00 -07:00
Gilbert Lee	37603ae6cb	Reverting GenericOp bug workaround modifications to slice/chunk steps	2019-10-11 09:20:10 -07:00
Gilbert Lee	1392dd2997	Performing __threadfence_system() with only first thread	2019-10-11 09:16:19 -07:00
Gilbert Lee	8ae1bce3bb	Fix for GenericOp device primitive bug	2019-10-10 22:39:45 -07:00
Wenkai Du	062c798c86	Merge pull request #136 from wenkaidu/tree Enable tree kernels in build	2019-10-09 10:58:52 -07:00
Wenkai Du	76976c9e2e	Enable tree kernels in build Need to tune and specify NCCL_TREE_THRESHOLD to allow usage	2019-10-08 23:20:11 +00:00
Changpeng Fang	eec319038e	Tuning the inline and unroll to reduce the scratch usage Summary: 1. remove the noinline attribute for AllReduceThreeKernel; 2. change AUTPUNROLL for tree functions to 1 or 2; Combining 1 and 2 will reduce the scratch usage from 1256 to 952	2019-10-08 14:02:25 -07:00
Wenkai Du	61ef1dcad5	Only generate kernels for sum and copy	2019-09-24 17:01:12 -07:00
Gilbert Lee	86ce0a93b5	RDMA HDP flush fix	2019-09-06 16:35:55 +00:00
Gilbert Lee	3e6b326a19	Revert "Set RDMA default to off state" This reverts commit `0f16ad966a`.	2019-09-05 18:16:53 +00:00
Wenkai Du	8c975353ed	Allocate opCount in pinned host memory for P2P transport To avoid remote P2P read access when checking remote GPU's opCount	2019-08-29 10:22:09 -07:00
Wenkai Du	0f16ad966a	Set RDMA default to off state	2019-08-26 10:59:33 -07:00
Wenkai Du	6759660529	Merge pull request #125 from wenkaidu/fix_nvml_id Assign unused nmvlDev to avoid random number	2019-08-19 09:08:13 -07:00
Wenkai Du	86efdfc3b5	Assign unused nmvlDev to avoid random number	2019-08-16 16:34:14 -07:00
Wenkai Du	7c38da0939	Merge remote-tracking branch 'remotes/nccl/master' into HEAD	2019-08-16 16:13:34 -07:00
Wenkai Du	1faededc03	Tune AUTOUNROLL for better performance Also remove all unused UNROLL defines	2019-08-16 10:34:53 -07:00
Michael LIAO	9369f8d75d	Fix build with hip-clang. - Add necessary function attribute for HIP programming model. - Explicitly include hsa headers.	2019-08-15 14:56:04 -04:00
Wenkai Du	2223cccf15	Tune LL threshold for VEGA Also move abort check after SPINS_BEFORE_CHECK_ABORT as NCCL	2019-08-15 09:16:11 -07:00
Wenkai Du	4b77a16f3f	Default to minimal 2 rings and improve LL loop	2019-08-14 14:12:56 -07:00
Wenkai Du	5782a8d857	Remove duplicate line	2019-08-14 13:22:43 -07:00
Wenkai Du	f11c8f60cd	RCCL 2.4 update	2019-08-14 10:42:35 -07:00
David Addison	fad079a8ae	Updated PR#196 to use a common hash function	2019-08-14 10:08:39 -07:00
David Addison	01d1836668	Merge branch 'shm' of git://github.com/lowintelligence/nccl into lowintelligence-shm	2019-08-14 09:45:45 -07:00
David Addison	7f2b337e70	Make use of SO_REUSEPORT conditional Fixes: #244 SO_RESUEPORT was introduced in Linux 3.9 and later. This change allows NCCL to compile against older releases. The functionality is only required if the user is specifying a NCCL bootstrap address via an environment variable.	2019-08-13 16:32:07 -07:00
Ke Wen	4d579e51cc	Fix NIC distances for 11+ NICs	2019-07-17 06:32:33 -07:00
Ke Wen	920ae57c14	Fix #224 : prevent number of IB devices from going out of bound	2019-07-17 06:32:33 -07:00
Ke Wen	c8c68fb5f7	Size up IPC buffers to multiples of 2MB Avoid potential CUDA error in concurrent communicator initialization	2019-07-12 09:50:17 -07:00
Hirochika Asai	0b192d2299	Add the exact matching modifier support "=" to the NCCL_IB_HCA variable (#236 ) Perform exact matching when the prefix "=" is specified in the NCCL_IB_HCA variable to exclude HCAs mlx5_X[0-9]+ when mlx5_X is specified.	2019-07-09 14:45:41 -07:00
Ke Wen	8e04d80382	Merge branch 'master' into HEAD	2019-06-25 13:39:08 -07:00
Ke Wen	7c72dee660	2.4.8-1 Fix #209: improve socket transport performance Split transfers over multiple sockets Launch multiple threads to drive sockets Detect AWS NICs and set nsockets/nthreads accordingly	2019-06-25 13:22:47 -07:00
Felix Abecassis	37e4f8729e	Fix out-of-bounds read in ncclStrToCpuset (#233 ) The affinityStr string was not null-terminated but was passed to strlen(3). Signed-off-by: Felix Abecassis <fabecassis@nvidia.com>	2019-06-21 10:25:08 +02:00
David Addison	0ceaec9cee	NCCL 2.4.7-1 Performance tweaks for PowerPC builds only; Set default NCCL_MIN_NRINGS to 4 Disable PCI-E NUMA distance detection	2019-05-10 13:52:16 -07:00
jakirkham	60a586ded9	Allow CUDA runtime library selection (#220 ) Makes a change to allow the user to select between the static CUDA runtime library (default) and the dynamic CUDA runtime library. Does this by allowing `CUDARTLIB` to be overridden.	2019-05-07 17:35:14 -07:00
Gustavo Alvarez	9db4b1d801	Add pkgconfig file (#190 )	2019-04-08 09:16:54 -07:00
David Addison	f40ce73e89	NCCL 2.4.6-1 Added detection of IBM/Power NVLink bridge device. Add NUMA support to PCI distance calculations. Added NCCL_IGNORE_CPU_AFFINITY env var. Fix memory leaks; GithubIssue#180 Compiler warning fix; GithubIssue#178 Replace non-standard variable length arrays. GithubIssue#171 Fix Tree+Shared Memory crash. GithubPR#185 Fix LL cleanup hang during long running DL jobs. Fix NCCL_RINGS environment variable handling. Added extra checks to catch repeat calls to ncclCommDestroy() GithubIssue#191 Improve bootstrap socket connection reliability at scale. Fix hostname hashing issue. GithubIssue#187 Code cleanup to rename all non device files from .cu to .cc	2019-04-05 13:05:45 -07:00
Cao Zongyan	161763aab2	Fix share memory collision in multi-communicator case. Current SHM object name would only use pidHash and ranks as identification, which would collide each other when program runs with multiple communicators. Here we added commId info into pidHash, it makes 'pidHash'es of different communicators keeping in same process will be distincted with each other.	2019-03-15 12:50:32 +08:00
Rong Ou	14e0cf644b	Fix crash during shared memory creation (#185 ) The shared memory filename was only based on the destination. While this was OK for rings since only one rank would send data to a given rank, it would crash with trees because they communicate in both directions. Co-authored-by: Rong Ou <rong.ou@gmail.com>	2019-03-04 11:42:47 -08:00
Sylvain Jeaugey	1450d42675	2.4.2-1 Add tree algorithms for allreduce to improve performance at scale. Add ncclCommAbort() and ncclCommGetAsyncError() to properly handle network errors and be permit recover. Detect initial CPU affinity and no longer escape it.	2019-01-29 15:19:27 -08:00

1 2 3

105 コミット