rocm-systems

مولف	SHA1	پیام	تاریخ
gilbertlee-amd	caba0a63d2	Fixing clique-topology detection (#342 ) * Fixing clique-topology detection * Fix to enable multi-process clique-based kernels	2021-04-07 11:29:44 -06:00
Wenkai Du	e26ad2995e	Cleanup number of channels calculation (#340 )	2021-04-05 17:51:56 -07:00
Wenkai Du	17491c918e	Fix incorrect net counting (#339 ) * Fix incorrect net counting * Add comments	2021-04-05 12:21:57 -07:00
Wenkai Du	1d2946ee4b	Rework network port trimming code (#338 ) * Rework network port trimming code * Move Rome related changes to separate source files	2021-03-31 10:25:59 -07:00
Wenkai Du	0c78553ee0	Check fine grained memory on peer GPU before enabling P2P (#337 )	2021-03-30 09:06:39 -07:00
Wenkai Du	d87dc7c2e8	collnet: support multiple NICs (#335 )	2021-03-25 20:59:32 -07:00
Stanley Tsang	289db2a636	Fixing message queue leak. (#331 )	2021-03-25 19:11:43 -06:00
Wenkai Du	0fbb9510a5	Remove HDP workaround for ROCm 4.2 HIP (#334 )	2021-03-23 20:11:37 -07:00
Wenkai Du	1d6244b18d	Enable collnet in RCCL (#333 ) * Enable CollNet and use different number of channels * topo_expl: enable collnet	2021-03-19 12:58:13 -07:00
Wenkai Du	b46260260a	Sort GPUs by HIP device ID (#329 ) * Sort GPUs by HIP device ID * Remove extra space	2021-03-16 16:51:32 -07:00
Wenkai Du	f60b76c67a	Add GPU memory usage tracker (#326 )	2021-03-06 20:32:30 -08:00
Wenkai Du	8e180cf087	Revert "Port alltoall[v]" (#325 ) This reverts commit `f4d5d3d620`.	2021-03-06 13:59:31 -08:00
Wenkai Du	c018edf0f2	Enable local sendrecv over network if GDR is available on all GPUs (#324 )	2021-03-05 19:59:41 -08:00
gilbertlee-amd	f4a9b9acba	Adding pthread_join / pthread_detach to clean up pthreads to avoid leaks (#322 )	2021-02-26 16:29:55 -07:00
Wenkai Du	e820a943e9	Update tuning parameters for XGMI and NET	2021-02-23 21:41:26 +00:00
Wenkai Du	ec8d89b1dd	Match NBIO only when GPUs and NICs are directly connected to CPU	2021-02-22 18:52:29 -05:00
Stanley Tsang	45f5255f7c	Fixing cache deletion for CliqueManager; updating copyright	2021-02-19 22:22:46 +00:00
Wenkai Du	95f178324c	Add support to another Rome model	2021-02-18 02:00:31 +00:00
Wenkai Du	c985358e11	Merge remote-tracking branch 'nccl/master' into 2.8.3	2021-02-15 18:44:47 -05:00
Wenkai Du	bf8eb40705	Move HDP flush to CPU	2021-02-12 18:06:19 +00:00
Sylvain Jeaugey	911d61f214	2.8.4-1 Fix hang in corner cases of alltoallv using point to point send/recv. Harmonize error messages. Fix missing NVTX section in the license. Update README.	2021-02-09 15:36:48 -08:00
Wenkai Du	9cc3b56166	Fix GDRDMA read and remove unused files	2021-02-09 01:34:39 +00:00
Stanley Tsang	d00b7d17bd	Update MP UT to support arbitrary # of GPUs; multiple bugfixes (#16 ) * Fixing temp file creation/deletion for Clique kernel mode. * Refactoring of MP unit tests; include bugfixes and general support for any number of GPUs * GroupCall MP UT properly quits when too many devices specified * MP UT will programmatically set NCCL_COMM_ID if not specified; updated install script	2021-02-05 16:49:25 -08:00
Wenkai Du	ab1e7a0318	Merge remote-tracking branch 'origin/develop' into 2.8.3	2021-02-04 20:02:34 -05:00
gilbertlee-amd	1990ffd76a	Tuning some clique-based kernel parameters (#315 )	2021-02-03 20:00:08 -07:00
Wenkai Du	5f97122442	Enable GPU direct RDMA read from GPU	2021-02-03 02:48:30 +00:00
gilbertlee-amd	3e62ceddc5	Clique kernel support (#295 ) (#15 ) * Adding experimental clique-based kernels (opt-in only) Co-authored-by: Stanley Tsang <stanley.tsang@amd.com> Co-authored-by: Gilbert Lee <gilbert.lee@amd.com> Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com> Co-authored-by: Stanley Tsang <stanley.tsang@amd.com> Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>	2021-01-28 09:45:01 -07:00
Wenkai Du	41e47a36e7	Use less unroll for clique kernels (#313 )	2021-01-15 17:48:10 -08:00
Wenkai Du	2ddbe6646b	Improve collective trace	2021-01-14 19:28:01 -05:00
Wenkai Du	f4d5d3d620	Port alltoall[v]	2021-01-14 19:28:01 -05:00
Wenkai Du	105db19a11	Do not allow GPU as intermediate	2021-01-14 19:28:01 -05:00
Wenkai Du	e055229e56	Revert "Changes to topology based on XGMI (#272 )" This reverts commit `01bd2573db`.	2021-01-14 19:28:01 -05:00
Wenkai Du	d469947641	Merge remote-tracking branch 'nccl/master' into no-target-id	2021-01-14 19:27:53 -05:00
Jonas Zhou	3996562690	x86: Add CPU detection for Zhaoxin processors Signed-off-by: Jonas Zhou <JonasZhou@zhaoxin.com>	2020-12-17 11:15:18 -08:00
Wenkai Du	373a108516	Fix Rome PCIe 2 node topology generation (#310 )	2020-12-15 17:16:17 -08:00
Wenkai Du	975b14dffa	Add Rome model and improve search (#305 )	2020-11-17 14:55:06 -08:00
Sylvain Jeaugey	920dbe5b35	2.8.3-1 Optimization for Tree allreduce on A100. Improve aggregation performance. Use shared buffers for inter-node send/recv. Add NVTX profiling hooks. Accelerate alltoall connections by merging communication for all channels. Add support for one hop communication through NVLink, for faster send/recv communication on cubemesh topologies like DGX-1. Improve alltoall scheduling to better balance intra/inter node communication. Increase send/recv parallelism by 8x, each warp sending or receiving to a different peer. Net: move to v4. Net: make flush operation asynchronous to accelerate alltoall. Net: define maximum number of requests. Fix hang when using LL128 protocol after 2^31 steps. Fix #379 : topology injection failing when using less GPUs than described in the XML. Fix #394 : protocol mismatch causing hangs or crashes when using one GPU per node.	2020-11-17 11:08:52 -08:00
Wenkai Du	554729079d	Use device's link width and speed if port doesn't report (#304 )	2020-11-13 17:58:04 -08:00
Stanley Tsang	2958f7eace	Fixing IPC handle leak (#302 )	2020-11-13 10:32:42 -07:00
gilbertlee-amd	c8d08a7c2f	Adding RCCL_CLIQUE_DEBUG to help debug experimental clique feature (#300 )	2020-11-13 09:07:11 -07:00
Wenkai Du	4e68229c8b	Skip unused peer connection in scatter and gather (#301 )	2020-11-12 15:47:34 -08:00
gilbertlee-amd	41bcfb8878	Clique kernel support (#295 ) * Adding experimental clique-based kernels (opt-in only) Co-authored-by: Stanley Tsang <stanley.tsang@amd.com> Co-authored-by: Gilbert Lee <gilbert.lee@amd.com> Co-authored-by: Wenkai Du <43822138+wenkaidu@users.noreply.github.com>	2020-11-10 15:44:10 -07:00
Wenkai Du	2e8b3a0857	Use ncclSend/ncclRecv for alltoall type of collectives as default (#297 )	2020-11-09 11:23:17 -08:00
Wenkai Du	709b7e4880	Improve GPU direct RDMA handling on Rome (#294 )	2020-11-03 14:29:08 -08:00
Wenkai Du	dfa3c41ede	Add more Rome models (#292 )	2020-10-30 21:26:04 -07:00
xietingwew	084207e685	fix proxyArgs for trace log	2020-10-21 09:18:40 -07:00
Wenkai Du	dcad0ef7cb	Fix incorrect pointer checking for scatter and gather (#285 )	2020-10-19 13:27:09 -07:00
Wenkai Du	c835d8263a	Merge remote-tracking branch 'nccl/master' into nccl_sync	2020-10-15 18:42:38 -04:00
gilbertlee-amd	84a2541e01	Revert "Initial support for clique-based kernels (#276 )" (#280 ) This reverts commit `2b8184808d`.	2020-10-15 11:30:18 -07:00
Sylvain Jeaugey	0e14394c5f	Fix affinity move	2020-10-13 16:58:05 -07:00

1 2 3 4 5

247 کامیت‌ها