rocm-systems

Автор	SHA1	Сообщение	Дата
Wenkai Du	d601c4909c	Merge pull request #685 from ROCmSoftwarePlatform/2.16.5 Sync up to NCCL 2.16.5	2023-02-22 10:29:02 -08:00
Wenkai Du	1c166046a2	Add back __syncthreads() in barrier and adjust stack size (#688 )	2023-02-18 08:50:31 -08:00
Ziyue Yang	f4bf47f325	NPKit: improve clock calibration and fix GPU clock API (#683 ) * Improve clock calibration in NPKit * Improve gfx macro * Fix macro	2023-02-17 12:26:57 -07:00
Wenkai Du	aee7b42bb8	Merge remote-tracking branch 'nccl/master' into HEAD	2023-02-14 17:14:13 -08:00
Wenkai Du	f7a456122c	Remove workaround and use indirect function call (#684 )	2023-02-14 13:59:48 -08:00
Wenkai Du	39534e8724	Add HIP event optimization and remove special code for gfx90a	2023-02-10 16:46:01 +00:00
Wenkai Du	e1cb45ff22	Merge remote-tracking branch 'nccl/master' into HEAD	2023-02-04 01:44:43 +00:00
Sylvain Jeaugey	f3d5166783	2.16.5-1 Add support for 400Gbit NDR network adapters (CX7) Handle EINTR in socket poll() function Add NCCL_PROGRESS_APPENDOP_FREQ to control op append overhead Resource cleanup fixes Fix double free in case of init failure Fix crash in ncclCommAbort Revert AMD speed commit	2023-02-02 12:52:47 -08:00
Ziyue Yang	adafc0f759	Add MSCCL Support (#658 ) * Add MSCCL support * Add alignment and message size checking * Fix nRanks checking, in-place and out-of-place tests and group call handling * Fix hipGraph unit test * Change MSCCL init warning to INFO * Revise license info	2022-12-12 15:51:04 -08:00
Sylvain Jeaugey	28189e2df8	2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments.	2022-11-30 02:31:59 -08:00
Wenkai Du	9cb72a3d0f	Fix collective trace timestamp format (#647 )	2022-11-21 08:11:12 -08:00
Wenkai Du	562dd87036	Move hipify to cmake stage Add minimal ROCm/HIP version requirements for Graph support	2022-11-14 18:10:45 +00:00
Wenkai Du	9a077e6947	Merge remote-tracking branch 'nccl/master' into develop	2022-11-03 21:17:42 +00:00
Sylvain Jeaugey	cb111f764a	2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs.	2022-10-25 00:55:55 -07:00
Wenkai Du	4f0e223db4	Merge remote-tracking branch 'nccl/master' into develop	2022-10-20 15:41:29 +00:00
Edgar Gabriel	e645b02cd8	introduce a hw topology aware bintree for hayabusa architecture.	2022-10-03 15:26:21 +00:00
Sylvain Jeaugey	da8152e57a	2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities.	2022-09-27 02:31:13 -07:00
Sylvain Jeaugey	ecab28a7c9	Fix potential deadlock during init in multi-thread mode. Make sure all calls calling cudaMalloc (including devCommSetup) are called before the last bootstrapBarrier. That way, we avoid calls to cudaMalloc be blocked by a NCCL kernel launched on another GPU by another thread which completed init faster. Resolve #623.	2022-09-26 02:13:10 -07:00
Edgar Gabriel	8f3219dbd4	make binary tree work on 2.13.4	2022-09-15 00:01:54 +00:00
Edgar Gabriel	e5d2dfed34	Update init.cc	2022-09-13 17:29:32 -05:00
Edgar Gabriel	be935d7ce7	Merge branch 'develop' into 2.13.4	2022-09-13 17:19:04 -05:00
Edgar Gabriel	65e2ae20e5	add binary tree In addition, introduce the ability to have 2 trees at the same time. Only for allreduce at the moment.	2022-09-13 20:52:32 +00:00
Gilbert Lee	009e79623f	Merge branch 'develop' into 2.13.4	2022-09-09 23:07:04 +00:00
gilbertlee-amd	dd56135a9a	Updating stream caching (#614 ) - Adding non-captured hipStream for use in setup	2022-09-09 16:30:15 -06:00
Wenkai Du	a79d9e3586	Merge remote-tracking branch 'nccl/master' into develop	2022-09-09 16:05:38 +00:00
Wenkai Du	7bbce085cc	Enable LL128 protocol support (#605 ) * Enable LL128 protocol support * Use shared memory object directly when possible	2022-09-08 14:45:27 -07:00
gilbertlee-amd	47b2fc3a30	Adding opt-in hipGraph support for RCCL via RCCL_ENABLE_HIPGRAPH (#608 ) Adding opt-in hipGraph support via RCCL_ENABLE_HIPGRAPH	2022-09-06 10:29:46 -06:00
Edgar Gabriel	4141ec1151	fix channelcount for multi-rank scenario	2022-08-22 19:09:22 +00:00
Wenkai Du	14b8ff153f	Repurpose profiling implementation to simple timestamps tracing (#600 )	2022-08-18 15:34:46 -07:00
Sylvain Jeaugey	c4e2aa6c79	2.14.3-1 Add support for improved fault tolerance: non-blocking mode, new init function with config, and ncclCommFinalize function. Reintroduce collnet+chain algorithm, alongside collnet+direct. Add LL protocol for intra-node P2P (on by default) and network communication (off by default). Use network instead of shared memory when performance is better. Fix: wait for CUDA graph destroy before destroying comm with linked graph resources. Remove aggressive polling during enqueue. Fix DMABUF fallback on MOFED 5.4 and earlier.	2022-08-18 02:53:17 -07:00
Ziyue Yang	f6b9686482	Improve alignment and tuning for Pivot A2A algorithm (#593 ) * Improve alignment and tuning for Pivot A2A algorithm * enable pivot a2a by default	2022-08-05 19:40:19 -07:00
Sylvain Jeaugey	19ab67d172	2.13.4-1 Optimize CUDA graph launch; avoid launching a CPU callback for intra-node operations. Simplify kernel common code to improve the latency of send/recv operations. Strengthen CUDA streams semantics. Change NET API to v6, to add dmabuf support. Add ncclGetLastError() function. Add ncclRemoteError code and use it for remote network errors. Support the use of a different NCCL_NET parameter per communicator. Add support for SHM and P2P transfers using cudaMemcpy.	2022-07-11 08:10:34 -07:00
gilbertlee-amd	a89a9966aa	Adding git hash info to version output line (#572 )	2022-06-28 16:42:51 -06:00
Ziyue Yang	6e93fafdc3	Add Feature - Add NPKit Support in RCCL (#564 ) * apply npkit * fix bug * add npkit in readme	2022-06-20 14:30:19 -07:00
Edgar	0336ffdf70	Introduce multi-rank support per device. This is a single commit of the source code changes required to introduce support for multiple ranks per device. A new interface (ncclCommRankInitMulti) has to be used to make use of this new feature.	2022-06-10 14:23:12 +00:00
Wenkai Du	7a6c6927ae	Enable timing profile option (#558 )	2022-06-03 07:05:13 -07:00
Aristotelis	e0864e7093	Merge remote-tracking branch 'ncclRepo/master' into develop	2022-06-02 15:27:24 +00:00
Wenkai Du	ef499c4810	Add another Rome model (#553 ) * Add another Rome model * Add option to force enable intranet on single node * Limit p2p channels to number of ranks * Refine p2p channels handling	2022-05-31 11:31:30 -07:00
akolliasAMD	98f0809a39	Added creation of new tree and added switch for using treesplit for specific cases (#551 )	2022-05-25 18:55:14 -04:00
Wenkai Du	6707a270b1	Add switch for pivot alltoall kernel (#549 )	2022-05-17 18:14:04 -07:00
Sylvain Jeaugey	7aa1c46fd5	2.12.12-1 Improve allreduce performance when we have more than one network interface per GPU and we need to use PXN to close rings. Add support for PCI Gen5 on 5.4 kernels. Fix crash when setting NCCL_SET_THREAD_NAME. Fix random crash in init due to uninitialized struct. Fix hang on cubemesh topologies. Add P2P_DIRECT_DISABLE parameter to disable direct access to pointers within a process.	2022-05-13 00:26:57 -07:00
Wenkai Du	d28e1cb44f	Merge remote-tracking branch 'nccl/master' into develop	2022-04-18 11:15:25 -07:00
Wenkai Du	bbe780ca6c	Support multiple tuning tables (#522 ) * Support multiple tuning tables * [UnitTests] Skip managed memory testing	2022-03-31 17:09:21 -07:00
Sylvain Jeaugey	353e8ba446	2.12.10-1 Fix bug with CollNet Fix bug with zero-bytes send/recv operations Fix NCCL_PARAM implementation to avoid taking a lock on every call Fix bug when setting NCCL_IB_QPS_PER_CONNECTION to more than one. Improve error reporting for network errors.	2022-03-30 02:27:01 -07:00
Sylvain Jeaugey	3c223c105a	2.12.7-1 Add network communication through another GPU connected with NVLink (PXN). Add aggregation of messages coming from different local GPUs through PXN and going to the same destination. Add new v5 plugin API with grouped receives and tags. Add compat for v4 plugins. Add naming of NCCL threads to help debugging. Fix NVLink detection and avoid data corruption when some NVLinks are down. Add support for Relaxed Ordering for IB. Add profiling and timing infrastructure.	2022-03-02 20:48:56 +01:00
Ziyue Yang	b569c0a1db	Add Pivot AllToAll algorithm for Rome model (#503 ) * add a2a pivot interface * remove debug info * address comments * fix bug * remove custom script * address comments * fix bug	2022-02-20 21:09:47 -08:00
Wenkai Du	598c6fdded	Update Rome models (#491 )	2022-01-14 10:03:30 -08:00
Wenkai Du	434ecb0e1f	Merge remote-tracking branch 'origin/develop' into 2.11.4	2022-01-03 09:54:16 -08:00
Wenkai Du	e9bf01fb7e	Determine fine grained memory availability at RCCL bootstrapping (#471 )	2021-11-19 08:12:53 -08:00
Wenkai Du	3a919c1f49	Merge remote-tracking branch 'nccl/master' into develop	2021-11-11 14:22:12 -08:00

1 2 3

141 Коммитов