rocm-systems

Author	SHA1	Message	Date
Wenkai Du	f484ff17b9	msccl: add templated kernel (#945 ) * msccl: add templated kernel * Use defines to improve code readability * Fix kernel indexing and review feedback	2023-11-02 17:21:53 -07:00
Wenkai Du	a497722894	NPkit: misc fixes for MSCCL (#936 ) * msccl: add xcc_id to timestamp sync * NPKit: add timestamp for rrc operator * NPKit: add timestamp for MSCCL init	2023-10-30 10:00:12 -07:00
Nilesh M Negi	1e5ca6820b	Fix gcnArchName bug in topology dump (#937 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2023-10-28 12:30:36 -05:00
Ziyue Yang	4c117e5335	Fix MSCCL work FIFO out-of-bound issue (#935 )	2023-10-27 11:24:52 -07:00
Nilesh M Negi	f22df90e5c	remove gcnArch support (#920 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2023-10-26 12:09:15 -05:00
Wen-Heng (Jack) Chung	7ee5c1c28b	Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911 ) * Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895) Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu> * Only build gfx941 * demo * fine tune malloc * Fix merge errors * Fix merge errors * Disable parallel build * Adopt --amdgpu-kernarg-preload-count * Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)" This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304. * Revert CMake changes. * NPKIT changes. * Remove some license declarations. * Address code review feedbacks on msccl_kernel_impl.h * Update CMakeLists.txt * Add CMake logic to check the existence of --amdgpu-kernarg-preload-count * Fix NPKIT trace logic. --------- Co-authored-by: Pedram Alizadeh <pmohamma@amd.com> Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu> Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>	2023-10-12 20:17:08 -05:00
akolliasAMD	b85d73c02e	changed the form that RCCL_TREE uses (#888 ) * changed the form that RCCL_TREE uses	2023-09-15 15:01:33 -06:00
Wenkai Du	26e982d913	Reduce NPKit latency overhead in MSCCL kernel (#893 ) * Reduce NPKit latency overhead in MSCCL kernel * Fix build error without NPKit enable	2023-09-15 13:28:26 -07:00
Audrey MP	e58ec78d35	Gcn arch name (#886 ) We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.	2023-09-12 15:34:40 -04:00
Andy li	e1dc4d5e42	enable hip graph on multi-node (#884 ) * initial checkin * enable msccl when hip graph is on * remove the commented out code of msccl enable check * clean up the code * remove the msccl HighestTransportType check logic	2023-09-11 15:30:04 -07:00
Nusrat Islam	a283f55f12	msccl: add NPKIT profiling for MSCCL send-recv	2023-09-08 13:11:16 -05:00
David Pagan	2ec2648247	Fix static_assert string literal that contains a "\%". This is no longer (#860 ) valid. They can only be simple escape sequences. Removing '\' fixes issue. Assert message now compiles and emits the '%' as expected.	2023-08-21 16:19:59 -07:00
Ziyue Yang	d33a70e620	NPKit update (#844 ) * NPKit update 1. Enable NPKit for MSCCL kernels 2. Fix NPKit context index calculation for sendrecv kernels * Update build script for npkit	2023-08-08 17:30:40 -07:00
Wenkai Du	d65c0830c6	Detect HIP_UNCACHED_MEMORY support from HIP version (#842 )	2023-08-04 10:17:04 -07:00
Wenkai Du	c8085eb704	Improve collective trace (#835 )	2023-08-03 07:16:12 -07:00
Wenkai Du	a7fcd58a97	Enable gfx94x (#808 ) (#816 ) (cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)	2023-07-21 07:31:27 -07:00
Wenkai Du	0f14e5a640	npkit: separate network timing between send and test (#798 )	2023-07-10 09:31:49 -07:00
Wenkai Du	ce6a2ffac8	Merge pull request #782 from ROCmSoftwarePlatform/2.18.3 Sync up with NCCL 2.18.3	2023-06-29 15:04:16 -07:00
akolliasAMD	9bba4a2f2a	added npkit support into the all_gather run ring algorithm (#790 )	2023-06-29 13:59:54 -06:00
Wenkai Du	abd0615351	Merge remote-tracking branch 'nccl/master' into develop	2023-06-26 22:51:56 +00:00
akolliasAMD	9bdf6797a5	fixed npkit size to never be a negative number (#779 )	2023-06-21 08:26:40 -06:00
Sylvain Jeaugey	ea38312273	2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10.	2023-06-14 01:29:17 -07:00
akolliasAMD	9cdac774ea	Wall clock update and npkit trace script Update (#771 ) * changed builtin clock to wall_clock64 * updated npkit_Trace_generator to the new version of npkit	2023-06-07 17:47:10 -06:00
Cory Bloor	b1a65afd58	Fix build on additional architectures (#740 ) * Fix build on additional architectures Instead of directly wrapping a platform-specific operation with a preprocessor check against a gfx macro, it can be more flexible to check a macro that can be overriden by the user. The gfx macro can then just provide the default value for the macro, resulting in the same default behaviour as if the gfx macro was checked directly but with more control at build-time. For example, to build rccl without using buffer_wbinvl1_vol on gfx902, but still use the default on other archs, a user could export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before configuring the build. This flexibility isn't always necessary, but it's nicer to have it and not need it than to need it and not have it. * Define WARP_SIZE using warpSize builtin	2023-06-06 16:45:50 -06:00
Ziyue Yang	7d6e7bcd7d	revert npkit (#748 )	2023-05-24 07:41:05 -07:00
Ziyue Yang	11676267b5	fix min, max and avg (#745 )	2023-05-18 11:02:59 -07:00
Wenkai Du	897745a266	Remove references to NVLS functions	2023-05-05 07:55:20 -07:00
Wenkai Du	53a1f91857	Merge remote-tracking branch 'nccl/master' into develop	2023-04-25 15:38:32 -07:00
Sylvain Jeaugey	d97a32fac8	2.18.1-1 Add support for IB SHARP to NVLS (NVLink SHARP algorithm). Add NVLS+Tree algorithm. Add support for memory management using cuMem* functions. Use all NICs for Send/Receive operations on systems with more than one NIC per GPU (#804). Add ncclCommSplit primitive, with resource sharing option in config. Fix alltoallv hang (#788) Increase number of channels on H100 when we're not limited by NVLink. Improve error reporting in case of IB failure, printing local and remote ID (#779). Add build option to allow compilation against RDMA includes instead of dynamically loading IB verbs symbols (#802). Fix context creation for progress thread (#803). NET/IB: add option to use multiple QPs in round-robin mode. Fix tree performance issue when NVB is disabled on HCM topologies.	2023-04-18 03:58:25 -07:00
Wenkai Du	4b09ffba43	msccl: print stack and memory usage (#723 ) * msccl: print stack and memory usage * Update number of kernels calculation	2023-04-14 14:59:03 -07:00
Ziyue Yang	e3b2342f39	MSCCL: Improve executor and integrate scheduler (#694 ) * MSCCL: improve executor and add scheduler for testing * Use external scheduler * Fix cmake error * Address comments * Fix thread safe issue * Make MSCCL lifecycle APIs thread safe * Make MSCCL internal scheduler aware of topology hint * Revise error message	2023-03-14 14:34:25 -07:00
Sylvain Jeaugey	5d3ab08b69	2.17.1-1 Add new NVLS algorithm for allreduce using NVLink SHARP (intra-node only). Add new config options: cgaClusterSize, minCTAs, maxCTAs, netName. Enable LL128 when we use PXN to close rings. NVTX3 includes update. Fix crash when one CollNet (SHARP) rail fails to initialize.	2023-03-01 00:39:04 -08:00
Ziyue Yang	f4bf47f325	NPKit: improve clock calibration and fix GPU clock API (#683 ) * Improve clock calibration in NPKit * Improve gfx macro * Fix macro	2023-02-17 12:26:57 -07:00
Wenkai Du	f7a456122c	Remove workaround and use indirect function call (#684 )	2023-02-14 13:59:48 -08:00
Wenkai Du	e1cb45ff22	Merge remote-tracking branch 'nccl/master' into HEAD	2023-02-04 01:44:43 +00:00
Ziyue Yang	adafc0f759	Add MSCCL Support (#658 ) * Add MSCCL support * Add alignment and message size checking * Fix nRanks checking, in-place and out-of-place tests and group call handling * Fix hipGraph unit test * Change MSCCL init warning to INFO * Revise license info	2022-12-12 15:51:04 -08:00
akolliasAMD	eca623df07	decreased warp size for gfx110x (#655 )	2022-12-01 12:19:21 -07:00
Sylvain Jeaugey	28189e2df8	2.16.2-1 Add support for CUDA 12.0, drop Kepler (sm_35). Support for H100 features. Make socket code more robust and protected. Solves #555. Improve performance on large CUDA graphs, reducing dependencies. Reduce inter-socket bandwidth on AMD CPUs to favor better paths. Various fixes to ncclCommAbort. Make service thread polling resistant to EINTR. Compile with profiling API by default. Extend NVTX instrumentation with call arguments.	2022-11-30 02:31:59 -08:00
Wenkai Du	562dd87036	Move hipify to cmake stage Add minimal ROCm/HIP version requirements for Graph support	2022-11-14 18:10:45 +00:00
Wenkai Du	9a077e6947	Merge remote-tracking branch 'nccl/master' into develop	2022-11-03 21:17:42 +00:00
Sylvain Jeaugey	cb111f764a	2.15.5-1 Fix crash with CollnetChain on some node topologies Fix hang when interleaving the capture of different graphs Fix hang during init in multi-threaded mode Fix potential data corruption with LL128 protocol on unaligned buffers. Fix CPU usage during preconnect Fixes double-free in the error path for ncclCommInitAll Workaround hang on H100 with Ring/LL128 on 2 GPUs.	2022-10-25 00:55:55 -07:00
Wenkai Du	4f0e223db4	Merge remote-tracking branch 'nccl/master' into develop	2022-10-20 15:41:29 +00:00
Wenkai Du	9ddf0e0649	Support P2P with invisible devices (#636 ) * Support P2P with invisible devices * Update copyright year	2022-10-17 10:24:59 -07:00
Ziyue Yang	7d6bbc19d4	apply npkit	2022-10-14 01:28:17 +00:00
Edgar Gabriel	e645b02cd8	introduce a hw topology aware bintree for hayabusa architecture.	2022-10-03 15:26:21 +00:00
Sylvain Jeaugey	da8152e57a	2.15.1-1 Add support for H100 (sm90). Make sure NCCL kernel honor user stream priorities.	2022-09-27 02:31:13 -07:00
Wenkai Du	a06e14e39b	Misc fixes and disable binTree	2022-09-14 00:26:19 +00:00
Edgar Gabriel	be935d7ce7	Merge branch 'develop' into 2.13.4	2022-09-13 17:19:04 -05:00
Edgar Gabriel	65e2ae20e5	add binary tree In addition, introduce the ability to have 2 trees at the same time. Only for allreduce at the moment.	2022-09-13 20:52:32 +00:00
Gilbert Lee	009e79623f	Merge branch 'develop' into 2.13.4	2022-09-09 23:07:04 +00:00

1 2 3 4

183 Commits