rocm-systems

Penulis	SHA1	Pesan	Tanggal
BertanDogancay	9ff53eeeae	Merge remote-tracking branch 'nccl/master' into develop	2024-01-30 14:43:43 -08:00
BertanDogancay	81ddf9de89	Merge remote-tracking branch 'nccl/v2.19' into develop	2024-01-24 15:25:33 -08:00
Bertan Dogancay	c4dbf8a914	Fix collective trace when rccl is configured (#1056 ) * Fix collective trace when rccl is configured	2024-01-22 09:26:44 -07:00
Wenkai Du	7e25d5bc55	Use new HIP graph API compatible with CUDA 11030 (#991 ) * Use new HIP graph API compatible with CUDA 11030 * Update dependency to ROCm 6.1 * Fix single stream use case	2024-01-21 19:00:50 -08:00
Bertan Dogancay	28d9b170c9	[DEV] Configure functions in RCCL (#986 ) * configure functions in rccl	2024-01-18 15:07:16 -07:00
Nilesh M Negi	249e9f7f65	Un-escaped character causes error with address sanitizer builds (#992 ) Signed-off-by: Nilesh M Negi <Nilesh.Negi@amd.com> Co-authored-by: Jenkins <jenkins-compute@amd.com>	2024-01-09 13:28:32 -06:00
Pedram Alizadeh	aa5c84c997	Merge pull request #1022 from PedramAlizadeh/sync_nccl_2.18.6 Sync to nccl 2.18.6	2024-01-09 13:29:29 -05:00
akolliasAMD	f4858e14b2	rearranged how the min and max functions are part of msccl (#1025 ) * rearranged how the min and max functions are part of msccl * added more coverage on in place graph tests	2023-12-21 08:58:33 -07:00
PedramAlizadeh	0d515f9388	resolved conflicts, fixed the localNetCount/0 bug	2023-12-18 08:11:34 +00:00
Ziyue Yang	655742a3a6	Fully disable MSCCL when machine is not matched (#1017 ) * Disable MSCCL algorithm meta loading when machine is not matched * fully disable init * fix potential segfault	2023-12-13 08:36:21 -08:00
Wenkai Du	12c08fc52a	msccl: build same number of kernels as in ROCm 5.7 (#1005 ) Removed fullOps kernels from build	2023-12-07 13:36:04 -06:00
Wen-Heng (Jack) Chung	293f0fb752	Use a map to host scratch buffers (#1004 ) * Use a map to host scratch buffers * Address review feedbacks. Deliberately keep mscclSetupScratch function.	2023-12-05 13:15:28 -06:00
Bertan Dogancay	7c0f49a878	IFC mix build (#998 )	2023-12-02 18:49:52 -07:00
Wenkai Du	4ba65d1d6a	Increase max channles to 64 (#993 )	2023-12-01 16:01:11 -08:00
Ziyue Yang	4bb0b4a380	Move MSCCL algorithm loading to initialization to workaround HIP graph conflict (#982 ) * MSCCL: pre-specify channels and pre-load algorithms * add mutex * fix bug * clean include * disable all-gathers temporarily	2023-11-30 09:47:20 -08:00
akolliasAMD	56ce9ef05f	recreated pr 914 to work with current develop branch (#979 )	2023-11-28 16:33:47 -07:00
Ziyue Yang	7fc891bc8d	Fix MSCCL work FIFO allocation with HIP graph enabled (#967 )	2023-11-15 16:43:28 -08:00
Sylvain Jeaugey	88d44d777f	2.19.4-1 Split transport connect phase into multiple steps to avoid port exhaustion when connecting alltoall at large scale. Defaults to 128 peers per round. Fix memory leaks on CUDA graph capture. Fix alltoallv crash on self-sendrecv. Make topology detection more deterministic when PCI speeds are not available (fix issue #1020). Properly close shared memory in NVLS resources. Revert proxy detach after 5 seconds. Add option to print progress during transport connect. Add option to set NCCL_DEBUG to INFO on first WARN.	2023-11-13 10:36:12 -08:00
Wenkai Du	5a800e00cd	msccl: enable basic collective trace (#959 ) To avoid increasing number of kernels, colltrace is only enabled with RCCL_MSCCL_FORCE_FULLOPS=1	2023-11-08 20:14:28 -08:00
Wenkai Du	f484ff17b9	msccl: add templated kernel (#945 ) * msccl: add templated kernel * Use defines to improve code readability * Fix kernel indexing and review feedback	2023-11-02 17:21:53 -07:00
Wenkai Du	a497722894	NPkit: misc fixes for MSCCL (#936 ) * msccl: add xcc_id to timestamp sync * NPKit: add timestamp for rrc operator * NPKit: add timestamp for MSCCL init	2023-10-30 10:00:12 -07:00
Nilesh M Negi	1e5ca6820b	Fix gcnArchName bug in topology dump (#937 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2023-10-28 12:30:36 -05:00
Ziyue Yang	4c117e5335	Fix MSCCL work FIFO out-of-bound issue (#935 )	2023-10-27 11:24:52 -07:00
Nilesh M Negi	f22df90e5c	remove gcnArch support (#920 ) Signed-off-by: nileshnegi <Nilesh.Negi@amd.com>	2023-10-26 12:09:15 -05:00
Wen-Heng (Jack) Chung	7ee5c1c28b	Change MSCCL kernel signature to allow kernel arguments be preloaded via SGPR (#911 ) * Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895) Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu> * Only build gfx941 * demo * fine tune malloc * Fix merge errors * Fix merge errors * Disable parallel build * Adopt --amdgpu-kernarg-preload-count * Revert "Adding a script that will download/compile/run TransferBench/RCCL/UCX/RCCL-tests/RCCL-Unittests/hip-mpi-testsuite (#895)" This reverts commit f5e252dddf02a41b4d1bc512f306f45f97166304. * Revert CMake changes. * NPKIT changes. * Remove some license declarations. * Address code review feedbacks on msccl_kernel_impl.h * Update CMakeLists.txt * Add CMake logic to check the existence of --amdgpu-kernarg-preload-count * Fix NPKIT trace logic. --------- Co-authored-by: Pedram Alizadeh <pmohamma@amd.com> Co-authored-by: Pedram Alizadeh <pmohamma@banff-pla-r27-05.pla.dcgpu> Co-authored-by: Ziyue Yang <ziyyang@microsoft.com>	2023-10-12 20:17:08 -05:00
Sylvain Jeaugey	8c6c595185	2.19.3-1 H800/H100 fixes and tuning. Re-enable intra-process direct pointer buffer access when CUMEM is enabled.	2023-09-26 05:57:15 -07:00
Sylvain Jeaugey	3435178b6c	Merge remote-tracking branch 'origin/master' into v2.19	2023-09-26 05:55:56 -07:00
Sylvain Jeaugey	f9c3dc251e	2.19.1-1 Add local user buffer registration for NVLink SHARP. Add tuning plugin support. Increase net API to v7 to allow for device-side packet reordering; remove support for v4 plugins. Add support for RoCE ECE. Add support for C2C links. Better detect SHM allocation failures to avoid crash with Bus Error. Fix missing thread unlocks in bootstrap (Fixes #936). Disable network flush by default on H100. Move device code from src/collectives/device to src/device.	2023-09-26 05:50:33 -07:00
Kaiming Ouyang	4365458757	Fix cudaMemcpyAsync bug We are trying to use the copy result of first cudaMemcpyAsync in the second cudaMemcpyAsync without sync in between. This patch fixes it by allocating a CPU side array to cache device side addr so that we can avoid this consecutive cuda mem copy. Fixes #957	2023-09-20 05:51:14 -07:00
akolliasAMD	b85d73c02e	changed the form that RCCL_TREE uses (#888 ) * changed the form that RCCL_TREE uses	2023-09-15 15:01:33 -06:00
Wenkai Du	26e982d913	Reduce NPKit latency overhead in MSCCL kernel (#893 ) * Reduce NPKit latency overhead in MSCCL kernel * Fix build error without NPKit enable	2023-09-15 13:28:26 -07:00
Audrey MP	e58ec78d35	Gcn arch name (#886 ) We use CMake to determine if we're compiling against a version of ROCm that supports gcnArchName and handles architecture checking appropriately. It includes a few helper functions as drop ins for the functionality we used gcnArch for before; sometimes to enable flags, and sometimes to set frequencies.	2023-09-12 15:34:40 -04:00
Andy li	e1dc4d5e42	enable hip graph on multi-node (#884 ) * initial checkin * enable msccl when hip graph is on * remove the commented out code of msccl enable check * clean up the code * remove the msccl HighestTransportType check logic	2023-09-11 15:30:04 -07:00
Nusrat Islam	a283f55f12	msccl: add NPKIT profiling for MSCCL send-recv	2023-09-08 13:11:16 -05:00
David Pagan	2ec2648247	Fix static_assert string literal that contains a "\%". This is no longer (#860 ) valid. They can only be simple escape sequences. Removing '\' fixes issue. Assert message now compiles and emits the '%' as expected.	2023-08-21 16:19:59 -07:00
Ziyue Yang	d33a70e620	NPKit update (#844 ) * NPKit update 1. Enable NPKit for MSCCL kernels 2. Fix NPKit context index calculation for sendrecv kernels * Update build script for npkit	2023-08-08 17:30:40 -07:00
Wenkai Du	d65c0830c6	Detect HIP_UNCACHED_MEMORY support from HIP version (#842 )	2023-08-04 10:17:04 -07:00
Wenkai Du	c8085eb704	Improve collective trace (#835 )	2023-08-03 07:16:12 -07:00
Wenkai Du	a7fcd58a97	Enable gfx94x (#808 ) (#816 ) (cherry picked from commit 94da229a7788d74685d1591a4e75a8341de64f41)	2023-07-21 07:31:27 -07:00
Wenkai Du	0f14e5a640	npkit: separate network timing between send and test (#798 )	2023-07-10 09:31:49 -07:00
Wenkai Du	ce6a2ffac8	Merge pull request #782 from ROCmSoftwarePlatform/2.18.3 Sync up with NCCL 2.18.3	2023-06-29 15:04:16 -07:00
akolliasAMD	9bba4a2f2a	added npkit support into the all_gather run ring algorithm (#790 )	2023-06-29 13:59:54 -06:00
Wenkai Du	abd0615351	Merge remote-tracking branch 'nccl/master' into develop	2023-06-26 22:51:56 +00:00
akolliasAMD	9bdf6797a5	fixed npkit size to never be a negative number (#779 )	2023-06-21 08:26:40 -06:00
Sylvain Jeaugey	ea38312273	2.18.3-1 Fix data corruption with Tree/LL128 on systems with 1GPU:1NIC. Fix hang with Collnet on bfloat16 on systems with less than one NIC per GPU. Fix long initialization time. Fix data corruption with Collnet when mixing multi-process and multi-GPU per process. Fix crash when shared memory creation fails. Fix Avg operation with Collnet/Chain. Fix performance of alltoall at scale with more than one NIC per GPU. Fix performance for DGX H800. Fix race condition in connection progress causing a crash. Fix network flush with Collnet. Fix performance of aggregated allGather/reduceScatter operations. Fix PXN operation when CUDA_VISIBLE_DEVICES is set. Fix NVTX3 compilation issues on Debian 10.	2023-06-14 01:29:17 -07:00
akolliasAMD	9cdac774ea	Wall clock update and npkit trace script Update (#771 ) * changed builtin clock to wall_clock64 * updated npkit_Trace_generator to the new version of npkit	2023-06-07 17:47:10 -06:00
Cory Bloor	b1a65afd58	Fix build on additional architectures (#740 ) * Fix build on additional architectures Instead of directly wrapping a platform-specific operation with a preprocessor check against a gfx macro, it can be more flexible to check a macro that can be overriden by the user. The gfx macro can then just provide the default value for the macro, resulting in the same default behaviour as if the gfx macro was checked directly but with more control at build-time. For example, to build rccl without using buffer_wbinvl1_vol on gfx902, but still use the default on other archs, a user could export CXXFLAGS='-Xarch_gfx902 -DRCCL_USE_WBINVL1_VOL=1' before configuring the build. This flexibility isn't always necessary, but it's nicer to have it and not need it than to need it and not have it. * Define WARP_SIZE using warpSize builtin	2023-06-06 16:45:50 -06:00
Ziyue Yang	7d6e7bcd7d	revert npkit (#748 )	2023-05-24 07:41:05 -07:00
Ziyue Yang	11676267b5	fix min, max and avg (#745 )	2023-05-18 11:02:59 -07:00
Wenkai Du	897745a266	Remove references to NVLS functions	2023-05-05 07:55:20 -07:00

1 2 3 4 5

206 Melakukan