rocm-systems

Author	SHA1	Message	Date
Atul Kulkarni	892d258319	Add missing header in alloc.h (#2086 )	2025-12-04 11:26:19 -06:00
Wenkai Du	185e78a8f0	Use one side stream per process (#2063 ) * Use one side stream per process * Handle multiple GPUs per process * Reset stream when not found * Address review comments * Fix missing mutex initializer	2025-12-02 10:03:15 -08:00
corey-derochie-amd	4acd0f64ea	Add copyright to src/device/symmetric/all_reduce.cuh (#2080 )	2025-11-27 14:29:21 -07:00
isaki001	da183596cd	add back missing proxy-counter updates (#2052 )	2025-11-25 15:22:34 -06:00
AbandiGa	b14e32c46e	Fix rcclNetP2pPolicy issue (#2072 ) * fix rcclNetP2pPolicy issue * change the comment to ncclNetIb	2025-11-21 18:28:10 -06:00
Matt Williams	3495baa6b2	Fix ToC in API Library page (#2053 ) * Add intro and remove ToC	2025-11-20 09:35:15 -05:00
Pedram Alizadeh	fb67e5b467	Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037 ) * Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic * Switching to hip_bf16.h from ROCm 6.0.0	2025-11-13 15:56:18 -05:00
AbandiGa	277b6e9bac	Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047 ) * disable bf16 reduce_copy pipelining for gfx950 * edit CHANGELOG * Combine unroll and pipeline local arch calculation into single function * fix multi-node error and disbale for gfx950 even if it's not a local build * removed has_gfx950 * disable pipelining for gfx950 in rcclSetPipelining --------- Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com> Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com> Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>	2025-11-13 14:55:09 -06:00
isaki001	0d09f86608	Post thread-block size increase tuning (#2042 ) * for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank * dont override direct allgather threshold if set to -1 * restore 2-node AR simple at earlier message sizes than higher multi-node AR * extend range of LL for single-node RS on gfx950 * update algo/proto for multi-node allreduce on gfx942 * set single-node AR on gfx950 to Tree LL for KB message sizes * decrease threshold for single node Tree for gfx950 AR	2025-11-13 14:51:04 -06:00
Bertan Dogancay	83ffc82fa7	[Launch] Move cudaEventRecord call to capturing stream only (#2050 )	2025-11-13 08:38:09 -06:00
gilbertlee-amd	46b032b760	[GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031 )	2025-11-12 19:34:27 -06:00
Dingming Wu	b811645688	Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027 )	2025-11-11 09:46:51 -06:00
Gheorghe-Teodor Bercea	1678bb9ae7	Fix compilation when enabling indirect function calls (#1994 ) Fix compilation when enabling indirect function calls.	2025-11-11 09:36:48 -05:00
Mustafa Abduljabbar	52f9526bd6	Reduce LL threshold for a2a (#2032 )	2025-11-10 19:14:23 -05:00
Kapil S. Pawar	acdafac49f	[RcclReplayer] Compile without the need for RCCL to be compiled (#2039 )	2025-11-10 15:38:48 -06:00
Dingming Wu	05f914c997	Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x (#2023 ) * Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that. * Update src/init.cc to use ERROR instead of WARN Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>	2025-11-10 11:54:35 -06:00
Dingming Wu	b00ee4c83c	Increment opCount for intra-node comms as well (#2024 ) * Enhance logging in NCCL initialization It's convenient to log comms obj and default channels together for debugging * Add opCount to collDevWork and update increment logic Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms) * Clarify opCount increment logic in enqueue.cc Updated comment to clarify incrementing opCount for intranode communications. * Refactor NCCL_INIT logging format Updated logging format for NCCL_INIT to improve clarity. * Remove duplicate INFO logging in init.cc	2025-11-10 11:23:49 -06:00
Bertan Dogancay	b1e680adc0	[GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030 )	2025-11-07 15:15:25 -05:00
Bertan Dogancay	a9bb7e9807	[Launch] Enable Implicit order launch with serial mode (#2033 )	2025-11-07 13:29:53 -05:00
Ghadeer Ahmed H Alabandi	45991fadad	[NET] Enable capping the number of QPs created for send/recv colls (#1998 )	2025-11-07 00:47:01 +00:00
alex-breslow-amd	56e0b4e445	[gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A (#2017 ) * Internal benchmarking shows nice single-node performance uplift for MI300A and MI350	2025-11-06 12:12:45 -08:00
Arm Patinyasakdikul	d6a53d2022	proxy: handle progressOps return code properly. (#2029 )	2025-11-04 09:09:50 -06:00
nawrinsu	166268d715	Fix protocol and channel override when tuner is used (#1985 ) * Fix protocol and channel override when tuner is used * Added comment * Fix README for basic tuner implementation	2025-11-03 13:56:34 -08:00
Nilesh M Negi	62ab7a22d7	Revert "[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006 )" (#2021 ) This reverts commit `bed7cdf863`.	2025-10-31 10:04:12 -05:00
David DeBonis	63d5846452	Single-node AllGather and ReduceScatter Optimization (#2019 ) * Single-node performance tuning * Normalizing value to individual rank	2025-10-31 08:59:46 -06:00
Arm Patinyasakdikul	1ce83d5cc0	Added ERROR message class to handle fatal error messages. (#2002 ) * Added ERROR message class to handle fatal error messages. New ERROR message class will print the message in all debug level, including none. Change some of the fatal error message to be in ERROR instead of WARN. Added new error handler function to print out more meaningful error message in the future. * Added CHANGELOG entry. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Change to no longer reuse NONE as ERROR. ERROR is now a separated class. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul	84fdcab68a	Added copyrights for Palamida scan 7.2. (#2018 )	2025-10-30 13:33:20 -05:00
isaki001	641c0eb51c	P2p batching hang-fix (#2011 ) * prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes * correct computation for channel to part mapping * update changelog * disabling p2p-batching by default	2025-10-30 13:32:01 -05:00
isaki001	72996e4d9f	gx950 multi-node tuning for LL/LL128 (#1953 ) * increased LL threshold for gfx950 AR to 256KB * AG/RS proto threshold update	2025-10-30 12:08:12 -05:00
Bertan Dogancay	bed7cdf863	[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006 )	2025-10-30 11:45:53 -04:00
Nilesh M Negi	8444b3c6e9	Fix gfx950 gating conditions to match ROCm 7.0.2 (#2003 )	2025-10-29 23:27:04 -05:00
Mustafa Abduljabbar	12f51ba8bf	[Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978 ) * Add initial commit to increase tb size to 512 * Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X * Adjust nthreads for LL * Opt threads for reduce_scatter upper small range * Add macro for single node * Restrict MSCCL to 256 threads to prevent mem access fault * Support pre-MI350 compatibility * Partially refactor threadblock size override * Use const macros instead of numerals * opt out of unused function	2025-10-29 23:24:32 -05:00
alex-breslow-amd	e69b11eba5	Remove nontemporality from stores, put in casts to global address space (#1982 ) * Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores * Removes nontemporality from some key stores for gfx950.	2025-10-28 10:34:48 -07:00
mberenjk	b58f234539	Add support for additional paths in RCCL DMABUF kernel configuration loading (#1825 ) * Adding more path to the kernel load and an environment variable to force enable DMABUF --------- Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>	2025-10-20 13:35:22 -07:00
Nilesh M Negi	c35bc721ad	Fix ncclDevFuncId for AllReduceWithBias (#1980 )	2025-10-17 09:28:57 -05:00
Arm Patinyasakdikul	58eca5d7f8	Disable graph mode memory registration and UBR as unsupported feature. (#1977 )	2025-10-17 09:18:39 -05:00
Rahul Vaidya	624f68b2b2	[Profiler plugin] Fix segfault issue with profiler plugin (#1973 ) * Fix profiler plugin segfault by correctly setting p2p->func * Look for librccl-profiler.so instead of libnccl-profiler.so Signed-off-by: rahulvaidya20 <ravaidya@amd.com> --------- Signed-off-by: rahulvaidya20 <ravaidya@amd.com> Co-authored-by: Yongjie Qiu <Yongjie.Qiu@amd.com>	2025-10-16 16:33:18 -05:00
alex-breslow-amd	154350baaf	MSCCL: Unland PR1788 + Fix for MSCCL Data Corruption (#1960 ) - Earlier fix PR1788 is no longer necessary after ROCr fix and pre-ROCr fix workaround - Inserts an s_waitcnt vmcnt(0), which fixes a data corruption issue in MSCCL	2025-10-15 10:32:25 -07:00
gilbertlee-amd	fedddb452c	Enabling gdrcopy option for gfx950 (#1955 )	2025-10-15 10:55:25 -06:00
alex-breslow-amd	c70f5b4621	[gfx950] Make bypassing __threadfence the default for multinode. (#1947 ) * Gate based on ROCM version, safe for ROCm 7.0.2 and beyond. * Updates naming to gfx9CheapFenceOff since we use this for gfx942 and gfx950. Thanks Nilesh. * Add info logging statement to NCCL_INIT to print whether enabled when INFO logging is enabled.	2025-10-15 09:15:36 -07:00
isaki001	0f99fd84a3	gfx950 channel tuning for ReduceScatter and AllGather (#1940 ) * add channel thresholds to override channel-count adjustments	2025-10-14 09:50:44 -05:00
mberenjk	e738c03e39	fixing the ar_with_bias test issue when running rccl-tests (#1912 ) * fixing the AR_With_Bias issue when running rccl-tests	2025-10-13 13:58:21 -07:00
Arm Patinyasakdikul	ff75860d73	Fix unroll factor display bug. (#1969 )	2025-10-10 15:35:06 -05:00
Surya Periaswamy	5bd5079de1	MSCCL++ fix split path null deref (#1959 ) * Add speriaswamy-amd to CODEOWNERS * MSCCL++: fix split path null deref; key maps by parent ncclUniqueId * removed no-op	2025-10-09 14:08:38 -05:00
Rahul Vaidya	6b200ee6c5	Fix LL128 proto selection to respect user setting (#1822 )	2025-10-09 14:08:03 -05:00
Nusrat Islam	d22a39e954	Update direct AG and single node LL threshold (#1944 ) * update AG direct and single node LL threshold * update thresholds based on MI350 expeirmental results * disable using LL for direct AG * enable direct AG for lower GPU counts * direct AG single node tuning * fix in-place buffer allocation for AG unit test * whitespace fix * gate direct AG for gfx950 and gfx942 --------- Co-authored-by: Nusrat Islam <nusislam@nova-login-gtu2.prov.gtu.zts.cpe.ice.amd.com>	2025-10-09 10:48:50 -05:00
Artem Kuzmitckii	00a42c80f3	Reverse logic of context tracking enablement from #1927 (#1971 ) In this commit it disabled by default and can be enabled via `RCCL_ENABLE_CONTEXT_TRACKING=1` for both (CDNA, RDNA) Original PR https://github.com/ROCm/rccl/pull/1927	2025-10-09 10:24:09 +02:00
BertanDogancay	3f94267f21	Merge remote-tracking branch 'nccl/master' into develop	2025-10-06 18:36:49 -04:00
Nilesh M Negi	342ec086e3	Revert "changes for hugepages backed host buffer for larger allocations (#1841 )" (#1951 ) This reverts commit `65b69bf318`.	2025-10-02 23:43:09 -05:00
amd-jiali	5978d2f9ab	Print out the hipRuntimeVersion message from WARN to always show up (#1911 ) Authored-by: Jiali Li <jialili@amd.com>	2025-10-02 11:32:32 -05:00

1 2 3 4 5 ...

1035 Commits