rocm-systems

Autor	SHA1	Zpráva	Datum
isaki001	da183596cd	add back missing proxy-counter updates (#2052 )	2025-11-25 15:22:34 -06:00
Kapil S. Pawar	5fd86021a8	[RcclReplayer] JSON <-> BIN log format conversion tool (#2056 ) * Add replay log format converter * Add Log Sanitizer * Add no timestamp option (nts) to sanitizer	2025-11-24 11:51:36 -06:00
Nilesh M Negi	db52690c2a	[AzureCI] Increase timeout of per PR and nightly pipeline to 240 mins (#2074 )	2025-11-24 10:55:36 -06:00
AbandiGa	b14e32c46e	Fix rcclNetP2pPolicy issue (#2072 ) * fix rcclNetP2pPolicy issue * change the comment to ncclNetIb	2025-11-21 18:28:10 -06:00
Nilesh M Negi	fca0015962	[AzureCI] Increase UT timeout to 3 hours (#2070 )	2025-11-21 09:18:49 -06:00
Matt Williams	3495baa6b2	Fix ToC in API Library page (#2053 ) * Add intro and remove ToC	2025-11-20 09:35:15 -05:00
Kapil S. Pawar	ab1eb6e70e	[AzureCI] Add RcclReplayer to CI (#2048 )	2025-11-18 12:21:24 -06:00
Kapil S. Pawar	c7f400dbff	Functional Tests for Ext-Profiler Plugin (#2007 ) * Add functional tests for CSV Tuner Plugin * Updated directory structure * Updated and renamed directories * Updated csv conf files * Added tests for ext-profiler * Updated readme * Updated readme	2025-11-18 11:20:39 -06:00
Arm Patinyasakdikul	461e61d10e	Added install.sh flag to suppress warnings. (#2054 )	2025-11-17 00:35:06 -06:00
Pedram Alizadeh	fb67e5b467	Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic (#2037 ) * Using hip_bf16.h instead of hip_bfloat16.h for the __bf16 intrinsic * Switching to hip_bf16.h from ROCm 6.0.0	2025-11-13 15:56:18 -05:00
AbandiGa	277b6e9bac	Disable Bfloatf16 pipelining for reduction collectives for gfx950 (#2047 ) * disable bf16 reduce_copy pipelining for gfx950 * edit CHANGELOG * Combine unroll and pipeline local arch calculation into single function * fix multi-node error and disbale for gfx950 even if it's not a local build * removed has_gfx950 * disable pipelining for gfx950 in rcclSetPipelining --------- Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-08.prov.gtu.zts.cpe.ice.amd.com> Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h30-18.prov.gtu.zts.cpe.ice.amd.com> Co-authored-by: Ghadeer Alabandi <galaband@cv350-zts-gtu-h28a-08.prov.gtu.zts.cpe.ice.amd.com>	2025-11-13 14:55:09 -06:00
isaki001	0d09f86608	Post thread-block size increase tuning (#2042 ) * for multinode gfx950, extend AR LL128 up to 256MB, extend RS LL128 up to 8MB per rank, extend AG LL up to 64KB per rank * dont override direct allgather threshold if set to -1 * restore 2-node AR simple at earlier message sizes than higher multi-node AR * extend range of LL for single-node RS on gfx950 * update algo/proto for multi-node allreduce on gfx942 * set single-node AR on gfx950 to Tree LL for KB message sizes * decrease threshold for single node Tree for gfx950 AR	2025-11-13 14:51:04 -06:00
Bertan Dogancay	83ffc82fa7	[Launch] Move cudaEventRecord call to capturing stream only (#2050 )	2025-11-13 08:38:09 -06:00
gilbertlee-amd	46b032b760	[GRAPH] Adding support for rail-optimized trees for MI3XX with 4 NICs (#2031 )	2025-11-12 19:34:27 -06:00
nawrinsu	c488c5307e	Add tuner config file (2,4,8 nodes) for gfx950 (#2012 ) * Add tuner config file (2,4,8 nodes) for gfx950 * remove alltoall * Added comment regarding allgather direct	2025-11-12 09:16:36 -08:00
Kapil S. Pawar	c8da880dc7	Added Functional Tests for CSV Tuner Plugin (#1968 ) * Add functional tests for CSV Tuner Plugin * Updated directory structure * Updated and renamed directories * Updated csv conf files * Updated readme * Updated readme * Updated readme	2025-11-11 10:11:19 -06:00
Dingming Wu	b811645688	Adjust nChannels on gfx950 based on ranks and nodes for better bandwidth (#2027 )	2025-11-11 09:46:51 -06:00
Gheorghe-Teodor Bercea	1678bb9ae7	Fix compilation when enabling indirect function calls (#1994 ) Fix compilation when enabling indirect function calls.	2025-11-11 09:36:48 -05:00
Mustafa Abduljabbar	52f9526bd6	Reduce LL threshold for a2a (#2032 )	2025-11-10 19:14:23 -05:00
Kapil S. Pawar	acdafac49f	[RcclReplayer] Compile without the need for RCCL to be compiled (#2039 )	2025-11-10 15:38:48 -06:00
Dingming Wu	05f914c997	Fail the job if flag HIP_HOST_UNCACHED_MEMORY is not set on MI350x (#2023 ) * Fail the job if compiler flag HIP_HOST_UNCACHED_MEMORY is not turned on on mi350x Place the check after initTransportsRank as the GPU arch info in comm->topo->nodes info is populated after that. * Update src/init.cc to use ERROR instead of WARN Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>	2025-11-10 11:54:35 -06:00
Dingming Wu	b00ee4c83c	Increment opCount for intra-node comms as well (#2024 ) * Enhance logging in NCCL initialization It's convenient to log comms obj and default channels together for debugging * Add opCount to collDevWork and update increment logic Added opCount to collDevWork and incremented it when proxyOpQueue is empty (e.g., for intra-node comms) * Clarify opCount increment logic in enqueue.cc Updated comment to clarify incrementing opCount for intranode communications. * Refactor NCCL_INIT logging format Updated logging format for NCCL_INIT to improve clarity. * Remove duplicate INFO logging in init.cc	2025-11-10 11:23:49 -06:00
Bertan Dogancay	b1e680adc0	[GEN/BUILD] Refactor generator script and reduce build time for old archs. (#2030 )	2025-11-07 15:15:25 -05:00
Bertan Dogancay	a9bb7e9807	[Launch] Enable Implicit order launch with serial mode (#2033 )	2025-11-07 13:29:53 -05:00
Avinash	85baa0d113	Empty kernel test enhancements [tools] (#1999 ) * Initial commit * Improvements-1 * Initial commit for PR * Updates warning, run.sh, decoupled loops * Forcing seq cst for CPU timimg	2025-11-07 12:28:06 -06:00
Ghadeer Ahmed H Alabandi	45991fadad	[NET] Enable capping the number of QPs created for send/recv colls (#1998 )	2025-11-07 00:47:01 +00:00
alex-breslow-amd	56e0b4e445	[gfx950] Turn On Single Node One Slice Optimization for gfx950 and MI300A (#2017 ) * Internal benchmarking shows nice single-node performance uplift for MI300A and MI350	2025-11-06 12:12:45 -08:00
Arm Patinyasakdikul	d6a53d2022	proxy: handle progressOps return code properly. (#2029 )	2025-11-04 09:09:50 -06:00
Aravind Ravikumar	07f8f6d6c6	Add S3 upload support for Perf and test reports by run ID and architecture (#2020 ) * Commits to enable scp report copy * Added Post report upload step * Added extra arg for fetch artifacts * Moved to a specific commit * Add write permissions to s3 * Added comment for TheRock sha commit date --------- Co-authored-by: arravikum <arravikum@amd.com>	2025-11-03 19:09:34 -05:00
nawrinsu	166268d715	Fix protocol and channel override when tuner is used (#1985 ) * Fix protocol and channel override when tuner is used * Added comment * Fix README for basic tuner implementation	2025-11-03 13:56:34 -08:00
Ahmed Khan	caffd013f6	Remove duplicate MaxEmptyLinesToKeep from clang-format (#2016 )	2025-11-02 21:44:27 -06:00
Nilesh M Negi	62ab7a22d7	Revert "[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006 )" (#2021 ) This reverts commit `bed7cdf863`.	2025-10-31 10:04:12 -05:00
David DeBonis	63d5846452	Single-node AllGather and ReduceScatter Optimization (#2019 ) * Single-node performance tuning * Normalizing value to individual rank	2025-10-31 08:59:46 -06:00
Arm Patinyasakdikul	1ce83d5cc0	Added ERROR message class to handle fatal error messages. (#2002 ) * Added ERROR message class to handle fatal error messages. New ERROR message class will print the message in all debug level, including none. Change some of the fatal error message to be in ERROR instead of WARN. Added new error handler function to print out more meaningful error message in the future. * Added CHANGELOG entry. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Change to no longer reuse NONE as ERROR. ERROR is now a separated class. * Update CHANGELOG.md Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-10-30 16:14:20 -05:00
Arm Patinyasakdikul	84fdcab68a	Added copyrights for Palamida scan 7.2. (#2018 )	2025-10-30 13:33:20 -05:00
isaki001	641c0eb51c	P2p batching hang-fix (#2011 ) * prevent batching when send/recv bytes dont match, restore bit reversal for channel to part mapping, prevent batching beyond 32-nodes * correct computation for channel to part mapping * update changelog * disabling p2p-batching by default	2025-10-30 13:32:01 -05:00
isaki001	72996e4d9f	gx950 multi-node tuning for LL/LL128 (#1953 ) * increased LL threshold for gfx950 AR to 256KB * AG/RS proto threshold update	2025-10-30 12:08:12 -05:00
Bertan Dogancay	bed7cdf863	[GEN/BUILD] Refactor generate.py and reduce build time for older archs (#2006 )	2025-10-30 11:45:53 -04:00
Nilesh M Negi	8444b3c6e9	Fix gfx950 gating conditions to match ROCm 7.0.2 (#2003 )	2025-10-29 23:27:04 -05:00
Mustafa Abduljabbar	12f51ba8bf	[Device] Adjust threadblock size for gfx950 to increase LL64/Simple performance for AR, RS and AG (#1978 ) * Add initial commit to increase tb size to 512 * Fix LL perf issue when subset of NCCL_MAX_NTHREADS is used Adding a constant to barrier_generic logic from using fallback logic when nthreads < NCCL_MAX_NTHREADS and nthreads == blockDim.X * Adjust nthreads for LL * Opt threads for reduce_scatter upper small range * Add macro for single node * Restrict MSCCL to 256 threads to prevent mem access fault * Support pre-MI350 compatibility * Partially refactor threadblock size override * Use const macros instead of numerals * opt out of unused function	2025-10-29 23:24:32 -05:00
Bertan Dogancay	b703ffdfa4	[Tools/Replayer] Fix prohibited calls during capture mode (#1938 )	2025-10-29 12:19:32 -04:00
corey-derochie-amd	561ad2fe05	Updated Changelog with 7.1.1 and 7.2.0 stub sections (#2008 ) * Missing ROCm 7.0 & 7.1.0 Changelog entries (#1976) * Update CHANGELOG.md * Update CHANGELOG.md * Apply suggestions from code review Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Update CHANGELOG.md * Update CHANGELOG.md * Update CHANGELOG.md --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com> * Added ROCm 7.2.0 section. * Update CHANGELOG.md * Apply suggestion from @corey-derochie-amd --------- Co-authored-by: Jeffrey Novotny <jnovotny@amd.com>	2025-10-28 13:41:22 -06:00
Atul Kulkarni	cc867dbaf2	Removed RCCL_EXPOSE_STATIC duplicate definition. (#1988 )	2025-10-28 13:01:48 -05:00
Atul Kulkarni	26dc7abb32	Added ROCM_VERSION restriction to alloc unit tests (#1989 )	2025-10-28 12:54:34 -05:00
alex-breslow-amd	e69b11eba5	Remove nontemporality from stores, put in casts to global address space (#1982 ) * Implements casting key loads and stores to address_space(1) so that vector global load and store instructions are emitted by the compiler instead of more costly flat loads and stores * Removes nontemporality from some key stores for gfx950.	2025-10-28 10:34:48 -07:00
corey-derochie-amd	f290e302d3	Updated CODEOWNERS to instead use RCCL-Reviewers team (#2010 ) * Updated CODEOWNERS to instead use RCCL-Reviewers team * Apply suggestion from @nileshnegi Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com> --------- Co-authored-by: Nilesh M Negi <Nilesh.Negi@amd.com>	2025-10-28 09:27:26 -06:00
Kapil S. Pawar	912d53caba	Fix segmentation fault related to ext-profiler plugin (#1986 )	2025-10-23 09:26:35 -05:00
Joseph Macaranas	c2e71e83d1	[External CI] Add references to rocm-systems super repo (#1935 ) - In order to trigger downstream jobs to verify projects that consume rccl, references to those repos are required.	2025-10-22 16:07:05 -04:00
Aravind Ravikumar	506c2e9878	Adding reservation time for salloc in CI (#1992 ) Co-authored-by: arravikum <arravikum@amd.com>	2025-10-22 10:00:01 -04:00
ehsanhosseinzadehKhaligh	aec4f0a659	Updating npkit_trace_generator.py to check npkit directory (#1891 ) * create dir regardless of default or user-provided path if it doesn't exist * Fix npkit_dump_dir on npkit_trace_generator.py --------- Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>	2025-10-22 02:51:16 -05:00

1 2 3 4 5 ...

2008 Commity