rocm-systems

Автор	SHA1	Сообщение	Дата
Rahul Manocha	c4f7593001	clr: Update signal count and pool size for staging buffer (#2889 ) * clr: Update signal count and pool size for staging buffer * Change to naming of variables etc --------- Co-authored-by: Rahul Manocha <rmanocha@amd.com>	2026-01-29 10:34:00 -08:00
sluzynsk-amd	f37b100c34	SWDEV-563777 - further reduce compilation warnings (#2331 ) This change resolves some of the warnings generated during clr builds. Quiet regular output of doxygen. Disable non-documented warnings of doxygen. Signed-off-by: Sebastian Luzynski <Sebastian.Luzynski@amd.com>	2026-01-27 20:51:16 +01:00
SaleelK	340f3aa887	clr: Implement dynamic stream to HWq logic (#1958 ) * clr: Implement dynamic stream to HW queue assignment This change implements dynamic stream to hardware queue (HWq) mapping with the following features: * Queue depth heuristics with weights for optimal HWq assignment * Make last used queue sticky for better locality * Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to pipe mapping based on creation order (single process per device only, as pipe ID is statically assigned by runtime) * More aggressive heuristic usage for better queue distribution * Extend dynamic queues support for all stream priorities Environment variables: * DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 - Depth+Pipe heuristics * DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation * clr: Clean up last_used_queue_	2026-01-23 10:40:54 -08:00
SaleelK	6b28faa532	clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480 ) Problem: The existing SDMA engine selection logic had several issues: 1. Same VirtualGPU/stream could use different SDMA engines for consecutive async copies since copy_engine_status may report engines as busy 2. Busy and Preferred engine check for every copy 3. No global tracking of which VirtualGPU uses which engine, leading to suboptimal resource allocation Solution: Implemented a global SDMA engine allocator with per-stream affinity: - Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments * Maintains global map of active assignments * Enforces exclusivity: different streams use different engines (except inter-GPU copies where preferred engines are prioritized for optimal hardware paths like XGMI links) * Thread-safe allocation/release with Monitor lock - Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_) for fast lookup without map access on hot path - Refactored rocrCopyBuffer() to: 1. Check local cached engine first → use if assigned 2. Call AllocateSdmaEngine() if not assigned → cache result - Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine) into AllocateEngine() for cleaner separation of concerns - Engine release on HostQueue::finish() instead of only VirtualGPU destruction * Improves engine utilization by releasing earlier * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice - Added future path for simple round-robin allocation (kUseSimpleRR) for next-gen GPUs with uniform SDMA bandwidth (disabled by default) Cleanup: - Removed selectSdmaEngine() helper (logic moved to allocator) - Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly) - Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager Benefits: - Ensures consistent per-stream SDMA engine usage - Prevents cross-stream contention and engine thrashing - Prioritizes hardware-optimal paths for inter-GPU transfers - Better resource utilization through earlier release - Cleaner, more maintainable code structure	2026-01-07 19:37:45 -08:00
Ioannis Assiouras	aecc845456	SWDEV-573589 - Fixed performance regression due to the increase of the signal pool (#2470 )	2026-01-02 12:50:56 +00:00
German Andryeyev	741b4b9fdf	SWDEV-558849 - Fix Windows build for ROCR backend (#2368 )	2025-12-29 08:35:22 -05:00
Sourabh U Betigeri	d552491985	SWDEV-572329 - Remove barrier packet (#2304 )	2025-12-19 13:37:48 -08:00
systems-assistant[bot]	b002c6a739	SWDEV-538607 - Add SIMDe as a build dependency, remove naked intrinsic use. (#500 ) Co-authored-by: Alex Voicu <alexandru.voicu@amd.com> Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>	2025-12-15 17:40:51 +00:00
SaleelK	10635483ad	clr: Fix packet batch write logic (#2236 ) * When writing bulk packets always invalidate packet headers, Its possible that the CP fetcher can have multiple packets in flight. In such cases we may end up with a malformed packet because the writes are not complete yet CP finds a valid header.	2025-12-11 04:26:41 -08:00
German Andryeyev	3895aadba6	SWDEV-558849 - Make ROCR path in Windows more stable (#2181 )	2025-12-10 12:37:10 -05:00
SaleelK	acc236fd89	clr: Avoid saving all ProfilingSignals at once (#2108 ) * While reusing signals, its possible we can come across a timestamp that can contain several signals, like when profiling a graph. Reading timestamps from all signals can make the call severely CPU bound. Instead cache only that signal so as to avoid the overhead for critical path.	2025-12-08 11:32:16 -08:00
Ioannis Assiouras	65b769ee16	SWDEV-569101 - increase signal list size to at least DEBUG_HIP_GRAPH_BATCH_SIZE (#2084 )	2025-12-01 18:52:51 -08:00
SaleelK	c105dcd05b	clr: Use graph segment scheduling to process HIP Graphs (#1372 ) * clr: Use graph segment scheduling to process HIP Graphs * Add a broader path to use capture packet capture for all topologies * Refactor code * Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path, Enabled by default * clr: Few fixes and improvements * clr: Detect complex graphs to take classic path * Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling path * clr: Fix a cornercase stack corruption * clr: Track commands of segments instead of snapshots * clr: Fix Batch dispatch logic * Track fence_dirty_ flag for command of other streams * Dependency resolution markers can now accomodate dirty fence on cross streams --------- Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com> Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>	2025-12-01 12:49:26 -08:00
Ioannis Assiouras	4f91b68988	SWDEV-559166 - Remove obsolete member execInfoOffset from KernelParameters (#1790 )	2025-11-12 17:20:36 +00:00
SaleelK	5e418ca256	clr: Allow all engines but prefer recommended engines (#1750 ) * Also honor ROC_P2P_SDMA_SIZE for IPC, since IPC can also mean P2P	2025-11-10 13:10:46 -08:00
SaleelK	738bb19835	clr: Increase kernelArg/managedBuffer size (#1586 ) * Increase the buffer to 4MB. That can help kernel launches limited by a deep kernel pipeline Co-authored-by: JeniferC99 <150404595+JeniferC99@users.noreply.github.com>	2025-11-08 18:32:43 -08:00
Pengda Xie	93947241d0	SWDEV-556684 - HSAIL cleanup (#1657 )	2025-11-08 02:22:03 -08:00
Pengda Xie	5dd15e22ca	SWDEV-559514 - Add queue validation to submitMarker sync path (#1308 )	2025-11-08 02:21:36 -08:00
SaleelK	f301053740	clr: Improve logging (#1457 )	2025-10-25 15:55:27 -07:00
SaleelK	839fb95717	clr: Do not increase signal pool (#1354 ) * Do not increase signal pool when profiling, instead allow saving off timestamps. This is slow but a tradeoff to memory footprint of the signals	2025-10-23 22:05:00 -07:00
Ioannis Assiouras	602ea0be1e	SWDEV-558078 - Fix use-after-free in graph tests due to AsyncEventHandler (#1502 )	2025-10-23 22:49:24 +01:00
Pengda Xie	a4bbd73dc6	SWDEV-556684 - Remove HSAIL support (#1183 )	2025-10-23 11:21:49 -07:00
SaleelK	cc18890fe8	clr: Reset barrier_value_packet_ at init (#1162 )	2025-10-13 22:01:46 -07:00
Godavarthy Surya, Anusha	d3cc2c7668	SWDEV-524745 - Part-III Add multi device support for hip graph (#814 ) - Retrieve the list of devices linked to each branch using stream ID x. - Identify the necessary streams for each device to facilitate graph execution. - Create the necessary streams for each device to ensure successful graph execution. - Implement support for launching a multi-device, single-branch graph. Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>	2025-10-10 10:27:27 +05:30
Pengda Xie	d195d925e9	SWDEV-548034 - Separate sdma signal from compute in checkGpuTime (#1201 )	2025-10-09 14:55:25 -07:00
German Andryeyev	bb1295bcdf	SWDEV-547108 - Fix compilation errors under Windows (#1085 ) Also correct AQL print under Windows	2025-09-26 09:42:50 -04:00
Godavarthy Surya, Anusha	fb72d7f851	SWDEV-524746 - Part-II Add multi device support for hip graph. Updated kernel arg manager for each device (#813 ) - Updated kernel arg manager to support allocating kernel args on multiple devices for single graph. - Updated AQL path to capture on the device where graph node is added. Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>	2025-09-25 20:38:18 +05:30
SaleelK	34b9184686	clr: Fix memory corruption for memset nodes (#1068 ) * Detect graph capture and use graph kernelarg memory for FillBuffer pattern	2025-09-23 17:17:33 -07:00
German Andryeyev	ea89ddd589	SWDEV-547108 - Add dll loader for Windows build (#1004 ) The build of ROCR backend will be enabled by default in Windows. It requires the dll loader until ROCR dll will be always available in Windows for any configuration.	2025-09-19 11:25:30 -04:00
SaleelK	149dc17c90	clr: Optimize doorbell ring (#1030 ) Lay foundation to batch packets efficiently for graphs Dynamically copy packets with max threshold set with DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2 Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256 If TS are not collected for a signal for reuse, create a new signal. This can potentially increase signal footprint if the handler doesn't run fast enough.	2025-09-18 15:02:10 -07:00
Ioannis Assiouras	5ac163a811	SWDEV-548770 - Added system scope acquire for all packets in gfx12 (#966 )	2025-09-18 14:33:17 +01:00
Ioannis Assiouras	35629e433d	SWDEV-546146 - Added support for hipMemLocationTypeHost in hipMemSetAccess (#682 )	2025-09-10 23:06:20 +01:00
SaleelK	e197aa83ba	SWDEV-543723 - Execute permission for kernArg buf (#728 ) - Refactor deviceLocalAlloc arguments - Refactor hostAlloc code, have cleaner interface - Kern args buffer need to have execute flag set as CP enforces this on certain newer HW.	2025-09-08 12:21:30 -07:00
SaleelK	c4537e8050	SWDEV-553126 - Improve logging (#835 ) * Ability to mask COPY api usage in logs * Show total graph nodes in logs * Add another log level for detailed debug	2025-09-04 10:08:41 -07:00
Danylo Lytovchenko	2ff2316227	Adjust clang format to the new versions, revert broken macro layout (#714 )	2025-08-22 17:23:22 +02:00
Danylo Lytovchenko	f7338717ae	SWDEV-470698 - fix formatting, add format check workflow (#657 )	2025-08-20 19:58:06 +05:30
Andryeyev, German	72b9408fed	SWDEV-547108 - Fix compilation errors under Windows (#867 ) Interop and numa are not enabled. [ROCm/clr commit: `0ac913e64c`]	2025-08-17 02:33:31 -04:00
Betigeri, Sourabh	35e48d1eaf	SWDEV-546293 - hipMemPrefetchAsync_v2 and hipMemAdvise_v2 implementation (#869 ) SWDEV-546293 - hipMemPrefetchAsync hipMemAdvise_v2 Please enter the commit message for your changes. Lines starting [ROCm/clr commit: `cbee74a80e`]	2025-08-15 22:40:04 -07:00
Andryeyev, German	6df9a49437	SWDEV-465041 - Add support for user events with DD (#321 ) * SWDEV-465041 - Add support for user events with DD User events can be replaced with HSA signals. Add the interface to allocate HSA signal for user events and update the status on CL_COMPLETE. Force pinned path with DD to avoid blocking calls. Pinned memory can be released only when the command is complete. Simplify device enqueue path to use generic kernel arg buffer and signals * Fix notifyCmdQueue() logic for OCL * Avoid blocking calls in OCL with DD * Add event destruciton in a case of the failure. [ROCm/clr commit: `2305f8ae56`]	2025-08-12 19:04:36 -04:00
Kudchadker, Saleel	3a849c6962	SWDEV-538195 - Introduce threshold for handler submission (#723 ) - When doing device/stream sync, we can submit a handler which may introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to batch commands for host wait. Default for HIP is 8 commands. - Investigation is underway in ROCr but need to address this for now in HIP runtime. [ROCm/clr commit: `9b045922a8`]	2025-08-06 20:34:42 -07:00
Xie, Pengda	b7d8cb56d1	SWDEV-505833 - Remove DEBUG_CLR_SKIP_RELEASE_SCOPE flag (#735 ) Cleanup debug flag DEBUG_CLR_SKIP_RELEASE_SCOPE [ROCm/clr commit: `4121a860bf`]	2025-08-05 08:31:55 -07:00
Kudchadker, Saleel	a1d3da6bd3	SWDEV-547614 - Demangle graph kernel names (#809 ) [ROCm/clr commit: `1492328894`]	2025-08-01 14:35:30 -07:00
Andryeyev, German	b9669ea266	SWDEV-531678 - Remove split path from the dispatch (#283 ) The split path for blit kernels are no longer necessary, since the new blit kernels don't use the copy size as the global workload [ROCm/clr commit: `da198ac5b2`]	2025-05-12 12:50:32 -04:00
Andryeyev, German	3ea758a2d4	SWDEV-528808 - Release all HW queues even if only one is idle (#240 ) Pytorch may not explicitly idle each queue. Thus, some queues can be considered as busy, but have idle state in reality [ROCm/clr commit: `65a0181a7c`]	2025-05-05 19:09:01 -04:00
Sang, Tao	68deb3d10a	SWDEV-520352 - Remove HostThread and legacy monitor (#230 ) * SWDEV-520352 - Remove HostThread and legacy monitor Remove HostThread, semaphore and legacy monitor. Make original logics of thread and command queue stricker. Add more comments to make logics clearer. Some other minor improvement. Also part of SWDEV-458943. [ROCm/clr commit: `96cadbc9e9`]	2025-04-29 09:55:24 -04:00
Kudchadker, Saleel	cd14def193	SWDEV-521647 - Fix tracking of hw_event (#206 ) - When a command may possibly have two packets(like device heap initializer), and if there is no signal on the main kernel packet the tracking was broken as it marked HW event of the command as the first packet signal. - Make sure if no completion signal is attached to the second packet then clear the HW event for the command. [ROCm/clr commit: `072fb0804e`]	2025-04-25 08:46:44 -07:00
Kudchadker, Saleel	1b1d6b841e	SWDEV-510186 - Improve logging (#220 ) - Print all arguments for logs, this is useful for debug [ROCm/clr commit: `ce24936970`]	2025-04-25 08:40:31 -07:00
Andryeyev, German	c50f85df20	SWDEV-517481 - Add more restrictions to the queue management (#168 ) [ROCm/clr commit: `4c363df3bf`]	2025-04-10 21:51:45 +05:30
Patel, Jaydeepkumar	2f3bc7f01c	SWDEV-521011 - Allow max stack size as per ISA. (#73 ) [ROCm/clr commit: `9e7248aa36`]	2025-04-08 10:15:38 +05:30
Arandjelovic, Marko	cc5124241b	Revert SWDEV-512344 - Unmap all subbuffers (#26 ) This reverts commit 0b69120cfcb5b4689d9f2037b1a01e274d85c20f. [ROCm/clr commit: `e7ada4effe`]	2025-03-19 21:17:36 +05:30

1 2 3 4 5 ...

329 Коммитов