rocm-systems

Upphovsman	SHA1	Meddelande	Datum
Rahul Manocha	c4f7593001	clr: Update signal count and pool size for staging buffer (#2889 ) * clr: Update signal count and pool size for staging buffer * Change to naming of variables etc --------- Co-authored-by: Rahul Manocha <rmanocha@amd.com>	2026-01-29 10:34:00 -08:00
SaleelK	340f3aa887	clr: Implement dynamic stream to HWq logic (#1958 ) * clr: Implement dynamic stream to HW queue assignment This change implements dynamic stream to hardware queue (HWq) mapping with the following features: * Queue depth heuristics with weights for optimal HWq assignment * Make last used queue sticky for better locality * Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to pipe mapping based on creation order (single process per device only, as pipe ID is statically assigned by runtime) * More aggressive heuristic usage for better queue distribution * Extend dynamic queues support for all stream priorities Environment variables: * DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 - Depth+Pipe heuristics * DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation * clr: Clean up last_used_queue_	2026-01-23 10:40:54 -08:00
SaleelK	6b28faa532	clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480 ) Problem: The existing SDMA engine selection logic had several issues: 1. Same VirtualGPU/stream could use different SDMA engines for consecutive async copies since copy_engine_status may report engines as busy 2. Busy and Preferred engine check for every copy 3. No global tracking of which VirtualGPU uses which engine, leading to suboptimal resource allocation Solution: Implemented a global SDMA engine allocator with per-stream affinity: - Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments * Maintains global map of active assignments * Enforces exclusivity: different streams use different engines (except inter-GPU copies where preferred engines are prioritized for optimal hardware paths like XGMI links) * Thread-safe allocation/release with Monitor lock - Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_) for fast lookup without map access on hot path - Refactored rocrCopyBuffer() to: 1. Check local cached engine first → use if assigned 2. Call AllocateSdmaEngine() if not assigned → cache result - Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine) into AllocateEngine() for cleaner separation of concerns - Engine release on HostQueue::finish() instead of only VirtualGPU destruction * Improves engine utilization by releasing earlier * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice - Added future path for simple round-robin allocation (kUseSimpleRR) for next-gen GPUs with uniform SDMA bandwidth (disabled by default) Cleanup: - Removed selectSdmaEngine() helper (logic moved to allocator) - Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly) - Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager Benefits: - Ensures consistent per-stream SDMA engine usage - Prevents cross-stream contention and engine thrashing - Prioritizes hardware-optimal paths for inter-GPU transfers - Better resource utilization through earlier release - Cleaner, more maintainable code structure	2026-01-07 19:37:45 -08:00
German Andryeyev	3895aadba6	SWDEV-558849 - Make ROCR path in Windows more stable (#2181 )	2025-12-10 12:37:10 -05:00
SaleelK	acc236fd89	clr: Avoid saving all ProfilingSignals at once (#2108 ) * While reusing signals, its possible we can come across a timestamp that can contain several signals, like when profiling a graph. Reading timestamps from all signals can make the call severely CPU bound. Instead cache only that signal so as to avoid the overhead for critical path.	2025-12-08 11:32:16 -08:00
SaleelK	c105dcd05b	clr: Use graph segment scheduling to process HIP Graphs (#1372 ) * clr: Use graph segment scheduling to process HIP Graphs * Add a broader path to use capture packet capture for all topologies * Refactor code * Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path, Enabled by default * clr: Few fixes and improvements * clr: Detect complex graphs to take classic path * Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling path * clr: Fix a cornercase stack corruption * clr: Track commands of segments instead of snapshots * clr: Fix Batch dispatch logic * Track fence_dirty_ flag for command of other streams * Dependency resolution markers can now accomodate dirty fence on cross streams --------- Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com> Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>	2025-12-01 12:49:26 -08:00
SaleelK	5e418ca256	clr: Allow all engines but prefer recommended engines (#1750 ) * Also honor ROC_P2P_SDMA_SIZE for IPC, since IPC can also mean P2P	2025-11-10 13:10:46 -08:00
SaleelK	738bb19835	clr: Increase kernelArg/managedBuffer size (#1586 ) * Increase the buffer to 4MB. That can help kernel launches limited by a deep kernel pipeline Co-authored-by: JeniferC99 <150404595+JeniferC99@users.noreply.github.com>	2025-11-08 18:32:43 -08:00
Pengda Xie	93947241d0	SWDEV-556684 - HSAIL cleanup (#1657 )	2025-11-08 02:22:03 -08:00
Ioannis Assiouras	6d6b136374	SWDEV-559166 - Fix data races in GetSubmissionBatch, CaptureAndSet and SetQueueStatus (#1441 )	2025-10-23 12:18:31 +01:00
SaleelK	cc18890fe8	clr: Reset barrier_value_packet_ at init (#1162 )	2025-10-13 22:01:46 -07:00
German Andryeyev	ea89ddd589	SWDEV-547108 - Add dll loader for Windows build (#1004 ) The build of ROCR backend will be enabled by default in Windows. It requires the dll loader until ROCR dll will be always available in Windows for any configuration.	2025-09-19 11:25:30 -04:00
SaleelK	149dc17c90	clr: Optimize doorbell ring (#1030 ) Lay foundation to batch packets efficiently for graphs Dynamically copy packets with max threshold set with DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2 Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256 If TS are not collected for a signal for reuse, create a new signal. This can potentially increase signal footprint if the handler doesn't run fast enough.	2025-09-18 15:02:10 -07:00
SaleelK	c4537e8050	SWDEV-553126 - Improve logging (#835 ) * Ability to mask COPY api usage in logs * Show total graph nodes in logs * Add another log level for detailed debug	2025-09-04 10:08:41 -07:00
Danylo Lytovchenko	2ff2316227	Adjust clang format to the new versions, revert broken macro layout (#714 )	2025-08-22 17:23:22 +02:00
Danylo Lytovchenko	f7338717ae	SWDEV-470698 - fix formatting, add format check workflow (#657 )	2025-08-20 19:58:06 +05:30
Manocha, Rahul	b3ccf487da	SWDEV-545952 - API definitions for hipStreamSet/GetAttribute (#831 ) Co-authored-by: Rahul Manocha <rmanocha@amd.com> [ROCm/clr commit: `0f49c4a97f`]	2025-08-15 12:51:35 -07:00
Andryeyev, German	6df9a49437	SWDEV-465041 - Add support for user events with DD (#321 ) * SWDEV-465041 - Add support for user events with DD User events can be replaced with HSA signals. Add the interface to allocate HSA signal for user events and update the status on CL_COMPLETE. Force pinned path with DD to avoid blocking calls. Pinned memory can be released only when the command is complete. Simplify device enqueue path to use generic kernel arg buffer and signals * Fix notifyCmdQueue() logic for OCL * Avoid blocking calls in OCL with DD * Add event destruciton in a case of the failure. [ROCm/clr commit: `2305f8ae56`]	2025-08-12 19:04:36 -04:00
Kudchadker, Saleel	3a849c6962	SWDEV-538195 - Introduce threshold for handler submission (#723 ) - When doing device/stream sync, we can submit a handler which may introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to batch commands for host wait. Default for HIP is 8 commands. - Investigation is underway in ROCr but need to address this for now in HIP runtime. [ROCm/clr commit: `9b045922a8`]	2025-08-06 20:34:42 -07:00
Jayaprakash, Karthik	4ea2d9a5ee	SWDEV-531711 - Report correct error code based on device failure. (#286 ) [ROCm/clr commit: `f5b8db33f1`]	2025-05-17 06:33:13 -04:00
Andryeyev, German	3ea758a2d4	SWDEV-528808 - Release all HW queues even if only one is idle (#240 ) Pytorch may not explicitly idle each queue. Thus, some queues can be considered as busy, but have idle state in reality [ROCm/clr commit: `65a0181a7c`]	2025-05-05 19:09:01 -04:00
Assiouras, Ioannis	4efd624960	SWDEV-525593, SWDEV-527293 - Acquire active queue after xferQueue is created (#165 ) For xferQueue VirtualGPU::create is called after ProfilingBegin so the active queue needs to be acquired. [ROCm/clr commit: `d3fb8eda8b`]	2025-04-30 09:21:11 +01:00
Jayaprakash, Karthik	49a527c826	SWDEV-506467 - Skip Abort in case of crash from the device. (#60 ) Change-Id: I964b2f2647d068202e9c38fcddb1337da754df8d [ROCm/clr commit: `b2388dfb88`]	2025-04-29 11:19:02 +05:30
Kudchadker, Saleel	cd14def193	SWDEV-521647 - Fix tracking of hw_event (#206 ) - When a command may possibly have two packets(like device heap initializer), and if there is no signal on the main kernel packet the tracking was broken as it marked HW event of the command as the first packet signal. - Make sure if no completion signal is attached to the second packet then clear the HW event for the command. [ROCm/clr commit: `072fb0804e`]	2025-04-25 08:46:44 -07:00
Andryeyev, German	5c7c86f66d	SWDEV-517481 - Add dynamic queue management (#37 ) Enabled by defaulty. DEBUG_HIP_DYNAMIC_QUEUES controls the feature [ROCm/clr commit: `28967982b2`]	2025-03-19 11:22:50 -04:00
Saleel Kudchadker	c8f39ec2b0	SWDEV-502365 - Track last used command - This change tries to save extra synchronization packets we may insert as we didnt track the completion signals for every command. We track the current enqueued command until it exits the enqueue stage. We also record the exit scope to know if we flushed the caches - Handle correct release scopes and store completion signal as HW events - Use a new finishCommand implementation to only wait for the command passed as the argument Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc [ROCm/clr commit: `e03e4f3b5d`]	2025-03-04 16:05:02 -05:00
Jimbo Xie	cc229f251f	SWDEV-504383 - Cleaned up kForcedTimeout10us and removed IsHwEventReadyForcedWait Also removed active_wait_timeout Change-Id: I7a429f003c09a4df267b5c0983050704260094c6 [ROCm/clr commit: `4872b420c9`]	2025-01-31 14:40:18 -05:00
German Andryeyev	ae379965dd	SWDEV-459826 - Add a crash dump for a failed queue The logic can analyze the AQL queue state and find a failed AQL packet with the kernel's name Change-Id: I1a478fa2c25462cd07a194784958bdf22454b897 [ROCm/clr commit: `ea0b092af8`]	2025-01-28 14:27:46 -05:00
Anusha GodavarthySurya	08c92f4793	SWDEV-480209 - Make internal callbacks non-blocking Change-Id: Ic918d08f341abfd9a7c167d09f9c723cdc43157f [ROCm/clr commit: `683a942364`]	2025-01-10 02:16:11 -05:00
Sourabh Betigeri	7261404002	SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs Change-Id: I5ac63a6626af8c2b4ac382c52dfe1aaf0b3716b8 [ROCm/clr commit: `03dbcd8ca7`]	2024-12-12 19:29:24 -05:00
Michael Xie	945ae82918	SWDEV-499997 - Unify ManagedBuffer and KernelArg buffer implementation Change-Id: I95421c87904dd62d7ee214539a57c7bda1097ff4 [ROCm/clr commit: `cfcc743824`]	2024-12-12 12:56:23 -05:00
German Andryeyev	6604accdb3	SWDEV-501757 - Use signals without interrupts In active wait mode use signals without interrupts by default and switch to the interrupts only if a callback is required. Change-Id: Ibcde8f7d44c70f8fb8fa5e0a7fdd8b08a2982a8e [ROCm/clr commit: `f4b9d3b7bd`]	2024-12-09 15:16:15 -05:00
Saleel Kudchadker	7d7aa8b69c	SWDEV-497145 - Use rocr copyOnEngine API for staged copies - Refactor blit code and clean ASAN instrumentation - Use unified function for rocr copy - Enable shader copy path for unpinned writeBuffer/readBuffer paths - Set GPU_FORCE_BLIT_COPY_SIZE=16 which means we will use BLIT copy for pinned copies or unpinned H2D/D2H copies < 16KB Change-Id: I42045cca79234b340dbf53dafb93044199736ae4 [ROCm/clr commit: `7863eb92dc`]	2024-12-04 13:38:13 -05:00
Sourabh Betigeri	1712acdd2e	Revert "SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs" This reverts commit `ab0ff9163d`. Reason for revert: hipInfo fails on windows. Updating llvm amd-mainline-closed Change-Id: I57e1fa1945188b0bc0a799c4f3d540f2b7713003 [ROCm/clr commit: `2ca644cf22`]	2024-12-02 16:46:12 -05:00
Sourabh Betigeri	ab0ff9163d	SWDEV-440866 - [hip-roclr] Adds support to batch memory operations APIs Change-Id: I449ffca44bbb04d13348d112e896d603c70fd485 [ROCm/clr commit: `bd5d8e9baf`]	2024-11-30 17:54:32 -05:00
Anusha GodavarthySurya	c34f55babb	SWDEV-489084 - Avoid using queue colliding with the graph launch stream Change-Id: I3ecaf8836c8e0883441275139041c702aba0937e [ROCm/clr commit: `06e6561eb5`]	2024-11-29 08:15:58 -05:00
German Andryeyev	faea40cbb3	SWDEV-486602 - Optimize HSA callback performance - Don't generate callbacks for HIP events - Don't process profiling info in the callback for HIP events - Wait for CPU status update of the submitted commands every 50 calls. That will allow to drain the commands and destroy HSA signals. Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9 [ROCm/clr commit: `364dfb0ed1`]	2024-10-11 14:50:25 -04:00
German Andryeyev	f8fc11c2d8	SWDEV-483586 - Unblock staging H2D transfers Although unpinned copies require synchronizations in HIP, runtime can avoid syncs for H2D copies with a staging buffer Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2 [ROCm/clr commit: `29cc678d8d`]	2024-09-21 10:25:27 -04:00
kjayapra-amd	f19260d568	SWDEV-480772 - Remove name variable from amd::Monitor class. Change-Id: Ie2a4fa44f485786227230f8a892e090e718aa30e [ROCm/clr commit: `12a39fbf22`]	2024-09-19 11:55:01 -04:00
German Andryeyev	9d1d3a6493	SWDEV-470612 - Avoid processing internal signals If only external signals were provided, then just process it without adding internal signals Change-Id: Iaefd65d0f8b0a64b9f6a864a9bd73de20a29dfa4 [ROCm/clr commit: `18187cd8fe`]	2024-07-25 10:08:16 -04:00
Anusha GodavarthySurya	7985a72073	SWDEV-468424 - hipgraph capture memset node Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch. Added support to capture single graph memset node. Capture support for memset node is currently disabled. Memset capture will be enabled when capture for multiple packets are supported.. Change-Id: I14dfbc41731025cc3a548a730558915def3fa384 [ROCm/clr commit: `346da4bb40`]	2024-07-19 23:52:50 -04:00
Anusha GodavarthySurya	291f079669	SWDEV-467102 - Hidden heap init for graph capture If the graph has kernels that does device side allocation, during packet capture, heap is allocated because heap pointer has to be added to the AQL packet, and initialized during graph launch. Handle race with wait when 2 kernels with device heap are enqueued on multiple streams. Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688 [ROCm/clr commit: `57156c524d`]	2024-06-17 02:07:25 -04:00
Ioannis Assiouras	dfe46a3093	SWDEV-467069 - Added safety check in activity prof for accumulate command Adding a safety check prevents an invalid memory access if timestamps and kernelNames vectors are of different size. The patch also moves the addKernelNames for the accumulate command into dispatchAqlPacket function. Change-Id: Iea0927e1253800403a1ae3f3d72de1e7d96476c3 [ROCm/clr commit: `d44f44a5b1`]	2024-06-12 21:53:03 +01:00
Ioannis Assiouras	407d1346f2	SWDEV-463865 - changed device,roc and pal namespaces to be nested under amd Change-Id: Icad342843c039c634e249a13a7aa31400730b1dd [ROCm/clr commit: `775dc204aa`]	2024-06-07 12:23:06 -04:00
Ioannis Assiouras	2f430138c5	SWDEV-451594 - Implement Readback and Avoid HDP Flush workaround for device kernel args Change-Id: I6d41a089a17f55306e7ff402588a1e831b20a7a7 [ROCm/clr commit: `bf74ef4025`]	2024-04-19 09:29:20 -04:00
Ioannis Assiouras	78008c05c5	SWDEV-453301 - Remove the option to write multiple packets in dispatchGenericAqlPacket Dispatching multiple packets with ring the doorbell once is not supported by the lower layers Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0 [ROCm/clr commit: `d7f352dbed`]	2024-04-05 05:28:10 -04:00
Saleel Kudchadker	f3aedfbec0	SWDEV-301667 - Create TS for each node recorded in graph - Create a vector to allow multiple TS to be stored in Command. - This would mean we dont wait for entire batch in Accumulate command to finish when we exhaust signals. - Reduce the number of signals created at init to 64. This min value may still need to be tuned but the KFD allows max of 4094 interrupt signals per device. - Store kernel names whenever they are available and not just when profiling. If we dynamically enable profiling like for Torch, a crash can happen if hipGraphInstantiate wasnt included in Torch profile scope beacuse we previously entered kernel names only when profiler is attached. Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006 [ROCm/clr commit: `c157bfb202`]	2024-03-26 14:47:24 -04:00
Saleel Kudchadker	34fd1b7fe5	SWDEV-301667 - Remove resetFenceDirty Dont track the status of fence_dirty_ flag on the host, instead clear it when we submit a barrier on the respective stream. Change-Id: I4d98dbf20c81379c9c5da9f5b67629a8f9f6dfcd [ROCm/clr commit: `0b0df605d4`]	2024-01-15 15:43:14 -05:00
Jaydeep Patel	7933b88d7c	SWDEV-431879 - Introduce IsHandlerPending back. It seems that due to removal of vdev()->isHandlerPending(), Marker queued to ensure finish is not enqueued and that cause hung at waiting event for kernel enqueue command. Change-Id: I364abb2dcb4897b11a7eb61b5d85013b69292792 [ROCm/clr commit: `eecbc2e436`]	2023-11-23 08:45:19 -05:00
Saleel Kudchadker	5f009b7cb1	SWDEV-422207 - Track commands for capture - Track all captured commands under a new AccumulateCommand - Add begin() and end() methods to capture commands - Explicit TS object now passed to certain methods because profilingBegin() and profilingEnd() now happen separately and thus can run into threading issues Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f [ROCm/clr commit: `40f41f4d0b`]	2023-11-03 05:09:04 +00:00

1 2 3

136 Incheckningar