rocm-systems

作成者	SHA1	メッセージ	日付
SaleelK	340f3aa887	clr: Implement dynamic stream to HWq logic (#1958 ) * clr: Implement dynamic stream to HW queue assignment This change implements dynamic stream to hardware queue (HWq) mapping with the following features: * Queue depth heuristics with weights for optimal HWq assignment * Make last used queue sticky for better locality * Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to pipe mapping based on creation order (single process per device only, as pipe ID is statically assigned by runtime) * More aggressive heuristic usage for better queue distribution * Extend dynamic queues support for all stream priorities Environment variables: * DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 - Depth+Pipe heuristics * DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation * clr: Clean up last_used_queue_	2026-01-23 10:40:54 -08:00
SaleelK	6b28faa532	clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480 ) Problem: The existing SDMA engine selection logic had several issues: 1. Same VirtualGPU/stream could use different SDMA engines for consecutive async copies since copy_engine_status may report engines as busy 2. Busy and Preferred engine check for every copy 3. No global tracking of which VirtualGPU uses which engine, leading to suboptimal resource allocation Solution: Implemented a global SDMA engine allocator with per-stream affinity: - Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments * Maintains global map of active assignments * Enforces exclusivity: different streams use different engines (except inter-GPU copies where preferred engines are prioritized for optimal hardware paths like XGMI links) * Thread-safe allocation/release with Monitor lock - Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_) for fast lookup without map access on hot path - Refactored rocrCopyBuffer() to: 1. Check local cached engine first → use if assigned 2. Call AllocateSdmaEngine() if not assigned → cache result - Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine) into AllocateEngine() for cleaner separation of concerns - Engine release on HostQueue::finish() instead of only VirtualGPU destruction * Improves engine utilization by releasing earlier * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice - Added future path for simple round-robin allocation (kUseSimpleRR) for next-gen GPUs with uniform SDMA bandwidth (disabled by default) Cleanup: - Removed selectSdmaEngine() helper (logic moved to allocator) - Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly) - Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager Benefits: - Ensures consistent per-stream SDMA engine usage - Prevents cross-stream contention and engine thrashing - Prioritizes hardware-optimal paths for inter-GPU transfers - Better resource utilization through earlier release - Cleaner, more maintainable code structure	2026-01-07 19:37:45 -08:00
Ioannis Assiouras	49b8900158	SWDEV-558849 - keep the lastEnqueueCommand_ when PAL backend is enabled (#2320 )	2025-12-23 21:24:09 +00:00
German Andryeyev	3895aadba6	SWDEV-558849 - Make ROCR path in Windows more stable (#2181 )	2025-12-10 12:37:10 -05:00
Rahul Manocha	4f075902fc	SWDEV-555347 - Remove lock contention in async events loop (#878 ) * SWDEV-555347 - Remove lock contention in async events loop * SWDEV-555347 - Introduce Pool of AsyncEventItems * create generic mempool for AsyncEventItem * Use BaseShared allocate and free for async event pool --------- Co-authored-by: Rahul Manocha <rmanocha@amd.com>	2025-10-24 08:43:00 -07:00
Ioannis Assiouras	6d6b136374	SWDEV-559166 - Fix data races in GetSubmissionBatch, CaptureAndSet and SetQueueStatus (#1441 )	2025-10-23 12:18:31 +01:00
Godavarthy Surya, Anusha	ce560304a8	SWDEV-548417 - Fix Memleaks in Graph (#713 ) Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>	2025-09-19 17:39:36 +05:30
SaleelK	c4537e8050	SWDEV-553126 - Improve logging (#835 ) * Ability to mask COPY api usage in logs * Show total graph nodes in logs * Add another log level for detailed debug	2025-09-04 10:08:41 -07:00
Danylo Lytovchenko	f7338717ae	SWDEV-470698 - fix formatting, add format check workflow (#657 )	2025-08-20 19:58:06 +05:30
Manocha, Rahul	b3ccf487da	SWDEV-545952 - API definitions for hipStreamSet/GetAttribute (#831 ) Co-authored-by: Rahul Manocha <rmanocha@amd.com> [ROCm/clr commit: `0f49c4a97f`]	2025-08-15 12:51:35 -07:00
Stojiljkovic, Vladana	33085dd232	SWDEV-533220 - Release marker when HostQueue is destroyed (#460 ) Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com> [ROCm/clr commit: `14760c6eba`]	2025-08-13 15:15:31 +02:00
Andryeyev, German	6df9a49437	SWDEV-465041 - Add support for user events with DD (#321 ) * SWDEV-465041 - Add support for user events with DD User events can be replaced with HSA signals. Add the interface to allocate HSA signal for user events and update the status on CL_COMPLETE. Force pinned path with DD to avoid blocking calls. Pinned memory can be released only when the command is complete. Simplify device enqueue path to use generic kernel arg buffer and signals * Fix notifyCmdQueue() logic for OCL * Avoid blocking calls in OCL with DD * Add event destruciton in a case of the failure. [ROCm/clr commit: `2305f8ae56`]	2025-08-12 19:04:36 -04:00
Kudchadker, Saleel	3a849c6962	SWDEV-538195 - Introduce threshold for handler submission (#723 ) - When doing device/stream sync, we can submit a handler which may introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to batch commands for host wait. Default for HIP is 8 commands. - Investigation is underway in ROCr but need to address this for now in HIP runtime. [ROCm/clr commit: `9b045922a8`]	2025-08-06 20:34:42 -07:00
Patel, Jaydeepkumar	821a1d89b0	SWDEV-536226 - Avoid waiting for lastCommand completion if GPU has already reported an error otherwise it causes hang due to status of cmd is not becoming CL_COMPLETE. (#478 ) [ROCm/clr commit: `a60212b9b4`]	2025-06-25 20:59:17 +05:30
Jayaprakash, Karthik	4ea2d9a5ee	SWDEV-531711 - Report correct error code based on device failure. (#286 ) [ROCm/clr commit: `f5b8db33f1`]	2025-05-17 06:33:13 -04:00
Andryeyev, German	3ea758a2d4	SWDEV-528808 - Release all HW queues even if only one is idle (#240 ) Pytorch may not explicitly idle each queue. Thus, some queues can be considered as busy, but have idle state in reality [ROCm/clr commit: `65a0181a7c`]	2025-05-05 19:09:01 -04:00
Sang, Tao	68deb3d10a	SWDEV-520352 - Remove HostThread and legacy monitor (#230 ) * SWDEV-520352 - Remove HostThread and legacy monitor Remove HostThread, semaphore and legacy monitor. Make original logics of thread and command queue stricker. Add more comments to make logics clearer. Some other minor improvement. Also part of SWDEV-458943. [ROCm/clr commit: `96cadbc9e9`]	2025-04-29 09:55:24 -04:00
Sang, Tao	60a1e6dbc1	SWDEV-523824 - Fix data validation issue of rocFFT (#154 ) Fix data validation issue of rocFFT when dynamic queue on. ReleaseHwQueue() can be called only when no command in HostQueue. The checking condition need be protected by lock. [ROCm/clr commit: `18d191fd1d`]	2025-04-08 20:30:06 -04:00
Arandjelovic, Marko	1c83314659	SWDEV-517867 - Remove invalid assert (#55 ) * Remove invalid assert * Retrigger CI * Rebase [ROCm/clr commit: `8fcaa1ca93`]	2025-04-03 11:14:32 +02:00
Andryeyev, German	5c7c86f66d	SWDEV-517481 - Add dynamic queue management (#37 ) Enabled by defaulty. DEBUG_HIP_DYNAMIC_QUEUES controls the feature [ROCm/clr commit: `28967982b2`]	2025-03-19 11:22:50 -04:00
Saleel Kudchadker	c8f39ec2b0	SWDEV-502365 - Track last used command - This change tries to save extra synchronization packets we may insert as we didnt track the completion signals for every command. We track the current enqueued command until it exits the enqueue stage. We also record the exit scope to know if we flushed the caches - Handle correct release scopes and store completion signal as HW events - Use a new finishCommand implementation to only wait for the command passed as the argument Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc [ROCm/clr commit: `e03e4f3b5d`]	2025-03-04 16:05:02 -05:00
Aidan Belton-Schure	4b4a35b86b	SWDEV-508279 - Improve HIP event profiling There are 2 functional changes to this patch: * Use GPU timing for internal markers for HIP. * Measure CPU time closer to GPU timer, to reduce delta between GPU/CPU timestamp measurements. There are some smaller non-functional updates: * waifForFence -> waitForFence typo * Remove unused drmProfiling Change-Id: I4c5fa600a842ab60e454888779edcac8449a902a [ROCm/clr commit: `179801a750`]	2025-02-13 04:15:40 -05:00
Saleel Kudchadker	d0656c944b	SWDEV-504494 - Resolve signal dependencies - Resolve signal dependencies for barrier value packet if there are > 1 depenent signals. Barrier Value packet accounts for only 1 dep signal - Better log Change-Id: Ia506ad5d80b91d598f92e7b539f41756e9b4b64b [ROCm/clr commit: `2d450e8b06`]	2025-01-29 19:49:02 +00:00
Anusha GodavarthySurya	08c92f4793	SWDEV-480209 - Make internal callbacks non-blocking Change-Id: Ic918d08f341abfd9a7c167d09f9c723cdc43157f [ROCm/clr commit: `683a942364`]	2025-01-10 02:16:11 -05:00
German Andryeyev	3191f8e942	SWDEV-486602 - Add tracking of HSA handlers Add an atomic counter to track the outstanding HSA handlers. Wait on CPU for the callbacks if the number exceeds the value in DEBUG_HIP_BLOCK_SYNC env variable. Change-Id: I95dc8c4bf0258c7e59411b7504220709ed6898c5 [ROCm/clr commit: `403f624bf8`]	2024-10-25 15:20:50 -04:00
German Andryeyev	0a03665a3f	SWDEV-491375 - Limit the SW batch size Applications may submit commands withoout waits for GPU. That causes a growth of SW unreleased commands. Make sure runtime flushes SW queue, if it grows over some threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE. Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396 [ROCm/clr commit: `8657a77029`]	2024-10-17 10:53:57 -04:00
German Andryeyev	faea40cbb3	SWDEV-486602 - Optimize HSA callback performance - Don't generate callbacks for HIP events - Don't process profiling info in the callback for HIP events - Wait for CPU status update of the submitted commands every 50 calls. That will allow to drain the commands and destroy HSA signals. Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9 [ROCm/clr commit: `364dfb0ed1`]	2024-10-11 14:50:25 -04:00
Ioannis Assiouras	00cb623a67	SWDEV-488851 - Correctly remove the queue from the active set on windows Change-Id: I4d21743ecf7a44636121f85566f898e62ff61e97 [ROCm/clr commit: `07bcc283f9`]	2024-10-02 12:06:59 +01:00
Ioannis Assiouras	b5a8d775d6	SWDEV-476929 - Introduce an activeQueues set The new set tracks only the queues that have a command submitted to them. This allows for fast iteration in waitActiveStreams. Change-Id: I2c832eefa01280d9a87a5f57874d36d2e9441de7 [ROCm/clr commit: `bcc545e6b8`]	2024-09-16 15:53:49 -04:00
taosang2	881ffd6650	SWDEV-467540 - Get lastCommand safely We must be in protected way to get last command when calling awaitCompletion() where lastCommand will be released and possibly destroyed. This can solve scope lock(notify_lock_) crash in Event::notifyCmdQueue() with AMD_DIRECT_DISPATCH = true. Change-Id: I4297166f912a71112f4a8945d993160ba9afdc34 [ROCm/clr commit: `749385155a`]	2024-06-28 21:18:22 -04:00
Ioannis Assiouras	af089a2171	SWDEV-463865 - namespace changes to prevent symbol conflicts in static builds Change-Id: I09ceb5962b7aa19156909f47167c87d6887c9cd1 [ROCm/clr commit: `3edf1501cc`]	2024-06-12 16:22:27 -04:00
Ioannis Assiouras	60ba0874fa	SWDEV-460925 - Do awaitCompletion before releasing the lastEnqueueCommand Change-Id: I210399dd1bced13c0923fdb1c215e044920c5a4b [ROCm/clr commit: `d6eaf49033`]	2024-05-28 06:31:10 +00:00
Saleel Kudchadker	3a67addd48	SWDEV-459778 - Remove CPU wait for profiler - No cpu wait is needed when profiler is attached, Doing this changes the application profile when roctracer is attached. Change-Id: I2b9cfc48d697cf5ed54bb6a240d8c12bdb079171 [ROCm/clr commit: `51e4368723`]	2024-05-28 06:28:17 +00:00
German Andryeyev	a2ffb2ad40	SWDEV-440746 - Release last command on terminate Change-Id: Ib6a9b8fc9a8692eb17b39b854cefd92c6b59733f [ROCm/clr commit: `0ccdb3e160`]	2024-04-22 09:57:38 -04:00
Jaydeep Patel	7933b88d7c	SWDEV-431879 - Introduce IsHandlerPending back. It seems that due to removal of vdev()->isHandlerPending(), Marker queued to ensure finish is not enqueued and that cause hung at waiting event for kernel enqueue command. Change-Id: I364abb2dcb4897b11a7eb61b5d85013b69292792 [ROCm/clr commit: `eecbc2e436`]	2023-11-23 08:45:19 -05:00
Saleel Kudchadker	1d4bd084b8	SWDEV-301667 - Cleanup unused paths - Refactor code and cleanup logic for callback saving for event records Change-Id: I5c56aa8e9c968a5bca70fb07ad1796da318e9e89 [ROCm/clr commit: `1338ff37e8`]	2023-11-02 11:43:41 -04:00
German Andryeyev	bd63f3f614	SWDEV-424603 - Use OR for CPU wait request Make sure rocclr doesn't overwrite the client's request for a wait. Change-Id: I0addf18ea408b7f4ecaa1e04b2877cc0bbbfcc0d [ROCm/clr commit: `fe7b36f3cb`]	2023-10-06 16:51:44 -04:00
German Andryeyev	d593231137	SWDEV-424603 - Force CPU wait if profiling Some pytorch tests use a tracer plugin and rely on profiling information to be reported right after hipDeviceSynchronize() Change-Id: Ib021a1e7b1a30b3c24de72627c471810f7f7878d [ROCm/clr commit: `5438b6362e`]	2023-10-06 11:33:06 -04:00
German Andryeyev	ee34d05add	SWDEV-424249 - Check if HwEvent is available Allocate marker only if HW event doesn't exist for the last command. Change-Id: I3e7284202365a9c75313fb5403f0c1908ab51d1e [ROCm/clr commit: `596b496c16`]	2023-10-02 11:27:16 -04:00
German Andryeyev	2d492a201b	SWDEV-423317 - Enable GPU wait for hip sync calls hipStreamSynchronize and hipDeviceSynchronize won't longer wait for CPU commands in DD mode Change-Id: I079c8bbfc34ddc6d3e2d74c92a34665877e512a5 [ROCm/clr commit: `fbea58ba11`]	2023-09-22 13:04:27 -04:00
Saleel Kudchadker	0a26b75238	SWDEV-301667 - Use large signal pool Use large signal pool if profiler is connected or profiling forced enabled. This is needed to mitigate signal creation overhead when profiling as signals are attached to every packet and deeper batch may show overhead of signal allocation. Change-Id: I8034b8a20b55328b87d593bf044f59672f9653e8 [ROCm/clr commit: `1ec0ba3537`]	2023-08-24 19:17:05 -04:00
Rakesh Roy	f887f2fc6f	SWDEV-405329 - Fix cuMask issue for WGP mode - Enable CUs adjacent pairwise for WGP mode - In HostQueue::terminate() do not segfault if virtual device hasn't been created Change-Id: I94402ff333308af5824878086cc238b3993d534d [ROCm/clr commit: `8c1232124e`]	2023-06-30 01:09:01 -04:00
Saleel Kudchadker	858e311f34	SWDEV-364604 - Add ROCclr support for hipEventDisableSystemFence Change-Id: I6127b432a8759359359a1890fda85bc401be6a56 [ROCm/clr commit: `3e603d986a`]	2023-02-21 19:07:35 -05:00
German	73f02aa6dc	SWDEV-382397 - Move VirtualGPU destruction back to the thread exit OS can terminate unfinished queue thread from default stream at any time. Potentially leaving the queue lock in a bad state and causing a deadlock if runtime destroys VirtualGPU later from the host thread. Change-Id: I247f102ee84e6b4dba947504933395071945c85d [ROCm/clr commit: `28daf98f1f`]	2023-02-17 10:05:49 -05:00
German	f857dcc48d	SWDEV-352197 - Destroy virtual device in thread destructor Windows kills threads on exit without any notification. However, runtime can still destroy VirtualGPU object from the host thread with HostQueue destruction. This change also forces RGP trace transfer on the last capture without any delays. Change-Id: I768e87e99e1d23a021e63c12f36e450817743759 [ROCm/clr commit: `ad33a021cb`]	2023-01-31 10:53:48 -05:00
Ajay	3d12929eb8	SWDEV-372757 - thread check workaround for windows hang Change-Id: Ie9f87b88dd0f3078ad1919edc336f297f6b40373 [ROCm/clr commit: `ecea27eb2d`]	2023-01-13 04:05:35 -05:00
German	f5f0a6c618	SWDEV-352487 - Don't add notifications as the last command Change-Id: Ifed34485839ef2c9491e8e8f6bb3569932160b1c [ROCm/clr commit: `e223b0f678`]	2022-10-24 09:39:03 -04:00
Saleel Kudchadker	0dd9add8e1	SWDEV-352001 - Store last scopes for dispatch - Store last fence scopes and use the last value to determine if we need a cache flush again. This helps cases where hipExtLaunchKernel API is used. - Purge code for ROC_EVENT_NO_FLUSH Change-Id: I531cf9c9c60d5e2b3a9e265d0f52f79ed2fa8a8c [ROCm/clr commit: `9b5cbd37a2`]	2022-09-22 11:34:10 -04:00
Joseph Greathouse	b995ea06e8	SWDEV-330307 - Avoid releasing command before last use The fix for SWDEV-329789 moved down the last use of the a command object pointer in order to prevent a race condition. However, the previous patch did not move down the release of that command. By releasing the command early, another thread could get a command with the same pointer. That second thread could later submit work to the queue using that new command. The first thread could then perform a comparison against the queue's last command using its own now-stale pointer. This could eventually allow the second thread to skip synchornizing on the queue. This would result in host synchronizations completing before their device work was actually complete. Change-Id: I292b7b369743251ceafe453a4c5cae14a6d01046 [ROCm/clr commit: `6b956f7627`]	2022-08-31 16:07:49 -04:00
Jason Tang	fb753e489d	SWDEV-333471 - Add GPU_FORCE_QUEUE_PROFILING To support both hip and ocl. HIP_FORCE_QUEUE_PROFILING will be replaced with this later on. Change-Id: I6d3514b1568ff049584ed9fd74bbdb3e4f4bf0c3 [ROCm/clr commit: `d92b3a2d90`]	2022-08-19 10:51:41 -04:00

1 2

89 コミット