コミットグラフ

89 コミット

作成者 SHA1 メッセージ 日付
SaleelK 340f3aa887 clr: Implement dynamic stream to HWq logic (#1958)
* clr: Implement dynamic stream to HW queue assignment

This change implements dynamic stream to hardware queue (HWq) mapping
with the following features:

* Queue depth heuristics with weights for optimal HWq assignment
* Make last used queue sticky for better locality
* Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to
  pipe mapping based on creation order (single process per device only,
  as pipe ID is statically assigned by runtime)
* More aggressive heuristic usage for better queue distribution
* Extend dynamic queues support for all stream priorities

Environment variables:
* DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 -
  Depth+Pipe heuristics
* DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation

* clr: Clean up last_used_queue_
2026-01-23 10:40:54 -08:00
SaleelK 6b28faa532 clr: Implement per-stream SDMA engine affinity for improved copy performance (#2480)
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
   async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
   suboptimal resource allocation

Solution:
Implemented a global SDMA engine allocator with per-stream affinity:

- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
  * Maintains global map of active assignments
  * Enforces exclusivity: different streams use different engines (except
    inter-GPU copies where preferred engines are prioritized for optimal
    hardware paths like XGMI links)
  * Thread-safe allocation/release with Monitor lock

- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
  for fast lookup without map access on hot path

- Refactored rocrCopyBuffer() to:
  1. Check local cached engine first → use if assigned
  2. Call AllocateSdmaEngine() if not assigned → cache result

- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
  into AllocateEngine() for cleaner separation of concerns

- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
  * Improves engine utilization by releasing earlier
  * Added virtual ReleaseSdmaEngines() method to device::VirtualDevice

- Added future path for simple round-robin allocation (kUseSimpleRR) for
  next-gen GPUs with uniform SDMA bandwidth (disabled by default)

Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager

Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
2026-01-07 19:37:45 -08:00
Ioannis Assiouras 49b8900158 SWDEV-558849 - keep the lastEnqueueCommand_ when PAL backend is enabled (#2320) 2025-12-23 21:24:09 +00:00
German Andryeyev 3895aadba6 SWDEV-558849 - Make ROCR path in Windows more stable (#2181) 2025-12-10 12:37:10 -05:00
Rahul Manocha 4f075902fc SWDEV-555347 - Remove lock contention in async events loop (#878)
* SWDEV-555347 - Remove lock contention in async events loop

* SWDEV-555347 - Introduce Pool of AsyncEventItems

* create generic mempool for AsyncEventItem

* Use BaseShared allocate and free for async event pool

---------

Co-authored-by: Rahul Manocha <rmanocha@amd.com>
2025-10-24 08:43:00 -07:00
Ioannis Assiouras 6d6b136374 SWDEV-559166 - Fix data races in GetSubmissionBatch, CaptureAndSet and SetQueueStatus (#1441) 2025-10-23 12:18:31 +01:00
Godavarthy Surya, Anusha ce560304a8 SWDEV-548417 - Fix Memleaks in Graph (#713)
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
2025-09-19 17:39:36 +05:30
SaleelK c4537e8050 SWDEV-553126 - Improve logging (#835)
* Ability to mask COPY api usage in logs
* Show total graph nodes in logs
* Add another log level for detailed debug
2025-09-04 10:08:41 -07:00
Danylo Lytovchenko f7338717ae SWDEV-470698 - fix formatting, add format check workflow (#657) 2025-08-20 19:58:06 +05:30
Manocha, Rahul b3ccf487da SWDEV-545952 - API definitions for hipStreamSet/GetAttribute (#831)
Co-authored-by: Rahul Manocha <rmanocha@amd.com>

[ROCm/clr commit: 0f49c4a97f]
2025-08-15 12:51:35 -07:00
Stojiljkovic, Vladana 33085dd232 SWDEV-533220 - Release marker when HostQueue is destroyed (#460)
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>

[ROCm/clr commit: 14760c6eba]
2025-08-13 15:15:31 +02:00
Andryeyev, German 6df9a49437 SWDEV-465041 - Add support for user events with DD (#321)
* SWDEV-465041 - Add support for user events with DD

User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals

* Fix notifyCmdQueue() logic for OCL

* Avoid blocking calls in OCL with DD

* Add event  destruciton in a case of the failure.

[ROCm/clr commit: 2305f8ae56]
2025-08-12 19:04:36 -04:00
Kudchadker, Saleel 3a849c6962 SWDEV-538195 - Introduce threshold for handler submission (#723)
- When doing device/stream sync, we can submit a handler which may
  introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
  batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
  HIP runtime.

[ROCm/clr commit: 9b045922a8]
2025-08-06 20:34:42 -07:00
Patel, Jaydeepkumar 821a1d89b0 SWDEV-536226 - Avoid waiting for lastCommand completion if GPU has already reported an error otherwise it causes hang due to status of cmd is not becoming CL_COMPLETE. (#478)
[ROCm/clr commit: a60212b9b4]
2025-06-25 20:59:17 +05:30
Jayaprakash, Karthik 4ea2d9a5ee SWDEV-531711 - Report correct error code based on device failure. (#286)
[ROCm/clr commit: f5b8db33f1]
2025-05-17 06:33:13 -04:00
Andryeyev, German 3ea758a2d4 SWDEV-528808 - Release all HW queues even if only one is idle (#240)
Pytorch may not explicitly idle each queue. Thus, some queues can be considered as busy,
but have idle state in reality


[ROCm/clr commit: 65a0181a7c]
2025-05-05 19:09:01 -04:00
Sang, Tao 68deb3d10a SWDEV-520352 - Remove HostThread and legacy monitor (#230)
* SWDEV-520352 - Remove HostThread and legacy monitor

Remove HostThread, semaphore and  legacy monitor.
Make original logics of thread and command queue stricker.
Add more comments to make logics clearer.
Some other minor improvement.

Also part of SWDEV-458943.

[ROCm/clr commit: 96cadbc9e9]
2025-04-29 09:55:24 -04:00
Sang, Tao 60a1e6dbc1 SWDEV-523824 - Fix data validation issue of rocFFT (#154)
Fix data validation issue of rocFFT when dynamic queue on.
ReleaseHwQueue() can be called only when no command in HostQueue.
The checking condition need be protected by lock.

[ROCm/clr commit: 18d191fd1d]
2025-04-08 20:30:06 -04:00
Arandjelovic, Marko 1c83314659 SWDEV-517867 - Remove invalid assert (#55)
* Remove invalid assert

* Retrigger CI

* Rebase

[ROCm/clr commit: 8fcaa1ca93]
2025-04-03 11:14:32 +02:00
Andryeyev, German 5c7c86f66d SWDEV-517481 - Add dynamic queue management (#37)
Enabled by defaulty. DEBUG_HIP_DYNAMIC_QUEUES controls the feature

[ROCm/clr commit: 28967982b2]
2025-03-19 11:22:50 -04:00
Saleel Kudchadker c8f39ec2b0 SWDEV-502365 - Track last used command
- This change tries to save extra synchronization packets we may insert
  as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
  passed as the argument

Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc


[ROCm/clr commit: e03e4f3b5d]
2025-03-04 16:05:02 -05:00
Aidan Belton-Schure 4b4a35b86b SWDEV-508279 - Improve HIP event profiling
There are 2 functional changes to this patch:
* Use GPU timing for internal markers for HIP.
* Measure CPU time closer to GPU timer, to reduce delta between GPU/CPU timestamp measurements.

There are some smaller non-functional updates:
* waifForFence -> waitForFence typo
* Remove unused drmProfiling

Change-Id: I4c5fa600a842ab60e454888779edcac8449a902a


[ROCm/clr commit: 179801a750]
2025-02-13 04:15:40 -05:00
Saleel Kudchadker d0656c944b SWDEV-504494 - Resolve signal dependencies
- Resolve signal dependencies for barrier value packet if there are > 1
  depenent signals. Barrier Value packet accounts for only 1 dep signal
- Better log

Change-Id: Ia506ad5d80b91d598f92e7b539f41756e9b4b64b


[ROCm/clr commit: 2d450e8b06]
2025-01-29 19:49:02 +00:00
Anusha GodavarthySurya 08c92f4793 SWDEV-480209 - Make internal callbacks non-blocking
Change-Id: Ic918d08f341abfd9a7c167d09f9c723cdc43157f


[ROCm/clr commit: 683a942364]
2025-01-10 02:16:11 -05:00
German Andryeyev 3191f8e942 SWDEV-486602 - Add tracking of HSA handlers
Add an atomic counter to track the outstanding HSA handlers.
Wait on CPU for the callbacks if the number exceeds the value
in DEBUG_HIP_BLOCK_SYNC env variable.

Change-Id: I95dc8c4bf0258c7e59411b7504220709ed6898c5


[ROCm/clr commit: 403f624bf8]
2024-10-25 15:20:50 -04:00
German Andryeyev 0a03665a3f SWDEV-491375 - Limit the SW batch size
Applications may submit commands withoout waits
for GPU. That causes a growth of SW unreleased commands.
Make sure runtime flushes SW queue, if it grows over some
threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE.

Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396


[ROCm/clr commit: 8657a77029]
2024-10-17 10:53:57 -04:00
German Andryeyev faea40cbb3 SWDEV-486602 - Optimize HSA callback performance
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.

Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9


[ROCm/clr commit: 364dfb0ed1]
2024-10-11 14:50:25 -04:00
Ioannis Assiouras 00cb623a67 SWDEV-488851 - Correctly remove the queue from the active set on windows
Change-Id: I4d21743ecf7a44636121f85566f898e62ff61e97


[ROCm/clr commit: 07bcc283f9]
2024-10-02 12:06:59 +01:00
Ioannis Assiouras b5a8d775d6 SWDEV-476929 - Introduce an activeQueues set
The new set tracks only the queues that have a command
submitted to them. This allows for fast iteration
in waitActiveStreams.

Change-Id: I2c832eefa01280d9a87a5f57874d36d2e9441de7


[ROCm/clr commit: bcc545e6b8]
2024-09-16 15:53:49 -04:00
taosang2 881ffd6650 SWDEV-467540 - Get lastCommand safely
We must be in protected way to get last command when calling
awaitCompletion() where lastCommand will be released and
possibly destroyed.
This can solve scope lock(notify_lock_) crash in
Event::notifyCmdQueue() with AMD_DIRECT_DISPATCH = true.

Change-Id: I4297166f912a71112f4a8945d993160ba9afdc34


[ROCm/clr commit: 749385155a]
2024-06-28 21:18:22 -04:00
Ioannis Assiouras af089a2171 SWDEV-463865 - namespace changes to prevent symbol conflicts in static builds
Change-Id: I09ceb5962b7aa19156909f47167c87d6887c9cd1


[ROCm/clr commit: 3edf1501cc]
2024-06-12 16:22:27 -04:00
Ioannis Assiouras 60ba0874fa SWDEV-460925 - Do awaitCompletion before releasing the lastEnqueueCommand
Change-Id: I210399dd1bced13c0923fdb1c215e044920c5a4b


[ROCm/clr commit: d6eaf49033]
2024-05-28 06:31:10 +00:00
Saleel Kudchadker 3a67addd48 SWDEV-459778 - Remove CPU wait for profiler
- No cpu wait is needed when profiler is attached, Doing this changes
the application profile when roctracer is attached.

Change-Id: I2b9cfc48d697cf5ed54bb6a240d8c12bdb079171


[ROCm/clr commit: 51e4368723]
2024-05-28 06:28:17 +00:00
German Andryeyev a2ffb2ad40 SWDEV-440746 - Release last command on terminate
Change-Id: Ib6a9b8fc9a8692eb17b39b854cefd92c6b59733f


[ROCm/clr commit: 0ccdb3e160]
2024-04-22 09:57:38 -04:00
Jaydeep Patel 7933b88d7c SWDEV-431879 - Introduce IsHandlerPending back.
It seems that due to removal of vdev()->isHandlerPending(),
Marker queued to ensure finish is not enqueued and that cause
hung at waiting event for kernel enqueue command.

Change-Id: I364abb2dcb4897b11a7eb61b5d85013b69292792


[ROCm/clr commit: eecbc2e436]
2023-11-23 08:45:19 -05:00
Saleel Kudchadker 1d4bd084b8 SWDEV-301667 - Cleanup unused paths
- Refactor code and cleanup logic for callback saving for event records

Change-Id: I5c56aa8e9c968a5bca70fb07ad1796da318e9e89


[ROCm/clr commit: 1338ff37e8]
2023-11-02 11:43:41 -04:00
German Andryeyev bd63f3f614 SWDEV-424603 - Use OR for CPU wait request
Make sure rocclr doesn't overwrite the client's request
for a wait.

Change-Id: I0addf18ea408b7f4ecaa1e04b2877cc0bbbfcc0d


[ROCm/clr commit: fe7b36f3cb]
2023-10-06 16:51:44 -04:00
German Andryeyev d593231137 SWDEV-424603 - Force CPU wait if profiling
Some pytorch tests use a tracer plugin and rely on profiling information
to be reported right after hipDeviceSynchronize()

Change-Id: Ib021a1e7b1a30b3c24de72627c471810f7f7878d


[ROCm/clr commit: 5438b6362e]
2023-10-06 11:33:06 -04:00
German Andryeyev ee34d05add SWDEV-424249 - Check if HwEvent is available
Allocate marker only if HW event doesn't exist for the last command.

Change-Id: I3e7284202365a9c75313fb5403f0c1908ab51d1e


[ROCm/clr commit: 596b496c16]
2023-10-02 11:27:16 -04:00
German Andryeyev 2d492a201b SWDEV-423317 - Enable GPU wait for hip sync calls
hipStreamSynchronize and hipDeviceSynchronize won't longer wait
for CPU commands in DD mode

Change-Id: I079c8bbfc34ddc6d3e2d74c92a34665877e512a5


[ROCm/clr commit: fbea58ba11]
2023-09-22 13:04:27 -04:00
Saleel Kudchadker 0a26b75238 SWDEV-301667 - Use large signal pool
Use large signal pool if profiler is connected or profiling forced
enabled. This is needed to mitigate signal creation overhead when
profiling as signals are attached to every packet and deeper batch may
show overhead of signal allocation.

Change-Id: I8034b8a20b55328b87d593bf044f59672f9653e8


[ROCm/clr commit: 1ec0ba3537]
2023-08-24 19:17:05 -04:00
Rakesh Roy f887f2fc6f SWDEV-405329 - Fix cuMask issue for WGP mode
- Enable CUs adjacent pairwise for WGP mode
- In HostQueue::terminate() do not segfault if virtual device hasn't been created

Change-Id: I94402ff333308af5824878086cc238b3993d534d


[ROCm/clr commit: 8c1232124e]
2023-06-30 01:09:01 -04:00
Saleel Kudchadker 858e311f34 SWDEV-364604 - Add ROCclr support for hipEventDisableSystemFence
Change-Id: I6127b432a8759359359a1890fda85bc401be6a56


[ROCm/clr commit: 3e603d986a]
2023-02-21 19:07:35 -05:00
German 73f02aa6dc SWDEV-382397 - Move VirtualGPU destruction back to the thread exit
OS can terminate unfinished queue thread from default stream at any
time. Potentially leaving the queue lock in a bad state and causing a
deadlock if runtime destroys VirtualGPU later from the host thread.

Change-Id: I247f102ee84e6b4dba947504933395071945c85d


[ROCm/clr commit: 28daf98f1f]
2023-02-17 10:05:49 -05:00
German f857dcc48d SWDEV-352197 - Destroy virtual device in thread destructor
Windows kills threads on exit without any notification. However,
runtime can still destroy VirtualGPU object from the host thread with
HostQueue destruction.
This change also forces RGP trace transfer on the last capture without
any delays.

Change-Id: I768e87e99e1d23a021e63c12f36e450817743759


[ROCm/clr commit: ad33a021cb]
2023-01-31 10:53:48 -05:00
Ajay 3d12929eb8 SWDEV-372757 - thread check workaround for windows hang
Change-Id: Ie9f87b88dd0f3078ad1919edc336f297f6b40373


[ROCm/clr commit: ecea27eb2d]
2023-01-13 04:05:35 -05:00
German f5f0a6c618 SWDEV-352487 - Don't add notifications as the last command
Change-Id: Ifed34485839ef2c9491e8e8f6bb3569932160b1c


[ROCm/clr commit: e223b0f678]
2022-10-24 09:39:03 -04:00
Saleel Kudchadker 0dd9add8e1 SWDEV-352001 - Store last scopes for dispatch
- Store last fence scopes and use the last value to determine if we need a cache flush again. This helps cases where hipExtLaunchKernel API is
used.
- Purge code for ROC_EVENT_NO_FLUSH

Change-Id: I531cf9c9c60d5e2b3a9e265d0f52f79ed2fa8a8c


[ROCm/clr commit: 9b5cbd37a2]
2022-09-22 11:34:10 -04:00
Joseph Greathouse b995ea06e8 SWDEV-330307 - Avoid releasing command before last use
The fix for SWDEV-329789 moved down the last use of the a
command object pointer in order to prevent a race condition.
However, the previous patch did not move down the release of
that command. By releasing the command early, another thread
could get a command with the same pointer. That second thread
could later submit work to the queue using that new command.
The first thread could then perform a comparison against the
queue's last command using its own now-stale pointer. This
could eventually allow the second thread to skip synchornizing
on the queue. This would result in host synchronizations
completing before their device work was actually complete.

Change-Id: I292b7b369743251ceafe453a4c5cae14a6d01046


[ROCm/clr commit: 6b956f7627]
2022-08-31 16:07:49 -04:00
Jason Tang fb753e489d SWDEV-333471 - Add GPU_FORCE_QUEUE_PROFILING
To support both hip and ocl. HIP_FORCE_QUEUE_PROFILING will be replaced with this later on.

Change-Id: I6d3514b1568ff049584ed9fd74bbdb3e4f4bf0c3


[ROCm/clr commit: d92b3a2d90]
2022-08-19 10:51:41 -04:00