* clr: Update signal count and pool size for staging buffer
* Change to naming of variables etc
---------
Co-authored-by: Rahul Manocha <rmanocha@amd.com>
* clr: Implement dynamic stream to HW queue assignment
This change implements dynamic stream to hardware queue (HWq) mapping
with the following features:
* Queue depth heuristics with weights for optimal HWq assignment
* Make last used queue sticky for better locality
* Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to
pipe mapping based on creation order (single process per device only,
as pipe ID is statically assigned by runtime)
* More aggressive heuristic usage for better queue distribution
* Extend dynamic queues support for all stream priorities
Environment variables:
* DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 -
Depth+Pipe heuristics
* DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation
* clr: Clean up last_used_queue_
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
suboptimal resource allocation
Solution:
Implemented a global SDMA engine allocator with per-stream affinity:
- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
* Maintains global map of active assignments
* Enforces exclusivity: different streams use different engines (except
inter-GPU copies where preferred engines are prioritized for optimal
hardware paths like XGMI links)
* Thread-safe allocation/release with Monitor lock
- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
for fast lookup without map access on hot path
- Refactored rocrCopyBuffer() to:
1. Check local cached engine first → use if assigned
2. Call AllocateSdmaEngine() if not assigned → cache result
- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
into AllocateEngine() for cleaner separation of concerns
- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
* Improves engine utilization by releasing earlier
* Added virtual ReleaseSdmaEngines() method to device::VirtualDevice
- Added future path for simple round-robin allocation (kUseSimpleRR) for
next-gen GPUs with uniform SDMA bandwidth (disabled by default)
Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager
Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
* While reusing signals, its possible we can come across a timestamp
that can contain several signals, like when profiling a graph. Reading
timestamps from all signals can make the call severely CPU bound.
Instead cache only that signal so as to avoid the overhead for critical
path.
* clr: Use graph segment scheduling to process HIP Graphs
* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
Enabled by default
* clr: Few fixes and improvements
* clr: Detect complex graphs to take classic path
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
path
* clr: Fix a cornercase stack corruption
* clr: Track commands of segments instead of snapshots
* clr: Fix Batch dispatch logic
* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
streams
---------
Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
The build of ROCR backend will be enabled by default in Windows.
It requires the dll loader until ROCR dll will be always available in Windows for any configuration.
*Lay foundation to batch packets efficiently for graphs
*Dynamically copy packets with max threshold set with
DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2
*Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256
*If TS are not collected for a signal for reuse, create a new signal.
This can potentially increase signal footprint if the handler doesn't run
fast enough.
* SWDEV-465041 - Add support for user events with DD
User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals
* Fix notifyCmdQueue() logic for OCL
* Avoid blocking calls in OCL with DD
* Add event destruciton in a case of the failure.
[ROCm/clr commit: 2305f8ae56]
- When doing device/stream sync, we can submit a handler which may
introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
HIP runtime.
[ROCm/clr commit: 9b045922a8]
- When a command may possibly have two packets(like device heap
initializer), and if there is no signal on the main kernel packet the
tracking was broken as it marked HW event of the command as the first
packet signal.
- Make sure if no completion signal is attached to the second packet
then clear the HW event for the command.
[ROCm/clr commit: 072fb0804e]
- This change tries to save extra synchronization packets we may insert
as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
passed as the argument
Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc
[ROCm/clr commit: e03e4f3b5d]
The logic can analyze the AQL queue state and
find a failed AQL packet with the kernel's name
Change-Id: I1a478fa2c25462cd07a194784958bdf22454b897
[ROCm/clr commit: ea0b092af8]
In active wait mode use signals without interrupts by default and switch
to the interrupts only if a callback is required.
Change-Id: Ibcde8f7d44c70f8fb8fa5e0a7fdd8b08a2982a8e
[ROCm/clr commit: f4b9d3b7bd]
- Refactor blit code and clean ASAN instrumentation
- Use unified function for rocr copy
- Enable shader copy path for unpinned writeBuffer/readBuffer paths
- Set GPU_FORCE_BLIT_COPY_SIZE=16 which means we will use BLIT copy for
pinned copies or unpinned H2D/D2H copies < 16KB
Change-Id: I42045cca79234b340dbf53dafb93044199736ae4
[ROCm/clr commit: 7863eb92dc]
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
[ROCm/clr commit: 364dfb0ed1]
Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer
Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2
[ROCm/clr commit: 29cc678d8d]
If only external signals were provided, then just process it
without adding internal signals
Change-Id: Iaefd65d0f8b0a64b9f6a864a9bd73de20a29dfa4
[ROCm/clr commit: 18187cd8fe]
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
[ROCm/clr commit: 346da4bb40]
If the graph has kernels that does device side allocation, during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.
Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.
Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688
[ROCm/clr commit: 57156c524d]
Adding a safety check prevents an invalid memory access
if timestamps and kernelNames vectors are of different size.
The patch also moves the addKernelNames for the accumulate command
into dispatchAqlPacket function.
Change-Id: Iea0927e1253800403a1ae3f3d72de1e7d96476c3
[ROCm/clr commit: d44f44a5b1]
Dispatching multiple packets with ring the doorbell once is not supported by the lower layers
Change-Id: I7665a2dcdd4ef9e47dadfe410180fed64c5a4ee0
[ROCm/clr commit: d7f352dbed]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
Dont track the status of fence_dirty_ flag on the host, instead clear it
when we submit a barrier on the respective stream.
Change-Id: I4d98dbf20c81379c9c5da9f5b67629a8f9f6dfcd
[ROCm/clr commit: 0b0df605d4]
It seems that due to removal of vdev()->isHandlerPending(),
Marker queued to ensure finish is not enqueued and that cause
hung at waiting event for kernel enqueue command.
Change-Id: I364abb2dcb4897b11a7eb61b5d85013b69292792
[ROCm/clr commit: eecbc2e436]
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues
Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f
[ROCm/clr commit: 40f41f4d0b]