Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
suboptimal resource allocation
Solution:
Implemented a global SDMA engine allocator with per-stream affinity:
- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
* Maintains global map of active assignments
* Enforces exclusivity: different streams use different engines (except
inter-GPU copies where preferred engines are prioritized for optimal
hardware paths like XGMI links)
* Thread-safe allocation/release with Monitor lock
- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
for fast lookup without map access on hot path
- Refactored rocrCopyBuffer() to:
1. Check local cached engine first → use if assigned
2. Call AllocateSdmaEngine() if not assigned → cache result
- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
into AllocateEngine() for cleaner separation of concerns
- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
* Improves engine utilization by releasing earlier
* Added virtual ReleaseSdmaEngines() method to device::VirtualDevice
- Added future path for simple round-robin allocation (kUseSimpleRR) for
next-gen GPUs with uniform SDMA bandwidth (disabled by default)
Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager
Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
* clr: Use graph segment scheduling to process HIP Graphs
* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
Enabled by default
* clr: Few fixes and improvements
* clr: Detect complex graphs to take classic path
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
path
* clr: Fix a cornercase stack corruption
* clr: Track commands of segments instead of snapshots
* clr: Fix Batch dispatch logic
* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
streams
---------
Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
The build of ROCR backend will be enabled by default in Windows.
It requires the dll loader until ROCR dll will be always available in Windows for any configuration.
* SWDEV-551080
* Fix condition for taking shader path, the size check was moved
incorrectly
* Also account for a bitmask returned for preferred engines
* SWDEV-465041 - Add support for user events with DD
User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals
* Fix notifyCmdQueue() logic for OCL
* Avoid blocking calls in OCL with DD
* Add event destruciton in a case of the failure.
[ROCm/clr commit: 2305f8ae56]
- When doing device/stream sync, we can submit a handler which may
introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
HIP runtime.
[ROCm/clr commit: 9b045922a8]
- Currently runtime just uses the local agent as it did not check for
IPCShared()
- With this fix we query hsa_amd_pointer_info and get the right agent
for the memory to pass it to the HSA copy api
[ROCm/clr commit: 46d766e4e2]
- ROCr now reports preferred engine for copy status. We can leverage
this for max bandwidth for inter-GPU copies
- Cleanup logging
[ROCm/clr commit: 1b0ea080e4]
- For D2H cases avoid passing dependent signals to SDMA, the signals
take a while to resolve on SDMA engine
Change-Id: I569635228af977847f201c82ca897002f8f2f4a8
[ROCm/clr commit: 78d0ff2dbc]
- This change tries to save extra synchronization packets we may insert
as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
passed as the argument
Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc
[ROCm/clr commit: e03e4f3b5d]
- Use getBuffer/releaseBuffer in BlitManager
- Cleanup XferBuffer as we use ManagedBuffer for both reads/writes
Change-Id: I2661b85dd012763b17a38a743fec1b1d79125f67
[ROCm/clr commit: 37d606d193]
- If any kernel uses device heap, the launch needs to be preceeded by an
init kernel, Save on the extra barrier packet launch/flush between the
init heap kernel and user kernel
Change-Id: I8ebc6246188200e5f673dc464bc76a53bcb8b7c6
[ROCm/clr commit: ca530c660b]
- Fix regression for D2H pinned copies which adds systemscope release.
- Skip cpu wait for D2H unpinned copies as we can pass the signal of the
barrier to rocr copy.
- Fix an old bug in sdmaEngineRetainCount_ logic
- Improve logging
Change-Id: If074bddb05564b15949b0d5f9bf12acd3692174e
[ROCm/clr commit: 4c95ee5e1e]
- When using shader copy, make sure to use release scope for the AQL
packet. This is a potential bug but is hidden as hipMemcpyAsync always
needs synchronization(which inserts a barrier with release scope). For
hipMemcpy we use a barrier packet to make sure its blocking. Eitherways
a barrier gets always used and hides in some ways a potential bug.
Change-Id: I57fb7f769c3179e76d712471c0905104c801d7ba
[ROCm/clr commit: c9dd95bf6c]
- When we use blit(compute) copies, two subsequent copies may read for
the same source buffer, the buffer may get modified by the host in
between and if the src buffer was allocated with non-coherent flag, the
device may simply use stale value from previous cacheline fetch. This is
a corner case.
Change-Id: I2ce261c6f6fa4e5bb608f116548e5cc711ae6f3c
[ROCm/clr commit: b63005d550]
This reverts commit 0830d95f6d.
Reason for revert: There needs to be memcpy size change
Change-Id: If4f51769731e54743ac705b19b4f81b2d5925d5a
[ROCm/clr commit: 446ed661a0]
We are passing this arg as an address, and memcpy complains about
overreading (8 bytes instead of 4).
Change-Id: Ica9207f6c5f6056a4bfc968280c76e779ded13ae
[ROCm/clr commit: a6f2a2c2af]
- Refactor blit code and clean ASAN instrumentation
- Use unified function for rocr copy
- Enable shader copy path for unpinned writeBuffer/readBuffer paths
- Set GPU_FORCE_BLIT_COPY_SIZE=16 which means we will use BLIT copy for
pinned copies or unpinned H2D/D2H copies < 16KB
Change-Id: I42045cca79234b340dbf53dafb93044199736ae4
[ROCm/clr commit: 7863eb92dc]
Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer
Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2
[ROCm/clr commit: 29cc678d8d]
If the graph has kernels that does device side allocation, during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.
Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.
Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688
[ROCm/clr commit: 57156c524d]