* clr: Implement dynamic stream to HW queue assignment
This change implements dynamic stream to hardware queue (HWq) mapping
with the following features:
* Queue depth heuristics with weights for optimal HWq assignment
* Make last used queue sticky for better locality
* Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to
pipe mapping based on creation order (single process per device only,
as pipe ID is statically assigned by runtime)
* More aggressive heuristic usage for better queue distribution
* Extend dynamic queues support for all stream priorities
Environment variables:
* DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 -
Depth+Pipe heuristics
* DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation
* clr: Clean up last_used_queue_
* SWDEV-555889 - Support mipmap on rocr
Support mipmap in hip-rt on rocr backend.
Enable all mipmap tests in Windows.
Some other minor improvement.
Add some SRD logs that will be removed finally.
* Add sampler.mipFilter to fix sampler issues on mipmap in rocr.
Fix format issues of view of leveled image and mipmap image in blit kernel in rocr.
Enabled disabled mipmap tests.
* Rewrite view logic
* Set word4.f.PITCH = 0 for mipmap SRD on navi31 to fix unstable test issues.
Reset last error in nagative tests.
* Remove SRD dump log from hip-rt
Let Rocr mipmap log be in condition.
* minor format chang
* Exclude mipmap tests for mi200+ which don't support mipmap.
Problem:
The existing SDMA engine selection logic had several issues:
1. Same VirtualGPU/stream could use different SDMA engines for consecutive
async copies since copy_engine_status may report engines as busy
2. Busy and Preferred engine check for every copy
3. No global tracking of which VirtualGPU uses which engine, leading to
suboptimal resource allocation
Solution:
Implemented a global SDMA engine allocator with per-stream affinity:
- Added Device::SdmaEngineAllocator to manage VirtualGPU → engine assignments
* Maintains global map of active assignments
* Enforces exclusivity: different streams use different engines (except
inter-GPU copies where preferred engines are prioritized for optimal
hardware paths like XGMI links)
* Thread-safe allocation/release with Monitor lock
- Modified VirtualGPU to cache assigned engine locally (assigned_sdma_engine_)
for fast lookup without map access on hot path
- Refactored rocrCopyBuffer() to:
1. Check local cached engine first → use if assigned
2. Call AllocateSdmaEngine() if not assigned → cache result
- Moved HSA API queries (memory_copy_engine_status, memory_get_preferred_copy_engine)
into AllocateEngine() for cleaner separation of concerns
- Engine release on HostQueue::finish() instead of only VirtualGPU destruction
* Improves engine utilization by releasing earlier
* Added virtual ReleaseSdmaEngines() method to device::VirtualDevice
- Added future path for simple round-robin allocation (kUseSimpleRR) for
next-gen GPUs with uniform SDMA bandwidth (disabled by default)
Cleanup:
- Removed selectSdmaEngine() helper (logic moved to allocator)
- Removed getSdmaRWMasks() (allocator accesses maxSdmaReadMask_/WriteMask_ directly)
- Removed unused sdmaEngineReadMask_/WriteMask_ member variables from DmaBlitManager
Benefits:
- Ensures consistent per-stream SDMA engine usage
- Prevents cross-stream contention and engine thrashing
- Prioritizes hardware-optimal paths for inter-GPU transfers
- Better resource utilization through earlier release
- Cleaner, more maintainable code structure
* clr: Use graph segment scheduling to process HIP Graphs
* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
Enabled by default
* clr: Few fixes and improvements
* clr: Detect complex graphs to take classic path
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
path
* clr: Fix a cornercase stack corruption
* clr: Track commands of segments instead of snapshots
* clr: Fix Batch dispatch logic
* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
streams
---------
Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
* SWDEV-555347 - Remove lock contention in async events loop
* SWDEV-555347 - Introduce Pool of AsyncEventItems
* create generic mempool for AsyncEventItem
* Use BaseShared allocate and free for async event pool
---------
Co-authored-by: Rahul Manocha <rmanocha@amd.com>
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister
* SWDEV-554174 Added hipHostRegisterIoMemory flag in test cases
* SWDEV-554174 : Did formatting corrections
* SWDEV-554608 - set HSA_AMD_MEMORY_POOL_UNCACHED_FLAG if IoMemory is set
* SWDEV-554608 - set HSA_AMD_MEMORY_POOL_UNCACHED_FLAG if IoMemory is set
* SWDEV-554608 - Add hipHostRegisterIoMemory for hipHostRegister
---------
Co-authored-by: Anavena Venkatesh <Anavena.Venkatesh@amd.com>
Co-authored-by: Rambabu Swargam <rambabu.swargam@amd.com>
- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
*Lay foundation to batch packets efficiently for graphs
*Dynamically copy packets with max threshold set with
DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2
*Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256
*If TS are not collected for a signal for reuse, create a new signal.
This can potentially increase signal footprint if the handler doesn't run
fast enough.
* SWDEV-547453 Release the kernel command if the operation returns an error
* SWDEV-547453 - Initialize parameters_ to default value
* SWDEV-547453 - Run clang-format
[ROCm/clr commit: a15957fee9]
* SWDEV-465041 - Add support for user events with DD
User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals
* Fix notifyCmdQueue() logic for OCL
* Avoid blocking calls in OCL with DD
* Add event destruciton in a case of the failure.
[ROCm/clr commit: 2305f8ae56]
- When doing device/stream sync, we can submit a handler which may
introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
HIP runtime.
[ROCm/clr commit: 9b045922a8]
Init and fini kernel needs to be launched when we load and unload code object. Avoid looping through all kernels within a code object just to run the init and fini kernels. Compiler currently only generates 1 init and fini kernel.
[ROCm/clr commit: cd46294b31]
Fini kernel is executed during the invocation of amd::Program destructor,
but the dispatch logic can retain/release the reference counter and
cause double free. Avoid double free with an extra retain() call
[ROCm/clr commit: bddb8f14d1]