* clr: Implement dynamic stream to HW queue assignment
This change implements dynamic stream to hardware queue (HWq) mapping
with the following features:
* Queue depth heuristics with weights for optimal HWq assignment
* Make last used queue sticky for better locality
* Use pipe HWq to pipe mapping - gfx9 follows a round-robin queue to
pipe mapping based on creation order (single process per device only,
as pipe ID is statically assigned by runtime)
* More aggressive heuristic usage for better queue distribution
* Extend dynamic queues support for all stream priorities
Environment variables:
* DEBUG_HIP_DYNAMIC_QUEUE: 0 - disabled, 1 - Depth heuristics 2 -
Depth+Pipe heuristics
* DEBUG_HIP_IGNORE_STREAM_PRIORITY=1: ignore priority stream creation
* clr: Clean up last_used_queue_
- When doing device/stream sync, we can submit a handler which may
introduce some host side delays. Use DEBUG_CLR_BATCH_CPU_SYNC_SIZE to
batch commands for host wait. Default for HIP is 8 commands.
- Investigation is underway in ROCr but need to address this for now in
HIP runtime.
[ROCm/clr commit: 9b045922a8]
- This change tries to save extra synchronization packets we may insert
as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
passed as the argument
Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc
[ROCm/clr commit: e03e4f3b5d]
wait() is redesigned with two pathes:
fast path: Use spinlock to wait for notify signal. If the
signal hasn't been received for some loops, go to slow path.
slow path: Use condition_variable's wait().
Improve monitor wrapper for better performance.
Fix some bugs left from name removing patch.
Change-Id: I893a8353121a25d11e37c8e631caf31cc1fc1f24
[ROCm/clr commit: f2ff56af9c]
Applications may submit commands withoout waits
for GPU. That causes a growth of SW unreleased commands.
Make sure runtime flushes SW queue, if it grows over some
threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE.
Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396
[ROCm/clr commit: 8657a77029]
The new set tracks only the queues that have a command
submitted to them. This allows for fast iteration
in waitActiveStreams.
Change-Id: I2c832eefa01280d9a87a5f57874d36d2e9441de7
[ROCm/clr commit: bcc545e6b8]
- Refactor code and cleanup logic for callback saving for event records
Change-Id: I5c56aa8e9c968a5bca70fb07ad1796da318e9e89
[ROCm/clr commit: 1338ff37e8]
hipStreamSynchronize and hipDeviceSynchronize won't longer wait
for CPU commands in DD mode
Change-Id: I079c8bbfc34ddc6d3e2d74c92a34665877e512a5
[ROCm/clr commit: fbea58ba11]
OS can terminate unfinished queue thread from default stream at any
time. Potentially leaving the queue lock in a bad state and causing a
deadlock if runtime destroys VirtualGPU later from the host thread.
Change-Id: I247f102ee84e6b4dba947504933395071945c85d
[ROCm/clr commit: 28daf98f1f]
Windows kills threads on exit without any notification. However,
runtime can still destroy VirtualGPU object from the host thread with
HostQueue destruction.
This change also forces RGP trace transfer on the last capture without
any delays.
Change-Id: I768e87e99e1d23a021e63c12f36e450817743759
[ROCm/clr commit: ad33a021cb]
The heap must be cleared once per device, but ROCclr doesn't
create a queue per device in HIP. Hence, the clear operation will
be performed during the first queue creation.
Change-Id: I52ceb06d67d11cde6d019c5ab510059f426a9bfb
[ROCm/clr commit: 04bfd93569]
Enqueue a handler callback for hipEventRecords(aka marker_ts_) for every
64 submits, This recycles the memory if we dont end up calling
synchronize for the longest time.
Change-Id: I3d39fe76d52a5d81387927edd85b5663b563682c
[ROCm/clr commit: fa76f03654]
Make sure the logic updates the command status when it's done in
HW, but not on submission.
Add the last command tracking, otherwise queue sync logic in the HIP
upper layer may skip synchronization, assuming the queue is empty.
Change-Id: I2d046792553e74df090a10f7d7a78914610f6df2
[ROCm/clr commit: 5b31c69a95]
Reduce the size of the queueLock and lastCmdLock critical sections
to improve lock contention performance. The smaller the critical
sections are the better.
lasCmdLock is still needed to guarantee that getLastEnqueueCommand_
can retain the command before it is swapped out and released.
Change-Id: Id35d4a77c035b2da0de4c15568b153d49e958bb7
[ROCm/clr commit: 080dcfe857]
The current implementation creates default reference in the stack and assigns it to class member cuMasks_, so whenever the content of the stack changes, cuMask_ would change.
Change-Id: Iefab63c335d504b83c4ae90bd34ae76c6afb8f3c
[ROCm/clr commit: 8ef5da00c7]
When HIP_ENABLE_DEFERRED_LOADING=0, many global variables will be
referenced but they are not initialized in that early time. The patch
will use constexpr to initialze global constant varables in compile
time.
Change-Id: I9d538b7abc6a0ce700ec3332b97fc144db5fc1ef
[ROCm/clr commit: fdef6f722f]
Two threads can enqueue to the same HostQueue (HostQueue::enqueue)
and result in last queued command being the first one reachine queue_.enqueue
NOTE: Temporarly make setLastQueuedCommand empty function to pass the build
Change-Id: Id09c3a28d184986f52b2ec86a2f6a18c40df1f0b
[ROCm/clr commit: 3d15a1e291]