There are 2 functional changes to this patch:
* Use GPU timing for internal markers for HIP.
* Measure CPU time closer to GPU timer, to reduce delta between GPU/CPU timestamp measurements.
There are some smaller non-functional updates:
* waifForFence -> waitForFence typo
* Remove unused drmProfiling
Change-Id: I4c5fa600a842ab60e454888779edcac8449a902a
- Resolve signal dependencies for barrier value packet if there are > 1
depenent signals. Barrier Value packet accounts for only 1 dep signal
- Better log
Change-Id: Ia506ad5d80b91d598f92e7b539f41756e9b4b64b
Add an atomic counter to track the outstanding HSA handlers.
Wait on CPU for the callbacks if the number exceeds the value
in DEBUG_HIP_BLOCK_SYNC env variable.
Change-Id: I95dc8c4bf0258c7e59411b7504220709ed6898c5
Applications may submit commands withoout waits
for GPU. That causes a growth of SW unreleased commands.
Make sure runtime flushes SW queue, if it grows over some
threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE.
Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
The new set tracks only the queues that have a command
submitted to them. This allows for fast iteration
in waitActiveStreams.
Change-Id: I2c832eefa01280d9a87a5f57874d36d2e9441de7
We must be in protected way to get last command when calling
awaitCompletion() where lastCommand will be released and
possibly destroyed.
This can solve scope lock(notify_lock_) crash in
Event::notifyCmdQueue() with AMD_DIRECT_DISPATCH = true.
Change-Id: I4297166f912a71112f4a8945d993160ba9afdc34
- No cpu wait is needed when profiler is attached, Doing this changes
the application profile when roctracer is attached.
Change-Id: I2b9cfc48d697cf5ed54bb6a240d8c12bdb079171
It seems that due to removal of vdev()->isHandlerPending(),
Marker queued to ensure finish is not enqueued and that cause
hung at waiting event for kernel enqueue command.
Change-Id: I364abb2dcb4897b11a7eb61b5d85013b69292792
Some pytorch tests use a tracer plugin and rely on profiling information
to be reported right after hipDeviceSynchronize()
Change-Id: Ib021a1e7b1a30b3c24de72627c471810f7f7878d
Use large signal pool if profiler is connected or profiling forced
enabled. This is needed to mitigate signal creation overhead when
profiling as signals are attached to every packet and deeper batch may
show overhead of signal allocation.
Change-Id: I8034b8a20b55328b87d593bf044f59672f9653e8
- Enable CUs adjacent pairwise for WGP mode
- In HostQueue::terminate() do not segfault if virtual device hasn't been created
Change-Id: I94402ff333308af5824878086cc238b3993d534d
OS can terminate unfinished queue thread from default stream at any
time. Potentially leaving the queue lock in a bad state and causing a
deadlock if runtime destroys VirtualGPU later from the host thread.
Change-Id: I247f102ee84e6b4dba947504933395071945c85d
Windows kills threads on exit without any notification. However,
runtime can still destroy VirtualGPU object from the host thread with
HostQueue destruction.
This change also forces RGP trace transfer on the last capture without
any delays.
Change-Id: I768e87e99e1d23a021e63c12f36e450817743759
- Store last fence scopes and use the last value to determine if we need a cache flush again. This helps cases where hipExtLaunchKernel API is
used.
- Purge code for ROC_EVENT_NO_FLUSH
Change-Id: I531cf9c9c60d5e2b3a9e265d0f52f79ed2fa8a8c
The fix for SWDEV-329789 moved down the last use of the a
command object pointer in order to prevent a race condition.
However, the previous patch did not move down the release of
that command. By releasing the command early, another thread
could get a command with the same pointer. That second thread
could later submit work to the queue using that new command.
The first thread could then perform a comparison against the
queue's last command using its own now-stale pointer. This
could eventually allow the second thread to skip synchornizing
on the queue. This would result in host synchronizations
completing before their device work was actually complete.
Change-Id: I292b7b369743251ceafe453a4c5cae14a6d01046
Runtime can reset the last command only if it didn't change
since the query at the beginning of finish()
Change-Id: I629f2d788e9bbaa17ca4e96b1a753f8131e32463
Maintain status of handler callback. For event records we no longer
submit callbacks to reduce the load on the async handler thread. However
without a callback we leak command memory/decrement refcounts. Indicate
status of the handler which we can use to queue a callback when
finish is called.
Change-Id: I89fd02f3d047a0e8162664ee17581a14795f1928
The heap must be cleared once per device, but ROCclr doesn't
create a queue per device in HIP. Hence, the clear operation will
be performed during the first queue creation.
Change-Id: I52ceb06d67d11cde6d019c5ab510059f426a9bfb
Enqueue a handler callback for hipEventRecords(aka marker_ts_) for every
64 submits, This recycles the memory if we dont end up calling
synchronize for the longest time.
Change-Id: I3d39fe76d52a5d81387927edd85b5663b563682c
- Add a global cache state for a device to indicate scopes of submitted
AQL packets
- Remove scopes for TS marker if hipEventReleaseToDevice is passed. Set
env ROC_EVENT_NO_FLUSH=1 to use NOP AQL for event records.
It would flush caches by default with system scope release.
- Calling finish() should ensure if caches are flushed, if not queue a
marker
Change-Id: Ibbbdbb1cd7ac61cb35649169212142545be159e0
If AMD event contains a reference to a HW event, then runtime
could check/wait for HW event. CPU status update will occur later
after HSA signal callback, but it's not important for the result.
Change-Id: I591391a953bbdba6a25ac07e2cd98aeb17cd4596
- Avoid GPU wait on the marker submission and update the command
batch after HSA signal callback upon HSA barrier completion.
Change-Id: I5c1c97212aefc2ae4b99aa9e2a81627ee9a38c1c