If the execution command had a split into multiple HW operations, then runtime has to accumulate time for all operations
Change-Id: Iaba31e96250918d8190bf63adb4c07730fdfefbf
[ROCm/clr commit: 24f5362296]
Maintain status of handler callback. For event records we no longer
submit callbacks to reduce the load on the async handler thread. However
without a callback we leak command memory/decrement refcounts. Indicate
status of the handler which we can use to queue a callback when
finish is called.
Change-Id: I89fd02f3d047a0e8162664ee17581a14795f1928
[ROCm/clr commit: 5df34a2f7a]
Move hidden heap creation to the kernel launch to make sure it's
allocated on the actual first usage.
Change-Id: I1b65a82fc06d9129ed45a69765bf14ea3d945b04
[ROCm/clr commit: 4975f69337]
Set release scope as system for dispatch AQL when events are passed to
hip*LaunchKernelGGL*
Change-Id: I93b91591e0ab023f1ecc5247f7905eca26147358
[ROCm/clr commit: 02566677cf]
Disable hostcall buffer in OCL for now. COv5 can add hostcallbuffer
metadata for unknown reason. OCL may fail the buffer allocation
and kernel launch.
Change-Id: I34a6a45bac86c57422b764c0d69760c96920d6c5
[ROCm/clr commit: 934149ff0a]
- check pcie atomci support for printf functionality
- if not enabled printf wont work
Signed-off-by: sdashmiz <shadi.dashmiz@amd.com>
Change-Id: Ib366e8e71772b02210c4a830bca4bd8cc7a11664
[ROCm/clr commit: 15f1632dfa]
- Add a global cache state for a device to indicate scopes of submitted
AQL packets
- Remove scopes for TS marker if hipEventReleaseToDevice is passed. Set
env ROC_EVENT_NO_FLUSH=1 to use NOP AQL for event records.
It would flush caches by default with system scope release.
- Calling finish() should ensure if caches are flushed, if not queue a
marker
Change-Id: Ibbbdbb1cd7ac61cb35649169212142545be159e0
[ROCm/clr commit: 8eeaa998c0]
Remove assert for kernel arg size, because COv5 reports a value
bigger than the actual usage in the most of cases
Change-Id: I8e15bc45a9e21b58a5894f9977511ca84408ce61
[ROCm/clr commit: 2be0b1e612]
With COv5 local size calculation must occur before
runtime programs kernel arguments
Change-Id: I0726c6529bde69b8fcf5360aa83986cf84e04168
[ROCm/clr commit: caa6110c29]
- Fix a crash with AMD_CPU_AFFINITY=1 as numa_bitmask_alloc isnt the
right api to allocate bitmask
- Do not set affinity for ROCr thread. It worsens performance rather
than any improvement.
- Fix regression from my previous change for event handler.
Change-Id: I3ea75adc2a6333f29752283eddd5b555e9b58cc5
[ROCm/clr commit: 802c2c8a9f]
- Queue handler for hipEventRecord(aka marker_ts_) only if there is a
callback associated with it.
Change-Id: I8a9877ae0e342556053abbaacc9510744a8e772a
[ROCm/clr commit: 3c3c0ca4c5]
Pass active queue for transfers in the cache coherency layer.
That will allow to use device transfer queue only for
cases when active queue isn't available, because using device
transfer queue from another active queue may cause a deadlock
Change-Id: Ifbe7e0303b77dbf6eeda3939ffbc25a3df7472de
[ROCm/clr commit: 95d55fdfa8]
If GlobalMemCacheLine reported is 0, runtime may run into an
infinite loop as the KernelSegmentAlignment is chosen as size of the
cache line.
Change-Id: Ide547940cc0407f16fab10ee210b4fd3ae4eaafc
[ROCm/clr commit: 041ddc0c1c]
Metadata in Codeobject version 5 is the extension of CO3 and CO4.
Add the detection of the new fields and program them in
the setup of the kernel arguments.
Change-Id: I27e58df77320ad00f4f16d35912668db803826af
[ROCm/clr commit: be6a06384e]
Add a state indicator to retain ExternalSignals when needed.
Co-operative group launch uses external signals to indicate a dependency
to the next command.
Change-Id: I6d0daa006e2377c3bbf4aeca0fd5b63c7ac8fbbb
[ROCm/clr commit: 1fbd75b825]
Crash was due to the fact that external signal structure was stale even
after destroyign the command. That is because we skipped wait due to a
missing check.
Detect external signals and dispatch a barrier in ReleaseGpuMemoryFence.
Also clear external_signals_ at ProfilingBegin.
Change-Id: I991387edcfe928b511bf5e780988ee131321ed5a
[ROCm/clr commit: 3239222516]
Add a threshold for ROCR/SDMA P2P transfers. ROCR copy path
requires extra barriers in compute for synchronization. That costs
extra performance with tiny transfers.
Reduce active wait time to 10us. Tensorflow uses extra thread
per GPU with constant hipEventQuery() calls. Longer active waits
in ROCr affect CPU performance.
Change-Id: I9020358438615fa2d4617f862f00a562f0a588e7
[ROCm/clr commit: 008133cf41]
Stall in the host thread could occur earlier than the app expects.
Make sure rutnime can grow the signals to the queue size without
any stall. Also adding a new signal to the end of the pool could
break the dependency chain on signal reuse. The new logic will
insert the new signal after current to keep the chain intact.
Change-Id: I9c90b98515907db8b677528263c3e88cd9581a14
[ROCm/clr commit: 102c19adf3]
Reference for the first element can trigger an assert with
_GLIBCXX_ASSERTIONS build
Change-Id: I59c63c052831307edfe5dcc6384798a43e9596dd
[ROCm/clr commit: 6f2e7c3199]
The optimization is controlled with ROCR_SKIP_KERNEL_ARG_COPY.
This is initial check-in for experiments. Extra changes are
necessary for full support:
- handle graph capture with the original sysmem alloc
- avoid memobject references, otherwise there is a race condition with
reusage of the arg buffer
- Remove arg setup from hip
Change-Id: Ib0af710f93e79834711fa4049a7c66093711e68b
[ROCm/clr commit: 7e12cf6318]
The kernel arg pool will be divided into 8 chunks to avoid long stalls,
when the pool will be reused.
Change-Id: I228e6ca1c09e428c1775f1e5b685220a9a5d71af
[ROCm/clr commit: f78b3a8919]
ROC_AQL_QUEUE_SIZE will control the size of AQL queue.
The current sefault value is 4096.
Change-Id: Icd2a4ee3ba554c06aa05b08defd922d2c63e43fd
[ROCm/clr commit: 7fe696b6ef]
Implementation to use a blit kernel to perform
a hipStreamWait/write instead of an AQL packet.
Change-Id: I462671ed5cec37144dfe97ff66439249196117c1
[ROCm/clr commit: cbb8d82bdb]
The original logic left only one slot for HW processing in the queue.
For some reason there is a race condition on CPU overwrite of the slot
before the current active. The workaround is to avoid the previous to
the current active slot for possible unfinished HW processing.
Change-Id: I565495a8feeaedffc9fc8a505edbee5ff5816975
[ROCm/clr commit: 65ddfcc6a8]
Pass the device agent specified by the user to the ROCr api instead of passing the device agent attached to the specified stream
Change-Id: I86c98935b9dc404eaa6d47ccdd082a8c3678fb36
[ROCm/clr commit: 169cc857fd]