* SWDEV-520352 - Remove HostThread and legacy monitor
Remove HostThread, semaphore and legacy monitor.
Make original logics of thread and command queue stricker.
Add more comments to make logics clearer.
Some other minor improvement.
Also part of SWDEV-458943.
Also add flag: HIP_FORCE_SPIRV_CODEOBJECT to allow override to force use
SPIRV.
* use cache for already compiled code objects
* address review comments and use the two spirv isa names
Add initial implementation of virtual memory heap with
dynamic virtual memory mapping support for memory pools.
DEBUG_HIP_MEM_POOL_VMHEAP controls the new method.
Change-Id: I8dc5be2e0f34ab472f1800f43bb6243639a5e500
wait() is redesigned with two pathes:
fast path: Use spinlock to wait for notify signal. If the
signal hasn't been received for some loops, go to slow path.
slow path: Use condition_variable's wait().
Improve monitor wrapper for better performance.
Fix some bugs left from name removing patch.
Change-Id: I893a8353121a25d11e37c8e631caf31cc1fc1f24
- Added DEBUG_CLR_SKIP_RELEASE_SCOPE flag to force release scope to
SCOPE_NONE in AQL packet header
Change-Id: Ife02cddb9d5cd4749103ce585d3d5fe9024c6868
- Refactor blit code and clean ASAN instrumentation
- Use unified function for rocr copy
- Enable shader copy path for unpinned writeBuffer/readBuffer paths
- Set GPU_FORCE_BLIT_COPY_SIZE=16 which means we will use BLIT copy for
pinned copies or unpinned H2D/D2H copies < 16KB
Change-Id: I42045cca79234b340dbf53dafb93044199736ae4
Currently amd::Monitor can work in FILO mode for the active waits
and cause a delay in wakeup of some threads. That may have a problem
with the current sysmem pool design.
Change-Id: I145081478d1e0b282d8838855c5718f09cf54b69
Windows alings fields to 8 bytes even with 32bit builds.
Add BUG_CLR_SYSMEM_POOL to cotnrol sysmempool.
Change-Id: I8622aabc9f7391ed7dd8583b252ce9eb41d62293
Applications may submit commands withoout waits
for GPU. That causes a growth of SW unreleased commands.
Make sure runtime flushes SW queue, if it grows over some
threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE.
Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
Use env var DEBUG_CLR_KERNARG_HDP_FLUSH_WA=1 to fall back to HDP flush
workaround. The default is 0
Change-Id: I7bdb9be61da60c30d15ac9991b7cd27351e1831c
This change adds a new HIP API `hipExtHostAlloc` which preserves
the functionality of `hipHostMalloc`.
Change-Id: I13504c6fc13465ddd7aed329795bb4f2fef1baff
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
Implement std::mutex based monitor that has much
simpler logics than legacy monitor.
Create DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR to
toggle them.
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = false
(by default), use legacy monitor;
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = true,
use std::mutex based monitor.
If no perf drop of stl::mutex based monitor,
legacy one will be removed later.
Change-Id: I1d21368ff462477d3238d71e4e2a1a7d6b9167ad
Support new comgr unbundling action api to extract codebjects
in compressed and uncompressed modes.
Create HIP_ALWAYS_USE_NEW_COMGR_UNBUNDLING_ACTION ENV to
toggle new path and old path.
If HIP_ALWAYS_USE_NEW_COMGR_UNBUNDLING_ACTION=false(default),
uncompressed codeobject will go old path for better perf,
compressed codeobject will go new path.
If HIP_ALWAYS_USE_NEW_COMGR_UNBUNDLING_ACTION=true,
both uncompressed and compressed codeobjects will go new
path.
Add comgr wrapper for
amd_comgr_action_info_set_bundle_entry_ids()
Change-Id: I79952f132fe21249296685ee12cae05a4f9aec32
This reverts commit e53df57ffe.
Reason for revert: <INSERT REASONING HERE>
New comgr unbundling action leads to perf drop for uncompressed code object. Will create a new patch to use old path for uncompressed , new unbundling api for compressed .
Change-Id: I41ef53b71fc9f7aaa8cf231d4d70945f1117db52
1.Make runtime use comgr to unbundle code objects
2.Support compressed/uncompressed modes
3.Remove HIP_USE_RUNTIME_UNBUNDLER and
HIPRTC_USE_RUNTIME_UNBUNDLER to simplify logics
4.Add comgr wrapper for
amd_comgr_action_info_set_bundle_entry_ids()
Change-Id: Ic41b1ad1b64cca1e31986437983a5146d52a7329
Refactor PrintfDbg::outputArgument() to remove potential risk.
Fix half vector printf issue on all devices.
Fix FEAT-56794 as well.
Change-Id: Iae39359d2128588def2e43d77fe58e868b8e71ff
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006