Windows alings fields to 8 bytes even with 32bit builds.
Add BUG_CLR_SYSMEM_POOL to cotnrol sysmempool.
Change-Id: I8622aabc9f7391ed7dd8583b252ce9eb41d62293
Applications may submit commands withoout waits
for GPU. That causes a growth of SW unreleased commands.
Make sure runtime flushes SW queue, if it grows over some
threshold, controlled by DEBUG_CLR_MAX_BATCH_SIZE.
Change-Id: Ia4d85c24210ef91c394f638ab6b53b14323a0396
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
Use env var DEBUG_CLR_KERNARG_HDP_FLUSH_WA=1 to fall back to HDP flush
workaround. The default is 0
Change-Id: I7bdb9be61da60c30d15ac9991b7cd27351e1831c
This change adds a new HIP API `hipExtHostAlloc` which preserves
the functionality of `hipHostMalloc`.
Change-Id: I13504c6fc13465ddd7aed329795bb4f2fef1baff
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
Implement std::mutex based monitor that has much
simpler logics than legacy monitor.
Create DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR to
toggle them.
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = false
(by default), use legacy monitor;
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = true,
use std::mutex based monitor.
If no perf drop of stl::mutex based monitor,
legacy one will be removed later.
Change-Id: I1d21368ff462477d3238d71e4e2a1a7d6b9167ad
Support new comgr unbundling action api to extract codebjects
in compressed and uncompressed modes.
Create HIP_ALWAYS_USE_NEW_COMGR_UNBUNDLING_ACTION ENV to
toggle new path and old path.
If HIP_ALWAYS_USE_NEW_COMGR_UNBUNDLING_ACTION=false(default),
uncompressed codeobject will go old path for better perf,
compressed codeobject will go new path.
If HIP_ALWAYS_USE_NEW_COMGR_UNBUNDLING_ACTION=true,
both uncompressed and compressed codeobjects will go new
path.
Add comgr wrapper for
amd_comgr_action_info_set_bundle_entry_ids()
Change-Id: I79952f132fe21249296685ee12cae05a4f9aec32
This reverts commit e53df57ffe.
Reason for revert: <INSERT REASONING HERE>
New comgr unbundling action leads to perf drop for uncompressed code object. Will create a new patch to use old path for uncompressed , new unbundling api for compressed .
Change-Id: I41ef53b71fc9f7aaa8cf231d4d70945f1117db52
1.Make runtime use comgr to unbundle code objects
2.Support compressed/uncompressed modes
3.Remove HIP_USE_RUNTIME_UNBUNDLER and
HIPRTC_USE_RUNTIME_UNBUNDLER to simplify logics
4.Add comgr wrapper for
amd_comgr_action_info_set_bundle_entry_ids()
Change-Id: Ic41b1ad1b64cca1e31986437983a5146d52a7329
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
- Enable Device kernel args for MI300* for now.
- Fix a perf issue which impacts graph instantiate when dev kernel args
are enabled.
Change-Id: I962e58fd9d8dd1a8db95e601cb03a8e9c7bac97f
- Implement workaround to ensure HDP writes are done by writing and
reading the HDP MMIO register.
- Implement the same workaround for graphs, we no longer need sentinel
write/readback
Change-Id: I0d3027b46a1f61131ec62e3c8c669ff5184fa6b2
Add GPU_DEBUG_ENABLE to control ttpm behavior. If enabled,
then HW will collect more debug info at some perf cost
Change-Id: Icee0686b903a7b1bd483710b9d611877cd43c6aa
- Add the new fillBuffer kernel, which allows to launch a limited
number of workgroups for memory fill operation
- Switch fill memory to 16 bytes write by default
- Allow to limit the workgroups with DEBUG_CLR_LIMIT_BLIT_WG
Change-Id: Ibad1822f2d42b2fc71bcfc1917c31409c0623e8e
- Rename HIP_USE_SDMA_QUERY to DEBUG_CLR_USE_SDMA_QUERY as this is
supposed to be a temporary env var for debug purposes only.
Change-Id: If6ebd52ab87624375a3df24ceccdcc05c60a65af
With recent upstream changes (D145770), we can now use the
Comgr unbundler without requiring an env field in the supplied
targetID. For users, this is consistent with previous legacy
unbundler behavior.
Change-Id: I5f085b0fa1ad352bbbb282b75367c206b75f279f
HIP_FORCE_DEV_KERNARG=1 will create a device allocation for kernel arg
segment. Flag is 0 by default.
Change-Id: Iaaf5a149f3be8596568878d5d272268baf067c60
The change enables VM support in graphs on Windows. That allows
to avoid caching of all allocations at the cost of map/unmap
overhead during memory create/destroy.
Change-Id: I792be00fba099e5e5d3cd44a963e1dfd6976a86d