* clr: SWDEV-547890 - Maintain an MQD for the emulated AQL queue
To simplify the shader debugger implementation, maintain the relevant
parts of the emulated AQL queue's MQD (amd_queue_t): read_dispatch_id,
write_dispatch_id, compute_tmpring_size.
With this MQD, the shader debugger can handle the emulated AQL queue
the same way it does the real AQL queue, no specialization is required.
* clr: SWDEV-547890 - Conservatively update the MQD's read_dispatch_id
The read_dispatch_id cannot be smaller than the current aql_packet_id
- hsa_queue.size for the debugger to work correctly.
The read_dispatch_id really should be updated when the CmdBuf is marked
as complete. Left a FIXME to address it in a future commit.
---------
Co-authored-by: Laurent Morichetti <laurent.morichetti@amd.com>
* clr: SWDEV-547890 - Maintain an MQD for the emulated AQL queue
To simplify the shader debugger implementation, maintain the relevant
parts of the emulated AQL queue's MQD (amd_queue_t): read_dispatch_id,
write_dispatch_id, compute_tmpring_size.
With this MQD, the shader debugger can handle the emulated AQL queue
the same way it does the real AQL queue, no specialization is required.
* clr: SWDEV-547890 - Conservatively update the MQD's read_dispatch_id
The read_dispatch_id cannot be smaller than the current aql_packet_id
- hsa_queue.size for the debugger to work correctly.
The read_dispatch_id really should be updated when the CmdBuf is marked
as complete. Left a FIXME to address it in a future commit.
The split path for blit kernels are no longer necessary, since the new blit kernels
don't use the copy size as the global workload
[ROCm/clr commit: da198ac5b2]
Compute doesn't support IB chaining, but RGP may collect
perf counters, which require more space in CB.
Increase CB size if RGP is enabled.
Change-Id: Iaa0a620ead8541a679b0dfe5e5711af5afdba545
[ROCm/clr commit: 63cf3057ba]
There are 2 functional changes to this patch:
* Use GPU timing for internal markers for HIP.
* Measure CPU time closer to GPU timer, to reduce delta between GPU/CPU timestamp measurements.
There are some smaller non-functional updates:
* waifForFence -> waitForFence typo
* Remove unused drmProfiling
Change-Id: I4c5fa600a842ab60e454888779edcac8449a902a
[ROCm/clr commit: 179801a750]
This PR adds UberTrace-based tracing support to ROCclr's PAL device class.
Legacy RGP-based tracing is still available and is the default.
If UberTrace support is enabled tool-side, this new code path will activate.
Change-Id: I268b2dcef70e850a50e2caef8355f38bf51d4641
[ROCm/clr commit: e550032d25]
The "optimized" version of memcpy is outdated and
was used in win32 only.
Change-Id: I7f2e0e9051e37cec95438266824b5b0025c324c6
[ROCm/clr commit: 7448113cfc]
Avoid a deadlock on the host call buffer creation. Since the buffer will be
allocated in the queue thread, then use direct device memory allocation
skipping the global context lock.
Change-Id: I09b55ee03bb42ab5d320c152b52a8c842c5fdcc1
[ROCm/clr commit: 62559a6e5a]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x.
Change barrier's code to match the most recent PAL optimizations.
Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa
[ROCm/clr commit: 1239309c90]
__amd_streamOpsWrite blitkernel in device-libs has only 3 args.
so getting rid of the 4th unused arg (sizeBytes)
Change-Id: I81cc1107f8b424bf58558c93a2495a1b878aef91
[ROCm/clr commit: e643406caa]
Extension allows to execute the kernels without a wait barrier and L1
invalidation.
Change-Id: I96c485204303f54a0240b93134f4560673e4bd17
[ROCm/clr commit: 13c6f56ca9]
Luxmark still uses HSAIL path and one subtest can benefit from the wave limit.
Change-Id: I16c94e09cd6e2afd6341cb76bf2e9ab7b7713214
[ROCm/clr commit: dec1158d04]
Add GPU_DEBUG_ENABLE to control ttpm behavior. If enabled,
then HW will collect more debug info at some perf cost
Change-Id: Icee0686b903a7b1bd483710b9d611877cd43c6aa
[ROCm/clr commit: 7d661bc7df]
Extra CPU read back will be performed before every submission to make sure
previous writes over PCIE reached GPU. HDP flush is done by CP.
Change-Id: I402d28ca26c8cee4a3920feb3599af8c285d0889
[ROCm/clr commit: cfc07c88ee]
- Add the new fillBuffer kernel, which allows to launch a limited
number of workgroups for memory fill operation
- Switch fill memory to 16 bytes write by default
- Allow to limit the workgroups with DEBUG_CLR_LIMIT_BLIT_WG
Change-Id: Ibad1822f2d42b2fc71bcfc1917c31409c0623e8e
[ROCm/clr commit: f1dc81f427]
Add support of HIP_FORCE_DEV_KERNARG under PAL.
Fix persistent memory detection for a resource view.
Change-Id: Ifb7db2db14e0c2205a9661cfa53887ec61ab26a4
[ROCm/clr commit: 5f297d75d9]
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues
Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f
[ROCm/clr commit: 40f41f4d0b]
Restore PAL platform destruction.
Update CmdAllocatorCreateInfo::AllocInfo for the new interface.
Change-Id: Iea418eed7ee26166039a4a9cc1999438856e9097
[ROCm/clr commit: bd00826446]
This reverts commit cab71e6e00.
Implement the right way to make ExternalSemaphores be signalled
only after prior works on the stream have been finished.
Change-Id: I9d5974e05d5f229170b928db4566c14e40e3cbaa
[ROCm/clr commit: d433df4761]
- Program unique AQL index for debugger. The logic manages AQL array of packets per HW queue.
- Provide debug state to PAL
Change-Id: I38fa1f5435fa711fd1d44dc391f2e61eb2a25efa
[ROCm/clr commit: d97cc0abbd]