The split path for blit kernels are no longer necessary, since the new blit kernels
don't use the copy size as the global workload
[ROCm/clr commit: da198ac5b2]
Compute doesn't support IB chaining, but RGP may collect
perf counters, which require more space in CB.
Increase CB size if RGP is enabled.
Change-Id: Iaa0a620ead8541a679b0dfe5e5711af5afdba545
[ROCm/clr commit: 63cf3057ba]
There are 2 functional changes to this patch:
* Use GPU timing for internal markers for HIP.
* Measure CPU time closer to GPU timer, to reduce delta between GPU/CPU timestamp measurements.
There are some smaller non-functional updates:
* waifForFence -> waitForFence typo
* Remove unused drmProfiling
Change-Id: I4c5fa600a842ab60e454888779edcac8449a902a
[ROCm/clr commit: 179801a750]
This PR adds UberTrace-based tracing support to ROCclr's PAL device class.
Legacy RGP-based tracing is still available and is the default.
If UberTrace support is enabled tool-side, this new code path will activate.
Change-Id: I268b2dcef70e850a50e2caef8355f38bf51d4641
[ROCm/clr commit: e550032d25]
The "optimized" version of memcpy is outdated and
was used in win32 only.
Change-Id: I7f2e0e9051e37cec95438266824b5b0025c324c6
[ROCm/clr commit: 7448113cfc]
Avoid a deadlock on the host call buffer creation. Since the buffer will be
allocated in the queue thread, then use direct device memory allocation
skipping the global context lock.
Change-Id: I09b55ee03bb42ab5d320c152b52a8c842c5fdcc1
[ROCm/clr commit: 62559a6e5a]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x.
Change barrier's code to match the most recent PAL optimizations.
Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa
[ROCm/clr commit: 1239309c90]
__amd_streamOpsWrite blitkernel in device-libs has only 3 args.
so getting rid of the 4th unused arg (sizeBytes)
Change-Id: I81cc1107f8b424bf58558c93a2495a1b878aef91
[ROCm/clr commit: e643406caa]
Extension allows to execute the kernels without a wait barrier and L1
invalidation.
Change-Id: I96c485204303f54a0240b93134f4560673e4bd17
[ROCm/clr commit: 13c6f56ca9]
Luxmark still uses HSAIL path and one subtest can benefit from the wave limit.
Change-Id: I16c94e09cd6e2afd6341cb76bf2e9ab7b7713214
[ROCm/clr commit: dec1158d04]
Add GPU_DEBUG_ENABLE to control ttpm behavior. If enabled,
then HW will collect more debug info at some perf cost
Change-Id: Icee0686b903a7b1bd483710b9d611877cd43c6aa
[ROCm/clr commit: 7d661bc7df]
Extra CPU read back will be performed before every submission to make sure
previous writes over PCIE reached GPU. HDP flush is done by CP.
Change-Id: I402d28ca26c8cee4a3920feb3599af8c285d0889
[ROCm/clr commit: cfc07c88ee]
- Add the new fillBuffer kernel, which allows to launch a limited
number of workgroups for memory fill operation
- Switch fill memory to 16 bytes write by default
- Allow to limit the workgroups with DEBUG_CLR_LIMIT_BLIT_WG
Change-Id: Ibad1822f2d42b2fc71bcfc1917c31409c0623e8e
[ROCm/clr commit: f1dc81f427]
Add support of HIP_FORCE_DEV_KERNARG under PAL.
Fix persistent memory detection for a resource view.
Change-Id: Ifb7db2db14e0c2205a9661cfa53887ec61ab26a4
[ROCm/clr commit: 5f297d75d9]
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues
Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f
[ROCm/clr commit: 40f41f4d0b]
Restore PAL platform destruction.
Update CmdAllocatorCreateInfo::AllocInfo for the new interface.
Change-Id: Iea418eed7ee26166039a4a9cc1999438856e9097
[ROCm/clr commit: bd00826446]
This reverts commit cab71e6e00.
Implement the right way to make ExternalSemaphores be signalled
only after prior works on the stream have been finished.
Change-Id: I9d5974e05d5f229170b928db4566c14e40e3cbaa
[ROCm/clr commit: d433df4761]
- Program unique AQL index for debugger. The logic manages AQL array of packets per HW queue.
- Provide debug state to PAL
Change-Id: I38fa1f5435fa711fd1d44dc391f2e61eb2a25efa
[ROCm/clr commit: d97cc0abbd]
The change enables VM support in graphs on Windows. That allows
to avoid caching of all allocations at the cost of map/unmap
overhead during memory create/destroy.
Change-Id: I792be00fba099e5e5d3cd44a963e1dfd6976a86d
[ROCm/clr commit: 04b696abee]
- The implementation in mempool graphs requires refcounting VA object.
That requires release() to update the map only on the actual destruction.
- Add GPU event tracking for paging operation. Otherwise, runtime
may not always flush IB.
Change-Id: Idf99ffb894321a38e04b490116a7ca435635918d
[ROCm/clr commit: 7ef2da5aba]
This reverts commit dfa7790030.
Reason for revert: Deferred to a future release.
Change-Id: Ia66c37f0ab9734dee73c930d10d7469d5fd57254
[ROCm/clr commit: 5dc104b3ea]
Windows kills threads on exit without any notification. However,
runtime can still destroy VirtualGPU object from the host thread with
HostQueue destruction.
This change also forces RGP trace transfer on the last capture without
any delays.
Change-Id: I768e87e99e1d23a021e63c12f36e450817743759
[ROCm/clr commit: ad33a021cb]