- Leverage managed buffer that would use chunks for fill pattern. Use a
different chunk for the next fill to avoid wait
Change-Id: I254483c867e112f66564ffd8f55e0a605d8896c9
[ROCm/clr commit: 175ad024d3]
Add a threshold for ROCR/SDMA P2P transfers. ROCR copy path
requires extra barriers in compute for synchronization. That costs
extra performance with tiny transfers.
Reduce active wait time to 10us. Tensorflow uses extra thread
per GPU with constant hipEventQuery() calls. Longer active waits
in ROCr affect CPU performance.
Change-Id: I9020358438615fa2d4617f862f00a562f0a588e7
[ROCm/clr commit: 008133cf41]
Reference for the first element can trigger an assert with
_GLIBCXX_ASSERTIONS build
Change-Id: I59c63c052831307edfe5dcc6384798a43e9596dd
[ROCm/clr commit: 6f2e7c3199]
If we don't create the __amd_rocclr_gwsInit kernel, we still want
to create the rest of the image related blit kernels.
Change-Id: I8bc4645f9f9116eeecbb8b22e981ac4d520f3121
[ROCm/clr commit: 55a0cf0b0c]
Implementation to use a blit kernel to perform
a hipStreamWait/write instead of an AQL packet.
Change-Id: I462671ed5cec37144dfe97ff66439249196117c1
[ROCm/clr commit: cbb8d82bdb]
For the fillBuffer shader, if there are two 32bit writes to a MMIO
register, it can get dropped. It has to be a single 64bit write.
Add optimization to fillBuffer to write 64bit and 16bit writes.
Change-Id: I3aa78e027898f8ae01e9c8f09004615673720c2b
[ROCm/clr commit: 21ba34d0fe]
ROC_BARRIER_SYNC will not work with direct dispatch.
Remove and cleanup.
Change-Id: I81368b2e65039477bd0343bb92708dab48867db6
[ROCm/clr commit: aa38af8c96]
- The logic will trace compute, sdma read/write operations and
apply signals when necessary
- ROC_CPU_WAIT_FOR_SIGNAL, ROC_SYSTEM_SCOPE_SIGNAL
and ROC_SKIP_COPY_SYNC were added to control the tracking
Change-Id: I9e8e6174c63bf7784f7ab00964e2918c8667d364
[ROCm/clr commit: dbc7abaecf]
- ROCR fails the call for some reason, then the signal will
become invalid and can hang on a wait. The logic will reset the
active signal in such cases
Change-Id: Ia131420200f1bbd7c9a162b8f1b06db8cecf41c6
[ROCm/clr commit: ce2e5eba6b]
- There is a performance regression with a HW wait for HSA signal
on ROCr async operation. For now move the logic back to CPU wait.
- Fix profiling issue with multiple HSA signal per single timestamp
object. Some copies require multiple ROCR calls and if profiling is
required, then the execution time is derived from all used signals.
Change-Id: Id003e4abb8c2de378eedc152a7e389500fc6f4ce
[ROCm/clr commit: 5a8946190a]
- Correct GSL path to report targets using the TargetID syntax.
- Correct GSL path to check compatibility of code objects when
loading.
- Add concept of an device isa and create a registery used by ROCm,
PAL and GSL.
- Support XNACK and SRAMECC target features consistently for PAL and ROCm.
- Correct logic for NullDevices and asserts to avoid memory coruption.
- Allow all NullDevices to be created for HIP.
- Numerous other code improvements.
Change-Id: I40abf3d2b22249c1492d1af5919665f8184f4e0e
[ROCm/clr commit: c7e8d91e14]
Implement the global class for signals tracking per device queue.
Switch to the new tracking mechanism.
Change-Id: I3c4dda04b34e6d18d6a95510d84102909633b415
[ROCm/clr commit: 8698aeef0d]
Make sure the comments in the code match the actual behavior.
HDP read has internal HDP read cache and doesn't use L2.
Change-Id: I667a4643b0e0d6529008f5e1a0a3269456c55b4e
[ROCm/clr commit: d524514f6a]
CPU read updates L2 with the latest values and requires
invalidation after, because SDMA doesn't use L2 and data can become
out of sync.
Change-Id: I98d1c91ca78a103fa5409e638f97485d62d5b11e
[ROCm/clr commit: 18a821acde]
OCL can't distinguish different copy types, but ROC profiler
expects SDMA transfer visibility. Add extra code to detect
a transfer with the host memory and substitute OCL command
Change-Id: I5290acd0e10bc082e00c1d4ae1474a075de7f165
[ROCm/clr commit: bd340d8cbf]
The change reuses HSA signals for dispatches as a wait signal.
Skipping the barrier requires to disable L2 cache for sysmem
allocations and extra tracking for HDP access with the large bar.
ROC_BARRIER_SYNC=0 activates the new logic. Barrier sync is
still used by default.
ROC_ACTIVE_WAIT=1 enables unconditional active wait in ROCr.
The change also consolidated ROCr wait logic under single function.
Change-Id: I6bd1be30aa88258da1b1f9de319ef5a45852afd8
[ROCm/clr commit: d9397590de]
When HIP_ENABLE_DEFERRED_LOADING=0, many global variables will be
referenced but they are not initialized in that early time. The patch
will use constexpr to initialze global constant varables in compile
time.
Change-Id: I9d538b7abc6a0ce700ec3332b97fc144db5fc1ef
[ROCm/clr commit: fdef6f722f]
Optimization for the fence release removed a sync for mem fill.
Add simple const buffer management forr the filled pattern to avoid
pattern overwriting with the async fills.
Change-Id: I63773ac09ceec31d5396d24570e4647ff096326b
[ROCm/clr commit: 2ce6bbebc4]
SWDEV-234947
SWDEV-236298
Instead of forcing a barrier packet, just inject system scope on the next packet.
Change-Id: If9bcee23e08dfe5db731235e2fcb30582cbd4c1c
[ROCm/clr commit: 6a5af4056e]
Apply the optimization to change for OpenCL too.
Clean up some unnecessary checks.
Change-Id: I840261fe35baeeadeba7388e86779d482f509aad
[ROCm/clr commit: 6c5a42b33c]
This workaround is to avoid performance penalty of SDMA engine
taking a while to clock up from a lower DPM state. Add env var
GPU_FORCE_BLIT_COPY_SIZE (1024 by default for HIP in KB). Forcing
Src and Dst agent to be amdgpu makes ROCr take blit copy path for
what otherwise should have been SDMA copy
Change-Id: I222f687155f86000d17d66d25182e490b6710463
[ROCm/clr commit: 5f64e6e7ad]