Scheduler in device queue requires relaunching itself. Make sure
scheduler uses exactly the same AQL packet as the host launch.
Change-Id: I4eb03c4c91bf2408a6d4607731f081a2e2c2c8ae
- Address an old bug in offset calculation that was causing out of bound
access.
- Improve logging
Change-Id: Iebdf34dddaa5e987cc72184a2152918adc6a96e0
- Check isAsync flag for small host copies on large bar as it synchronizes
- Use CopyEngine Preference hint if HMM is enabled.
Change-Id: I1ffc4b2604ed03cf5979cdc454178648c5ae5cba
- Leverage managed buffer that would use chunks for fill pattern. Use a
different chunk for the next fill to avoid wait
Change-Id: I254483c867e112f66564ffd8f55e0a605d8896c9
Add a threshold for ROCR/SDMA P2P transfers. ROCR copy path
requires extra barriers in compute for synchronization. That costs
extra performance with tiny transfers.
Reduce active wait time to 10us. Tensorflow uses extra thread
per GPU with constant hipEventQuery() calls. Longer active waits
in ROCr affect CPU performance.
Change-Id: I9020358438615fa2d4617f862f00a562f0a588e7
If we don't create the __amd_rocclr_gwsInit kernel, we still want
to create the rest of the image related blit kernels.
Change-Id: I8bc4645f9f9116eeecbb8b22e981ac4d520f3121
For the fillBuffer shader, if there are two 32bit writes to a MMIO
register, it can get dropped. It has to be a single 64bit write.
Add optimization to fillBuffer to write 64bit and 16bit writes.
Change-Id: I3aa78e027898f8ae01e9c8f09004615673720c2b
- The logic will trace compute, sdma read/write operations and
apply signals when necessary
- ROC_CPU_WAIT_FOR_SIGNAL, ROC_SYSTEM_SCOPE_SIGNAL
and ROC_SKIP_COPY_SYNC were added to control the tracking
Change-Id: I9e8e6174c63bf7784f7ab00964e2918c8667d364
- ROCR fails the call for some reason, then the signal will
become invalid and can hang on a wait. The logic will reset the
active signal in such cases
Change-Id: Ia131420200f1bbd7c9a162b8f1b06db8cecf41c6
- There is a performance regression with a HW wait for HSA signal
on ROCr async operation. For now move the logic back to CPU wait.
- Fix profiling issue with multiple HSA signal per single timestamp
object. Some copies require multiple ROCR calls and if profiling is
required, then the execution time is derived from all used signals.
Change-Id: Id003e4abb8c2de378eedc152a7e389500fc6f4ce
- Correct GSL path to report targets using the TargetID syntax.
- Correct GSL path to check compatibility of code objects when
loading.
- Add concept of an device isa and create a registery used by ROCm,
PAL and GSL.
- Support XNACK and SRAMECC target features consistently for PAL and ROCm.
- Correct logic for NullDevices and asserts to avoid memory coruption.
- Allow all NullDevices to be created for HIP.
- Numerous other code improvements.
Change-Id: I40abf3d2b22249c1492d1af5919665f8184f4e0e
Implement the global class for signals tracking per device queue.
Switch to the new tracking mechanism.
Change-Id: I3c4dda04b34e6d18d6a95510d84102909633b415
Make sure the comments in the code match the actual behavior.
HDP read has internal HDP read cache and doesn't use L2.
Change-Id: I667a4643b0e0d6529008f5e1a0a3269456c55b4e
CPU read updates L2 with the latest values and requires
invalidation after, because SDMA doesn't use L2 and data can become
out of sync.
Change-Id: I98d1c91ca78a103fa5409e638f97485d62d5b11e
OCL can't distinguish different copy types, but ROC profiler
expects SDMA transfer visibility. Add extra code to detect
a transfer with the host memory and substitute OCL command
Change-Id: I5290acd0e10bc082e00c1d4ae1474a075de7f165
The change reuses HSA signals for dispatches as a wait signal.
Skipping the barrier requires to disable L2 cache for sysmem
allocations and extra tracking for HDP access with the large bar.
ROC_BARRIER_SYNC=0 activates the new logic. Barrier sync is
still used by default.
ROC_ACTIVE_WAIT=1 enables unconditional active wait in ROCr.
The change also consolidated ROCr wait logic under single function.
Change-Id: I6bd1be30aa88258da1b1f9de319ef5a45852afd8
When HIP_ENABLE_DEFERRED_LOADING=0, many global variables will be
referenced but they are not initialized in that early time. The patch
will use constexpr to initialze global constant varables in compile
time.
Change-Id: I9d538b7abc6a0ce700ec3332b97fc144db5fc1ef