Use barrier packets for every profile marker that gets submitted
and use the completion signal to get GPU ts. This gives most accurate
dispatch time. Club cache flushes with profile marker if there is a
pending dispatch that needs cache flush. This optimization saves on
extra barrier and helps wall time
Change-Id: Ib62d6d7aabf4743827b561be6c9c5afa813203da
OCL can't distinguish different copy types, but ROC profiler
expects SDMA transfer visibility. Add extra code to detect
a transfer with the host memory and substitute OCL command
Change-Id: I5290acd0e10bc082e00c1d4ae1474a075de7f165
Enable this optimization when the barrier is disabled, since
reuse requires a signal wait.
Use the size of pending AQL signals as the size of signal pool.
Change-Id: I2754a0f8b67e19d2601c58945e10fdf0e8be1624
The change reuses HSA signals for dispatches as a wait signal.
Skipping the barrier requires to disable L2 cache for sysmem
allocations and extra tracking for HDP access with the large bar.
ROC_BARRIER_SYNC=0 activates the new logic. Barrier sync is
still used by default.
ROC_ACTIVE_WAIT=1 enables unconditional active wait in ROCr.
The change also consolidated ROCr wait logic under single function.
Change-Id: I6bd1be30aa88258da1b1f9de319ef5a45852afd8
SWDEV-249719 - root cause: queues with custom CU mask are not inserted
into queuePool_ (i.e., queue of reusable HSA queues) of ROC device class
causing a crash when creating hostcall buffers for printf
Change-Id: Ieee7005d9a5a30b3113394ce23ee65927126d0d6
root cause - cooperative queue is not inserted into queuePool_ (HSA queues) of ROC device calss causing a crash when creating hostcall buffers for printf
Change-Id: I3f9aceb4e5fe6a7c7a2a549a4bb0a3511fe02799
Fix a typo with the name define, when compilation wasn't enabled.
Force CPU prefetch if system was forced in runtime
Change-Id: Id4b578f9fa44a45426fdb5d8ecb1da803aa42313
The current implementation creates default reference in the stack and assigns it to class member cuMasks_, so whenever the content of the stack changes, cuMask_ would change.
Change-Id: Iefab63c335d504b83c4ae90bd34ae76c6afb8f3c
Optimizaiton to remove extra syncs uncovered a bug with the cache
coherency layer, there runtime could lose the track of mem address
if coherency layer performed a sync.
Change-Id: I25647cfa4a4be9cdbd8577ff076a740bbdac79c8
When HIP_ENABLE_DEFERRED_LOADING=0, many global variables will be
referenced but they are not initialized in that early time. The patch
will use constexpr to initialze global constant varables in compile
time.
Change-Id: I9d538b7abc6a0ce700ec3332b97fc144db5fc1ef
- Expose ROCclr interfaces for HIP usage
- ROCr interfaces aren't available in staging, thus control the
build with AMD_HMM_SUPPORT define
Change-Id: Iadc2bcc230e78d3b0dc22b235189c8cc80843446
Optimization for the fence release removed a sync for mem fill.
Add simple const buffer management forr the filled pattern to avoid
pattern overwriting with the async fills.
Change-Id: I63773ac09ceec31d5396d24570e4647ff096326b
SWDEV-234947
SWDEV-236298
Instead of forcing a barrier packet, just inject system scope on the next packet.
Change-Id: If9bcee23e08dfe5db731235e2fcb30582cbd4c1c
Remove queue limitation since we loop through HW queues now.
Add a DevLogError if we fail to create the hsa_queue. A ticket showed a regression there.
Change-Id: I4f58e405f88e75600a762f6d6352838c969cdb5e
[ROCm][TCT][HIP] cooperative stream test case is failing.
Make sure lockXfer() in the blit manager returns a valid value.
Port the latest PAL backend logic into the ROCr backend.
This change doesn't fix the issue, reported in the ticket.
Change-Id: I54101a824f49a2dcfbbf5414cb5b3af41745306d
- Once device assertion occurs, abort the host execution as well.
- TODO: This's the initial support. As we need to drain hostcall queue
to ensure device assertion message being flushed out, hostcall
listener needs an interface to explicitly drain its queue.
Change-Id: I8a04400aa7109bfd054ae5777c41a4abbf0db4a9
1. Enable pitch workaround
2. When we use copy image, we don't need to create the custom pitch image
3. wrtBackImageBuffer_ stores device memory object, not amd image object.
Tests:
conformance kernel read / write test pass with this code change.
Change-Id: I7dca3127adde6ac83e78dd270a2256ebed55c60d
[hipclang-vdi-rocm][perf]~45% to 50% of Performance drop on
rocBLAS_int8 test
- Enable AMD_OPT_FLUSH optimization by default to match HCC
- Disable CPU writes to GPU memory on boards with large bar,
because it requires HDP flush tracking.
- Enable L2 cache on kernel arguments, because L2 will be
invalidated on memory reuse .
Change-Id: I124cf250bdd4d19c523ce542c163813828f8fbdc