Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer
Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2
If the graph has kernels that does device side allocation, during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.
Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.
Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688
This reverts commit 5447cf8872.
Reason for revert: SWDEV-455075, SWDEV-461507 - This change forces to
use ROCr's copy path. Reintroducing hostBlit copy path for
host-to-host copies.
Change-Id: Ic3c45b49e481c9dcdaa7611f61071778790b7e6c
If we are using the mask returned by getLastUsedSdmaEngine() then we
need to apply the SDMA Read/Write mask to it before using with HSA
copy_on_engine API.
Change-Id: I6e5dc6c187eeb3c61ee159e9d2a0fa7b4737c06e
The copies can get blocked if the last SDMA engine is used by another
copy and this can lead to perf drop in some of the tests like Gromacs.
Resetting the last engine by checking the engine status and fetching the
new mask after few copies can avoid this.
Change-Id: I8fe8ea678db508d291c6242f3741fa9215e99921
__amd_streamOpsWrite blitkernel in device-libs has only 3 args.
so getting rid of the 4th unused arg (sizeBytes)
Change-Id: I81cc1107f8b424bf58558c93a2495a1b878aef91
The new copy kernel can limit the number of launched workgoups.
It can copy in chunks of 16 bytes or 4 bytes.
Workgoup size is increased to 512 or 1024
Change-Id: Ic3fefa2d5bda6afebd1acc4d41ad310b138af6df
- Add the new fillBuffer kernel, which allows to launch a limited
number of workgroups for memory fill operation
- Switch fill memory to 16 bytes write by default
- Allow to limit the workgroups with DEBUG_CLR_LIMIT_BLIT_WG
Change-Id: Ibad1822f2d42b2fc71bcfc1917c31409c0623e8e
Add hipMemcpyDeviceToDeviceNoCU to force a non blit copy path. This
helps in cases where an app may determine that CU may be busy and copies
with SDMA may be quicker.
Change-Id: I59b415dd8f6022c244e8d75f265464d5c635df1e
- Track last SDMA engine per queue, this results in better scheduling
- Reset last SDMA engine upon batch completion. That ensures we dont get
blocked if the same engine is used by another concurrent copy
Change-Id: Id53111980da7ee41d5c932fb44e4aab5b1e065a3
- Rename HIP_USE_SDMA_QUERY to DEBUG_CLR_USE_SDMA_QUERY as this is
supposed to be a temporary env var for debug purposes only.
Change-Id: If6ebd52ab87624375a3df24ceccdcc05c60a65af
Blit manager requires an image view to reduce the amount
of copy kernels. Creation/destruction of a view in ROCr is
an expensive operation. Thus, runtime can cache views for fast access.
Change-Id: Ia67d775b481cc8326d91215ca22d4a73c1dddb59
- Remove large bar memcpy path. Since we end up waiting for a barrier,
its defeating the true intent of the copy, Also memcpy over PCIE\XGMI is
introducing variability in perf for HPC apps like GROMACS
Change-Id: I3b5c9d9ce93333959c39023bf4f703e2ccb6e3af
- Use regular copy API if we exhaust free SDMA engines and not fall back
to compute copy. Falling to compute is affecting performance for
numerous apps that are GPU bound
Change-Id: I75c767eff0b9f5ada324301c5c327fe2c23a9806
- Maintain a map of SDMA engine# to stream allocated following a greedy
approach
- Anything past that will query SDMA engine status always and go with a
SDMA or Blit copy path
Change-Id: Ibfaed7f951ab84d80cb0430596a4d11b5aec9202
Scheduler in device queue requires relaunching itself. Make sure
scheduler uses exactly the same AQL packet as the host launch.
Change-Id: I4eb03c4c91bf2408a6d4607731f081a2e2c2c8ae
- Address an old bug in offset calculation that was causing out of bound
access.
- Improve logging
Change-Id: Iebdf34dddaa5e987cc72184a2152918adc6a96e0
- Check isAsync flag for small host copies on large bar as it synchronizes
- Use CopyEngine Preference hint if HMM is enabled.
Change-Id: I1ffc4b2604ed03cf5979cdc454178648c5ae5cba
- Leverage managed buffer that would use chunks for fill pattern. Use a
different chunk for the next fill to avoid wait
Change-Id: I254483c867e112f66564ffd8f55e0a605d8896c9