Invalidate only the address range that covers the newly copied
code-object. This avoids invalidating I$ for old code objects and thus
might increase I$ hit rate.
When compiling with -O0, some compilers generate a xchg instruction for
the __atomic_store(...) built-in. Using xchg on MMIO memory is
undefined-behavior and may be ignored on certain CPUs.
For native and DTIF backends, unify to use HSAKMT_CALL(...) to call
hsaKmt APIs.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
Using HSA_ENABLE_DTIF to control dtif/native thunk code path
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
When copying for inter devices, Currently only XGMI as exposed. Now
SDMA0/1 will be exposed as well for inter device copies especially that
they are one of the recommended engines.
Signed-off-by: Li Ma <li.ma@amd.com>
Adds a bool to the GPU agent and a public member method to
check if the GPU supports large BAR. This is needed so we can
check if large BAR is supported when a user tries to allocate
an AQL queue in device memory on a given GPU agent.
Also adds an exception to the AQL queue if device-side AQL queues
are requested and the GPU owner of the AQL doesn't support large
BAR. Otherwise, ROCr will currently allow device-side queues
that can cause faults when the user tries to touch their ring
buffers and the user will not know why the faults are occuring.
This relies on the fact that the KFD does not exposed any links
from the CPU to the GPU if large BAR is not enabled (though
links from the GPU to the CPU may still be exposed by the KFD).
This builds on a prior change that allowed for allocating
a user-mode queue's packet buffer in device memory to also
allocate the queue struct in device memory. This provides
additional latency benefits particularly for cases where
dispatches are performed from the GPU itself. Flags are
added to support the various use cases.
This patch will adds recommended sdma supports with
limited XGMI SDMA engine. It will use one PCIe SDMA
to do gpu <-> gpu copies which will help improve all
to all copy performance.
Signed-off-by: Shane Xiao <shane.xiao@amd.com>
Use spaces consistently to format the trap handler code. This patch
does not introduce any change in the trap handler. Using `git show -w`
on this patch shows an empty diff.
Change-Id: Ic0244dd203347146ffde65460cd87ecbcc43732a
The trap handler should read the PERF_SNAPSHOT_DATA after all of
PERF_SNAPSHOT_DATA, PERF_SNAPSHOT_PC_LO and PERF_SNAPSHOT_PC_HI. This
patch fixes this.
Change-Id: I7f78e16d7a0d8bfebb34906b4dff73c2eaeb5658
Make sure to clear the HOST_TRAP and PERF_SNAPSHOT bits before returning
from the second level trap handler. As those bits are sticky, this
ensures future re-entry to the trap handler (for context save for
example) will not be confused with a sampling trap.
Change-Id: I05e5e58779a650b324ac6e30d574dc6931340f13
Signed-off-by: Lancelot SIX <lancelot.six@amd.com>
Adding a general stage for agents to release their resources on
shutdown. This avoids a circular dependency during shutdown because
we have to delete allocated resources before deleting memory pools, but
we also have to delete memory pools before destroying agents.
In PcSamplingCreateFromId, convert number of bytes into number of
dwords because DmaFill expects a count of 32-bit words, not raw bytes.
This prevents OOB writes on large sampling buffers.