On large BAR systems, for small-sized code-objects, we get performance
using direct memcpy due to latencies when doing the blit-copy.
[ROCm/ROCR-Runtime commit: da2607024b]
Invalidate only the address range that covers the newly copied
code-object. This avoids invalidating I$ for old code objects and thus
might increase I$ hit rate.
[ROCm/ROCR-Runtime commit: e969e01f54]
Create CP queue and SDMA queue should fail with invalid queue ring
buffer or ring buffer size.
Test unmap or free queue buffers should fail before queue is destroyed.
Use child process to test unmap CWSR buffer will evict queue.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Change-Id: I5dcd51d6b43445d19a986f8b0b82063e20348a5f
[ROCm/ROCR-Runtime commit: bd86fb1e63]
If unmap from GPU return failed, for example, unmap user queue buffer
while queue is active, we should not free obj->mapped_node_id_array,
otherwise, the following unmap user queue buffer after queue is
destroyed still return failed.
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Change-Id: I32aeb18871c2e971d01900d92916c54680f5c9fa
[ROCm/ROCR-Runtime commit: 3e6f51b715]
disable KFDLocalMemoryTest.Fragmentation and
KFDEventTest.MeasureInterruptConsumption as
part of the KFD test suite improvement feature
Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>
[ROCm/ROCR-Runtime commit: f853dda9ba]
When compiling with -O0, some compilers generate a xchg instruction for
the __atomic_store(...) built-in. Using xchg on MMIO memory is
undefined-behavior and may be ignored on certain CPUs.
[ROCm/ROCR-Runtime commit: f011a9506d]
For native and DTIF backends, unify to use HSAKMT_CALL(...) to call
hsaKmt APIs.
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
[ROCm/ROCR-Runtime commit: 7ba77fb193]
Using HSA_ENABLE_DTIF to control dtif/native thunk code path
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
[ROCm/ROCR-Runtime commit: 166b0fa45a]
When copying for inter devices, Currently only XGMI as exposed. Now
SDMA0/1 will be exposed as well for inter device copies especially that
they are one of the recommended engines.
Signed-off-by: Li Ma <li.ma@amd.com>
[ROCm/ROCR-Runtime commit: e38dd98914]
Set the rec_sdma_eng_override_ for other gpus, or DmaCopyOnEngine
will use sdma for D<->D copy, which will trigger invalid argument.
[ROCm/ROCR-Runtime commit: 82a88f2e2b]
* Update createMCObjectStreamer() to use new LLVM API
Obsolete interfaces were removed via llvm-project's
f2ff298867d7733122e32eead5a8c524b09dfdb1
* Fix typo: LLVM_VERSION -> LLVM_VERSION_MAJOR
* Fix typo
[ROCm/ROCR-Runtime commit: ac1e6d59c2]
Random driver deadlock on svm_range_evict_svm_bo_worker() is obeserved on
NPS2/DPX mode. It's seen with xnack off and happens more often on the
partition with less VRAM because of TMR.
Temporarily skip SVM Evict tests on Family AV when xnack is disabled.
Signed-off-by: Amber Lin <Amber.Lin@amd.com>
[ROCm/ROCR-Runtime commit: 5e28208cec]
Adds a bool to the GPU agent and a public member method to
check if the GPU supports large BAR. This is needed so we can
check if large BAR is supported when a user tries to allocate
an AQL queue in device memory on a given GPU agent.
Also adds an exception to the AQL queue if device-side AQL queues
are requested and the GPU owner of the AQL doesn't support large
BAR. Otherwise, ROCr will currently allow device-side queues
that can cause faults when the user tries to touch their ring
buffers and the user will not know why the faults are occuring.
This relies on the fact that the KFD does not exposed any links
from the CPU to the GPU if large BAR is not enabled (though
links from the GPU to the CPU may still be exposed by the KFD).
[ROCm/ROCR-Runtime commit: f2c482d923]
This builds on a prior change that allowed for allocating
a user-mode queue's packet buffer in device memory to also
allocate the queue struct in device memory. This provides
additional latency benefits particularly for cases where
dispatches are performed from the GPU itself. Flags are
added to support the various use cases.
[ROCm/ROCR-Runtime commit: 6e3c375bf1]
- Add a new AMD extension API to return preferred SDMA engine mask.
This can use used in conjunction with copy_on_engine API to get
optimal bandwidth.
[ROCm/ROCR-Runtime commit: 57c0c643ce]
This patch will adds recommended sdma supports with
limited XGMI SDMA engine. It will use one PCIe SDMA
to do gpu <-> gpu copies which will help improve all
to all copy performance.
Signed-off-by: Shane Xiao <shane.xiao@amd.com>
[ROCm/ROCR-Runtime commit: 6a63170b38]
The max queues per process is 1024 in KFD,
KFDQMTest.OverSubscribeCpQueues fails with multi-gpu mode
on more than 15 gpus, because 65x16=1040 exceeds 1024, so
changing MAX_CP_QUEUES to adapt it will fix the issue.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
[ROCm/ROCR-Runtime commit: df6048429c]
The parent process can only be ptraced by 1 process
once, to avoid the error we have to add mutex to
synchronize the ptrace call.
Signed-off-by: Eric Huang <jinhuieric.huang@amd.com>
[ROCm/ROCR-Runtime commit: d3265234e9]
Use spaces consistently to format the trap handler code. This patch
does not introduce any change in the trap handler. Using `git show -w`
on this patch shows an empty diff.
Change-Id: Ic0244dd203347146ffde65460cd87ecbcc43732a
[ROCm/ROCR-Runtime commit: e0359e5d35]