KFDPerformanceTest.P2PBandWidthTest[push, push] takes about 3 seconds
on 4 gfx906, the default g_TestTimeout 2 seconds is not enough to wait
for sDMA queue rptr is consumed. Use kfdtest command line option
--timeout=6000, the test is finished and result is reasonable twice as
P2PBandWidthTest[push, none]. Change P2PBandWidthTest wait timeout to 6
seconds.
Add timeout argument to function WaitOnValue, BaseQueue.Wait4PacketConsumption
SDMAQueue.Wait4PacketConsumption, PM4Queue.Wait4PacketConsumption with
default value is g_TestTimeOut.
Change-Id: I0aa04d644339feaeea695e41647ae66568beab9e
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
[ROCm/ROCR-Runtime commit: b2e026fce3]
Adding it to the DEBIAN/control won't work, since we use CMake to build
it. Add all required packages to the CMakeLists file
Change-Id: Iaf62f42e0f998d66038338fb2cf793d29c790205
[ROCm/ROCR-Runtime commit: 666f90440a]
This will support the sp3 library built on one gcc version to be
compatible with another gcc version.
Change-Id: If67714bd63376dc781c56ed025be335fe54b2ba5
Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
[ROCm/ROCR-Runtime commit: 81b8815e1a]
RAS feature enabling bit and errors return are implemented in
existed topology and event mechanism.
v2: change library interface.
Change-Id: I75807c080b5b26e8115240b05b3d7016cb05a31a
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 8ee93b3187]
These tests all make use of an SDMAQueue in one way or another, so add
them to the SDMA_BLACKLIST to be 100% certain
Change-Id: Ic29e073c2f46249f3e5918145b13d276aec7bb33
[ROCm/ROCR-Runtime commit: 54807526b9]
This is intermittently causing VM faults and excessive evictions, which
causes the rest of the tests to fail. Take it out for now until someone
can investigate
Change-Id: I9c43890bc9f03a4a31efbc18df0df5e40a232c58
[ROCm/ROCR-Runtime commit: 381dba3932]
RAS feature enabling bit and errors return are implemented in
existed topology and event mechanism.
Change-Id: I9b018bba80cf4a6998e42a7bff64318c689b1d2a
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 1fbe010354]
On small bar multi-gpu system, hsaKmtMemoryMapToGPU will fail due to latest
kernel P2P sanity check. Swith to use hsaKmtMemoryMapToGPUNodes to fix
the failure
Change-Id: Id8b6329d1243df0e908cc9a171b5c7f9156f4a8b
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
[ROCm/ROCR-Runtime commit: d8009b4fd3]
Map scratch memory to the GPU that specified when allocate the memory
Change-Id: I788f9ef0dccb63b894a75e75cac5f94a60d7ec48
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
[ROCm/ROCR-Runtime commit: 29b45b8c0a]
BaseQueue class has a member function GetQueueType so m_Type
is duplicated. m_Type is only used in one function. Move it to
a local variable.
Change-Id: Ice144cf723178dd628cb49261c23d10605f9ee7d
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
[ROCm/ROCR-Runtime commit: 8d65e72045]
Those new types are used to create SDMA queue on specific engine
Change-Id: I91c3bcc14fef7404cf42b256a18651432e171091
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
[ROCm/ROCR-Runtime commit: 5173e71810]
EPERM means "operation not permitted" and is returned when CGroup
access checks fail. EACCES means "permission denied" and is returned
when the device file permission bits or access control list don't
allow access.
EPERM can fail silently, since we assume the administrator disabled
a device on purpose in the CGroup. EACCESS should produce an error
message and an info message to check the device file permissions.
Change-Id: Iee4c5584c5fdc4e113c3d760dede6661097b4341
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
[ROCm/ROCR-Runtime commit: 5e4e19d47b]
This feature only support dgpu for now.
Change-Id: Ic766ec06892c955dd605ecc335a776335edc0df2
Signed-off-by: Gang Ba <gaba@amd.com>
[ROCm/ROCR-Runtime commit: c54c1dbdcb]
Device whiltelist controller cgroup allows to track and enforce open and
mknod restrictions on device files. Tasks should works with
/dev/dri/renderN devices that are whitelisted for its cgroup. If a
certain node is not whitelisted it is not an error condition.
Change-Id: I0b997423ccdc00aee98df5b6f04ed6794549604e
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: c1994e28f0]
Add the numa libs to the thunk specs for DEB/RPM, so we can remove the
manual installation requirement
Change-Id: I5aadcf581b64e9a20aee9c1e1204af4715d1e990
[ROCm/ROCR-Runtime commit: 10edccb912]
Move debug trap support capabilities to their own
structure to fix thunk spec vs header mismatch.
Change-Id: I6694601bfa36097502c8ab932e082d7a4645d5b2
Signed-off-by: Philip Cox <Philip.Cox@amd.com>
[ROCm/ROCR-Runtime commit: 105edd4bb4]
On gfx900+, the test sometimes timeout due to cp fw bug.
Blacklist it until we address the root cause and have a fix.
Change-Id: Iff600a6f6dbd86c56e034f530484205520bced32
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: 7a13bb4d66]
We observe this test fails on gfx900+. Looks like the sdma packets are not
executed at all after we submit sometimes.
Run it with timeout 2s on gfx900.
[ RUN ] KFDQMTest.SdmaEventInterrupt
[----------] SDMACopyData FAIL! 1485262707170 VS 1485262747814
[----------] Event On Queue 1:0 Timeout, try to resubmit packets!
[----------] The timeout event is signaled!
[ ] Time Consumption (ns)
[ ] 1: 1859427148
[ ] 2: 680148
[ ] 3: 6370
[ ] 4: 5481
/home/pp/code/compute/libhsakmt/tests/kfdtest/src/KFDQMTest.cpp:1670: Failure
Value of: (ret)
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
[----------] SDMACopyData FAIL! 1485367669958 VS 1485367750022
[----------] Event On Queue 2:1 Timeout, try to resubmit packets!
[----------] The timeout event is signaled!
[ ] Time Consumption (ns)
[ ] 1: 1881615148
[ ] 2: 673629
[ ] 3: 6074
[ ] 4: 5481
/home/pp/code/compute/libhsakmt/tests/kfdtest/src/KFDQMTest.cpp:1670: Failure
Value of: (ret)
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
[----------] SDMACopyData FAIL! 1485427671250 VS 1485427751238
[----------] Event On Queue 2:1 Timeout, try to resubmit packets!
[----------] The timeout event is signaled!
[ ] Time Consumption (ns)
[ ] 1: 1881508777
[ ] 2: 741629
[ ] 3: 6074
[ ] 4: 5481
/home/pp/code/compute/libhsakmt/tests/kfdtest/src/KFDQMTest.cpp:1670: Failure
Value of: (ret)
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
[ FAILED ] KFDQMTest.SdmaEventInterrupt (23675 ms)
Change-Id: I7c1b752537d89782570df20838bf976578614f75
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: ab4610cff7]
Create an extra event so that the event id to test is non zero. That
way we can be sure the context id received in kernel ISR is non zero, which
is different from the default value 0 when context id is not set at all.
Change-Id: I7e261d1bbb783d5afd15558c7ac00493b1218cef
Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
[ROCm/ROCR-Runtime commit: 77bab8596f]
GPU Resource management can disable some of the GPU nodes.
The Kernel driver could be not aware of this.
Get from Kernel driver information of all the nodes and then filter it.
Change-Id: I4eeb126a5efce2192c35f5d2b72be1811e9ded32
Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
[ROCm/ROCR-Runtime commit: 3144a84b9a]
Currently the FindDRMRenderNode function will access the sysfs
directly to find the render node. It doesn't work with the
GPU management changes. Have changed code to call hsaKmtGetNodeProperties
instead.
Change-Id: I3bb537a323bc1e8c49f38d8aabc60c13e268aecd
Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
[ROCm/ROCR-Runtime commit: c3b47c0959]
The existing call sysconf (_SC_NPROCESSORS_ONLN) provides the number of
processors available to the scheduler. When a KFD process is run under a
container environment, only a subset (cpuset) of processors are
available to the current process.
For getting CPU cache information use sched_getaffinity() to get the
number of processors available to the current process.
Change-Id: Ieac02f1f61c17e24ac34ba502968c69d3bc631cb
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: fb79a0efe2]
Some infrastructures below,
Implement SdmaTimePacket which records the global GPU timestamp.
Introduce class AsyncMPSQ and AsyncMPMQ.
AsyncMPSQ is aka async multiple packet single queue. It takes a set of
packet when create and submits them to a GPU to run. While AsyncMPMQ is
aka async multiple packet multiple queue. It manages a set of AsyncMPSQ,
and use a forloop to do operations of AsyncMPSQ.
Implement sdma_multicopy helper functions.
Change-Id: I47e1d2ca9630113b2a1d85a0055f3f8ee629fb5f
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: f618b3f075]
For following test cases:
- KFDQMTest.QueueLatency
- KFDQMTest.BasicCuMaskingLinear
- KFDQMTest.BasicCuMaskingEven
- KFDMemoryTest.MMBandWidth
- KFDMemoryTest.MMapLarge
- KFDMemoryTest.MMBench
v2: xml element cannot start with a number, so change the key name of
MMBandWidth and MMBench accordingly
xml element cannot contain whitespaces, so trim whitespaces in "VRAM "
v3: introduce KFDLog-like way to use KFDRecord
Change-Id: Ifc3ed5657621252a7b39dccf1ef4f50a92593f77
Signed-off-by: Xiaojie Yuan <xiaojie.yuan@amd.com>
[ROCm/ROCR-Runtime commit: 247fa9f1e0]
This change is from commit a505c9bb("kfdtest: Do not set GTEST_FLAG
throw_on_failure").
But it is unexpected to reverted by commit b86f1456("kfdtest: Clean up
comments"). So add this change back.
Fix: b86f1456
Change-Id: Ia9e99c9ca17b99aab62b4db55017018ddae43dfb
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: a6287ba919]
The timestamp written by releaseMemory packet might still not be visible
when we fetch it.
To fix this bug, use event-based wait.
Change-Id: If2324eb3b3a632c711ee4dff4d03a93d5306c289
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: 07bd97a864]
Handle the case that svm.dgpu_aperture does not exist in vm_find_object.
Change-Id: Ic0983d4f321f1b6248514f2fa25162976e90bd75
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
[ROCm/ROCR-Runtime commit: be574169c1]
Use the NodeFrom returned by hsaKmtGetNodeIoLinkProperties() to check
its correctness.
Change-Id: I6ce436dc7c5d5b192bee21156292bd3eff77f916
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: 1fda429726]
Some nodes are unavailable based on the task's cgroup hierarchy. Handle
this situation by ignoring those nodes
Change-Id: I72f9e822d2ec8cf15732df95e427d5549a75b55d
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: 7876bb70a9]
With GPU resource management, some nodes are unavailable based on the
cgroup hierarchy of the task. Kernel via sysfs specifies all the
iolinks. Skip the links which are not accessible.
Also iolinks specified by the kernel refer to sysfs Node IDs. Map it to
relevant user Node IDs
v2: NodeFrom mapped from sysfs Node to User Node
Change-Id: I95312ee6ca51b89fe9e6ca2a9185c2ea1e94afc4
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: 866ef20054]