This is intermittently causing VM faults and excessive evictions, which
causes the rest of the tests to fail. Take it out for now until someone
can investigate
Change-Id: I9c43890bc9f03a4a31efbc18df0df5e40a232c58
[ROCm/ROCR-Runtime commit: 381dba3932]
RAS feature enabling bit and errors return are implemented in
existed topology and event mechanism.
Change-Id: I9b018bba80cf4a6998e42a7bff64318c689b1d2a
Signed-off-by: Eric Huang <JinhuiEric.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 1fbe010354]
On small bar multi-gpu system, hsaKmtMemoryMapToGPU will fail due to latest
kernel P2P sanity check. Swith to use hsaKmtMemoryMapToGPUNodes to fix
the failure
Change-Id: Id8b6329d1243df0e908cc9a171b5c7f9156f4a8b
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
[ROCm/ROCR-Runtime commit: d8009b4fd3]
Map scratch memory to the GPU that specified when allocate the memory
Change-Id: I788f9ef0dccb63b894a75e75cac5f94a60d7ec48
Signed-off-by: shaoyunl <shaoyun.liu@amd.com>
[ROCm/ROCR-Runtime commit: 29b45b8c0a]
HW has limited bits for wave scratch base address stride. Enforcement
prevents programs with larger than supported scratch allocations from
running and clobbering neighboring scratch space.
Change-Id: I574da888e9d1d5e290a9c0025ba13b5ef9f1e5c0
[ROCm/ROCR-Runtime commit: 8e4177382a]
- Process dynamic relocation even if there is
no symbol associated to it.
Change-Id: Iaefee682ee52f5acda8280e5764e6d5fd992774a
[ROCm/ROCR-Runtime commit: a447d79430]
BaseQueue class has a member function GetQueueType so m_Type
is duplicated. m_Type is only used in one function. Move it to
a local variable.
Change-Id: Ice144cf723178dd628cb49261c23d10605f9ee7d
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
[ROCm/ROCR-Runtime commit: 8d65e72045]
Those new types are used to create SDMA queue on specific engine
Change-Id: I91c3bcc14fef7404cf42b256a18651432e171091
Signed-off-by: Oak Zeng <Oak.Zeng@amd.com>
[ROCm/ROCR-Runtime commit: 5173e71810]
EPERM means "operation not permitted" and is returned when CGroup
access checks fail. EACCES means "permission denied" and is returned
when the device file permission bits or access control list don't
allow access.
EPERM can fail silently, since we assume the administrator disabled
a device on purpose in the CGroup. EACCESS should produce an error
message and an info message to check the device file permissions.
Change-Id: Iee4c5584c5fdc4e113c3d760dede6661097b4341
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
[ROCm/ROCR-Runtime commit: 5e4e19d47b]
Also rename blit_agent to region_gpu and add comments to clarify
its role in deprecated region API support rather than to do blits.
Change-Id: I80b1043db2e1c5d40a58fc801eef70a688ea9169
[ROCm/ROCR-Runtime commit: 936ecd1885]
During registration we must not call any function that depends on registered
data as the lists are not yet complete. This includes signal allocation since
allocating shared GPU mapped memory depends on the list of GPUs.
Change-Id: I94d59e847802c546c2a5a0d9f55fe5ac3fd1d878
[ROCm/ROCR-Runtime commit: dda9c17b45]
This feature only support dgpu for now.
Change-Id: Ic766ec06892c955dd605ecc335a776335edc0df2
Signed-off-by: Gang Ba <gaba@amd.com>
[ROCm/ROCR-Runtime commit: c54c1dbdcb]
Device whiltelist controller cgroup allows to track and enforce open and
mknod restrictions on device files. Tasks should works with
/dev/dri/renderN devices that are whitelisted for its cgroup. If a
certain node is not whitelisted it is not an error condition.
Change-Id: I0b997423ccdc00aee98df5b6f04ed6794549604e
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: c1994e28f0]
Add the numa libs to the thunk specs for DEB/RPM, so we can remove the
manual installation requirement
Change-Id: I5aadcf581b64e9a20aee9c1e1204af4715d1e990
[ROCm/ROCR-Runtime commit: 10edccb912]
Move debug trap support capabilities to their own
structure to fix thunk spec vs header mismatch.
Change-Id: I6694601bfa36097502c8ab932e082d7a4645d5b2
Signed-off-by: Philip Cox <Philip.Cox@amd.com>
[ROCm/ROCR-Runtime commit: 105edd4bb4]
Delete the runtime object when the last hsa_shut_down occurs.
Change-Id: I2005d52d06702eaef166714fd5e471cc277924db
[ROCm/ROCR-Runtime commit: 9ec37b5103]
Debug agent requires handles to internal queues for single step debugging.
Added tools only API hsa_amd_runtime_queue_create_register for reporting.
hsa_amd_runtime_queue_create_register sets a callback which is invoked
when internal queues are created.
Change-Id: Ia5190ae724fadba686c15f25b2cd085350eeff0e
[ROCm/ROCR-Runtime commit: 757502ccd6]
Required for debug agent requires copy API and trap handler to be initalized
prior to loading. Existing tools do not make use of internal queue or scratch
memory intercept which is what PostToolsInit allows.
PostToolsInit() will be removed in a following cleanup change.
Change-Id: If43377843808e3eff0defd9204910a67a852902f
[ROCm/ROCR-Runtime commit: 5975c465ad]
On gfx900+, the test sometimes timeout due to cp fw bug.
Blacklist it until we address the root cause and have a fix.
Change-Id: Iff600a6f6dbd86c56e034f530484205520bced32
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: 7a13bb4d66]
We observe this test fails on gfx900+. Looks like the sdma packets are not
executed at all after we submit sometimes.
Run it with timeout 2s on gfx900.
[ RUN ] KFDQMTest.SdmaEventInterrupt
[----------] SDMACopyData FAIL! 1485262707170 VS 1485262747814
[----------] Event On Queue 1:0 Timeout, try to resubmit packets!
[----------] The timeout event is signaled!
[ ] Time Consumption (ns)
[ ] 1: 1859427148
[ ] 2: 680148
[ ] 3: 6370
[ ] 4: 5481
/home/pp/code/compute/libhsakmt/tests/kfdtest/src/KFDQMTest.cpp:1670: Failure
Value of: (ret)
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
[----------] SDMACopyData FAIL! 1485367669958 VS 1485367750022
[----------] Event On Queue 2:1 Timeout, try to resubmit packets!
[----------] The timeout event is signaled!
[ ] Time Consumption (ns)
[ ] 1: 1881615148
[ ] 2: 673629
[ ] 3: 6074
[ ] 4: 5481
/home/pp/code/compute/libhsakmt/tests/kfdtest/src/KFDQMTest.cpp:1670: Failure
Value of: (ret)
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
[----------] SDMACopyData FAIL! 1485427671250 VS 1485427751238
[----------] Event On Queue 2:1 Timeout, try to resubmit packets!
[----------] The timeout event is signaled!
[ ] Time Consumption (ns)
[ ] 1: 1881508777
[ ] 2: 741629
[ ] 3: 6074
[ ] 4: 5481
/home/pp/code/compute/libhsakmt/tests/kfdtest/src/KFDQMTest.cpp:1670: Failure
Value of: (ret)
Actual: 31
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
[ FAILED ] KFDQMTest.SdmaEventInterrupt (23675 ms)
Change-Id: I7c1b752537d89782570df20838bf976578614f75
Signed-off-by: xinhui pan <xinhui.pan@amd.com>
[ROCm/ROCR-Runtime commit: ab4610cff7]
Create an extra event so that the event id to test is non zero. That
way we can be sure the context id received in kernel ISR is non zero, which
is different from the default value 0 when context id is not set at all.
Change-Id: I7e261d1bbb783d5afd15558c7ac00493b1218cef
Signed-off-by: Yong Zhao <Yong.Zhao@amd.com>
[ROCm/ROCR-Runtime commit: 77bab8596f]
Apertures now overlap with the change to 48bit addressing which
precludes using aperture checks to discover buffer ownership.
Switches to ptrinfo to decide which device a buffer owned by.
This corrects faults in the legacy hsa_memory_copy api.
Change-Id: I5c7ce0216e1cdc96f836fc6fec9c3defdf4b9d90
[ROCm/ROCR-Runtime commit: 1e0d690948]
On update, the removal will occur AFTER the new package is installed,
due to some stupidity with how yum/rpm does things. Only remove it if
we're doing a pure uninstall
Change-Id: I4982610828d8bc1f2d8691b1e4ee1718c89413cc
[ROCm/ROCR-Runtime commit: ed9baefd75]
GPU Resource management can disable some of the GPU nodes.
The Kernel driver could be not aware of this.
Get from Kernel driver information of all the nodes and then filter it.
Change-Id: I4eeb126a5efce2192c35f5d2b72be1811e9ded32
Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
[ROCm/ROCR-Runtime commit: 3144a84b9a]
Currently the FindDRMRenderNode function will access the sysfs
directly to find the render node. It doesn't work with the
GPU management changes. Have changed code to call hsaKmtGetNodeProperties
instead.
Change-Id: I3bb537a323bc1e8c49f38d8aabc60c13e268aecd
Signed-off-by: Mike Li <Tianxinmike.Li@amd.com>
[ROCm/ROCR-Runtime commit: c3b47c0959]
The existing call sysconf (_SC_NPROCESSORS_ONLN) provides the number of
processors available to the scheduler. When a KFD process is run under a
container environment, only a subset (cpuset) of processors are
available to the current process.
For getting CPU cache information use sched_getaffinity() to get the
number of processors available to the current process.
Change-Id: Ieac02f1f61c17e24ac34ba502968c69d3bc631cb
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
[ROCm/ROCR-Runtime commit: fb79a0efe2]