IPC use cases with RVD set can't convey proper agent handles.
Runtime discovery is required to properly route the copy in this
case.
Change-Id: I4c97e132fb4b6ac1040de1cb17fe5a3e36d6be48
This is consistent with KFD and has significantly better latency.
KFD is taking this as the definition of the SystemClockCounter.
Change-Id: I4c1b3bc58c738206265c55ebefd41356c013bfe5
Eliminates the need for manually assembling the source of the
second level trap handler to produce the shader binary. Also
separated blit shaders' binary source and version one second
level trap handler binary sources into different header files.
Change-Id: If29a18ee06dc083ec880ea962f234c6b5cac806a
Host to device SDMA copies do not require an HDP cache flush when
connected by xGMI since data copies over the data fabric and not HDP.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Reviewed-by: Sean Keely <sean.keely@amd.com>
Change-Id: I78d73a47edcc1a9c0ba59f33cf91485f13f1c45b
Declare the type of HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT
and add a missing break statement.
Change-Id: I86ce8a2e620438e046b60cee991ce1fbe07a3e88
On gfx10+ we need to issue a minimum count of active lanes or
groups before ADC moves on. Ensure that scratch allocations
attempt to reach this limit.
Occupancy throttling due to OOM condition may still drop below this
limit.
Change-Id: I0edf2e40fbe1a95e9a262564cebd2b6a82501a0b
__x86_64__ and __AMD64__ should be already defined by the compiler to
specify the compilation target and shouldn't be defined manually.
I fixed two x86_64 checks to include VS variables, as removing this
might cause it to fail to compile on that compiler.
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I600ff449af85bf7d83ecab167d97933922e2d917
Instead of installing to lib or include, use CMAKE_INSTALL_LIBDIR and
CMAKE_INSTALL_INCLUDEDIR to allow the builder to override if desired.
The default LIBDIR should be "lib" to avoid breaking ROCm packaging, but
using GNUInstallDirs would use lib64 on RHEL. By setting a default value
prior to including GNUInstallDirs, we can always use "lib" unless the
builder explicitly overrides it via "-DCMAKE_INSTALL_LIBDIR", which is
typical in most distro scripts.
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I135f21bcfeb02b6849f6e8ca403b39c029a02d5c
Image support does not compile on other archectures, since it relies on
the x86 only header "x86intrin.h".
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I120d15870e74e20bd618e6f5da8c05e28fb1203b
Each time delay is grown we need to reset elapsed. We want to take
the most accurate sample from the set at fixed delay.
Without this we will hang if there is ever an insufficiently accurate,
high unit clock read.
Change-Id: Ic65f364067789ac85a6572d67af2d77528e265bb
The loader must use internal interfaces to access page allocation
flags. Code pages should also ensure use of cached memory.
Also relocate i-cache flush after code page copy.
Change-Id: I86d36243b6eebb1d46b991b372a5236baaf941ab
VM faults should not report via the queue error handler.
The system event contains much more useful information.
Change-Id: I744d9b97b23334d7ed2c0f450111c1b8032567e3
Hive ID is used during copy path selection to locate an optimal
pool of SDMA engines. However, for CPU-GPU connections we always
want to use the host port facing engines, known generally as the
PCIe optimzed engines. We want this selection even when the
connection is XGMI hence dropping the hive id for CPUs.
Change-Id: Iffe44174afecfc0bb3272b806fce549c930a49d9
Excessive scratch allocations can normally trigger occupancy
reduction. This breaks cooperative groups so if occupancy
reduction is required on a cooperative dispatch fail with OOM.
Change-Id: I64612a2e38bf1286f3b74c1c2a68ab0c85452771
With asym. harvest hw does not issue groups equally to each SE,
occasionally hw will skip an SE so that the distribution reflects
each SE's CU count. Scratch resources must be allocated to reflect
this asymmetric distribution of groups.
Change-Id: I65e26206500483ea18e6e8796e65ecba5354b029
HW does not ignore low bits of the scratch wave count and will
stride beyond the end of the allocation if the wave count is
ever indivisible by SE count. Rather than returning the allocation
size for cached large scratch allocations, use the requested
scratch size in scratch setup. Scratch cache will retain the
cached allocation's size.
Change-Id: I0129ddc99a8940d01d8fbcd0b02d5061f31f456d
Include the upgrade operation check in the prerm and postun scripts
in package.
Signed-off-by: Saravanan Solaiyappan <saravanan.solaiyappan@amd.com>
Change-Id: Ic766d8d68b5168e5f1b065d846ca2604d281e5be
discardBlock may be called multiple times on the same block.
We must not discard the block multiple times or we will corrupt
in-use memory accounting.
Change-Id: Ife9f3162785965a795dcf81887d4d447cc096e62
Minimum queue size was not enforced at the Agent level. Minimum
size should be one page to give unifority across all asics.
Change-Id: I26394f79458d09fbceb79fc8aaf495e2c26a8ff3
On gfx90a only a reduced number of CUs must be used for cooperative
dispatches due to CWSR and launcher interactions with asymetric
harvest. We must use one fewer CUs per SE than the lowest count of
CUs on any SE.
Also adds env var HSA_COOP_CU_COUNT which enables the cooperative
CU count computation. Set to 1 to enable the new computation.
This is an opt-in feature that will become enabled by default (opt-out)
in a future release.
Change-Id: Ifbb75ced3bbc15876eef44922c6a4f6fde8c4c28
Corrections have been made in libhsakmt, and corresponding changes are required here as well.
Signed-off-by: Chen Gong <curry.gong@amd.com>
Change-Id: Ib697ce25278c2c5ac6ef0206930ec285f46c60d1
The start iterator becomes invalid after it is removed from
std::map prefetch_map_. This was causing a segfault when the iterator is
incremented afterwards.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Change-Id: I4b0b763d2cb4ee99c0b8571c2c526b834e74077a
Prior solution used a single global lock to protect the memory tracking structures.
This change protects the memory tracking structure with a shared mutex (rw lock) in
shared (r) mode for memory allocations and frees so that long duration processes,
calling to kfd, can be done in parallel. Operations which must modify the memory map
take the mutex in exclusive mode (w) and must not call to the thunk while holding
the mutex.
The fragment allocator now requires separate protection and is protected with a
mutex at the device level. Protecting at the device level, rather than pool,
allows retention of the current recursive design and allows calling Trim from
withing Allocate. This could be made finer (pool level locks) but would
require backing out of Allocate entirely to call Trim. Trim and any retried
Allocation must be done in isolation (per device) or we may report OOM when
memory is actually available in some pool's fragment cache. So some device
level serialization is required in at least some paths.
Change-Id: I7c1e94d6965ffcc602b12fefdd3a6e97b84b5e00
Comments call out the specific operation being selected since the
ternary nest is a bit hard to read.
Change-Id: If033dbaa6cba132e96196ad3fc6d5572042041f4
Argument must be checked for nullptr before being dereferenced and
filled with the default return value.
Change-Id: I9ff366f066a5e18c78129bf59cc3ba00fca3ef18