HW does not ignore low bits of the scratch wave count and will
stride beyond the end of the allocation if the wave count is
ever indivisible by SE count. Rather than returning the allocation
size for cached large scratch allocations, use the requested
scratch size in scratch setup. Scratch cache will retain the
cached allocation's size.
Change-Id: I0129ddc99a8940d01d8fbcd0b02d5061f31f456d
Include the upgrade operation check in the prerm and postun scripts
in package.
Signed-off-by: Saravanan Solaiyappan <saravanan.solaiyappan@amd.com>
Change-Id: Ic766d8d68b5168e5f1b065d846ca2604d281e5be
discardBlock may be called multiple times on the same block.
We must not discard the block multiple times or we will corrupt
in-use memory accounting.
Change-Id: Ife9f3162785965a795dcf81887d4d447cc096e62
Minimum queue size was not enforced at the Agent level. Minimum
size should be one page to give unifority across all asics.
Change-Id: I26394f79458d09fbceb79fc8aaf495e2c26a8ff3
On gfx90a only a reduced number of CUs must be used for cooperative
dispatches due to CWSR and launcher interactions with asymetric
harvest. We must use one fewer CUs per SE than the lowest count of
CUs on any SE.
Also adds env var HSA_COOP_CU_COUNT which enables the cooperative
CU count computation. Set to 1 to enable the new computation.
This is an opt-in feature that will become enabled by default (opt-out)
in a future release.
Change-Id: Ifbb75ced3bbc15876eef44922c6a4f6fde8c4c28
Corrections have been made in libhsakmt, and corresponding changes are required here as well.
Signed-off-by: Chen Gong <curry.gong@amd.com>
Change-Id: Ib697ce25278c2c5ac6ef0206930ec285f46c60d1
The start iterator becomes invalid after it is removed from
std::map prefetch_map_. This was causing a segfault when the iterator is
incremented afterwards.
Signed-off-by: David Yat Sin <david.yatsin@amd.com>
Change-Id: I4b0b763d2cb4ee99c0b8571c2c526b834e74077a
Prior solution used a single global lock to protect the memory tracking structures.
This change protects the memory tracking structure with a shared mutex (rw lock) in
shared (r) mode for memory allocations and frees so that long duration processes,
calling to kfd, can be done in parallel. Operations which must modify the memory map
take the mutex in exclusive mode (w) and must not call to the thunk while holding
the mutex.
The fragment allocator now requires separate protection and is protected with a
mutex at the device level. Protecting at the device level, rather than pool,
allows retention of the current recursive design and allows calling Trim from
withing Allocate. This could be made finer (pool level locks) but would
require backing out of Allocate entirely to call Trim. Trim and any retried
Allocation must be done in isolation (per device) or we may report OOM when
memory is actually available in some pool's fragment cache. So some device
level serialization is required in at least some paths.
Change-Id: I7c1e94d6965ffcc602b12fefdd3a6e97b84b5e00
Comments call out the specific operation being selected since the
ternary nest is a bit hard to read.
Change-Id: If033dbaa6cba132e96196ad3fc6d5572042041f4
Argument must be checked for nullptr before being dereferenced and
filled with the default return value.
Change-Id: I9ff366f066a5e18c78129bf59cc3ba00fca3ef18
This really should be set to conform to distro standards.
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I8c3bdcc7eb103cec9db6aa9f9cfec25754784be8
On gcc-10.3.0 environment, hsa-runtime building is failed as below log:
compute/hsa/runtime/rocrtst/suites/negative/queue_validation.cc:470:18: error: conversion from ‘unsigned int’ to ‘uint16_t’ {aka ‘short unsigned int’} changes value from ‘4294967295’ to ‘65535’ [-Werror=overflow]
470 | aql().header |= 0xFFFFFFFF << HSA_PACKET_HEADER_TYPE;
| ~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [CMakeFiles/rocrtst64.dir/build.make:339: CMakeFiles/rocrtst64.dir/home/aaliu/work/compute/hsa/runtime/rocrtst/suites/negative/queue_validation.cc.o] Error 1
make[2]: *** Waiting for unfinished jobs....
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Change-Id: I95fe72030368abc211b4b97b5a7ba00b5e094730
GetGlobalMemoryPool had improper return codes for an iterator callback
and did not properly order the APU pool selection path.
Change-Id: I01ab9d23e2352be98d9718bc25889ad4f779d3ca
Clang warns about bitwise operators on bools. Cast to int silences
the warning without introducing short circut logic.
Change-Id: I6e25138e1acf4a5562d3925ea5b2fcef3addb783
Would be nice to get warning count changes highlighted in CI though.
Clang's increasingly suspect diagnostics has caused multiple build
breaks without highlighting any actual issues.
Also: https://embeddedartistry.com/blog/2017/05/22/werror-is-not-your-friend/
Change-Id: I7dc82da58cd86f7b4f1a9fb511c4c039419271d4
Due to a CPACK bug the package needs to remove header file
symlinks. Cleanup is required for uninstall and upgrade
since each release installs to a different folder.
Change-Id: I5ec378b21e69235404781c7bce3c0203eb38eed1
KFD topology has been corrected and the defaults used by this
workaround are no longer true for all chips.
Change-Id: I0242d8077e9666ed1cf0dc3985244258ae5c0924
For APU asics, the default configuration size of video memory is
relatively small, plus the reserved region, ratio of max alloc size to
the pool size may below the expected value, so adjust it.
Change-Id: I798b44d9532aa6a381a1cc19faa5a46110bf0ad6
Early exit if the range is found to be fine grain. Indeterminate
should only apply if the range is neither coarse nor fine.
Change-Id: I54133e14f4e8cfa53e2d612f6112cdcdb5a47dfa
Because of sharing ports with other engines, the
hardware design team has advised that SDMA0 on gfx90a
should only be used for host-to-device data transfers.
The recommendation is to use SDMA1 for any device-to-device
or device-to-host data transfers.
A driver change will ensure that, for each gfx90a
device, only the first PCIe SDMA queue a process
requests will possibly be from SDMA0. This patch ensures
that the first PCIe queue requested (which may be from
SDMA0) is always set up for host-to-device.
Change-Id: I6793ca95596dedaed9d5be1dbd9469ceef2a5c33
Bumps cmake minimum version to 3.7 for version comparison operator.
Previously the Clang cmake project version strings were used. These
are not defined if the clang cmake project has not been loaded.
We should use CMAKE_CXX_COMPILER_VERSION to check the version when
only the compiler binary is redirected and the project files are
not available.
Also adjust device libs lookup logic to handle multiple paths in
CMAKE_PREFIX_PATH.
Change-Id: I67b6958d8241685cd6c3a0af68507c9fdc6331ef
For minimal latency we should place command queues and blit code
in the nearest numa node to each GPU. Add an allocator matching
the current runtime default allocator interface to each GpuAgent
that allocates on the closest numa node as represented by kfd
topology. Use this allocator for queue ring buffers and blit
objects.
Change-Id: I181127f9c27bafe68976312963146616e3f58369
Also make failure to handle queue errors fatal.
Motivation is to improve detection of queue error conditions
that currently appear as application hangs.
Change-Id: I655643616dc0bd303d7df3ce8aca2c099bec3d46
Sets package found and component lists. ROCr does not have components
so this is mostly cosmetic. It's part of maintaining a compliant
cmake project config file though.
Change-Id: Ida2ef746375143babd3a6f938727a47135606f01
Per clang 13 option -Wno-error=unused-but-set-variable is not
recoginized nor is the diagnostic emitted. Set this option
conditional to the clang compiler version.
Change-Id: I3c0958dffa985d53b641f9eff4e702988dffd033
Passing 0 into num_cu_mask_count used to be an implicit error.
This has been repurposed as a short hand for enabling all CUs.
Enabling all CUs when HSA_CU_MASK is set will cause the CU mask to
reset to whatever was set by HSA_CU_MASK which may then be queried.
Change-Id: I1d6bb2034595a78ee48fa72aa05563e8ea6c0fff
Delay parsing until after GPU discovery. Use the surfaced
GPU count and maximum phyiscal CU count to limit parsed bit masks.
This prevents pathological input such as
HSA_CU_MASK=0-8000000:0-8000000 from attempting to consume 7TiB.
Change-Id: I3773d2db3740c2023b0f6275d1818b69119b0495