doorbell_queue_map should always be allocated or we will need to
add branches around all accesses.
Change-Id: I994c0eaf4be62c1a4a37bd06894272dba1fc1da6
sdma end ts must be 256 bit aligned in oss 3.0 and prior. Using
the ts pool requires copying into the signal and is a significant
performance penalty for small copies.
SharedSignal is 128 bytes due to alignment so can host the end ts.
Move sdma end ts into SharedSignal and remove ts pool and ts copy.
Change-Id: I7899bda36ebc9adcaad1d3a3d2b7a489857cc9e8
Impacts GPU_ONLY signal type latency when waiting for small operations.
Using this type improves total SDMA small copy performance by ~40% if
the signal is allowed to spin freely.
Change-Id: I27aa128c63a1bacb3f51fb08f166e4e1d6fef651
Remove agent lookup in time stamp translation for IPC signals. The copy
agent handle is not shared so does not need to be checked for cross
process use. Cross process copy-timestamp read is illegal and continues
to deliver garbage.
Store the copy agent properly when doing CPU-CPU copies.
Change-Id: Ib4008f66ff866922047749dd556c84a32021c1fd
ucode versions are per asic so not valid for feature enablement outside
of bringup/dev. Feature is older than the latest ioctl change that
the thunk depends on so use of this patch with kernel packages that
don't contain the feature is not possible in a supported environment.
Change-Id: I36b14176a7d642017ef1518aeade454b0f3dc749
If M0[23] is set then the driver will interpret the interrupt as a
debug event, rather than a signal event.
Clear M0 before sending the interrupt. All paths here are terminal so
it's not necessary to save/restore M0.
Change-Id: Ibd85b8cc6f8556941f2308a2c3fa3c68702cd606
agentOwner from thunk reflects the GPU which holds the device alias.
We need to return a CPU to better reflect that the memory is system memory.
Change-Id: I9233f8779a4bfd471f68dbbbce07ae4528412e18
Allow user specified profiles if the HSAIL note is not found.
Konstantin reviewed and approved. HSAIL note is not generated by LLVM.
Change-Id: I40fbfbaedd6787b6a716507918f698d02007afe1
Report traps and fatal exceptions through a wavefront's
amd_queue_t.queue_inactive_signal. Previously, only traps were
reported and requireed the compiler to pass in the signal pointer
in s[0:1].
The signal is obtained through a mapping from doorbell index to
amd_queue_t*. The doorbell is fetched within a wavefront through
the gfx9+ S_SENDMSG(MSG_GET_DOORBELL) instruction.
Change-Id: I319b45f2e15dfcfe4db8f4065da1136e9539a42b
Assembler toolchains are moving from SP3 to LLVM. Replace trap handler
source code with LLVM equivalent.
Fix a trap issue with SQ_WAVE_IB_STS restore. Mostly harmless as all
traps are currently considered fatal to the wavefront.
Change-Id: Iacecd9dd31a1d96a083c8b8327f442f33c861f9f
Adds hsa_amd_register_deallocation_callback and hsa_amd_deregister_deallocation_callback
to notify when HSA memory has been released.
Change-Id: I1f33cee250ca890e5c2e7fddfa4479aa5874651d
CPUClockCounter is not NTP adjusted (CLOCK_MONOTONIC_RAW) so should be
better for measurements. However, it is implemented with syscall while
CLOCK_MONOTONIC is implemented via vDSO. The latency increase becomes
significant when language layers make corresponding clock measurements.
Reverting to CLOCK_MONOTONIC will reduce latency and allow small
duration events to be measured at the cost of incorporating NTP
frequency skew errors. NTP may adjust frequency by 500ppm so limits us
to ~3 decimals in elapsed time.
Change-Id: I920b9f707f47109d80d6c256c475638c03fb8d76
Description was inconsistent with itself and code. Existing behavior
returns HSA_AMD_MEMORY_POOL_INFO_ACCESSIBLE_BY_ALL == true for system
memory pools only and system memory pools do require hsa_amd_agents_allow_access.
Change-Id: I64b287bff9fdb21688aa169296e410edf1b209b5
Check if it is true or not. The string() call would define this to an
empty string, which would pass. This would then leave a trailing -
in the version string, which dpkg would error on during package
installation.
Change-Id: Ifb5fc15f5dde506e96bff7881a5d3f22d983406e
Search the local src directories first. If using a system
installed hsakmt, this would pick the installed hsa headers.
Change-Id: I9746d6e9db1749a130e4d93e024556754a537083
Joined threads can not be joined more than once nor can they be detached.
Thread library wait and close allows multiple waits and separate close so
this fixes the pthread implementation.
Change-Id: I0019271a438f11ed4c6c11854011f5c4f6e16b65
Small times may be given to time conversion if GPU clocks are used to
accumulate elapsed time. Because HSA APIs deal in absolute time this
leads to large conversion offsets of order system uptime. Variation
in relative clock ratio estimation may be amplified in this case,
destroying elapsed time measurements.
This patch fixes the relative clock ratio used for times which predate
the call to hsa_init. This correlates errors in such times allowing
the elapsed time to be correctly computed.
The effective maximum system uptime before elapsed time conversion becomes
inaccurate is ~3.5 months. GPU event timestamps are good for process uptime
of ~3.5 months. These are limited by double's mantissa precision.
Change-Id: I48752ff354920439d91016d6f2b0c8ddfa60b445
At the moment it is not possible to build ROCr with Clang. This is
a spurious limitation. The present PR addresses it by guarding GCC
only flags and by fixing some additional warnings that Clang triggers;
one of said warnings did outline a rather interesting issue with math
being done on void*s. - AlexVlx
Void ptr arithmetic had already been fixed in amd-master branch.
Change-Id: I5ee97e20b5c40b10dd73facecabe75f02ba46462
Non-paged memory can be IPC-shared even when HSA_USERPTR_FOR_PAGED_MEM
is enabled.
Change-Id: I8b1fa6d7a4a9327c78a77b3679697fbf55397093
Signed-off-by: Felix Kuehling <Felix.Kuehling@amd.com>
KFD no longer reports MemoryAccessFault.Failure with retry fault
implementation. ROCr ignores the memory event when Failure = 0.
Use the Flags field instead, which will be non-zero when the
event is triggered.
Change-Id: Ie90799a303b0b2f1b476b20ffafdde79ae137182
Makes malloc memory accessible to GPUs so that the memory has the
capabilities of the pool it is locked to.
This admits fine grained locked memory and reserves API space for any future
special CPU pools.
Change-Id: If8c3dd8582a43f19d3d36b3763c1a688cc419ef0