SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes
represented by exponent as bits set in the COUNT field of the
linear copy.
Also note that the full 2^22 byte limit is available from SDMA4 onwards
as it has corrected the 0x3fffe0 HW limitation from SDMA3.
As copy limit has increase, this can change system performance
so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall
back to the original 0x3fffe0 limit for debugging purposes.
Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee
To be able to trace memcpy asynchronously, both dst and src agents need to have profiling enabled and the api for enabling profiling was only enabling for gpu agents. CPU agents didn't have profiling enabled so the signal owner could not be known. hsa_amd_profiling_get_async_copy_time will fail with an HSA status error because it can't read the agent for the given signal.
Change-Id: Ie165e0e39b8fcd6992a55695b9ffcead10a8e812
- Update CMakeLists.txt
- find_package for rocprofiler-register
- this is an optional package until rocprofiler-register is added to the CI
- define HSA_VERSION_{MAJOR,MINOR,PATCH} ppdefs
- Update runtime.cpp
- include <rocprofiler-register/rocprofiler-register.h>
- if rocprofiler-register succeeds, do not support v1 unless explicitly requested
Change-Id: I8f48bbf3f6b52fb91ddade2f198491a1256035fe
Remove override that forces ROCr image blit source and ROCr test to use
code object version 4 now that mainline has been updated to version 5.
Change-Id: I94681e86835c0e382475306ead4cd4132a2ee78f
Add handler to handle HW exception events reported by underlying
drivers. These events are generally caused by GPU resets and need the
application to abort.
As an improvement, in the future, we can provide additional information
about the exception (e.g mode-reset level)
Change-Id: If3fb5f19f9fce181a9d3b5e34a5506725856e7b0
An AQL packet header field is stored using an atomic release, and needs
to be read using atomic acquire if it may be written by another thread.
Change-Id: I1d75587fd93f9c6216deebffc9a627b404a7e749
Define AMD_AQL_FORMAT_INTERCEPT_MARKER AMD vendor AQL packet. Add
support to intercept queue to invoke a callback for these packets.
Change-Id: Ia58d5fe2171f563632b4edd6343e02585f49d149
When the intecept queue copies packets from the proxy queue to the
wrapped queue, it should not attempt to copy packets that are outside
the proxy queue. This could happen if the user of the proxy queue
advances the write pointer beyond the number of free slots and the
packet rewriter reduces the number of packets.
Change-Id: Id02f5df8aee0ed7269f4de813731d507cf2126b3
If an intercept queue is created and multiple packet rewriters are
registered, and if one of the rewriters invokes the packet writer
multiple times, then on returning from the packet writer the packet
rewriter index needs to be restored. Otherwise the next packet writer
call will start with an index of 0 which will be decremented and result
in out of bounds vector access.
Change-Id: Icb3f6a81ea04f1f7b91551b974a1f48c4f32db60
It is possible that packet rewriting an initial packet for the intercept
queue produces more packets that the size of the wrapped queue. The code
would never submit the such a set of packets as it attempted to submit
all or none. This can result in an infinite loop.
This is corrected to submit what will fit if the rewrite is larger than
the wrapped queue.
Change-Id: I8f03228c2e15151287e25de46eaee998f829c62a
The intercept queue submit needs to be obstruction free as it can be
invoked by the runtime async handler helper thread. The code had a busy
wait loop waiting for a free slot to be available to add the retry
barrier packet. Blocking that thread prevents it servicing other async
handlers which may need to execute in order to allow packets on the
hardware queue to be processed to free up a slot.
Change the code to always leave one free slot unless there is a retry
barrier packet already on the queue.
Change-Id: If901c865550258b790b995d58037b0f99f1968cc
Describe the assumption being made when checking if there is a retry
barrier packet on the queue. Also enforce the consequential requirement
of the minimum queue size.
Change-Id: I0efaffc5a79b9e2fdab3655b8b74270118a5c2ff
The intercept queue was processing all the packets on the proxy queue.
This could result in the rewrite of more than one packet being put on
the overflow queue. If there are a lot of packets on the intercept
queue this could result in the overflow queue having more packets than
the size of the hardware queue. The code to submit the overflow queue
fails if it is unable to put all the packets of the overflow on the
hardware queue. This resulted in an infinite loop. It also resulted in
an assert being reported that packets are being added to the overflow
queue when it is not empty.
Correct this by checking if the overflow queue is non-empty after
rewriting each packet. If it is non-empty then stop processing
additional packets. The additional packets will be processed when the
barrier packet added to the hardware queue is executed due to its asyn
handler. This barrier packet is added to the hardware queue whenever
packets are saved on the overflow queue.
Change-Id: I2537911d3c3ba1aac61a0a35f1ab97426a66b5a2
When forcing SDMA copies, engine ID specified by the requester should
still be used since the requester has hint of engine availability.
Change-Id: Idefa9494e407e31da510aa4c7c1fa283c85a4f6e
The Vendor specific header is only 8-bits and this would break the
behavior on big-endian machines. Renaming field to amd_format to match
name in spec sheets.
Change-Id: I65559757657565d3d3ff489d2663a0be42cf8ba5
Some new CPUs have different cache reporting structure causing thunk to
leave the cache information empty. Allow the cache information for CPU
agents to be empty as they are not used by language-runtimes
Change-Id: Ic5e880171ab20aa114b4b62bdb4479eb54066f7b
Using new ExtendedCoherent KFD HSA memory flag to achieve system
scope coherence on atomic instructions. Non-compliant systems may
have the need to perform explicit HDP flushes to achieve system
scope coherence using this flag.
Change-Id: Ic6b47c0e97285086fa1f52bbfa4597b81cadafeb
Some negative tests can trigger C++ exceptions to be thrown, which
causes code to leave the ref counts in inconsistent state.
Change-Id: Ifa6d8be986941efcdf20d7ac8b86eb15a8fe9932
Modify hsa_amd_vmem_get_access to handle pointers that are within VA
range of an existing memory mapping
Change-Id: I9f806ec39f6e9a33da8d86dd65d9a472438fa8ed
Silence warnings on more stringent compile checks for lack of override
declaration.
Signed-off-by: Jonathan Kim <jonathan.kim@amd.com>
Change-Id: Iaa54dfc3dd74f5ee55763cafbbcf2db73493bb21
On busy systems, the memory allocation can take long duration and
increase calls to hsa_signal_create/hsa_amd_signal_create. This
mitigates this issue.
Change-Id: Ib7640273262ebc3dbf1f07049ce5da10b1d6b158
Add compile time asserts to force incrementing API table STEP versions
each time a new function is added to each table. This is required for
profiler team to be able to add preprocessor macros to determine which
versions contain the new APIs.
Also incrementing the major versions to 2 to indicate new numbering
scheme.
Change-Id: I148a436a5ceab6be3906f8263b40ea9b07841577
Some GFX9 devices will drop commands if ring buffer submission is less
than 64 DWORDs. Pad submission with a NOP head an trailing null
DWORDs in this case.
Change-Id: I850af490fb699f7efe8aef96d97c600a8e76516b
Also changed enum value to leave gap between enums that only exist in
hsa_region_info_t and enums that exist in both hsa_amd_memory_pool_info_t
Change-Id: I8f9f31200de66648e9328e4203ab283068c993f0
We don't need to keep track of specific blit engines in gang for
submission anymore as ganging early exits on pending bytes.
So tidy up the fluff.
Change-Id: I77e80bf1ad8f561a03fff77bce33aa09d02760c6
When oversubscribing SDMA gangs, a circular deadlock can occur since
gang enqueue is staggered with respect to SDMA engine leader based
on source to destination.
As a result, an enqueued leader may be waiting on a gang item that is
waiting on another enqueued leader or gang item and so on.
To prevent this, first lock the submission to ensure dma status query
and submissions are atomic. Once this is in place, be more stringent
with ganging in that all SDMA engines must be available in order to gang.
Finally, re-enable SDMA ganging by default.
Change-Id: I4511e3487db9d26475b5aece4897f10168cc5322
xGMI for compute partitioning in non-SPX modes does not have
a reported bandwith.
Fix it to at most 2 since each partition is either bounded
by the number of xGMI links or the number of available
SDMA contexts.
Change-Id: I09094bd7548d9eee6f039b0efe849838e5de166e
SDMA ganging is causing some regressions with some applications hanging.
Temporarily disabling SDMA ganging by default until issue is fixed.
Change-Id: I65e172923a53a967df27b30d969ad5d215c4fa09
Use all available SDMA engines capped by xGMI bandwith for
all D2D copies within a hive.
By default, set the latency boundary copy size as 4KB and below.
Any copy size in within this boundary will not gang.
Avoid oversubscribing engines by not ganging on engines with
pending non-ganged work.
An enviroment variable HSA_ENABLE_SDMA_GANG has been provided
to override default ganging behaviour.
Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d
Reverting this as current mainline compiler branch does not support
gfx1150/gfx1151 yet. Will bring back later.
This reverts commit e877840197.
Change-Id: I31ff4fb2d5817538094a7ffaeba96dd6a7d660c7
Add agent info query to return nearest CPU agent. This can be used to
determine which CPU agent is in the same NUMA region as the GPU agent.
Change-Id: I5400b4347ffbf4d2a836df31c4de443a38b0ecd1