Core dump generation considers ulimit to generate the proper size
file.
Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: I61d991fc003b173f9075b66bff6a931447720695
This API consists in one function to be called from a fault event at the
hsa-runtime to generate a core dump.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Change-Id: Ib1b90d5beb13f93c4e8ebd21fd61705ebb12ca5d
SegmentBuilder classes are used to get core dump data from the GPUs.
So far, it uses thunk API calls and smaps to collect all data from
the Hardware.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Change-Id: I2ad70ca5a951885181d3142653b186b0f6be739e
- fix logic for using HSA_TOOLS_LIB when rocprofiler-register support is enabled
- report tool load failure for rocprofiler-register
Change-Id: Ife23aa3e6ed19174376cd694764583b73f8976cd
The alternate scratch memory is used for dispatches that have a low
number of waves but relatively large wave size.
This allows us to keep the tmpring_size.bits.WAVES field of the main
scratch to full occupancy.
Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3
For devices where the CP FW supports asynchronous scratch reclaim, ROCr
is able to claw-back scratch memory that was assigned to an AQL queue.
With that ability, ROCr does not have to rely on using USO
(use-scratch-once) when assigning large amounts of memory to a queue.
If we reach a situation where we are running low on device memory, ROCr
will attempt to claw-back the scratch memory.
Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7
Update queue structure to add members required for asynchronous reclaim
mechanism and dual-scratch. CP will set the AMD_QUEUE_CAPS_ASYNC_RECLAIM
bit on queue-connect to indicate whether the new features are supported.
The new members are ignored by previous versions of CP FW
Change-Id: Ic8e9ef41c5b1d04f09b43bc9b44b31527863d10f
For gfx11, the trap_handler fails to recognize a trap id 3 and report
the exception to the debugger if the debugger is attached.
This is because the 2nd level trap handler looks for the DEBUG_ENABLED
bit in ttmp13 instead of ttmp11. This bit is set by the 1st level trap
handler and is part of the 1st/2nd level trap handler ABI.
Change-Id: Ib36361f53d9bcbbed52320d8c3a9ab2c0b28c7cd
Skip Extended-scope memory pool as allocation is very close to
fine-grain/coarse-grain but with just different PTE flags.
Only test coarse grain on CPU agent other than the first CPU agent.
Stop bisecting the max size once we are withing 5% to total size for
these pool to speed this test on large memory pools.
Change-Id: I77d1b45a1752ef092dda7c7f27723ea0a292a612
SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes
represented by exponent as bits set in the COUNT field of the
linear copy.
Also note that the full 2^22 byte limit is available from SDMA4 onwards
as it has corrected the 0x3fffe0 HW limitation from SDMA3.
As copy limit has increase, this can change system performance
so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall
back to the original 0x3fffe0 limit for debugging purposes.
Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee
To be able to trace memcpy asynchronously, both dst and src agents need to have profiling enabled and the api for enabling profiling was only enabling for gpu agents. CPU agents didn't have profiling enabled so the signal owner could not be known. hsa_amd_profiling_get_async_copy_time will fail with an HSA status error because it can't read the agent for the given signal.
Change-Id: Ie165e0e39b8fcd6992a55695b9ffcead10a8e812
- Update CMakeLists.txt
- find_package for rocprofiler-register
- this is an optional package until rocprofiler-register is added to the CI
- define HSA_VERSION_{MAJOR,MINOR,PATCH} ppdefs
- Update runtime.cpp
- include <rocprofiler-register/rocprofiler-register.h>
- if rocprofiler-register succeeds, do not support v1 unless explicitly requested
Change-Id: I8f48bbf3f6b52fb91ddade2f198491a1256035fe
Remove override that forces ROCr image blit source and ROCr test to use
code object version 4 now that mainline has been updated to version 5.
Change-Id: I94681e86835c0e382475306ead4cd4132a2ee78f
Add handler to handle HW exception events reported by underlying
drivers. These events are generally caused by GPU resets and need the
application to abort.
As an improvement, in the future, we can provide additional information
about the exception (e.g mode-reset level)
Change-Id: If3fb5f19f9fce181a9d3b5e34a5506725856e7b0
An AQL packet header field is stored using an atomic release, and needs
to be read using atomic acquire if it may be written by another thread.
Change-Id: I1d75587fd93f9c6216deebffc9a627b404a7e749
Define AMD_AQL_FORMAT_INTERCEPT_MARKER AMD vendor AQL packet. Add
support to intercept queue to invoke a callback for these packets.
Change-Id: Ia58d5fe2171f563632b4edd6343e02585f49d149
When the intecept queue copies packets from the proxy queue to the
wrapped queue, it should not attempt to copy packets that are outside
the proxy queue. This could happen if the user of the proxy queue
advances the write pointer beyond the number of free slots and the
packet rewriter reduces the number of packets.
Change-Id: Id02f5df8aee0ed7269f4de813731d507cf2126b3
If an intercept queue is created and multiple packet rewriters are
registered, and if one of the rewriters invokes the packet writer
multiple times, then on returning from the packet writer the packet
rewriter index needs to be restored. Otherwise the next packet writer
call will start with an index of 0 which will be decremented and result
in out of bounds vector access.
Change-Id: Icb3f6a81ea04f1f7b91551b974a1f48c4f32db60
It is possible that packet rewriting an initial packet for the intercept
queue produces more packets that the size of the wrapped queue. The code
would never submit the such a set of packets as it attempted to submit
all or none. This can result in an infinite loop.
This is corrected to submit what will fit if the rewrite is larger than
the wrapped queue.
Change-Id: I8f03228c2e15151287e25de46eaee998f829c62a
The intercept queue submit needs to be obstruction free as it can be
invoked by the runtime async handler helper thread. The code had a busy
wait loop waiting for a free slot to be available to add the retry
barrier packet. Blocking that thread prevents it servicing other async
handlers which may need to execute in order to allow packets on the
hardware queue to be processed to free up a slot.
Change the code to always leave one free slot unless there is a retry
barrier packet already on the queue.
Change-Id: If901c865550258b790b995d58037b0f99f1968cc
Describe the assumption being made when checking if there is a retry
barrier packet on the queue. Also enforce the consequential requirement
of the minimum queue size.
Change-Id: I0efaffc5a79b9e2fdab3655b8b74270118a5c2ff
The intercept queue was processing all the packets on the proxy queue.
This could result in the rewrite of more than one packet being put on
the overflow queue. If there are a lot of packets on the intercept
queue this could result in the overflow queue having more packets than
the size of the hardware queue. The code to submit the overflow queue
fails if it is unable to put all the packets of the overflow on the
hardware queue. This resulted in an infinite loop. It also resulted in
an assert being reported that packets are being added to the overflow
queue when it is not empty.
Correct this by checking if the overflow queue is non-empty after
rewriting each packet. If it is non-empty then stop processing
additional packets. The additional packets will be processed when the
barrier packet added to the hardware queue is executed due to its asyn
handler. This barrier packet is added to the hardware queue whenever
packets are saved on the overflow queue.
Change-Id: I2537911d3c3ba1aac61a0a35f1ab97426a66b5a2
When forcing SDMA copies, engine ID specified by the requester should
still be used since the requester has hint of engine availability.
Change-Id: Idefa9494e407e31da510aa4c7c1fa283c85a4f6e
The Vendor specific header is only 8-bits and this would break the
behavior on big-endian machines. Renaming field to amd_format to match
name in spec sheets.
Change-Id: I65559757657565d3d3ff489d2663a0be42cf8ba5
For large memory allocations (>2MB) the thunk should use the
MADV_HUGEPAGE flag for madvise call to optimize allocation performance
on certain operating systems that rely on madvise hint when Traspatent
Huge Pages is not set to always.
Suggested-by: Joseph Greathouse <joseph.greathouse@amd.com>
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
Change-Id: Ic0c753f89a177b0f715942d6e2a7108b08a85f20
Provided pkgconfig file contains interface description which is arch
dependent. In such cases .pc files should be installed in libdir.
Signed-off-by: Tomasz Kłoczko <kloczek@github.com>
Change-Id: Ibbc85ad4aee1ef014c409dfa63313873b590464b
Set CWSR svm range granularity to 0xff, then KFD will migrate the entire
CWSR range from VRAM back to system memory when recovering the CPU page
fault if rocgdb access CWSR area, this avoid the partial CWSR range
migration and stall CWSR GPU mapping issue.
This is a temporary workaround, it should be reverted once the KFD is
fixed.
Change-Id: I80a7248244574edba25b13858b7ebcf1c77b8930
Signed-off-by: Philip Yang <Philip.Yang@amd.com>
Some new CPUs have different cache reporting structure causing thunk to
leave the cache information empty. Allow the cache information for CPU
agents to be empty as they are not used by language-runtimes
Change-Id: Ic5e880171ab20aa114b4b62bdb4479eb54066f7b
This share resource is for IOMMUv2 which is removed from
AMDGPU/KFD.
Change-Id: Ia6e9311f1adc56fac2c9e8fa05b24c5ec8c272a5
Signed-off-by: James Zhu <James.Zhu@amd.com>