- add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found
- add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found
- remove report_tool_load_failures_explicit_
- add HSA_TOOLS_DISABLE_REGISTER flag
- add HSA_TOOLS_REPORT_REGISTER_FAILURE
- use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE
- changed rocprofiler-register message to not include the word "error"
Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6
If two attach requests to the same piece of shared memory occur,
a double export or premature dmabuf fd close can occur since the export
and close on demand calls are not atomic.
Use a reference counter on shared memory dmabuf FDs that have
already been opened to avoid this problem.
Change-Id: I14a59209c0385e32582af42a57b33b1c6838a9b1
The wavefront size is currently only exposed as an agent level
attribute. This is not correctyl, because while the agent has a default
wave front size that is usually correct, it can easily be overridden via
options like -mwavefrontsize64 on various ISAs. The wavefrontsize
attribute is actually more of a calling convention that is consistent
within a callgraph. Because the root of each call graph is a kernel in
this architecture, we need to be able to query this on a per-kernel
basis. This information is already avialable in the kernel descriptor
packet, but it wasn't exported.
This patch adds HSA_CODE_SYMBOL_INFO_KERNEL_WAVEFRONT_SIZE as a new
option to query on the executable symbol.
Change-Id: I744815c89cc9d4c82f25479bdd48ae1f32e859ff
Instead of caching shared memory fds for export on the exporter side,
only export the FD in the async handler when requested.
The importer should request export fd closure once import is done.
Change-Id: I469e0cd1749beeb9c506c8a6461745fb039d9c3b
Adjust code to allow the use of non-contiguous chunks of memory to be
mapped within a single VA range.
Change-Id: Ida21ba202927229347b3a32d9b7106df10819cf5
Currently, the definition of hsa_amd_memory_fault_reason_t tries to
set a constant of 0x8000_0000 by using the definition "1 << 31".
However, the 1 in this definition is a signed integer by C++ rules.
On our architectures, shifting a signed integer by 31 results in
signed integer overflow. Signed integer overflow results in
undefined behavior.
Forcing the 1 to be unsigned avoids this.
Change-Id: I860431eeede4eff29598f646abf3c1337b048d71
Users can import device memory without specifying the target node.
DMA buf imports return a Thunk handle that's not useful for
gpu mapping calls.
Fix this by using the import node information to re-import and
map with the correct target GPU.
Also fix IPC detach calls by deregistering the Thunk handle
import immediately during attach instead of failing to do it later
on detach since Thunk handles aren't placed into ROCr allocation
map.
Finally refactor the IPC attach function for cleaner logic flow.
Change-Id: Ib2bf178110b2be98bd6917c765f724e4e613f5f2
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items. Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.
Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.
Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
Applies the following changes:
add version number to documentation left navigation bar and page title
add an "About" section with a license page
enable htmlzip, pdf, epub formats when publishing on Read the Docs
set pdf title, author, copyright, and version
rename .sphinx/.doxygen to sphinx/doxygen
remove docBin from URL
update rocm-docs-core dependency
Change-Id: I947cf32cd42d9f4e55b1ddd324ad4a7e4ba3f3e3
Use emplace to prevent copying the MappedHandle objects when inserting
entries into mapped_handle_map_.
Change-Id: Id3f40f1eb73ce30e62da53c5aea4dd715e83ac59
When allocating a memory handle, the NoAddress thunk flag should be set
so that this allocation does not have a virtual address range.
Also, skip mapping the memory when allocating a memory handle
Change-Id: I1c168bc00ddbc158d447197c4dc25f96bad02b19
After a memory handle is created. hsa_amd_vmem_get_access should return
HSA_ACCESS_PERMISSION_NONE insread of reporting the allocation as
invalid.
Change-Id: I1a09d15c220d48497d09c89059493e538f82aeb9
As the KFD IPC IOCTLs will not be upstreamed, change runtime
implementation to use DMA bufs.
DMA buf fds will be passed over abstract unix domain sockets.
The exporter spins a thread that creates a socket server.
The importer connects to the server to fetch the fd.
libDRM will be required to do a manual import and GPU map for
memory that is not already imported and mapped.
For now, use the legacy IPC implementation by default as a
follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY
environment variable.
Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4
Implement HybridMutex to improve latencies compared to KernelMutex when
there is contention between several threads calling hsa_signal_create
and hsa_amd_signal_async_handler.
Change-Id: If53377033e749b0050727964c9303f09b02527cc
Fixes issue where t1_ counters may not be updated when doing dispatch
profiling, causing a divide by 0.
Change-Id: I91060ac3f9fd2183d277e6e7cd810398a453a87f
KFD had some fixes for handling of virtual memory APIs. These fixes are
included in interface version 1.15.
Change-Id: Ie701eccf6e032f9ec0a1f4e8a43718964eebddc6
Update the `hsa.h` header to use the gcc / clang `__BYTE_ORDER__`
macros where available to more accurately autodetect endianness for
the target.
Change-Id: I7312f3badcba9287a30eb14882b91e2a247acc5f
This reverts commit c5db063b2f. This
change is required for the runtime to generate reliable core dump files,
but this feature has been disabled for now by
5e3be9c28a. Until it is needed, revert
the ABI change in the trap handler to maintain compatibility with older
debugger.
Change-Id: I77a1562dc7962befe2bf88442df858e2d2b1c5ab
This reverts commit 803e37ded5.
This commit disables core dump feature. Apparently, gfx1101 SA1 waves
can not enter the trap handler because they receive an invalid
address. However, core dump at the debugger has been moved to rocm
6.2.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Change-Id: I7915caf58118658e5e7f435f91a0a6216d2fdb42
On some systems, pthread_addr_setaffinity_np does not exist, so we need
to use pthread_setaffinity_np on thread after pthread_create
Provided by Julian Samaroo on github
https: //github.com/RadeonOpenCompute/ROCR-Runtime/pull/143
Change-Id: I4649f94333f2d7b0a5993b370a4bfc48d92acecb
Add query to return flags for GPU agent memory properties and AQL
extensions.
Implement flag to determine that GPU agent is an APU
Change-Id: Ic04c51290b2b9763e14989c117f35a2e22297453
When inspecting waves on architectures where SPI may not initialize TTMP
registers, the debugger cannot reliably know if the trap handler was
entered and if it saved valuable information in TTMP registers.
This patch uses the status.skip_export bit (unused by the compute
shaders) to indicate that it got executed before halting a wave.
This is done except for gfx940, where ttmp11[31] can be used (as long as
TTMP registers are always initialized by SPI for this architecture). It
could be possible to be more selective as architectures always
initializing TTMP registers do not require this step, but always doing
is makes maintenance simpler.
Change-Id: I314db6b37772f7daa8bd405e6662a86658d3f5e0
Extracts and creates a core dump ELF file from a fault event, using
core dump front end.
Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: Ibbbe41b3d13dd3fcb90161e927d48c329cf513a9
Member added to KFDVersion to report if KFD supports core dump
mechanism. This is done through hsaKmtRuntimeEnable API call while
the topology is being built. It also dictates if core dump will be
generated by either KFD or hsa-runtime.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Change-Id: I2e9d4166563402f78613d728446feb692c52d9d1
Core dump generation considers ulimit to generate the proper size
file.
Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: I61d991fc003b173f9075b66bff6a931447720695
This API consists in one function to be called from a fault event at the
hsa-runtime to generate a core dump.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Change-Id: Ib1b90d5beb13f93c4e8ebd21fd61705ebb12ca5d
SegmentBuilder classes are used to get core dump data from the GPUs.
So far, it uses thunk API calls and smaps to collect all data from
the Hardware.
Signed-off-by: Alex Sierra <alex.sierra@amd.com>
Change-Id: I2ad70ca5a951885181d3142653b186b0f6be739e
- fix logic for using HSA_TOOLS_LIB when rocprofiler-register support is enabled
- report tool load failure for rocprofiler-register
Change-Id: Ife23aa3e6ed19174376cd694764583b73f8976cd
The alternate scratch memory is used for dispatches that have a low
number of waves but relatively large wave size.
This allows us to keep the tmpring_size.bits.WAVES field of the main
scratch to full occupancy.
Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3
For devices where the CP FW supports asynchronous scratch reclaim, ROCr
is able to claw-back scratch memory that was assigned to an AQL queue.
With that ability, ROCr does not have to rely on using USO
(use-scratch-once) when assigning large amounts of memory to a queue.
If we reach a situation where we are running low on device memory, ROCr
will attempt to claw-back the scratch memory.
Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7