Commit Graph

1157 Commits

Author SHA1 Message Date
David Yat Sin efe455c2fa Temporary: Set AllocateGTTAccess and node_id for MES
Temporary change to set the AllocateGTTAccess flag and node_id
on MES devices.

Change-Id: I22385d11b17b76cfb44278fa0d8a09bc8721cea6
2024-03-29 19:38:19 +00:00
Konstantin Zhuravlyov 9e8f185397 Add R_AMDGPU_ABS32 support
Change-Id: I0ee0302d919ede44765adf02eab15015573efef2
2024-03-26 18:47:29 -04:00
Konstantin Zhuravlyov c5e74b7d0a Add dynamic relocation types (NFC)
Change-Id: I1b443003077ba241f34444da293e362266c2ae92
2024-03-26 18:47:05 -04:00
Konstantin Zhuravlyov b2c32ad6cb Rename existing relocation types to legacy/v1 (NFC)
Change-Id: Ided7f656c34131b8067a19c0d3b2955fc8823628
2024-03-26 18:46:50 -04:00
Shweta.Khatri 00b63f7452 Replace lazy_ptr's Init() with reset() method
The function Init() called by one of the constructors of lazy_ptr is undefined.
Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function

Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184

Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3
2024-03-26 15:07:34 -04:00
Shweta.Khatri 02a40e9272 Convert some comments to Doxygen-style comments
hsa_ext_amd.h - Fix provided by github developer - Mátyás Aradi
Github request - https://github.com/ROCm/ROCR-Runtime/pull/187

Change-Id: I63e4175caebd10be0151f21bd5f048dd011aaf06
2024-03-25 11:47:14 -04:00
David Yat Sin 9d842dd1d8 Fix uninialized variables
Change-Id: Ie5da4547fa764e55162aff287cbb338ed4324093
2024-03-14 15:20:56 -04:00
pvanhout a93c18dc90 [libamdhsacode] Support COV6/Generic Targets
Change-Id: I4680577eb56dc436fbc134b169f172dd476bff37
2024-03-12 07:37:32 -04:00
Jonathan R. Madsen 5402842d5f Add hsa_api_trace_version.h
- hsa_api_trace.h contains C++
- rocprofiler-sdk needs to include the table version number defines (*_MAJOR_VERSION and *_STEP_VERSION) for the HSA API in it's public headers
- rocprofiler-sdk needs it's public headers to be C-compatible so hsa_api_trace_version.h was created

Change-Id: Ieece990b3b7775cb0446b545c9e3391c5f691c61
2024-03-12 01:17:34 -04:00
Jonathan Kim eb2100daad Fix deferred dmabuf export on IPC due to GEM object loss
When deferring a dmabuf export on an import call, there may be a
failure to export as the GEM object is not referenced by the kernel
mode driver.  To get around this, do a non-deferred export and
immediately close the dmabuf FD to keep FD creation to a minimum.
This way, the GEM object will have a kernel mode driver reference
when a deferred export is done.

Also a bad dmabuf FD sent over a socket may not be received by an import
reader and this can cause a hang.
Set a 10 second timer so that importer is not blocking indefinitely.

Change-Id: I11a9b5ec64aa2e16fd6aecdf46c34e4eb56ccfd0
2024-03-07 12:12:06 -05:00
Alex Sierra cbeddf9eb6 core dump: Generates a core dump from a fault event
Extracts and creates a core dump ELF file from a fault event, using
core dump front end. GFX11 is not supported.

Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: I5ae154e886f39ab3ce7bbae5803efb27a96c7e2e
2024-03-05 09:28:44 -05:00
Lancelot SIX 5d3f6a63f1 trap_handler: Set status.skip_export when halting a wave
When inspecting waves on architectures where SPI may not initialize TTMP
registers, the debugger cannot reliably know if the trap handler was
entered and if it saved valuable information in TTMP registers.

This patch uses the status.skip_export bit (unused by the compute
shaders) to indicate that it got executed before halting a wave.
This is done except for gfx940, where ttmp11[31] can be used (as long as
TTMP registers are always initialized by SPI for this architecture).  It
could be possible to be more selective as architectures always
initializing TTMP registers do not require this step, but always doing
is makes maintenance simpler.

Change-Id: I5c4148c78062f7ffa049ac7856c2edc82dbc77d1
2024-03-05 09:28:33 -05:00
Jonathan Kim ed462035fa Disable SDMA ganging on non-APU multi-partition modes
Work around SDMA hang in non-SPX modes for non-APU devices by disabling
ganging.
Root cause of hang not found.
non-APU xGMI modes have only 1 link between socket devices anyways so
there's likely no real system level gain in ganging intra-socket.

Change-Id: Ia4eda2f85cbf25151d3dbcf50cc45b8b775c60e2
2024-02-28 14:52:01 -05:00
Jonathan Kim ed260ea970 Fix gang item wait on dependency signals
Gang items have to wait on dependency signals as well as the leader.
Copies should not start if shaders are still operating on memory
to be copied.

Change-Id: I99703b420045ebcba2c9da39ec64678129dc140f
2024-02-27 12:45:41 -05:00
Shweta Khatri f2006d6899 Record interop mapped object in allocation_map_
This allows the VA to be recorded in ROCr so that they are not
treated as an invalid pointer in future API calls.

Change-Id: I8d1d8ef9816a984c89d30a2179b0ce8940fef1da
2024-02-26 13:40:55 -05:00
Jonathan R. Madsen 7ce263b0e4 Update rocprofiler-register support
- add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found
- add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found
- remove report_tool_load_failures_explicit_
- add HSA_TOOLS_DISABLE_REGISTER flag
- add HSA_TOOLS_REPORT_REGISTER_FAILURE
- use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE
- changed rocprofiler-register message to not include the word "error"

Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6
2024-02-22 11:55:25 -05:00
Shweta Khatri 24633c7a85 Avoid releasing scratch for blit queues
At hsa_shutdown(), scratch_lock_ may be gone. Blit queues don't need it.

Change-Id: Ic132ac8a6be31fb2f0623137115608b0b222f077
2024-02-22 14:12:05 +00:00
Jonathan Kim 1f63ea3476 Fix export-close race during IPC attach request
If two attach requests to the same piece of shared memory occur,
a double export or premature dmabuf fd close can occur since the export
and close on demand calls are not atomic.

Use a reference counter on shared memory dmabuf FDs that have
already been opened to avoid this problem.

Change-Id: I14a59209c0385e32582af42a57b33b1c6838a9b1
2024-02-22 14:12:05 +00:00
David Yat Sin b77ade9c64 rocrtst: Add non-contiguous VMM map tests
Add rocrtst to test mapping non-contiguous memory to a
single VA range

Change-Id: Id2e57f83512f8b482456b2b1925586951ada7400
2024-02-22 14:12:05 +00:00
David Yat Sin 99e31e43aa rocrtst: Add test for GPU access to memory
Add test to verify whether GPU shaders can read memory created using VMM
APIs.
Split VMM rocrtst to two separate groups: Basic and Access tests

Change-Id: Iead8d46125580c71ccd582e967c8e2e891e75c5e
2024-02-22 14:12:05 +00:00
David Yat Sin 1f50219634 Fix compile error when using clang
Change-Id: Ibacf094934a9b489c052a18eeb6b26639aba3032
2024-02-22 14:12:05 +00:00
David Yat Sin 5b28a1bc17 Fix compile error on certain gcc versions
Change-Id: I8a4fab76d1dcc576eb7706ab45fc786c0cab274a
2024-02-13 15:25:34 -05:00
David Yat Sin ae16b3e14e Use sysconf pagesize for system pagesize
Provided by user huanggyizhi on github
https://github.com/RadeonOpenCompute/ROCR-Runtime/pull/124

Change-Id: Ia03c45f7a869ae2c804accf8163f8ae36c20dd5a
2024-02-13 14:28:10 -05:00
Joseph Huber 9e26cbac14 Add executable symbol info for the wavefront size
The wavefront size is currently only exposed as an agent level
attribute. This is not correctyl, because while the agent has a default
wave front size that is usually correct, it can easily be overridden via
options like -mwavefrontsize64 on various ISAs. The wavefrontsize
attribute is actually more of a calling convention that is consistent
within a callgraph. Because the root of each call graph is a kernel in
this architecture, we need to be able to query this on a per-kernel
basis. This information is already avialable in the kernel descriptor
packet, but it wasn't exported.

This patch adds HSA_CODE_SYMBOL_INFO_KERNEL_WAVEFRONT_SIZE as a new
option to query on the executable symbol.

Change-Id: I744815c89cc9d4c82f25479bdd48ae1f32e859ff
2024-02-09 15:55:30 +00:00
Jonathan Kim e911335cee Minimize FD creation on IPC Create
Instead of caching shared memory fds for export on the exporter side,
only export the FD in the async handler when requested.
The importer should request export fd closure once import is done.

Change-Id: I469e0cd1749beeb9c506c8a6461745fb039d9c3b
2024-02-07 18:50:54 -05:00
Mythreya 8e312471dc Fix ToolsApiTable versioning
ToolsApiTable's version was incorrectly default initialized to 0.
Fixes error in commit fc889669

Change-Id: I41e9301a9c33b119ee50f6164d21ddf11dc188c4
2024-02-07 17:02:32 -05:00
Shweta Khatri 13800cc6d5 Set max_alloc to 95%,reduce by 1% on fail
Prevents OOM-Killer trigger,if all physical and swap mem gets fully used

Change-Id: I70d558fa9c06fe6217e62d57e11aec6a089aa0bb
2024-02-07 14:46:58 -05:00
David Yat Sin f7de85082e VMM: Allow non-contiguous memory maps
Adjust code to allow the use of non-contiguous chunks of memory to be
mapped within a single VA range.

Change-Id: Ida21ba202927229347b3a32d9b7106df10819cf5
2024-02-07 16:56:52 +00:00
David Yat Sin 776da1a3f7 rocrtst: Add some tests for hsa_amd_pointer_info
Add tests to catch whether ROCr breaks ABI compatibility with the
hsa_amd_pointer_info API in case the hsa_amd_pointer_info struct is
extended.

Change-Id: I4e69bf30db9791e59f895b2798b87985c41242e5
2024-02-07 16:56:52 +00:00
David Yat Sin 0f30da58a7 Improve documentation for set_async_scratch_limit API
Change-Id: I03ca986cdd468c7b167e119bd2f25d5c79ff2142
2024-02-07 16:56:52 +00:00
Mythreya a67af3807f Initial support for scratch allocation tracking
Add new tools table and functions to notify in case of an event

Change-Id: I47f0c2f3c8e02d7bcb74d649903eb4f86721c154
2024-02-07 16:56:52 +00:00
Joseph Greathouse 1d6691e06b Fix undefined behavior in definition of hsa_amd_memory_fault_reason_t
Currently, the definition of hsa_amd_memory_fault_reason_t tries to
set a constant of 0x8000_0000 by using the definition "1 << 31".

However, the 1 in this definition is a signed integer by C++ rules.
On our architectures, shifting a signed integer by 31 results in
signed integer overflow. Signed integer overflow results in
undefined behavior.

Forcing the 1 to be unsigned avoids this.

Change-Id: I860431eeede4eff29598f646abf3c1337b048d71
2024-02-07 16:56:52 +00:00
Jonathan Kim 1dd4a7dc18 Fix copy logic on devices with no xgmi SDMAs
Fix gang factor overwrite of 0 if there are no xGMI SDMAs
on the device and gang factor is 1.

Change-Id: I041d4b4ae87fb68f224ee4dedb758c6f06c022a9
2024-02-07 16:56:52 +00:00
Jonathan Kim a3efd13a2f Fix IPC import on device memory with no requested nodes
Users can import device memory without specifying the target node.
DMA buf imports return a Thunk handle that's not useful for
gpu mapping calls.

Fix this by using the import node information to re-import and
map with the correct target GPU.

Also fix IPC detach calls by deregistering the Thunk handle
import immediately during attach instead of failing to do it later
on detach since Thunk handles aren't placed into ROCr allocation
map.

Finally refactor the IPC attach function for cleaner logic flow.

Change-Id: Ib2bf178110b2be98bd6917c765f724e4e613f5f2
2024-02-06 23:15:29 +00:00
Jonathan Kim 15691ae460 Fix DMABuf FD closure for IPC attach client
We should also close the client side dmabuf fd after importing for target
nodes.

Change-Id: I74f61dd65bebb03dc002f5df7301efd1ef8d9603
2024-02-06 23:15:29 +00:00
Jonathan Kim 62f3f250ce Optimize and fix SDMA gang copies
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items.  Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.

Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.

Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
2024-01-25 10:42:27 -05:00
James Zhu caedadcc6f rocrtst: change max memory search algorithm.
The old max memory search algorithm is using Binary Search
algorithm to find last successful memory allocation. But each
successful memory allocation takes times. Since the unsuccessful
memory allocation returns very quick. Changing the search algorithm
to find first successful memory allocation starting from MAX, each
testing step with granularity interval will speed up this test.

Change-Id: Idada3c6f750c94f3bb223f4f3bff4e4ebd3e98f7
Signed-off-by: James Zhu <James.Zhu@amd.com>
2024-01-18 13:46:44 -05:00
Sam Wu 1c6ad56dc6 Apply doc standards for ReadtheDocs builds
Applies the following changes:
add version number to documentation left navigation bar and page title
add an "About" section with a license page
enable htmlzip, pdf, epub formats when publishing on Read the Docs
set pdf title, author, copyright, and version
rename .sphinx/.doxygen to sphinx/doxygen
remove docBin from URL
update rocm-docs-core dependency

Change-Id: I947cf32cd42d9f4e55b1ddd324ad4a7e4ba3f3e3
2024-01-18 12:07:27 -05:00
David Yat Sin 84c30dd735 VMM: rocrtst for exporting/importing dmabuf
This is part of patch series for Virtual Memory API.

Change-Id: I1f1357a39b48b0d0611967ce9dd0b83b6a8db864
2024-01-17 10:25:20 -05:00
David Yat Sin a69c1e9f39 VMM: rocrtst for basic virtual memory APIs
This is part of patch series for Virtual Memory API.

Change-Id: Ic3b44435cb09ad17d833b4a4b2551bd211b494e9
2024-01-17 10:25:09 -05:00
David Yat Sin 32b3a3c299 VMM: Use emplace when adding entries
Use emplace to prevent copying the MappedHandle objects when inserting
entries into mapped_handle_map_.

Change-Id: Id3f40f1eb73ce30e62da53c5aea4dd715e83ac59
2024-01-17 10:25:04 -05:00
David Yat Sin 29efd8eccd VMM: Fix flags when allocating memory handle
When allocating a memory handle, the NoAddress thunk flag should be set
so that this allocation does not have a virtual address range.
Also, skip mapping the memory when allocating a memory handle

Change-Id: I1c168bc00ddbc158d447197c4dc25f96bad02b19
2024-01-17 10:24:58 -05:00
David Yat Sin 2f97049da5 VMM: Default access should be none
After a memory handle is created. hsa_amd_vmem_get_access should return
HSA_ACCESS_PERMISSION_NONE insread of reporting the allocation as
invalid.

Change-Id: I1a09d15c220d48497d09c89059493e538f82aeb9
2024-01-17 10:24:51 -05:00
David Yat Sin 8b85f9e668 VMM: Fix access for multi-GPU
When using multi-GPU for each BO, a new dmabuf_fd needs to be imported
into libdrm.

Change-Id: Iaa2415c8f655a1ce8e92b0878517a11ff014a1d5
2024-01-17 10:24:35 -05:00
Jonathan R. Madsen 8f0ea44c09 Suppress reporting no tools were found with rocprofiler-register
Change-Id: If853517d40e073202d12e2a6b16fb54be5529650
2024-01-17 01:01:19 -05:00
Jonathan Kim e20f41df62 Enable IPC DMA buf
Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation
by default).

Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2
2024-01-16 22:43:27 -05:00
Jonathan Kim 5dfebdbca9 Change IPC implementation to use DMA Bufs
As the KFD IPC IOCTLs will not be upstreamed, change runtime
implementation to use DMA bufs.

DMA buf fds will be passed over abstract unix domain sockets.
The exporter spins a thread that creates a socket server.
The importer connects to the server to fetch the fd.

libDRM will be required to do a manual import and GPU map for
memory that is not already imported and mapped.

For now, use the legacy IPC implementation by default as a
follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY
environment variable.

Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4
2024-01-16 22:43:00 -05:00
David Yat Sin 0e3f668e2c Use HybridMutex for IPC locks
Change-Id: I24ab4a96237612a7d32beda06cc20b25cb1f0b37
2024-01-16 21:29:39 +00:00
David Yat Sin 8d3fee5095 Use HybridMutex for signal mutexes
Implement HybridMutex to improve latencies compared to KernelMutex when
there is contention between several threads calling hsa_signal_create
and hsa_amd_signal_async_handler.

Change-Id: If53377033e749b0050727964c9303f09b02527cc
2024-01-16 21:29:39 +00:00
David Yat Sin 3d1563ee68 Force t1_ update when profiling is enabled
Fixes issue where t1_ counters may not be updated when doing dispatch
profiling, causing a divide by 0.

Change-Id: I91060ac3f9fd2183d277e6e7cd810398a453a87f
2024-01-16 21:29:39 +00:00