Grafico dei commit

686 Commit

Autore SHA1 Messaggio Data
David Yat Sin ac5fb8be9e Temporary: Do not early release mutex when not ganging
It seesm the Release() function is not reliable and can cause segfaults.
This is a temporary work-around until the Release() function is fixed.

Change-Id: I95470a800c6153673e4b8f4fe46a646903325074
2024-04-30 17:07:39 -04:00
David Yat Sin 57b93e02a4 Use pthread_attr_setaffinity_np when available
If pthread_attr_setaffinity_np function exists use it instead of
pthread_setaffinity_np as pthread_setaffinity_np seems to fail to set
the affinity settings on some systems.

Change-Id: Icd8b17039699ac10d9cd5c4dbb6ac44630673949
2024-04-29 15:02:54 +00:00
David Yat Sin 3d999a1adf Perform HDP flush for SDMA copies gfx10/gfx11
Perform HDP flush on gfx10/gfx11 PCIe devices.

Exclude gfx101x devices

Change-Id: Ief76c34634b09b0a7942cb71519d4082ca8b4fad
2024-04-24 18:07:34 -04:00
David Yat Sin 9af225e1b1 Add support for contiguous memory allocations
Support contiguous physical memory allocation flag. Allocations with
this flag will have contiguous physical memory. This is dependent on KFD
support for this flag and the AllocateKfdMemory(..) function call will
fail when it is not supported.

Change-Id: I6c51c8b061f7b026fdcc2aa2c37c74ecc13d95b6
2024-04-24 14:02:07 -04:00
David Yat Sin e539c8dce2 Remove assert for physical vs virtual memory size
On systems with more than 1 TB of memory per NUMA region, this triggers
unnecessary errors.

Change-Id: I1bc7f209b9c1739b516c9f6b0acf434488ac7b8d
2024-04-24 08:43:23 -04:00
David Yat Sin f2751b7030 Fix queue creation for PC Sampling
Fix lazy pointer initialization for dedicated PC Sampling queue.
Previous implementation would always create a queue on GPU agent
creation instead of creating the queue on first use.

Change-Id: Icf300f2b162e59143ba61ba182d9bee6e1308fc1
2024-04-22 19:00:48 +00:00
Shweta.Khatri bc9cac97fe Fixing compilation errors related to MUSL libc
Fix Musl libc NULL errors and unsupported pthread funcs for compatibility.
Also ensures cleanup and error handling irrespective of CPU affinity override.

Fix submitted by github dev - AngryLoki
https://github.com/ROCm/ROCR-Runtime/issues/181

Change-Id: Ia487315e504112be5d3370756f23f6e23b9ae4be
2024-04-17 07:14:15 -04:00
David Yat Sin d6d5786051 Adding queue information queries
New hsa_amd_queue_get_info API to support:

- HSA_AMD_QUEUE_INFO_AGENT: Agent that owns the underlying HW queue

- HSA_AMD_QUEUE_INFO_DOORBELL_ID: KFD doorbell ID of the queue
completion signal.

Change-Id: I98842131bcbdd08552649791a5d43e578a615808
2024-04-11 12:53:48 -04:00
David Yat Sin 3443fdf665 PC Sampling: Disable coredump when sessions active
When doing a coredump, we try to park the wave and save its PC in
ttmp7/ttmp11, but these registers will be overwritten by PC Sampling
requests.

Change-Id: I60fb734eb3bed4ee3cc8d8bba9ec4a527fff9671
2024-04-11 12:53:43 -04:00
David Yat Sin 547c9cb143 PC Sampling: Implement lost sample count
Change-Id: Idfdfbac71c1813dd7a97c301619cf8ce83713c53
2024-04-11 12:53:31 -04:00
David Yat Sin 8abbf9475b PC Sampling: Implement flush
Flush is used by the client to retrieve data that are currently stored
in the buffers. This is used by the client to retrieve current data when
the buffers are not full.

Change-Id: Ib8304dcdfb2797cb060ec72df4970d95cf6be348
2024-04-11 12:53:24 -04:00
David Yat Sin 5177d17f5d PC Sampling: Push data to PC Sampling client
Each time there is enough data to fill the client session buffer,
callback the client data ready function to transfer the buffer contents
to the client.

Change-Id: Id79775426fa6d22e00dc2ef6f55c439eacb9b2af
2024-04-11 12:53:17 -04:00
David Yat Sin 855e454671 PC Sampling: Retrieve data from trap handler
Retrieve data from the buffers previously set in the 2nd level trap
handler TMA. We use a double buffering mechanism to allow the 2nd level
trap handler to write to one buffer while we are copying data from the
other.

Co-authored by: Joseph Greathouse <Joseph.Greathouse@amd.com>
Co-authored by: James Zhu <James.Zhu@amd.com>

Change-Id: I252c381ea06b8cf927c4f9af6ea59dedc3717fbb
2024-04-11 12:53:12 -04:00
David Yat Sin efdb72fd71 PC Sampling: Update 2nd level trap handler
Update 2nd level trap handler when PC Sampling is enabled

Change-Id: I95bf2bca8057d2f8313923c7f012f033e12ccc3a
2024-04-11 12:53:06 -04:00
David Yat Sin 8d666dea01 PC Sampling: Allocate resources to retrieve data from trap handler
Allocate required device and host buffers to be able to interact with
the 2nd level trap handler.

Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa
2024-04-11 12:53:00 -04:00
Joseph Greathouse 431a70471e PC Sampling: Add gfx9 2nd trap handler for PC Sampling
Code is valid for gfx9 GPUs excluding gfx94x.

1st level trap handler will use TTMP13[22] to indicate host trap and
TTMP13[21] to indicate stochastic trap.

For each PC sampling method (hosttrap and stochastic), we use a double
buffering mechanism to transfer data between GPU and host.
The GPU will dump data into one buffer while CPU may be reading data
from the other buffer. There are 2 separate signals, one for each
buffer.
When signal != 0, the buffer belongs to the GPU and the GPU can write
to it. Once the buffer has reached the high watermark, the GPU will
set the signal to 0 to wake up the host and so that the host can try
to switch the buffers and read the data.

Co-authored-by: David Yat Sin <David.YatSin@amd.com>
Change-Id: If3eb0913e52fb4788059a71e5feca334612f3d5d
2024-04-11 12:52:54 -04:00
David Yat Sin a83f872a23 PC Sampling: Create dedicated CP queue
Create dedicated CP queue with highest priority for PC Sampling. Reduce
the highest priority that LRT's can set for existing API so that PC
Sampling queue will always have highest priority over any other CP
queues

Change-Id: Ia70d74415edc83b4862a3e18dbdbd7cebe73ab47
2024-04-11 12:52:48 -04:00
David Yat Sin a842247482 PC Sampling: Add start stop and flush APIs
Create PC Sampling APIs for start and stop functions. And create stub
for flush function.

Change-Id: I7a093b29dc87e34ac06faaae6cac2be50e4663e1
2024-04-11 12:52:42 -04:00
David Yat Sin 632f9e60f7 PC Sampling: Add create and destroy APIs
Implement PC Sampling session create and destroy APIs.

Change-Id: I93370d3d01b74ee15e71b8b0e20feb8f0066a3dc

Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Signed-off-by: Vladimir Indic <Vladimir.Indic@amd.com>
Change-Id: Ib0c64356a1a4616b12d5dbeebe16273fe2a84abe
2024-04-11 12:52:35 -04:00
David Yat Sin 295acf6b27 PC Sampling: API to list supported configurations
Add new PC Sampling API to list the supported PC Sampling methods and
options on a specific agent. If there is already a PC Sampling session
active on this agent, the list of methods returned will be reduced to
methods that can be run simultaneously with the current active session.

Change-Id: I42ac2b8f30d5c368faf8ed4cf37ca4134db22985
2024-04-11 12:52:30 -04:00
David Yat Sin 0bc244e10a PC Sampling: Create PC Sampling interfaces
Create new interface group for PC Sampling

Change-Id: I59b4cfe9f8d1ae313dc28be1d2ed49f750d8212b
2024-04-11 12:52:23 -04:00
David Yat Sin 71f1a6726c Create fine-grained allocator
Create allocator helper function to provide fine-grained memory on
a specific agent.

Change-Id: I32ba9aceb9c9dc708b140a0c45158e6e7a018844
2024-04-11 12:52:10 -04:00
David Yat Sin 721e56ef5c Extend ExecutePM4() to accept completion signal and fences
ExecutePM4() function can optionally accept extra arguments for
acquire fence scope, release fence scope andcompletion signal. When
a completion signal is provided, ExecutePM4() does not wait for the
commands to complete.

Change-Id: Ib2a433b7bce1cb6260be8b76fe902335bd5dfada
2024-04-11 12:51:52 -04:00
David Yat Sin d7adc94e3f Add limit checks for HSA_SINGLE_SCRATCH_LIMIT
Hard limit for scratch is 4GB per XCC and checks in case user specifies
values exceeding this value

Change-Id: Ib3cade762ff66c7e7d6a2d311e482cacbcf2b0de
2024-04-11 14:03:25 +00:00
Konstantin Zhuravlyov b983c19729 Switch to per-executable contexts in the loader
- Per-executable contexts should be used from now on
  - Global contexts are left as is for now for backwards
    compatibility and will be phased out in follow up
    patches.

Change-Id: I6291abf865c7ed24ee71f5065e539afc23f5ce64
2024-04-09 10:31:51 -04:00
Shweta Khatri 244ad319ac Revert "Use HybridMutex for IPC locks"
This reverts commit 5c520f4544c654e5f18e05cabd1c63d64473cfab.

Reason for revert: This patch is introducing a synchronization related bug in Unit_hipGetSetDevice_MultiThreaded testcase.

Change-Id: I367e4d4f1d75b21658ac1127c58982894a97cedb
2024-04-02 12:27:55 -04:00
David Yat Sin efe455c2fa Temporary: Set AllocateGTTAccess and node_id for MES
Temporary change to set the AllocateGTTAccess flag and node_id
on MES devices.

Change-Id: I22385d11b17b76cfb44278fa0d8a09bc8721cea6
2024-03-29 19:38:19 +00:00
Shweta.Khatri 00b63f7452 Replace lazy_ptr's Init() with reset() method
The function Init() called by one of the constructors of lazy_ptr is undefined.
Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function

Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184

Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3
2024-03-26 15:07:34 -04:00
David Yat Sin 9d842dd1d8 Fix uninialized variables
Change-Id: Ie5da4547fa764e55162aff287cbb338ed4324093
2024-03-14 15:20:56 -04:00
pvanhout a93c18dc90 [libamdhsacode] Support COV6/Generic Targets
Change-Id: I4680577eb56dc436fbc134b169f172dd476bff37
2024-03-12 07:37:32 -04:00
Jonathan Kim eb2100daad Fix deferred dmabuf export on IPC due to GEM object loss
When deferring a dmabuf export on an import call, there may be a
failure to export as the GEM object is not referenced by the kernel
mode driver.  To get around this, do a non-deferred export and
immediately close the dmabuf FD to keep FD creation to a minimum.
This way, the GEM object will have a kernel mode driver reference
when a deferred export is done.

Also a bad dmabuf FD sent over a socket may not be received by an import
reader and this can cause a hang.
Set a 10 second timer so that importer is not blocking indefinitely.

Change-Id: I11a9b5ec64aa2e16fd6aecdf46c34e4eb56ccfd0
2024-03-07 12:12:06 -05:00
Alex Sierra cbeddf9eb6 core dump: Generates a core dump from a fault event
Extracts and creates a core dump ELF file from a fault event, using
core dump front end. GFX11 is not supported.

Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: I5ae154e886f39ab3ce7bbae5803efb27a96c7e2e
2024-03-05 09:28:44 -05:00
Lancelot SIX 5d3f6a63f1 trap_handler: Set status.skip_export when halting a wave
When inspecting waves on architectures where SPI may not initialize TTMP
registers, the debugger cannot reliably know if the trap handler was
entered and if it saved valuable information in TTMP registers.

This patch uses the status.skip_export bit (unused by the compute
shaders) to indicate that it got executed before halting a wave.
This is done except for gfx940, where ttmp11[31] can be used (as long as
TTMP registers are always initialized by SPI for this architecture).  It
could be possible to be more selective as architectures always
initializing TTMP registers do not require this step, but always doing
is makes maintenance simpler.

Change-Id: I5c4148c78062f7ffa049ac7856c2edc82dbc77d1
2024-03-05 09:28:33 -05:00
Jonathan Kim ed462035fa Disable SDMA ganging on non-APU multi-partition modes
Work around SDMA hang in non-SPX modes for non-APU devices by disabling
ganging.
Root cause of hang not found.
non-APU xGMI modes have only 1 link between socket devices anyways so
there's likely no real system level gain in ganging intra-socket.

Change-Id: Ia4eda2f85cbf25151d3dbcf50cc45b8b775c60e2
2024-02-28 14:52:01 -05:00
Jonathan Kim ed260ea970 Fix gang item wait on dependency signals
Gang items have to wait on dependency signals as well as the leader.
Copies should not start if shaders are still operating on memory
to be copied.

Change-Id: I99703b420045ebcba2c9da39ec64678129dc140f
2024-02-27 12:45:41 -05:00
Shweta Khatri f2006d6899 Record interop mapped object in allocation_map_
This allows the VA to be recorded in ROCr so that they are not
treated as an invalid pointer in future API calls.

Change-Id: I8d1d8ef9816a984c89d30a2179b0ce8940fef1da
2024-02-26 13:40:55 -05:00
Jonathan R. Madsen 7ce263b0e4 Update rocprofiler-register support
- add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found
- add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found
- remove report_tool_load_failures_explicit_
- add HSA_TOOLS_DISABLE_REGISTER flag
- add HSA_TOOLS_REPORT_REGISTER_FAILURE
- use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE
- changed rocprofiler-register message to not include the word "error"

Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6
2024-02-22 11:55:25 -05:00
Shweta Khatri 24633c7a85 Avoid releasing scratch for blit queues
At hsa_shutdown(), scratch_lock_ may be gone. Blit queues don't need it.

Change-Id: Ic132ac8a6be31fb2f0623137115608b0b222f077
2024-02-22 14:12:05 +00:00
Jonathan Kim 1f63ea3476 Fix export-close race during IPC attach request
If two attach requests to the same piece of shared memory occur,
a double export or premature dmabuf fd close can occur since the export
and close on demand calls are not atomic.

Use a reference counter on shared memory dmabuf FDs that have
already been opened to avoid this problem.

Change-Id: I14a59209c0385e32582af42a57b33b1c6838a9b1
2024-02-22 14:12:05 +00:00
David Yat Sin ae16b3e14e Use sysconf pagesize for system pagesize
Provided by user huanggyizhi on github
https://github.com/RadeonOpenCompute/ROCR-Runtime/pull/124

Change-Id: Ia03c45f7a869ae2c804accf8163f8ae36c20dd5a
2024-02-13 14:28:10 -05:00
Jonathan Kim e911335cee Minimize FD creation on IPC Create
Instead of caching shared memory fds for export on the exporter side,
only export the FD in the async handler when requested.
The importer should request export fd closure once import is done.

Change-Id: I469e0cd1749beeb9c506c8a6461745fb039d9c3b
2024-02-07 18:50:54 -05:00
Mythreya 8e312471dc Fix ToolsApiTable versioning
ToolsApiTable's version was incorrectly default initialized to 0.
Fixes error in commit fc889669

Change-Id: I41e9301a9c33b119ee50f6164d21ddf11dc188c4
2024-02-07 17:02:32 -05:00
David Yat Sin f7de85082e VMM: Allow non-contiguous memory maps
Adjust code to allow the use of non-contiguous chunks of memory to be
mapped within a single VA range.

Change-Id: Ida21ba202927229347b3a32d9b7106df10819cf5
2024-02-07 16:56:52 +00:00
Mythreya a67af3807f Initial support for scratch allocation tracking
Add new tools table and functions to notify in case of an event

Change-Id: I47f0c2f3c8e02d7bcb74d649903eb4f86721c154
2024-02-07 16:56:52 +00:00
Jonathan Kim 1dd4a7dc18 Fix copy logic on devices with no xgmi SDMAs
Fix gang factor overwrite of 0 if there are no xGMI SDMAs
on the device and gang factor is 1.

Change-Id: I041d4b4ae87fb68f224ee4dedb758c6f06c022a9
2024-02-07 16:56:52 +00:00
Jonathan Kim a3efd13a2f Fix IPC import on device memory with no requested nodes
Users can import device memory without specifying the target node.
DMA buf imports return a Thunk handle that's not useful for
gpu mapping calls.

Fix this by using the import node information to re-import and
map with the correct target GPU.

Also fix IPC detach calls by deregistering the Thunk handle
import immediately during attach instead of failing to do it later
on detach since Thunk handles aren't placed into ROCr allocation
map.

Finally refactor the IPC attach function for cleaner logic flow.

Change-Id: Ib2bf178110b2be98bd6917c765f724e4e613f5f2
2024-02-06 23:15:29 +00:00
Jonathan Kim 15691ae460 Fix DMABuf FD closure for IPC attach client
We should also close the client side dmabuf fd after importing for target
nodes.

Change-Id: I74f61dd65bebb03dc002f5df7301efd1ef8d9603
2024-02-06 23:15:29 +00:00
Jonathan Kim 62f3f250ce Optimize and fix SDMA gang copies
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items.  Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.

Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.

Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
2024-01-25 10:42:27 -05:00
David Yat Sin 32b3a3c299 VMM: Use emplace when adding entries
Use emplace to prevent copying the MappedHandle objects when inserting
entries into mapped_handle_map_.

Change-Id: Id3f40f1eb73ce30e62da53c5aea4dd715e83ac59
2024-01-17 10:25:04 -05:00
David Yat Sin 29efd8eccd VMM: Fix flags when allocating memory handle
When allocating a memory handle, the NoAddress thunk flag should be set
so that this allocation does not have a virtual address range.
Also, skip mapping the memory when allocating a memory handle

Change-Id: I1c168bc00ddbc158d447197c4dc25f96bad02b19
2024-01-17 10:24:58 -05:00