To allow non-POD global variables to last until the last thread
has exited, use "new" to allocate the memory instead of static
allocation.
Change-Id: Ica571b61ff8068a52e472c49cb1c44917e60c8c8
An ASAN run of the release build revealed some elements of
the supported_isas static map were still using stack data. This
change makes it use heap data so it will persist.
Change-Id: Ie51887e88b9e2dec27acfc97ea45a6219fea971c
SDMA queue resources are limited when all SDMA copies are bottle necked
into 2 engines. Callers will not be able to make the best decisions
to allocate queue resources fairly so have ROCr fallback to old round
robin behaviour dictated by KFD.
Change-Id: I93d52297976d74e20129c5eb1dcfbfa5aa5067a7
- Add the new path to avoid WaitAny() calls in AsyncEventsLoopp() with
HSA_WAIT_ANY_DEBUG key. The new path is selected by default.
The optimizaiton combines all logic of WaitAny() in a single processing loop
and avoids extra memory allocations or ref counting. Also it won't spin
on the CPU if all events are busy.
Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a
On GPUs where EOP is handled in asic, the read_dispatch_id is not always
updated after each packet. Look for the first dispatch packet that needs
scratch memory before allocating scratch.
Change-Id: Ibf4b4b485f99bf2fabfe48e9609ca99111fdafbe
The supported_isas static unordered_map was adding stack
allocated Isa objects. Instead, make the objects statically
allocated, as supported_isas itself is.
Change-Id: I23405e218290d48deea6f984f76c57e7b43e314e
When ROCr is built as a static library, global variables
were often not initialized to valid values at their first
use. This change addresses that problem.
Change-Id: I550fa41feb3bc04b9cc686bcfb4acf2a7b651a88
Devices older than GFX90a hit a segfault on queue unmap when an
SDMA queue has been assigned a fixed engine. Bypass fixing the
engine for these devices for now.
Change-Id: I7d2f882d2377f004a7bb65f3b397396db07ce6d3
To correctly map to all GPUs after an import, use the new extended
registration call that can import a virtual address without having to
specify a target node.
Change-Id: Ifca8f6f6ee24fa99b2af357dcc3ea1de3ab234f7
When hsa_amd_vmem_set_access is called, do not remove permissions for
unspecified agents. Also updating documentation in header to clarify
this.
Change-Id: I3bb4cf08ba399f85cc67b17fd13a4a40d862415f
Socket server accept calls do not guarantee synchronous actions
post-accept. This can result in a race condition.
To resolve this, first limit the socket server's listen backlog to a
single connection. This will force competing clients to busy-retry
until timeout.
Second, make the DMABUF IPC file descriptor send-receive and import
calls into an atomic routine per connection.
By doing these fixes, not only to we resolve potential races but
we guarantee that any exporter process will create at most one
file descriptor that will only last for the duration of the import
transaction. This alleviates any concern on running into system
limits for the number of open file descriptors per process.
Change-Id: I6d8b14795a680d89a2707e082fa027d525792e05
Discarding blocks for reallocation on IPC export for better memory
performance trigger memory violations with DMA BUF exports so bypass
this for now as application performance drops haven't been observed
with the bypass.
The raw fragment should be passed to the DMA Buf export call as well
since offsets will be implicitly applied in the Thunk/KFD for
export/import calls.
Also, use the agent information directly from the pointer
information so that the export call doesn't have to scan memory to find
this. Pass the node ID in the handle so that the import call doesn't
have to make two thunk imports to fetch the node ID for GPU memory
imports.
Finally, allow the user to use DMA Buf IPC via
HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will
be applied by default.
Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e
Return false if trying to free a NULL pointer (or invalid size)
internally in ROCr. This is to detect errors within ROCr when trying
to free NULL pointers. If a user of ROCr tries to free a NULL
pointer, this condition should be caught at the beginning of the
Runtime::FreeMemory(...) function and return HSA_STATUS_SUCCESS. This
matches the behavior of the free(...) or delete functions that
silently ignores calls when the passed a NULL pointer.
Change-Id: I84bc26928b35023e19cd9f214b42c6ee9508029c
Adds support for AllocateMemoryOnly inside XDNA driver.
Move the IsLocalMemory() check inside the KFD driver
since the XDNA driver can, and needs to, create handles
on system memory buffer objects.
Changed handle variable name from thunk_handle to user_mode_driver_handle,
which is more representative if we support non-GPU drivers.
Change-Id: I95db9d575afd1ab0ff2de74cea5175d9a12a721b
Adds support for initialzing the XDNA driver so that
a hardware context can be created for an AIE queue.
Right now this initializes the device heap in the driver,
gets the relevant tile parameters for the AIE agent,
and creates a hardware context that backs the AIE queue.
Change-Id: Ib90e1bc67a8637f6db3ff2bebe34677843796417
GFX 9.4.x has better performance for CPU-GPU copies when using
engines in reverse order from other devices.
Change-Id: I1eaebf0e837bb7f44712f40d5115df618f6a73d7
If the KFD doesn't support targeting SDMA engines, ensure that ROCr
selects the correct downstream queue type by using an invalid engine.
Change-Id: Ia6848126f67f3d35ab37248633e8e0e6e2d77fff
- Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device
memory.
- Before writing AQL packet header to the queue use an SFENCE to ensure
that there is no reodering of the writes over PCIE
Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d
Remove KFD-specific Allocate/Free calls from the AMD::MemoryRegion.
The KFD-driver-specific Allocate/Free calls are now implemented in
the KfdDriver. Future changes will migrate the remaining KFD-specific
calls out of AMD::MemoryRegion.
This allows the MemoryRegion to be used across AMD drivers like the
XDNA driver.
Change-Id: Ib6a2a9e5e1a15e61644d2592beb3a8e6578c3010
Adds the initial KFD driver interface and use it to open the
KFD from amd_topology.cpp.
This change is to show the direction of the Driver interface for
initially supporting the KFD and to get feedback on the approach.
For now we wrap relevant ROCt calls behind this generic driver
interface so that we can generalize core ROCr components like
MemoryRegion, Runtime, etc.
Now that ROCt is incorporated into ROCr, we can more fully integrate
ROCt into the Driver interface. Ideally, we get to a point where
the generic Driver interface can support KFD, XDNA, and potential
future drivers.
Change-Id: I4573fd6af1f8398233ee9d3814d9f3139dd0279c
This change adds the initial classes for the AIE agent and AIE AQL
queue.
An AIE agent list is added to the core runtime object.
Change-Id: I84b02f52171b80726dfb2c8431582a3ea2986eb3
Rewriting logic to fix issue where pthread_create would return errors
other than EINVAL, and these errors would be ignored.
Change-Id: I573958724dcf886c20e8c14e6a9182303b3ffa06
Recommended SDMA engines for DMA copies are now exposed for better
GPU-GPU performance. ROCr can now select those DMA engines.
Also lock-in host-device copies to SDMA0 and device-host copies to
SDMA1 for better stability and performance.
Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5
When HSA_OVERRIDE_GFX_VERSION is used, save the overrided GFX
version to OverrideEngineId instead of original EngineId. There
are places where real GFX properties still needed, e.g. CWSR size
calculation.
Change-Id: I9d9149bae465b7cfe55604fc19e7ca34e48b7b1c
Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
The current trap handler has 2 limitations:
1) If it receives a HOST_TRAP, it clears the corresponding bit
and notifies the host, when it should not.
2) When it is entered because of a debug trap (s_trap 3) and the
debugger is not attached, it returns unconditionally. However,
if another exception is reported at the same time as the trap
handler is entered for the debug trap (a memory violation for
example), that other exception ends-up being ignored.
This patch addresses both of those issues. It makes it so host traps
and debug traps are ignored when necessary. If any other exception is
reported to the wave, we halt the wave and notify the host, and if no
other exception is reported (i.e. we entered the trap handler because of
host trap or debug trap), we return to shader code.
Other minor defects are also fixed during this refactor:
- Fixed SQ_WAVE_EXCP_FLAG_PRIV_XNACK_ERROR_SHIFT which had an incorrect
value
- Host traps can be sent at any time, including after we have halted a
wave. In such case, the old approach would have:
1) cleared the trap ID saved in ttmp6
2) clobbered ttmp10 where part of the actual wave's PC is saved.
Change-Id: I9ecd341f4967e686233dec182b3e5b0388ef19bd
This fixes an issue for missing HW events when out of HW events.
We cannot determine whether a HW event has occurred unless we call the
underlying drivers with hsaKmtWaitOnMultipleEvents_Ext. Previous logic
in Signal::WaitAny would switch to ACTIVE_WAIT state if we run out of
hardware events (signal->EopEvent() == NULL) and this would cause the
hsaKmtWaitOnMultipleEvents_Ext call to be skipped. But also, when we
have some signals without hardware events, calling
hsaKmtWaitOnMultipleEvents_Ext with a timeout of 0 so that we can poll
for remaining signals adds overhead with an IOCTL call and may cause
extra delay. Separating AsyncEventLoop into two separate threads so
that:
1. We can have a new Signal::WaitAnyExceptions to wait for HW events
This function can be simpler as it does not have to perform all the
timer calculations because it is expected to be always waiting on
hsaKmtWaitOnMultipleEvents_Ext through the lifetime of a process.
2. Signal::WaitAny does not need to have extra code to check for HW
exceptions as it only needs to handle HSA_EVENTTYPE_SIGNAL events. It
can also skip the calls to hsaKmtWaitOnMultipleEvents_Ext if needed.
Change-Id: I52ba99fd6e483e0cb477b7931a0dcc03520aa523
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Delete queues used internally in agent destructor to make sure any
memory allocated by the queue objects are freed before the agent memory
regions are destroyed.
Change-Id: I4768c9cf66f77ac00a5a355f373f7f22dc266e47
If user application tries to free memory that is currently being used by
the underlying HW device, the hsaKmtFreeMemory function call will fail.
This would be caused by an incorrect call by the user application. A
system memory error is raised and the user application is expected to
abort when this happens.
Note: This leaves the allocation_map_ table in an inconsistent state as
this address entry is removed from it while the pointer is not actually
free'd. But re-organising the FreeMemory() function would require the
memory_lock_ to be held for much longer and may affect performance.
Since this is a very unlikely and invalid use case, we prefer to leave
the FreeMemory() function as is.
Change-Id: I24279eb98620c32d34f4c5ad1b7a0a30cb65835d
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Skip coredump generation when receiving HSA_STATUS_ERROR_MEMORY_FAULT.
We also receive a system error of type HSA_EVENTTYPE_MEMORY and generate
the coredump there. Trying to generate coredump from 2 places sometimes
causes unnecessary error message because both places try to create a
coredump file with the same name.
Change-Id: If3f03bab2c24ad71dfeff39ab411bb9ac08b337e
Signed-off-by: David Yat Sin <David.YatSin@amd.com>
Force mem_flags to be explicit passed in then calling Queue constructor
to avoid ambiguity with calls to Queue constructor trying to only pass
the agent_node_id.
Change-Id: Ib6fedcb9e52d6c9f35f9051dfa989343456ca368
Signed-off-by: David Yat Sin <David.YatSin@amd.com>