Moved the Call to pthread_mutex_lock to an else statement for better
code readibility.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 1635746a9c]
Because eventDescrp->mutex is a non-recursive lock attempting to
acquire the lock with pthread_mutex_lock can cause the system to hang
indefinitely if the lock was already previously aquired with the
preceeding call to pthread_mutex_trylock.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: a97b7df4b9]
allocated memory was previously not freed in the event of an error
with rwlock initialization.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 293092f32f]
On large BAR systems, for small-sized code-objects, we get performance
using direct memcpy due to latencies when doing the blit-copy.
[ROCm/ROCR-Runtime commit: da2607024b]
Using HSA_ENABLE_DTIF to control dtif/native thunk code path
Signed-off-by: Aaron Liu <aaron.liu@amd.com>
Reviewed-by: David Yat Sin <David.YatSin@amd.com>
[ROCm/ROCR-Runtime commit: 166b0fa45a]
This builds on a prior change that allowed for allocating
a user-mode queue's packet buffer in device memory to also
allocate the queue struct in device memory. This provides
additional latency benefits particularly for cases where
dispatches are performed from the GPU itself. Flags are
added to support the various use cases.
[ROCm/ROCR-Runtime commit: 6e3c375bf1]
The initial call to Refresh() in the constructor is
unnecessary as it's handled in Runtime::Load().
Signed-off-by: lyndonli <Lyndon.Li@amd.com>
[ROCm/ROCR-Runtime commit: c34a2798ce]
The scratch_backing_memory_byte_size is not used by CP, but it is
currently used by rocgdb. Putting the field back, but we need to find a
solution for alt_scratch_backing_memory_byte_size.
Also, completely disabling alternate scratch as we need some changes to
support debugger.
[ROCm/ROCR-Runtime commit: 02b38d0614]
Updating ROCr code to match new handshake protocol with CP FW for
asynchronous scratch reclaim.
Increase previous limits when scratch reclaim feature is available.
[ROCm/ROCR-Runtime commit: aa2f98e6f9]
Added HSA_IMAGE_ENABLE_3D_SWIZZLE_DEBUG environment flag to
enable/disable this. Default value is false (view3dAs2dArray = 1)
Enabling this flag will enable support for swizzles that do 3D
interleaving. Note that all features of 3D images are supported
with 2D swizzles,it's just that the access patterns are different
and therefore cache hit-rates may be better or worse, depending
on how it's used. Volumetric algorithms do better with 3D and apps
that tend to access a single slice at a time do better with 2D.
Change-Id: Id8574a6710fe4333a1ee331e5ce9195a81434198
[ROCm/ROCR-Runtime commit: 6361466baa]
Set priority to maximum for signal event handler and minimum for
exceptions event handler.
Change-Id: I1b982d3c2e4c880fafc073fe1a542d01692a6fdc
[ROCm/ROCR-Runtime commit: 7ea25ebb85]
Removed 'args' as a unique pointer and deletion in
'ThreadTrampoline', then declared as a class member.
Change-Id: Ia52058392d0170e8b5e57cfdd2c587f47a6f93f0
Signed-off-by: Apurv Mishra <apurv.mishra@amd.com>
[ROCm/ROCR-Runtime commit: 89115369cc]
WaitSemaphore and PostSemaphore are used in the HybridMutex
implementation. If HybridMutex did not have to call WaitSemaphore when
acquired, then calling PostSemaphore would cause the internal count
inside sem_t to slowly grow to large values and eventually cause
overflow.
Change-Id: I173fc17c874b49926e56991405e9086ea8c138fc
[ROCm/ROCR-Runtime commit: f58aff630c]
Add support for abort timeout when hsa_signal_wait_relaxed is called and
signal does not clear within timeout.
timeout is in seconds
Change-Id: If1db5a8af33c82ddc4b48968c3d8eceb97d0ea6d
[ROCm/ROCR-Runtime commit: 4ec730f1dc]
- Add the new path to avoid WaitAny() calls in AsyncEventsLoopp() with
HSA_WAIT_ANY_DEBUG key. The new path is selected by default.
The optimizaiton combines all logic of WaitAny() in a single processing loop
and avoids extra memory allocations or ref counting. Also it won't spin
on the CPU if all events are busy.
Change-Id: I197ce60d0d023fbb672f700d6e87702686f1f55a
[ROCm/ROCR-Runtime commit: 0fc7369ba5]
Discarding blocks for reallocation on IPC export for better memory
performance trigger memory violations with DMA BUF exports so bypass
this for now as application performance drops haven't been observed
with the bypass.
The raw fragment should be passed to the DMA Buf export call as well
since offsets will be implicitly applied in the Thunk/KFD for
export/import calls.
Also, use the agent information directly from the pointer
information so that the export call doesn't have to scan memory to find
this. Pass the node ID in the handle so that the import call doesn't
have to make two thunk imports to fetch the node ID for GPU memory
imports.
Finally, allow the user to use DMA Buf IPC via
HSA_ENABLE_IPC_MODE_LEGACY=0 for developer testing as legacy mode will
be applied by default.
Change-Id: Ie8fe267f8768fa5df37126078406f7065f69ff4e
[ROCm/ROCR-Runtime commit: 32bb0764b7]
- Use HSA_ALLOCATE_QUEUE_DEV_MEM=1 to create AQL queue in device
memory.
- Before writing AQL packet header to the queue use an SFENCE to ensure
that there is no reodering of the writes over PCIE
Change-Id: I5eacdc35108c4a1e245c75ae349b7495451aa60d
[ROCm/ROCR-Runtime commit: 3baaa6e9c0]
Rewriting logic to fix issue where pthread_create would return errors
other than EINVAL, and these errors would be ignored.
Change-Id: I573958724dcf886c20e8c14e6a9182303b3ffa06
[ROCm/ROCR-Runtime commit: c8dd4d2b3b]
Recommended SDMA engines for DMA copies are now exposed for better
GPU-GPU performance. ROCr can now select those DMA engines.
Also lock-in host-device copies to SDMA0 and device-host copies to
SDMA1 for better stability and performance.
Change-Id: Ideff2e13daf537104efecb8b837bd49ee5096cb5
[ROCm/ROCR-Runtime commit: eb30a5bbc7]
Current DMABUF implemenation is unstable. Switch back to legacy
support for now.
Change-Id: I3be871f38c6524b0bcc9225bab61de4e57771efb
[ROCm/ROCR-Runtime commit: ea646cf958]
New API to accept a file stream for logging
Co-authored-by: David Yat Sin <David.YatSin@amd.com>
Change-Id: Ie09c35ae14ca86a97eb25f61251be287c55d7169
Signed-off-by: Chris Freehill <cfreehil@amd.com>
[ROCm/ROCR-Runtime commit: 26e105d9ab]
This reverts commit ef95ccf81e59b8608861e8f2f256d981eee19df7.
Reason for revert: Causing performance regressions on some systems
Change-Id: I82951350cafbd57c495852d6f90023a3373f04f6
Signed-off-by: Chris Freehill <cfreehil@amd.com>
[ROCm/ROCR-Runtime commit: 1cee8656df]
If pthread_attr_setaffinity_np function exists use it instead of
pthread_setaffinity_np as pthread_setaffinity_np seems to fail to set
the affinity settings on some systems.
Change-Id: Icd8b17039699ac10d9cd5c4dbb6ac44630673949
[ROCm/ROCR-Runtime commit: 57b93e02a4]
Fix Musl libc NULL errors and unsupported pthread funcs for compatibility.
Also ensures cleanup and error handling irrespective of CPU affinity override.
Fix submitted by github dev - AngryLoki
https://github.com/ROCm/ROCR-Runtime/issues/181
Change-Id: Ia487315e504112be5d3370756f23f6e23b9ae4be
[ROCm/ROCR-Runtime commit: bc9cac97fe]
Allocate required device and host buffers to be able to interact with
the 2nd level trap handler.
Change-Id: If99de5aacf956ca57ecafc7b04b797be9c9decaa
[ROCm/ROCR-Runtime commit: 8d666dea01]
The function Init() called by one of the constructors of lazy_ptr is undefined.
Replacing with reset method sets the object to an uninitialized state and assigns a new constructor function
Fix submitted on github by zhoumin2 - https://github.com/ROCm/ROCR-Runtime/pull/184
Change-Id: I7d906d526ce7fe7e2548b01810e6395b13497bf3
[ROCm/ROCR-Runtime commit: 00b63f7452]
- add rocprofiler-register to CPACK_DEBIAN_BINARY_PACKAGE_DEPENDS when found
- add rocprofiler-register to CPACK_RPM_BINARY_PACKAGE_REQUIRES when found
- remove report_tool_load_failures_explicit_
- add HSA_TOOLS_DISABLE_REGISTER flag
- add HSA_TOOLS_REPORT_REGISTER_FAILURE
- use HSA_TOOLS_REPORT_REGISTER_FAILURE instead of HSA_TOOLS_REPORT_LOAD_FAILURE
- changed rocprofiler-register message to not include the word "error"
Change-Id: Ib7fd7f14c42758a54c347874018281bb1b5477a6
[ROCm/ROCR-Runtime commit: 7ce263b0e4]
Optimizations include:
- Greedy gang by placing gang leaders on first D2D sdma blit context
to avoid dead locking with other gang leaders and items. Note that
this is fine since we can't avoid an oversubscription problem when
there is only 1 xGMI link anyways, so treat all xGMI links as a single
pipe for ganging.
- Non-leader gang items don't have to poll on dependency signals so this
opens up more non-blocking SDMA channels.
- unlock gang lock when gangs are not needed.
- Change gang factor lookup from vector pair to map and register all
gpus in gang factor lookup regardless of link type so that we can take
advantage of the O(logN) direct key/value lookup time.
Fixes include:
- HSA_PAGE_SIZE_4KB was an incorrect macro to use for gang size limit.
As a result, small copies ended up ganging and hitting latency limit.
Use hardcoded 4096 bytes instead.
- Cap auxillary gang factor to the number of non-XGMI SDMA engines.
Change-Id: Ic23fde131502906a807134a04599aa6d012e8cbb
[ROCm/ROCR-Runtime commit: 62f3f250ce]
Set HSA_ENABLE_IPC_MODE_LEGACY off (i.e. use DMA bufs implementation
by default).
Change-Id: I7b1c6cb7d19310adf6f0bfe060736f4adbf7adc2
[ROCm/ROCR-Runtime commit: e20f41df62]
As the KFD IPC IOCTLs will not be upstreamed, change runtime
implementation to use DMA bufs.
DMA buf fds will be passed over abstract unix domain sockets.
The exporter spins a thread that creates a socket server.
The importer connects to the server to fetch the fd.
libDRM will be required to do a manual import and GPU map for
memory that is not already imported and mapped.
For now, use the legacy IPC implementation by default as a
follow on patch will disable the HSA_ENABLE_IPC_MODE_LEGACY
environment variable.
Change-Id: Ifd8469e9adfc81f8a1ea78d6010fb10b515ba1b4
[ROCm/ROCR-Runtime commit: 5dfebdbca9]
Implement HybridMutex to improve latencies compared to KernelMutex when
there is contention between several threads calling hsa_signal_create
and hsa_amd_signal_async_handler.
Change-Id: If53377033e749b0050727964c9303f09b02527cc
[ROCm/ROCR-Runtime commit: 8d3fee5095]
On some systems, pthread_addr_setaffinity_np does not exist, so we need
to use pthread_setaffinity_np on thread after pthread_create
Provided by Julian Samaroo on github
https: //github.com/RadeonOpenCompute/ROCR-Runtime/pull/143
Change-Id: I4649f94333f2d7b0a5993b370a4bfc48d92acecb
[ROCm/ROCR-Runtime commit: 6333fdecf3]
The alternate scratch memory is used for dispatches that have a low
number of waves but relatively large wave size.
This allows us to keep the tmpring_size.bits.WAVES field of the main
scratch to full occupancy.
Change-Id: I32d240fac4b7d38200d1eebc1b0fdc8a823920d3
[ROCm/ROCR-Runtime commit: a7a3358067]
For devices where the CP FW supports asynchronous scratch reclaim, ROCr
is able to claw-back scratch memory that was assigned to an AQL queue.
With that ability, ROCr does not have to rely on using USO
(use-scratch-once) when assigning large amounts of memory to a queue.
If we reach a situation where we are running low on device memory, ROCr
will attempt to claw-back the scratch memory.
Change-Id: Iddf8ec84e37ab8b9fdc58bafbe2b61fe2acb6eb7
[ROCm/ROCR-Runtime commit: dca8f3a21d]
SDMA4.4 and SDMA5.2+ has increased it's available copy size to 2^30 bytes
represented by exponent as bits set in the COUNT field of the
linear copy.
Also note that the full 2^22 byte limit is available from SDMA4 onwards
as it has corrected the 0x3fffe0 HW limitation from SDMA3.
As copy limit has increase, this can change system performance
so provide env var HSA_ENABLE_SDMA_COPY_SIZE_OVERRIDE=0 to fall
back to the original 0x3fffe0 limit for debugging purposes.
Change-Id: I0fb6e5378f68e5b8a00ff559271691a943ee06ee
[ROCm/ROCR-Runtime commit: 81c64228e0]
Use all available SDMA engines capped by xGMI bandwith for
all D2D copies within a hive.
By default, set the latency boundary copy size as 4KB and below.
Any copy size in within this boundary will not gang.
Avoid oversubscribing engines by not ganging on engines with
pending non-ganged work.
An enviroment variable HSA_ENABLE_SDMA_GANG has been provided
to override default ganging behaviour.
Change-Id: Iccde76aa1af1d47ea2a151789432c9db4f0ffa8d
[ROCm/ROCR-Runtime commit: 7df0167821]
I've just reverted some code what it was in 5.5 by wrapping new x86
specific bits with #if's, e.g.:
- CPUID is x86 specific
- mwait is x86 specific
Change-Id: I6cefae34282c777c7340daf3f934d2a11742502e
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
[ROCm/ROCR-Runtime commit: 132a19e9c3]
Add support for HSA_ENABLE_PEER_SDMA env variable that can be used to
disable use of SDMA engines for device-to-device transfers. Note that
setting HSA_ENABLE_SDMA=0 will disable all SDMA transfers and override
HSA_ENABLE_PEER_SDMA values.
Change-Id: I737b3c2b2efcf3ff237f98bc748f49b8252ed24a
[ROCm/ROCR-Runtime commit: a397373cea]