Implement HybridMutex to improve latencies compared to KernelMutex when
there is contention between several threads calling hsa_signal_create
and hsa_amd_signal_async_handler.
Change-Id: If53377033e749b0050727964c9303f09b02527cc
During hsa initializing stage, ROCr now searches all the loaded libraries
for a symbol "HSA_AMD_TOOL_PRIORITY" and adds all those libraries to
the tools library init list. Tools libraries listed in HSA_TOOLS_LIB
env variable are also loaded in the given order and take priority
over HSA_AMD_TOOL_PRIORITY.
Change-Id: I739af42bbd777c44a9152c11e17dd69979b65e82
This is consistent with KFD and has significantly better latency.
KFD is taking this as the definition of the SystemClockCounter.
Change-Id: I4c1b3bc58c738206265c55ebefd41356c013bfe5
Prior solution used a single global lock to protect the memory tracking structures.
This change protects the memory tracking structure with a shared mutex (rw lock) in
shared (r) mode for memory allocations and frees so that long duration processes,
calling to kfd, can be done in parallel. Operations which must modify the memory map
take the mutex in exclusive mode (w) and must not call to the thunk while holding
the mutex.
The fragment allocator now requires separate protection and is protected with a
mutex at the device level. Protecting at the device level, rather than pool,
allows retention of the current recursive design and allows calling Trim from
withing Allocate. This could be made finer (pool level locks) but would
require backing out of Allocate entirely to call Trim. Trim and any retried
Allocation must be done in isolation (per device) or we may report OOM when
memory is actually available in some pool's fragment cache. So some device
level serialization is required in at least some paths.
Change-Id: I7c1e94d6965ffcc602b12fefdd3a6e97b84b5e00