=> hipDeviceSynchronize is not allowed during capture.
=> hipEventSynchronize during capture should return hipErrorCapturedEvent error
=> hipEventQuery during capture should return hipErrorCapturedEvent error
hipStreamSynchronize, hipEventSynchronize, hipStreamWaitEvent, hipStreamQuery
For Side Stream(Stream that is not currently under capture):
=> If current thread is capturing in relaxed mode, calls are allowed
=> If any stream in current/concurrent thread is capturing in global mode, calls are not allowed
=> If any stream in current thread is capturing in ThreadLocal mode, calls are not allowed
For Stream that is currently under capture
=> calls are not allowed
=> Any call that is not allowed during capture invalidates the capture sequence
=> It is invalid to call synchronous APIs during capture. Synchronous APIs,
such as hipMemcpy(), enqueue work to the legacy stream and synchronize it before returning.
Change-Id: I201c6e63e1a5d93fd416a3b520264c0fdbe31237
[ROCm/clr commit: 19b4660cbb]
The new logic has a lock for PAL call and doesn't require the lock for queues.
Change-Id: I61b67c3c4abd2ede44809de1d6beed756766032e
[ROCm/clr commit: 2f3ad43c4a]
alwaysResident setting doesn't require per queue residency tracking.
Thus, the logic can be skipped to avoid the lock of queues.
Change-Id: Ib5cff5b79d3ecb8c2f2eb2565cf069f9a69438b0
[ROCm/clr commit: 95e3958748]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
- With https://gerrit-git.amd.com/c/lightning/ec/llvm-project/+/1002628 applied, at -O0 Kernel::dynamicParallelism() returns true but virtual queue isn't created
- This causes segfault inside VirtualGPU::submitKernelInternal() when getVQVirtualAddress() is called
Change-Id: Ia7af042adad2329e870c142caaac3e8fa886f8b8
[ROCm/clr commit: d1fff7cea2]
Under ROCr physical allocations don't have initial VA and require extra
flag in ROCclr. Add an option to have a mempool of physical allocations.
Change-Id: I4d062fe0dd8113d4eaf6e8b51749ed56d8701d1e
[ROCm/clr commit: f296159f62]
This reverts commit 6c46d696ae.
Reason for revert: due to windows staging branch using Opencl-icd-loader master
Change-Id: I9cca7564a21de1733665a34da6f0322aa3b886e7
[ROCm/clr commit: f44ca70bb7]
Added call to hipDeviceSynchronize in __hipUnregisterFatBinary
to ensure that all HSA async signal handlers complete before removeFatBinary
Change-Id: I756fecca1c2a5eae092613d8079de266399e5685
[ROCm/clr commit: ad32e604c7]
PAL optimized the logic for the barriers, which caused failures with CP DMA on Navi4x.
Change barrier's code to match the most recent PAL optimizations.
Change-Id: I55eeab20f51eb8e920bcbb4b55fbe3c7f77fd3fa
[ROCm/clr commit: 1239309c90]
Previously, we used the following approach and Comgr actions
for device lib linking:
AMD_COMGR_COMPILE_SOURCE_TO_BC (compile with clang driver)
AMD_COMGR_ADD_DEVICE_LIBRARIES (link in device libs with
llvm-link API)
However, the clang driver can link in device libraries as part
of compilation, assuming a --rocm-path is set. In this context,
this is accomplished by using the following Comgr action instead:
AMD_COMGR_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC (compile and
link in device libs with clang driver)
Change-Id: Ie0bbee7d9a12672536b6d751056a941128ed58be
[ROCm/clr commit: 6311ed8a8e]
- Sometimes we want to mask out kernel names, use right level for kernel
logging
Change-Id: Ideae9647c57b86ae390ff2f4131f6d8c6df5c086
[ROCm/clr commit: f1adecd186]
The copies can get blocked if the last SDMA engine is used by another
copy and this can lead to perf drop in some of the tests like Gromacs.
Resetting the last engine by checking the engine status and fetching the
new mask after few copies can avoid this.
Change-Id: I8fe8ea678db508d291c6242f3741fa9215e99921
[ROCm/clr commit: 1b25484f0f]
Currently we force inlining everything for HIP. Now we'd like to enable function
supports. The first step is to remove uses of `-amdgpu-early-inline-all` in
various places. This patch is to remove all of them from clr.
Change-Id: Ib0cad1f586714c9989778b00746aa4c47a4eec95
[ROCm/clr commit: a09204388a]
Mempool has capability to track dependency between streams for
faster memory reuse. Enable that capability.
Change-Id: I28266a7e38d0fc4c5d027b9542d3719653840821
[ROCm/clr commit: 17d0c166d2]
OpenCL printf handling did not process vector of half precision floats properly
(mainly because compiler packs 2 halfs into a dword and runtime failed to extract the
individual parts).
This patch fixes the issue.
Change-Id: Ia1f15ccfb5db52b71c43cfd588dd38f551ee5277
[ROCm/clr commit: 6f390f5af9]