This change removes the stream callback from hipStreamWaitEvent and
uses a stream memory wait operation instead. This allows the
hipStreamWaitEvent to be non-blocking on the host.
Change-Id: Ie5530febda5a5bcb5daa0db8a01249d6b137fd43
[ROCm/clr commit: 721c5800ca]
- Add custom compare to the map of queues, which will help with
the round-robin selection
Change-Id: Ie67a820bfb1a5b484a1b3edced967eed94228bb8
[ROCm/clr commit: ba8e740be4]
Add initial implementation of virtual memory heap with
dynamic virtual memory mapping support for memory pools.
DEBUG_HIP_MEM_POOL_VMHEAP controls the new method.
Change-Id: I8dc5be2e0f34ab472f1800f43bb6243639a5e500
[ROCm/clr commit: 296dce5570]
Compute doesn't support IB chaining, but RGP may collect
perf counters, which require more space in CB.
Increase CB size if RGP is enabled.
Change-Id: Iaa0a620ead8541a679b0dfe5e5711af5afdba545
[ROCm/clr commit: 63cf3057ba]
- Use correct header in device_library_decl
- use std:: instead of __hip_internal:: for host compilation
- hide device specific stuff behind __clang__ and __HIP__ check
Change-Id: I2f3647e00555ed0e79f9954a459c41394c3cd49b
[ROCm/clr commit: c3f49c8788]
- Also add a cache, which allows compiled code objects to be reused
instead of compiling again. This should improve performance on
multigpu systems.
Change-Id: Ib135d616c076b77f8aaf28de275d408b38021d89
[ROCm/clr commit: 0391aec14a]
There are 2 functional changes to this patch:
* Use GPU timing for internal markers for HIP.
* Measure CPU time closer to GPU timer, to reduce delta between GPU/CPU timestamp measurements.
There are some smaller non-functional updates:
* waifForFence -> waitForFence typo
* Remove unused drmProfiling
Change-Id: I4c5fa600a842ab60e454888779edcac8449a902a
[ROCm/clr commit: 179801a750]
Resolved an issue where hipEventSynchronize and hipStreamWaitEvent APIs
did not function correctly for events created with the hipEventInterprocess flag.
The bug caused the event to be incorrectly marked as "recorded,"
leading to these APIs failing to wait for the event as expected.
Change-Id: Ic9fdfaab2393beb93d6e0b83661545e902a63499
[ROCm/clr commit: 1cdfbfd270]
- Fix regression for D2H pinned copies which adds systemscope release.
- Skip cpu wait for D2H unpinned copies as we can pass the signal of the
barrier to rocr copy.
- Fix an old bug in sdmaEngineRetainCount_ logic
- Improve logging
Change-Id: If074bddb05564b15949b0d5f9bf12acd3692174e
[ROCm/clr commit: 4c95ee5e1e]
Make ocltst -m tests/ocltst/liboclruntime.so -t OCLMemoryInfo
pass in emu where GPU memory is very big.
Cherry pick
https://gerrit-git.amd.com/c/compute/ec/clr/+/1014858
Change-Id: I0228c5e87ce7c366983fd4af71c25e7f8161c2c7
[ROCm/clr commit: de83d7a6ae]
hipGetLastError should return the error by any of the previous APIs
in the same host thread to match the CUDA behavior, whereas
hipExtGetLastError will return the error by the immediate previous API.
This Ext API was added earlier to facilitate the existing HIP apps which
are following the current behavior of hipGetLastError
Change-Id: I61e95b1fc136cc761e2434e02187b7ed2598b733
[ROCm/clr commit: 4b443f8133]
BatchMemop should be positioned before the image support kernels
because the total number of kernels is determined by BlitLinearTotal,
when there is no image support on the device.
Change-Id: I8e53caf744ba54259ac04bad1762eef21806f3f2
[ROCm/clr commit: 3e01da3dac]
The cl_khr_depth_images associated macro definition is defined twice in
the compiler: in opencl-c.h and automatically by the compiler deduced
from the cl-ext list. These two co-exist and there is no need to remove
cl_khr_depth_images from the cl-ext list.
If we remove cl_khr_depth_images from the cl-ext list, and we do not
include opencl-c.h the macro is not defined.
This fixes conformance test ./test_compiler compiler_defines_for_extensions
when using Comgr with -include opencl-c-base.h -fdeclare-opencl-builtins
without including opencl-c.h.
Before we got the error `ERROR: Supported extension cl_khr_depth_images
not defined in kernel`
This change is needed to eventually get rid of the opencl-c.pch that is embedded in comgr, and that makes implementing a compilation cache in comgr hard.
Change-Id: I76497874ebe7163966420d4ac23a0788b93a36fd
[ROCm/clr commit: 8c9e6d0fa5]
Instead of enforcing c++14 here, we can instead use the current
clang default
Change-Id: Ib0a178a53c1377f2910edf6fab82b2bac6567ac7
[ROCm/clr commit: 33e48b9629]
- Resolve signal dependencies for barrier value packet if there are > 1
depenent signals. Barrier Value packet accounts for only 1 dep signal
- Better log
Change-Id: Ia506ad5d80b91d598f92e7b539f41756e9b4b64b
[ROCm/clr commit: 2d450e8b06]
The logic can analyze the AQL queue state and
find a failed AQL packet with the kernel's name
Change-Id: I1a478fa2c25462cd07a194784958bdf22454b897
[ROCm/clr commit: ea0b092af8]
wait() is redesigned with two pathes:
fast path: Use spinlock to wait for notify signal. If the
signal hasn't been received for some loops, go to slow path.
slow path: Use condition_variable's wait().
Improve monitor wrapper for better performance.
Fix some bugs left from name removing patch.
Change-Id: I893a8353121a25d11e37c8e631caf31cc1fc1f24
[ROCm/clr commit: f2ff56af9c]