Pass active queue for transfers in the cache coherency layer.
That will allow to use device transfer queue only for
cases when active queue isn't available, because using device
transfer queue from another active queue may cause a deadlock
Change-Id: Ifbe7e0303b77dbf6eeda3939ffbc25a3df7472de
[ROCm/clr commit: 95d55fdfa8]
If GlobalMemCacheLine reported is 0, runtime may run into an
infinite loop as the KernelSegmentAlignment is chosen as size of the
cache line.
Change-Id: Ide547940cc0407f16fab10ee210b4fd3ae4eaafc
[ROCm/clr commit: 041ddc0c1c]
OCL2.2 requires SPIR-V and runtime doesn't support it.
Make sure PAL backend doesn't report any SPIR-V support.
Change-Id: I8d179069674205b54f7d20d149bcb675bee5cdb0
[ROCm/clr commit: 0bf395af39]
Metadata in Codeobject version 5 is the extension of CO3 and CO4.
Add the detection of the new fields and program them in
the setup of the kernel arguments.
Change-Id: I27e58df77320ad00f4f16d35912668db803826af
[ROCm/clr commit: be6a06384e]
Use HSA_AMD_AGENT_INFO_COOPERATIVE_COMPUTE_UNIT_COUNT to get compute
units. This is needed to work around assymentric CU harvesting bug on
gfx90a. Add a new device property to get the max available CUs on the
device.
Change-Id: I878f38f14f16c1af01fc0a77157aea1e816a63b8
[ROCm/clr commit: 33aca5a4a6]
Report proper target id for xnack in HSAIL path. Runtime
will use ISA table and report hsailName().
Fix offline compilation path for PAL.
Change-Id: Ic0250bf6b9c193d867aec9800a319da1bf00c3ee
[ROCm/clr commit: a543d4a860]
When OCL failed to obtain function pointer from GL, we should not call it.
Change-Id: I50f69d270ce445386906a286e44c4e8c83722302
[ROCm/clr commit: 15101e704b]
Add a state indicator to retain ExternalSignals when needed.
Co-operative group launch uses external signals to indicate a dependency
to the next command.
Change-Id: I6d0daa006e2377c3bbf4aeca0fd5b63c7ac8fbbb
[ROCm/clr commit: 1fbd75b825]
Crash was due to the fact that external signal structure was stale even
after destroyign the command. That is because we skipped wait due to a
missing check.
Detect external signals and dispatch a barrier in ReleaseGpuMemoryFence.
Also clear external_signals_ at ProfilingBegin.
Change-Id: I991387edcfe928b511bf5e780988ee131321ed5a
[ROCm/clr commit: 3239222516]
Add a threshold for ROCR/SDMA P2P transfers. ROCR copy path
requires extra barriers in compute for synchronization. That costs
extra performance with tiny transfers.
Reduce active wait time to 10us. Tensorflow uses extra thread
per GPU with constant hipEventQuery() calls. Longer active waits
in ROCr affect CPU performance.
Change-Id: I9020358438615fa2d4617f862f00a562f0a588e7
[ROCm/clr commit: 008133cf41]
With SAM on, don't force Persistent for allocations
in HIP. This makes ROCCLR go down paths we don't want
for HIP.
Change-Id: If54cc16fa891d4cfdc761c6ab21ad707627e822a
[ROCm/clr commit: 5243552768]