Setting HIP_VISIBLE_DEVICES= should be treated as invalid device
which makes all the devices invisible to the app. This matches
the CUDA behavior
Change-Id: I937ac4c0b7dacff776cdbe692d4576c81b86ee2d
[ROCm/clr commit: b5799c4dbe]
This reverts commit 06593a072f.
Reason for revert: ROCr query should now be usable in upcoming release.
Change-Id: I2207761ca6af5d585d090bae1af09eb9a8e9bad6
[ROCm/clr commit: a52f5bda8f]
Sync between compute and SDMA engines can be very expensive under Windows.
Use CP DMA for tiny transfers (< 1KiB) to avoid syncs and improve performance.
Change-Id: I9db39a2199f7b9e337ed08fd36d9cbc150502f1f
[ROCm/clr commit: 473621c008]
The profiler can retrieve this correlation ID to attribute waves to
specific dispatch locations.
Change-Id: I700e8a91219d612f6a2028c0dda0c92753f3526a
[ROCm/clr commit: b043b4f5a2]
HIP can't rely on the resource tracking, used in OCL and requires different explicit sync.
Make sure ROCCLR syncs compute only when SDMA is used and vise versa.
The new logic will allow to enable CPDMA without unnecessary waits.
Change-Id: Ib9d1788cfd5afa5ea2fec4c96a37d8b9c4d0059d
[ROCm/clr commit: ff6b4db70b]
Blender creates and destroys big allocations during the benchmark.
That causes big delays, because vidmm has to page-in/page-out memory.
Change-Id: I2baf4545807127406e3d2870a7581ff9ae7bcdb5
[ROCm/clr commit: dc4ad8c99c]
- Make sure SQQT trace is captured for RGP server if the queue is destroyed before normal capture is done.
- Remove prepare queue from the logic. It's not really used for any HW capture and can cause RGP server abort if destroyed before capture is even started(delayed capture)
Change-Id: I6eb19963190a5769c6477a5496c1b831a6d59b89
[ROCm/clr commit: c1c5127875]
In X86 on Windows, sizeof(size_t)=4, but size=8, for
amd::KernelParameterDescriptor::HiddenGlobalOffsetX/Y/Z items.
Loose the condition to prevent crash.
Change-Id: I2216f71f4d4fd6dd3766023b1c821cb3d35d7848
[ROCm/clr commit: 3d281114fb]
The ROCclr assigns zero-based IDs to GPUs in the order they are
discovered. That zero-based ID is what is used to identify the GPU
on which the HIP_OPS activity took place.
When multiple ranks are used, each rank's first logical device always
has GPU ID 0, regardless of which physical device is selected with
CUDA_VISIBLE_DEVICES. Because of this, when merging trace files from
multiple ranks, GPU IDs from different processes may overlap.
The long term solution is to use the KFD's gpu_id which is stable
across APIs and processes. Unfortunately the gpu_id is not yet exposed
by the ROCr, so for now use the driver's node id.
Change-Id: Ib78854527d600d175bb76e2df0747c33f898c615
[ROCm/clr commit: 9a82118c85]
- Use a dirty flag to determine fence optimization
- If fence is dirty submit a marker at top level to sync.
Change-Id: I53fb19b5bb05b7c7b37c41637a6c7aaf870b639a
[ROCm/clr commit: 6405b6cdba]