Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer
Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2
This reverts commit 7d3c0c5e10.
Changing the error code is considered as a breaking change,
so it should be done in major releases only.
The other reason for reverting the commit is that this change itself
is incorrect. Cuda behaves in the same way as hip when
pResDesc or pTexDesc are nullptr.
Change-Id: I3abee6b79279b81ab01c7f8466c7f8e3776c4109
1) Child Graph nodes need to have parent graph dependencies in waitlist.
2) Marker is placed on base stream with parent graph waitlist
Change-Id: Iec65a0171ea387be05b0733abcc708fb630e4be4
This PR adds UberTrace-based tracing support to ROCclr's PAL device class.
Legacy RGP-based tracing is still available and is the default.
If UberTrace support is enabled tool-side, this new code path will activate.
Change-Id: I268b2dcef70e850a50e2caef8355f38bf51d4641
The new set tracks only the queues that have a command
submitted to them. This allows for fast iteration
in waitActiveStreams.
Change-Id: I2c832eefa01280d9a87a5f57874d36d2e9441de7
The variable is already set as cache, so that user can override.
But the hard coded setting is preventing override. Removed the same
Change-Id: I2aecc18ce4f1d1b523ba267ef1c8ef4ea1168d9c
1) currently cpu wait is set to true, which makes the host wait for last
command in queue to finish even if the kernel execution has already
finished causing delay in device sync call.
2) device sync only needs to await completion when hw event
is not ready.
Change-Id: I91e3e89d39a1193ae06abac822cea8ae651493a5
Use env var DEBUG_CLR_KERNARG_HDP_FLUSH_WA=1 to fall back to HDP flush
workaround. The default is 0
Change-Id: I7bdb9be61da60c30d15ac9991b7cd27351e1831c
Ensure the member function Alloc() and Free() of command_pool_ will not be
accessed after command_pool_ be destructed.
Signed-off-by: Chong Li <chongli2@amd.com>
Change-Id: Ic2d36423302518a030bd61fa399290ebe2ed8194
1) Since g_devices is not initialized when stream_per_thread constructor
is called on windows, m_streams is empty when hipDeviceReset is called.
2) clear_spt tries to access empty vector causing segfaults in
hipDeviceReset call.
3) on linux ROCCLR_INIT_PRIORITY makes sure that g_devices is initialized
first before tls constructor creates stream_per_thread object.
Change-Id: Ib2ba643d1278d820287ea3b242ed0878d7529165
The amdgpu-arch tool is not supported for static build.
This commit adds changes to detect the build type during
cmake config and use the rocm_agent_enumerator for static build.
Change-Id: I8a295e01f54075507390ef540f16b28bb20237a9
This change adds a new HIP API `hipExtHostAlloc` which preserves
the functionality of `hipHostMalloc`.
Change-Id: I13504c6fc13465ddd7aed329795bb4f2fef1baff
Integration into pytorch pointed out some issues, value narrowing, to
fix this we are now using unions. Also removed check for -munsafe*
compiler flag. The check is now just on builtin detection.
Change-Id: I49364503fa429bd862952f9b29879072afa6d553