* When hipMemset3dAsync is captured, a 3d extent can set be as a parameter (depth > 1). That worked on nvidia, but on amd wrong portion of array was filled because when creating Memset3D command, extent dimensions were used to create pitchedPtr, instead of original array width and height.
* Also, when capturing hipMemset3dAsync, nvidia allows any of the extent dimension to be 0, and in that case, no work should be done.
Change-Id: I46a605bf9ae801cd3348e98d528c21263a8eefce
Add an atomic counter to track the outstanding HSA handlers.
Wait on CPU for the callbacks if the number exceeds the value
in DEBUG_HIP_BLOCK_SYNC env variable.
Change-Id: I95dc8c4bf0258c7e59411b7504220709ed6898c5
=> GraphExec instance is destroyed before async launch completes,
destroy after all pending graph launches
=> Remove GraphExec destroy during next sync point(hipStreamSync,
hipDeviceSync etc..)
Change-Id: I4df682aae5787fd6e5240a7be936ce50361345d0
1) SW Conversions for ocp and fnuz are enabled on pre mi300 archs
2) for mi300 only fnuz is enabled
3) for gfx1200 only ocp is enabled
Change-Id: I90373752a2d15eff20d5deec874ed396ba4e1788
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
PAL supports allocating from system memory once device memory is used up
or allocation is larger than the device memory.
Change-Id: Iccd3377e95a6cc6d23e45d4738a17af8b9ee32d7
* In a scenario where kernel is launched with hipExtLaunchKernelGGL and stop event is used, hipGraphInstantiate leaks. Since stop event is used, profiling is enabled and Timestamp (ReferencedCountedObject) is created, but it doesn't get released.
* The idea behind this solution is that profiling should be disabled when command is captured, hence the timestamp should not be created. Because information about capturing isn't available when kernel command is created, packet capturing state is used to determine whether to create a timestamp or not.
Change-Id: Ia23adac4592ded4fb5e236acf99e12e729f63692
Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer
Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2
This reverts commit 7d3c0c5e10.
Changing the error code is considered as a breaking change,
so it should be done in major releases only.
The other reason for reverting the commit is that this change itself
is incorrect. Cuda behaves in the same way as hip when
pResDesc or pTexDesc are nullptr.
Change-Id: I3abee6b79279b81ab01c7f8466c7f8e3776c4109
1) Child Graph nodes need to have parent graph dependencies in waitlist.
2) Marker is placed on base stream with parent graph waitlist
Change-Id: Iec65a0171ea387be05b0733abcc708fb630e4be4
The new set tracks only the queues that have a command
submitted to them. This allows for fast iteration
in waitActiveStreams.
Change-Id: I2c832eefa01280d9a87a5f57874d36d2e9441de7
The variable is already set as cache, so that user can override.
But the hard coded setting is preventing override. Removed the same
Change-Id: I2aecc18ce4f1d1b523ba267ef1c8ef4ea1168d9c
1) currently cpu wait is set to true, which makes the host wait for last
command in queue to finish even if the kernel execution has already
finished causing delay in device sync call.
2) device sync only needs to await completion when hw event
is not ready.
Change-Id: I91e3e89d39a1193ae06abac822cea8ae651493a5