- When a command may possibly have two packets(like device heap
initializer), and if there is no signal on the main kernel packet the
tracking was broken as it marked HW event of the command as the first
packet signal.
- Make sure if no completion signal is attached to the second packet
then clear the HW event for the command.
- This change tries to save extra synchronization packets we may insert
as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
passed as the argument
Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc
Since hipMemMap can be called for multiple device handles on the same virtual memory, the same is true for hipMemUnmap, meaning that virtual memory can be "partially unmapped".
This means that the unmap function can be called for a specific part of the reserved address, meaning that only the designated subbuffer should be released. If unmap is called on the entire reserved memory, then all subbuffers should be released.
The main point is that for every hsa_amd_vmem_map, there should be a corresponding hsa_amd_vmem_unmap. Otherwise, if entire memory is unmapped by a single unmap call, then HSA will report the memory as "in use" if an attempt is made to delete it.
Change-Id: I039308eafb820decfb1c09f60347f26cdad1a362
- If any kernel uses device heap, the launch needs to be preceeded by an
init kernel, Save on the extra barrier packet launch/flush between the
init heap kernel and user kernel
Change-Id: I8ebc6246188200e5f673dc464bc76a53bcb8b7c6
- Resolve signal dependencies for barrier value packet if there are > 1
depenent signals. Barrier Value packet accounts for only 1 dep signal
- Better log
Change-Id: Ia506ad5d80b91d598f92e7b539f41756e9b4b64b
This change fixes random segfaults in graph tests that
are seen after the change make internal callbacks non-blocking.
The callback thread that decreases the GraphExec ref count
may now run after the runtime shutdown. This can cause a segfault
because the hip::device that is accessed in GraphExec destructor
is already destroyed during runtime shutdown. This patch ensures
that the hip::device object stays alive until after the
callback thread completes.
Change-Id: I75a6ac01f27a0b2250bbd10ed389ebfb322927af
- Added DEBUG_CLR_SKIP_RELEASE_SCOPE flag to force release scope to
SCOPE_NONE in AQL packet header
Change-Id: Ife02cddb9d5cd4749103ce585d3d5fe9024c6868
In active wait mode use signals without interrupts by default and switch
to the interrupts only if a callback is required.
Change-Id: Ibcde8f7d44c70f8fb8fa5e0a7fdd8b08a2982a8e
- Refactor blit code and clean ASAN instrumentation
- Use unified function for rocr copy
- Enable shader copy path for unpinned writeBuffer/readBuffer paths
- Set GPU_FORCE_BLIT_COPY_SIZE=16 which means we will use BLIT copy for
pinned copies or unpinned H2D/D2H copies < 16KB
Change-Id: I42045cca79234b340dbf53dafb93044199736ae4
hsa_amd_profiling_async_copy_enable is taking 45us for the first call. Disable sdma profiling for enqueuing captured kernel packets and for accumulate command.
Change-Id: I80b51a58c46bccc9c1025e9331515f57c97b5a2a
Runtime may use checkGpuTime() for the wait and not just for the GPU time queries. Hence, the call can't be skipped if profiling isn't enabled.
More changes are required for this optimization.
Change-Id: I79e8918312e755d75f0d26685f2fdc604a8ffb18
Add an atomic counter to track the outstanding HSA handlers.
Wait on CPU for the callbacks if the number exceeds the value
in DEBUG_HIP_BLOCK_SYNC env variable.
Change-Id: I95dc8c4bf0258c7e59411b7504220709ed6898c5
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
* In a scenario where kernel is launched with hipExtLaunchKernelGGL and stop event is used, hipGraphInstantiate leaks. Since stop event is used, profiling is enabled and Timestamp (ReferencedCountedObject) is created, but it doesn't get released.
* The idea behind this solution is that profiling should be disabled when command is captured, hence the timestamp should not be created. Because information about capturing isn't available when kernel command is created, packet capturing state is used to determine whether to create a timestamp or not.
Change-Id: Ia23adac4592ded4fb5e236acf99e12e729f63692
Although unpinned copies require synchronizations
in HIP, runtime can avoid syncs for H2D copies with
a staging buffer
Change-Id: If2203c6bc0cbd89742823688dc8e89e9acd873b2
This patch fixes this potential issue that filling AQL header before
filling the AQL body. The hsa spec specifies "Packet processors may
process AQL packets after the packet format field is updated, but
before the doorbell is signaled."
However, the hipGraph AQL package with valid header will be filled
before fill the body, which may have the potential issue that CP
receive invalid AQL body.
Change-Id: I84af798c19ee2b8805ba19732b0eabdea2958a96
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
=> Added support to capture multiple AQL Packets.
=> Added Interface to callback to hip runtime from rocclr to allocate
kernel args from the graph kernel arg pool.
=> Enabled Support to capture memset node.
Change-Id: I7e1c2ba06927459e024653058af142bd82192c43
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
This change modifies the readback mechanism to use a pointer to volatile
instead of a volatile pointer. This ensures that the compiler does not
optimize away the read operation.
Change-Id: I79ff925d615aa8cc4f950e8ff4b7e608fcb179a4