* clr: Use graph segment scheduling to process HIP Graphs
* Add a broader path to use capture packet capture for all topologies
* Refactor code
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING to toggle new vs classic path,
Enabled by default
* clr: Few fixes and improvements
* clr: Detect complex graphs to take classic path
* Use DEBUG_HIP_GRAPH_SEGMENT_SCHEDULING=2 to force segment scheduling
path
* clr: Fix a cornercase stack corruption
* clr: Track commands of segments instead of snapshots
* clr: Fix Batch dispatch logic
* Track fence_dirty_ flag for command of other streams
* Dependency resolution markers can now accomodate dirty fence on cross
streams
---------
Co-authored-by: Ioannis Assiouras <Ioannis.Assiouras@amd.com>
Co-authored-by: Godavarthy Surya, Anusha <agodavar@amd.com>
- Updated kernel arg manager to support allocating kernel args on multiple devices for single graph.
- Updated AQL path to capture on the device where graph node is added.
Co-authored-by: Anusha GodavarthySurya <Anusha.GodavarthySurya@amd.com>
*Lay foundation to batch packets efficiently for graphs
*Dynamically copy packets with max threshold set with
DEBUG_HIP_GRAPH_BATCH_SIZE, if not stagger packet copy with pow2
*Default threshold for DEBUG_HIP_GRAPH_BATCH_SIZE is 256
*If TS are not collected for a signal for reuse, create a new signal.
This can potentially increase signal footprint if the handler doesn't run
fast enough.
* SWDEV-465041 - Add support for user events with DD
User events can be replaced with HSA signals. Add the interface
to allocate HSA signal for user events and update the status on
CL_COMPLETE.
Force pinned path with DD to avoid blocking calls. Pinned memory
can be released only when the command is complete.
Simplify device enqueue path to use generic kernel arg buffer and
signals
* Fix notifyCmdQueue() logic for OCL
* Avoid blocking calls in OCL with DD
* Add event destruciton in a case of the failure.
[ROCm/clr commit: 2305f8ae56]
- When a command may possibly have two packets(like device heap
initializer), and if there is no signal on the main kernel packet the
tracking was broken as it marked HW event of the command as the first
packet signal.
- Make sure if no completion signal is attached to the second packet
then clear the HW event for the command.
[ROCm/clr commit: 072fb0804e]
- This change tries to save extra synchronization packets we may insert
as we didnt track the completion signals for every command. We track
the current enqueued command until it exits the enqueue stage. We also
record the exit scope to know if we flushed the caches
- Handle correct release scopes and store completion signal as HW events
- Use a new finishCommand implementation to only wait for the command
passed as the argument
Change-Id: Ie4350c5dd24f5d48dfa6ccbabd892f0544caadcc
[ROCm/clr commit: e03e4f3b5d]
- Don't generate callbacks for HIP events
- Don't process profiling info in the callback for HIP events
- Wait for CPU status update of the submitted commands
every 50 calls. That will allow to drain the commands and
destroy HSA signals.
Change-Id: Ib601a350e7e7c2b6c6209a172385389baccf73a9
[ROCm/clr commit: 364dfb0ed1]
* In a scenario where kernel is launched with hipExtLaunchKernelGGL and stop event is used, hipGraphInstantiate leaks. Since stop event is used, profiling is enabled and Timestamp (ReferencedCountedObject) is created, but it doesn't get released.
* The idea behind this solution is that profiling should be disabled when command is captured, hence the timestamp should not be created. Because information about capturing isn't available when kernel command is created, packet capturing state is used to determine whether to create a timestamp or not.
Change-Id: Ia23adac4592ded4fb5e236acf99e12e729f63692
[ROCm/clr commit: da5f1a6146]
Ensure the member function Alloc() and Free() of command_pool_ will not be
accessed after command_pool_ be destructed.
Signed-off-by: Chong Li <chongli2@amd.com>
Change-Id: Ic2d36423302518a030bd61fa399290ebe2ed8194
[ROCm/clr commit: e6a5c81221]
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
[ROCm/clr commit: 9db52f9a46]
=> Added support to capture multiple AQL Packets.
=> Added Interface to callback to hip runtime from rocclr to allocate
kernel args from the graph kernel arg pool.
=> Enabled Support to capture memset node.
Change-Id: I7e1c2ba06927459e024653058af142bd82192c43
[ROCm/clr commit: bd3a35bde1]
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
[ROCm/clr commit: 346da4bb40]
Implement std::mutex based monitor that has much
simpler logics than legacy monitor.
Create DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR to
toggle them.
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = false
(by default), use legacy monitor;
If DEBUG_CLR_USE_STDMUTEX_IN_AMD_MONITOR = true,
use std::mutex based monitor.
If no perf drop of stl::mutex based monitor,
legacy one will be removed later.
Change-Id: I1d21368ff462477d3238d71e4e2a1a7d6b9167ad
[ROCm/clr commit: 73c02041e1]
- Remove Last graph node optimization and instead submit a barrier NOP
packet always. This simplifies the code.
Change-Id: Ied443173ba47a08b6df148ac7e3ead712acda11c
[ROCm/clr commit: badf2b0880]
Switch commands creation to the new suballocator to avoid
frequent expensive OS calls
Change-Id: I3597c811820e577c15708bad8b8a41aa53acc400
[ROCm/clr commit: 5b0bfdcbad]
- Create a vector to allow multiple TS to be stored in Command.
- This would mean we dont wait for entire batch in Accumulate command
to finish when we exhaust signals.
- Reduce the number of signals created at init to 64. This min value
may still need to be tuned but the KFD allows max of 4094 interrupt
signals per device.
- Store kernel names whenever they are available and not just when
profiling. If we dynamically enable profiling like for Torch, a crash
can happen if hipGraphInstantiate wasnt included in Torch profile scope
beacuse we previously entered kernel names only when profiler is
attached.
Change-Id: I34e7881a25bbc763f82fdeb3408a8ea58e1ec006
[ROCm/clr commit: c157bfb202]
- The correlation_id had random junk values which we were inserting in
the dispatch AQL packet even when no profiler was attached but if we had
a valid timestamp.
- Also make sure we dont even write the reserved2 field in the AQL
packet if no profiler attached.
Change-Id: Icdb7493198c1bb5e2d786a97e027288660854cd7
[ROCm/clr commit: 9a6ddae7b2]
- Report kernel names for optimized graph path
- Refactor code so that we store profiling info in Accumulate command
Change-Id: Ib97735a0239aeb9fc3a50a4bb7126dd0bcadc8af
[ROCm/clr commit: b056686607]
- Do not use extra barrier to detect graph end. If its a kernel node we
can use a completion signal for the last packet. Saves roughly 6us for
Phantom testcase per graph launch.
Change-Id: I5e0c2479d9964fbeda86ed97533f6718f49a7f91
[ROCm/clr commit: c3bd229f4f]
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues
Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f
[ROCm/clr commit: 40f41f4d0b]
This reverts commit cab71e6e00.
Implement the right way to make ExternalSemaphores be signalled
only after prior works on the stream have been finished.
Change-Id: I9d5974e05d5f229170b928db4566c14e40e3cbaa
[ROCm/clr commit: d433df4761]
Let ExternalSemaphores be signalled only after prior works on the
stream have been finished.
Change-Id: I856917db905f68f55fdf484f5267f7fe8ea3117f
[ROCm/clr commit: 44a3935cda]
The change enables VM support in graphs on Windows. That allows
to avoid caching of all allocations at the cost of map/unmap
overhead during memory create/destroy.
Change-Id: I792be00fba099e5e5d3cd44a963e1dfd6976a86d
[ROCm/clr commit: 04b696abee]