=> GraphExec instance is destroyed before async launch completes,
destroy after all pending graph launches
=> Remove GraphExec destroy during next sync point(hipStreamSync,
hipDeviceSync etc..)
Change-Id: I4df682aae5787fd6e5240a7be936ce50361345d0
1) Child Graph nodes need to have parent graph dependencies in waitlist.
2) Marker is placed on base stream with parent graph waitlist
Change-Id: Iec65a0171ea387be05b0733abcc708fb630e4be4
This change adds fixes in optimized multistream path for childGraph uses cases.
1) For childgraph nodes, rely on runNodes() only to process
the childgraph and skip calls to createCommand and enqueueCommands.
This ensures that the start/end markers are enqueued correctly
with respect to the childGraph commands.
In addition, the runNodes() for the childgraph should be called after
the dependency walkthrough to make sure that the subgraph is executed once.
2) Nodes with no outgoing edges should be marked
as a leafs regardless of which stream they are assigned to.
This is to ensure that marker dependencies from nodes
that run on non-zero stream to subgraph leafs that run on zero stream
are still set up correctly.
Change-Id: I4a5f4f3b0e0d01e515cdcb045b46c2798f291255
Releasing graph exec after wait completes and before delete hip::stream obj
during stream destroy.
Change-Id: I1d68aa8d844f7d3af330c6d09c44af07f8553551
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
=> Added support to capture multiple AQL Packets.
=> Added Interface to callback to hip runtime from rocclr to allocate
kernel args from the graph kernel arg pool.
=> Enabled Support to capture memset node.
Change-Id: I7e1c2ba06927459e024653058af142bd82192c43
hipDeviceSynchronize called from __hipUnregisterFatBinary
accesses static maps and monitors. This change ensures these ojects
are not destroyed before __hipUnregisterFatBinary is called.
Additionally it disables the teardown process for static build.
Change-Id: I46b58641d60efcf6637a8e99cdd786ffe9e2c77d
This issue was happening because of incorrect usage of getStream call,
if we get the null stream first and then typecast it, and call on
getStream again, we lose the advantage of simply passing "nullptr" to
indicate NULL stream. Thus we enter the waitActiveStream call and add
barriers to sync across streams.
Change-Id: I94dc4e3ec927295b9e1ab6dee4b37d7d3e00b0cc
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
For refactoring of childGraph to have its own graphExec,
kernelArgs needs to be separated from the graphExec object.
All the childNodes part of graph should share same kernelArg pool.
Otherwise we endup creating multiple device kernel arg memory chucks
for single graphExec.
Change-Id: I4029a46ebc1fa112d87df64ab1fecbf288fabe5e
This change modifies the readback mechanism to use a pointer to volatile
instead of a volatile pointer. This ensures that the compiler does not
optimize away the read operation.
Change-Id: I79ff925d615aa8cc4f950e8ff4b7e608fcb179a4
If the graph has kernels that does device side allocation, during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.
Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.
Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688
Adding a safety check prevents an invalid memory access
if timestamps and kernelNames vectors are of different size.
The patch also moves the addKernelNames for the accumulate command
into dispatchAqlPacket function.
Change-Id: Iea0927e1253800403a1ae3f3d72de1e7d96476c3
- Print kernelname for graph launches, its hard to correlate packets
otherwise
- Print correlation_id if any
Change-Id: Ib8db7a00e4e7c98f570e71029e61d86f5dccc2ed
- Remove Last graph node optimization and instead submit a barrier NOP
packet always. This simplifies the code.
Change-Id: Ied443173ba47a08b6df148ac7e3ead712acda11c
Handle GraphExec instance is destroyed before async launch completes
GraphExec instance is destroyed after async launch completes
GraphExec instance is destroyed without a launch
Change-Id: I45a7c82295fea916c7559bd8f796df710513aea1
The Readback and Avoid HDP Flush memory ordering workaround is
used as a fallback solution only when HDP flush register is invalid
Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a
Node can be enabled/disabled only for kernel, memcpy and memset nodes.
If the node is disabled it becomes empty node.
To maintain ordering just enqueue marker with respective node dependencies.
Change-Id: I710f3e88ab4e76c81f6f86a40a7dc61fd4c7e440
When kernel does device side malloc, initial heap is allocated with __amd_rocclr_initHeap.
During graph launch kernel __amd_rocclr_initHeap is enqueued followed by actual kernel . So kernel will execute after initHeap kernel.
But with graph optimizations during capture initHeap gets enqueued on device null stream and actual kernel on graph launch stream.
So no proper synchronization. Switch to command creation and enqueue during launch for kernel node with hidden heap.
Change-Id: Iaf600251faef9a448853f19429023c118aa760b9
- Implement workaround to ensure HDP writes are done by writing and
reading the HDP MMIO register.
- Implement the same workaround for graphs, we no longer need sentinel
write/readback
Change-Id: I0d3027b46a1f61131ec62e3c8c669ff5184fa6b2
When graph is Instantiate on device 0 graph and launch on device1 switch to command creation and enqueue during launch.
Change-Id: Ied34dc99b2a776130d1354ed3830c6ccab9068e4
Read and write int bytes sentinal value to dev_ptr or PCIE connected devices at the tail end of the kernarg surface.
Change-Id: I993d552ac872b3cd56aef4746c4d1d92c58d38b4
During hipGraphExecKernelNodeSetParams kernel function can also be updated.
Hence size required for kernel parameters differs from what is allocated during graphInstantiation.
So, create new 128KB kernel pool and allocate kernel args from the pool.
If the pool is full create new 128KB pool. Release kernel pools when graph exec object is destroyed.
Change-Id: I9567946d63400c79cbfd4c5439c654c92557ceae