Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
For refactoring of childGraph to have its own graphExec,
kernelArgs needs to be separated from the graphExec object.
All the childNodes part of graph should share same kernelArg pool.
Otherwise we endup creating multiple device kernel arg memory chucks
for single graphExec.
Change-Id: I4029a46ebc1fa112d87df64ab1fecbf288fabe5e
This change modifies the readback mechanism to use a pointer to volatile
instead of a volatile pointer. This ensures that the compiler does not
optimize away the read operation.
Change-Id: I79ff925d615aa8cc4f950e8ff4b7e608fcb179a4
If the graph has kernels that does device side allocation, during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.
Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.
Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688
Adding a safety check prevents an invalid memory access
if timestamps and kernelNames vectors are of different size.
The patch also moves the addKernelNames for the accumulate command
into dispatchAqlPacket function.
Change-Id: Iea0927e1253800403a1ae3f3d72de1e7d96476c3
- Print kernelname for graph launches, its hard to correlate packets
otherwise
- Print correlation_id if any
Change-Id: Ib8db7a00e4e7c98f570e71029e61d86f5dccc2ed
- Remove Last graph node optimization and instead submit a barrier NOP
packet always. This simplifies the code.
Change-Id: Ied443173ba47a08b6df148ac7e3ead712acda11c
Handle GraphExec instance is destroyed before async launch completes
GraphExec instance is destroyed after async launch completes
GraphExec instance is destroyed without a launch
Change-Id: I45a7c82295fea916c7559bd8f796df710513aea1
The Readback and Avoid HDP Flush memory ordering workaround is
used as a fallback solution only when HDP flush register is invalid
Change-Id: Ic284eba1f95ed22b0270d3abeb904fb902015b1a
Node can be enabled/disabled only for kernel, memcpy and memset nodes.
If the node is disabled it becomes empty node.
To maintain ordering just enqueue marker with respective node dependencies.
Change-Id: I710f3e88ab4e76c81f6f86a40a7dc61fd4c7e440
When kernel does device side malloc, initial heap is allocated with __amd_rocclr_initHeap.
During graph launch kernel __amd_rocclr_initHeap is enqueued followed by actual kernel . So kernel will execute after initHeap kernel.
But with graph optimizations during capture initHeap gets enqueued on device null stream and actual kernel on graph launch stream.
So no proper synchronization. Switch to command creation and enqueue during launch for kernel node with hidden heap.
Change-Id: Iaf600251faef9a448853f19429023c118aa760b9
- Implement workaround to ensure HDP writes are done by writing and
reading the HDP MMIO register.
- Implement the same workaround for graphs, we no longer need sentinel
write/readback
Change-Id: I0d3027b46a1f61131ec62e3c8c669ff5184fa6b2
When graph is Instantiate on device 0 graph and launch on device1 switch to command creation and enqueue during launch.
Change-Id: Ied34dc99b2a776130d1354ed3830c6ccab9068e4
Read and write int bytes sentinal value to dev_ptr or PCIE connected devices at the tail end of the kernarg surface.
Change-Id: I993d552ac872b3cd56aef4746c4d1d92c58d38b4
During hipGraphExecKernelNodeSetParams kernel function can also be updated.
Hence size required for kernel parameters differs from what is allocated during graphInstantiation.
So, create new 128KB kernel pool and allocate kernel args from the pool.
If the pool is full create new 128KB pool. Release kernel pools when graph exec object is destroyed.
Change-Id: I9567946d63400c79cbfd4c5439c654c92557ceae
- Report kernel names for optimized graph path
- Refactor code so that we store profiling info in Accumulate command
Change-Id: Ib97735a0239aeb9fc3a50a4bb7126dd0bcadc8af
- Do not use extra barrier to detect graph end. If its a kernel node we
can use a completion signal for the last packet. Saves roughly 6us for
Phantom testcase per graph launch.
Change-Id: I5e0c2479d9964fbeda86ed97533f6718f49a7f91
- Track all captured commands under a new AccumulateCommand
- Add begin() and end() methods to capture commands
- Explicit TS object now passed to certain methods because
profilingBegin() and profilingEnd() now happen separately and thus can
run into threading issues
Change-Id: I171106bdcad72b057836cb2f3fc398db3533119f
- Support graph with different types of nodes with single
branch when DEBUG_CLR_GRAPH_PACKET_CAPTURE flag is enabled
Change-Id: I149a8629769cd0d5849ffefb04f1352668a685b6
Fix wrong logic to get layer index;
Make layered image's layout match cuda spec;
Fix wrong comparision of element size.
Remove amd::BufferRect from ihipMemcpyAtoHCommand()
and ihipMemcpyHtoACommand().
Change-Id: Icc6a4233fbce2e9b2dc6feb79e6bfbd761684c7d
Three for loops iterate over all graph nodes for UpdateStream, FillCommands and
EnqueueCommands has performance drop for large graphs.
Change-Id: I077accf3a4680d5d944b73200fd6498a7a48f25c
The change enables VM support in graphs on Windows. That allows
to avoid caching of all allocations at the cost of map/unmap
overhead during memory create/destroy.
Change-Id: I792be00fba099e5e5d3cd44a963e1dfd6976a86d
Avoid syncing blocking streams with the default stream,
since that introduces extra command dependencies and
doesn't allow to destroy memory after last submission
Change-Id: I618e9bd2091c4cf9157125612d8c4759030c5a80