Releasing graph exec after wait completes and before delete hip::stream obj
during stream destroy.
Change-Id: I1d68aa8d844f7d3af330c6d09c44af07f8553551
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
=> Added support to capture multiple AQL Packets.
=> Added Interface to callback to hip runtime from rocclr to allocate
kernel args from the graph kernel arg pool.
=> Enabled Support to capture memset node.
Change-Id: I7e1c2ba06927459e024653058af142bd82192c43
hipDeviceSynchronize called from __hipUnregisterFatBinary
accesses static maps and monitors. This change ensures these ojects
are not destroyed before __hipUnregisterFatBinary is called.
Additionally it disables the teardown process for static build.
Change-Id: I46b58641d60efcf6637a8e99cdd786ffe9e2c77d
This issue was happening because of incorrect usage of getStream call,
if we get the null stream first and then typecast it, and call on
getStream again, we lose the advantage of simply passing "nullptr" to
indicate NULL stream. Thus we enter the waitActiveStream call and add
barriers to sync across streams.
Change-Id: I94dc4e3ec927295b9e1ab6dee4b37d7d3e00b0cc
Also in the scope of SWDEV-467540.
Fix sporadic crash in Unit_hipStreamAddCallback_MultipleThreads by
deferring release() of block_command.
The test will invoke 1000 threads on the same stream thus there
is a chance to free block_command too early in original code.
By deferring release() of block_command we can make sure block_command
is always valid during calling block_command->notifyCmdQueue().
Change-Id: I31555ee18e6958e34b89f04181867fa4e932a38c
Creation of ReferenceCountedObject will increase reference count by 1.
Clear the commands from Node after capture so that they wont be reference later.
Change-Id: I1cc4085939cf65218ec2aa2e25ab6d737f7cacd3
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
On some platforms user can ask for extended shared memory for a
particular kernel in some cases. This feature does not exist on HIP at
the moment. So we are setting it to sharedMemPerBlock which is the
maximum user can expect for their kernels.
Change-Id: I81005cf0d1c9fb941e77d34fb8385241ffe5bdd0
For refactoring of childGraph to have its own graphExec,
kernelArgs needs to be separated from the graphExec object.
All the childNodes part of graph should share same kernelArg pool.
Otherwise we endup creating multiple device kernel arg memory chucks
for single graphExec.
Change-Id: I4029a46ebc1fa112d87df64ab1fecbf288fabe5e
This change modifies the readback mechanism to use a pointer to volatile
instead of a volatile pointer. This ensures that the compiler does not
optimize away the read operation.
Change-Id: I79ff925d615aa8cc4f950e8ff4b7e608fcb179a4
- Introduce a lock when checking isUserObjectValid. We need a lock
here as one can remove the userObject T2, leading to buffer overflow
when checking ranges in T1.
Change-Id: I058144b8cc463c90ab6bf5cf96bf937897742917
If graph has multiple branches, End command is enqueued on launch stream which
makes sure all the internal parallel streams are finsihed.
When node is removed from the graph, indegree and outdegree are not getting update correctly for parent, child nodes and
resulting in endNode not having deps on parallel commands. Resulting in graph sync issues.
Change-Id: I33cc2f21220e1c017d88099b29b542e05b683f73
Resolved an issue where a freed virtual buffer was incorrectly
added to the global mapping causing an assertion error during
teardown process.
Change-Id: I4801157a28603ce9be1ca0131982b700ff884f7a
Changed find_package call to prioritize the package that is
found under the rocm installation over other system locations
Change-Id: Ice93c94bbb9cdebd467d3e88bb2e4bfb7a1e76d9
If the graph has kernels that does device side allocation, during packet capture, heap is
allocated because heap pointer has to be added to the AQL packet, and initialized during
graph launch.
Handle race with wait when 2 kernels with device heap are enqueued on multiple streams.
Change-Id: I45933b77fcaf7bc8fdf1bc906462e32b5d8d3688
Adding a safety check prevents an invalid memory access
if timestamps and kernelNames vectors are of different size.
The patch also moves the addKernelNames for the accumulate command
into dispatchAqlPacket function.
Change-Id: Iea0927e1253800403a1ae3f3d72de1e7d96476c3
- Update the intra socket weight for partitions within single socket as
it is changed to 13 by the driver.
- Use the PCIe function to distinguish the partitions of the same device
such as TPX mode in gfx942.
Change-Id: I8e64023d44e37c2dbb105cbb343441a48021ba7b