When source or destination pitch is set to zero in hip_Memcpy2D struct
it should default to WidthInBytes + [src/dst]XInBytes
Change-Id: Id57b53cab40ba72ced231258da9356554c4868c3
[ROCm/clr commit: 7a1e818c82]
- Fixes -0.0 and +0.0 comparison. For atomicMax if the value on
address is -0.0 and on val is +0.0, gfx90a's unsafe atomics will swap
them. This behavior should be consistent with cas loop as well.
- _system variants of atomicMax and atomicMin are resulting in
incorrect output. Updated these to use the similar implementation as
atomicMax and atomicMin.
Change-Id: I20df36ee29ae0434a6b564f2ba71193fe41cfa59
[ROCm/clr commit: d69cc35750]
This is the first step to remove rocm-ocl-icd.
We don't build amd icd after this commit.
We still need to remove header files usage in future steps.
Change-Id: Ic4ac5476180f9ef2ce87b62891c08b28d6c9bfd2
[ROCm/clr commit: 5f775b8b7f]
Releasing graph exec after wait completes and before delete hip::stream obj
during stream destroy.
Change-Id: I1d68aa8d844f7d3af330c6d09c44af07f8553551
[ROCm/clr commit: 8e80429b87]
- Added the optimized multi stream path in graph execution. It uses a fixed number of async streams in the execution
- Optimize the launch latency, where commands
creation and execution is done at the same time
- Optimize the scheduling to use less barriers and waiting signals if
the same queue can be detected
- The new path is controlled by DEBUG_HIP_FORCE_GRAPH_QUEUES
environment variable, where 0 will use the original path and any other
value will force the number of asynchronous queues for execution
- DEBUG_HIP_FORCE_ASYNC_QUEUE can force single queue async
execution in graphs(applicable for Navi families only)
Change-Id: I7eb40bc15c45f508d6911868a6f6d4c3598d380e
[ROCm/clr commit: 9db52f9a46]
=> Added support to capture multiple AQL Packets.
=> Added Interface to callback to hip runtime from rocclr to allocate
kernel args from the graph kernel arg pool.
=> Enabled Support to capture memset node.
Change-Id: I7e1c2ba06927459e024653058af142bd82192c43
[ROCm/clr commit: bd3a35bde1]
hipDeviceSynchronize called from __hipUnregisterFatBinary
accesses static maps and monitors. This change ensures these ojects
are not destroyed before __hipUnregisterFatBinary is called.
Additionally it disables the teardown process for static build.
Change-Id: I46b58641d60efcf6637a8e99cdd786ffe9e2c77d
[ROCm/clr commit: 9b33db9b24]
This issue was happening because of incorrect usage of getStream call,
if we get the null stream first and then typecast it, and call on
getStream again, we lose the advantage of simply passing "nullptr" to
indicate NULL stream. Thus we enter the waitActiveStream call and add
barriers to sync across streams.
Change-Id: I94dc4e3ec927295b9e1ab6dee4b37d7d3e00b0cc
[ROCm/clr commit: cda4b7db1c]
If only external signals were provided, then just process it
without adding internal signals
Change-Id: Iaefd65d0f8b0a64b9f6a864a9bd73de20a29dfa4
[ROCm/clr commit: 18187cd8fe]
Updating field num_mip_levels to better align with OpenCL specification that mip-mapped images can not be created for CL_MEM_OBJECT_IMAGE1D_BUFFER images. Added check for miplevels value used for ClCreateImage call.
Change-Id: I82a25b83ef0637a877409572b7976d9e4413dfac
[ROCm/clr commit: 21a1c9075a]
Also in the scope of SWDEV-467540.
Fix sporadic crash in Unit_hipStreamAddCallback_MultipleThreads by
deferring release() of block_command.
The test will invoke 1000 threads on the same stream thus there
is a chance to free block_command too early in original code.
By deferring release() of block_command we can make sure block_command
is always valid during calling block_command->notifyCmdQueue().
Change-Id: I31555ee18e6958e34b89f04181867fa4e932a38c
[ROCm/clr commit: e3ef19e22a]
Creation of ReferenceCountedObject will increase reference count by 1.
Clear the commands from Node after capture so that they wont be reference later.
Change-Id: I1cc4085939cf65218ec2aa2e25ab6d737f7cacd3
[ROCm/clr commit: 6ae5d6896c]
Capture AQL packets during GraphInstantiation and enqueue AQL packets during graph launch.
Added support to capture single graph memset node.
Capture support for memset node is currently disabled.
Memset capture will be enabled when capture for multiple packets are supported..
Change-Id: I14dfbc41731025cc3a548a730558915def3fa384
[ROCm/clr commit: 346da4bb40]
- HIP path doesn't support resource tracking. Thus, double copy can't be enabled,
because it requires resource tracking.
Change-Id: I0f9c4e185b5b2d2b1abde041fca21bb099db9ccd
[ROCm/clr commit: 4c763e45a1]
Since we made the members public, we can optimize some operations which
do not require redundant conversions to half_raw types.
Change-Id: I31555ef18e695d8e24b89f0418187fa4e932a38a
[ROCm/clr commit: 6a655a77e7]
On some platforms user can ask for extended shared memory for a
particular kernel in some cases. This feature does not exist on HIP at
the moment. So we are setting it to sharedMemPerBlock which is the
maximum user can expect for their kernels.
Change-Id: I81005cf0d1c9fb941e77d34fb8385241ffe5bdd0
[ROCm/clr commit: 4b95e7bc87]
Fixes the memory leak with hipExtStreamCreateWithCUMask API.
hsa queues with cumask set are not being reused and created
everytime the API is called, But these queues were not being
destroyed during hipStreamDestroy causing memory leak.
Change-Id: Ibfbe019bbd73604e98eca80461efe53fa64bb701
[ROCm/clr commit: 191869b252]
For refactoring of childGraph to have its own graphExec,
kernelArgs needs to be separated from the graphExec object.
All the childNodes part of graph should share same kernelArg pool.
Otherwise we endup creating multiple device kernel arg memory chucks
for single graphExec.
Change-Id: I4029a46ebc1fa112d87df64ab1fecbf288fabe5e
[ROCm/clr commit: 35079e834e]