Previously, we used the following approach and Comgr actions
for device lib linking:
AMD_COMGR_COMPILE_SOURCE_TO_BC (compile with clang driver)
AMD_COMGR_ADD_DEVICE_LIBRARIES (link in device libs with
llvm-link API)
However, the clang driver can link in device libraries as part
of compilation, assuming a --rocm-path is set. In this context,
this is accomplished by using the following Comgr action instead:
AMD_COMGR_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC (compile and
link in device libs with clang driver)
Change-Id: Ie0bbee7d9a12672536b6d751056a941128ed58be
The copies can get blocked if the last SDMA engine is used by another
copy and this can lead to perf drop in some of the tests like Gromacs.
Resetting the last engine by checking the engine status and fetching the
new mask after few copies can avoid this.
Change-Id: I8fe8ea678db508d291c6242f3741fa9215e99921
Currently we force inlining everything for HIP. Now we'd like to enable function
supports. The first step is to remove uses of `-amdgpu-early-inline-all` in
various places. This patch is to remove all of them from clr.
Change-Id: Ib0cad1f586714c9989778b00746aa4c47a4eec95
Mempool has capability to track dependency between streams for
faster memory reuse. Enable that capability.
Change-Id: I28266a7e38d0fc4c5d027b9542d3719653840821
OpenCL printf handling did not process vector of half precision floats properly
(mainly because compiler packs 2 halfs into a dword and runtime failed to extract the
individual parts).
This patch fixes the issue.
Change-Id: Ia1f15ccfb5db52b71c43cfd588dd38f551ee5277
- Using runtime unbundler, no any gfx device can load fat binary,
if there is any device without available code object.
- Extract available code object to corresponding gfx devices. So
users can work ROCm with those ready devices without segmentation
fault.
Change-Id: I9f14c65ecebf2d3c4b127a007cb434a3ae98c450
- Enable Device kernel args for MI300* for now.
- Fix a perf issue which impacts graph instantiate when dev kernel args
are enabled.
Change-Id: I962e58fd9d8dd1a8db95e601cb03a8e9c7bac97f
Node can be enabled/disabled only for kernel, memcpy and memset nodes.
If the node is disabled it becomes empty node.
To maintain ordering just enqueue marker with respective node dependencies.
Change-Id: I710f3e88ab4e76c81f6f86a40a7dc61fd4c7e440
- enforce incrementing the table versioning number when a table size changes outside of ifdef for ROCPROFILER_REGISTER
- add new HIP_ENFORCE_ABI entries
- update the HipDispatchTable size and bump HIP_RUNTIME_API_TABLE_STEP_VERSION to 1
- re-enable rocprofiler-register
Change-Id: Ie0cc1d8491c5640056e5dd393ea243e4dce4e8a9
When kernel does device side malloc, initial heap is allocated with __amd_rocclr_initHeap.
During graph launch kernel __amd_rocclr_initHeap is enqueued followed by actual kernel . So kernel will execute after initHeap kernel.
But with graph optimizations during capture initHeap gets enqueued on device null stream and actual kernel on graph launch stream.
So no proper synchronization. Switch to command creation and enqueue during launch for kernel node with hidden heap.
Change-Id: Iaf600251faef9a448853f19429023c118aa760b9
__amd_streamOpsWrite blitkernel in device-libs has only 3 args.
so getting rid of the 4th unused arg (sizeBytes)
Change-Id: I81cc1107f8b424bf58558c93a2495a1b878aef91
With multiple HIP streams it's possible to have a race condition when
one thread stops the traces, but another still performs submisisons.
That may cause a crash on the barrier callback.
Change-Id: Ic56f8277fcfd2c2142a4821d927b938b9f313add
Check the pointer if its present in the arrayset before trying to dereference
it as it can cause access violation if the pointer is allocated using malloc
Change-Id: Ida72b9015dc22269fc1fbe0728e66e3de29fda3d