Previously, we used the following approach and Comgr actions
for device lib linking:
AMD_COMGR_COMPILE_SOURCE_TO_BC (compile with clang driver)
AMD_COMGR_ADD_DEVICE_LIBRARIES (link in device libs with
llvm-link API)
However, the clang driver can link in device libraries as part
of compilation, assuming a --rocm-path is set. In this context,
this is accomplished by using the following Comgr action instead:
AMD_COMGR_COMPILE_SOURCE_WITH_DEVICE_LIBS_TO_BC (compile and
link in device libs with clang driver)
Change-Id: Ie0bbee7d9a12672536b6d751056a941128ed58be
[ROCm/clr commit: 6311ed8a8e]
- Sometimes we want to mask out kernel names, use right level for kernel
logging
Change-Id: Ideae9647c57b86ae390ff2f4131f6d8c6df5c086
[ROCm/clr commit: f1adecd186]
The copies can get blocked if the last SDMA engine is used by another
copy and this can lead to perf drop in some of the tests like Gromacs.
Resetting the last engine by checking the engine status and fetching the
new mask after few copies can avoid this.
Change-Id: I8fe8ea678db508d291c6242f3741fa9215e99921
[ROCm/clr commit: 1b25484f0f]
Currently we force inlining everything for HIP. Now we'd like to enable function
supports. The first step is to remove uses of `-amdgpu-early-inline-all` in
various places. This patch is to remove all of them from clr.
Change-Id: Ib0cad1f586714c9989778b00746aa4c47a4eec95
[ROCm/clr commit: a09204388a]
Mempool has capability to track dependency between streams for
faster memory reuse. Enable that capability.
Change-Id: I28266a7e38d0fc4c5d027b9542d3719653840821
[ROCm/clr commit: 17d0c166d2]
OpenCL printf handling did not process vector of half precision floats properly
(mainly because compiler packs 2 halfs into a dword and runtime failed to extract the
individual parts).
This patch fixes the issue.
Change-Id: Ia1f15ccfb5db52b71c43cfd588dd38f551ee5277
[ROCm/clr commit: 6f390f5af9]
- Print SWq for AQL packets, this helps correlating a stream to the HWq
mapped
Change-Id: I610430c0872a1abc6636027c00163ec46983cd65
[ROCm/clr commit: 984c86f407]
- Using runtime unbundler, no any gfx device can load fat binary,
if there is any device without available code object.
- Extract available code object to corresponding gfx devices. So
users can work ROCm with those ready devices without segmentation
fault.
Change-Id: I9f14c65ecebf2d3c4b127a007cb434a3ae98c450
[ROCm/clr commit: 6723277ad4]
- Enable Device kernel args for MI300* for now.
- Fix a perf issue which impacts graph instantiate when dev kernel args
are enabled.
Change-Id: I962e58fd9d8dd1a8db95e601cb03a8e9c7bac97f
[ROCm/clr commit: 68f40f78dd]
Node can be enabled/disabled only for kernel, memcpy and memset nodes.
If the node is disabled it becomes empty node.
To maintain ordering just enqueue marker with respective node dependencies.
Change-Id: I710f3e88ab4e76c81f6f86a40a7dc61fd4c7e440
[ROCm/clr commit: e0e63eb04d]
- enforce incrementing the table versioning number when a table size changes outside of ifdef for ROCPROFILER_REGISTER
- add new HIP_ENFORCE_ABI entries
- update the HipDispatchTable size and bump HIP_RUNTIME_API_TABLE_STEP_VERSION to 1
- re-enable rocprofiler-register
Change-Id: Ie0cc1d8491c5640056e5dd393ea243e4dce4e8a9
[ROCm/clr commit: d84c5ae3af]
When kernel does device side malloc, initial heap is allocated with __amd_rocclr_initHeap.
During graph launch kernel __amd_rocclr_initHeap is enqueued followed by actual kernel . So kernel will execute after initHeap kernel.
But with graph optimizations during capture initHeap gets enqueued on device null stream and actual kernel on graph launch stream.
So no proper synchronization. Switch to command creation and enqueue during launch for kernel node with hidden heap.
Change-Id: Iaf600251faef9a448853f19429023c118aa760b9
[ROCm/clr commit: 2dc6ec68a5]
__amd_streamOpsWrite blitkernel in device-libs has only 3 args.
so getting rid of the 4th unused arg (sizeBytes)
Change-Id: I81cc1107f8b424bf58558c93a2495a1b878aef91
[ROCm/clr commit: e643406caa]
With multiple HIP streams it's possible to have a race condition when
one thread stops the traces, but another still performs submisisons.
That may cause a crash on the barrier callback.
Change-Id: Ic56f8277fcfd2c2142a4821d927b938b9f313add
[ROCm/clr commit: e2d2fad56c]
Check the pointer if its present in the arrayset before trying to dereference
it as it can cause access violation if the pointer is allocated using malloc
Change-Id: Ida72b9015dc22269fc1fbe0728e66e3de29fda3d
[ROCm/clr commit: 821ae6a103]