Commit Graph

2945 Commits

Author SHA1 Message Date
Chris Freehill c5faafeb25 rocr: Ensure AqlQueue can exit on memory error
A hang would occur when a memory error occurs because the
AQLQueue destructor would be waiting for a signal that
wouldn't come. This change allows it to break out of the
wait loop.


[ROCm/ROCR-Runtime commit: c065d9a7e2]
2025-07-11 12:58:21 -05:00
Alysa Liu 7ebf230622 rocrtst: migrate from rsmi API to amdsmi API
Replace ROCm SMI (rsmi) API calls with AMDSMI (amdsmi) API calls
in rocrtst.

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: 4fab4d70e6]
2025-07-11 11:22:34 -04:00
Honglei Huang 3fb4c8d3d7 rocr/driver: move wallclock frequency query to driver layer
Move the wallclock frequency query from GpuAgent to driver layer to improve
code organization and support multiple driver types. This change:

1. Add GetWallclockFrequency API to KFD/XDNA drivers
2. Move libdrm GPU info query from GpuAgent to driver implementation
3. Update GpuAgent to use the new driver API

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 412e386b50]
2025-07-11 16:14:29 +08:00
Honglei Huang 309e8b1a9f rocr/driver: add support for getting GPU tile configuration
- Implemented GetTileConfig in KfdDriver to retrieve tile configuration for
a specific node.
- Added a stub implementation of GetTileConfig in XdnaDriver.
- Updated driver.h to include a virtual GetTileConfig method.
- Extended hsa_internal.h with a new hsa_get_tile_config function.
- Integrated hsa_get_tile_config into hsa.cpp to call the driver-specific
  implementation.
- Updated driver headers to declare the new GetTileConfig method.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 9bc38e2ee6]
2025-07-11 16:14:29 +08:00
Honglei Huang e459cc0c3b rocr/driver: add GetClockCounters API to driver interface
This commit introduces a new GetClockCounters API to the driver interface.

- Implemented GetClockCounters in KfdDriver to fetch clock counters
  using hsaKmtGetClockCounters.
- Added a stub implementation of GetClockCounters in XdnaDriver that
  returns HSA_STATUS_ERROR.
- Modified GpuAgent to use driver().GetClockCounters instead of
  directly calling hsaKmtGetClockCounters.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 8d077dba3b]
2025-07-11 16:14:29 +08:00
Honglei Huang bacf61dde9 rocr/driver: add GetDeviceHandle to driver interface
This commit introduces a new GetDeviceHandle API to the driver
interface, allowing retrieval of the device handle for a
specific node.

- Implemented GetDeviceHandle in KfdDriver to fetch the AMD GPU
  device handle using hsaKmtGetAMDGPUDeviceHandle.
- Added a stub implementation of GetDeviceHandle in XdnaDriver
  that returns HSA_STATUS_ERROR.
- Modified GpuAgent::InitLibDrm to use driver().GetDeviceHandle
  instead of directly calling hsaKmtGetAMDGPUDeviceHandle.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 05b83e72d9]
2025-07-11 16:14:29 +08:00
Honglei Huang d675a9e3a0 rocr: replace DMABuf export paths by driver interface
This change improves code maintainability and error handling by
centralizing DMABuf export functionality in the driver interface.

- Replace direct hsaKmtExportDMABufHandle calls with driver's ExportDMABuf method
- Improve error handling with more specific error status returns
- Add explicit invalid parameter checks and assertions
- Consolidate DMABuf export logic in IPC and VMemory paths
- Propagate detailed error status from driver layer

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 837fd044d0]
2025-07-11 13:36:45 +08:00
Tony Gutierrez a99b7358ea rocr: Remove driver usage from filter device
Slightly refactor the RvdFilter so it doesn't need to call into the driver.


[ROCm/ROCR-Runtime commit: cb7b0c8d9f]
2025-07-10 09:41:34 -07:00
David Yat Sin 4e069fe72b doc: Fix doxygen comments for in-out params
[ROCm/ROCR-Runtime commit: 4c2dec5bb8]
2025-07-10 08:21:01 -04:00
Honglei Huang a8e7d69b18 libhsakmt: use uint32_t for loop index variables
This patch changes the type of several loop index variables from int to
uint32_t in fmm.c. The affected functions are:
- __fmm_release
- _fmm_map_to_gpu
- _fmm_unmap_from_gpu

To fix compile warning:

warning: comparison of integer expressions of different signedness:
'int' and 'uint32_t' {aka 'unsigned int'} [-Wsign-compare]
 2009 |         for (i = 0; i < object->handle_num; i++) {

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 45af009c5d]
2025-07-09 13:15:42 +08:00
Apurv Mishra 6c89d61cef kfdtest: Temporarily blacklist KFDEvictTest suite
blacklist the KFDEvictTest suite until the defects
SWDEV 535386 and 537002, where these test cases fail
inconsistently, are fixed

Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>


[ROCm/ROCR-Runtime commit: 3115384874]
2025-07-04 11:47:20 -04:00
Chris Freehill ad3985af1c rocr: Fix isa entries for gfx906/sramecc
Some of the entries for gfx906 in the ISA table in isa.cpp
had "any" for "sramecc-" instead of "disabled". This fixes
that.


[ROCm/ROCR-Runtime commit: 12430fe25a]
2025-07-02 08:40:30 -05:00
Chris Freehill 0e860e73b0 rocr/rocrtst: Update to c++17
[ROCm/ROCR-Runtime commit: f1bd89bd0d]
2025-06-30 14:02:24 -05:00
Honglei Huang 4fea8ea1fd rocr/driver: add SetTrapHandler API to driver interface
This commit introduces a new SetTrapHandler API to the driver interface

- Implemented SetTrapHandler in KfdDriver to set trap handlers using
  hsaKmtSetTrapHandler.
- Added a stub implementation of SetTrapHandler in XdnaDriver that returns
  HSA_STATUS_ERROR.
- Updated the driver interface in driver.h to include the new SetTrapHandler
  method.
- Modified GpuAgent to use driver().SetTrapHandler instead of directly calling
  hsaKmtSetTrapHandler.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: d874b8003a]
2025-06-27 23:32:53 +08:00
Honglei Huang 5e7fd3b5ba rocr: replace direct libhsakmt calls with driver interfaces
Replace direct hsakmt API calls with calls through the driver abstraction layer
in queue management related functions. This includes:
- CreateQueue/DestroyQueue operations
- Queue update and GWS allocation
- CU masking configuration

Also update the corresponding error status types from HSAKMT_STATUS to
hsa_status_t and adjust error handling accordingly.

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: dee5bdc679]
2025-06-26 15:53:01 +08:00
Honglei Huang 75a7da05be rocr: use driver interface for memory and cache properties query
Replace direct libhsakmt calls with driver interface methods
in GpuAgent initialization:
- Replace hsaKmtGetNodeMemoryProperties with driver().GetMemoryProperties
- Replace hsaKmtGetNodeCacheProperties with driver().GetCacheProperties

Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>


[ROCm/ROCR-Runtime commit: 046591419f]
2025-06-26 15:53:01 +08:00
Honglei Huang 97992b809f rocr: remove unused agent properties reference in scratch initialization
The agent properties variable `agent_props` was declared but never used
in the `InitScratchSRD()` function. Which casued compile warning:

runtime/core/runtime/amd_aql_queue.cpp:1880:15: warning:
unused variable ‘agent_props’ [-Wunused-variable]
 1880 |   const auto& agent_props = agent_->properties();

No functional changes, purely a code cleanup commit.


[ROCm/ROCR-Runtime commit: ffa07e28e7]
2025-06-26 13:05:40 +08:00
Tony Gutierrez 8daec0261f rocr: Move OpenSMI call to Driver
[ROCm/ROCR-Runtime commit: 1a339feb1f]
2025-06-25 15:53:02 -07:00
Yiannis Papadopoulos 47093a7f73 rocr/aie: Remove redundant and unused functions.
[ROCm/ROCR-Runtime commit: 2ca4d8f6d4]
2025-06-25 11:32:42 -04:00
Yiannis Papadopoulos 9ca5405b74 rocr/aie: Correct calculation of neural cores and avoid error on invalid queue ID.
[ROCm/ROCR-Runtime commit: e5125c9d5e]
2025-06-25 11:32:42 -04:00
Ken O'Brien 24d10e5c76 rocr: Fixes memory allocation issue
Fixes a bug in memory allocation in which dmabuf export only works on
GPU 0 in a multi-GPU environment.


[ROCm/ROCR-Runtime commit: 7b8a6f8ca2]
2025-06-24 14:53:14 -04:00
Sunday Clement a9a8190453 rocrtst: Add new test for querying Clock Counters
added new subtest to Agent Properties test, to check functionality of
query.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: d2b35dfee6]
2025-06-23 18:45:09 -04:00
Sunday Clement 315b1abaf9 rocr: Add hsa-agent Queries for Clock Counters
Support has been added to query the following
HSA_AMD_INFO_GET_CLOCK_COUNTERS agent info exposed through the hsa api
in rocr, rather than the user having to make a direct IOCTL call
through the kernel driver.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: e97d06530e]
2025-06-23 18:45:09 -04:00
Tony Gutierrez a62368e2ba rocr: Update Driver queue-related APIs
Update the user-mode driver queue APIs to leverage KMT types.

Move queue-related calls to the core::Driver API.


[ROCm/ROCR-Runtime commit: e03d44d742]
2025-06-23 12:21:01 -07:00
David Yat Sin 39bddd8b9d rocr: support reserving non-registered VA
Extend hsa_amd_vmem_address_reserve/hsa_amd_vmem_address_reserve_align
to support HSA_AMD_VMEM_ADDRESS_NO_REGISTER flag. This allocation can be
used to reserve virtual address ranges that can later be used by
hsa_amd_svm_attributes_set for SVM based memory allocations.


[ROCm/ROCR-Runtime commit: b3c48cc68c]
2025-06-18 18:21:11 -04:00
Chris Freehill 14b5faf333 rocr: Add missing close of dmabuf after import
[ROCm/ROCR-Runtime commit: 24f36de037]
2025-06-17 20:22:34 -04:00
David Yat Sin b0e43cc426 rocrtst: Reduce host memory limit to 90%
Further reduce upper bound for rocrtstFunc.Memory_Max_Mem
as previous limit of 95% can still trigger OOM killer.


[ROCm/ROCR-Runtime commit: 649ec63a4f]
2025-06-16 21:02:20 -04:00
David Yat Sin e3b013b208 rocr: Always send free scratch notifications
Always send notification to profiler tools when scratch memory is freed.


[ROCm/ROCR-Runtime commit: 488cfd467c]
2025-06-16 17:39:33 -04:00
Alysa Liu a36892da4d rocr: Fix wrong sizeof argument
Update size calculation from 2 * sizeof(void*) to 2 * sizeof(uint64_t)

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: 3b450397d6]
2025-06-16 13:11:07 -04:00
Sunday Clement 90e35e8486 rocr: Remove Recursive Include
Removed unnecessary header inlude in file to prevent circular include.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: 31b6474801]
2025-06-13 12:29:52 -04:00
Sunday Clement 76dbfc159c rocr: Fix Recursive Include in header files
scratch_cache.h includes amd_gpu_agent.h which then again includes
scratch_cache.h, this has now been fixed removing the unecessary
header include.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: 06efa50c09]
2025-06-13 12:29:52 -04:00
David Yat Sin b66b6991b0 rocr: Remove scratch_backing_memory_byte_size
scratch_backing_memory_byte_size was originally removed, and then put
back in e130172218. This was because it
was used by rocgdb. rocgdb code has been updated to not use this field.
Bumped _amdgpu_r_debug for the ABI change.


[ROCm/ROCR-Runtime commit: 3c0af843e3]
2025-06-12 15:33:47 -04:00
David Yat Sin 10b8b00193 cmake: Remove unused file
[ROCm/ROCR-Runtime commit: 17b8f9b24d]
2025-06-12 10:38:58 -04:00
David Yat Sin 37afa1c0eb rocr: Remove support for Kaveri GPUs
Kaveri GPUs are EoL


[ROCm/ROCR-Runtime commit: 24ce840732]
2025-06-12 10:38:58 -04:00
David Yat Sin 8982f2c2c6 rocr: Fix compile warning when using clang
[ROCm/ROCR-Runtime commit: 96d0f07b15]
2025-06-12 10:38:58 -04:00
Alysa Liu ab747b1ffd rocr: Prevent int overflow in arithmetic operation
Cast range->x and range->y to uint64_t before performing multiplication

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: 77b86ca908]
2025-06-11 19:36:36 -04:00
David Yat Sin ec4830eb5c rocr: document pseudo-code for scratch reclaim
Document CP FW and ROCr pseudo-code for asynchronous reclaim.
No code change.


[ROCm/ROCR-Runtime commit: df5d66eae5]
2025-06-11 16:19:59 -04:00
Chris Freehill 91268a6be9 rocr: Add hsa_amd_portable_export_dmabuf_v2
The original version of hsa_amd_portable_export_dmabuf() did not
consider the conditions under which a dmabuf could be shared.
In the new version (hsa_amd_portable_export_dmabuf_v2()), the caller
can specify the flag HSA_AMD_DMABUF_MAPPING_TYPE_PCIE, which means they
want to share the dmabuf over PCIe. In that case, the new code will check
that if it is a PCIe GPU and it is not in a XGMI Hive then if
large-BAR is not supported, we will return an error.


[ROCm/ROCR-Runtime commit: a34604bddb]
2025-06-09 15:42:58 -05:00
Chris Freehill 287986ab65 rocr: Add hsa_amd_portable_export_dmabuf_v2
The original version of hsa_amd_portable_export_dmabuf() did not
consider the conditions under which a dmabuf could be shared.
In the new version (hsa_amd_portable_export_dmabuf_v2()), the caller
can specify the flag HSA_AMD_DMABUF_MAPPING_TYPE_PCIE, which means they
want to share the dmabuf over PCIe. In that case, the new code will check
that if it is a PCIe GPU and it is not in a XGMI Hive then if
large-BAR is not supported, we will return an error.


[ROCm/ROCR-Runtime commit: 3a9d14bb66]
2025-06-09 15:42:58 -05:00
Sunday Clement 5c7524ba3e rocr: Fix Unintentional Integer Overflow
Its safer to have the integer literal explicitly be an unsigned long
in this expression as that's what the type of the errorCode variable
resolves to, preventing any overflow errors.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: dce52be686]
2025-06-09 15:16:10 -04:00
Sunday Clement 1eaee1649a rocr: Fix Unintended Sign Extension
ehdr->e_shentshize and ehdr->e_shnum are both 16-bit unsigned integers
and so their types get implicitly promoted to signed int automatically
during the multiplication, they must be explicitly cast into a larger
unsigned type, otherwise if the signed product is large enough the
value is sign extended resulting in incorrect values.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: d00ca2e9b7]
2025-06-09 15:16:10 -04:00
Alysa Liu 03430838af rocr: Remove structurally dead code
Remove unreachable return statement.

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: 9b3d15e68d]
2025-06-09 14:01:39 -04:00
Alysa Liu d1c3b7262d rocr: Add proper file descriptor cleanup
Ensure file descriptor 'in' is properly closed in error cases
when calling _lseek() during readFrom() operations.
Fix potential resource leak when errors occur during file operations.

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: 167602edfb]
2025-06-04 22:37:21 -04:00
Sunday Clement 1da312af87 rocr: Fix Potential Deadlock
Moved the Call to pthread_mutex_lock to an else statement for better
code readibility.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: 1635746a9c]
2025-06-04 10:18:09 -04:00
Sunday Clement 25886ecda8 rocr: Fix Potential Deadlock
Because eventDescrp->mutex is a non-recursive lock attempting to
acquire the lock with pthread_mutex_lock can cause the system to hang
indefinitely if the lock was already previously aquired with the
preceeding call to pthread_mutex_trylock.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: a97b7df4b9]
2025-06-04 10:18:09 -04:00
Alysa Liu 6de1c81b71 rocr: Fix inefficient copy operations
Refactor variable assignments to use std::move() where appropriate.
Updat function headers to accept parameters by const& where appropriate.

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: f6c8cbd293]
2025-06-02 11:18:36 -04:00
Alysa Liu 65f5ce6f0a rocr: Fixed inefficient copy operations
Changed variable assignments to use std::move() where appropriate.
Changed function headers to pass string arguments by reference where appropriate.

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: ae6851dbb4]
2025-06-02 11:18:36 -04:00
Alysa Liu b97f9ba6d5 rocr: Fixed inefficient copy operations
Changed variable assignments to use std::move() where appropriate.
Revert change in amd_kfd_driver.cpp.

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: a945b5d493]
2025-06-02 11:18:36 -04:00
Alysa Liu 88dd451c64 rocr: Fixed inefficient copy operations
Changed variable assignments to use std::move() where appropriate

Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>


[ROCm/ROCR-Runtime commit: 369d89ade3]
2025-06-02 11:18:36 -04:00
Sunday Clement 3d3cca8083 rocr: Fix Resource Leak
allocated memory was previously not freed in the event of an error
with rwlock initialization.

Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>


[ROCm/ROCR-Runtime commit: 293092f32f]
2025-05-30 09:16:26 -04:00