A hang would occur when a memory error occurs because the
AQLQueue destructor would be waiting for a signal that
wouldn't come. This change allows it to break out of the
wait loop.
[ROCm/ROCR-Runtime commit: c065d9a7e2]
Replace ROCm SMI (rsmi) API calls with AMDSMI (amdsmi) API calls
in rocrtst.
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
[ROCm/ROCR-Runtime commit: 4fab4d70e6]
Move the wallclock frequency query from GpuAgent to driver layer to improve
code organization and support multiple driver types. This change:
1. Add GetWallclockFrequency API to KFD/XDNA drivers
2. Move libdrm GPU info query from GpuAgent to driver implementation
3. Update GpuAgent to use the new driver API
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 412e386b50]
- Implemented GetTileConfig in KfdDriver to retrieve tile configuration for
a specific node.
- Added a stub implementation of GetTileConfig in XdnaDriver.
- Updated driver.h to include a virtual GetTileConfig method.
- Extended hsa_internal.h with a new hsa_get_tile_config function.
- Integrated hsa_get_tile_config into hsa.cpp to call the driver-specific
implementation.
- Updated driver headers to declare the new GetTileConfig method.
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 9bc38e2ee6]
This commit introduces a new GetClockCounters API to the driver interface.
- Implemented GetClockCounters in KfdDriver to fetch clock counters
using hsaKmtGetClockCounters.
- Added a stub implementation of GetClockCounters in XdnaDriver that
returns HSA_STATUS_ERROR.
- Modified GpuAgent to use driver().GetClockCounters instead of
directly calling hsaKmtGetClockCounters.
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 8d077dba3b]
This commit introduces a new GetDeviceHandle API to the driver
interface, allowing retrieval of the device handle for a
specific node.
- Implemented GetDeviceHandle in KfdDriver to fetch the AMD GPU
device handle using hsaKmtGetAMDGPUDeviceHandle.
- Added a stub implementation of GetDeviceHandle in XdnaDriver
that returns HSA_STATUS_ERROR.
- Modified GpuAgent::InitLibDrm to use driver().GetDeviceHandle
instead of directly calling hsaKmtGetAMDGPUDeviceHandle.
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 05b83e72d9]
This change improves code maintainability and error handling by
centralizing DMABuf export functionality in the driver interface.
- Replace direct hsaKmtExportDMABufHandle calls with driver's ExportDMABuf method
- Improve error handling with more specific error status returns
- Add explicit invalid parameter checks and assertions
- Consolidate DMABuf export logic in IPC and VMemory paths
- Propagate detailed error status from driver layer
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 837fd044d0]
This patch changes the type of several loop index variables from int to
uint32_t in fmm.c. The affected functions are:
- __fmm_release
- _fmm_map_to_gpu
- _fmm_unmap_from_gpu
To fix compile warning:
warning: comparison of integer expressions of different signedness:
'int' and 'uint32_t' {aka 'unsigned int'} [-Wsign-compare]
2009 | for (i = 0; i < object->handle_num; i++) {
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: 45af009c5d]
blacklist the KFDEvictTest suite until the defects
SWDEV 535386 and 537002, where these test cases fail
inconsistently, are fixed
Signed-off-by: Apurv Mishra <Apurv.Mishra@amd.com>
[ROCm/ROCR-Runtime commit: 3115384874]
Some of the entries for gfx906 in the ISA table in isa.cpp
had "any" for "sramecc-" instead of "disabled". This fixes
that.
[ROCm/ROCR-Runtime commit: 12430fe25a]
This commit introduces a new SetTrapHandler API to the driver interface
- Implemented SetTrapHandler in KfdDriver to set trap handlers using
hsaKmtSetTrapHandler.
- Added a stub implementation of SetTrapHandler in XdnaDriver that returns
HSA_STATUS_ERROR.
- Updated the driver interface in driver.h to include the new SetTrapHandler
method.
- Modified GpuAgent to use driver().SetTrapHandler instead of directly calling
hsaKmtSetTrapHandler.
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: d874b8003a]
Replace direct hsakmt API calls with calls through the driver abstraction layer
in queue management related functions. This includes:
- CreateQueue/DestroyQueue operations
- Queue update and GWS allocation
- CU masking configuration
Also update the corresponding error status types from HSAKMT_STATUS to
hsa_status_t and adjust error handling accordingly.
Signed-off-by: Honglei Huang <Honglei1.Huang@amd.com>
[ROCm/ROCR-Runtime commit: dee5bdc679]
The agent properties variable `agent_props` was declared but never used
in the `InitScratchSRD()` function. Which casued compile warning:
runtime/core/runtime/amd_aql_queue.cpp:1880:15: warning:
unused variable ‘agent_props’ [-Wunused-variable]
1880 | const auto& agent_props = agent_->properties();
No functional changes, purely a code cleanup commit.
[ROCm/ROCR-Runtime commit: ffa07e28e7]
added new subtest to Agent Properties test, to check functionality of
query.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: d2b35dfee6]
Support has been added to query the following
HSA_AMD_INFO_GET_CLOCK_COUNTERS agent info exposed through the hsa api
in rocr, rather than the user having to make a direct IOCTL call
through the kernel driver.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: e97d06530e]
Extend hsa_amd_vmem_address_reserve/hsa_amd_vmem_address_reserve_align
to support HSA_AMD_VMEM_ADDRESS_NO_REGISTER flag. This allocation can be
used to reserve virtual address ranges that can later be used by
hsa_amd_svm_attributes_set for SVM based memory allocations.
[ROCm/ROCR-Runtime commit: b3c48cc68c]
Further reduce upper bound for rocrtstFunc.Memory_Max_Mem
as previous limit of 95% can still trigger OOM killer.
[ROCm/ROCR-Runtime commit: 649ec63a4f]
scratch_cache.h includes amd_gpu_agent.h which then again includes
scratch_cache.h, this has now been fixed removing the unecessary
header include.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 06efa50c09]
scratch_backing_memory_byte_size was originally removed, and then put
back in e130172218. This was because it
was used by rocgdb. rocgdb code has been updated to not use this field.
Bumped _amdgpu_r_debug for the ABI change.
[ROCm/ROCR-Runtime commit: 3c0af843e3]
Cast range->x and range->y to uint64_t before performing multiplication
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
[ROCm/ROCR-Runtime commit: 77b86ca908]
The original version of hsa_amd_portable_export_dmabuf() did not
consider the conditions under which a dmabuf could be shared.
In the new version (hsa_amd_portable_export_dmabuf_v2()), the caller
can specify the flag HSA_AMD_DMABUF_MAPPING_TYPE_PCIE, which means they
want to share the dmabuf over PCIe. In that case, the new code will check
that if it is a PCIe GPU and it is not in a XGMI Hive then if
large-BAR is not supported, we will return an error.
[ROCm/ROCR-Runtime commit: a34604bddb]
The original version of hsa_amd_portable_export_dmabuf() did not
consider the conditions under which a dmabuf could be shared.
In the new version (hsa_amd_portable_export_dmabuf_v2()), the caller
can specify the flag HSA_AMD_DMABUF_MAPPING_TYPE_PCIE, which means they
want to share the dmabuf over PCIe. In that case, the new code will check
that if it is a PCIe GPU and it is not in a XGMI Hive then if
large-BAR is not supported, we will return an error.
[ROCm/ROCR-Runtime commit: 3a9d14bb66]
Its safer to have the integer literal explicitly be an unsigned long
in this expression as that's what the type of the errorCode variable
resolves to, preventing any overflow errors.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: dce52be686]
ehdr->e_shentshize and ehdr->e_shnum are both 16-bit unsigned integers
and so their types get implicitly promoted to signed int automatically
during the multiplication, they must be explicitly cast into a larger
unsigned type, otherwise if the signed product is large enough the
value is sign extended resulting in incorrect values.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: d00ca2e9b7]
Ensure file descriptor 'in' is properly closed in error cases
when calling _lseek() during readFrom() operations.
Fix potential resource leak when errors occur during file operations.
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
[ROCm/ROCR-Runtime commit: 167602edfb]
Moved the Call to pthread_mutex_lock to an else statement for better
code readibility.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 1635746a9c]
Because eventDescrp->mutex is a non-recursive lock attempting to
acquire the lock with pthread_mutex_lock can cause the system to hang
indefinitely if the lock was already previously aquired with the
preceeding call to pthread_mutex_trylock.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: a97b7df4b9]
Refactor variable assignments to use std::move() where appropriate.
Updat function headers to accept parameters by const& where appropriate.
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
[ROCm/ROCR-Runtime commit: f6c8cbd293]
Changed variable assignments to use std::move() where appropriate.
Changed function headers to pass string arguments by reference where appropriate.
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
[ROCm/ROCR-Runtime commit: ae6851dbb4]
Changed variable assignments to use std::move() where appropriate.
Revert change in amd_kfd_driver.cpp.
Signed-off-by: Alysa Liu <Alysa.Liu@amd.com>
[ROCm/ROCR-Runtime commit: a945b5d493]
allocated memory was previously not freed in the event of an error
with rwlock initialization.
Signed-off-by: Sunday Clement <Sunday.Clement@amd.com>
[ROCm/ROCR-Runtime commit: 293092f32f]