* Fix the amdgpu version string comparison
The intention behind it was to avoid showing the string if it's not
got information.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
* Display the kernel version in amd-smi output
This is an interesting debugging point, especially in the case of
not having a DKMS package installed.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Moving os_kernel_version to static --driver
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
---------
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
* Add support for get and set APIs for CPUISOFreqPolicy and DFCState Control
- Add support for get and set APIs for CPUISOFreqPolicy and DFCState Control
in AMD SMI and also in the CLI tool
* CHANGELOG.md file updated
* SWDEV-562837: Update amdsmi-py-api.md as per the new APIs
Updated amdsmi-py-api.md as per the new APIs added.
---------
Signed-off-by: Soumya <sranjanr@amd.com>
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Co-authored-by: Saka Sitharammurthy <SitharamMurthy.Saka@amd.com>
* Read the ids_flags when fetching GPU info
The ids_flags contains the flags that can help identify if a GPU
is a dGPU or an APU.
* Show correct memory pool for APUs
The kernel policy for APUs will be to choose the bigger pool of
memory (GTT or VRAM) for KFD work. Adjust the policy for the monitor
and default commands to show the right memory pool when using an APU.
* Don't require powercap support
APUs don't necessarily support setting a power cap from sysfs.
Ignore failures of the file missing.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Show edge temperature in default output if hotspot is missing
APUs don't have a hotspot temperature, they have an edge though.
Use that.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Format all "power" keys as watts
There will be more power keys when APU support is added, so format
them properly.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Don't show power limit in output if it's invalid
APUs can't set power limit using power_cap1 interface. The limit
will be 0 and thus the UX looks weird in default output.
Only add the `/power_limit` if it's valid.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Unify sizes of `amdsmi_power_info_t`
Sizes are used inconsistently. This causes tools to not show
N/A when they should. Make them unified.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Fix powercap default to enum for sensor_ind
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
* [SWDEV-559965] Refactor amdsmi set power cap
Modified power cap set to accept args with
optional power_cap type. Added power_cap helper
validate_and_set_power_cap(). Fixed JSON output
format.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
---------
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Co-authored-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Changes:
- Fixed `amd-smi` showing:
```console
$ amd-smi
Traceback (most recent call last):
File "/opt/rocm/bin/amd-smi", line 53, in <module>
from amdsmi_init import *
File "/opt/rocm/libexec/amdsmi_cli/amdsmi_init.py", line 38, in <module>
from amdsmi import amdsmi_interface, amdsmi_exception
File "/usr/local/lib/python3.8/dist-packages/amdsmi/__init__.py", line 24, in <module>
from .amdsmi_interface import amdsmi_init
File "/usr/local/lib/python3.8/dist-packages/amdsmi/amdsmi_interface.py", line 5581, in <module>
) -> tuple[int, int]:
TypeError: 'type' object is not subscriptable
```
This was a python3.8 issue, which is now resolved by using
`Tuple[int, int]` typing for Python 3.8 compatibility.
* Added Product Serial Number to the raw_bytes cper entries
* Added Product Serial Number to the Python API return
---------
Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: 05ea00dcc4]
* Added Python & C API's for new node devices. Currently these are functional for node 0 only.
- amdsmi_get_node_handle
- amdsmi_get_npm_info
* Added `amd-smi node` CLI for Node Power Management
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: f8e4771363]
- Updated python integration test to account for PPT1 support changes
- Updated set/reset power-cap input format
- Adjusted python API and updated C++ API test
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Change-Id: Ia9d02868b6e91c88c10a9772d9e2d9f37c3c352f
[ROCm/amdsmi commit: 18faddf6f3]
Allow amdsmi to find libamd_smi.so and librocm-core.so relative to
amdsmi_wrapper.py location.
The amdsmi_wrapper.py file is located in
_rocm_sdk_core/share/amd_smi/amdsmi and the libraries are in
_rocm_sdk_core/lib/libamd_smi.so.26.
_rocm_sdk_core/lib/librocm-core.so.1.
[ROCm/amdsmi commit: ad20d57162]
- **Added evicted_time metric for kfd processes**.
- Time that queues are evicted on a GPU in milliseconds
- Added to CLI in `amd-smi monitor -q` and `amd-smi process`
- Added to C API and Python API:
- amdsmi_get_gpu_process_list()
- amdsmi_get_gpu_compute_process_info()
- amdsmi_get_gpu_compute_process_info_by_pid()
---------
Signed-off-by: Pryor, Adam <Adam.Pryor@amd.com>
[ROCm/amdsmi commit: 2144cfbba4]
* Update readme doc: amdsmi_get_afids_from_cper() input argument is only bytes, not a list of dicts each with keys “bytes” (List[int]) and “size” (int)
---------
Signed-off-by: Oosman Saeed <oossaeed@amd.com>
[ROCm/amdsmi commit: f7c9fe3011]
Added the following API's to amdsmi_interface.py.
amdsmi_get_cpu_handle()
amdsmi_get_esmi_err_msg()
amdsmi_get_gpu_event_notification()
amdsmi_get_processor_count_from_handles()
amdsmi_get_processor_handles_by_type()
amdsmi_gpu_validate_ras_eeprom()
amdsmi_init_gpu_event_notification()
amdsmi_set_gpu_event_notification_mask()
amdsmi_stop_gpu_event_notification()
amdsmi_get_gpu_busy_percent()
Added additional return value to API amdsmi_get_xgmi_plpd().
The entry policies is added to the end of the dictionary to match API definition.
The entry plpds is marked for deprecation as it has the same information as policies.
---------
Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: 7decbc67a1]
- Changed amd-smi static --vbios to accept ifwi
- Change population logic for vbios version API
- Added IFWI boot_firmware to the CLI, C++, Rust, and Python API
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I4ea504d40a43cfb011ab38fc9a664ecf12d39c8a
[ROCm/amdsmi commit: cd21b5edcc]
* [SWDEV-531904] Unit and Integ Test Updates
Updated: unit_tests.py
- Removed redundant self.setUp() and self.tearDown() calls.
- Removed test_free_name_value_pairs() since is internal only.
Updated: integration_test.py
- Added logic to set AMDSMI_CLI_PATH from environment or default.
- Raise FileNotFoundError if path does not exist.
- Append CLI path to sys.path and handle ImportError with a clear message.
- Removed redundant @handle_exceptions function decorator.
- Removed redundant self.setUp() and self.tearDown() calls.
Updated: amdsmi_interface.py
- Removed POINTER conversion in amdsmi_get_gpu_pm_metrics_info() and amdsmi_get_gpu_reg_table_info()
All tests pass/skip
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
* Update tests/python_unittest/integration_test.py
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>
* Review Update 1
Modified: integration_test.py
- Added logic to properly loop through firmware list and display each name and version
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
* Skip xgmi_err tests + improve running output
Changes:
1. Now check for elevated permissions
2. Skip xgmi_error related SYSFS tests, refer to xgmi_read_write.cc
(both are skipped)
3. Added list of tests and provided a summary of additional output
provided
Change-Id: Iefc85c270faad89c625e2bd7af397d24faed2437
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
---------
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: 67eb541c15]
- **Fixed gpuboard and baseboard temperatures enums in amdsmi Python Library**.
- AmdSmiTemperatureType had issues with referencing the right attribute, so we removed the following duplicate enums:
- `AmdSmiTemperatureType.GPUBOARD_NODE_FIRST`
- `AmdSmiTemperatureType.GPUBOARD_VR_FIRST`
- `AmdSmiTemperatureType.BASEBOARD_FIRST`
Change-Id: Ia61446b593bd9182d597c4b4c2ac3c5ffdae7493
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: 286c421a49]
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.
Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: d3b73fac82]
Changes:
- Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
- Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
(Violation Status is the first example of this in monitor)
- Improve CLI monitor output:
support multiple GPU lines per GPU, add new columns, and better formatting
- Refactor helpers and logger for flexible unit formatting and table rendering
- Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
new metrics APIs in C++ example
- Sync Python/C++ interface and structures for new metrics fields and naming
- Remove deprecated/unused RSMI activity APIs, documentation not needed since
the APIs no longer exist in ROCm SMI either.
- Cleanup metric violations + fix handle watch arguments
- Provide better handling/doc for average_flattened_ints()
- Group xcp metrics with brackets in human readable + adjust output size
Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
[ROCm/amdsmi commit: e2e4fc65c1]
Description:
- Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
- Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
- Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
- Enhanced CLI and test cases to allow users to control when the driver reload occurs.
- Updated documentation and changelog to reflect the new driver reload process.
- Improved error handling and logging for driver reload operations.
- Added progress bar and user confirmation prompts for driver reload commands.
* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
- Systemctl is typically not enabled on docker.
- And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions
---------
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: d24dc7ef89]
Add support to Query UBB/OAM temperature.
* Updated Python API with new temperature metrics enum
---------
Co-authored-by: Bill Liu <shuzhliu@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
[ROCm/amdsmi commit: abd3c02a3c]
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
[ROCm/amdsmi commit: 645c313f00]
The bug was reproduced like this.
In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow
In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done
The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow
From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.
The fix:
Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.
The Python CLI should not treat this as an error, but should continue to print what the API returned.
---------
Signed-off-by: Oosman Saeed <oossaeed@amd.com>
[ROCm/amdsmi commit: 5b95d227bc]
* Added copyrights
* Fixed type hinting for processor_handle in python_interface
* Fixed Incorrect type hinting to actual return types
---------
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Change-Id: Ie2a09acf628ed0c43eacc8ec78c159d125acbcdb
[ROCm/amdsmi commit: 23b9da656c]
* Adjusted help text
* Adjusted --afid to run only with --cper-file
* Fixed interface return error
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I2b96f4515c85f3b9dd84ba5c2d819729a997141b
[ROCm/amdsmi commit: ac63f410c2]
The xgmi read and write accumulated data from gpu metric index
is based on sysfs xgmi_port_num file. Mapped these two to display
read and write wrt src_gpu Vs dst_gpu.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
[ROCm/amdsmi commit: 8ed52616ad]
The xgmi command was showing pcie bit rate and bandwidth instead of xgmi. Corrected the API to get xgmi data from gpu metric.
Added python API for amdsmi_get_link_metrics. Modified the amdsmi_link_metrics struct.
Added check to confirm non zero partition got xgmi command.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
[ROCm/amdsmi commit: 2eff0b3764]
* Add the API and CLI to show the board voltage.
---------
Change-Id: Icb25bd653bb1d004704b5a21b378ca31b2b242c7
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
[ROCm/amdsmi commit: 970560fc7c]