Added bad_page_threshold_exceeded field to ras, which
compares retired pages count against bad page threshold.
This field displays True if retired pages exceed the
threshold, False if within threshold, or N/A if
threshold data is unavailable.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Co-authored-by: Arif, Maisam <Maisam.Arif@amd.com>
* Added gpu-board and base-board temperatures to amd-smi metric
* Updated Changelog and adjusted the metric base-board/gpu-board output
* Adjusted output of metric to hide base/gpu-board when not relevant
---------
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
* Filter out amd-smi from process detection
* Fixed N/A stripping N/ incorrectly from running elevated processes
Signed-off-by: adapryor <Adam.pryor@amd.com>
* Update amd-smi doc with examples of CPER and AFID API usage.
---------
Signed-off-by: Oosman Saeed <oossaeed@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Changes:
- Fixes amd-smi monitor such as:
amd-smi monitor -Vqt, amd-smi monitor -g 0 -Vqt -w 1
amd-smi monitor -Vqt --file /tmp/test1, ...
- Required moving around when process is called, since xcp
information is gathered in right format expected by monitor
- Requires process to be appended first with the gpu data -> xcp
info to be gathered + added after 1st device
Change-Id: I76356a4610944f633a9530970fac66556d65bf11
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Impacts only Guest systems
Fixes following error:
$ amd-smi monitor
AttributeError: 'Namespace' object has no attribute 'violation'
Change-Id: If501819be3f8e2d2dfd75775dc776873a92465a3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.
Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
The xgmi_metrics_info variable was being referenced before
assignment when no destination GPUs were found or when the API
call failed. This caused an UnboundLocalError. Fixed this by
initializing xgmi_metrics_info with empty links structure.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Changes:
- Fix for any set without CPU loaded (ex.):
sudo /opt/rocm/bin/amd-smi set -o 250
AttributeError: 'Namespace' object has no attribute 'core_boost_limit'
- Fix for recent changes to memory partition sets
Needed to account for permission denied -> to display not supported.
EACCESS == *_STATUS_PERMISSION, but in this case need to show
NOT_SUPPORTED
Change-Id: Ie00bbb34d01adfe38300f1ac4c1620d78885b9b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Changes:
- Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
- Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
(Violation Status is the first example of this in monitor)
- Improve CLI monitor output:
support multiple GPU lines per GPU, add new columns, and better formatting
- Refactor helpers and logger for flexible unit formatting and table rendering
- Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
new metrics APIs in C++ example
- Sync Python/C++ interface and structures for new metrics fields and naming
- Remove deprecated/unused RSMI activity APIs, documentation not needed since
the APIs no longer exist in ROCm SMI either.
- Cleanup metric violations + fix handle watch arguments
- Provide better handling/doc for average_flattened_ints()
- Group xcp metrics with brackets in human readable + adjust output size
Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
* Enabled and updated set CPU APIs from CLI
* Fix sets not working consistently across devices + string/int comparison
Signed-off-by: Deepak Mewar <deepak.mewar@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Deepak Mewar <deepak.mewar@amd.com>
Description:
- Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
- Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
- Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
- Enhanced CLI and test cases to allow users to control when the driver reload occurs.
- Updated documentation and changelog to reflect the new driver reload process.
- Improved error handling and logging for driver reload operations.
- Added progress bar and user confirmation prompts for driver reload commands.
* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
- Systemctl is typically not enabled on docker.
- And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions
---------
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
The amd-smi command will will show only executable
name of a process by stripping absolute path. This
cause "N/A" process names incorrectly display as
"A" in the output. Corrected the same.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Improved amd-smi help and error messages.
Updated to show subcommand name in help text.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Changes:
- Modified outputputs for amd-smi set/reset when in partitions
to display error codes
- Provided some general cleanup for the above ^
----------------------------------------------------
- Updated `amd-smi set -o <value>` / `amd-smi set --power-cap <value>` command to
allow setting power cap to values other than 0, provided the current power cap is not 0.
- Modified power_cap_read_write.cc:
- Added a check to ensure that the power cap can only be set to non-zero values if the current
power cap is not 0.
- Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7
---------
Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Disabled violation argument for monitor on guests as it is supported on BM only.
* Added `-v` and `--violation` args to metric along with `throttle` due to legacy behavior.
* Supressed metric throttle arg and do not show in help text
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Modified the file used to fetch process name so that complete name with path can be displayed.
Changes:
amd-smi monitor -q
- human readable format will output only the process name
- csv and json formats will print the full path
amd-smi process
- name will always be the full path to the process
amd-smi (default output)
- name will always be truncated.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
* [SWDEV-537852] Update process name help text
Currently process name displays N/A if that need elevated
permissions. Updated the default amd-smi, process and monitor
commands help texts to display elevated permission requirement.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Updates:
- Separate extra APIs calls from amd-smi CLI to target specific CLI commands that need them.
- Remove extra current_compute_partition SYSFS calls from amd-smi static.
- Remove the partition information from the default `amd-smi static` CLI command.
- Users must now use the `-p` argument to view partition information with `amd-smi static`.
- The help text for the `partition` argument has been updated to reflect this change.
- The partition information can still be accessed using the `amd-smi partition -c -m` or `sudo amd-smi partition -a` commands.
---------
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
* fix links in amdsmi_cli/README.md
* fix xrefs to install docs
* rm rocm-smi examples and add cli tutorial
* rm disclaimer and add amd smi contributing guidelines to index
Signed-off-by: Peter Park <Peter.Park@amd.com>
The bug was reproduced like this.
In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow
In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done
The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow
From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.
The fix:
Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.
The Python CLI should not treat this as an error, but should continue to print what the API returned.
---------
Signed-off-by: Oosman Saeed <oossaeed@amd.com>
Topology numa_bw checks for non-xgmi links to set as N/A.
The recent change in link_type enum mapping caused this
condition to check for PCIE instead of XGMI. Corrected
the same.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
The topology method calls amdsmi_topo_get_p2p_status repeatedly
for the same GPU pairs across different table sections,
significantly impacting performance with 60+ GPUs. Reduced this
by implemeting result caching.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>