Γράφημα Υποβολών

472 Υποβολές

Συγγραφέας SHA1 Μήνυμα Ημερομηνία
Kanangot Balakrishnan, Bindhiya edaae978a2 [SWDEV-553557] Add bad_page_threshold_exceeded to RAS (#677)
Added bad_page_threshold_exceeded field to ras, which
compares retired pages count against bad page threshold.
This field displays True if retired pages exceed the
threshold, False if within threshold, or N/A if
threshold data is unavailable.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Co-authored-by: Arif, Maisam <Maisam.Arif@amd.com>
2025-09-09 09:15:37 -05:00
AL Musaffar, Yazen 4a8ee27225 [SWDEV-545894] Folder name defaulting to lower case fix (#611)
* Folder name defaulting to lower case

* Update amdsmi_cli/amdsmi_cli.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>

* Fixed Based On Comments

* Remove unused variable 'skip_next'

Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>

---------

Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
Co-authored-by: yalmusaf_amdeng <yalmusaf@amd.com>
Co-authored-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-09-07 20:38:29 -05:00
Maisam Arif 2c9f3af026 [SWDEV-540665] Change parser to not accept 0 as a power set input
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I510fa5523b8dd7ea33f49e21cc199d4a2cfcf9bb
2025-08-29 04:18:36 -05:00
gabrpham_amdeng 39b26104d4 reverted help formatting column width to 80
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-28 11:30:24 -05:00
Oosman Saeed 594d5ce8ee [SWDEV-546239] Match amdsmi output with host output 2025-08-27 18:41:59 -05:00
Maisam Arif 978fad01d2 [SWDEV-544299] Fix CLI prefix for amd-smi metric -G
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic184ec824213421388356417e713d9ed5adeddeb
2025-08-27 18:08:06 -05:00
Pham, Gabriel b13fc16d60 Added gpuboard and baseboard temperatures to amd-smi metric (#617)
* Added gpu-board and base-board temperatures to amd-smi metric
* Updated Changelog and adjusted the metric base-board/gpu-board output
* Adjusted output of metric to hide base/gpu-board when not relevant

---------

Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-26 12:49:56 -05:00
Maisam Arif e030f71229 [SWDEV-540665] Power cap on 1VF cli parsing fix
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I5aac8f820fd8ae1c6c1dbae3b5b9e69018c69452
2025-08-22 15:22:44 -05:00
Oosman Saeed dee18e9fb4 continue to process all entries 2025-08-21 23:37:24 -05:00
gabrpham_amdeng 71c8b92076 [SWDEV-549373] Added vbios and pldm information to version header and adjusted platform info display
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-21 18:16:47 -05:00
gabrpham_amdeng 5aae1a31fa Added Version Header to all Help Sections
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-21 17:17:16 -05:00
Pryor, Adam 4ac1c7e453 [SWDEV-540665] Fix power_caps in help text (#642)
Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-08-21 16:45:37 -05:00
Maisam Arif 074c4b7a3f Fix spelling and incorrect error references
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I23e947a0cfd4f68067f9fca703574f44680163d4
2025-08-21 12:36:43 -05:00
Pryor, Adam ad29de4238 [SWDEV-525336] Filter out amd-smi process itself from detection (#638)
* Filter out amd-smi from process detection
* Fixed N/A stripping N/ incorrectly from running elevated processes

Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-08-21 11:41:03 -05:00
Saeed, Oosman fd5e37a07e [SWDEV-546239] amd-smi ras cper - no data created (#614)
* Update amd-smi doc with examples of CPER and AFID API usage.

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-20 11:27:41 -05:00
AL Musaffar, Yazen e84e364b35 [SWDEV-549789] Removed incorrect CPER AFID references (#619)
* Fix for afid help
* Update amdsmi_parser.py

Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-08-19 18:55:33 -05:00
Pham, Gabriel c0ea186d47 [SWDEV-446394] Updated error message for setting clock limit (#633)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-19 18:51:49 -05:00
Poag, Charis 1b2edd70bd [SWDEV-550355] Fix process + violation output when in partitions (#623)
Changes:
  - Fixes amd-smi monitor such as:
    amd-smi monitor -Vqt, amd-smi monitor -g 0 -Vqt -w 1
    amd-smi monitor -Vqt --file /tmp/test1, ...
  - Required moving around when process is called, since xcp
    information is gathered in right format expected by monitor
  - Requires process to be appended first with the gpu data -> xcp
    info to be gathered + added after 1st device

Change-Id: I76356a4610944f633a9530970fac66556d65bf11
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-19 18:50:51 -05:00
Charis Poag 5fe58a8e38 [SWDEV-550679] Fix amd-smi monitor AttributeError
Impacts only Guest systems

Fixes following error:
$ amd-smi monitor
AttributeError: 'Namespace' object has no attribute 'violation'

Change-Id: If501819be3f8e2d2dfd75775dc776873a92465a3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-19 17:58:44 -05:00
Bindhiya Kanangot Balakrishnan 41488f0c18 [SWDEV-547160] Fix VRAM percentage calculation
The vram_percent calculation was missing
multiplication by 100.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-18 17:28:30 -05:00
Arif, Maisam 2d5accd000 [SWDEV-540665] Add power_cap set to Linux Guest (#626)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I3c8d707681c141390b40521231e0d638c81cdeaf
2025-08-18 14:59:14 -05:00
Charis Poag d3b73fac82 Revert Major ABI break for amdsmi_get_violation_status()
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.

Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-18 11:36:57 -05:00
Maisam Arif c8d0e5c497 [SWDEV-549831] Fixed file outputs not printing
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I56b792256c30d618d59d2d40faf5fa0f1c2c4dc6
2025-08-14 11:08:49 -05:00
Bindhiya Kanangot Balakrishnan f0453c2c75 [SWDEV-543308] Fix xgmi_metrics_info initialization in xgmi
The xgmi_metrics_info variable was being referenced before
assignment when no destination GPUs were found or when the API
call failed. This caused an UnboundLocalError. Fixed this by
initializing xgmi_metrics_info with empty links structure.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-07 16:19:10 -05:00
Charis Poag e7964cda49 Fix amd-smi sets attribute error & memory partition sets
* Changes:
- Fix for any set without CPU loaded (ex.):
sudo /opt/rocm/bin/amd-smi set -o 250
AttributeError: 'Namespace' object has no attribute 'core_boost_limit'

- Fix for recent changes to memory partition sets
  Needed to account for permission denied -> to display not supported.
  EACCESS == *_STATUS_PERMISSION, but in this case need to show
  NOT_SUPPORTED

Change-Id: Ie00bbb34d01adfe38300f1ac4c1620d78885b9b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-07 16:09:56 -05:00
Poag, Charis e2e4fc65c1 [SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558)
Changes:
  - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
  - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
    (Violation Status is the first example of this in monitor)
  - Improve CLI monitor output:
    support multiple GPU lines per GPU, add new columns, and better formatting
  - Refactor helpers and logger for flexible unit formatting and table rendering
  - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
    new metrics APIs in C++ example
  - Sync Python/C++ interface and structures for new metrics fields and naming
  - Remove deprecated/unused RSMI activity APIs, documentation not needed since
    the APIs no longer exist in ROCm SMI either.
  - Cleanup metric violations + fix handle watch arguments
  - Provide better handling/doc for average_flattened_ints()
  - Group xcp metrics with brackets in human readable + adjust output size

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-08-06 16:03:06 -05:00
62d92968791937c6480e7d49e40bec15_amdeng 1dedeac4e3 [SWDEV-539532] Enabled and updated set CPU APIs from CLI (#513)
* Enabled and updated set CPU APIs from CLI
* Fix sets not working consistently across devices + string/int comparison

Signed-off-by: Deepak Mewar <deepak.mewar@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Deepak Mewar <deepak.mewar@amd.com>
2025-08-06 12:52:35 -05:00
Maisam Arif 81ca193477 Default output driver string truncation
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I88b78b1cb9712f9fee4f94a54811f8f702d4d920
2025-08-06 10:40:37 -05:00
Poag, Charis d24dc7ef89 [SWDEV-518561] Separate Driver Reload from Memory Partition Sets (#582)
Description:
  - Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
  - Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
  - Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
  - Enhanced CLI and test cases to allow users to control when the driver reload occurs.
  - Updated documentation and changelog to reflect the new driver reload process.
  - Improved error handling and logging for driver reload operations.
  - Added progress bar and user confirmation prompts for driver reload commands.

* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
  - Systemctl is typically not enabled on docker.
  - And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-05 20:44:28 -05:00
Pham, Gabriel fc5ea762b3 Added Platform Information to Default Command (#553)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:11:42 -05:00
Pryor, Adam 2dc2e12a97 Documentation updates for AMDSMI_GPU_METRICS_CACHE_MS (#564)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-05 19:58:37 -05:00
AL Musaffar, Yazen 27cae85910 [SWDEV-544092] Fix Navi process float conversion (#579)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-08-04 14:40:18 -05:00
Bindhiya Kanangot Balakrishnan b16a66b2c5 [SWDEV-525336] Fix N/A process name display
The amd-smi command will will show only executable
name of a process by stripping absolute path. This
cause "N/A" process names incorrectly display as
"A" in the output. Corrected the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-04 13:51:42 -05:00
Kanangot Balakrishnan, Bindhiya 27a1705d96 [SWDEV-537852] Update compute-partition set error messages (#505)
[SWDEV-537852] Update compute-partition set error messages

Setting compute partition needs sudo privileges. Added
AmdSmiPermissionDeniedException to display CLI elevated
permission errors.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-01 08:22:22 -05:00
Bindhiya Kanangot Balakrishnan 449839a32e [SWDEV-537852] Update help text for InvalidParameterValueException
Updated the help text to display command name.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 10:47:13 -05:00
Kanangot Balakrishnan, Bindhiya 6f7b397998 [SWDEV-537852] Update help and error text (#518)
Improved amd-smi help and error messages.
Updated to show subcommand name in help text.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 09:06:22 -05:00
Poag, Charis ec055f2c2d [SWDEV-536953] Fix sets/resets + Align Power Cap Behavior with ROCM_SMI (#456)
Changes:
  - Modified outputputs for amd-smi set/reset when in partitions
    to display error codes
  - Provided some general cleanup for the above ^
----------------------------------------------------
  - Updated  `amd-smi set -o <value>` /  `amd-smi set --power-cap <value>`  command to
    allow setting power cap to values other than 0, provided the current power cap is not 0.
  - Modified power_cap_read_write.cc:
    - Added a check to ensure that the power cap can only be set to non-zero values if the current
      power cap is not 0.
    - Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7

---------

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-07-22 17:21:15 -05:00
Bindhiya Kanangot Balakrishnan 645c313f00 [SWDEV-543308] Revert amdsmi_link_metrics structure change
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-14 13:56:26 -05:00
Maisam Arif 10f9aae0b3 Reduced calls to drm devinfo for getting virtualization_mode
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I22a6a9ca15131b37a775e8d4f595fb13c0b043c7
2025-07-11 12:26:42 -05:00
Kanangot Balakrishnan, Bindhiya f6b854b4ed [SWDEV-541289] Update violation argument in amd-smi (#526)
* Disabled violation argument for monitor on guests as it is supported on BM only. 
* Added `-v` and `--violation` args to metric along with `throttle` due to legacy behavior.
	* Supressed metric throttle arg and do not show in help text

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-09 16:38:09 -05:00
Kanangot Balakrishnan, Bindhiya 514517e536 [SWDEV-539721] Show complete process name (#536)
Modified the file used to fetch process name so that complete name with path can be displayed.

Changes:
amd-smi monitor -q
- human readable format will output only the process name
- csv and json formats will print the full path

amd-smi process
- name will always be the full path to the process

amd-smi (default output)
- name will always be truncated.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-09 16:34:39 -05:00
AL Musaffar, Yazen 01a6158c85 [SWDEV-532904] CLI lists unusable UUID without sudo (#510)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-07-09 15:45:03 -05:00
josnarlo 0257140504 [SWDEV-536953] Align Power Cap Behavior with ROCM_SMI
Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
2025-07-09 15:37:40 -05:00
Kanangot Balakrishnan, Bindhiya ce230efaaa [SWDEV-537852] Update process name help text (#517)
* [SWDEV-537852] Update process name help text

Currently process name displays N/A if that need elevated
permissions. Updated the default amd-smi, process and monitor
commands help texts to display elevated permission requirement.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-07 11:26:10 -05:00
Poag, Charis 88473b7fd0 [SWDEV-533305] Remove partition info from amd-smi static (-p/--partition still available) + CLI API call cleanup (#529)
Updates:
- Separate extra APIs calls from amd-smi CLI to target specific CLI commands that need them.
- Remove extra current_compute_partition SYSFS calls from amd-smi static.
- Remove the partition information from the default `amd-smi static` CLI command.
- Users must now use the `-p` argument to view partition information with `amd-smi static`.
- The help text for the `partition` argument has been updated to reflect this change.
- The partition information can still be accessed using the `amd-smi partition -c -m` or `sudo amd-smi partition -a` commands.

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-07-07 11:21:46 -05:00
Park, Peter 8039ab9449 Fix links in docs (#532)
* fix links in amdsmi_cli/README.md
* fix xrefs to install docs
* rm rocm-smi examples and add cli tutorial
* rm disclaimer and add amd smi contributing guidelines to index

Signed-off-by: Peter Park <Peter.Park@amd.com>
2025-07-07 11:18:40 -05:00
Saeed, Oosman 5b95d227bc [SWDEV-538308] CPER CLI 20 limit bug (#499)
The bug was reproduced like this.

In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done

The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.

The fix:

Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.

The Python CLI should not treat this as an error, but should continue to print what the API returned.

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
2025-07-07 11:11:13 -05:00
gabrpham_amdeng a2885d6e70 [SWDEV-539451] Adjusted reset command to prevent reset on partitions
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-07-03 01:11:46 -05:00
Bindhiya Kanangot Balakrishnan fa9ca21520 [SWDEV-540014] Correct topology link_type check
Topology numa_bw checks for non-xgmi links to set as N/A.
The recent change in link_type enum mapping caused this
condition to check for PCIE instead of XGMI. Corrected
the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-30 14:01:19 -05:00
Bindhiya Kanangot Balakrishnan c3453f7c97 [SWDEV-530646] Reduce amdsmi_topo_get_p2p_status calls in topology
The topology method calls amdsmi_topo_get_p2p_status repeatedly
for the same GPU pairs across different table sections,
significantly impacting performance with 60+ GPUs. Reduced this
by implemeting result caching.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-24 11:27:28 -05:00