Commit-Graf

1921 Incheckningar

Upphovsman SHA1 Meddelande Datum
Maisam Arif f732ee4e98 Fix spelling and incorrect error references
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I23e947a0cfd4f68067f9fca703574f44680163d4


[ROCm/amdsmi commit: 074c4b7a3f]
2025-08-21 12:36:43 -05:00
Pryor, Adam 5e4a23dd01 [SWDEV-525336] Filter out amd-smi process itself from detection (#638)
* Filter out amd-smi from process detection
* Fixed N/A stripping N/ incorrectly from running elevated processes

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/amdsmi commit: ad29de4238]
2025-08-21 11:41:03 -05:00
Oosman Saeed 7c83dac63d [SWDEV-547223] RAS HBM CRC Read CE failed due to AFID missing 24
cherry-pick aca-decode repo changeset: aca-decode repo: f9e5ad5 (HEAD -> main, origin/main, origin/HEAD) Fix bug in Corrected HBM Error being decoded as AFID 34 (#5)


[ROCm/amdsmi commit: ffca095246]
2025-08-21 11:00:30 -05:00
Saeed, Oosman 3779562abb [SWDEV-546239] amd-smi ras cper - no data created (#614)
* Update amd-smi doc with examples of CPER and AFID API usage.

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: fd5e37a07e]
2025-08-20 11:27:41 -05:00
Pham, Gabriel d32bae0e8f Updated Changelog for updated temperature metrics API (#616)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: e6af86f44a]
2025-08-19 19:02:50 -05:00
AL Musaffar, Yazen 678972b8ec [SWDEV-549789] Removed incorrect CPER AFID references (#619)
* Fix for afid help
* Update amdsmi_parser.py

Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>

[ROCm/amdsmi commit: e84e364b35]
2025-08-19 18:55:33 -05:00
Pryor, Adam 96a28009fc [SWDEV-544620] Add kfd fallback for GPU Processes (#631)
Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/amdsmi commit: b62900c372]
2025-08-19 18:53:16 -05:00
Pham, Gabriel 729b7beddf [SWDEV-446394] Updated error message for setting clock limit (#633)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: c0ea186d47]
2025-08-19 18:51:49 -05:00
Poag, Charis 35b4b0df38 [SWDEV-550355] Fix process + violation output when in partitions (#623)
Changes:
  - Fixes amd-smi monitor such as:
    amd-smi monitor -Vqt, amd-smi monitor -g 0 -Vqt -w 1
    amd-smi monitor -Vqt --file /tmp/test1, ...
  - Required moving around when process is called, since xcp
    information is gathered in right format expected by monitor
  - Requires process to be appended first with the gpu data -> xcp
    info to be gathered + added after 1st device

Change-Id: I76356a4610944f633a9530970fac66556d65bf11
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: 1b2edd70bd]
2025-08-19 18:50:51 -05:00
Charis Poag b239e5be60 [SWDEV-550679] Fix amd-smi monitor AttributeError
Impacts only Guest systems

Fixes following error:
$ amd-smi monitor
AttributeError: 'Namespace' object has no attribute 'violation'

Change-Id: If501819be3f8e2d2dfd75775dc776873a92465a3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/amdsmi commit: 5fe58a8e38]
2025-08-19 17:58:44 -05:00
Maisam Arif 2851f80253 Removed kfd_ioctl.h from rocm include install
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I7948eb050f79a8a0f71e0b8a8e4e08187ac0bb84


[ROCm/amdsmi commit: 6de6290dc1]
2025-08-19 17:18:14 -05:00
Galantsev, Dmitrii beaffa2c84 [SWDEV-545751] CMAKE - Enable fPIC (#629)
Change-Id: Iaade10e70b3a39d6bca23ae98f9f501339ffd76d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/amdsmi commit: cd33b75540]
2025-08-19 11:39:39 -05:00
Poag, Charis 61c817a2d3 [SWDEV-546220] Fix mVF xcd check within tests (#628)
Adding a check to see if we're in guest -> allowing equal XCD values.
This is because in mVF configurations, we may not be able to read the gfx clock values.

Change-Id: I8e5d9627e061e98ec854734a91624c8077644a2a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: e12d270693]
2025-08-19 11:13:18 -05:00
Bindhiya Kanangot Balakrishnan 8e645a6da7 [SWDEV-547160] Fix VRAM percentage calculation
The vram_percent calculation was missing
multiplication by 100.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>


[ROCm/amdsmi commit: 41488f0c18]
2025-08-18 17:28:30 -05:00
Arif, Maisam 4e568b2eea [SWDEV-540665] Add power_cap set to Linux Guest (#626)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I3c8d707681c141390b40521231e0d638c81cdeaf

[ROCm/amdsmi commit: 2d5accd000]
2025-08-18 14:59:14 -05:00
Charis Poag 7ab967ec69 Revert Major ABI break for amdsmi_get_violation_status()
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.

Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/amdsmi commit: d3b73fac82]
2025-08-18 11:36:57 -05:00
Bill Liu 2ebf71976e [SWDEV-548260] Enable Support for Multiple init() and shutdown()
Implemented reference counting to manage init and shutdown processes,
allowing for multiple initializations and shutdowns.


[ROCm/amdsmi commit: c45a53d751]
2025-08-15 11:44:50 -05:00
Maisam Arif 029ca0f256 [SWDEV-549831] Fixed file outputs not printing
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I56b792256c30d618d59d2d40faf5fa0f1c2c4dc6


[ROCm/amdsmi commit: c8d0e5c497]
2025-08-14 11:08:49 -05:00
Charis Poag 3eb536e34c [SWDEV-548755] Driver reload temporary fix for CQE
Temporary solution until CQE can update how their containers are ran.

This is because the driver reload requires:
1) Containers must run serially
   (i.e. no parallel containers running at the same time)
2) Containers must run with extra parameters:
   `--cap-add=SYS_ADMIN -v /lib/modules:/lib/modules`

Change-Id: If6364c9e82da8404b73ac6a9688833f4d18693b0
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/amdsmi commit: 425b05cb18]
2025-08-11 13:06:57 -05:00
Galantsev, Dmitrii 8b96ee5271 Bump version to 26.1
Change-Id: I1b6ab552c9be965524ad49a866374a0d21b9ceb3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/amdsmi commit: e7d6590bbc]
2025-08-08 08:12:10 -05:00
josnarlo 4fe4c4df23 Fix getting version information
Change-Id: I2695733307888f5ab41a1265ae4369a2ea011e09


[ROCm/amdsmi commit: 925014ddaf]
2025-08-08 08:12:10 -05:00
Bindhiya Kanangot Balakrishnan 82a2c0dffc [SWDEV-543308] Fix xgmi_metrics_info initialization in xgmi
The xgmi_metrics_info variable was being referenced before
assignment when no destination GPUs were found or when the API
call failed. This caused an UnboundLocalError. Fixed this by
initializing xgmi_metrics_info with empty links structure.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>


[ROCm/amdsmi commit: f0453c2c75]
2025-08-07 16:19:10 -05:00
Charis Poag 79ce271d1f Fix amd-smi sets attribute error & memory partition sets
* Changes:
- Fix for any set without CPU loaded (ex.):
sudo /opt/rocm/bin/amd-smi set -o 250
AttributeError: 'Namespace' object has no attribute 'core_boost_limit'

- Fix for recent changes to memory partition sets
  Needed to account for permission denied -> to display not supported.
  EACCESS == *_STATUS_PERMISSION, but in this case need to show
  NOT_SUPPORTED

Change-Id: Ie00bbb34d01adfe38300f1ac4c1620d78885b9b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/amdsmi commit: e7964cda49]
2025-08-07 16:09:56 -05:00
Justin Williams 05ee73d6de CI - Updated Runners & Max Parallels
Signed-off-by: Justin Williams <juwillia@amd.com>


[ROCm/amdsmi commit: d0321875d9]
2025-08-07 12:07:19 -05:00
Poag, Charis 07dfa789d0 [SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558)
Changes:
  - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
  - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
    (Violation Status is the first example of this in monitor)
  - Improve CLI monitor output:
    support multiple GPU lines per GPU, add new columns, and better formatting
  - Refactor helpers and logger for flexible unit formatting and table rendering
  - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
    new metrics APIs in C++ example
  - Sync Python/C++ interface and structures for new metrics fields and naming
  - Remove deprecated/unused RSMI activity APIs, documentation not needed since
    the APIs no longer exist in ROCm SMI either.
  - Cleanup metric violations + fix handle watch arguments
  - Provide better handling/doc for average_flattened_ints()
  - Group xcp metrics with brackets in human readable + adjust output size

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>

[ROCm/amdsmi commit: e2e4fc65c1]
2025-08-06 16:03:06 -05:00
62d92968791937c6480e7d49e40bec15_amdeng 3437d5b5da [SWDEV-539532] Enabled and updated set CPU APIs from CLI (#513)
* Enabled and updated set CPU APIs from CLI
* Fix sets not working consistently across devices + string/int comparison

Signed-off-by: Deepak Mewar <deepak.mewar@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Deepak Mewar <deepak.mewar@amd.com>

[ROCm/amdsmi commit: 1dedeac4e3]
2025-08-06 12:52:35 -05:00
Pham, Gabriel 4565e23aca [SWDEV-542706] Corrected get_od_clk_volt_info (#604)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: b916ceedb6]
2025-08-06 12:24:02 -05:00
Pham, Gabriel c8698c87ef [SWDEV-542706] Adjusted logic for reading pp_od_clk_voltage (#592)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>


[ROCm/amdsmi commit: 95c11daa68]
2025-08-06 11:20:09 -05:00
Maisam Arif f3291ee791 Default output driver string truncation
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I88b78b1cb9712f9fee4f94a54811f8f702d4d920


[ROCm/amdsmi commit: 81ca193477]
2025-08-06 10:40:37 -05:00
Poag, Charis bf8bbd99c6 [SWDEV-518561] Separate Driver Reload from Memory Partition Sets (#582)
Description:
  - Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
  - Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
  - Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
  - Enhanced CLI and test cases to allow users to control when the driver reload occurs.
  - Updated documentation and changelog to reflect the new driver reload process.
  - Improved error handling and logging for driver reload operations.
  - Added progress bar and user confirmation prompts for driver reload commands.

* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
  - Systemctl is typically not enabled on docker.
  - And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: d24dc7ef89]
2025-08-05 20:44:28 -05:00
Liu, Shuzhou (Bill) 7ec0a1a7dd Query UBB/OAM temperature API (#581)
Add support to Query UBB/OAM temperature.
* Updated Python API with new temperature metrics enum

---------

Co-authored-by: Bill Liu <shuzhliu@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: abd3c02a3c]
2025-08-05 20:37:45 -05:00
Saeed, Oosman 7a6d75af7c [SWDEV-533349] codeQL erors in amdsmi source code (#588)
Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>

[ROCm/amdsmi commit: 753a5ea326]
2025-08-05 20:17:21 -05:00
Pham, Gabriel e1a538e551 Added Platform Information to Default Command (#553)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: fc5ea762b3]
2025-08-05 20:11:42 -05:00
Pryor, Adam 32a1ef90cd Documentation updates for AMDSMI_GPU_METRICS_CACHE_MS (#564)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: 2dc2e12a97]
2025-08-05 19:58:37 -05:00
Galantsev, Dmitrii dbc496b36f CI - Use self-hosted machines for format checking due to IP whitelist
Change-Id: I1f0f4af7ed42d849cf4c9384e3c0c6da57b0504c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/amdsmi commit: 4044d1da41]
2025-08-04 21:09:25 -05:00
AL Musaffar, Yazen 00ffa28162 [SWDEV-544092] Fix Navi process float conversion (#579)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>

[ROCm/amdsmi commit: 27cae85910]
2025-08-04 14:40:18 -05:00
Bindhiya Kanangot Balakrishnan 4c2dec0883 [SWDEV-525336] Fix N/A process name display
The amd-smi command will will show only executable
name of a process by stripping absolute path. This
cause "N/A" process names incorrectly display as
"A" in the output. Corrected the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>


[ROCm/amdsmi commit: b16a66b2c5]
2025-08-04 13:51:42 -05:00
Kanangot Balakrishnan, Bindhiya 67f21bb032 [SWDEV-537852] Update compute-partition set error messages (#505)
[SWDEV-537852] Update compute-partition set error messages

Setting compute partition needs sudo privileges. Added
AmdSmiPermissionDeniedException to display CLI elevated
permission errors.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: 27a1705d96]
2025-08-01 08:22:22 -05:00
Arif, Maisam e5d03d099a Revert "[SWDEV-505176] Submodule Unified Header in AMDSMI"
This reverts commit 0e895cf235.


[ROCm/amdsmi commit: 240a607904]
2025-07-30 14:08:24 -05:00
Narlo, Joseph 0e895cf235 [SWDEV-505176] Submodule Unified Header in AMDSMI
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: a315b62e37]
2025-07-30 13:37:01 -05:00
gabrpham_amdeng cab2270feb [SWDEV-543627] Fixed incorrect metric min clock values
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>


[ROCm/amdsmi commit: 4f0d1c8c29]
2025-07-26 04:55:25 -05:00
Kanangot Balakrishnan, Bindhiya dca01f8128 [SWDEV-545342] Remove link type translation (#575)
[ROCm/amdsmi commit: c9f0d1b953]
2025-07-25 13:16:06 -05:00
Williams, Justin 222762d1be Updated CODEOWNERS (#578)
Signed-off-by: Justin Williams <juwillia@amd.com>
Co-authored-by: Justin Williams <juwillia@amd.com>

[ROCm/amdsmi commit: d11ae93eb0]
2025-07-25 09:42:16 -07:00
Bindhiya Kanangot Balakrishnan 10389ae450 [SWDEV-537852] Update help text for InvalidParameterValueException
Updated the help text to display command name.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>


[ROCm/amdsmi commit: 449839a32e]
2025-07-24 10:47:13 -05:00
Justin Williams 634a5c8c2b CI - Added Debian 10 Repository Updates
Signed-off-by: Justin Williams <juwillia@amd.com>


[ROCm/amdsmi commit: 0d76d78e49]
2025-07-24 10:39:38 -05:00
Kanangot Balakrishnan, Bindhiya 46deb667e3 [SWDEV-537852] Update help and error text (#518)
Improved amd-smi help and error messages.
Updated to show subcommand name in help text.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>

[ROCm/amdsmi commit: 6f7b397998]
2025-07-24 09:06:22 -05:00
Justin Williams 65a9397928 CI - Make ABI compliance checks non-blocking with warning labels
Signed-off-by: Justin Williams <juwillia@amd.com>


[ROCm/amdsmi commit: 4c09fcac1f]
2025-07-24 08:49:44 -05:00
Pham, Gabriel 6369febcbd [SWDEV-545342] Fixed amdsmi_link_type_t enumeration (#560)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: e2eac98496]
2025-07-22 18:22:49 -05:00
Williams, Justin 76add85291 CI - Created Automatic Github to Gerrit Mirror (#556)
Signed-off-by: Justin Williams <Justin.Williams@amd.com>

[ROCm/amdsmi commit: 5b72f3a950]
2025-07-22 17:30:40 -05:00
Poag, Charis e754e8e7ad [SWDEV-536953] Fix sets/resets + Align Power Cap Behavior with ROCM_SMI (#456)
Changes:
  - Modified outputputs for amd-smi set/reset when in partitions
    to display error codes
  - Provided some general cleanup for the above ^
----------------------------------------------------
  - Updated  `amd-smi set -o <value>` /  `amd-smi set --power-cap <value>`  command to
    allow setting power cap to values other than 0, provided the current power cap is not 0.
  - Modified power_cap_read_write.cc:
    - Added a check to ensure that the power cap can only be set to non-zero values if the current
      power cap is not 0.
    - Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7

---------

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>

[ROCm/amdsmi commit: ec055f2c2d]
2025-07-22 17:21:15 -05:00