Граф коммитов

1906 Коммитов

Автор SHA1 Сообщение Дата
Charis Poag d3b73fac82 Revert Major ABI break for amdsmi_get_violation_status()
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.

Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-18 11:36:57 -05:00
Bill Liu c45a53d751 [SWDEV-548260] Enable Support for Multiple init() and shutdown()
Implemented reference counting to manage init and shutdown processes,
allowing for multiple initializations and shutdowns.
2025-08-15 11:44:50 -05:00
Maisam Arif c8d0e5c497 [SWDEV-549831] Fixed file outputs not printing
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I56b792256c30d618d59d2d40faf5fa0f1c2c4dc6
2025-08-14 11:08:49 -05:00
Charis Poag 425b05cb18 [SWDEV-548755] Driver reload temporary fix for CQE
Temporary solution until CQE can update how their containers are ran.

This is because the driver reload requires:
1) Containers must run serially
   (i.e. no parallel containers running at the same time)
2) Containers must run with extra parameters:
   `--cap-add=SYS_ADMIN -v /lib/modules:/lib/modules`

Change-Id: If6364c9e82da8404b73ac6a9688833f4d18693b0
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-11 13:06:57 -05:00
Galantsev, Dmitrii e7d6590bbc Bump version to 26.1
Change-Id: I1b6ab552c9be965524ad49a866374a0d21b9ceb3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-08-08 08:12:10 -05:00
josnarlo 925014ddaf Fix getting version information
Change-Id: I2695733307888f5ab41a1265ae4369a2ea011e09
2025-08-08 08:12:10 -05:00
Bindhiya Kanangot Balakrishnan f0453c2c75 [SWDEV-543308] Fix xgmi_metrics_info initialization in xgmi
The xgmi_metrics_info variable was being referenced before
assignment when no destination GPUs were found or when the API
call failed. This caused an UnboundLocalError. Fixed this by
initializing xgmi_metrics_info with empty links structure.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-07 16:19:10 -05:00
Charis Poag e7964cda49 Fix amd-smi sets attribute error & memory partition sets
* Changes:
- Fix for any set without CPU loaded (ex.):
sudo /opt/rocm/bin/amd-smi set -o 250
AttributeError: 'Namespace' object has no attribute 'core_boost_limit'

- Fix for recent changes to memory partition sets
  Needed to account for permission denied -> to display not supported.
  EACCESS == *_STATUS_PERMISSION, but in this case need to show
  NOT_SUPPORTED

Change-Id: Ie00bbb34d01adfe38300f1ac4c1620d78885b9b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-07 16:09:56 -05:00
Justin Williams d0321875d9 CI - Updated Runners & Max Parallels
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-08-07 12:07:19 -05:00
Poag, Charis e2e4fc65c1 [SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558)
Changes:
  - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
  - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
    (Violation Status is the first example of this in monitor)
  - Improve CLI monitor output:
    support multiple GPU lines per GPU, add new columns, and better formatting
  - Refactor helpers and logger for flexible unit formatting and table rendering
  - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
    new metrics APIs in C++ example
  - Sync Python/C++ interface and structures for new metrics fields and naming
  - Remove deprecated/unused RSMI activity APIs, documentation not needed since
    the APIs no longer exist in ROCm SMI either.
  - Cleanup metric violations + fix handle watch arguments
  - Provide better handling/doc for average_flattened_ints()
  - Group xcp metrics with brackets in human readable + adjust output size

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-08-06 16:03:06 -05:00
62d92968791937c6480e7d49e40bec15_amdeng 1dedeac4e3 [SWDEV-539532] Enabled and updated set CPU APIs from CLI (#513)
* Enabled and updated set CPU APIs from CLI
* Fix sets not working consistently across devices + string/int comparison

Signed-off-by: Deepak Mewar <deepak.mewar@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Deepak Mewar <deepak.mewar@amd.com>
2025-08-06 12:52:35 -05:00
Pham, Gabriel b916ceedb6 [SWDEV-542706] Corrected get_od_clk_volt_info (#604)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-06 12:24:02 -05:00
Pham, Gabriel 95c11daa68 [SWDEV-542706] Adjusted logic for reading pp_od_clk_voltage (#592)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
2025-08-06 11:20:09 -05:00
Maisam Arif 81ca193477 Default output driver string truncation
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I88b78b1cb9712f9fee4f94a54811f8f702d4d920
2025-08-06 10:40:37 -05:00
Poag, Charis d24dc7ef89 [SWDEV-518561] Separate Driver Reload from Memory Partition Sets (#582)
Description:
  - Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
  - Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
  - Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
  - Enhanced CLI and test cases to allow users to control when the driver reload occurs.
  - Updated documentation and changelog to reflect the new driver reload process.
  - Improved error handling and logging for driver reload operations.
  - Added progress bar and user confirmation prompts for driver reload commands.

* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
  - Systemctl is typically not enabled on docker.
  - And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-05 20:44:28 -05:00
Liu, Shuzhou (Bill) abd3c02a3c Query UBB/OAM temperature API (#581)
Add support to Query UBB/OAM temperature.
* Updated Python API with new temperature metrics enum

---------

Co-authored-by: Bill Liu <shuzhliu@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:37:45 -05:00
Saeed, Oosman 753a5ea326 [SWDEV-533349] codeQL erors in amdsmi source code (#588)
Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>
2025-08-05 20:17:21 -05:00
Pham, Gabriel fc5ea762b3 Added Platform Information to Default Command (#553)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:11:42 -05:00
Pryor, Adam 2dc2e12a97 Documentation updates for AMDSMI_GPU_METRICS_CACHE_MS (#564)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-05 19:58:37 -05:00
Galantsev, Dmitrii 4044d1da41 CI - Use self-hosted machines for format checking due to IP whitelist
Change-Id: I1f0f4af7ed42d849cf4c9384e3c0c6da57b0504c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-08-04 21:09:25 -05:00
AL Musaffar, Yazen 27cae85910 [SWDEV-544092] Fix Navi process float conversion (#579)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-08-04 14:40:18 -05:00
Bindhiya Kanangot Balakrishnan b16a66b2c5 [SWDEV-525336] Fix N/A process name display
The amd-smi command will will show only executable
name of a process by stripping absolute path. This
cause "N/A" process names incorrectly display as
"A" in the output. Corrected the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-04 13:51:42 -05:00
Kanangot Balakrishnan, Bindhiya 27a1705d96 [SWDEV-537852] Update compute-partition set error messages (#505)
[SWDEV-537852] Update compute-partition set error messages

Setting compute partition needs sudo privileges. Added
AmdSmiPermissionDeniedException to display CLI elevated
permission errors.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-01 08:22:22 -05:00
Arif, Maisam 240a607904 Revert "[SWDEV-505176] Submodule Unified Header in AMDSMI"
This reverts commit a315b62e37.
2025-07-30 14:08:24 -05:00
Narlo, Joseph a315b62e37 [SWDEV-505176] Submodule Unified Header in AMDSMI
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-30 13:37:01 -05:00
gabrpham_amdeng 4f0d1c8c29 [SWDEV-543627] Fixed incorrect metric min clock values
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-07-26 04:55:25 -05:00
Kanangot Balakrishnan, Bindhiya c9f0d1b953 [SWDEV-545342] Remove link type translation (#575) 2025-07-25 13:16:06 -05:00
Williams, Justin d11ae93eb0 Updated CODEOWNERS (#578)
Signed-off-by: Justin Williams <juwillia@amd.com>
Co-authored-by: Justin Williams <juwillia@amd.com>
2025-07-25 09:42:16 -07:00
Bindhiya Kanangot Balakrishnan 449839a32e [SWDEV-537852] Update help text for InvalidParameterValueException
Updated the help text to display command name.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 10:47:13 -05:00
Justin Williams 0d76d78e49 CI - Added Debian 10 Repository Updates
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-24 10:39:38 -05:00
Kanangot Balakrishnan, Bindhiya 6f7b397998 [SWDEV-537852] Update help and error text (#518)
Improved amd-smi help and error messages.
Updated to show subcommand name in help text.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 09:06:22 -05:00
Justin Williams 4c09fcac1f CI - Make ABI compliance checks non-blocking with warning labels
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-24 08:49:44 -05:00
Pham, Gabriel e2eac98496 [SWDEV-545342] Fixed amdsmi_link_type_t enumeration (#560)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-22 18:22:49 -05:00
Williams, Justin 5b72f3a950 CI - Created Automatic Github to Gerrit Mirror (#556)
Signed-off-by: Justin Williams <Justin.Williams@amd.com>
2025-07-22 17:30:40 -05:00
Poag, Charis ec055f2c2d [SWDEV-536953] Fix sets/resets + Align Power Cap Behavior with ROCM_SMI (#456)
Changes:
  - Modified outputputs for amd-smi set/reset when in partitions
    to display error codes
  - Provided some general cleanup for the above ^
----------------------------------------------------
  - Updated  `amd-smi set -o <value>` /  `amd-smi set --power-cap <value>`  command to
    allow setting power cap to values other than 0, provided the current power cap is not 0.
  - Modified power_cap_read_write.cc:
    - Added a check to ensure that the power cap can only be set to non-zero values if the current
      power cap is not 0.
    - Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7

---------

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-07-22 17:21:15 -05:00
Justin Williams 553f2bfce3 CI - Fixed Debian 10 Install Errors
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-22 17:17:15 -05:00
Galantsev, Dmitrii 1042c4fa6b .clangd - Remove google readability config
Change-Id: I0535af5053eac9add068926c44073ae884df2008
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-07-21 15:06:53 -05:00
Castillo, Juan 3b1957e674 [SWDEV-531904] Added test_get_gpu_revision (#533)
* [SWDEV-531904] Added test_get_gpu_revision
New:
- amdsmi_get_gpu_revision() previously not implemented in amdsmi_interface.py
- test_get_gpu_revision() missing integration test.

Updated:
-changelog.md added new doc fields for ROCm 7.1
-amdsmi-py-api.md added field|description doc fields

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-07-15 19:35:54 -05:00
Saeed, Oosman 03414e20ee SWDEV-539482: Different sizes of mem leaks observed in amdsmitst (#538)
Signed-off-by:Oosman Saeed <oossaeed@amd.com>
2025-07-15 14:33:27 -05:00
Bindhiya Kanangot Balakrishnan 645c313f00 [SWDEV-543308] Revert amdsmi_link_metrics structure change
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-14 13:56:26 -05:00
Maisam Arif fcf494bbc5 gpu_metrics caching fix
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I6dacb0b81d6677c354ef3c86af4d7d5156a76d8b
2025-07-14 12:12:37 -05:00
dependabot[bot] 28e577f1c0 Bump urllib3 from 2.3.0 to 2.5.0 in /docs/sphinx (#546)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.3.0 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.3.0...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-13 11:43:30 -05:00
Pryor, Adam 42096c1398 Add gpu metrics cache (#541)
* Add gpu metrics caching defaulted to 100ms
* AMDSMI_GPU_METRICS_CACHE_MS is used to set the caching rate limits

---------

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-13 09:56:29 -05:00
Maisam Arif 10f9aae0b3 Reduced calls to drm devinfo for getting virtualization_mode
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I22a6a9ca15131b37a775e8d4f595fb13c0b043c7
2025-07-11 12:26:42 -05:00
Justin Williams af69f75a86 CI - Added Docs Generation Instructions
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-10 09:42:51 -05:00
Kanangot Balakrishnan, Bindhiya f6b854b4ed [SWDEV-541289] Update violation argument in amd-smi (#526)
* Disabled violation argument for monitor on guests as it is supported on BM only. 
* Added `-v` and `--violation` args to metric along with `throttle` due to legacy behavior.
	* Supressed metric throttle arg and do not show in help text

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-09 16:38:09 -05:00
Kanangot Balakrishnan, Bindhiya 514517e536 [SWDEV-539721] Show complete process name (#536)
Modified the file used to fetch process name so that complete name with path can be displayed.

Changes:
amd-smi monitor -q
- human readable format will output only the process name
- csv and json formats will print the full path

amd-smi process
- name will always be the full path to the process

amd-smi (default output)
- name will always be truncated.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-09 16:34:39 -05:00
AL Musaffar, Yazen 01a6158c85 [SWDEV-532904] CLI lists unusable UUID without sudo (#510)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-07-09 15:45:03 -05:00
josnarlo 0257140504 [SWDEV-536953] Align Power Cap Behavior with ROCM_SMI
Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
2025-07-09 15:37:40 -05:00
Castillo, Juan 34f465bfc5 [SWDEV-531904] Removed Handle Exceptions function (#531)
Removed:
- handle_exceptions() Exposes, silences, and logs AMDSMI exceptions to users returns success/failure

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-07-07 13:26:26 -05:00