Graf commitů

1913 Commity

Autor SHA1 Zpráva Datum
Poag, Charis 1b2edd70bd [SWDEV-550355] Fix process + violation output when in partitions (#623)
Changes:
  - Fixes amd-smi monitor such as:
    amd-smi monitor -Vqt, amd-smi monitor -g 0 -Vqt -w 1
    amd-smi monitor -Vqt --file /tmp/test1, ...
  - Required moving around when process is called, since xcp
    information is gathered in right format expected by monitor
  - Requires process to be appended first with the gpu data -> xcp
    info to be gathered + added after 1st device

Change-Id: I76356a4610944f633a9530970fac66556d65bf11
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-19 18:50:51 -05:00
Charis Poag 5fe58a8e38 [SWDEV-550679] Fix amd-smi monitor AttributeError
Impacts only Guest systems

Fixes following error:
$ amd-smi monitor
AttributeError: 'Namespace' object has no attribute 'violation'

Change-Id: If501819be3f8e2d2dfd75775dc776873a92465a3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-19 17:58:44 -05:00
Maisam Arif 6de6290dc1 Removed kfd_ioctl.h from rocm include install
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I7948eb050f79a8a0f71e0b8a8e4e08187ac0bb84
2025-08-19 17:18:14 -05:00
Galantsev, Dmitrii cd33b75540 [SWDEV-545751] CMAKE - Enable fPIC (#629)
Change-Id: Iaade10e70b3a39d6bca23ae98f9f501339ffd76d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-08-19 11:39:39 -05:00
Poag, Charis e12d270693 [SWDEV-546220] Fix mVF xcd check within tests (#628)
Adding a check to see if we're in guest -> allowing equal XCD values.
This is because in mVF configurations, we may not be able to read the gfx clock values.

Change-Id: I8e5d9627e061e98ec854734a91624c8077644a2a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-19 11:13:18 -05:00
Bindhiya Kanangot Balakrishnan 41488f0c18 [SWDEV-547160] Fix VRAM percentage calculation
The vram_percent calculation was missing
multiplication by 100.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-18 17:28:30 -05:00
Arif, Maisam 2d5accd000 [SWDEV-540665] Add power_cap set to Linux Guest (#626)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I3c8d707681c141390b40521231e0d638c81cdeaf
2025-08-18 14:59:14 -05:00
Charis Poag d3b73fac82 Revert Major ABI break for amdsmi_get_violation_status()
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.

Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-18 11:36:57 -05:00
Bill Liu c45a53d751 [SWDEV-548260] Enable Support for Multiple init() and shutdown()
Implemented reference counting to manage init and shutdown processes,
allowing for multiple initializations and shutdowns.
2025-08-15 11:44:50 -05:00
Maisam Arif c8d0e5c497 [SWDEV-549831] Fixed file outputs not printing
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I56b792256c30d618d59d2d40faf5fa0f1c2c4dc6
2025-08-14 11:08:49 -05:00
Charis Poag 425b05cb18 [SWDEV-548755] Driver reload temporary fix for CQE
Temporary solution until CQE can update how their containers are ran.

This is because the driver reload requires:
1) Containers must run serially
   (i.e. no parallel containers running at the same time)
2) Containers must run with extra parameters:
   `--cap-add=SYS_ADMIN -v /lib/modules:/lib/modules`

Change-Id: If6364c9e82da8404b73ac6a9688833f4d18693b0
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-11 13:06:57 -05:00
Galantsev, Dmitrii e7d6590bbc Bump version to 26.1
Change-Id: I1b6ab552c9be965524ad49a866374a0d21b9ceb3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-08-08 08:12:10 -05:00
josnarlo 925014ddaf Fix getting version information
Change-Id: I2695733307888f5ab41a1265ae4369a2ea011e09
2025-08-08 08:12:10 -05:00
Bindhiya Kanangot Balakrishnan f0453c2c75 [SWDEV-543308] Fix xgmi_metrics_info initialization in xgmi
The xgmi_metrics_info variable was being referenced before
assignment when no destination GPUs were found or when the API
call failed. This caused an UnboundLocalError. Fixed this by
initializing xgmi_metrics_info with empty links structure.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-07 16:19:10 -05:00
Charis Poag e7964cda49 Fix amd-smi sets attribute error & memory partition sets
* Changes:
- Fix for any set without CPU loaded (ex.):
sudo /opt/rocm/bin/amd-smi set -o 250
AttributeError: 'Namespace' object has no attribute 'core_boost_limit'

- Fix for recent changes to memory partition sets
  Needed to account for permission denied -> to display not supported.
  EACCESS == *_STATUS_PERMISSION, but in this case need to show
  NOT_SUPPORTED

Change-Id: Ie00bbb34d01adfe38300f1ac4c1620d78885b9b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-07 16:09:56 -05:00
Justin Williams d0321875d9 CI - Updated Runners & Max Parallels
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-08-07 12:07:19 -05:00
Poag, Charis e2e4fc65c1 [SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558)
Changes:
  - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
  - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
    (Violation Status is the first example of this in monitor)
  - Improve CLI monitor output:
    support multiple GPU lines per GPU, add new columns, and better formatting
  - Refactor helpers and logger for flexible unit formatting and table rendering
  - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
    new metrics APIs in C++ example
  - Sync Python/C++ interface and structures for new metrics fields and naming
  - Remove deprecated/unused RSMI activity APIs, documentation not needed since
    the APIs no longer exist in ROCm SMI either.
  - Cleanup metric violations + fix handle watch arguments
  - Provide better handling/doc for average_flattened_ints()
  - Group xcp metrics with brackets in human readable + adjust output size

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-08-06 16:03:06 -05:00
62d92968791937c6480e7d49e40bec15_amdeng 1dedeac4e3 [SWDEV-539532] Enabled and updated set CPU APIs from CLI (#513)
* Enabled and updated set CPU APIs from CLI
* Fix sets not working consistently across devices + string/int comparison

Signed-off-by: Deepak Mewar <deepak.mewar@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Deepak Mewar <deepak.mewar@amd.com>
2025-08-06 12:52:35 -05:00
Pham, Gabriel b916ceedb6 [SWDEV-542706] Corrected get_od_clk_volt_info (#604)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-06 12:24:02 -05:00
Pham, Gabriel 95c11daa68 [SWDEV-542706] Adjusted logic for reading pp_od_clk_voltage (#592)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
2025-08-06 11:20:09 -05:00
Maisam Arif 81ca193477 Default output driver string truncation
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I88b78b1cb9712f9fee4f94a54811f8f702d4d920
2025-08-06 10:40:37 -05:00
Poag, Charis d24dc7ef89 [SWDEV-518561] Separate Driver Reload from Memory Partition Sets (#582)
Description:
  - Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
  - Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
  - Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
  - Enhanced CLI and test cases to allow users to control when the driver reload occurs.
  - Updated documentation and changelog to reflect the new driver reload process.
  - Improved error handling and logging for driver reload operations.
  - Added progress bar and user confirmation prompts for driver reload commands.

* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
  - Systemctl is typically not enabled on docker.
  - And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-05 20:44:28 -05:00
Liu, Shuzhou (Bill) abd3c02a3c Query UBB/OAM temperature API (#581)
Add support to Query UBB/OAM temperature.
* Updated Python API with new temperature metrics enum

---------

Co-authored-by: Bill Liu <shuzhliu@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:37:45 -05:00
Saeed, Oosman 753a5ea326 [SWDEV-533349] codeQL erors in amdsmi source code (#588)
Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>
2025-08-05 20:17:21 -05:00
Pham, Gabriel fc5ea762b3 Added Platform Information to Default Command (#553)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:11:42 -05:00
Pryor, Adam 2dc2e12a97 Documentation updates for AMDSMI_GPU_METRICS_CACHE_MS (#564)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-05 19:58:37 -05:00
Galantsev, Dmitrii 4044d1da41 CI - Use self-hosted machines for format checking due to IP whitelist
Change-Id: I1f0f4af7ed42d849cf4c9384e3c0c6da57b0504c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-08-04 21:09:25 -05:00
AL Musaffar, Yazen 27cae85910 [SWDEV-544092] Fix Navi process float conversion (#579)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-08-04 14:40:18 -05:00
Bindhiya Kanangot Balakrishnan b16a66b2c5 [SWDEV-525336] Fix N/A process name display
The amd-smi command will will show only executable
name of a process by stripping absolute path. This
cause "N/A" process names incorrectly display as
"A" in the output. Corrected the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-04 13:51:42 -05:00
Kanangot Balakrishnan, Bindhiya 27a1705d96 [SWDEV-537852] Update compute-partition set error messages (#505)
[SWDEV-537852] Update compute-partition set error messages

Setting compute partition needs sudo privileges. Added
AmdSmiPermissionDeniedException to display CLI elevated
permission errors.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-01 08:22:22 -05:00
Arif, Maisam 240a607904 Revert "[SWDEV-505176] Submodule Unified Header in AMDSMI"
This reverts commit a315b62e37.
2025-07-30 14:08:24 -05:00
Narlo, Joseph a315b62e37 [SWDEV-505176] Submodule Unified Header in AMDSMI
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-30 13:37:01 -05:00
gabrpham_amdeng 4f0d1c8c29 [SWDEV-543627] Fixed incorrect metric min clock values
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-07-26 04:55:25 -05:00
Kanangot Balakrishnan, Bindhiya c9f0d1b953 [SWDEV-545342] Remove link type translation (#575) 2025-07-25 13:16:06 -05:00
Williams, Justin d11ae93eb0 Updated CODEOWNERS (#578)
Signed-off-by: Justin Williams <juwillia@amd.com>
Co-authored-by: Justin Williams <juwillia@amd.com>
2025-07-25 09:42:16 -07:00
Bindhiya Kanangot Balakrishnan 449839a32e [SWDEV-537852] Update help text for InvalidParameterValueException
Updated the help text to display command name.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 10:47:13 -05:00
Justin Williams 0d76d78e49 CI - Added Debian 10 Repository Updates
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-24 10:39:38 -05:00
Kanangot Balakrishnan, Bindhiya 6f7b397998 [SWDEV-537852] Update help and error text (#518)
Improved amd-smi help and error messages.
Updated to show subcommand name in help text.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 09:06:22 -05:00
Justin Williams 4c09fcac1f CI - Make ABI compliance checks non-blocking with warning labels
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-24 08:49:44 -05:00
Pham, Gabriel e2eac98496 [SWDEV-545342] Fixed amdsmi_link_type_t enumeration (#560)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-22 18:22:49 -05:00
Williams, Justin 5b72f3a950 CI - Created Automatic Github to Gerrit Mirror (#556)
Signed-off-by: Justin Williams <Justin.Williams@amd.com>
2025-07-22 17:30:40 -05:00
Poag, Charis ec055f2c2d [SWDEV-536953] Fix sets/resets + Align Power Cap Behavior with ROCM_SMI (#456)
Changes:
  - Modified outputputs for amd-smi set/reset when in partitions
    to display error codes
  - Provided some general cleanup for the above ^
----------------------------------------------------
  - Updated  `amd-smi set -o <value>` /  `amd-smi set --power-cap <value>`  command to
    allow setting power cap to values other than 0, provided the current power cap is not 0.
  - Modified power_cap_read_write.cc:
    - Added a check to ensure that the power cap can only be set to non-zero values if the current
      power cap is not 0.
    - Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7

---------

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-07-22 17:21:15 -05:00
Justin Williams 553f2bfce3 CI - Fixed Debian 10 Install Errors
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-07-22 17:17:15 -05:00
Galantsev, Dmitrii 1042c4fa6b .clangd - Remove google readability config
Change-Id: I0535af5053eac9add068926c44073ae884df2008
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-07-21 15:06:53 -05:00
Castillo, Juan 3b1957e674 [SWDEV-531904] Added test_get_gpu_revision (#533)
* [SWDEV-531904] Added test_get_gpu_revision
New:
- amdsmi_get_gpu_revision() previously not implemented in amdsmi_interface.py
- test_get_gpu_revision() missing integration test.

Updated:
-changelog.md added new doc fields for ROCm 7.1
-amdsmi-py-api.md added field|description doc fields

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-07-15 19:35:54 -05:00
Saeed, Oosman 03414e20ee SWDEV-539482: Different sizes of mem leaks observed in amdsmitst (#538)
Signed-off-by:Oosman Saeed <oossaeed@amd.com>
2025-07-15 14:33:27 -05:00
Bindhiya Kanangot Balakrishnan 645c313f00 [SWDEV-543308] Revert amdsmi_link_metrics structure change
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-14 13:56:26 -05:00
Maisam Arif fcf494bbc5 gpu_metrics caching fix
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I6dacb0b81d6677c354ef3c86af4d7d5156a76d8b
2025-07-14 12:12:37 -05:00
dependabot[bot] 28e577f1c0 Bump urllib3 from 2.3.0 to 2.5.0 in /docs/sphinx (#546)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.3.0 to 2.5.0.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.3.0...2.5.0)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-version: 2.5.0
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
2025-07-13 11:43:30 -05:00
Pryor, Adam 42096c1398 Add gpu metrics cache (#541)
* Add gpu metrics caching defaulted to 100ms
* AMDSMI_GPU_METRICS_CACHE_MS is used to set the caching rate limits

---------

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-13 09:56:29 -05:00