Commit Graph

464 Commits

Author SHA1 Message Date
Kanangot Balakrishnan, Bindhiya f8e4771363 [SWDEV-538483] Add NPM API's and CLI (#817)
* Added Python & C API's for new node devices. Currently these are functional for node 0 only.
 - amdsmi_get_node_handle
 - amdsmi_get_npm_info
* Added `amd-smi node` CLI for Node Power Management

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-11-13 21:51:31 -06:00
gabrpham_amdeng 18faddf6f3 Added support for configuring PPT1 power cap
- Updated python integration test to account for PPT1 support changes
  - Updated set/reset power-cap input format
  - Adjusted python API and updated C++ API test

Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Change-Id: Ia9d02868b6e91c88c10a9772d9e2d9f37c3c352f
2025-11-13 13:08:12 -06:00
Galantsev, Dmitrii 8bdf951d32 Add numbers to .so because wheels dont allow symlinks (#820)
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-11-06 03:57:31 -06:00
Galantsev, Dmitrii ad20d57162 Find libamd_smi.so and librocm-core.so relative to wrapper.py
Allow amdsmi to find libamd_smi.so and librocm-core.so relative to
amdsmi_wrapper.py location.

The amdsmi_wrapper.py file is located in
_rocm_sdk_core/share/amd_smi/amdsmi and the libraries are in
_rocm_sdk_core/lib/libamd_smi.so.26.
_rocm_sdk_core/lib/librocm-core.so.1.
2025-10-30 12:35:06 -05:00
Charis Poag 0a5fdc944f [SWDEV-560847] Fix Vram type not showing newer types
* Changes:
  - Allows `amd-smi static --vram` (`amdsmi_get_gpu_vram_info()`)
    to read the following types:
    DDR5, LPDDR4, LPDDR5, and HBM3E.

Change-Id: I1eddf9dcb574e1868541cc5063ae95cb6d6e1c59
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-10-29 16:13:42 -05:00
Pryor, Adam 2144cfbba4 [SWDEV-357472] Add evicted_ms metric (#620)
- **Added evicted_time metric for kfd processes**.  
  - Time that queues are evicted on a GPU in milliseconds
  - Added to CLI in `amd-smi monitor -q` and `amd-smi process`
  - Added to C API and Python API:
    - amdsmi_get_gpu_process_list()
    - amdsmi_get_gpu_compute_process_info()
    - amdsmi_get_gpu_compute_process_info_by_pid()

---------

Signed-off-by: Pryor, Adam <Adam.Pryor@amd.com>
2025-10-28 14:49:03 -05:00
Narlo, Joseph ced7d12395 [SWDEV-553416] Fix amdsmi_get_gpu_reg_table_info and amdsmi_get_gpu_pm_metrics_info(#787)
Signed-off-by: Narlo, Joseph <Joseph.Narlo@amd.com>
2025-10-27 14:43:31 -05:00
Poag, Charis 01b4fe6614 [SWDEV-535159] Add support for GPU partition metrics (#490)
[SWDEV-535159] Add support for GPU partition metrics

Changes include:
  - Internal logic to smart-switch between gpu_metrics/xcp_metrics files
  - [WIP] Initial plumbing for new partition metric API

Change-Id: I4340fb1b48bac0117d80d5d486b9e871430d5cd8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Add amdsmi_get_gpu_partition_metrics_info() + minor cleanup

Change-Id: I5d60604f18baddbd03852dc90e88aa0b8107d50e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Fix partition metric logic + update logging/tests

Change-Id: I9e89b19ead17694c54e224f8e13ff8ee3eb2e22a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Adjust amd-smi metric/monitor/default to show (some) partition information

Change-Id: I2e8d2745876a19bdaec3c039daa97345c9f701b5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Add C++ tests

Change-Id: Ib9eb0b57a6d7a280992e05a4c6eba632826952ef
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Remove modification of energy counter, not needed

Change-Id: I5c48eaaae248ee6dc79abba609d837ec35d78022
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[CLI] amd-smi metric: cleaned up N/A'd multi-valued to show just N/A

Changes:
1. amd-smi metric: cleaned up N/A'd multi-valued to show just N/A
ex.
JPEG_ACTIVITY: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]

Now just shows: N/A

2. [Python Unit Test] Changed testname TestAmdSmiPythonBDF(unittest.TestCase) ->
 AmdSmiPythonUnitTest

Test name was confusing.

Change-Id: Ieb3b036f30002fd22362508eb9fc5d443df395ae
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Log cleanup

Change-Id: I1b1a95f1844d35bec7a7bd8cb996f87e4914c069
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Add amd-smi partition-metrics CLI + general cleanup

Change-Id: Ia91488e6cb3a4d62b4087afbddfe0b3bb9378fdc
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[1.3 metrics] Remove forwards compatibility for partition metrics

Change-Id: Iab928983e6f6f1587bc9307f6f3fa2b2696ca6f7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Fixed violation output not showing % + general cleanup

Change-Id: Icac1b0a55b18c7628b07109ae0c377d17e0825f1
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Clean up amdsmi_get_gpu_partition_metrics_info & amd-smi partition-metric outputs

Change-Id: I6427028b980874641e9ffb3b5d88ad493dbf9cf4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Fix metrics not found + extra logging/formatting

Change-Id: I841a27bb2c305e97ec7579a13ac915e5be497c3a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Update license to current default

Change-Id: I0de9b8a2d5dbbeab4491097f0354ba17b0d30866
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Cleanup for review

Change-Id: I96ed25c3f2b8968eea1af24c5e5860c2b4e74e6e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Moderize updated/new interal APIs.

Change-Id: I3c48a250eeb703709b14cb5ffa68268d8321626c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Remove extra logging in dynamic metrics

Change-Id: Idb97547bcbe143d6fa1cb5cb278ffe4da615ce14
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Remove amd-smi partition-metric command

Change-Id: Ib83c17e5cd7e0da3798198943bddd46c296b411c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Move new CLI updates to another PR + minor fixes

Change-Id: I3b1163eec12f9b5f7d95ee33de08e168cec1b1fe
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Allow dynamic metrics to work for gpu/xcp metrics 1.9+/1.1+

Updated some logging as well.

Change-Id: I2ed9f5a5ef8afb1520508820ca6153525f0644b4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Allow dyn gpu/xcp metric v1.9+/v1.1+

Added tests for quick check

Change-Id: I576d6f6582a55afb08e5ac57791ce95e2fa184a2
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Update tests for larger subset of version checks

Change-Id: I3cdf4f8bb4fc6161f4c76566939f90545d0f362a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Fix XCP metrics in gpu/partition metric pre-v1.9/v1.1 (dynamic)

Change-Id: I4dabc1ed6bef6b86c8e7f92bf9cb5992f3966fe2
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-10-20 14:43:40 -05:00
Saeed, Oosman f7c9fe3011 [SWDEV-551318] Update readme doc: amdsmi_get_afids_from_cper() input arguments (#766)
* Update readme doc: amdsmi_get_afids_from_cper() input argument is only bytes, not a list of dicts each with keys “bytes” (List[int]) and “size” (int)

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
2025-10-17 15:42:17 -05:00
Narlo, Joseph 7decbc67a1 [SWDEV-539078] Add missing API definitions to python interface (#525)
Added the following API's to amdsmi_interface.py.
	amdsmi_get_cpu_handle()
	amdsmi_get_esmi_err_msg()
	amdsmi_get_gpu_event_notification()
	amdsmi_get_processor_count_from_handles()
	amdsmi_get_processor_handles_by_type()
	amdsmi_gpu_validate_ras_eeprom()
	amdsmi_init_gpu_event_notification()
	amdsmi_set_gpu_event_notification_mask()
	amdsmi_stop_gpu_event_notification()
	amdsmi_get_gpu_busy_percent()

Added additional return value to API amdsmi_get_xgmi_plpd().
	The entry policies is added to the end of the dictionary to match API definition.
	The entry plpds is marked for deprecation as it has the same information as policies.

---------

Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-10-06 14:50:00 -05:00
Maisam Arif bd3579a1ac Create symbolic links instead of hard links
This unbreaks having sources on one mount point and builds at another.

Signed-off-by: Marius Brehler <marius.brehler@amd.com>
Change-Id: I68363112382a95baaa867cad91e09bdec2b30d90
2025-09-26 12:17:06 -05:00
Narlo, Joseph 3c8fd1bf54 [SWDEV-554880] Sync Unified and Linux Header (#686)
Sync Unified and Linux Header

---------

Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
2025-09-23 16:56:32 -05:00
Maisam Arif cd21b5edcc [SWDEV-554587] Added IFWI Version and boot_firmware API
- Changed amd-smi static --vbios to accept ifwi
- Change population logic for vbios version API
- Added IFWI boot_firmware to the CLI, C++, Rust, and Python API

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I4ea504d40a43cfb011ab38fc9a664ecf12d39c8a
2025-09-23 16:05:10 -05:00
Kanangot Balakrishnan, Bindhiya 6715c5aa92 [SWDEV-534605] Increase max devices supported and drm test link type (#625)
Increased the AMDSMI_MAX_DEVICES to 64 to accomodate all
devices in CPX mode. The link type has been modified in
amd-smi to match with rocm-smi types, updated the same
for drm tests.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-09-17 16:30:04 -05:00
3049ac537468bd90fe07f2cbb3d7a83e_amdeng 67eb541c15 [SWDEV-531904] Unit and Integ Test Updates (#563)
* [SWDEV-531904] Unit and Integ Test Updates
Updated: unit_tests.py
- Removed redundant self.setUp() and self.tearDown() calls.
- Removed test_free_name_value_pairs() since is internal only.
Updated: integration_test.py
- Added logic to set AMDSMI_CLI_PATH from environment or default.
- Raise FileNotFoundError if path does not exist.
- Append CLI path to sys.path and handle ImportError with a clear message.
- Removed redundant @handle_exceptions function decorator.
- Removed redundant self.setUp() and self.tearDown() calls.
Updated: amdsmi_interface.py
- Removed POINTER conversion in amdsmi_get_gpu_pm_metrics_info() and amdsmi_get_gpu_reg_table_info()

All tests pass/skip

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

* Update tests/python_unittest/integration_test.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>

* Review Update 1
Modified: integration_test.py
- Added logic to properly loop through firmware list and display each name and version

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

* Skip xgmi_err tests + improve running output

Changes:
1. Now check for elevated permissions
2. Skip xgmi_error related SYSFS tests, refer to xgmi_read_write.cc
   (both are skipped)
3. Added list of tests and provided a summary of additional output
   provided

Change-Id: Iefc85c270faad89c625e2bd7af397d24faed2437
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

---------

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
2025-09-11 16:39:31 -05:00
Arif, Maisam ed2300516f Revert "[SWDEV-536176] libdrm_amdgpu depdency change (#448)"
This reverts commit 652761de54.
2025-08-27 20:11:17 -05:00
Arif, Maisam 286c421a49 [SWDEV-552378] Removed First enums in amdsmi_interface.py (#659)
- **Fixed gpuboard and baseboard temperatures enums in amdsmi Python Library**.  
  - AmdSmiTemperatureType had issues with referencing the right attribute, so we removed the following duplicate enums:
    - `AmdSmiTemperatureType.GPUBOARD_NODE_FIRST`
    - `AmdSmiTemperatureType.GPUBOARD_VR_FIRST`
    - `AmdSmiTemperatureType.BASEBOARD_FIRST`

Change-Id: Ia61446b593bd9182d597c4b4c2ac3c5ffdae7493
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-27 18:07:17 -05:00
Arif, Maisam 652761de54 [SWDEV-536176] libdrm_amdgpu depdency change (#448)
* Cmake fix updates
* Next fix will be addressing libdrm further

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Justin Williams <juwillia@amd.com>
2025-08-27 09:32:51 -05:00
Charis Poag d3b73fac82 Revert Major ABI break for amdsmi_get_violation_status()
Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.

Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-18 11:36:57 -05:00
Poag, Charis e2e4fc65c1 [SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558)
Changes:
  - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
  - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
    (Violation Status is the first example of this in monitor)
  - Improve CLI monitor output:
    support multiple GPU lines per GPU, add new columns, and better formatting
  - Refactor helpers and logger for flexible unit formatting and table rendering
  - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
    new metrics APIs in C++ example
  - Sync Python/C++ interface and structures for new metrics fields and naming
  - Remove deprecated/unused RSMI activity APIs, documentation not needed since
    the APIs no longer exist in ROCm SMI either.
  - Cleanup metric violations + fix handle watch arguments
  - Provide better handling/doc for average_flattened_ints()
  - Group xcp metrics with brackets in human readable + adjust output size

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-08-06 16:03:06 -05:00
Pham, Gabriel 95c11daa68 [SWDEV-542706] Adjusted logic for reading pp_od_clk_voltage (#592)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
2025-08-06 11:20:09 -05:00
Poag, Charis d24dc7ef89 [SWDEV-518561] Separate Driver Reload from Memory Partition Sets (#582)
Description:
  - Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
  - Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
  - Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
  - Enhanced CLI and test cases to allow users to control when the driver reload occurs.
  - Updated documentation and changelog to reflect the new driver reload process.
  - Improved error handling and logging for driver reload operations.
  - Added progress bar and user confirmation prompts for driver reload commands.

* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
  - Systemctl is typically not enabled on docker.
  - And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-08-05 20:44:28 -05:00
Liu, Shuzhou (Bill) abd3c02a3c Query UBB/OAM temperature API (#581)
Add support to Query UBB/OAM temperature.
* Updated Python API with new temperature metrics enum

---------

Co-authored-by: Bill Liu <shuzhliu@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:37:45 -05:00
gabrpham_amdeng 4f0d1c8c29 [SWDEV-543627] Fixed incorrect metric min clock values
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-07-26 04:55:25 -05:00
Pham, Gabriel e2eac98496 [SWDEV-545342] Fixed amdsmi_link_type_t enumeration (#560)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-22 18:22:49 -05:00
Poag, Charis ec055f2c2d [SWDEV-536953] Fix sets/resets + Align Power Cap Behavior with ROCM_SMI (#456)
Changes:
  - Modified outputputs for amd-smi set/reset when in partitions
    to display error codes
  - Provided some general cleanup for the above ^
----------------------------------------------------
  - Updated  `amd-smi set -o <value>` /  `amd-smi set --power-cap <value>`  command to
    allow setting power cap to values other than 0, provided the current power cap is not 0.
  - Modified power_cap_read_write.cc:
    - Added a check to ensure that the power cap can only be set to non-zero values if the current
      power cap is not 0.
    - Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7

---------

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-07-22 17:21:15 -05:00
Castillo, Juan 3b1957e674 [SWDEV-531904] Added test_get_gpu_revision (#533)
* [SWDEV-531904] Added test_get_gpu_revision
New:
- amdsmi_get_gpu_revision() previously not implemented in amdsmi_interface.py
- test_get_gpu_revision() missing integration test.

Updated:
-changelog.md added new doc fields for ROCm 7.1
-amdsmi-py-api.md added field|description doc fields

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-07-15 19:35:54 -05:00
Bindhiya Kanangot Balakrishnan 645c313f00 [SWDEV-543308] Revert amdsmi_link_metrics structure change
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-14 13:56:26 -05:00
Narlo, Joseph 2cf6272b53 [SWDEV-541675] Remove Unnecessary API from amdsmi.h (#530)
Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
2025-07-07 11:14:27 -05:00
Saeed, Oosman 5b95d227bc [SWDEV-538308] CPER CLI 20 limit bug (#499)
The bug was reproduced like this.

In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done

The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.

The fix:

Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.

The Python CLI should not treat this as an error, but should continue to print what the API returned.

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
2025-07-07 11:11:13 -05:00
Maisam Arif 28a7f536f9 Fix subsystem_id str comparision
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Icbe2440884458b63b42cb653009e7df36eb31e0f
2025-06-19 17:21:17 -05:00
Narlo, Joseph 7c0802889b [SWDEV-489696] Improve AMD SMI Python APIs Functional and Unit Testing (#468)
* Adding python unit tests
* Remove duplicate functions definitions
* Added missing classes for __init__ for py-interface

---------

Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-06-19 16:38:34 -05:00
Maisam Arif 6da33b8ded [SWDEV-529665] PLDM Bundle naming
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Id7f652ddc4e790027869683a4aaa3226ffc05c83
2025-06-12 02:19:37 -05:00
Arif, Maisam 23b9da656c Fixed type hinting & Added copy rights (#462)
* Added copyrights
* Fixed type hinting for processor_handle in python_interface
* Fixed Incorrect type hinting to actual return types

---------

Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Change-Id: Ie2a09acf628ed0c43eacc8ec78c159d125acbcdb
2025-06-11 17:19:02 -05:00
Maisam Arif b579d89ae2 [SWDEV-537062] Fixed CU Occupancy reporting UINT MAX
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I975579997a9e455eb930f6c0b8fc5f3dc3cbfae4
2025-06-11 10:42:00 -05:00
Maisam Arif 93404a6bff [SWDEV-529665] Fix PLDM version format
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I7df4c2068e32a5c81c83adc69dc82a9f5d725533
2025-06-11 07:35:25 -05:00
Maisam Arif ac63f410c2 Fixed Parser Folder Checking
* Adjusted help text
* Adjusted --afid to run only with --cper-file
* Fixed interface return error

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I2b96f4515c85f3b9dd84ba5c2d819729a997141b
2025-06-10 15:58:06 -05:00
Maisam Arif fb592e003a [SWDEV-536417] CPER Display fixes
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic2f3901d0f4c95bd9ed4beda8aa5fd3d596df8d2
2025-06-10 15:58:06 -05:00
Charis Poag 391451752b [SWDEV-529030/SWDEV-531217] Fix tests & output for partitioned configurations (CPX, DPX, QPX, etc.)
Changes:
  - Updated AMD SMI firmware to display "N/A" for unavailable firmware in partitioned environments, improving clarity.
    Example (in DPX):
    $ amd-smi firmware
    GPU: 0
        FW_LIST:
            ...
            FW 12:
                FW_ID: PM
                FW_VERSION: 00.86.39.00
    GPU: 1
        FW_LIST: N/A
  - Fixed amd-smi partition not showing current partition information on
    asics with inablity to set memory or accelerator partitions.
    $ amd-smi partition -c -m
    CURRENT_PARTITION:
    GPU_ID  MEMORY  ACCELERATOR_TYPE  ACCELERATOR_PROFILE_INDEX  PARTITION_ID
    0       NPS1    CPX               2                          0
    1       N/A     N/A               N/A                        1
    2       N/A     N/A               N/A                        2
    3       N/A     N/A               N/A                        3
    4       N/A     N/A               N/A                        4
    5       N/A     N/A               N/A                        5
    6       NPS1    SPX               0                          0
    7       NPS1    SPX               0                          0
    8       NPS1    SPX               0                          0

    MEMORY_PARTITION:
    GPU_ID  MEMORY_PARTITION_CAPS  CURRENT_MEMORY_PARTITION
    0       N/A                    NPS1
    1       N/A                    N/A
    2       N/A                    N/A
    3       N/A                    N/A
    4       N/A                    N/A
    5       N/A                    N/A
    6       N/A                    NPS1
    7       N/A                    NPS1
    8       N/A                    NPS1

  - Refactored amd_smi_drm_example.cc:
    - Grouped partition changes and restores original partition settings.
    - Now handles partitioned environments allowing example to continue even if some APIs are not supported in partitioned configurations.
  - Modified amdsmi_asic_info_t (see amdsmi_get_gpu_asic_info()) to report OAM ID as N/A if 0xFFFFFFFF (was 0xFFFF).
    Allows for better handling of OAM IDs in partitioned environments (DNE for non-primary nodes,
    since its a physical identifier). Easier to handle in tests and example code (ie. now consistent w/ max size of the structure's value).
  - Introduced amdsmi_RAII_open_FD() (internal API) to manage file descriptors using RAII, ensuring proper closure and preventing resource leaks.
    Updated the following APIs to use this function:
      - amdsmi_get_gpu_asic_info(), amdsmi_get_gpu_vram_usage(),
        amdsmi_get_gpu_vram_info(), amdsmi_get_gpu_vbios_info(),
        amdsmi_get_gpu_driver_info(), amdsmi_get_gpu_virtualization_mode()
  - Updated AMD SMI test_base.cc/.h:
    - Improved output and handling for partitioned environments.
    - Added detailed ASIC information logging to align with structure changes.
    - Enhanced error messages for better context before ASSERT checks.
  - Resolved test failures in partitioned environments by updating
    logic and handling for partition-specific configurations.
    Fixed tests include:
      - computepartition_read_write.cc, frequencies_read_write.cc,
        gpu_metrics_read.cc, mem_util_read.cc, memorypartition_read_write.cc,
        perf_level_read.cc, perf_level_read_write.cc, power_cap_read_write.cc,
        power_read.cc, sys_info_read.cc, gpu_busy_read.cc

Change-Id: I36e903f8fddd714c74c719459c71aba8bbb77e6f
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Resetting head + adding fixes for tests ran in partitions

Change-Id: I0c1e9ac07488b50c95f3bc6d8a724e67d2c715dc
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-06-05 19:24:49 -05:00
Arif, Maisam e2692ab533 Add Directory Not Found Status code to map to ENOTDIR (#238)
* Corrected ecc count error return
* Added directory not found error code
* Added ENOTDIR mapping to RSMI_STATUS_DIRECTORY_NOT_FOUND in ErrnoToRsmiStatus

---------

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-06-03 17:53:33 -05:00
Narlo, Joseph ce7d6dfe61 [SWDEV-532769] amd-smi APIs mismatch with documentation (#428)
* Populated socket_power to get power info
---------

Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-06-03 17:12:13 -05:00
Kanangot Balakrishnan, Bindhiya 8ed52616ad [SWDEV-519061] xgmi command output shows zero for all xgmi acc read/write data in the first column (#392)
The xgmi read and write accumulated data from gpu metric index
is based on sysfs xgmi_port_num file. Mapped these two to display
read and write wrt src_gpu Vs dst_gpu.
---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-02 14:01:06 -05:00
Joseph Narlo ee43ec71e8 [SWDEV-522996] Syncing Unified Header and AMDSMI
Signed-off-by: Joseph Narlo <joseph.narlo@amd.com>
2025-06-02 13:44:33 -05:00
Maisam Arif c89b5db09d Deprecated PASID
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ib008f80f3d736172079358c0ceb3ebca87340d28
2025-05-30 20:48:29 -05:00
Kanangot Balakrishnan, Bindhiya 2eff0b3764 [SWDEV-530633] Use gpu_metric speed and BW for xgmi (#366)
The xgmi command was showing pcie bit rate and bandwidth instead of xgmi. Corrected the API to get xgmi data from gpu metric.
Added python API for amdsmi_get_link_metrics. Modified the amdsmi_link_metrics struct.
Added check to confirm non zero partition got xgmi command.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-05-30 16:51:11 -05:00
Arif, Maisam 42441c78ea [SWDEV-488303] Adjusted process vram_mem data source (#411)
* [SWDEV-488303] Adjusted process vram_mem data source
* Standardized sscanf format strings

---------

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-05-29 23:26:12 -05:00
Arif, Maisam 0fdaebdbaa [SWDEV-488303] Updated CU occupancy for per-process retrieval (#243)
Change-Id: I2990597c6dd4b2e8cf3e11ce60f72049ebdd9a8c
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-05-29 20:35:27 -05:00
Liu, Shuzhou (Bill) 970560fc7c [SWDEV-520665] Add support for board voltage (#303)
* Add the API and CLI to show the board voltage. 

---------

Change-Id: Icb25bd653bb1d004704b5a21b378ca31b2b242c7
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-05-29 18:55:08 -05:00
Kanangot Balakrishnan, Bindhiya e7f19b36f0 [SWDEV-463406] ViolationStatus Changes (#288)
* Expanded Violation Status tracking for GPU metrics 1.8
* Added new fields to `amdsmi_violation_status_t` and related interfaces for enhanced violation statuses
---------

Signed-off-by: Kanangot Balakrishnan, Bindhiya <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
2025-05-29 13:26:21 -05:00
Pryor, Adam d0a89393df Remove ring hang (#391)
Change-Id: I856cd0949d3661911ab9302148aa1bc6e72abeed

Signed-off-by: adapryor <Adam.pryor@amd.com>
2025-05-29 11:58:46 -05:00