283 Коммитов

Автор SHA1 Сообщение Дата
Bindhiya Kanangot Balakrishnan 8326c33d33 [SWDEV-573540] Add DRM-based wake for suspended AMD GPUs (#2510)
Implements automatic device wake using getDRMDeviceId() DRM call when GPUs
are detected in low-power state. This ensures rocm-smi can access device
information on suspended GPUs.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2026-01-08 10:19:45 -06:00
Mario Limonciello bfb13f2b43 Run pre-commit's whitespace related hooks on projects/rocm-smi-lib (#2117)
* Run pre-commit's whitespace related hooks on projects/rocm-smi-lib

In order for pre-commit to be useful, everything needs to meet a common
baseline.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

* Added Changelog Spaces for formatting

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

---------

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-12-11 15:41:24 -06:00
Charis Poag Jones 933fdc3c7e [SWDEV-558141] Fix rocm-smi --setsclk [0...n] & other clocks in partitioned configurations (#1493)
Changes:
  - Fix `rocm-smi --setsclk [0 .. n]` for multiple devices to continue on fail when
    in a partitioned configuration (ex. in DPX/QPX/CPX/etc).
  - Partitioned configurations or devices which do not support changing
    sclk/mclk/pcie clks will now continue on failure. Will report a "not
    supported" or other (rocm-smi) error codes for these devices.
  - Updates impact other clock settings such as `--setmclk` and
    `--setpcie`.

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-10-23 08:56:41 -05:00
systems-assistant[bot] 88201d2b79 [SWDEV-544729] Updated CLI error handling (#216)
Updated: rocm_smi.py
- Remove all else: clauses from functions where rsmi_ret_ok is part of the if clause, as requested.
- rsmi_ret_ok() function already handles unsucessful return codes and gracefully handles them.
- Updated check_runtime_status() function to sweep through /sys/class/drm to find active runtime_status.
- Updated the message to' AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status'
- This clarifies the status of the GPU and tells them where to check for more info.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: gabrpham <Gabriel.Pham@amd.com>
2025-09-16 10:56:03 -05:00
systems-assistant[bot] bfdb3bc636 fix(python): fix comparison to None (#211)
from PEP8 (https://peps.python.org/pep-0008/#programming-recommendations):

> Comparisons to singletons like None should always be done with is or is not, never the equality operators.

Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2025-09-10 14:50:32 -05:00
systems-assistant[bot] 39ea16e544 fix(E712): fix comparison to True/False (#212)
from PEP8 (https://peps.python.org/pep-0008/#programming-recommendations):

> Comparisons to singletons like None should always be done with is or is not, never the equality operators.

Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
2025-09-10 14:50:23 -05:00
Bindhiya Kanangot Balakrishnan c7b6bb9600 SWDEV-476667 - Get pcie bw from GPU metric (#853)
The sysfs pcie bandwidth file pcie_bw is deprecated
in newer asics. This change will get pcie BW from
GPU metric for version 1.5 or later.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-09-10 14:48:31 -05:00
gabrpham 5dbca01d2d [SWDEV-551309] Adjusted rocmsmitst and --resetprofile command (#769) 2025-09-09 14:32:35 -05:00
gabrpham 94e194eba2 [SWDEV-540377] Fixed segfault in --showevent command (#649)
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-28 11:49:36 -05:00
GabrPham 67fd6c0f73 Applied Copilot suggestions
[ROCm/rocm_smi_lib commit: def9a3c92d]
2025-08-06 12:42:44 -05:00
GabrPham 12f71cab20 Adjusted logic for reading pp_od_clk_voltage
Signed-off-by: GabrPham <gabrpham_amdeng@amd.com>


[ROCm/rocm_smi_lib commit: 3bea40dfd0]
2025-08-06 12:42:44 -05:00
GabrPham a1052e7598 Updated Tool and Lib Version
Signed-off-by: GabrPham <gabrpham_amdeng@amd.com>


[ROCm/rocm_smi_lib commit: 25aec994a0]
2025-08-06 12:38:08 -05:00
GabrPham 4e69ac4f59 Update threading to use more stable threading module
Unstable threading was causing segmentation faults. Update to use more
recent threading module rather than the _thread module solved
segmentation fault issue.

multiple issues solved by this commit:
	[SWDEV-537518]
	[SWDEV-540377]
	[SWDEV-540223]

Signed-off-by: GabrPham <gabrpham_amdeng@amd.com>


[ROCm/rocm_smi_lib commit: 7dba992ebd]
2025-07-09 09:36:44 -05:00
Peter Park 5a3556ca85 update copyright years to 2025
revert shared_mutex.h


[ROCm/rocm_smi_lib commit: a156bfa4ae]
2025-06-03 17:16:54 -05:00
Castillo, Juan 08062c0577 Fix WARNING: AMD GPUs visible, but data is inaccessible (#58)
* [SWDEV-531834] Fix AMD GPUs visible, but data is inaccessible:
- Scans directories under /sys/bus/pci/drivers/amdgpu
- Verifies each device's runtime_status to determine if it's active
- Returns False if any device is not in active state
- Handles permission errors gracefully with proper debug logging
- Includes comments explaining behavior differences between Instinct / NAVI hardware

The default status is set to True, assuming devices are active unless
proven otherwise, which accommodates hardware like some Instinct ASICS
which do not support runtime power management.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

[ROCm/rocm_smi_lib commit: 47f80145cb]
2025-05-15 14:30:33 -05:00
Poag, Charis efb37d89bc [SWDEV-522992] Make libdrm / libdrm_amdgpu load dynamically (#43)
Changes:
- Now load libdrm/libdrm_amdgpu dynamically

Change-Id: I49fb1f3540b3235a25370f7cfcfb9778db34c2a5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/rocm_smi_lib commit: ce405476ca]
2025-04-16 16:03:42 -05:00
Charis Poag aacf23778d [SWDEV-518325/SWDEV-518320/SWDEV-443309] Fix Partition Enumeration
* Changes:
  - Updates to DRM renderD* / card* pathing for partition devices
  - Now use KFD to discover AMD devices and populate accordingly
    Device MUST have an accessible KFD node (via cgroups)
  - Updated several ROCm SMI CLI outputs to handle SYSFS files
    which are not accessible on partition nodes
  - Added a new method to help get card/drm info
    (rsmi_dev_device_identifiers_get) from ROCm SMI

Change-Id: If844f27ffc595942272abe9c8167ed90a0b0e225
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: a0df877fdf]
2025-04-14 16:03:24 -05:00
Castillo, Juan 07c06318ad [SWDEV-516013]-rocm-smi runtime status check fix (#28)
rocm-smi is not working in mGPU, Blocking DLM tests
Updates include:
 - Creating check_runtime_status function to check for device status of active.
 - Added warning to users that No AMD GPUs are available, check power status/control.
 - Added check for empty string coming from HWMON, if emtpy returns unexpected data.

---------

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

[ROCm/rocm_smi_lib commit: 2630bf0a8c]
2025-04-14 13:05:22 -05:00
Castillo, Juan 3aa80ec0e4 SWDEV-518214: GPU Metrics 1.8 (#31)
* SWDEV-518214: GPU Metrics 1.8 (#31)

- Updates:
    - Adding the following metrics to allow new calculations for violation status:
        - Per XCP metrics gfx_below_host_limit_ppt_acc
        - Per XCP metrics gfx_below_host_limit_thm_acc
        - Per XCP metrics gfx_low_utilization_acc
        - Per XCP metrics gfx_below_host_limit_total_acc
    - Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/rocm_smi_lib commit: f69e65f7bd]
2025-03-20 18:07:32 -05:00
Charis Poag 8e841f22ac [SWDEV-504146] Fix Device Name
Changes: - Fixed Device Name (market name)
  - Added new API rsmi_dev_market_name_get()
  - Updated tests
  - Updated amdgpu_drm.h to match latest mainline kernel
  - Fixed subsystem ID to only show hex value (not subsystem name)
  - rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997


[ROCm/rocm_smi_lib commit: b951a65cf2]
2025-03-09 14:23:22 -05:00
Kanangot Balakrishnan, Bindhiya 7bbbaeb020 [SWDEV-481004] Fix for incorrect gfx_version number (#8)
The target_graphics_version was not formatted properly and was
showing incorrect Target Name. Corrected this by fomatting
major, minor and revision numbers.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>

[ROCm/rocm_smi_lib commit: 6337f7b05b]
2025-03-09 14:23:22 -05:00
Galantsev, Dmitrii 89913b7899 [SWDEV-495169] Update ROCm SMI CLI and Error handling (#3)
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo juan.castillo@amd.com
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7

[ROCm/rocm_smi_lib commit: 898ae4ffc1]
2025-03-09 14:23:22 -05:00
Charis Poag 125bdaf4f5 [SWDEV-496693] GPU metrics 1.7
Changes:
    - Added new GPU metrics:
      1) XGMI link status - Up/Down; 1 = up; 0 = down
      2) Graphics clocks below host limit (per XCP)
         accumulators -> used to help calculate a violation status
      3) VRAM max bandwidth at max memory clock
    - Updated rocm-smi --showmetrics to include new metrics.
    Units/values reflect as indicated by driver, may differ
    from AMD SMI or other ROCm SMI interfaces which
    use these fields.
    - N/A fields means the device does not support providing this
    data.

Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 88a7e4b8ad]
2025-03-09 14:23:21 -05:00
Charis Poag 4451b18408 [SWDEV-475712] Fix MI2x target_graphics_version
Removed correcting target_graphics_version by
product name. Instead detected target_graphics_version which
needs to be corrected -> populate accordingly.

Change-Id: I90765c87e0629daea5c732dace8acfd17e8c62c7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: d2efac3d93]
2025-03-09 14:23:21 -05:00
Charis Poag 72fd1821b0 Merge amd-staging into amd-master 20241125
Change-Id: I801dcda853066d8d2e19a8727b2b07dcafc253b4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 562575c73d]
2024-11-25 08:39:32 -06:00
Charis Poag 06fead7e41 [SWDEV-499029] Fix unable to change memory partition modes
Changes:
  * [API] Removed checking board name, fixes for other MI ASICs
  * [CLI] Increased progress bar to change memory partition modes
    to 140 seconds, since driver reload is variable per system

Change-Id: Ifcaf40d28b4adf5eaa800c9e3748d33749dc414a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: d04cec7f1d]
2024-11-22 20:19:29 -05:00
Charis Poag ff46ab4258 Merge amd-staging into amd-master 20241112
Change-Id: I3fba6fb940aa19532037e2125fd1837de4d3f282
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 99c1b5a0df]
2024-11-12 16:43:50 -06:00
Charis Poag 2258c26c53 [SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Changes:
  - Added warning screen to ROCm SMI users
    setting memory partition
  - Added new API (rsmi_dev_memory_partition_capabilities_get)
    to retrieve memory partition capabilities
    (What users can set memory partition modes to)
  - Increased time-bar for CLI sets display to 40 seconds
  - API now waits until the driver reloads with SYSFS files active
  - [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
    not properly displaying for MI2x or Navi 3x ASICs.
  - Updated tests

Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 46902274b6]
2024-11-12 12:18:44 -04:00
Zhang Ava 2a5e1a98bb Merge amd-staging into amd-master 20241106
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: Ib125dee62e5a893871f5c6df7715177973361a02


[ROCm/rocm_smi_lib commit: fa2c9180d7]
2024-11-08 08:42:13 +08:00
Jorge López d51bd18649 Updates driverInitialized() to support amdgpu built as module as well as kernel built-in. Fixes ROCm/rocm_smi_lib#102 and is an updated version of ROCm/rocm_smi_lib#104
Change-Id: Icb3abe820bc67035b822358a1c04bd09a7c22b6b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 35c1d00f5a]
2024-11-05 16:30:37 -05:00
Charis Poag 7f4ec4f82f Merge amd-staging into amd-master 20241022
Change-Id: I823ffdba9f1db614542658a2af61df917a44c07a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 7504cd04eb]
2024-10-22 18:23:12 -05:00
Oliveira, Daniel d41fbc88ca [SWDEV-490187 / SWDEV-491215] Remove reset gpu partition + NPS test disabled
The reset gpu partition support for both compute and memory were removed

Code changes related to the following:
  * rsmi_dev_compute_partition_reset()
  * rsmi_dev_memory_partition_reset()
  * CLI
  * Unit tests
  * Documentation

Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: a1295714f2]
2024-10-21 14:22:57 -05:00
Charis Poag 986c8e09ef Merge amd-staging into amd-master 20240930
Change-Id: I814a16d5e1f9371e00dbbb3623dc975ab2359f44
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 28c2cc3298]
2024-09-30 10:56:18 -04:00
Charis Poag 0b40a73798 [SWDEV-422195/SWDEV-440985] GPU metrics 1.6 + --showmetrics
Changes:
- Added new GPU metrics:
  1) Violation status' (ex. PVIOL/TVIOL) accumulators
  2) XCP (Graphics Compute Partitions) statistics
  3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.

Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 0609cbf1d0]
2024-09-27 13:18:05 -04:00
Zhang Ava da7bf7e366 Merge amd-staging into amd-master 20240917
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I198c849530508a90eee8ae5454035b9c610b3f5a


[ROCm/rocm_smi_lib commit: af2507807f]
2024-09-19 18:44:19 +08:00
James Xu ddba959395 SWDEV-478077 - logging.warn used instead of logging.warning
- logging.warn() is deprecated in favour of logging.warning()
- for some reason, this is the only place in all of rocm_smi.py
	that uses logging.warn() as pointed out on github
	https://github.com/ROCm/rocm_smi_lib/issues/187

Change-Id: Ie1e4a0ea16b996fbed2e902c8edfe68087a5a5fa


[ROCm/rocm_smi_lib commit: fe6a49d186]
2024-09-16 13:50:26 -04:00
Zhang Ava 9316bae9d8 Merge amd-staging into amd-master 20240911
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: Iaa1be5b9c6eb4c205ced9d610feada93ad28aa50


[ROCm/rocm_smi_lib commit: 743bd50aa5]
2024-09-13 18:31:57 +08:00
Oliveira, Daniel 902bd6b1ae [SWDEV-483822] rocm-smi shows 'warning' for unsupported curves
Options '--showvoltagerange' and '--showvc' show 'warning' instead of 'error' for unsupported voltage curves

Code changes related to the following:
  * CLI

Change-Id: Ide662c98202c32ad01ccaf3c47a61f2543f82ebb
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 72b112f8f3]
2024-09-10 11:36:36 -05:00
Zhang Ava 454af241e3 Merge amd-staging into amd-master 20240828
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I13187d5772ee1e5e74d9daf4268b90819b4198d0


[ROCm/rocm_smi_lib commit: ad511e9b0d]
2024-08-29 20:09:31 +08:00
Charis Poag 16ce520533 Fix rocm-smi --showfw displaying error fw prints
Updates:
  - [CLI] Previously --showfw displayed fw that
    does not exist on systems. This change removes
    that extra output.

Change-Id: If8b063001b80b03579ea1378dfd890c60f62ccd7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 6b8db74578]
2024-08-27 15:43:16 -04:00
Galantsev, Dmitrii 4b21bb4b29 Merge amd-staging into amd-master 20240808
Change-Id: I15b180364b79de72a74ae52fbce7009122a01415
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: ee3caa23ed]
2024-08-08 16:38:24 -05:00
Maisam Arif 02d2ceddaf Bump version tool:2.3.1+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic67456d7484c2f5a0ce0e086e56b29e20d9d9745


[ROCm/rocm_smi_lib commit: 055b023d2e]
2024-08-08 01:40:55 -05:00
Zhang Ava 86acd39029 Merge amd-staging into amd-master 20240801
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I8c9b1a2805e83e5de5873ef8fafaf38143c2ebd8


[ROCm/rocm_smi_lib commit: 481928965f]
2024-08-02 13:12:38 +08:00
Ranjith Ramakrishnan 70b29ed8c3 SWDEV-469004 - Append additonal path to system path
rocm-smi is installed in /opt/rocm-ver/bin , but not as a soft link in wheel package
For rocm-smi to work from bin directory, it need the extra path to find rsmiBindings.py

Change-Id: I41388f680cb2ab9f11dc135639b0d30b66082392


[ROCm/rocm_smi_lib commit: c9201f7736]
2024-07-31 19:52:46 -04:00
Maisam Arif 1998b57059 [SWDEV-464799] Handle UnicodeEncodeError with non UTF-8 locales
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ifb8e6e3c7891c4f70faba5441fb87cc4ba2302f3


[ROCm/rocm_smi_lib commit: c2235eea35]
2024-07-31 17:01:01 -04:00
Maisam Arif 39bada540d Merge amd-staging into amd-master 20240710
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I82f353d21279e2c1ee1788cb9e949b6d3b7e3270


[ROCm/rocm_smi_lib commit: 3cd677419e]
2024-07-10 19:57:39 -05:00
Maisam Arif 96fd0e1ea4 Bump version lib:7.3.0 tool:2.3.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I637b34e03580d5b5efb1e12805a9cdeb7778de74


[ROCm/rocm_smi_lib commit: db4d81b944]
2024-07-10 19:55:15 -05:00
Zhang Ava fed31f08b4 Merge amd-staging into amd-master 20240628
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I9493cdf35b64cfa0a99de017e2d6b521af71cf14


[ROCm/rocm_smi_lib commit: 4c0ce45912]
2024-07-04 14:19:02 +08:00
Charis Poag 33eb3fa429 [SWDEV-463213] Add partition ID fallback + new API
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
  below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.

Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8]  = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 323ab1105d]
2024-06-27 17:27:01 -05:00
jingyu1l ff020dae44 Merge amd-staging into amd-master 20240622
Signed-off-by: jingyu1l <Jingyu1.Li@amd.com>
Change-Id: I7d1c62c8e61c5e43200efd4b5abd7f48e8182e65


[ROCm/rocm_smi_lib commit: 5463955787]
2024-06-27 14:37:24 +08:00