Commit Graph

265 Commits

Author SHA1 Message Date
Castillo, Juan f69e65f7bd SWDEV-518214: GPU Metrics 1.8 (#31)
* SWDEV-518214: GPU Metrics 1.8 (#31)

- Updates:
    - Adding the following metrics to allow new calculations for violation status:
        - Per XCP metrics gfx_below_host_limit_ppt_acc
        - Per XCP metrics gfx_below_host_limit_thm_acc
        - Per XCP metrics gfx_low_utilization_acc
        - Per XCP metrics gfx_below_host_limit_total_acc
    - Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
2025-03-20 18:07:32 -05:00
Charis Poag b951a65cf2 [SWDEV-504146] Fix Device Name
Changes: - Fixed Device Name (market name)
  - Added new API rsmi_dev_market_name_get()
  - Updated tests
  - Updated amdgpu_drm.h to match latest mainline kernel
  - Fixed subsystem ID to only show hex value (not subsystem name)
  - rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997
2025-03-09 14:23:22 -05:00
Kanangot Balakrishnan, Bindhiya 6337f7b05b [SWDEV-481004] Fix for incorrect gfx_version number (#8)
The target_graphics_version was not formatted properly and was
showing incorrect Target Name. Corrected this by fomatting
major, minor and revision numbers.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-03-09 14:23:22 -05:00
Galantsev, Dmitrii 898ae4ffc1 [SWDEV-495169] Update ROCm SMI CLI and Error handling (#3)
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo juan.castillo@amd.com
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7
2025-03-09 14:23:22 -05:00
Charis Poag 88a7e4b8ad [SWDEV-496693] GPU metrics 1.7
Changes:
    - Added new GPU metrics:
      1) XGMI link status - Up/Down; 1 = up; 0 = down
      2) Graphics clocks below host limit (per XCP)
         accumulators -> used to help calculate a violation status
      3) VRAM max bandwidth at max memory clock
    - Updated rocm-smi --showmetrics to include new metrics.
    Units/values reflect as indicated by driver, may differ
    from AMD SMI or other ROCm SMI interfaces which
    use these fields.
    - N/A fields means the device does not support providing this
    data.

Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-03-09 14:23:21 -05:00
Charis Poag d2efac3d93 [SWDEV-475712] Fix MI2x target_graphics_version
Removed correcting target_graphics_version by
product name. Instead detected target_graphics_version which
needs to be corrected -> populate accordingly.

Change-Id: I90765c87e0629daea5c732dace8acfd17e8c62c7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-03-09 14:23:21 -05:00
Charis Poag 562575c73d Merge amd-staging into amd-master 20241125
Change-Id: I801dcda853066d8d2e19a8727b2b07dcafc253b4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-25 08:39:32 -06:00
Charis Poag d04cec7f1d [SWDEV-499029] Fix unable to change memory partition modes
Changes:
  * [API] Removed checking board name, fixes for other MI ASICs
  * [CLI] Increased progress bar to change memory partition modes
    to 140 seconds, since driver reload is variable per system

Change-Id: Ifcaf40d28b4adf5eaa800c9e3748d33749dc414a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-22 20:19:29 -05:00
Charis Poag 99c1b5a0df Merge amd-staging into amd-master 20241112
Change-Id: I3fba6fb940aa19532037e2125fd1837de4d3f282
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-12 16:43:50 -06:00
Charis Poag 46902274b6 [SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Changes:
  - Added warning screen to ROCm SMI users
    setting memory partition
  - Added new API (rsmi_dev_memory_partition_capabilities_get)
    to retrieve memory partition capabilities
    (What users can set memory partition modes to)
  - Increased time-bar for CLI sets display to 40 seconds
  - API now waits until the driver reloads with SYSFS files active
  - [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
    not properly displaying for MI2x or Navi 3x ASICs.
  - Updated tests

Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-12 12:18:44 -04:00
Zhang Ava fa2c9180d7 Merge amd-staging into amd-master 20241106
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: Ib125dee62e5a893871f5c6df7715177973361a02
2024-11-08 08:42:13 +08:00
Jorge López 35c1d00f5a Updates driverInitialized() to support amdgpu built as module as well as kernel built-in. Fixes ROCm/rocm_smi_lib#102 and is an updated version of ROCm/rocm_smi_lib#104
Change-Id: Icb3abe820bc67035b822358a1c04bd09a7c22b6b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-05 16:30:37 -05:00
Charis Poag 7504cd04eb Merge amd-staging into amd-master 20241022
Change-Id: I823ffdba9f1db614542658a2af61df917a44c07a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-10-22 18:23:12 -05:00
Oliveira, Daniel a1295714f2 [SWDEV-490187 / SWDEV-491215] Remove reset gpu partition + NPS test disabled
The reset gpu partition support for both compute and memory were removed

Code changes related to the following:
  * rsmi_dev_compute_partition_reset()
  * rsmi_dev_memory_partition_reset()
  * CLI
  * Unit tests
  * Documentation

Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-10-21 14:22:57 -05:00
Charis Poag 28c2cc3298 Merge amd-staging into amd-master 20240930
Change-Id: I814a16d5e1f9371e00dbbb3623dc975ab2359f44
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-30 10:56:18 -04:00
Charis Poag 0609cbf1d0 [SWDEV-422195/SWDEV-440985] GPU metrics 1.6 + --showmetrics
Changes:
- Added new GPU metrics:
  1) Violation status' (ex. PVIOL/TVIOL) accumulators
  2) XCP (Graphics Compute Partitions) statistics
  3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.

Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-27 13:18:05 -04:00
Zhang Ava af2507807f Merge amd-staging into amd-master 20240917
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I198c849530508a90eee8ae5454035b9c610b3f5a
2024-09-19 18:44:19 +08:00
James Xu fe6a49d186 SWDEV-478077 - logging.warn used instead of logging.warning
- logging.warn() is deprecated in favour of logging.warning()
- for some reason, this is the only place in all of rocm_smi.py
	that uses logging.warn() as pointed out on github
	https://github.com/ROCm/rocm_smi_lib/issues/187

Change-Id: Ie1e4a0ea16b996fbed2e902c8edfe68087a5a5fa
2024-09-16 13:50:26 -04:00
Zhang Ava 743bd50aa5 Merge amd-staging into amd-master 20240911
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: Iaa1be5b9c6eb4c205ced9d610feada93ad28aa50
2024-09-13 18:31:57 +08:00
Oliveira, Daniel 72b112f8f3 [SWDEV-483822] rocm-smi shows 'warning' for unsupported curves
Options '--showvoltagerange' and '--showvc' show 'warning' instead of 'error' for unsupported voltage curves

Code changes related to the following:
  * CLI

Change-Id: Ide662c98202c32ad01ccaf3c47a61f2543f82ebb
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-09-10 11:36:36 -05:00
Zhang Ava ad511e9b0d Merge amd-staging into amd-master 20240828
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I13187d5772ee1e5e74d9daf4268b90819b4198d0
2024-08-29 20:09:31 +08:00
Charis Poag 6b8db74578 Fix rocm-smi --showfw displaying error fw prints
Updates:
  - [CLI] Previously --showfw displayed fw that
    does not exist on systems. This change removes
    that extra output.

Change-Id: If8b063001b80b03579ea1378dfd890c60f62ccd7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-08-27 15:43:16 -04:00
Galantsev, Dmitrii ee3caa23ed Merge amd-staging into amd-master 20240808
Change-Id: I15b180364b79de72a74ae52fbce7009122a01415
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-08-08 16:38:24 -05:00
Maisam Arif 055b023d2e Bump version tool:2.3.1+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic67456d7484c2f5a0ce0e086e56b29e20d9d9745
2024-08-08 01:40:55 -05:00
Zhang Ava 481928965f Merge amd-staging into amd-master 20240801
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I8c9b1a2805e83e5de5873ef8fafaf38143c2ebd8
2024-08-02 13:12:38 +08:00
Ranjith Ramakrishnan c9201f7736 SWDEV-469004 - Append additonal path to system path
rocm-smi is installed in /opt/rocm-ver/bin , but not as a soft link in wheel package
For rocm-smi to work from bin directory, it need the extra path to find rsmiBindings.py

Change-Id: I41388f680cb2ab9f11dc135639b0d30b66082392
2024-07-31 19:52:46 -04:00
Maisam Arif c2235eea35 [SWDEV-464799] Handle UnicodeEncodeError with non UTF-8 locales
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ifb8e6e3c7891c4f70faba5441fb87cc4ba2302f3
2024-07-31 17:01:01 -04:00
Maisam Arif 3cd677419e Merge amd-staging into amd-master 20240710
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I82f353d21279e2c1ee1788cb9e949b6d3b7e3270
2024-07-10 19:57:39 -05:00
Maisam Arif db4d81b944 Bump version lib:7.3.0 tool:2.3.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I637b34e03580d5b5efb1e12805a9cdeb7778de74
2024-07-10 19:55:15 -05:00
Zhang Ava 4c0ce45912 Merge amd-staging into amd-master 20240628
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I9493cdf35b64cfa0a99de017e2d6b521af71cf14
2024-07-04 14:19:02 +08:00
Charis Poag 323ab1105d [SWDEV-463213] Add partition ID fallback + new API
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
  below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.

Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8]  = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-06-27 17:27:01 -05:00
jingyu1l 5463955787 Merge amd-staging into amd-master 20240622
Signed-off-by: jingyu1l <Jingyu1.Li@amd.com>
Change-Id: I7d1c62c8e61c5e43200efd4b5abd7f48e8182e65
2024-06-27 14:37:24 +08:00
Bill(Shuzhou) Liu 57e8e72b79 Change error message for concise json/csv
The error message is changed to not supported instead of errors.

Change-Id: I28bd1e009770674389534be12519cc34673ba846
2024-06-21 16:16:36 -04:00
guanyu12 e7e8b59cba Merge amd-staging into amd-master 20240606
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I06c0f47701f580cd7440dc9fb5d394fad97d06aa
2024-06-06 16:48:50 +08:00
Roopa Malavally 2fd36e33ad ROCm SMI Documentation Reorg
Change-Id: I3e4db2c50a43a51eeea4d3e06ba4811ad1859368
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-05-31 16:25:35 -05:00
Maisam Arif 217827e2c1 Merge amd-staging into amd-master 20240508
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I8ce1757eb0666c7c3242556e20c0fd5de22d740b
2024-05-08 00:29:23 -05:00
Maisam Arif 9c16cc8baf Bump version lib:7.2.0 tool:2.2.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I07138dad67d796fb8c2dd418a384f663dd8532c0
2024-05-07 21:04:29 -05:00
Oliveira, Daniel 8e6d66e15b fix: [SWDEV-458862] [rocm/rocm_smi_lib]
Fixes reading pp_od_clk_voltage new variable format and size.

Code changes related to the following:
  * get_od_clk_volt_info()
  * get_od_clk_volt_curve_regions()
  * Unit tests
  * CLI options restored: --showclkvolt, --showvc, --showvoltagerange, --setvc
    * Rework: 48ddd9ab
  * Bump CLI version
  * CHANGELOG.md

Change-Id: I817ca224de923fdaa992df84592d63b4d5a12b22
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-05-07 20:47:26 -05:00
guanyu12 2938796bc2 Merge amd-staging into amd-master 20240506
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Ifecbc41972411afaf0e7d7b9f07982114402a65a
2024-05-06 14:47:47 +08:00
Ori Messinger 0c48cd9122 ROCm SMI LIB: Fix rsmiBindings.py.in Mismatch
This commit aligns the rsmiBindings.py.in file's
"notification_type_names" & "rsmi_evt_notification_type_t" with
those found in the rsmiBindings.py file.

Change-Id: I67f36606c505992fb98495651310bd70a1755033
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2024-05-02 23:22:44 -05:00
Maisam Arif c425848141 Bump version lib:7.1.0 tool:2.1.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I6f3d7c64aacf36c9d33d663e23559a7f50cd8db6
2024-05-02 03:30:48 -04:00
Oliveira, Daniel 48ddd9abd7 fix: [SWDEV-458862] [rocm/rocm_smi_lib]
Fixes reading pp_od_clk_voltage new variable format and size.

Code changes related to the following:
  * get_od_clk_volt_info()
  * get_od_clk_volt_curve_regions()
  * Unit tests
  * CLI options removed: --showclkvolt, --showvc, --showvoltagerange, --setvc

Change-Id: Ieedb845eeadcea2f2e447ec576c253ad2a814176
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-05-02 03:29:59 -04:00
Ori Messinger 3282aaa8de ROCm SMI LIB: Add Ring Hang Event Enums
This patch adds 'ring hang' enums to ROCM SMI LIB.
This event type name is KFD_SMI_EVENT_RING_HANG.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9b886eb1fc027f03bcca1e5d1a89a2a186b64bf5
2024-05-01 17:02:52 -05:00
guanyu12 6881fc9c2e Merge amd-staging into amd-master 20240411
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I25ed71cca91a0d78110a995861cff93ba748e056
2024-04-11 10:24:26 +08:00
Charis Poag b86f92230d [SWDEV-450463] Fix --showmemuse clarity
* Updates:
  - [CLI] Updated --showmemuse:
    -> Add VRAM%, provide better context as "GPU Allocated Memory (VRAM%)"
    -> Update "GPU memory use (%)" as
       "GPU Memory Read/Write Activity(%)"
  - [CLI] Updated --showmaxpower and rocm-smi (no arg)
    -> Rounding was inconsistent with values past decimal.
       This provides the floor value of the device

Change-Id: Ib76dea2cb8483a1d7f53df675b0a94d8d01c81b9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-04-08 10:25:46 -04:00
Junyi Hou 9e2a6ea4bf Fix typos in rocm_smi.py, README.md, rsmiBindings.py
Change-Id: Ib03cec6130983a56657a388799fc2afaf3b8f728
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-04-05 11:15:41 -04:00
Charis Poag 6fada8c4a6 Merge amd-staging into amd-master 20240401
Change-Id: I52c8665735e86deed53645197c11889fc7ece8c5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-04-01 17:48:06 -05:00
Charis Poag f5c32b5415 Add ROCm 6.1.1 changelog, ROCm SMI deprication, vbios fix
* Updates:
    - Add ROCm 6.1.1 Changelog updates
    - Add planned ROCm SMI deprication notice
    - Fix rocm-smi --showvbios showing extra errors
      for GPUs which do not have a VBIOS (MI300a ASICs)

Change-Id: I0e5ccfe2677f9c7909ca13863a920e323e82b439
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-30 00:11:09 -05:00
guanyu12 8d4261c5c5 Merge amd-staging into amd-master 20240321
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I006fc6c187f134a4851e262fa53ab6bf8d58759d
2024-03-21 14:03:51 +08:00
Charis Poag c5acd4ee88 Update ROCm 6.0/6.1 CHANGELOG.md & README.md
* Updates:
    - [CHANGELOG.md] Provide 6.1 and 6.0 changes
    - [README.md] Update readme with relavant changes
    - [CLI] Updated --showpower to expand on types of power provided to users

Change-Id: Ic653cc81f80b7973654e2c23e1ab70567b930aa7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-20 00:17:33 -05:00