İşleme Grafiği

270 İşleme

Yazar SHA1 Mesaj Tarih
Peter Park a156bfa4ae update copyright years to 2025
revert shared_mutex.h
2025-06-03 17:16:54 -05:00
Castillo, Juan 47f80145cb Fix WARNING: AMD GPUs visible, but data is inaccessible (#58)
* [SWDEV-531834] Fix AMD GPUs visible, but data is inaccessible:
- Scans directories under /sys/bus/pci/drivers/amdgpu
- Verifies each device's runtime_status to determine if it's active
- Returns False if any device is not in active state
- Handles permission errors gracefully with proper debug logging
- Includes comments explaining behavior differences between Instinct / NAVI hardware

The default status is set to True, assuming devices are active unless
proven otherwise, which accommodates hardware like some Instinct ASICS
which do not support runtime power management.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-05-15 14:30:33 -05:00
Poag, Charis ce405476ca [SWDEV-522992] Make libdrm / libdrm_amdgpu load dynamically (#43)
Changes:
- Now load libdrm/libdrm_amdgpu dynamically

Change-Id: I49fb1f3540b3235a25370f7cfcfb9778db34c2a5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-04-16 16:03:42 -05:00
Charis Poag a0df877fdf [SWDEV-518325/SWDEV-518320/SWDEV-443309] Fix Partition Enumeration
* Changes:
  - Updates to DRM renderD* / card* pathing for partition devices
  - Now use KFD to discover AMD devices and populate accordingly
    Device MUST have an accessible KFD node (via cgroups)
  - Updated several ROCm SMI CLI outputs to handle SYSFS files
    which are not accessible on partition nodes
  - Added a new method to help get card/drm info
    (rsmi_dev_device_identifiers_get) from ROCm SMI

Change-Id: If844f27ffc595942272abe9c8167ed90a0b0e225
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-04-14 16:03:24 -05:00
Castillo, Juan 2630bf0a8c [SWDEV-516013]-rocm-smi runtime status check fix (#28)
rocm-smi is not working in mGPU, Blocking DLM tests
Updates include:
 - Creating check_runtime_status function to check for device status of active.
 - Added warning to users that No AMD GPUs are available, check power status/control.
 - Added check for empty string coming from HWMON, if emtpy returns unexpected data.

---------

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-04-14 13:05:22 -05:00
Castillo, Juan f69e65f7bd SWDEV-518214: GPU Metrics 1.8 (#31)
* SWDEV-518214: GPU Metrics 1.8 (#31)

- Updates:
    - Adding the following metrics to allow new calculations for violation status:
        - Per XCP metrics gfx_below_host_limit_ppt_acc
        - Per XCP metrics gfx_below_host_limit_thm_acc
        - Per XCP metrics gfx_low_utilization_acc
        - Per XCP metrics gfx_below_host_limit_total_acc
    - Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
2025-03-20 18:07:32 -05:00
Charis Poag b951a65cf2 [SWDEV-504146] Fix Device Name
Changes: - Fixed Device Name (market name)
  - Added new API rsmi_dev_market_name_get()
  - Updated tests
  - Updated amdgpu_drm.h to match latest mainline kernel
  - Fixed subsystem ID to only show hex value (not subsystem name)
  - rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997
2025-03-09 14:23:22 -05:00
Kanangot Balakrishnan, Bindhiya 6337f7b05b [SWDEV-481004] Fix for incorrect gfx_version number (#8)
The target_graphics_version was not formatted properly and was
showing incorrect Target Name. Corrected this by fomatting
major, minor and revision numbers.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-03-09 14:23:22 -05:00
Galantsev, Dmitrii 898ae4ffc1 [SWDEV-495169] Update ROCm SMI CLI and Error handling (#3)
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo juan.castillo@amd.com
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7
2025-03-09 14:23:22 -05:00
Charis Poag 88a7e4b8ad [SWDEV-496693] GPU metrics 1.7
Changes:
    - Added new GPU metrics:
      1) XGMI link status - Up/Down; 1 = up; 0 = down
      2) Graphics clocks below host limit (per XCP)
         accumulators -> used to help calculate a violation status
      3) VRAM max bandwidth at max memory clock
    - Updated rocm-smi --showmetrics to include new metrics.
    Units/values reflect as indicated by driver, may differ
    from AMD SMI or other ROCm SMI interfaces which
    use these fields.
    - N/A fields means the device does not support providing this
    data.

Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-03-09 14:23:21 -05:00
Charis Poag d2efac3d93 [SWDEV-475712] Fix MI2x target_graphics_version
Removed correcting target_graphics_version by
product name. Instead detected target_graphics_version which
needs to be corrected -> populate accordingly.

Change-Id: I90765c87e0629daea5c732dace8acfd17e8c62c7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-03-09 14:23:21 -05:00
Charis Poag 562575c73d Merge amd-staging into amd-master 20241125
Change-Id: I801dcda853066d8d2e19a8727b2b07dcafc253b4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-25 08:39:32 -06:00
Charis Poag d04cec7f1d [SWDEV-499029] Fix unable to change memory partition modes
Changes:
  * [API] Removed checking board name, fixes for other MI ASICs
  * [CLI] Increased progress bar to change memory partition modes
    to 140 seconds, since driver reload is variable per system

Change-Id: Ifcaf40d28b4adf5eaa800c9e3748d33749dc414a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-22 20:19:29 -05:00
Charis Poag 99c1b5a0df Merge amd-staging into amd-master 20241112
Change-Id: I3fba6fb940aa19532037e2125fd1837de4d3f282
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-12 16:43:50 -06:00
Charis Poag 46902274b6 [SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Changes:
  - Added warning screen to ROCm SMI users
    setting memory partition
  - Added new API (rsmi_dev_memory_partition_capabilities_get)
    to retrieve memory partition capabilities
    (What users can set memory partition modes to)
  - Increased time-bar for CLI sets display to 40 seconds
  - API now waits until the driver reloads with SYSFS files active
  - [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
    not properly displaying for MI2x or Navi 3x ASICs.
  - Updated tests

Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-12 12:18:44 -04:00
Zhang Ava fa2c9180d7 Merge amd-staging into amd-master 20241106
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: Ib125dee62e5a893871f5c6df7715177973361a02
2024-11-08 08:42:13 +08:00
Jorge López 35c1d00f5a Updates driverInitialized() to support amdgpu built as module as well as kernel built-in. Fixes ROCm/rocm_smi_lib#102 and is an updated version of ROCm/rocm_smi_lib#104
Change-Id: Icb3abe820bc67035b822358a1c04bd09a7c22b6b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-05 16:30:37 -05:00
Charis Poag 7504cd04eb Merge amd-staging into amd-master 20241022
Change-Id: I823ffdba9f1db614542658a2af61df917a44c07a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-10-22 18:23:12 -05:00
Oliveira, Daniel a1295714f2 [SWDEV-490187 / SWDEV-491215] Remove reset gpu partition + NPS test disabled
The reset gpu partition support for both compute and memory were removed

Code changes related to the following:
  * rsmi_dev_compute_partition_reset()
  * rsmi_dev_memory_partition_reset()
  * CLI
  * Unit tests
  * Documentation

Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-10-21 14:22:57 -05:00
Charis Poag 28c2cc3298 Merge amd-staging into amd-master 20240930
Change-Id: I814a16d5e1f9371e00dbbb3623dc975ab2359f44
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-30 10:56:18 -04:00
Charis Poag 0609cbf1d0 [SWDEV-422195/SWDEV-440985] GPU metrics 1.6 + --showmetrics
Changes:
- Added new GPU metrics:
  1) Violation status' (ex. PVIOL/TVIOL) accumulators
  2) XCP (Graphics Compute Partitions) statistics
  3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.

Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-27 13:18:05 -04:00
Zhang Ava af2507807f Merge amd-staging into amd-master 20240917
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I198c849530508a90eee8ae5454035b9c610b3f5a
2024-09-19 18:44:19 +08:00
James Xu fe6a49d186 SWDEV-478077 - logging.warn used instead of logging.warning
- logging.warn() is deprecated in favour of logging.warning()
- for some reason, this is the only place in all of rocm_smi.py
	that uses logging.warn() as pointed out on github
	https://github.com/ROCm/rocm_smi_lib/issues/187

Change-Id: Ie1e4a0ea16b996fbed2e902c8edfe68087a5a5fa
2024-09-16 13:50:26 -04:00
Zhang Ava 743bd50aa5 Merge amd-staging into amd-master 20240911
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: Iaa1be5b9c6eb4c205ced9d610feada93ad28aa50
2024-09-13 18:31:57 +08:00
Oliveira, Daniel 72b112f8f3 [SWDEV-483822] rocm-smi shows 'warning' for unsupported curves
Options '--showvoltagerange' and '--showvc' show 'warning' instead of 'error' for unsupported voltage curves

Code changes related to the following:
  * CLI

Change-Id: Ide662c98202c32ad01ccaf3c47a61f2543f82ebb
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-09-10 11:36:36 -05:00
Zhang Ava ad511e9b0d Merge amd-staging into amd-master 20240828
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I13187d5772ee1e5e74d9daf4268b90819b4198d0
2024-08-29 20:09:31 +08:00
Charis Poag 6b8db74578 Fix rocm-smi --showfw displaying error fw prints
Updates:
  - [CLI] Previously --showfw displayed fw that
    does not exist on systems. This change removes
    that extra output.

Change-Id: If8b063001b80b03579ea1378dfd890c60f62ccd7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-08-27 15:43:16 -04:00
Galantsev, Dmitrii ee3caa23ed Merge amd-staging into amd-master 20240808
Change-Id: I15b180364b79de72a74ae52fbce7009122a01415
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-08-08 16:38:24 -05:00
Maisam Arif 055b023d2e Bump version tool:2.3.1+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic67456d7484c2f5a0ce0e086e56b29e20d9d9745
2024-08-08 01:40:55 -05:00
Zhang Ava 481928965f Merge amd-staging into amd-master 20240801
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I8c9b1a2805e83e5de5873ef8fafaf38143c2ebd8
2024-08-02 13:12:38 +08:00
Ranjith Ramakrishnan c9201f7736 SWDEV-469004 - Append additonal path to system path
rocm-smi is installed in /opt/rocm-ver/bin , but not as a soft link in wheel package
For rocm-smi to work from bin directory, it need the extra path to find rsmiBindings.py

Change-Id: I41388f680cb2ab9f11dc135639b0d30b66082392
2024-07-31 19:52:46 -04:00
Maisam Arif c2235eea35 [SWDEV-464799] Handle UnicodeEncodeError with non UTF-8 locales
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ifb8e6e3c7891c4f70faba5441fb87cc4ba2302f3
2024-07-31 17:01:01 -04:00
Maisam Arif 3cd677419e Merge amd-staging into amd-master 20240710
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I82f353d21279e2c1ee1788cb9e949b6d3b7e3270
2024-07-10 19:57:39 -05:00
Maisam Arif db4d81b944 Bump version lib:7.3.0 tool:2.3.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I637b34e03580d5b5efb1e12805a9cdeb7778de74
2024-07-10 19:55:15 -05:00
Zhang Ava 4c0ce45912 Merge amd-staging into amd-master 20240628
Signed-off-by: Zhang Ava <niandong.zhang@amd.com>
Change-Id: I9493cdf35b64cfa0a99de017e2d6b521af71cf14
2024-07-04 14:19:02 +08:00
Charis Poag 323ab1105d [SWDEV-463213] Add partition ID fallback + new API
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
  below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.

Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8]  = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-06-27 17:27:01 -05:00
jingyu1l 5463955787 Merge amd-staging into amd-master 20240622
Signed-off-by: jingyu1l <Jingyu1.Li@amd.com>
Change-Id: I7d1c62c8e61c5e43200efd4b5abd7f48e8182e65
2024-06-27 14:37:24 +08:00
Bill(Shuzhou) Liu 57e8e72b79 Change error message for concise json/csv
The error message is changed to not supported instead of errors.

Change-Id: I28bd1e009770674389534be12519cc34673ba846
2024-06-21 16:16:36 -04:00
guanyu12 e7e8b59cba Merge amd-staging into amd-master 20240606
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I06c0f47701f580cd7440dc9fb5d394fad97d06aa
2024-06-06 16:48:50 +08:00
Roopa Malavally 2fd36e33ad ROCm SMI Documentation Reorg
Change-Id: I3e4db2c50a43a51eeea4d3e06ba4811ad1859368
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-05-31 16:25:35 -05:00
Maisam Arif 217827e2c1 Merge amd-staging into amd-master 20240508
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I8ce1757eb0666c7c3242556e20c0fd5de22d740b
2024-05-08 00:29:23 -05:00
Maisam Arif 9c16cc8baf Bump version lib:7.2.0 tool:2.2.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I07138dad67d796fb8c2dd418a384f663dd8532c0
2024-05-07 21:04:29 -05:00
Oliveira, Daniel 8e6d66e15b fix: [SWDEV-458862] [rocm/rocm_smi_lib]
Fixes reading pp_od_clk_voltage new variable format and size.

Code changes related to the following:
  * get_od_clk_volt_info()
  * get_od_clk_volt_curve_regions()
  * Unit tests
  * CLI options restored: --showclkvolt, --showvc, --showvoltagerange, --setvc
    * Rework: 48ddd9ab
  * Bump CLI version
  * CHANGELOG.md

Change-Id: I817ca224de923fdaa992df84592d63b4d5a12b22
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-05-07 20:47:26 -05:00
guanyu12 2938796bc2 Merge amd-staging into amd-master 20240506
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Ifecbc41972411afaf0e7d7b9f07982114402a65a
2024-05-06 14:47:47 +08:00
Ori Messinger 0c48cd9122 ROCm SMI LIB: Fix rsmiBindings.py.in Mismatch
This commit aligns the rsmiBindings.py.in file's
"notification_type_names" & "rsmi_evt_notification_type_t" with
those found in the rsmiBindings.py file.

Change-Id: I67f36606c505992fb98495651310bd70a1755033
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2024-05-02 23:22:44 -05:00
Maisam Arif c425848141 Bump version lib:7.1.0 tool:2.1.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I6f3d7c64aacf36c9d33d663e23559a7f50cd8db6
2024-05-02 03:30:48 -04:00
Oliveira, Daniel 48ddd9abd7 fix: [SWDEV-458862] [rocm/rocm_smi_lib]
Fixes reading pp_od_clk_voltage new variable format and size.

Code changes related to the following:
  * get_od_clk_volt_info()
  * get_od_clk_volt_curve_regions()
  * Unit tests
  * CLI options removed: --showclkvolt, --showvc, --showvoltagerange, --setvc

Change-Id: Ieedb845eeadcea2f2e447ec576c253ad2a814176
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-05-02 03:29:59 -04:00
Ori Messinger 3282aaa8de ROCm SMI LIB: Add Ring Hang Event Enums
This patch adds 'ring hang' enums to ROCM SMI LIB.
This event type name is KFD_SMI_EVENT_RING_HANG.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9b886eb1fc027f03bcca1e5d1a89a2a186b64bf5
2024-05-01 17:02:52 -05:00
guanyu12 6881fc9c2e Merge amd-staging into amd-master 20240411
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I25ed71cca91a0d78110a995861cff93ba748e056
2024-04-11 10:24:26 +08:00
Charis Poag b86f92230d [SWDEV-450463] Fix --showmemuse clarity
* Updates:
  - [CLI] Updated --showmemuse:
    -> Add VRAM%, provide better context as "GPU Allocated Memory (VRAM%)"
    -> Update "GPU memory use (%)" as
       "GPU Memory Read/Write Activity(%)"
  - [CLI] Updated --showmaxpower and rocm-smi (no arg)
    -> Rounding was inconsistent with values past decimal.
       This provides the floor value of the device

Change-Id: Ib76dea2cb8483a1d7f53df675b0a94d8d01c81b9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-04-08 10:25:46 -04:00