192 커밋

작성자 SHA1 메시지 날짜
Bindhiya Kanangot Balakrishnan aa16cca39a [SWDEV-549108] Increase gpu_metrics API execution test threshold (#2617)
Increased threshold from 2100 μs to 3100 µs to accommodate
gpu_metric read time variation across Navi systems.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2026-01-15 11:20:17 -06:00
Yazen AL Musaffar cb372748f8 [ROCM-SMI] [SWDEV-569731] rsmi tests failing on Frequency/Power/GpuMetrics ReadOnly Fix (#2303)
* Updated unsupported metric version file for rocm_smi_tests Frequency/Power/GpuMetrics ReadOnly tests

Signed-off-by: yalmusaf_amdeng <Yazen.ALMusaffar@amd.com>
2026-01-06 16:46:38 -06:00
Mario Limonciello bfb13f2b43 Run pre-commit's whitespace related hooks on projects/rocm-smi-lib (#2117)
* Run pre-commit's whitespace related hooks on projects/rocm-smi-lib

In order for pre-commit to be useful, everything needs to meet a common
baseline.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>

* Added Changelog Spaces for formatting

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

---------

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-12-11 15:41:24 -06:00
Yazen AL Musaffar c9d6a8720c [SWDEV-548312] Fix for rsmitstReadWrite.TestPciReadWrite failure in rsmi-tests on MI200. (#1834)
* Fix for rsmitstReadWrite.TestPciReadWrite failure in rsmi-tests

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>

* Resolved comments

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>

---------

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>
Co-authored-by: yalmusaf_amdeng <yalmusaf@amd.com>
2025-12-03 15:21:36 -06:00
Bindhiya Kanangot Balakrishnan e8c3b22734 [SWDEV-556483] Fix runtime PM suspend causing test failures (#1931)
Added runtime PM detection and DRM ioctl-based device wake
to handle GPUs in BACO state. Modified tests to wake
suspended devices before reading sysfs files.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-11-25 13:36:45 -06:00
darren-amd 16e7ee32e6 [rocm-smi-lib] Add iomanip include to frequencies_read (#1797) 2025-11-24 16:38:21 -05:00
Bindhiya Kanangot Balakrishnan 97b6e806da SWDEV-560768 - SMI test return if no devices available (#1369)
Return from Setup if no monitor devices are available.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-10-16 15:35:18 -05:00
Bindhiya Kanangot Balakrishnan b4288fd8d4 SWDEV-554099 - Update rsmi tests expected output (#1364)
Updated rsmitsts expected outputs to accomodate
returned status.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-10-16 15:34:07 -05:00
systems-assistant[bot] 857e5ef3ce chore: unset executable permission (#213)
Co-authored-by: Eisuke Kawashima <e-kwsm@users.noreply.github.com>
Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-09-16 11:06:54 -05:00
gabrpham 5dbca01d2d [SWDEV-551309] Adjusted rocmsmitst and --resetprofile command (#769) 2025-09-09 14:32:35 -05:00
gabrpham ee38e26ab2 [SWDEV_543709] Updated tests with new expectations for output (#692)
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-09-09 14:32:01 -05:00
Castillo, Juan 8133e89e82 [SWDEV-539845] Add support for board voltage (#92)
* Add the API and CLI to show the board voltage.

---------

Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Liu, Shuzhou (Bill) <Shuzhou.Liu@amd.com>

[ROCm/rocm_smi_lib commit: bab82d98b7]
2025-07-03 01:58:50 -05:00
Charis Poag b45713faf5 [SWDEV-530035] Fix tests ran with partitioned configurations (CPX, DPX, QPX, etc.)
Changes: - Updates to APIs to handle null pointers or RSMI_STATUS_NOT_SUPPORTED
  - Fixes to tests to handle partitioned configurations correctly
  - Synced with latest AMD SMI API changes
Change-Id: I7a932f9336ef29ccb01d3b15e2101f6136b45720


[ROCm/rocm_smi_lib commit: 12b78439d2]
2025-06-06 16:39:29 -05:00
Peter Park 5a3556ca85 update copyright years to 2025
revert shared_mutex.h


[ROCm/rocm_smi_lib commit: a156bfa4ae]
2025-06-03 17:16:54 -05:00
Castillo, Juan eaa2000af5 [SWDEV-523359] fan_read_write: Add set fan speed validation check. (#61)
[SWDEV-523359] fan_read_write: Add set fan speed validation check.
- Handled NOT_SUPPORTED status which previously caused rsmitst to false fail
- Added continute statement to proceed with rest of FanReadWrite test.
- fixed spacing line 140

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

[ROCm/rocm_smi_lib commit: ac31c6e576]
2025-05-26 09:54:41 -05:00
Castillo, Juan 3aa80ec0e4 SWDEV-518214: GPU Metrics 1.8 (#31)
* SWDEV-518214: GPU Metrics 1.8 (#31)

- Updates:
    - Adding the following metrics to allow new calculations for violation status:
        - Per XCP metrics gfx_below_host_limit_ppt_acc
        - Per XCP metrics gfx_below_host_limit_thm_acc
        - Per XCP metrics gfx_low_utilization_acc
        - Per XCP metrics gfx_below_host_limit_total_acc
    - Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/rocm_smi_lib commit: f69e65f7bd]
2025-03-20 18:07:32 -05:00
Poag, Charis 66d66a872d [SWDEV-514998/SWDEV-511662] Fix tests for Guest and BM with static CPX config (#26)
Guest: Tests needed to account for not supporting changing compute
partitions.

BM: Tests need to account for invalid responses from Driver (due to
static CPX config).

Change-Id: I09ccee981c6b73684b64e5053068920a6c1b6439

Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/rocm_smi_lib commit: 23e945c6b3]
2025-03-09 14:08:02 -05:00
Poag, Charis ef21f5d254 Revert "[SWDEV-514998/SWDEV-511662] Fix tests for Guest and BM with static CPX config" (#25)
* Revert - this reverts commit c03341cc02efb70c35c4e96ff4fc3e6c53f5be9d.

* Revert "[SWDEV-514998/SWDEV-511662] Fix tests for Guest and BM with static CPX config"

This reverts commit 9bd169da4801c32f7c48f83cb70f790faa0dca96.

[ROCm/rocm_smi_lib commit: 08fee73075]
2025-03-09 14:08:02 -05:00
Kanangot Balakrishnan, Bindhiya c648438732 SWDEV-510419: Restore compute partition after memory partition test (#15)
Memory partition test was changing original compute partiton based
on default compute mode. Corrected this to set back to original
compute partition.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>

[ROCm/rocm_smi_lib commit: 1d6b8d9422]
2025-03-09 14:08:02 -05:00
Charis Poag 4d47e514f3 [SWDEV-514998/SWDEV-511662] Fix tests for Guest and BM with static CPX config
Guest: Tests needed to account for not supporting changing compute
partitions.

BM: Tests need to account for invalid responses from Driver (due to
static CPX config).


[ROCm/rocm_smi_lib commit: 1dd9ca9df4]
2025-03-09 14:07:51 -05:00
Charis Poag 7b867182f3 [SWDEV-504146] Fix Device Name
Changes: - Fixed Device Name (market name)
  - Added new API rsmi_dev_market_name_get()
  - Updated tests
  - Updated amdgpu_drm.h to match latest mainline kernel
  - Fixed subsystem ID to only show hex value (not subsystem name)
  - rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997


[ROCm/rocm_smi_lib commit: 6a5e94c451]
2025-02-19 08:49:50 -06:00
gabrpham a62f424b90 Fixed reset event issues
Issues include:
	SWDEV-480250
	SWDEV-480255
	SWDEV-480248

Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: Icf12211e4b136f26fce18f09a7bf8b7e9cd20691


[ROCm/rocm_smi_lib commit: 6f51cd651e]
2024-12-30 13:12:46 -05:00
Charis Poag 7b3c814501 [SWDEV-496693] GPU metrics 1.7
Changes:
    - Added new GPU metrics:
      1) XGMI link status - Up/Down; 1 = up; 0 = down
      2) Graphics clocks below host limit (per XCP)
         accumulators -> used to help calculate a violation status
      3) VRAM max bandwidth at max memory clock
    - Updated rocm-smi --showmetrics to include new metrics.
    Units/values reflect as indicated by driver, may differ
    from AMD SMI or other ROCm SMI interfaces which
    use these fields.
    - N/A fields means the device does not support providing this
    data.

Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 4de2168866]
2024-12-15 16:48:13 -05:00
gabrpham 21d3a831d7 [SWDEV-478748] Changing PCIE Read/Write message TEST FAILURE to WARNING
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: I534a94b358f7fddbe3c11d249c6e090cf3fa121e


[ROCm/rocm_smi_lib commit: 5428d29b19]
2024-11-13 15:05:26 -06:00
Charis Poag 2258c26c53 [SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Changes:
  - Added warning screen to ROCm SMI users
    setting memory partition
  - Added new API (rsmi_dev_memory_partition_capabilities_get)
    to retrieve memory partition capabilities
    (What users can set memory partition modes to)
  - Increased time-bar for CLI sets display to 40 seconds
  - API now waits until the driver reloads with SYSFS files active
  - [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
    not properly displaying for MI2x or Navi 3x ASICs.
  - Updated tests

Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 46902274b6]
2024-11-12 12:18:44 -04:00
Oliveira, Daniel d41fbc88ca [SWDEV-490187 / SWDEV-491215] Remove reset gpu partition + NPS test disabled
The reset gpu partition support for both compute and memory were removed

Code changes related to the following:
  * rsmi_dev_compute_partition_reset()
  * rsmi_dev_memory_partition_reset()
  * CLI
  * Unit tests
  * Documentation

Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: a1295714f2]
2024-10-21 14:22:57 -05:00
Charis Poag 0b40a73798 [SWDEV-422195/SWDEV-440985] GPU metrics 1.6 + --showmetrics
Changes:
- Added new GPU metrics:
  1) Violation status' (ex. PVIOL/TVIOL) accumulators
  2) XCP (Graphics Compute Partitions) statistics
  3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.

Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 0609cbf1d0]
2024-09-27 13:18:05 -04:00
Charis Poag 0d5c46fe52 [SWDEV-475552/SWDEV-475351] Fix segfault TestComputePartitionReadWrite
In order to check partition id's we must continue to check # of devices.
Since this fluctuates with partition updates
and there are drm minor limitations.

For the drm minor limitation of 64, user must remove other drivers
using PCIe space. You can see these by:
ls /sys/class/drm

Recommend: rmmod unneeded driver and reload amdgpu. In order to
ensure CPX can enumerate with all XCP (Graphic Cluster Partitions).

Change-Id: Ib663503f0b7264dce163f6ac2d50795fc8dc5eba
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: c11209f618]
2024-07-27 17:47:54 -05:00
Charis Poag 33eb3fa429 [SWDEV-463213] Add partition ID fallback + new API
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
  below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.

Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8]  = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 323ab1105d]
2024-06-27 17:27:01 -05:00
Oliveira, Daniel 05596bc060 fix: [SWDEV-461904] [rocm/rocm_smi_lib]
Checks returned error by rsmi_dev_od_volt_info_get() before assert

Code changes related to the following:
  * Unit tests

Change-Id: Icc0f329e35992aae19f07243024521181467bcd3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 497ef4a7ef]
2024-05-14 18:25:00 -05:00
Oliveira, Daniel af02873dfb fix: [SWDEV-458862] [rocm/rocm_smi_lib]
Fixes reading pp_od_clk_voltage new variable format and size.

Code changes related to the following:
  * get_od_clk_volt_info()
  * get_od_clk_volt_curve_regions()
  * Unit tests
  * CLI options restored: --showclkvolt, --showvc, --showvoltagerange, --setvc
    * Rework: 162d1d24
  * Bump CLI version
  * CHANGELOG.md

Change-Id: I817ca224de923fdaa992df84592d63b4d5a12b22
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 8e6d66e15b]
2024-05-07 20:47:26 -05:00
Oliveira, Daniel 162d1d24a4 fix: [SWDEV-458862] [rocm/rocm_smi_lib]
Fixes reading pp_od_clk_voltage new variable format and size.

Code changes related to the following:
  * get_od_clk_volt_info()
  * get_od_clk_volt_curve_regions()
  * Unit tests
  * CLI options removed: --showclkvolt, --showvc, --showvoltagerange, --setvc

Change-Id: Ieedb845eeadcea2f2e447ec576c253ad2a814176
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 48ddd9abd7]
2024-05-02 03:29:59 -04:00
Ori Messinger 38b048f5f9 ROCm SMI LIB: Add Ring Hang Event Enums
This patch adds 'ring hang' enums to ROCM SMI LIB.
This event type name is KFD_SMI_EVENT_RING_HANG.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9b886eb1fc027f03bcca1e5d1a89a2a186b64bf5


[ROCm/rocm_smi_lib commit: 3282aaa8de]
2024-05-01 17:02:52 -05:00
Oliveira, Daniel 5ddf42fe4e fix: [SWDEV-450058] [rocm/rocm_smi_lib]
Fixes TestMeasureApiExecutionTime test fails

Code changes related to the following:
  * Unit tests

Change-Id: I6223078f219448deb6bfbd78edae371a5a4cf03c
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: adf5c1da67]
2024-04-09 16:20:14 -04:00
Oliveira, Daniel 729a26605b fix: [SWDEV-432974] [rocm/rocm_smi_lib]
Checks returned error by get_gpu_pci_bandwith() before assert

Code changes related to the following:
  * Unit tests

Change-Id: Ia0fe64f168711147c5e66c7917cf633be40dee9f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 35b561fd69]
2024-03-01 17:30:07 -06:00
Oliveira, Daniel b86b8e165a fix: [rocm/rocm_smi_lib] rsmi_dev_activity_metric_get gfx/memory activity does not update with GPU activity
Checks and forces rereading gpu metrics unconditionally

Code changes related to the following:
  * Device::dev_log_gpu_metrics()
  * Examples
  * Unit tests

Change-Id: Ic1c4f34a39f2bf197263f80ddbb84da26345807d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: b4d37caa70]
2024-02-16 09:47:45 -06:00
Oliveira, Daniel ea66076ea9 fix: [rocm/rocm_smi_lib] header cleanup Remove non-unified headers
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards

Code changes related to the following:
  * 'rsmi_dev_metrics_' APIs
  * Functional tests
  * Examples

Change-Id: I7d562a95889361ee6f8f7588f8a790f42c8eb262
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: ce36198cb1]
2024-02-14 17:50:26 -06:00
Charis Poag 059fd6260e [SWDEV-423481/SWDEV-423393] Align all device identifier details
Updated:
 * [CLI] Fixed vram % - printf style formatting causes many data errors
   This fix updates to the recommended way of outputting formatted data.
   https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
 * [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
       -> CLI name: "GUID"
       -> ROCm SMI calls: no arg, -i, --showhw, --showproduct
 * [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
       -> CLI name: "Node"
       -> ROCm SMI calls: no arg, --showhw, --showproduct
 * [CLI] Added target gfx version from kfd
       -> CLI name: "GFX Version" or "GFX VER"
       -> ROCm SMI calls: --showhw, --showproduct
 * [CLI] Base ROCm CLI
       -> Removed - stacked id formatting:
	   This is to simplify identifiers helpful to users.
	   More identifiers can be found on -i --showhw, --showproduct
 * [CLI] Update -i, --showhw, --showproduct, w/out arg
      -> Card ID/DID/Model/SKU/VBIOS:
            All unsupported values now display "N/A" instead
            of "unknown" or "unsupported"
 * [CLI] Showhw now expands data based on content

Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 4b5ccb57f0]
2024-02-13 19:52:29 -05:00
Charis Poag 443d034d36 Add rsmi_dev_target_graphics_version_get
Updates:
   - [API] rsmi_dev_target_graphics_version_get, takes
     reported value from KFD -> parses into human-readable
     values. If device does not support, returns MAX UINT64
     value and RSMI_STATUS_NOT_SUPPORTED.
     Otherwise, puts into base10 format removing
     extra 0's + putting in correct format. If user
     provides nullptr, returning RSMI_STATUS_INVALID_ARGS.
    - [Test/Example] sys_info_read updated to include
     new rsmi_dev_target_graphics_version_get tests

Change-Id: I50f94e06b8733a5dec2eb08f284b44927f36abcd
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 5d2cd0c271]
2024-01-29 14:25:24 -06:00
Bill(Shuzhou) Liu 0566bbc47a Return NOT_SUPPORT for set function in VM guest
Fix the unit tests which are fail in VM guest environment.

Change-Id: Id7c58887692bbdecba54f5d2d8463b292e19b4ad


[ROCm/rocm_smi_lib commit: a0ec98c30d]
2024-01-17 11:18:25 -06:00
Oliveira, Daniel c0335b2695 rocm_smi_lib: Fix gpu_metrics_v1_5 support
Adds support and implement APIs for 'gpu_metrics_v1_5'

Code changes related to the following:
  * gpu metrics 1.5 support
  * Unit tests
  * Examples

Build changes related to the following: None

Change-Id: Ie8917dd63c1dd1a94467b100fa44b634cebe62b6
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 373621aed3]
2024-01-05 14:24:34 -06:00
Charis Poag 18fa660402 Memory partition permission denied fix
Received EACCES return for file that does not have
write access (read only). Permissions would be an
issue, but we check for sudo/root permissions early on.

Change-Id: I98615b02e4acccc59facb42225887a6b7273716b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: c6b0c93e6f]
2023-12-06 21:51:30 -05:00
Galantsev, Dmitrii f38b62abf5 TESTS - Temporarily disable overdrive tests
Change-Id: Ice06d31e874621abf3135548eedfe2158281891d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 48163b8d4f]
2023-12-06 19:33:17 -06:00
Galantsev, Dmitrii bb50cf42a2 TESTS - Fix overdrive error on not-supported
Change-Id: I47e7f499229b47b151f4ba4d5fa9c59ac04d6816
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 102c2c692a]
2023-12-06 02:43:04 -06:00
Oliveira, Daniel e2a833f347 rocm_smi_lib: Fix GPU Metrics Max Elements Read Exceeded
Code changes related to the following:
  * Check smallest copy size for multi-valued metrics
  * Unit tests: gpu_metric_read
  * ROCMSMI examples

Build changes related to the following:
  * CMakeLists.txt

Change-Id: Ieb2363020fa21c93fbacd0edcc1d394eed183051
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 8e0d3d5a39]
2023-12-04 17:01:08 -06:00
Galantsev, Dmitrii 7fc67c88ce Fix ASAN for tests and log metrics better
Change-Id: Ib495cfc28c48a4d291a89673a3b6fc13313845c7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: a128867497]
2023-11-30 15:39:05 -05:00
Oliveira, Daniel 83589929db rocm_smi_lib: Fix Refactoring gpu_metrics code
Uses new support for 'gpu_metrics_v1_4'

Code changes related to the following:
  * rsmi gpu_metrics APIs
  * rsmi gpu_metrics Logs
  * new data structure fields added in 1.4
  * added APIs for all other existing metrics before 1.4
  * added support to older metrics; 1.1, and 1.2
  * public APIs renamed to start with prefix 'rsmi_dev_metrics_'
  * Unit tests updated
  * Examples updated

Build changes related to the following: None

Change-Id: Ibdaf031be9d916020b4049544dbd725858c7711d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 2c8ba4cae9]
2023-11-10 19:05:09 -06:00
Charis Poag e89751e202 Partition EBUSY with RSMI_STATUS_BUSY & invalid GPU Metrics check
* Updates:
   - [API/CLI] rsmi_dev_*_partition_set &
     rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
     EBUSY writes + cleaned up accidental map insertions
     (maplookup[] can insert values that are not in the map,
     map.at(key) fixes this potential issue)
   - [API] rsmi_dev_gpu_metrics_info_get() - returns
     RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
     outside of 1v1/1v2/1v3
   - [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
     EBUSY write errors; kept backward compatibility
     for other writes which do not care about these states
   - [API] rsmi_dev_od_volt_info_get()
      & rsmi_dev_od_volt_curve_regions_get() have better logging
     + Expose more details on why they are erroring
   - [Utils/logs/example] Expose AMD GPU gfx target version to aid in
     system troubleshooting
   - [Utils] Added test methods that look at od volt
     freq & regions into here - for easier access across
     several tests
   - [Utils] Updated getRSMIStatusString(new argument - fullstatus;
     default to true for backwards compatibility)
     -> true shows shortened RSMI STATUS response
   - [Utils] Added splitString to cut out noisy return responses
     (used in getRSMIStatusString(), when fullstatus = true)
   - [Utils] Added getFileCreationDate() to expose build date
     of the library - helpful for local builds or experimental builds
   - [Utils] Macro cleanup
   - [Example] Added a few gpu_metric checks - helpful for upcoming
     updates
   - [Device] SYSFS/DebugFS - now have better r/w displayed in logs
   - [LOGS] Expose library build date - see above for details
   - [Tests] Add more warnings/errors to test builds
   - [Tests] Moved up Partition tests for ordered test runs - helped
     identify issues with GPU BUSY writes
   - [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
     with waits for busy status found & cleaned up how we checked
     for partition changes - with RSMI responses exposed more clearly
   - [Tests] perf_determinism - multi gpu now properly runs through
     with full resets as needed
   - [Tests] volt_freq_curv_read - better error handling with more
     verbose output

Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 57b6135e54]
2023-10-27 14:52:02 -04:00
Charis Poag 73d4fbf53d bdfid fix for partition & xgmi nodes
* Updates:
    - [API] After discovering all amd gpus, we now properly
      map correct bdf (xgmi nodes). Especially important for
      partition changes - aka secondary nodes.
    - [API] While adding new secondary nodes we now have
      better grouping -> due to resorting based on
      kfd properties list & matching to primary uniqueid
    - [API] All secondary nodes are now AddToDeviceList
      with correct bdf (location id), provided by kfd
    - [API] Modified AddToDeviceList(..., uint64_t bdfid):
      providing an optional field - bdfid. This allows working
      around primary pcie cards with xgmi nodes
    - [API] Utils - cpplint minor fixes
    - [Example] Removed all endl references w/ newline, fixed
      spacing, and some incorrect values displaying as hex
      (needed dec representation)
    - [API] kfd node functions - now print full path of file
      for trace logs
    - [Tests] power_read.cc: Added in generic power test to
      confirm guaranteeing specific return values

Change-Id: I143474e8d64c4915a966e789be6bcea4fa7f4472
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 6f1afd2678]
2023-10-13 20:14:39 -05:00
Galantsev, Dmitrii 2e5f5fd51a TESTS - Skip XGMI test
Change-Id: Idd9f505f36fac4a670e5129f835aa051b5c4c9fa
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 2a7589a065]
2023-10-12 21:27:55 -05:00