Граф коммитов

307 Коммитов

Автор SHA1 Сообщение Дата
Castillo, Juan 7c882b2f69 SWDEV-518209: GPU Metrics 1.8 (#177)
- Updates:
    - Adding the following metrics to allow new calculations for violation status:
        - Per XCP metrics gfx_below_host_limit_ppt_acc
        - Per XCP metrics gfx_below_host_limit_thm_acc
        - Per XCP metrics gfx_low_utilization_acc
        - Per XCP metrics gfx_below_host_limit_total_acc
    - Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
2025-03-19 10:24:02 -05:00
Poag, Charis 48cb5529d2 [SWDEV-493274/SWDEV-514998] Add AMD SMI partition tests + Add Guest amd-smi static --partition (#127)
* [SWDEV-493274/SWDEV-514998] Add AMD SMI partition tests + Add Guest amd-smi static --partition

Changes:
    - Added amd-smi static --partition for guest systems
    - Added C++ tests for memory and compute (accelerator) partitions
    - Added Python tests for amdsmi_get_gpu_vram_info(),
       amdsmi_get_gpu_accelerator_partition_profile_config()
    - Updated Python tests for
      amdsmi_get_gpu_accelerator_partition_profile()
      Now includes more profile and resource detail
    - Added amdsmi_get_gpu_xcd_counter();
      Tests provided for both C++/Python APIs
    - Added AmdSmiVramType & AmdSmiVramVendor: they were missing
      python testing required adding.

Change-Id: Ib6549d8ccc5fb68726f38745b87c78f890186022
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-03-11 16:38:46 -05:00
Arif, Maisam 0e67568902 [SWDEV-501958] Doc Update deprecating pasid in 7.0 (#166)
Change-Id: Ie19ba271c901d0be324143474871241272166124

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I024f7e2b5e7a5fcd6e1d12181d21ffacfe29c00f
2025-03-07 14:56:46 -06:00
Narlo, Joseph d7c3ad0886 [SWDEV-515031] Change Header Version to 25.2.0 (#109)
Change Versioning Scheme to match https://semver.org/
Dropping the year enum and API fields in a future release.
Should not impact library versioning since we are now starting from 25.2.0
---------

Signed-off-by: Joseph Narlo <joseph.narlo@amd.com>
Co-authored-by: Arif, Maisam <Maisam.Arif@amd.com>
Change-Id: Id090e23f156926d08f9c0b781447388adf268cf6
2025-02-26 19:17:09 -06:00
Narlo, Joseph dc4a16da6f [SWDEV-513651] Sync Unified And Linux Header (#98)
Signed-off-by: Joseph Narlo <joseph.narlo@amd.com>
2025-02-06 22:25:50 -06:00
Kanangot Balakrishnan, Bindhiya a7283196a7 [PLAT-156250] Blacklist VoltCurvRead test for unsupported devices (#96)
Blacklisted TestVoltCurvRead for devices with gfx_target_version
90400, 90401 and 90402 as it is not supported on these systems.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-02-05 18:44:54 -06:00
Castillo, Juan 9cc5c303a2 [SWDEV-508173] [AMDSMI] Python API missing function errors (#46)
* [SWDEV-508173] Updates include:
- Updating py-interface to import amdsmi_get_gpu_reg_table_info and amdsmi_get_gpu_pm_metrics_info.
- Updating the ctypes from byref to pointer.

Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>
2025-01-21 14:11:41 -06:00
Scaffidi, Salvatore 3793be7735 [SWDEV-463406] Update API with fields for gfx_clock_below_host_limit and low_utilization violations
Updated API with fields for gfx_clock_below_host_limit and low_utilization violations
Change-Id: I25647bae6e7b785f44dab024272767658688bcad

---------
Signed-off-by: Scaffidi, Salvatore <Salvatore.Scaffidi@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
2025-01-08 22:07:23 -06:00
Juan Castillo f8b8347627 [SWDEV-496693]GPU Metrics 1.7
Features added:
- [SWDEV-475244] Add new interface to get max memory bandwidth
Updated API: amdsmi_get_gpu_vram_info
Updated: struct amdsmi_vram_info_t to include vram_max_bandwidth
CLI: amd-smi static --vram

- [SWDEV-488349] Add new interface for XGMI link status
New API: amdsmi_get_gpu_xgmi_link_status
CLI: amd-smi xgmi --link-status

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Change-Id: I1aa35b741136eb4f02f7ea9a95b865886273eb72
2024-12-18 10:57:06 -06:00
Joe Narlo d0a7332d32 SWDEV-492272 [AMDSMI] Build/Compiler warnings messages
Fix compiler warnings

Signed-off-by: Joe Narlo <Joseph.Narlo@amd.com>
Change-Id: I10657b8f3ef18a9b45311e8f6509958297a57823
2024-12-13 00:38:07 -05:00
Joe Narlo 3052ad4220 SWDEV-495787 [AMDSMI] Different license headers
Change copyrights to MIT and remove date

Signed-off-by: Joe Narlo <Joseph.Narlo@amd.com>
Change-Id: I16f5b412f2b9ddefaaa1771aa714cc18829a1be4
2024-11-22 08:55:28 -05:00
Adam Pryor b7789d4699 Revert "[SWDEV-446215] Update cmake to put test libs in proper lib dir"
This reverts commit 6e01df00ca.

Reason for revert: Because the gtest of amdsmi is different to other components so it was installed in a share/amdsmi/lib folder. It cannot be installed in a common folder such as /usr/local/bin or /usr/bin because all other components try to search those folder first.

 

This is breaking ROCmValidationSuite and other tools. Per Wang, Yanyao this should be reverted.

Change-Id: Id61bc6056fe41800e738616f39293e9b8762a377
2024-11-15 15:08:12 -05:00
Maisam Arif afd06950c1 Revert "SWDEV-489696 [AMD SMI] Update python integration test"
This reverts commit 06e7bf8a98.

Reason for revert: Changes needed

Change-Id: I96cc956a2f1c73a2828c70ec9aa22931ba570d8f
2024-11-14 18:54:48 -05:00
Joe Narlo 06e7bf8a98 SWDEV-489696 [AMD SMI] Update python integration test
Initial update

Signed-off-by: Joe Narlo <Joseph.Narlo@amd.com>
Change-Id: I7c5777159f591f8b402168576b14ef8c1157e8d9
2024-11-14 17:52:01 -05:00
adapryor 6e01df00ca [SWDEV-446215] Update cmake to put test libs in proper lib dir
Change-Id: I2e91b904b3f869cdba717d872c10d799d0260c30
Signed-off-by: adapryor <Adam.pryor@amd.com>
2024-10-29 16:07:58 -04:00
gabrpham 00b3184e9f SWDEV-478748 Changed TestPciReadWrite Test Failure message to Warning
TEST FAILURE message for `amdsmi_get_gpu_cpi_throughput` and
`amdsmi_get_gpu_pci_bandwidth` changed to WARNING to indicate that
pcie_bw and/or pp_dpm_pcie sysfs files may not be supported on respetive
devices.

Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: I1ad6e15eceacb5a00b022458ee5fb21df9d845c7
2024-10-18 16:32:57 -05:00
Charis Poag 5eff39915b [SWDEV-463406] Add volation_status current counter/accumulated values
Changes:
  - amdsmi_violation_status_t now includes current accumulated/counter
   values
  - Tests/wrapper now include added values
  - Removed ASIC references in header for host/bm alignment
  - Fix violation_status->per_hbm_thrm /
    violation_status->active_hbm_thrm
    calculations.

Change-Id: Ic86a7cbad5198a41018f82f6b588b83158d9ba0b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-10-04 15:56:01 -04:00
Charis Poag 3a4abbd8c0 [SWDEV-422195/SWDEV-440985] GPU metrics 1.6
Changes:
    - Added new GPU metrics:
      1) Violation status' (ex. PVIOL/TVIOL) accumulators
      2) XCP (Graphics Compute Partitions) statistics
      3) pcie other end recovery counter
    - CLI/API/tests changes were made accordingly

Change-Id: I589b9b1f570f25dda12d95bb501feca85da8b3bb
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-27 12:04:21 -05:00
Lang Yu 7a557b1c50 SWDEV-463405: Add amdsmi_get_link_topology_nearest support
amdsmi_get_link_topology_nearest() is used to retrieve
the set of GPUs that are nearest to a given device
at a specific interconnectivity level.

Code changes related to the following:
    * API
    * CLI
    * Unit tests
    * Examples

Header Unification Change: "/amdsmi/+/1122408"

Change-Id: Id0317797c652c267742513936d321677793ec634
Signed-off-by: Lang Yu <lang.yu@amd.com>
2024-09-26 16:43:27 -05:00
Ryo Ficano 9979be8512 [SWDEV-482963] [Test updates] Add new tests for p0 items - BM v2
Updates:
- Added tests for these API calls:

amdsmi_get_socket_handles
amdsmi_get_processor_type
amdsmi_get_clk_freq
amdsmi_get_gpu_process_info
amdsmi_get_gpu_ras_block_features_enabled
amdsmi_get_gpu_ecc_count
amdsmi_get_gpu_memory_usage
amdsmi_get_gpu_vendor_name
amdsmi_get_utilization_count

- Added amdsmi_init() and amdsmi_shut_down() before and after each test.
- Updated README and removed all pytest references.

Change-Id: Ida0c165a466571b1df36c413161bd95c070f6ff1
Signed-off-by: Ryo Ficano <Ryo.Ficano@amd.com>
2024-09-26 14:08:13 -04:00
gabrpham 8bc4abc88b Corrected partition changes in header and wrapper
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: Iafd7de8f08924873da841ee6eca62100a17b2b6c
2024-09-20 17:01:55 -05:00
gabrpham c9a489d437 Moved partition_id from static --asic-info to static --partition.
partition_id also removed from the `amdsmi_asic_info_t` struct and
supporting API has been added for querying partition information.

Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: Id5a6291a77d11bb97a1c7a200fc465898e86e081
2024-09-20 03:48:42 -04:00
Maisam Arif 3b7f661e71 Moved KFD information to separate structure and API
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: If6eaea589edc704cf408d6391b5f2154134035e7
2024-09-20 03:48:42 -04:00
Charis Poag ede0e6318d Fix python unittest not installing amd-smi-lib-test package install
Moving to TESTS_COMPONENT allows files to be placed
within the amd-smi-lib-test package.
Previously, was put within the amd-smi-lib package,
which will never be triggered for installation with
latest changes.

Change-Id: Id49dbe69bfc7d5bd1af403c28b946fe1edf64d8e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-18 19:25:48 -05:00
Juan Castillo ac593f9fa0 [SWDEV-482966/ SWDEV-482967] Removing pytest dependency + install path change
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Change-Id: I7aace93fcad18d67443e6849c10a1fbbc65d0fa8
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2024-09-18 00:27:00 -04:00
Eisuke Kawashima 1b6ec8df07 chore: unset executable permission
Change-Id: I06727774f3b1657a7955b172a40d0dfc9c76d6b9
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-09-16 17:34:39 -04:00
Maisam Arif 105db1afcd Udpated License Dates
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I8ca199c129c06508bc3e23745ab5ac2d20dce928
2024-09-16 16:14:47 -04:00
Charis Poag a33e4c9e14 [SWDEV-483526] Fix MI3x partitions not showing all logical nodes
Changes:
- Updates to amdsmi_asic_info_t structure to include:
  target_graphics_version, kfd_id, node_id, partition_id
- Updates to amd-smi static --asic to display new
  samdsmi_asic_info_t fields
- Updates to gpu enumeration during amdsmi_init()
  to discover all logical GPUs when in a non-SPX mode
  (ex. DPX, TPX, QPX, or CPX)
 - Updates to amdsmi_get_gpu_bdf_id(..) to include
   partition_id details when in BDF or optional bits.
     - bits [63:32] = domain
     - bits [31:28] or bits [2:0] = partition id
     - bits [27:16] = reserved
     - bits [15:8]  = Bus
     - bits [7:3] = Device
     - bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

- C++/Python tests updated to reflect these outputs

Change-Id: I4be0ea35bb98f3109ae2ca9e82f6b21baa38de29
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-11 16:35:17 -05:00
Tim Huang 260edaa752 [SWDEV-463402] - Support retrieving connection type and P2P capabilities between two GPUs
1. Add a API interface amdsmi_topo_get_p2p_status to retrieve
connection type and P2P capabilities between 2 GPUs.

2. Add getting p2p status test in hw_topology_read
to print P2P capability information.

3. Add below tables for cli topology sub commands:
  - CACHE COHERANCY TABLE
  - ATOMICS TABLE
  - DMA TABLE
  - BI-DIRECTIONAL TABLE

Change-Id: I199173030d4170115cea27c472958a4826e4e1bf
Signed-off-by: Tim Huang <tim.huang@amd.com>
2024-09-06 09:42:34 -04:00
Maisam Arif 97c487372f Clean up unused files & Update License info
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I5b58e8fe3d9eeac207b07ce0fe4134dd717dbd90
2024-09-05 09:52:48 -04:00
gabrpham 95ca2b83a1 Changed power parameter in amdsmi_get_energy_count() to energy_accumulator
Issue linked here: https://github.com/ROCm/amdsmi/issues/38

Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: I622236eb3f0144aefeb6c82d2713b4822bfeeb11
2024-09-04 09:38:08 -04:00
Oliveira, Daniel 893f13ab98 SWDEV-463399: amdsmi_get_gpu_vram_info() adds bit-width
Driver info `amdgpu_gpu_info.vram_bit_width` is exposed through amdsmi_get_gpu_vram_info().

Code changes related to the following:
  * API
  * CLI
  * Unit tests
  * Examples

Change-Id: I8abd8db7a603078b2b1c008b2685cecf35caf3d2
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-08-27 18:22:50 -04:00
Oliveira, Daniel af3670d758 SWDEV-463372: amdsmi_get_utilization_count() adds decoder_activity
GPU Metrics info `gpu_metrics.vcn_activity` is exposed through amdsmi_get_utilization_count().

Code changes related to the following:
  * API
  * CLI
  * Unit tests

Change-Id: I831b2a81bdc0e090a6698dcb689d10f91ed87dd9
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-08-27 16:58:34 -05:00
Maisam Arif 2388ff7e3c Whitespace
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I8335e617670a471a97bf54886b3221b6222e507f
2024-07-10 19:22:02 -05:00
Charis Poag 7194aaebf3 [SWDEV-455442/SWDEV-464645] Add back voltage curve testing for MI300
Validation requires running tests for MI300 systems, this update
removes the exclusion for these systems.

Change-Id: Idacf3e8bf0bd569f1cfa6192af47993eb5440ee6
2024-07-08 14:24:26 -05:00
Dalibor Stanisavljevic 7b2463abe0 SWDEV-457337 - Fix header alignment
Change-Id: I9f25f6c4f0d00c76b66d13162f30be11368f5b59
Signed-off-by: Dalibor Stanisavljevic <Dalibor.Stanisavljevic@amd.com>
2024-05-23 04:41:57 -04:00
Maisam Arif 7d999aa34c SWDEV-458102 - Updates to pp_od_clk_voltage parsing
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I650dae1a99856dcde914fe66917cf9111f3ce0e2
2024-05-15 03:18:24 -05:00
Maisam Arif 52843152a5 SWDEV-444567 - Added Ring Hang Event
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I2e73ba08ee0004f6f30660b2fa425ea94bafceca
2024-05-03 17:21:28 -04:00
Maisam Arif 11c72946eb Revert "SWDEV-458102 - Deprecated Voltage Curve API"
This reverts commit 1423fb632e.

Change-Id: I8a3eaf0a9f28200e09fb35d5260fbc070fe8a4a9
2024-05-02 15:27:16 -05:00
Charis Poag c24d66740e SWDEV-450580 - Fix powercap set
Updates:
     * CLI - Added AMDSMIHelpers.convert_SI_unit() to help
       conversion of units
     * API - Reverted to uW for power cap limits
     * CLI - amd-smi static --limit now includes MIN_POWER
     * Tests now are all using uW units to keep W conversion
       to only happen in CLI
     * Python API now reflects same units as uW (what is seen
       in amdgpu driver)
     * CLI - amd-smi metric --power:
       Fixed power seen on gpu_metrics v1.3

Change-Id: I32d9ba78d0d8806772f0860f9a803a885b3f316a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-05-02 10:13:39 -05:00
Maisam Arif 1423fb632e SWDEV-458102 - Deprecated Voltage Curve API
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I111c3ce26d2ab66d5e755432f4b8a9bfa631f805
2024-05-02 02:53:29 -04:00
Maisam Arif 1bd18c1a65 Added new ecc blocks and adjusted metric --ecc-block filtering
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ib2f69c7d59ee5108024794434fb202b5e4f58738
2024-04-18 15:01:41 -04:00
Oliveira, Daniel 08e2e21bab fix: [SWDEV-442525] [rocm/amd_smi_lib]
Fixes gpu_process_list

Code changes related to the following:
  * amdsmi_get_gpu_process_list()
  * CLI
  * Examples
  * Unit tests
  * Changelog
  * Readme
  * rocm_smi_lib commit: 677433b367

Change-Id: I9210fbca7a5da92d0a8b472b72ca82597c8e4fb5
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-03-27 16:48:24 -05:00
Oliveira, Daniel c6208c0db0 fix: [rocm/amd_smi_lib] Navi3X/Navi2X/MI100 amdsmitst 2 test cases fail when running
Checks returned error by get_gpu_pci_bandwith() before assert

Code changes related to the following:
  * Unit tests

Change-Id: I950eee5d92607eea08722af7d7c84e8457cd4e60
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-02-28 15:11:22 -06:00
Oliveira, Daniel 475424525e fix: [rocm/amd_smi_lib] TestFrequenciesRead & TestPciReadWrite test cases failed
Fixes asserts in unit tests, and 'pp_dpm_pcie' condition

Code changes related to the following:
  * rsmi_dev_pci_bandwidth_set()
  * Functional tests

Change-Id: Id5e6851393fa3b51bb8cad87daca1efaf500a7e0
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-02-22 03:40:50 -05:00
Oliveira, Daniel 78074d7d77 fix: [rocm/amd_smi_lib] amdsmi_get_gpu_activity gfx/memory activity does not update
Checks and forces rereading gpu metrics unconditionally

Code changes related to the following:
  * Device::dev_log_gpu_metrics()
  * amdsmi_get_gpu_metrics_header_info()
    Removed unintentionally during work on 'header cleanup Remove non-unified headers'
  * Examples
  * Unit tests

Change-Id: I83710e173c0f7102d0b7f865c18474c979a95cd8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-02-13 10:15:17 -06:00
Oliveira, Daniel 55734d2d7a fix: [rocm/amd_smi_lib] header cleanup Remove non-unified headers
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards

Code changes related to the following:
  * '_get_gpu_metrics_' APIs
  * Functional tests

Change-Id: I2dd2ecde11c1d77e343e0ae0e10aeb9120ae9b99
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-01-26 10:38:48 -05:00
Charis Poag 34bd26c68e Fix metric type error output + re-align with ROCm SMI metrics
Changes:
* [CLI] Provide fix for "/opt/rocm/bin/amd-smi metric
TypeError: '>' not supported between instances of 'str' and 'i"
--> Python API was updated, CLI needed to reflect these changes
* [API] Updated amdsmi.h's with ROCm SMI
--> Incorrectly added mem_bandwidth_acc & mem_max_bandwidth
--> Realigned wrapper with updates
* [Test] Added metrics not shown in gpu_metrics_read.cc

Change-Id: Ia3a172377fd5a582254dd5a46d81dbec7e763cd9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-01-24 21:23:40 -06:00
Galantsev, Dmitrii a60f5d2d4c SWDEV-409184 - Exclude some tests in VM
Change-Id: Ic196a113426fc63a0b2aadfa04ab4b10ed6434e3
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-11 01:38:15 -06:00
Charis Poag 5ff5af0b5a Fix GPU metric tests & cleanup test output
- CLI: Added average_power to display if current_power is empty
    - CLI: fixed PCIe current_speed not displaying GT/s
    - ROCm API: 1.3 & 1.4
                -> commented out setting avg clocks to current clock value
(leave as max uint value, not re-assign; these are not same values)
                    -> commented out setting current_socket_power = average_power
(leave as max uint value, not re-assign; these are not same values)
                    -> For all non-array clocks, placed value in first
                        array[0] to keep outputs consistent
                    (helps xcd calc)
      - ROCm API: rsmi_dev_metrics_curr_gfxclk_get fixed to count
        XCDs using backwards compatible rsmi_dev_gpu_metrics_info_get.
      - ^ Fixes XCD count overall + assigning clock[0] in 1.3 to curr
        freq
      - AMD SMI API: amdsmi_get_gpu_metrics_info() initialized all new
        1.5 metric values for all lower metric tables
      - AMD SMI API: wrapper -> fix is here + returns correct AMD SMI return
      - AMD SMI API: wrapper -> now displays amdsmi return status as
        string in logs
      - gpu_metrics_read.cc -> now has better overview of backwards
        compatible output
      - gpu_metrics_read.cc -> Cleaned up output, added units, and
        display all array output

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Change-Id: Id5b60ded5b0ed2cdf0f96ca72c79e356f0410960
2023-12-19 14:18:15 -05:00