Wykres commitów

345 Commity

Autor SHA1 Wiadomość Data
Bill(Shuzhou) Liu c489cb8f3f Add support for deferred RAS errors in API
The API will support the deferred errors

Change-Id: I221a146f09fefde1fc31e5f746d0870e07c93561
2024-03-04 22:46:44 -05:00
Maisam Arif 69caba8727 Bump Version to 24.4.0.0 & Corrected argument checks for set subcommand
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I651f8ca652c764f30845503dd869f435f728d5ba
2024-02-23 20:47:19 -06:00
Bill(Shuzhou) Liu db33cda0c1 Unify the amdsmi_get_pcie_info python interface
Make the python interface consistent with the C interface.

Change-Id: Idda08f888947c757e475d5a024b0ec3d8e1d846a
2024-02-22 03:33:59 -05:00
Maisam Arif f58613561c Refactor ESMI Initialization and Argument Parsing
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Iefab3a8110e0d3c525ee0cef1bdef9101550e9de
2024-02-21 19:02:14 -05:00
Deepak Mewar 84608807da Fix for multiple hsmp freq sources not reported on some setups
Change-Id: I8afe7076bd7790cf408ef104c50ac8d258b7d3fc
Signed-off-by: Maisam Arif <maisarif@amd.com>
2024-02-21 06:30:03 -06:00
Maisam Arif 703fdb0ed2 Aligned cache property enum with Host
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ie64a33f55c9a9a7cc8c806419509897351f37c70
2024-02-20 05:48:53 -06:00
Maisam Arif 61f8888488 24.3.0 Version update
Change-Id: I936c896117ad64d06ea919a8b7bd6ba4cc388592
Signed-off-by: Maisam Arif <maisarif@amd.com>
2024-02-15 17:21:24 -05:00
Maisam Arif 77710921a4 Align list and cache_info to Host
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I4fa55b360b74d5a202d0b9b4eb7aee660b0a1bcf
2024-02-15 01:47:59 -05:00
Deepak Mewar 34ccbb5d1b Updated amdsmi header for ESMI doxygen formatting
Referencing htttps://github.com/ROCm/amdsmi/pull/10

Change-Id: I516e3643130db8a4213aee7dfcaca27363e3171e
Signed-off-by: Maisam Arif <maisarif@amd.com>
2024-02-14 02:03:05 -06:00
Oliveira, Daniel 78074d7d77 fix: [rocm/amd_smi_lib] amdsmi_get_gpu_activity gfx/memory activity does not update
Checks and forces rereading gpu metrics unconditionally

Code changes related to the following:
  * Device::dev_log_gpu_metrics()
  * amdsmi_get_gpu_metrics_header_info()
    Removed unintentionally during work on 'header cleanup Remove non-unified headers'
  * Examples
  * Unit tests

Change-Id: I83710e173c0f7102d0b7f865c18474c979a95cd8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-02-13 10:15:17 -06:00
Maisam Arif f831cf49f7 Renamed amdsmi_get_metrics_table to amdsmi_get_cpu_metrics_table
Renamed structs to be more conistent with what they are calling

Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I6f2be2fcb76f004aa592f0dad8545565700ccd4b
2024-02-12 16:30:18 -06:00
Bill(Shuzhou) Liu 86d025daaa Add @platform doxygen alias
The @platform alias will describe which platform (for example,
gpu_baremetal or/and host) an API can be used.

The get_platform.py is a tool to compare APIs in different platforms.

Change-Id: I902bc4fea048269eace6e9f3f4a8e93f3adb7f87
2024-02-07 07:28:38 -05:00
Deepak Mewar 6f7273fda5 Added amdsmi cpu family & cpu model
- Updated header and source files
- Updated python interface
- Generated python wrapper for updated header
- Updated the CLI to have cpu family & cpu model
  as part of metric table

Change-Id: Iea440251797270d5d29ffe883b0ad6db790be658
2024-02-06 18:46:27 -05:00
Maisam Arif 88192d8b6b SWDEV-436533 - Cache Info Struct Update
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ic640fa657cdcc32d7b00ff78fc9452ec7e05dd07
2024-02-05 16:51:04 -05:00
Maisam Arif 59d885a9ca Fixed gpu_metric and cache cli checks
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ic71e2b50dfa8fc106a17079842a7564a8e24b69d
2024-02-01 05:47:18 -05:00
Oliveira, Daniel 55734d2d7a fix: [rocm/amd_smi_lib] header cleanup Remove non-unified headers
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards

Code changes related to the following:
  * '_get_gpu_metrics_' APIs
  * Functional tests

Change-Id: I2dd2ecde11c1d77e343e0ae0e10aeb9120ae9b99
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-01-26 10:38:48 -05:00
Charis Poag 34bd26c68e Fix metric type error output + re-align with ROCm SMI metrics
Changes:
* [CLI] Provide fix for "/opt/rocm/bin/amd-smi metric
TypeError: '>' not supported between instances of 'str' and 'i"
--> Python API was updated, CLI needed to reflect these changes
* [API] Updated amdsmi.h's with ROCm SMI
--> Incorrectly added mem_bandwidth_acc & mem_max_bandwidth
--> Realigned wrapper with updates
* [Test] Added metrics not shown in gpu_metrics_read.cc

Change-Id: Ia3a172377fd5a582254dd5a46d81dbec7e763cd9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-01-24 21:23:40 -06:00
Bill(Shuzhou) Liu 0b67c2ccc4 Unified API
amdsmi_get_link_metrics() and amdsmi_get_pcie_info()

Change-Id: Iea060e449813b842236243b772e8809497ce98fe
2024-01-24 18:27:20 -05:00
Maisam Arif c400a22d4d 24.2.0 Version update
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ied7c24d63ca38c2e5ea5eca6b411e0156f61a403
2024-01-24 11:13:02 -06:00
Maisam Arif c48c989bbc 24.1.0 Version update
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ibfe92d199b10dc48ece85dfdeda1041f5ea98626
2024-01-24 12:09:48 -05:00
Deepak Mewar 5d0b479661 amdsmi library updated for esmi error status mapping to amdsmi
Change-Id: I7e4dd146a1a9af496556efcf811b2e1ed565b09e
2024-01-16 11:41:22 -06:00
Deepak Mewar a0c95e855b amdsmi library updated for metric table structure
Change-Id: Ie8a9840a9020282599dd413e964d86bfb8850f6a
2024-01-16 11:41:22 -06:00
Deepak Mewar 9f3a6dbd29 amdsmi library and sample code updated for amdsmi_get_metrics_table
Change-Id: Ie03c556f5c38fe4a0365743d3a94220e3aa62b23
2024-01-16 11:41:22 -06:00
Bill(Shuzhou) Liu 5a6b5d2a0a Use the same mutex as rocm-smi
Share the same mutex as rocm-smi implementation. Handle the crash
when a user is not in render group.

Change-Id: I486b26569f9b523b41bbdaf95d51f4a730978cfd
2024-01-15 13:12:49 -05:00
Charis Poag 5ff5af0b5a Fix GPU metric tests & cleanup test output
- CLI: Added average_power to display if current_power is empty
    - CLI: fixed PCIe current_speed not displaying GT/s
    - ROCm API: 1.3 & 1.4
                -> commented out setting avg clocks to current clock value
(leave as max uint value, not re-assign; these are not same values)
                    -> commented out setting current_socket_power = average_power
(leave as max uint value, not re-assign; these are not same values)
                    -> For all non-array clocks, placed value in first
                        array[0] to keep outputs consistent
                    (helps xcd calc)
      - ROCm API: rsmi_dev_metrics_curr_gfxclk_get fixed to count
        XCDs using backwards compatible rsmi_dev_gpu_metrics_info_get.
      - ^ Fixes XCD count overall + assigning clock[0] in 1.3 to curr
        freq
      - AMD SMI API: amdsmi_get_gpu_metrics_info() initialized all new
        1.5 metric values for all lower metric tables
      - AMD SMI API: wrapper -> fix is here + returns correct AMD SMI return
      - AMD SMI API: wrapper -> now displays amdsmi return status as
        string in logs
      - gpu_metrics_read.cc -> now has better overview of backwards
        compatible output
      - gpu_metrics_read.cc -> Cleaned up output, added units, and
        display all array output

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Change-Id: Id5b60ded5b0ed2cdf0f96ca72c79e356f0410960
2023-12-19 14:18:15 -05:00
Naveen Krishna Chatradhi 65eed73f4d amd-smi: fix cpu specific apis and header
1. provide prototype and documentation for esmi specific api.
   define structures and update classes as required
2. update cmake files as required and add esmi api to the
   amdsmi esmi integration example.

Change-Id: I753ec176f9b381e74c9646525dfd9075237bf8d9
2023-12-18 06:28:15 -05:00
Charis Poag 8f3861e1d9 Add vcn and jpeg activity
Changes:
    - Add new engine field vcn_activity (from 1.4/1.5
      gpu_metrics
    - Updated log output to enhance view of gpu_metric
      data as json pretty print
    - Added new fields provided in 1.5
    - Added unit overview in python API, CLI is WIP

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Change-Id: I7d9f29e7ecc35dcd0697814c222cdd02b0d5518e
2023-12-15 22:18:46 -05:00
Bill(Shuzhou) Liu 59b510de2b Support max_num_cu_shared and num_cache_instance
Add above fields for cache info. Remove driver_date in CLI and
Remove the disable properties of cache.

Change-Id: I80672490908d9e32a149076cc37459fa56b8b0bf
2023-12-14 09:59:35 -05:00
Bill(Shuzhou) Liu de7e74f7db Collect compute partition devices under the same socket
The socket represents a physical device, and the partition devices
should belong to the socket. The partition devices are only
different in function id in BDF. Use the BD part of the BDF to
identify a socket.

Change-Id: I5d355a6f5db02faa7555b760a36c7351b8d8d835
2023-11-29 08:23:23 -06:00
Maisam Arif b54086a037 Change xgmi_physical_id to oam_id
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I35fb36ec0e9f72a7135d8bb9070dbdc0e956b93a
2023-11-22 12:16:38 -06:00
Maisam Arif 5b36b438b7 Refactor gpu_metrics usage in CLI
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I599878971ab94a768d008f046f2d303ad76fdb3b
2023-11-22 03:32:55 -06:00
Maisam Arif d790ebc62b Refactor gpu_metrics usage in libraries
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I763638d4b546bf49b234e823df81028c357e8f49
2023-11-22 03:32:15 -06:00
Bill(Shuzhou) Liu ac1ba33371 Add APIs for PM table and register table
Read the PM table and register table as the name value pair.

Change-Id: Ie44fe67a28af3341bd6beb90d809e90f280351ac
2023-11-20 12:31:18 -05:00
Maisam Arif 545e57d3e3 SWDEV-426130 - Updated firmware subcommand output
Corrected truncation
	corrected xgmi to ta_xgmi
	remapped smc(system management controller) to pm(power
management)

Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I404cefa7b90a454d4f4b08f6490448b47cf32107
2023-11-14 11:56:43 -05:00
Deepak Mewar 0c790752ac modified local esmi functions called from amdsmi_init
for gtest compatibility

Change-Id: I627c9887a1f1e340c358f060818a1a7d74ce33f9
2023-11-10 15:50:42 -05:00
Maisam Arif 5dba2f3120 Updated License Dates
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Id6fd66b03c602232ecc1a063a534a15fe3a03f56
2023-11-07 03:57:08 -05:00
Bill(Shuzhou) Liu 56b246cc3c Support cache type in cache info
Add the cache type to the cache info.

Change-Id: Ic13ca9640b65d2b414eeebe7b884530f2036aac8
2023-11-02 04:53:38 -05:00
Deepak Mewar 28f6383639 Esmi Auxillary API wrappers removed from amdsmi library
that are called during amdsmi inititalization
    amdsmi_get_cpu_family,
    amdsmi_get_cpu_model,
    amdsmi_get_cpu_threads_per_core,
    amdsmi_get_number_of_cpu_cores,
    amdsmi_get_number_of_cpu_sockets

Added amdsmi_get_cpucore_info to amdsmi library

Change-Id: Ib88d580e1d85afdf578963247e585cfae05c58ad
2023-10-30 20:59:21 -04:00
Maisam Arif 2b4637ff9f SWDEV-410051 - Updates to board_info struct & CLI
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I8735d8965140ee5da0c35106b388af1dca87ec71
2023-10-27 16:52:56 -05:00
Maisam Arif 5018a57b62 Updated READMEs & Versioning for 6.0 Release
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Idadece3c1022ecba4291b96ddbe23112e27394de
2023-10-16 16:57:49 -05:00
Maisam Arif 1f8d9cb9ef Added memory & compute partitions to amd-smi lib
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: If3acea6ad281298f1f05785b2e6d8e70fae8d89b
2023-10-13 21:47:59 -04:00
Deepak Mewar ee890c5060 esmi: remove energy reporting, fix errors from clang compiler
Clang compiler reporting errors while generating python wrappers for esmi lib

Change-Id: I62352aba3b87f9a6b044c97af6b9fd649612b622
2023-10-13 14:45:25 -04:00
Bill(Shuzhou) Liu d92d4e4b38 Add new API for RAS related information
The API to get the EEPROM version and ECC schema.

Change-Id: Iee6b3c555541a33bf16bf9ac1fd60100dfff5643
2023-10-13 02:06:14 -04:00
Galantsev, Dmitrii 6d72d65c48 Merge rocmsmi/amd-staging into amd-dev 20231010
Change-Id: I492562094a004eb78b2cc2b52d14d013d9f97112
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-11 18:58:12 -05:00
Galantsev, Dmitrii 1b606acf73 Fix amdsmi.h and update wrapper
Having an unnamed struct confuses our wrapper generator.
Adding a name solved it.

Change-Id: Iab3e73317fb21fb3667beef04878d4f3da96eadf
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-10 17:58:25 -05:00
Bill(Shuzhou) Liu 6ca95c1a2d Add support to XGMI physical id
Get XGMI physical id from sysfs.

Change-Id: Ifd9e431bc2fbfd759d888a71b99046a5eb07b6ed
2023-10-10 09:29:05 -04:00
Charis Poag 31a1fcce7d Add rsmi_dev_power_get
* Updates:
  - [API] Added rsmi_dev_power_get(uint32_t dv_ind,
                                   uint64_t *power,
                                   RSMI_POWER_TYPE
                                   *type)
          provides generic get to average or
          current power & provides backwards
          compatibility
  - Added a utility function to get MonitorTypes
    (monitor_type_string(type)) &
    RSMI_POWER_TYPE (power_type_string(type))
    strings
  - [Tests] Added rsmi_dev_power_get tests and
    provided better verification of return values for
    all power APIs
  - [Tests] Updated power outputs to show correct
    units
  - [example] Now uses avg, current, and generic
    power functions with type output response

Change-Id: I5ca06ca37fd5f61e100f2835b664d6cdd1ca42e6
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-10 00:34:19 -05:00
Deepak Mewar 192fb538be added metric table wrapper APIS & test code
Change-Id: I24207b3c32d7294337140a1f5108b81f3bf33580
2023-10-10 00:03:11 -04:00
Oliveira, Daniel 4e4ebde640 rocm_smi_lib: Fix Modernize and refactor gpu_metrics
Adds support for 'gpu_metrics_v1_4' and new counters

Code changes related to the following:
  * rsmi gpu_metrics APIs
  * rsmi gpu_metrics Logs
  * The new gpu_metrics are now part of the Device

Build changes related to the following: None

Change-Id: Ie748e977cd0a01c6a2fb82260014c0699605dbb3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-10-09 21:43:22 -05:00
Charis Poag b251bb0c9f Rename NPS -> memory partition + compute partition node fix
* Updates:
        - rocm_smi_lib + CLI:
          Rename all "NPS mode" -> "memory partition"
          related files/functions/API/CLI to align with correct
          technical naming
        - rocm_smi_main: fixed identifying primary card's unique id
          utilize rsmi_dev_unique_id_get to map which
          KFD nodes belong to it
        - rsmi_dev_*_partition*: now have better logging output
        - compute partition tests:
          Added 20 sec delay for workaround until GPU
          busy is confirmed as the issue
        - CPPLint fixes/formatting
        - [Example] Moved all endl to "\n" for efficiency
        - [Example] Added Edge & Junction temperature examples
        - [Example] Added rsmi_minmax_bandwidth_get() example - WIP

Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-06 11:51:09 -04:00