Граф коммитов

158 Коммитов

Автор SHA1 Сообщение Дата
Oliveira, Daniel 713d259f88 rocm_smi_lib: Fix Refactoring gpu_metrics code
Uses new support for 'gpu_metrics_v1_4'

Code changes related to the following:
  * rsmi gpu_metrics APIs
  * rsmi gpu_metrics Logs
  * new data structure fields added in 1.4
  * added APIs for all other existing metrics before 1.4
  * added support to older metrics; 1.1, and 1.2
  * added support to dump_internal_metrics_table()
  * public APIs renamed to start with prefix 'rsmi_dev_metrics_'
  * Unit tests updated
  * Examples updated

Build changes related to the following: None

Change-Id: I23e59f99d3ed43318cd6bd43bd2f0c5387e9ccb9
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-11-20 19:36:47 -06:00
Oliveira, Daniel 2c8ba4cae9 rocm_smi_lib: Fix Refactoring gpu_metrics code
Uses new support for 'gpu_metrics_v1_4'

Code changes related to the following:
  * rsmi gpu_metrics APIs
  * rsmi gpu_metrics Logs
  * new data structure fields added in 1.4
  * added APIs for all other existing metrics before 1.4
  * added support to older metrics; 1.1, and 1.2
  * public APIs renamed to start with prefix 'rsmi_dev_metrics_'
  * Unit tests updated
  * Examples updated

Build changes related to the following: None

Change-Id: Ibdaf031be9d916020b4049544dbd725858c7711d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-11-10 19:05:09 -06:00
Bill(Shuzhou) Liu e5627d2bf1 Sort GPU index using BDF
Sort GPU index based on BDF. Also add an API to get the XGMI
physical id.

Change-Id: I998876e435165c59d450ecd0b979315278b488a5
2023-11-06 20:51:25 -06:00
Galantsev, Dmitrii a099f0682a Fix issues introduced in 57b6135e54
- std=c++.. is not required because CMAKE_CXX_STANDARD is set
- nullptr check breaks the test because we rely on nullptr as an api for
  checking feature availability.
- enum number setting is unnecessary

Change-Id: I393e6dd3f292b7fa4198302f140c0443ba5e50f5
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-03 17:54:35 -05:00
Bill(Shuzhou) Liu de5bc164de Query the CPU and GPU link type
The rsmi_topo_get_link_type() is extended to support query the CPU
and GPU link type by passing dv_ind_dst as 0xFFFFFFFF.

Change-Id: I1f212a01e8120adb70a08ab772fa9faaaecefa29
2023-10-31 10:17:24 -04:00
Charis Poag 57b6135e54 Partition EBUSY with RSMI_STATUS_BUSY & invalid GPU Metrics check
* Updates:
   - [API/CLI] rsmi_dev_*_partition_set &
     rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
     EBUSY writes + cleaned up accidental map insertions
     (maplookup[] can insert values that are not in the map,
     map.at(key) fixes this potential issue)
   - [API] rsmi_dev_gpu_metrics_info_get() - returns
     RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
     outside of 1v1/1v2/1v3
   - [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
     EBUSY write errors; kept backward compatibility
     for other writes which do not care about these states
   - [API] rsmi_dev_od_volt_info_get()
      & rsmi_dev_od_volt_curve_regions_get() have better logging
     + Expose more details on why they are erroring
   - [Utils/logs/example] Expose AMD GPU gfx target version to aid in
     system troubleshooting
   - [Utils] Added test methods that look at od volt
     freq & regions into here - for easier access across
     several tests
   - [Utils] Updated getRSMIStatusString(new argument - fullstatus;
     default to true for backwards compatibility)
     -> true shows shortened RSMI STATUS response
   - [Utils] Added splitString to cut out noisy return responses
     (used in getRSMIStatusString(), when fullstatus = true)
   - [Utils] Added getFileCreationDate() to expose build date
     of the library - helpful for local builds or experimental builds
   - [Utils] Macro cleanup
   - [Example] Added a few gpu_metric checks - helpful for upcoming
     updates
   - [Device] SYSFS/DebugFS - now have better r/w displayed in logs
   - [LOGS] Expose library build date - see above for details
   - [Tests] Add more warnings/errors to test builds
   - [Tests] Moved up Partition tests for ordered test runs - helped
     identify issues with GPU BUSY writes
   - [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
     with waits for busy status found & cleaned up how we checked
     for partition changes - with RSMI responses exposed more clearly
   - [Tests] perf_determinism - multi gpu now properly runs through
     with full resets as needed
   - [Tests] volt_freq_curv_read - better error handling with more
     verbose output

Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-27 14:52:02 -04:00
Charis Poag 6f1afd2678 bdfid fix for partition & xgmi nodes
* Updates:
    - [API] After discovering all amd gpus, we now properly
      map correct bdf (xgmi nodes). Especially important for
      partition changes - aka secondary nodes.
    - [API] While adding new secondary nodes we now have
      better grouping -> due to resorting based on
      kfd properties list & matching to primary uniqueid
    - [API] All secondary nodes are now AddToDeviceList
      with correct bdf (location id), provided by kfd
    - [API] Modified AddToDeviceList(..., uint64_t bdfid):
      providing an optional field - bdfid. This allows working
      around primary pcie cards with xgmi nodes
    - [API] Utils - cpplint minor fixes
    - [Example] Removed all endl references w/ newline, fixed
      spacing, and some incorrect values displaying as hex
      (needed dec representation)
    - [API] kfd node functions - now print full path of file
      for trace logs
    - [Tests] power_read.cc: Added in generic power test to
      confirm guaranteeing specific return values

Change-Id: I143474e8d64c4915a966e789be6bcea4fa7f4472
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-13 20:14:39 -05:00
Charis Poag 31a1fcce7d Add rsmi_dev_power_get
* Updates:
  - [API] Added rsmi_dev_power_get(uint32_t dv_ind,
                                   uint64_t *power,
                                   RSMI_POWER_TYPE
                                   *type)
          provides generic get to average or
          current power & provides backwards
          compatibility
  - Added a utility function to get MonitorTypes
    (monitor_type_string(type)) &
    RSMI_POWER_TYPE (power_type_string(type))
    strings
  - [Tests] Added rsmi_dev_power_get tests and
    provided better verification of return values for
    all power APIs
  - [Tests] Updated power outputs to show correct
    units
  - [example] Now uses avg, current, and generic
    power functions with type output response

Change-Id: I5ca06ca37fd5f61e100f2835b664d6cdd1ca42e6
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-10 00:34:19 -05:00
Oliveira, Daniel 4e4ebde640 rocm_smi_lib: Fix Modernize and refactor gpu_metrics
Adds support for 'gpu_metrics_v1_4' and new counters

Code changes related to the following:
  * rsmi gpu_metrics APIs
  * rsmi gpu_metrics Logs
  * The new gpu_metrics are now part of the Device

Build changes related to the following: None

Change-Id: Ie748e977cd0a01c6a2fb82260014c0699605dbb3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-10-09 21:43:22 -05:00
Charis Poag b251bb0c9f Rename NPS -> memory partition + compute partition node fix
* Updates:
        - rocm_smi_lib + CLI:
          Rename all "NPS mode" -> "memory partition"
          related files/functions/API/CLI to align with correct
          technical naming
        - rocm_smi_main: fixed identifying primary card's unique id
          utilize rsmi_dev_unique_id_get to map which
          KFD nodes belong to it
        - rsmi_dev_*_partition*: now have better logging output
        - compute partition tests:
          Added 20 sec delay for workaround until GPU
          busy is confirmed as the issue
        - CPPLint fixes/formatting
        - [Example] Moved all endl to "\n" for efficiency
        - [Example] Added Edge & Junction temperature examples
        - [Example] Added rsmi_minmax_bandwidth_get() example - WIP

Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-06 11:51:09 -04:00
Charis Poag f078375350 Add Current (Instant) Socket Power
* Updates:
    - rocm_smi_logger:
      General cleanup &
      Aligned to cpplint rules for usage
    - rocm_smi_monitor:
      Fixed MonitorTypes
      from not displaying properly in logs
      & Added socket power label + current
      socket power MonitorTypes
    - rocm_smi API:
      Added rsmi_dev_current_socket_power_get API
    - rocm_smi CLI:
      General cleanup,
      Concise info now displays device data
      in variable width (see printLogSpacer's
      new field),
      printLogSpacer now as an adjustable
      variable that overrides appWidth,
      Added Socket Power to base rocm-smi +
      --showpower CLI calls,
      --showpower & base rocm-smi CLI defaults
      to printing socket power (if not available,
      displays average power)
    - Cleaned up temp label references
    - power_read gtests:
      Added current socket power to testing

Change-Id: Ica57e6f98ad96e2584e7c7955e188f68d2dab89d
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-09-25 01:38:54 -04:00
Galantsev, Dmitrii 3d40c4bb2c SWDEV-422836 - Add sleep frequency support
Change-Id: I0bde403b010bf036ce44ed0600cc7eb03742c6b6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-25 01:38:27 -04:00
Ori Messinger d44a6ef523 ROCm SMI LIB: Add Missing Firmware Blocks
The purpose of this patch is to add the following missing firmware
blocks to the SMI LIB:
-RSMI_FW_BLOCK_MES
-RSMI_FW_BLOCK_MES_KIQ

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I5d4d37d883878dd02ef8533d4eb8891d54d70630
2023-09-25 01:37:38 -04:00
Oliveira, Daniel e0483f2ee2 rocm_smi_lib: Fix [linux BM] [AMDSMI] Memory Bandwidth
Implements APIs for 'gpu_metrics_v1_3' utilization averages

Code changes related to the following:
  * rsmi_dev_activity_metric_get()
  * rsmi_dev_activity_avg_mm_get()
  * CLI shows "Avg.Memory Bandwidth" under "--showmemuse"

Change-Id: I8e4600f350a7c18499abf022534db2b875f09d5f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-09-21 11:00:29 -04:00
Charis Poag ed6777a8e7 Add GPU partition nodes
* Updates:
    - Fixed infinit loop on systems
      which did not have VRAM files
    - Fixed concise info from throwing exception
      with no amdgpu driver loaded
    - Fix for ability to see all nodes when
      after switching partitions (mirrors
      original card display/settings)
    - Added to logs build type, lib path,
      and set env. variables

Change-Id: Ic0333df355144ce2242cecea93fe4ce51caf311c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-09-07 22:17:54 -05:00
Galantsev, Dmitrii 4aef767596 Cleanup rocm_smi.cc
Change-Id: Ia676c237222b0dd5d9e8a054a93776f3b11e2225
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-07 15:50:40 -04:00
Bill(Shuzhou) Liu fab0542ab1 Fix doxygen warning messages
The Doxygen will enable warning as error message.

Change-Id: Ie7a7c9a823388c4140f31489604d65ec43005772
2023-09-07 08:48:38 -04:00
Bill(Shuzhou) Liu 471fbfddc1 Numa affinity shows large number
Change the affinity from unsigned int to integer to represent -1.

Change-Id: I82dc6f476b45fa4ec03a3c686fe8e6e2b7761b56
2023-08-25 09:01:08 -04:00
Charis Poag f191c2753c Error handling for unset freqs
Sending RSMI_STATUS_UNEXPECTED_DATA for drivers
which do not set some clock freqs

Change-Id: I43a9515c2757dddd412bb25cfd54095e63367030
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-08-23 10:44:57 -05:00
Bill(Shuzhou) Liu a10f00bf57 Fallback to kfd node when VRAM sysfs not available
The driver may not expose VRAM sysfs in certain system. Add a
fallback to it.

Change-Id: Ib3be71b4f4d2c79318d5026b0a97f3657d8a97b6
2023-08-17 14:36:03 -05:00
Charis Poag 755e14dbad [SWDEV-399953] Smart Temperature detection + partitioning display
* Updates:
    - Fix for devices which do not have edge sensors, but junction
    - Added partitioning (memory and dynamic) displays for
      base rocm-smi CLI calls
    - Added subheading for base rocm-smi call output
    - Added better hwmon and device detection logging

Change-Id: I8219884b2e532d6ed379527cacdc1f2b232a5451
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-08-10 19:53:38 -04:00
Oliveira, Daniel cc5ab079df Fix rsmitstReadWrite.TestPowerReadWrite test failure
Code changes related to the following:
  * All reinforcement work moved to their own files
  * Self contained changes only to support them
  * New files added to CMakeLists.txt

Change-Id: I761e91f54392824df9145eaed8b9805986861285
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-08-09 21:51:05 -05:00
Charis Poag 9c7eed7edc [lib] Enhance Logger: gpu_metrics + enable console out
* Updates:
    - Env variable RSMI_LOGGING=0 or any other value
        -> all logging off
    - Env variable RSMI_LOGGING=1 -> logs only
    - Env variable RSMI_LOGGING=2 -> console only
    - Env variable RSMI_LOGGING=3 -> both logs + console
    - Metrics output includes hexdump of current file
      and decoded metrics (functions: logHexDump
      and log_gpu_metrics)
    - System info gathered, now includes if system's
      perceived endianness - little or big endian
      helpful for viewing decoded hexdump or any
      binary translation
    - Added templates for printing unsigned hex
      (print_unsigned_hex_and_int), unsigned integers
      (print_unsigned_int), and printing both unsigned
      hex and int with an optional header
      (print_unsigned_hex_and_int)
    - Fixed some build compile warnings/errors -
      ex. doing strncpys for sku or board names
      this operation is expected and needed
      and for temp file writes if unsuccessful
      we now properly send RSMI_STATUS_FILE_ERROR
    - Fixed on RHEL 8.8/9.x logrotate does not properly
      initialize

Change-Id: Ifa0f0218c9cafd0a8cd6aa8e7f94d61e9107200f
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-08-01 21:46:19 -05:00
Oliveira, Daniel 573620f586 Add revision to --showhw
Code changes related to the following:
  * Added 'rsmi_dev_revision_get()' related code
  * Test code
  * Functional tests

Change-Id: I8c2097c65384a028c8c8437b717d05d52fe45250
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-07-18 16:17:33 -05:00
Bill(Shuzhou) Liu d9b6af7a09 Expand showpids to provide more details
Provide details of GPU usage by an application.

Change-Id: I0f36df7d358754c2c8a60432b736d98f667ee99c
2023-06-16 08:52:18 -04:00
Charis Poag c3a095a180 [SWDEV-398070] Adding logging to ROCm SMI (by default off)
Updates:
    * [rocm-smi] Provide a thread-safe logging feature
    * [rocm-smi] Adding logrotation into install/upgrade/remove
      scripts
    * [rocm-smi] Updated cmake lists to include rocm_smi_logger
    * [rocm-smi] Updated DEB/RPM install/remove logging file &
      folder with all users having r/w privledges for
      /var/log/rocm_smi_lib/ROCm-SMI-lib.log
    * [rocm-smi] Added ability to do a glob search for multiple files
      (globFileExists), assists doing file searches with * strings
    * [rocm-smi] Added ability to log system details when RSMI_LOGGING
      is turned on (getSystemDetails())
    * [rocm-smi] Added logging to provide which ROCm API is being called
      when RSMI_LOGGING is on
    * [rocm-smi] Added logging to provide SYSFS path and read value,
      when RSMI_LOGGING is on. Provides error reponse on failure.
    * [rocm-smi] Added logging to provide SYSFS path and read value,
      when RSMI_LOGGING is on. Provides error reponse on failure.
    * [rocm-smi] Added environment variable RSMI_LOGGING to control
      when logging is enabled or disabled. By default, by not
      setting this env. variable, logging is turned off. When
      setting RSMI_LOGGING=<any value>, logging is enabled
      which is placed in /var/log/rocm_smi_lib/ROCm-SMI-lib.log file.
      Setting RSMI_LOGGING is allowed in both debug and release builds.
    * [rocm-smi] Removed an initialize procedure which keeps
      debug_inf_loop. Seems this feature is not being used.

Change-Id: I79b48387609c6233c6f05b04fb8bba66b68c2399
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-05-17 21:18:52 -05:00
Charis Poag f44d1ea8bc [SWDEV-387906] Fix rocm-smi initialize crash
Fix was needed due to hwmon updates.
Several voltage sensors (ex. vddgfx/vddnb)
are unsupported or not applicable
to upcoming hardware. This was not the case
for previous hardware sensors, resulting in
the rocm-smi crash observed.

Change-Id: Ib8593e10811638def26fc7a1eda29309e328db09
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-03-17 15:04:34 -05:00
Charis Poag 0d3558945b [SWDEV-335697] Add RSMI_STATUS_SETTING_UNAVAILABLE for dynamic partition
Updates:
    * Added RSMI_STATUS_SETTING_UNAVAILABLE for
      rsmi_dev_compute_partition_set - gives users
      better error output when attempting to set
      compute partition to values not listed in
      available_compute_partition SYSFS
    * Updated python --setcomputepartition to
      provide better output when receiving
      RSMI_STATUS_SETTING_UNAVAILABLE
    * Updated all test & example files to check for
      RSMI_STATUS_SETTING_UNAVAILABLE when doing
      rsmi_dev_compute_partition_set

Change-Id: Ida5d54880d9b9b6e4a0468cdb962fdc0c18d6257
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-02-27 11:17:44 -06:00
Charis Poag 77c950a4bf [SWDEV-381630] Add reset partition functionality
Updates:
    * Added rsmi_dev_compute_partition_reset & rsmi_dev_nps_mode_reset
    * Added --resetcomputepartition and --resetnpsmode python smi calls
    * Added temp data files rocmsmi_boot_compute_partition_<device num>
      & rocmsmi_boot_nps_mode_partition_<device num>, writes UNKNOWN
      if data cannot be read or device does not support
    * Cleaned up NPS & compute API documentation
    * Added creation and reading of API temp files (used in reset
      functionality)
    * Cleaned up output of rocm_smi_example
    * Updated rocm_smi_example to check if running with sudo permission
      before executing write API calls (cleans up erroneous output)
    * Added template specialization for storing temp data, requires
      specific rsmi_type_t enums (restrics what data can be stored)
    * Added storage of temp data, if temp files do not exist
    * Updated google tests for NPS & compute to include reset API calls

Change-Id: I69895a466b97107617e6dbb355737b84499a76c9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-02-17 12:55:08 -06:00
Charis Poag 9ef376cd61 SWDEV-342812- Add NPS support
Updates:
    * Added rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
    * Added ability to set multiple SYSFS files in debug build
    * Added ability to see user's env variables set for debug build
    * Added tests for rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
    * Added ability to restart AMD GPU driver, used in nps_mode_set
    * Updated ROCm_SMI_Manual.pdf to include new APIs
    * Added progress bar for long running python_smi_tools, used
      in setting nps_mode if runs longer than .1 seconds

Change-Id: I6d61bedd28d7cba6aff432ad2d127ba741b7d15a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-02-14 11:54:24 -06:00
Charis Poag 4d7f3f2bc7 SWDEV-335697- Add support for dynamic partitioning
Original updates:
    * Added .gitignore to help with future commits
    * Updated/added copyrights on modified or added files
    * Updated rocm_smi.h/.cc
      - Added 3 new SMI API functions:
          rsmi_dev_compute_partition_set &
          rsmi_dev_compute_partition_get
      - Added helpful maps/enums used in
        new get/set compute_partition API calls
    * Updated rocm_smi.py
      - Added --showcomputepartition
      - Added --setcomputepartition
      - Fixed a few mistypes
    * Updated rsmiBindings.py - added helpful class/dict/list
    * Updated rocm_smi_example.cc
      - Added helpful MACRO to detect if api is not supported.
      - Added current_compute_partition set/get rocm lib calls
      - Added helpful macro to discover future RSMI errors
      - Commented out test_set_freq, was having permission issues
        on a Navi21
    * Updated rocm_smi_main.cc
      - Added helpful map to debug API calls, left in for future use
      - Added comment to better understand a non-class function returns
    * Added computepartition_read_write.cc/.h
      - Added get/set compute partition API test calls
      - Confirmed on devices that do not support the API calls, tests pass
    * Updated rocm_smi_test/main.cc
      - Calls new compute partition gtests

Added following updates from review feedback:
   * Updated rocm_smi.h/cc
       - Removed C++ API calls, adding support for both C/C++
         API calls could cause confusion and adds extra work for us
       - rsmi_dev_compute_partition_get -> Fixed an edge case where
         user gives a small buffer length size (smaller than data
         received), but does not receive the partial buffer back.
         google Tests are updated to reflect this find.
   * Updated rocm_smi_example.cc
       - Fixed test_set_freq, issue was that file was not writable.
         We now indicate this warning, so prior errors make sense.
       - General test code cleanup. Removed extra code,
         by creating loops for tests.
   * Updated rocm_smi_main.cc
     - Moved and got rid of an external reference to a map used
       for debugging RSMI enums, now is a const public reference.
   * Updated rocm_smi.py
     - Updated python code to identify NOT_SUPPORTED due to
       (currently) only a few GPU support the feature

Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
2023-01-13 10:46:40 -05:00
Sreekant Somasekharan aa5cba122c Fix documentation mistake related to get memory overdrive function.
Changes made on rsmi_perf_determinism_mode_set function documentation
as well for styling consistency.

Change-Id: I09ce8139eb9cbda94352ac7725c4c9b9bb06bd59
2022-06-30 08:57:52 -04:00
Sreekant Somasekharan 1432e5e040 Add rsmi lib function to get memory overdrive value
Change-Id: I515b51d5ce4baf966bb31714886a0d72330026bc
2022-06-23 11:42:50 -04:00
Divya Shikre afe996c2ed Update get_frequencies to handle failures.
Show an optional debug log (RSMI_DEBUG_BITFIELD=2) to
the user in the following scenarios:
1. If more than one current frequency is found
2. If frequencies are not read in increasing order of
   their value
If current frequency is not available, index for it is
set to -1, values will not have * next to it in the
output. This will also be handled in rocm_smi.py.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I477ec065f7513c8045d6392f12ef6cb835a6b8f6
2022-05-11 15:33:15 -04:00
Divya Shikre 99be3451d7 Add DEBUG_LOG macro
Add DEBUG_LOG that will optionally print error
message when RSMI_DEBUG_BITFIELD is set to 2.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I6017e92d8a9e5f9861ae29ece0488d4bc198f996
2022-05-11 11:03:24 -04:00
Divya Shikre c9b42bff57 Add RSMI_CLK_TYPE_PCIE to rsmi_clk_type_t
showclocks/showclkfrq does not display pp_dpm_pcie values
in sriov. This fix adds pcie clocks to rsmi_clk_type_t
where rest of the clocks are present.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I6d129ae412623b369c14456ae9781b2dbceb2139
2022-05-06 09:15:39 -04:00
Ori Messinger 9d6403bb17 ROCm SMI LIB: Add Missing GPU Blocks
This patch adds the following 4 missing GPU blocks to the SMI LIB:
-RSMI_GPU_BLOCK_MMHUB
-RSMI_GPU_BLOCK_PCIE_BIF
-RSMI_GPU_BLOCK_HDP
-RSMI_GPU_BLOCK_XGMI_WAFL

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia1ec6f53e195f4bf7b8f073d6bed4fdb6572e546
2022-05-05 00:44:16 -04:00
Harish Kasiviswanathan 8de6ed2b8d rocm_smi_lib: add stdbool.h needed for C90
'bool' keyword is supported only from C99 onwards. Include stdbool.h
for older compilers

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I09fd5cf6eac20e7185e85a1123bc4826958b2b7c
2021-12-14 15:25:59 -05:00
Elena Sakhnovitch 50ea68e694 [ROCm SMI LIB]: Add rsmi_minmax_bandwidth_get()
API provides min/max bandwidth values between nodes.
(Current implementation only supports directly (1 hop)
connected XGMI devices.

Signed-off-by: Elena Sakhnovitch
Change-Id: Ifc95da13845fbe7903c5386d320183ffd58c5b53
2021-10-28 17:00:41 -04:00
Ori Messinger ff02042c64 ROCm SMI LIB: Add rsmi_is_P2P_accessible() API
Implements rsmi_is_p2p_accessible API.
The function returns True if P2P is possible between two nodes.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ic7316eebcec4480175c7ad04c21a42b2e1a4c454
2021-10-13 22:01:33 -04:00
Elena Sakhnovitch 5e1bfcadd7 rocm_smi_lib: fix gpu_metrics_v1_3 support
Signed-off-by: Elena Sakhnovitch
Change-Id: Ia7a6b17eb0f317465613ba92ae7548a221c46ee3
2021-08-13 11:59:50 -04:00
Elena Sakhnovitch fee82af1fe rocm_smi_lib: add gpu_metrics_v1_3 support
Signed-off-by: Elena Sakhnovitch
Change-Id: I4a9dedc80b8fce60e12c5baf8651d54d16a6a41c
2021-08-13 09:23:35 -04:00
Harish Kasiviswanathan 14201290a2 Add timestamp resolution info in comments
Specify that timestamp resolution is in ns in header file.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4db00a07c0b5c43ae23c98213f2fbbcf93110234
2021-05-05 12:32:58 -04:00
Harish Kasiviswanathan 6b10a7761b Add support to read gpu_metrics version 1.2
gpu_metrics version 1.2 provides atomic timestamp. Use this timestamp.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I7a1a675f53b93718f34b1f2979173e9064e0ef93
2021-05-05 12:31:10 -04:00
Harish Kasiviswanathan e83cf605c6 Change #define RSMI_GPU_METRICS_API_CONTENT_VER
Chnage to RSMI_GPU_METRICS_API_CONTENT_VER_1. In preparation for
supporting additional formats

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4367a2622a0fa41e6b05bc4436ecd24b8c4e30e2
2021-05-04 20:51:10 -04:00
Ori Messinger 83cd2fe4f1 ROCm SMI LIB: Add Default GPU Power Cap
Implement default GPU power cap functionality in the LIB.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53
2021-04-22 10:49:55 -04:00
Harish Kasiviswanathan 844acbc0d8 Add energy counter resolution to rsmi_dev_energy_count_get
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I03b70968257db7a45e21d7ba62542cdedd18eb85
2021-04-22 10:25:06 -04:00
Divya Shikre 9f9a7aaf65 Add new setrange function in C++ lib
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I670aaeb93827bf4b2cc08eb36d0f9756f00e4e4e
2021-04-19 22:38:59 -04:00
Elena a383dd23aa [rocm-smi-lib] add HBM temperature conversion factor
Change-Id: I45339c87c3d2a40670baf1b76ada60dceb650dc0
2021-04-19 16:41:48 -04:00
Bill(Shuzhou) Liu 8eec0a7d36 Add energy accumulator counter
The energy accumulator counter tracks all energy consumed.

Change-Id: I5b25f817b7802d81c477361447f0ecd7ec02fc61
2021-04-14 10:43:01 -04:00