Commit Graph

122 Commits

Author SHA1 Message Date
Istvan Kiss f0c16aa8a0 Update documentation and add python API documentation
Change-Id: Ibccf5b6a5fba81cea42e04a022deac8a3207b9b8


[ROCm/rocm_smi_lib commit: 50a079af0f]
2024-03-06 22:01:30 -05:00
Charis Poag 078f678c30 Fix rocm_smi library calls
- [CLI] Rounded VRAM output on CLI, no diffrence in output
    - [python API] Fixed initializing calls which reuse initializeRsmi()
      calls - now we set a global reference to rocmsmi to use
      throughout API calls (see error below)

Traceback (most recent call last):
  File "/home/charpoag/rocmsmi_pythonapi.py", line 9, in <module>
    rocm_smi.initializeRsmi()
  File "/opt/rocm/libexec/rocm_smi/rocm_smi.py", line 3531, in initializeRsmi
    ret_init = rocmsmi.rsmi_init(0)
NameError: name 'rocmsmi' is not defined

Change-Id: I0eff3b8a432abf6d4344a02b9f638e1191c51a19
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 90160a7c9c]
2024-03-04 21:08:08 -06:00
Charis Poag 059fd6260e [SWDEV-423481/SWDEV-423393] Align all device identifier details
Updated:
 * [CLI] Fixed vram % - printf style formatting causes many data errors
   This fix updates to the recommended way of outputting formatted data.
   https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
 * [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
       -> CLI name: "GUID"
       -> ROCm SMI calls: no arg, -i, --showhw, --showproduct
 * [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
       -> CLI name: "Node"
       -> ROCm SMI calls: no arg, --showhw, --showproduct
 * [CLI] Added target gfx version from kfd
       -> CLI name: "GFX Version" or "GFX VER"
       -> ROCm SMI calls: --showhw, --showproduct
 * [CLI] Base ROCm CLI
       -> Removed - stacked id formatting:
	   This is to simplify identifiers helpful to users.
	   More identifiers can be found on -i --showhw, --showproduct
 * [CLI] Update -i, --showhw, --showproduct, w/out arg
      -> Card ID/DID/Model/SKU/VBIOS:
            All unsupported values now display "N/A" instead
            of "unknown" or "unsupported"
 * [CLI] Showhw now expands data based on content

Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 4b5ccb57f0]
2024-02-13 19:52:29 -05:00
Vladimir Stempen dc98babe34 Fix [Not supported] status for get_compute_process_info_by_pid
On some systems [rocm-smi --showpids] reports
get_compute_process_info_by_pid, Not supported on the given system
[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN

get_compute_process_info_by_pid fails because cu_occupancy debugfs method
is not provided on some graphics cards and GFX revisions by design

Proposing a change to return success status when only cu_occupancy debugfs method
is not found and provide cu_occupancy invalidation value to mark only
this parameter as UNKNOWN

Change-Id: Iae37070d9bd19483b4e6c8ee24c7d9a4c92f00d7
Signed-off-by: Vladimir Stempen <Vladimir.Stempen@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 677433b367]
2024-02-13 18:17:47 -05:00
Bill(Shuzhou) Liu 194f98b809 Support set min or max clock
In addition to be able to set clock range, new setextremum option
is added to set only min/max clock as sometimes one of them may
not be supported.

Change-Id: I7c91ba308f3fc6c78efc88117509c515d403a6cb


[ROCm/rocm_smi_lib commit: 4e0a7f2f67]
2024-02-09 09:24:26 -06:00
Charis Poag 2e3cc74c79 [SWDEV-437365] Fix --showpower
Updates:
  - [CLI] Switching to use generic rsmi_dev_power_get()
  this is a backwards compatible function to
  retrieve power values. More consistent than
  previous fixes.
  - [API] Update API for rsmi_dev_power_get()
  Now provides @depricated for this function.
  Providing notes on newer ASICS only support
  current socket power, where as previous
  ASICS only provided average power.

Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 51aec98edd]
2024-02-02 00:00:38 -06:00
Bill(Shuzhou) Liu 3da1539121 Voltage clock display as 0 when overdrive and voltage not supported
Change the python tool not to display above information if it is
not supported.

Change-Id: I48ffd95f07168219a629dfb391c1b4587308286d


[ROCm/rocm_smi_lib commit: 905c25e59b]
2024-01-19 17:11:08 -05:00
Galantsev, Dmitrii a435423020 Bump version lib:7.0.0 tool:2.0.0+hash
Change-Id: I7f2fd5605a93d07f61b997a25e1fbcf2780ea5cb
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: e1c972a193]
2023-11-21 17:19:41 -06:00
Galantsev, Dmitrii a854fbe9f6 Add version hash
Change-Id: I6cf18b00a45ebd106f981e92681cab2ef25924e2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: d61aaf44e1]
2023-11-21 17:14:38 -06:00
Charis Poag b6ae7c5775 Fix CLI checks for secondary die
MCM die check was inconsistent (using avg power).
By using only the energy counter, this provides
a consistent way of checking which die is the MCM node.

Change-Id: I532fa2047706d0f1e92e643ce1e6759e45b65ec0
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 553d26ef3a]
2023-11-21 11:58:52 -05:00
Charis Poag e89751e202 Partition EBUSY with RSMI_STATUS_BUSY & invalid GPU Metrics check
* Updates:
   - [API/CLI] rsmi_dev_*_partition_set &
     rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
     EBUSY writes + cleaned up accidental map insertions
     (maplookup[] can insert values that are not in the map,
     map.at(key) fixes this potential issue)
   - [API] rsmi_dev_gpu_metrics_info_get() - returns
     RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
     outside of 1v1/1v2/1v3
   - [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
     EBUSY write errors; kept backward compatibility
     for other writes which do not care about these states
   - [API] rsmi_dev_od_volt_info_get()
      & rsmi_dev_od_volt_curve_regions_get() have better logging
     + Expose more details on why they are erroring
   - [Utils/logs/example] Expose AMD GPU gfx target version to aid in
     system troubleshooting
   - [Utils] Added test methods that look at od volt
     freq & regions into here - for easier access across
     several tests
   - [Utils] Updated getRSMIStatusString(new argument - fullstatus;
     default to true for backwards compatibility)
     -> true shows shortened RSMI STATUS response
   - [Utils] Added splitString to cut out noisy return responses
     (used in getRSMIStatusString(), when fullstatus = true)
   - [Utils] Added getFileCreationDate() to expose build date
     of the library - helpful for local builds or experimental builds
   - [Utils] Macro cleanup
   - [Example] Added a few gpu_metric checks - helpful for upcoming
     updates
   - [Device] SYSFS/DebugFS - now have better r/w displayed in logs
   - [LOGS] Expose library build date - see above for details
   - [Tests] Add more warnings/errors to test builds
   - [Tests] Moved up Partition tests for ordered test runs - helped
     identify issues with GPU BUSY writes
   - [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
     with waits for busy status found & cleaned up how we checked
     for partition changes - with RSMI responses exposed more clearly
   - [Tests] perf_determinism - multi gpu now properly runs through
     with full resets as needed
   - [Tests] volt_freq_curv_read - better error handling with more
     verbose output

Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 57b6135e54]
2023-10-27 14:52:02 -04:00
Charis Poag d57d65a607 Rename NPS -> memory partition + compute partition node fix
* Updates:
        - rocm_smi_lib + CLI:
          Rename all "NPS mode" -> "memory partition"
          related files/functions/API/CLI to align with correct
          technical naming
        - rocm_smi_main: fixed identifying primary card's unique id
          utilize rsmi_dev_unique_id_get to map which
          KFD nodes belong to it
        - rsmi_dev_*_partition*: now have better logging output
        - compute partition tests:
          Added 20 sec delay for workaround until GPU
          busy is confirmed as the issue
        - CPPLint fixes/formatting
        - [Example] Moved all endl to "\n" for efficiency
        - [Example] Added Edge & Junction temperature examples
        - [Example] Added rsmi_minmax_bandwidth_get() example - WIP

Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: b251bb0c9f]
2023-10-06 11:51:09 -04:00
Galantsev, Dmitrii 3652c8a3e7 Add --version to CLI
Change-Id: Id2a8f10f544ed04e874db773820534eddd73f55d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: d862bee754]
2023-10-02 17:57:02 -05:00
Ori Messinger 5f26166271 ROCm SMI CLI: Add Missing Firmware Blocks
The purpose of this patch is to add the following missing firmware
blocks to the SMI CLI:
-RSMI_FW_BLOCK_MES
-RSMI_FW_BLOCK_MES_KIQ

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: If9cabdc60ffcf08f27c9e6bdc20e8a26b192a738


[ROCm/rocm_smi_lib commit: aa89f2e125]
2023-09-29 18:13:16 -04:00
Charis Poag fd5066437b Add Current (Instant) Socket Power
* Updates:
    - rocm_smi_logger:
      General cleanup &
      Aligned to cpplint rules for usage
    - rocm_smi_monitor:
      Fixed MonitorTypes
      from not displaying properly in logs
      & Added socket power label + current
      socket power MonitorTypes
    - rocm_smi API:
      Added rsmi_dev_current_socket_power_get API
    - rocm_smi CLI:
      General cleanup,
      Concise info now displays device data
      in variable width (see printLogSpacer's
      new field),
      printLogSpacer now as an adjustable
      variable that overrides appWidth,
      Added Socket Power to base rocm-smi +
      --showpower CLI calls,
      --showpower & base rocm-smi CLI defaults
      to printing socket power (if not available,
      displays average power)
    - Cleaned up temp label references
    - power_read gtests:
      Added current socket power to testing

Change-Id: Ica57e6f98ad96e2584e7c7955e188f68d2dab89d
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: f078375350]
2023-09-25 01:38:54 -04:00
Galantsev, Dmitrii 80c47e3c09 SWDEV-422836 - Add sleep frequency support
Change-Id: I0bde403b010bf036ce44ed0600cc7eb03742c6b6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 3d40c4bb2c]
2023-09-25 01:38:27 -04:00
Bill(Shuzhou) Liu 93557c6e4e Change the python tool id output label
Change the label from GPU to Device as we call rsmi_dev_id_get().

Change-Id: I8ffe3673d434e5291ebd5cc909afb7d18154ecb6


[ROCm/rocm_smi_lib commit: 2247c4b46c]
2023-09-25 01:31:04 -04:00
Galantsev, Dmitrii ff072106c2 PY: Remove f-strings from rocm_smi.py
Change-Id: I0a422e8f66473af837460ecb2450e5be329163b0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 1683245ecf]
2023-09-22 19:15:59 -05:00
Oliveira, Daniel 1bf68ad1c9 rocm_smi_lib: Fix [linux BM] [AMDSMI] Memory Bandwidth
Implements APIs for 'gpu_metrics_v1_3' utilization averages

Code changes related to the following:
  * rsmi_dev_activity_metric_get()
  * rsmi_dev_activity_avg_mm_get()
  * CLI shows "Avg.Memory Bandwidth" under "--showmemuse"

Change-Id: I8e4600f350a7c18499abf022534db2b875f09d5f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: e0483f2ee2]
2023-09-21 11:00:29 -04:00
Galantsev, Dmitrii 3397cadf11 rocm_smi.py: Fix pipe into head error
When piping rocm_smi into 'head' it failed with "Broken pipe" error. The
error can be safely ignored. head closes the pipe early which causes
calls a SIGPIPE signal to be raised.

https://docs.python.org/3/library/signal.html#note-on-sigpipe

Change-Id: I4a589c6ed9a8c5b50de84b33e28115c6b510045f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 094c98a74f]
2023-09-15 16:37:10 -04:00
Galantsev, Dmitrii 11679b8e05 rsmiBindings.py - Add initRsmiBindings()
Library path was printed at all times even with --json flag.
This commit adds a mandatory initRsmiBindings function which is a core
component of the rsmiBindings.py library.

It **MUST** be called on import.

Change-Id: Ic6ae1ec5d1fabba288910e6aed6c4706e53e5cd7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 3b95214fff]
2023-09-15 16:37:10 -04:00
Galantsev, Dmitrii 6335bdb6e0 Add errors for existing but empty dev files
Change-Id: Iad9febc50f9b8e6085f8b605249ee884d2f134d6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: a4b470fe71]
2023-09-14 17:30:03 -04:00
Oliveira, Daniel 24edb0acab rocm_smi_lib: Fix rocm-smi --resetfans results in Permission Denied
For operations related to:
  --resetfans
  --setfan

We report 'Not supported' for these cases instead of 'Permission denied'

Code changes related to the following:
  * rocm_smi_properties
  * rocm_smi related APIs

Change-Id: I144646efc3804fabd45cc5a46351803950b4feb7
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 12f395e592]
2023-09-14 16:54:29 -04:00
Galantsev, Dmitrii cc740d7d22 PY: Silence error output when printing concise info
Change-Id: I9ce4ad523b3fe2ec8afc5bea791810ec67558f11
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 4acfb00ad5]
2023-09-12 19:16:16 -04:00
Charis Poag d975792f47 Add GPU partition nodes
* Updates:
    - Fixed infinit loop on systems
      which did not have VRAM files
    - Fixed concise info from throwing exception
      with no amdgpu driver loaded
    - Fix for ability to see all nodes when
      after switching partitions (mirrors
      original card display/settings)
    - Added to logs build type, lib path,
      and set env. variables

Change-Id: Ic0333df355144ce2242cecea93fe4ce51caf311c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: ed6777a8e7]
2023-09-07 22:17:54 -05:00
Oliveira, Daniel a044785231 rocm_smi_lib: Fix rocm-smi --showfan shows 'unable to detect fan'
Code changes related to the following:
  * Reverts earlier fix for the same issue
  * Check for existence of files before reading

Change-Id: I175b20c3343c414b12b79dc3fc404f53fbaabf3a
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 328ce0150b]
2023-08-30 14:45:05 -05:00
Bill(Shuzhou) Liu 087a642570 Numa affinity shows large number
Change the affinity from unsigned int to integer to represent -1.

Change-Id: I82dc6f476b45fa4ec03a3c686fe8e6e2b7761b56


[ROCm/rocm_smi_lib commit: 471fbfddc1]
2023-08-25 09:01:08 -04:00
Oliveira, Daniel 2b199f14ec rocm_smi_lib/rocm_smi.py: Fix Add 'GPU name' in rocm-smi output
Code changes related to the following:
  * rocm_smi.py

Change-Id: I600e776bf479f972b8d639ce5a658a24916aed3c
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 3602447109]
2023-08-24 13:22:24 -05:00
Oliveira, Daniel 3992aafc08 rocm_smi_lib/rocm_smi.py: Fix rocm-smi --resetfans shows 'permission denied'
Properly handles 'Not supported' fan cases where:
 * sysfs file (pwm#_enable) exists
 * sysfs file (pwm#_enable) does not exist

Change-Id: Ifa3c290e5ee1d27a550e94d86cd25ad8dcef3f59
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 654f65118b]
2023-08-24 10:54:50 -05:00
Oliveira, Daniel c166a66017 rocm_smi_lib/rocm_smi.py: Fix rocm-smi --showfan shows 'unable to detect fan'
Properly handles 'Unable to detect' vs 'Not supported' fan cases where:
 * sysfs file (pwm#) exists, and readings report zero (0), "Unable to detect fan speed"
 * sysfs file (pwm#) does not exist, then "Not supported"

Change-Id: If4b0312c872b76647a3e54427ba2a3f3e8e6dab1
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: f9fd6b0a96]
2023-08-23 20:43:08 -05:00
Charis Poag 47420111a8 [SWDEV-399953] Smart Temperature detection + partitioning display
* Updates:
    - Fix for devices which do not have edge sensors, but junction
    - Added partitioning (memory and dynamic) displays for
      base rocm-smi CLI calls
    - Added subheading for base rocm-smi call output
    - Added better hwmon and device detection logging

Change-Id: I8219884b2e532d6ed379527cacdc1f2b232a5451
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 755e14dbad]
2023-08-10 19:53:38 -04:00
Bill(Shuzhou) Liu f9f936e3d2 Change reset power error message to logging
Since the reset will continue if the reset power and current power
is the same, error may confuse the user.

Change-Id: I35b9ef17afd47b5af5bd2b8882a44f63991fe509


[ROCm/rocm_smi_lib commit: aeb6c61f54]
2023-07-27 15:18:28 -05:00
Bill(Shuzhou) Liu aca23ecb0b Handle csv output when the command is not based on the device
Fix the error only one csv line can be printed out when output
is not based on device.

Change-Id: Idacc5d98acc223e932fb3d46c888bfa04778b73c


[ROCm/rocm_smi_lib commit: 80d650b95a]
2023-07-26 15:28:18 -05:00
Maisam Arif 60a4b3cb19 SWDEV-394316 - Handle not applicable vbios
Change-Id: I3390078a63c9a5eff67024b84a3be1369c4b1460
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>


[ROCm/rocm_smi_lib commit: c78ec46671]
2023-07-25 16:33:22 -05:00
Oliveira, Daniel 4c063f4038 Add revision to --showhw
Code changes related to the following:
  * Added 'rsmi_dev_revision_get()' related code
  * Test code
  * Functional tests

Change-Id: I8c2097c65384a028c8c8437b717d05d52fe45250
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>


[ROCm/rocm_smi_lib commit: 573620f586]
2023-07-18 16:17:33 -05:00
Bill(Shuzhou) Liu 8c41a911b4 rocm-smi --showevents shows wrong gpuID
Use the gpuid returned from the event data instead.

Change-Id: I7f286cc105f7ea12985223e603504f0ef3d9724e


[ROCm/rocm_smi_lib commit: 0aeb6025bd]
2023-07-13 08:28:53 -05:00
Bill(Shuzhou) Liu 86a1deaca9 Expand showpids to provide more details
Provide details of GPU usage by an application.

Change-Id: I0f36df7d358754c2c8a60432b736d98f667ee99c


[ROCm/rocm_smi_lib commit: d9b6af7a09]
2023-06-16 08:52:18 -04:00
Galantsev, Dmitrii 61ac3f8ca9 --showtempgraph - Show N/A when no temp found
If temp in hwmon was missing - rocm-smi crashed.
e.g. /sys/class/drm/card1/device/hwmon/hwmon5/temp1_input

This change displays "N/A" for temp instead of crashing.

Change-Id: I02f84a466bd3acfbd9b65e7e4ca0f18e76606c3b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: 713f85721b]
2023-06-12 19:16:39 -05:00
Maisam Arif a9d4f69eea SWDEV-404157 - Fixed printLog delimiter parsing
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I3d8e22d185790f4325aeacc18e4bfcfe8777d356


[ROCm/rocm_smi_lib commit: 00e170c2f5]
2023-06-08 20:02:51 -05:00
Galantsev, Dmitrii 6aef6b09ea Clean-up python errors and warnings
Used pyright to show errors and warnings and resolved most

Change-Id: I0fdf7dcdf08db5c35dec80f6645e0a395fbe4197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rocm_smi_lib commit: e8391c9d7c]
2023-06-01 17:37:57 -04:00
Charis Poag d38ba30a0f [SWDEV-392571] Fix concise info when missing VRAM info
Updates:
    * [rocm-smi] Added larger app width size, which helps
      display missing device info
    * [rocm-smi] Added better context when rsmi_ret_ok
      does not return with RSMI_STATUS_SUCCESS
    * [rocm-smi] Removed all references to an
      undefined function (printLogNoDev())
    * [rocm-smi] Fixed not detecting non-int
      values when setting the voltage curve
    * [rocm-smi] Added better context on missing
      sysfs file when setting clock overdrive
      values
    * [rocm-smi] Fixed getMemInfo() calls not
      referencing tuple values (making it easier
      to read)
    * [rocm-smi] Silenced concise info spitting
      out errors for missing VRAM files, instead
      display which metric is "unsupported" if
      the files are missing
    * [rocm-smi] Updated function descriptions for
      rsmi_ret_ok & getMemInfo
    * [rocm-smi] Updated getMemInfo to provide a
      quiet call, to silence for concise info calls.
      This provides a way to keep the output clean.
    * [rocm-smi-lib] Added when using debug sysfs
      files, to state, which enums are enabled
      for debug

Change-Id: I0e9e0c97ccf71467ced0e1a1f71803327a8be2b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 6be92b9e26]
2023-04-13 15:11:35 -04:00
Bill(Shuzhou) Liu 050e38f2a1 Validate the clock frequency when set it
Add the check of the clock frequency when set it.

Change-Id: I707291bfb5007bb69100c780af50a4b0f697bb37


[ROCm/rocm_smi_lib commit: b6789891b0]
2023-04-06 11:54:38 -04:00
Bill(Shuzhou) Liu 06028876b0 mem_use_pct uninitialized error
Initialize mem_use_pct if the memory info is not available.

Change-Id: Id8e285050149c51077356826c8f99719b473060d


[ROCm/rocm_smi_lib commit: fcb6afa289]
2023-02-27 16:47:45 -06:00
Charis Poag 5c3a3f3641 [SWDEV-335697] Add RSMI_STATUS_SETTING_UNAVAILABLE for dynamic partition
Updates:
    * Added RSMI_STATUS_SETTING_UNAVAILABLE for
      rsmi_dev_compute_partition_set - gives users
      better error output when attempting to set
      compute partition to values not listed in
      available_compute_partition SYSFS
    * Updated python --setcomputepartition to
      provide better output when receiving
      RSMI_STATUS_SETTING_UNAVAILABLE
    * Updated all test & example files to check for
      RSMI_STATUS_SETTING_UNAVAILABLE when doing
      rsmi_dev_compute_partition_set

Change-Id: Ida5d54880d9b9b6e4a0468cdb962fdc0c18d6257
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 0d3558945b]
2023-02-27 11:17:44 -06:00
Bill(Shuzhou) Liu b093ec72a1 Memory usage division by zero
The showAllConcise with division by zero error.

Change-Id: I469f1b9f268842cd51662be6f9036f555a8949b2


[ROCm/rocm_smi_lib commit: 55bc2e2072]
2023-02-24 10:12:36 -06:00
Charis Poag 2001f3620e [SWDEV-381630] Add reset partition functionality
Updates:
    * Added rsmi_dev_compute_partition_reset & rsmi_dev_nps_mode_reset
    * Added --resetcomputepartition and --resetnpsmode python smi calls
    * Added temp data files rocmsmi_boot_compute_partition_<device num>
      & rocmsmi_boot_nps_mode_partition_<device num>, writes UNKNOWN
      if data cannot be read or device does not support
    * Cleaned up NPS & compute API documentation
    * Added creation and reading of API temp files (used in reset
      functionality)
    * Cleaned up output of rocm_smi_example
    * Updated rocm_smi_example to check if running with sudo permission
      before executing write API calls (cleans up erroneous output)
    * Added template specialization for storing temp data, requires
      specific rsmi_type_t enums (restrics what data can be stored)
    * Added storage of temp data, if temp files do not exist
    * Updated google tests for NPS & compute to include reset API calls

Change-Id: I69895a466b97107617e6dbb355737b84499a76c9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 77c950a4bf]
2023-02-17 12:55:08 -06:00
Charis Poag 3ab5bb3230 SWDEV-342812- Add NPS support
Updates:
    * Added rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
    * Added ability to set multiple SYSFS files in debug build
    * Added ability to see user's env variables set for debug build
    * Added tests for rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
    * Added ability to restart AMD GPU driver, used in nps_mode_set
    * Updated ROCm_SMI_Manual.pdf to include new APIs
    * Added progress bar for long running python_smi_tools, used
      in setting nps_mode if runs longer than .1 seconds

Change-Id: I6d61bedd28d7cba6aff432ad2d127ba741b7d15a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/rocm_smi_lib commit: 9ef376cd61]
2023-02-14 11:54:24 -06:00
Bill(Shuzhou) Liu 79583335ee rocm-smi --showxgmierr return error instead of error counter values
The rocm-smi pass wrong arguments

Change-Id: I3a3923abdd263d4af87f3ec90670bb16afa2ef9b


[ROCm/rocm_smi_lib commit: ae10e842af]
2023-02-13 16:36:24 -05:00
Ori Messinger 58d4fd4edb ROCm SMI CLI: Fix --showproductname bug
This patch fixes a --showproductname bug, which is related to the
device's SKU. If a device with a VBIOS value that cannot be decoded
is used, that device's SKU cannot be parsed out of the VBIOS string.

Now, when the VBIOS value cannot be decoded, an error will be
printed instead of crashing with an 'UnboundLocalError' message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I680a182e94107e782235b8a2477ab165988f7703


[ROCm/rocm_smi_lib commit: 56f9d6bfc0]
2023-02-02 14:52:13 -05:00
Charis Poag 1b8d3f507a SWDEV-335697- Add support for dynamic partitioning
Original updates:
    * Added .gitignore to help with future commits
    * Updated/added copyrights on modified or added files
    * Updated rocm_smi.h/.cc
      - Added 3 new SMI API functions:
          rsmi_dev_compute_partition_set &
          rsmi_dev_compute_partition_get
      - Added helpful maps/enums used in
        new get/set compute_partition API calls
    * Updated rocm_smi.py
      - Added --showcomputepartition
      - Added --setcomputepartition
      - Fixed a few mistypes
    * Updated rsmiBindings.py - added helpful class/dict/list
    * Updated rocm_smi_example.cc
      - Added helpful MACRO to detect if api is not supported.
      - Added current_compute_partition set/get rocm lib calls
      - Added helpful macro to discover future RSMI errors
      - Commented out test_set_freq, was having permission issues
        on a Navi21
    * Updated rocm_smi_main.cc
      - Added helpful map to debug API calls, left in for future use
      - Added comment to better understand a non-class function returns
    * Added computepartition_read_write.cc/.h
      - Added get/set compute partition API test calls
      - Confirmed on devices that do not support the API calls, tests pass
    * Updated rocm_smi_test/main.cc
      - Calls new compute partition gtests

Added following updates from review feedback:
   * Updated rocm_smi.h/cc
       - Removed C++ API calls, adding support for both C/C++
         API calls could cause confusion and adds extra work for us
       - rsmi_dev_compute_partition_get -> Fixed an edge case where
         user gives a small buffer length size (smaller than data
         received), but does not receive the partial buffer back.
         google Tests are updated to reflect this find.
   * Updated rocm_smi_example.cc
       - Fixed test_set_freq, issue was that file was not writable.
         We now indicate this warning, so prior errors make sense.
       - General test code cleanup. Removed extra code,
         by creating loops for tests.
   * Updated rocm_smi_main.cc
     - Moved and got rid of an external reference to a map used
       for debugging RSMI enums, now is a const public reference.
   * Updated rocm_smi.py
     - Updated python code to identify NOT_SUPPORTED due to
       (currently) only a few GPU support the feature

Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b


[ROCm/rocm_smi_lib commit: 4d7f3f2bc7]
2023-01-13 10:46:40 -05:00