Commit Graph

143 Commits

Author SHA1 Message Date
Hao Zhou 5aa94c48d1 Revert "Merge amd-staging into amd-master 20230602"
Revert submission 869878

Reason for revert: <RPM package on RHEL9.x is broken>
Reverted Changes:
I4886ef2a6:Merge amd-staging into amd-master 20230602
I0f277acf3:Revert "Revert "Merge amd-staging into amd-master ...

Change-Id: Ie370327c8db0404c9cedde42c1376e3cec56fae0
2023-06-02 02:12:07 -04:00
Hao Zhou 8560d96c81 Merge amd-staging into amd-master 20230602
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I4886ef2a60b08e17bd9165ff0cd46b4297e15972
2023-06-02 10:11:48 +08:00
Galantsev, Dmitrii e8391c9d7c Clean-up python errors and warnings
Used pyright to show errors and warnings and resolved most

Change-Id: I0fdf7dcdf08db5c35dec80f6645e0a395fbe4197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-01 17:37:57 -04:00
Hao Zhou ecb1303732 Merge amd-staging into amd-master 20230414
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I518af7182bb6537c9c03a30d53c44d2143f3064f
2023-04-14 12:17:58 +08:00
Charis Poag 6be92b9e26 [SWDEV-392571] Fix concise info when missing VRAM info
Updates:
    * [rocm-smi] Added larger app width size, which helps
      display missing device info
    * [rocm-smi] Added better context when rsmi_ret_ok
      does not return with RSMI_STATUS_SUCCESS
    * [rocm-smi] Removed all references to an
      undefined function (printLogNoDev())
    * [rocm-smi] Fixed not detecting non-int
      values when setting the voltage curve
    * [rocm-smi] Added better context on missing
      sysfs file when setting clock overdrive
      values
    * [rocm-smi] Fixed getMemInfo() calls not
      referencing tuple values (making it easier
      to read)
    * [rocm-smi] Silenced concise info spitting
      out errors for missing VRAM files, instead
      display which metric is "unsupported" if
      the files are missing
    * [rocm-smi] Updated function descriptions for
      rsmi_ret_ok & getMemInfo
    * [rocm-smi] Updated getMemInfo to provide a
      quiet call, to silence for concise info calls.
      This provides a way to keep the output clean.
    * [rocm-smi-lib] Added when using debug sysfs
      files, to state, which enums are enabled
      for debug

Change-Id: I0e9e0c97ccf71467ced0e1a1f71803327a8be2b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-04-13 15:11:35 -04:00
Hao Zhou b21fe43ec1 Merge amd-staging into amd-master 20230407
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Icf2ea8ab76a521f31bc551b4cf644fc833765263
2023-04-07 11:06:58 +08:00
Bill(Shuzhou) Liu b6789891b0 Validate the clock frequency when set it
Add the check of the clock frequency when set it.

Change-Id: I707291bfb5007bb69100c780af50a4b0f697bb37
2023-04-06 11:54:38 -04:00
Hao Zhou c189126be6 Merge amd-staging into amd-master 20230302
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I17c723b907c988754d2da5d04092c1d435e29cc1
2023-03-02 14:13:00 +08:00
Bill(Shuzhou) Liu fcb6afa289 mem_use_pct uninitialized error
Initialize mem_use_pct if the memory info is not available.

Change-Id: Id8e285050149c51077356826c8f99719b473060d
2023-02-27 16:47:45 -06:00
Charis Poag 0d3558945b [SWDEV-335697] Add RSMI_STATUS_SETTING_UNAVAILABLE for dynamic partition
Updates:
    * Added RSMI_STATUS_SETTING_UNAVAILABLE for
      rsmi_dev_compute_partition_set - gives users
      better error output when attempting to set
      compute partition to values not listed in
      available_compute_partition SYSFS
    * Updated python --setcomputepartition to
      provide better output when receiving
      RSMI_STATUS_SETTING_UNAVAILABLE
    * Updated all test & example files to check for
      RSMI_STATUS_SETTING_UNAVAILABLE when doing
      rsmi_dev_compute_partition_set

Change-Id: Ida5d54880d9b9b6e4a0468cdb962fdc0c18d6257
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-02-27 11:17:44 -06:00
Bill(Shuzhou) Liu 55bc2e2072 Memory usage division by zero
The showAllConcise with division by zero error.

Change-Id: I469f1b9f268842cd51662be6f9036f555a8949b2
2023-02-24 10:12:36 -06:00
Hao Zhou d93a307f4b Merge amd-staging into amd-master 20230223
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: If82fc6ac361e6ef5a68ae9dfe333dd8685a02fb5
2023-02-23 18:13:38 +08:00
Charis Poag 77c950a4bf [SWDEV-381630] Add reset partition functionality
Updates:
    * Added rsmi_dev_compute_partition_reset & rsmi_dev_nps_mode_reset
    * Added --resetcomputepartition and --resetnpsmode python smi calls
    * Added temp data files rocmsmi_boot_compute_partition_<device num>
      & rocmsmi_boot_nps_mode_partition_<device num>, writes UNKNOWN
      if data cannot be read or device does not support
    * Cleaned up NPS & compute API documentation
    * Added creation and reading of API temp files (used in reset
      functionality)
    * Cleaned up output of rocm_smi_example
    * Updated rocm_smi_example to check if running with sudo permission
      before executing write API calls (cleans up erroneous output)
    * Added template specialization for storing temp data, requires
      specific rsmi_type_t enums (restrics what data can be stored)
    * Added storage of temp data, if temp files do not exist
    * Updated google tests for NPS & compute to include reset API calls

Change-Id: I69895a466b97107617e6dbb355737b84499a76c9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-02-17 12:55:08 -06:00
Hao Zhou 405dff6b64 Merge amd-staging into amd-master 20230216
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I1a439acaf0208f9bc68ac19753c1df98ea5387ee
2023-02-16 17:04:21 +08:00
Charis Poag 9ef376cd61 SWDEV-342812- Add NPS support
Updates:
    * Added rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
    * Added ability to set multiple SYSFS files in debug build
    * Added ability to see user's env variables set for debug build
    * Added tests for rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
    * Added ability to restart AMD GPU driver, used in nps_mode_set
    * Updated ROCm_SMI_Manual.pdf to include new APIs
    * Added progress bar for long running python_smi_tools, used
      in setting nps_mode if runs longer than .1 seconds

Change-Id: I6d61bedd28d7cba6aff432ad2d127ba741b7d15a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-02-14 11:54:24 -06:00
Bill(Shuzhou) Liu ae10e842af rocm-smi --showxgmierr return error instead of error counter values
The rocm-smi pass wrong arguments

Change-Id: I3a3923abdd263d4af87f3ec90670bb16afa2ef9b
2023-02-13 16:36:24 -05:00
Hao Zhou c8aace17df Merge amd-staging into amd-master 20230203
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I0519fc50fbcb604bc548ce1883c9f57c9a3be250
2023-02-03 11:28:16 +08:00
Ori Messinger 56f9d6bfc0 ROCm SMI CLI: Fix --showproductname bug
This patch fixes a --showproductname bug, which is related to the
device's SKU. If a device with a VBIOS value that cannot be decoded
is used, that device's SKU cannot be parsed out of the VBIOS string.

Now, when the VBIOS value cannot be decoded, an error will be
printed instead of crashing with an 'UnboundLocalError' message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I680a182e94107e782235b8a2477ab165988f7703
2023-02-02 14:52:13 -05:00
Hao Zhou 0802602499 Merge amd-staging into amd-master 20230118
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I3672f4919d7636f2a9521f0364e65c0dda1c2b2b
2023-01-18 09:13:06 +08:00
Charis Poag 4d7f3f2bc7 SWDEV-335697- Add support for dynamic partitioning
Original updates:
    * Added .gitignore to help with future commits
    * Updated/added copyrights on modified or added files
    * Updated rocm_smi.h/.cc
      - Added 3 new SMI API functions:
          rsmi_dev_compute_partition_set &
          rsmi_dev_compute_partition_get
      - Added helpful maps/enums used in
        new get/set compute_partition API calls
    * Updated rocm_smi.py
      - Added --showcomputepartition
      - Added --setcomputepartition
      - Fixed a few mistypes
    * Updated rsmiBindings.py - added helpful class/dict/list
    * Updated rocm_smi_example.cc
      - Added helpful MACRO to detect if api is not supported.
      - Added current_compute_partition set/get rocm lib calls
      - Added helpful macro to discover future RSMI errors
      - Commented out test_set_freq, was having permission issues
        on a Navi21
    * Updated rocm_smi_main.cc
      - Added helpful map to debug API calls, left in for future use
      - Added comment to better understand a non-class function returns
    * Added computepartition_read_write.cc/.h
      - Added get/set compute partition API test calls
      - Confirmed on devices that do not support the API calls, tests pass
    * Updated rocm_smi_test/main.cc
      - Calls new compute partition gtests

Added following updates from review feedback:
   * Updated rocm_smi.h/cc
       - Removed C++ API calls, adding support for both C/C++
         API calls could cause confusion and adds extra work for us
       - rsmi_dev_compute_partition_get -> Fixed an edge case where
         user gives a small buffer length size (smaller than data
         received), but does not receive the partial buffer back.
         google Tests are updated to reflect this find.
   * Updated rocm_smi_example.cc
       - Fixed test_set_freq, issue was that file was not writable.
         We now indicate this warning, so prior errors make sense.
       - General test code cleanup. Removed extra code,
         by creating loops for tests.
   * Updated rocm_smi_main.cc
     - Moved and got rid of an external reference to a map used
       for debugging RSMI enums, now is a const public reference.
   * Updated rocm_smi.py
     - Updated python code to identify NOT_SUPPORTED due to
       (currently) only a few GPU support the feature

Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
2023-01-13 10:46:40 -05:00
Hao Zhou afa6e806e6 Merge amd-staging into amd-master 20230106
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Ie0c9c44a7fb39ecdcd56158d501c738ceb64096c
2023-01-06 13:05:17 +08:00
Ori Messinger 5c478e9eb9 ROCm SMI CLI: Fix --showproductname bugs
This patch fixes a couple of --showproductname bugs, both of which
are related to the device's SKU.
Previously if a device with a non-standard VBIOS name was used,
fetching that device's SKU wasn't working correctly.

A standard VBIOS name should follow the following pattern:
AAA-BBBBBB-CCC
Where the middle section "BBBBBB" between the hypens is the SKU.

Now, SKU can be correctly fetched even with a non-standard VBIOS
name, and return 'unkown' if SKU does not exist.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I5899a859c6131c6048bb31a4305ddacbac3075a9
2023-01-05 11:53:04 -05:00
Hao Zhou cd31d17736 Merge amd-staging into amd-master 20221221
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I34a54bbf6a3b434f92fb5ca5abf699c61ab4a8a2
2022-12-21 09:03:03 +08:00
Ori Messinger 932feb6e49 ROCm SMI CLI: Add --showtempgraph Feature
The purpose of this patch is to add a new feature to the smi cli.
Use ./rocm-smi --showtempgraph to print a persistant bar graph for
each GPU's temperature.

The bar graphs refresh continuously to show current temps, and the
graphs change in a color gradient depending on the temperature.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I98902b76c42cc7281420759f5ebe8c78f7785e66
2022-12-15 18:20:32 -05:00
Hao Zhou 34f4b63853 Merge amd-staging into amd-master 20221212
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I555da02f58185a9eca00954755c2bfd8e418e153
2022-12-12 16:25:01 +08:00
kent.russell@amd.com 248c6f79f4 rocm_smi.py: Fix order of CE and UE reporting
We append CE then UE, but in the table right after, it goes UE then CE.
Fix the order of the table, and add capitals for consistency

Change-Id: I208f37685508ab1e2ff83d3456620bbbf3a16268
2022-12-08 12:28:37 -05:00
Hao Zhou 0f02a3a272 Merge amd-staging into amd-master 20220909
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Ic8bbdad24b0671f6b77543daa9656f5c3662c2c8
2022-09-09 09:26:33 +08:00
Alex Sierra 03fab6b2b6 Consider invalid peer link type during topology report
Invalid peer links are labeled as N/A during topology report creation.
This invalid link type could be triggered by having a configuration
with CPU XGMI iolinks and disable XGMI peer to peer access. This can
be done by setting the driver parameter 'use_xgmi_p2p = 0'.

Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: Ifb09a8f3266a3f07686615dfb45781d6cfe55e83
2022-09-06 13:47:32 -05:00
Hao Zhou 1efd6ee29c Merge amd-staging into amd-master 20220901
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Ic59465c248f96a74d20226810c9ae98360797e34
2022-09-01 09:54:40 +08:00
Ori Messinger dfd88b593f ROCm SMI CLI: Modify Column Header
The purpose of this patch is to modify the column header of the default
'./rocm-smi' command from 'Temp' to 'Temp (DieEdge)' for clarity.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I127a9214be97a1185c3db010f1c9176d1f412ec9
2022-08-31 09:47:14 -04:00
Hao Zhou a5e286d250 Merge amd-staging into amd-master 20220826
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Ibef408e62669ec105571e605f333642cebc33112
2022-08-26 13:45:55 +08:00
Elena Sakhnovitch 8b2bc318eb [rocm_smi.py] bugfix for non-alphanum parce issue
--showdeviceid
Fix for false-positive  "FRU is corrupted" messages,
since str(sn).isalphanum() triggers on empty struct.

--showproductname
fix script termination on non-alphanum product name

Change-Id: I78d4998e156f9b0d9f45338bed2a0d30b789e220
2022-08-23 19:28:19 -04:00
Hao Zhou 350b77a1fc Merge amd-staging into amd-master 20220722
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I1575353fb596e1fa001b888ff8c3a4555375efee
2022-07-22 11:51:56 +08:00
Divya Shikre 8144dd4d8e Add perf determinism to perf_level_string
This fixes the 'unknown' value being displayed
for Perf Level because of a missing mapping of
RSMI_DEV_PERF_LEVEL_DETERMINISM to its string
value.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I479c2baea450f0ff61640ad81cbd4d08ad56ff8e
2022-07-21 08:55:38 -04:00
Ori Messinger cbb068ccac ROCm SMI CLI: Force RETCODE to 0 by Default
The purpose of this patch is to set RETCODE equal to 0 by default
unless an appropriate '--loglevel LEVEL' has been set.

To allow a non-zero RETCODE value, you must use any loglevel that
is not 'warning' or 'None' (default).

You can set the loglevel in the CLI with:
--loglevel <debug/info/warning/error/critical>

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9484a750206a3f464c59952304e72c59c3d12465
2022-07-18 18:33:29 -04:00
Hao Zhou 46e21f2509 Merge amd-staging into amd-master 20220708
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I8c0061b099bb140ecc6c3c4b491165da44c6b96a
2022-07-08 08:56:45 +08:00
Elena Sakhnovitch 5d5ba738db rocm_smi.py: improve error output
Match alignment of error output with general output

signed-off-by Elena Sakhnovitch

Change-Id: Id4334152f4ad5665ff37d5d47e6f7ca0107a9428
2022-06-24 12:19:43 -04:00
Hao Zhou 4752e3184a Merge amd-staging into amd-master 20220624
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Ieb7c5bc9c3480dabb8534a0be5839f00f60e100b
2022-06-24 11:53:19 +08:00
Sreekant Somasekharan 1432e5e040 Add rsmi lib function to get memory overdrive value
Change-Id: I515b51d5ce4baf966bb31714886a0d72330026bc
2022-06-23 11:42:50 -04:00
Elena Sakhnovitch 0f88f59ddd [rocm_smi.py] Hiding unnecessary N/A lines
Hiding not applicable/unsupported sensors under INFO

Signed-off-by: Elena Sakhnovitch
Change-Id: I89c80ca7c6365ef3a2dd751a575ddf90044c8a2e
2022-06-23 11:02:13 -04:00
Hao Zhou 0635134df4 Merge amd-staging into amd-master 20220617
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I1bfa6a012b2bfb7f744e018f129539a495c2c5db
2022-06-17 11:08:54 +08:00
Kent Russell 6b6e840337 rocm_smi.py: Handle corrupted serial number
If the FRU has been corrupted, then the serial number will come in with
any manner of random bytes, which will cause decode() to fail
spectacularily. Check that the serial returned by the kernel is
alphanumeric, and print to the error log if not (then continue to the
next device).

Change-Id: If4f35b140b6089e02729b1490ed6b48d614a122a
2022-06-16 17:29:08 -04:00
Elena Sakhnovitch 4dd2398f3d [rocm_smi.py] error feedback improvement
Cleaning overally verbose error reporting system.

Signed-off-by: Elena Sakhnovitch
Signed-off-by: Sreekant Somasekharan
Change-Id: Icc96086810b8dcfc426848b8c349a2572026c3bd
2022-06-16 14:32:13 -04:00
Ori Messinger 2b8d0ad70f ROCm SMI CLI: Fix setClockRange Error
This patch changes the error handling for setClockRange.

When a device does not support modifying a clock type (sclk/mclk),
an error message is printed through the python CLI.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I37d9ea4189b1ca81e5deaab5efa6cfa4901b89b3
2022-06-15 15:47:51 -04:00
Hao Zhou 90571416c1 Merge amd-staging into amd-master 20220610
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Iaefd0a9180925d91d2e3ec03be84e5b04cf262b6
2022-06-10 09:08:16 +08:00
Divya Shikre dcab886394 Print log when PIDs dont use any GPU device.
showpidgpus prints 'none' when no GPU devices are
being used by the running process. Adding a fix
to print a relevant message.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I165a6644a76c8e1c3c3cad676dcfd41eb1c4724f
2022-05-31 16:17:42 -04:00
Hao Zhou 4da4de6dbe Merge amd-staging into amd-master 20220526
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Id96c706f0b9ecee20a9ded0fb1ee220f53219067
2022-05-26 09:23:37 +08:00
Elena Sakhnovitch 44ea49eb01 [rocm_smi.py]: shownodesbw fix for non xgmi
Improve error output for non-xgmi nodes bandwidth

signed-off-by: Elena Sakhnovitch
Change-Id: I833970d3200a75c7639d33bf19e0e83afe176c8d
2022-05-24 16:45:32 -04:00
Ori Messinger 786f66671a ROCm SMI CLI: Fix --showvoltagerange bug
This patch fixes a --showvoltagerange bug, which attempts to check
the voltage curve on a device that does not have any voltage
regions in its OverDrive voltage frequency data (odvf).

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I647c30c978ffb13f6819ac3d069ee340710a7f99
2022-05-21 05:02:15 -04:00
Ori Messinger 4298cbb400 ROCm SMI CLI: Fix setPowerOverdrive restPowerOverdrive Bugs
Fixes bug in the 'setPowerOverdrive' function which mishandles
GPUs with secondary dies. Secondary dies have a default power cap
of 0W and cannot be changed, so they are now skipped.

Fixes bug in the 'resetPowerOverdrive' function which incorrectly
resets the wattage to the current value.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I483fa3f58b1fa44a3bf7bae3b52c59ce523ae152
2022-05-21 05:01:32 -04:00