Commit Graph

93 Commits

Author SHA1 Message Date
kent.russell@amd.com 3fa81a26e4 rocm_smi.py: Fix order of CE and UE reporting
We append CE then UE, but in the table right after, it goes UE then CE.
Fix the order of the table, and add capitals for consistency

Change-Id: I208f37685508ab1e2ff83d3456620bbbf3a16268


[ROCm/amdsmi commit: 248c6f79f4]
2022-12-08 12:28:37 -05:00
Alex Sierra ca07577907 Consider invalid peer link type during topology report
Invalid peer links are labeled as N/A during topology report creation.
This invalid link type could be triggered by having a configuration
with CPU XGMI iolinks and disable XGMI peer to peer access. This can
be done by setting the driver parameter 'use_xgmi_p2p = 0'.

Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: Ifb09a8f3266a3f07686615dfb45781d6cfe55e83


[ROCm/amdsmi commit: 03fab6b2b6]
2022-09-06 13:47:32 -05:00
Ori Messinger e0c6a44916 ROCm SMI CLI: Modify Column Header
The purpose of this patch is to modify the column header of the default
'./rocm-smi' command from 'Temp' to 'Temp (DieEdge)' for clarity.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I127a9214be97a1185c3db010f1c9176d1f412ec9


[ROCm/amdsmi commit: dfd88b593f]
2022-08-31 09:47:14 -04:00
Elena Sakhnovitch 6c8a8c5ae6 [rocm_smi.py] bugfix for non-alphanum parce issue
--showdeviceid
Fix for false-positive  "FRU is corrupted" messages,
since str(sn).isalphanum() triggers on empty struct.

--showproductname
fix script termination on non-alphanum product name

Change-Id: I78d4998e156f9b0d9f45338bed2a0d30b789e220


[ROCm/amdsmi commit: 8b2bc318eb]
2022-08-23 19:28:19 -04:00
Divya Shikre 4d175f7726 Add perf determinism to perf_level_string
This fixes the 'unknown' value being displayed
for Perf Level because of a missing mapping of
RSMI_DEV_PERF_LEVEL_DETERMINISM to its string
value.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I479c2baea450f0ff61640ad81cbd4d08ad56ff8e


[ROCm/amdsmi commit: 8144dd4d8e]
2022-07-21 08:55:38 -04:00
Ori Messinger b7f6850450 ROCm SMI CLI: Force RETCODE to 0 by Default
The purpose of this patch is to set RETCODE equal to 0 by default
unless an appropriate '--loglevel LEVEL' has been set.

To allow a non-zero RETCODE value, you must use any loglevel that
is not 'warning' or 'None' (default).

You can set the loglevel in the CLI with:
--loglevel <debug/info/warning/error/critical>

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9484a750206a3f464c59952304e72c59c3d12465


[ROCm/amdsmi commit: cbb068ccac]
2022-07-18 18:33:29 -04:00
Elena Sakhnovitch 63c35faea6 rocm_smi.py: improve error output
Match alignment of error output with general output

signed-off-by Elena Sakhnovitch

Change-Id: Id4334152f4ad5665ff37d5d47e6f7ca0107a9428


[ROCm/amdsmi commit: 5d5ba738db]
2022-06-24 12:19:43 -04:00
Sreekant Somasekharan b405977e0e Add rsmi lib function to get memory overdrive value
Change-Id: I515b51d5ce4baf966bb31714886a0d72330026bc


[ROCm/amdsmi commit: 1432e5e040]
2022-06-23 11:42:50 -04:00
Elena Sakhnovitch d0c3b5c1e9 [rocm_smi.py] Hiding unnecessary N/A lines
Hiding not applicable/unsupported sensors under INFO

Signed-off-by: Elena Sakhnovitch
Change-Id: I89c80ca7c6365ef3a2dd751a575ddf90044c8a2e


[ROCm/amdsmi commit: 0f88f59ddd]
2022-06-23 11:02:13 -04:00
Kent Russell 8a9c88c35e rocm_smi.py: Handle corrupted serial number
If the FRU has been corrupted, then the serial number will come in with
any manner of random bytes, which will cause decode() to fail
spectacularily. Check that the serial returned by the kernel is
alphanumeric, and print to the error log if not (then continue to the
next device).

Change-Id: If4f35b140b6089e02729b1490ed6b48d614a122a


[ROCm/amdsmi commit: 6b6e840337]
2022-06-16 17:29:08 -04:00
Elena Sakhnovitch 6e9f35e1c6 [rocm_smi.py] error feedback improvement
Cleaning overally verbose error reporting system.

Signed-off-by: Elena Sakhnovitch
Signed-off-by: Sreekant Somasekharan
Change-Id: Icc96086810b8dcfc426848b8c349a2572026c3bd


[ROCm/amdsmi commit: 4dd2398f3d]
2022-06-16 14:32:13 -04:00
Ori Messinger 99b2e41906 ROCm SMI CLI: Fix setClockRange Error
This patch changes the error handling for setClockRange.

When a device does not support modifying a clock type (sclk/mclk),
an error message is printed through the python CLI.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I37d9ea4189b1ca81e5deaab5efa6cfa4901b89b3


[ROCm/amdsmi commit: 2b8d0ad70f]
2022-06-15 15:47:51 -04:00
Divya Shikre fdeb60d881 Print log when PIDs dont use any GPU device.
showpidgpus prints 'none' when no GPU devices are
being used by the running process. Adding a fix
to print a relevant message.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I165a6644a76c8e1c3c3cad676dcfd41eb1c4724f


[ROCm/amdsmi commit: dcab886394]
2022-05-31 16:17:42 -04:00
Elena Sakhnovitch ccf3ac2b15 [rocm_smi.py]: shownodesbw fix for non xgmi
Improve error output for non-xgmi nodes bandwidth

signed-off-by: Elena Sakhnovitch
Change-Id: I833970d3200a75c7639d33bf19e0e83afe176c8d


[ROCm/amdsmi commit: 44ea49eb01]
2022-05-24 16:45:32 -04:00
Ori Messinger 23b3bcc038 ROCm SMI CLI: Fix --showvoltagerange bug
This patch fixes a --showvoltagerange bug, which attempts to check
the voltage curve on a device that does not have any voltage
regions in its OverDrive voltage frequency data (odvf).

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I647c30c978ffb13f6819ac3d069ee340710a7f99


[ROCm/amdsmi commit: 786f66671a]
2022-05-21 05:02:15 -04:00
Ori Messinger cf61df76ad ROCm SMI CLI: Fix setPowerOverdrive restPowerOverdrive Bugs
Fixes bug in the 'setPowerOverdrive' function which mishandles
GPUs with secondary dies. Secondary dies have a default power cap
of 0W and cannot be changed, so they are now skipped.

Fixes bug in the 'resetPowerOverdrive' function which incorrectly
resets the wattage to the current value.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I483fa3f58b1fa44a3bf7bae3b52c59ce523ae152


[ROCm/amdsmi commit: 4298cbb400]
2022-05-21 05:01:32 -04:00
Divya Shikre f4e33b90c9 Update get_frequencies to handle failures.
Show an optional debug log (RSMI_DEBUG_BITFIELD=2) to
the user in the following scenarios:
1. If more than one current frequency is found
2. If frequencies are not read in increasing order of
   their value
If current frequency is not available, index for it is
set to -1, values will not have * next to it in the
output. This will also be handled in rocm_smi.py.

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I477ec065f7513c8045d6392f12ef6cb835a6b8f6


[ROCm/amdsmi commit: afe996c2ed]
2022-05-11 15:33:15 -04:00
Elena Sakhnovitch 65841a8fd0 Revert "rocm_smi.py: Don't try to print absent clock files"
This reverts commit 4de1e4094a.
DRM device id  does not always match GPU ID in the rocm_smi.py. This leads to cases where wrong device is checked by os.path.isfile().

Change-Id: Ib6f2b9be123b7eb64334d3feec57f63d7eb37d6f


[ROCm/amdsmi commit: be66d67ef2]
2022-05-03 16:42:42 -04:00
Elena Sakhnovitch 67d69e127e [rocm_smi.py] Hide unsupported clocks under debug
Signed-off-by: Elena Sakhnovitch <elena.sakhnovitch@amd.com>
Change-Id: I1f2c7b93d9a81f2735c76e8d441f9e298288f5c0


[ROCm/amdsmi commit: 9d7fd34d2b]
2022-05-03 16:38:22 -04:00
Bill(Shuzhou) Liu 9bf38c36a3 Sanity check amdgpu module is loaded in rocm_smi.py
Instead of check /proc/modules for amdgpu, the code will check
/sys/module/amdgpu/initstate which covers the case when the driver
is compiled into the kernel.

Change-Id: Id39ec5b0eb9b68204bc9f5f779057ba8cc090bdc


[ROCm/amdsmi commit: 9f6614e83b]
2022-04-14 11:28:38 -04:00
Ori Messinger a21208fc4e ROCm SMI CLI: Fix formatCsv Bug
Fixes a bug in the 'formatCsv' function which mishandles json
data conversion for 'system' data types.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I705060409bf5ae75b994ffda270843065ca12321


[ROCm/amdsmi commit: e800cbf161]
2022-04-07 19:33:46 -04:00
Kent Russell da9b4c606e README: Remove restrictive licensing language
Also update copyright years

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: Ic9ead543c4937680afc1957623c4d5fcbfbd58b0


[ROCm/amdsmi commit: 85571318e2]
2022-03-16 13:52:25 -04:00
Elena Sakhnovitch 26ef2abe05 [rocm_smi.py] resetPowerOverdrive fix
resetPowerOverdrive: improve output messages.

Signed-off-by: Elena Sakhnovitch
Change-Id: Ic5b9084f0637458c36e460231f2d3622b0a23aa6


[ROCm/amdsmi commit: a3317714cb]
2022-03-04 11:26:45 -05:00
Ranjith Ramakrishnan 2a0ecb1e56 File reorganization with backward compatibility
Wrapper header files
Soft link to libraries and binaries
rocm_smi.py and rsmiBindings.py installed in libexec/rocm_smi
Binaries, libraries and header files installed as per File Reorg folder structure

Change-Id: I3166ab67f89c2ae4aafbc87bb00c9a5233221ade


[ROCm/amdsmi commit: f1da5591b5]
2022-03-03 18:48:52 -05:00
Elena Sakhnovitch 99a9fbfea8 [rocm_smi.py]: fix input error type for --setclock
signed-off-by: Elena Sakhnovitch
Change-Id: I9626978780f360c591fb8908f5b759f2289dff0b


[ROCm/amdsmi commit: 9b871fcd9f]
2022-02-22 14:24:38 -05:00
Ori Messinger e9afb27da3 ROCm SMI CLI: Hide Failed Command Warning
The purpose of this patch is to hide 'One or more commands failed.'
from showing up, unless an appropriate log level has been set.

You can set the loglevel in the CLI with:
--loglevel <debug/info/warning/error/critical>

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ifa309cd62596491a6ea5892e0752251f037fc0e9


[ROCm/amdsmi commit: 007f326c34]
2022-02-09 11:52:33 -05:00
Sreekant Somasekharan 304636c27d Print ASD firmware version in hex instead of decimal format
Change-Id: Idf113f63b79f2d2903ae795d272d232a43680516


[ROCm/amdsmi commit: cf2f0b0508]
2022-01-18 10:44:20 -05:00
Elena Sakhnovitch 48a2251ff6 [rocm_smi.py] remove \r symbol at print
Remove carriage return at the end of the line in printLog function.
On linux end of line is encoded with \n, not \n\r.

Change-Id: If3835d773033b53a7f25b4a0284df359a6f9555d


[ROCm/amdsmi commit: 1aeb27c4c9]
2021-12-08 10:13:56 -05:00
Divya Shikre 58b5a538a7 Add fix to display correct GPU Memory Activity and GFX Activity value.
Driver mem fills in 0xFF for all for the metrices not supported for that ASIC.
So if 0xFF is detected, return RSMI_STATUS_NOT_SUPPORTED

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I86a38148c7a288ea0db94893f685560eaac098ab


[ROCm/amdsmi commit: 7b1daaef96]
2021-11-25 14:28:06 -05:00
Ori Messinger 7e248102eb ROCm SMI CLI: Fix printErrLog Arguments
This patch removes every erroneous occurance of a third argument
when calling printErrLog(device, err), since it takes two arguments.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I5971cc68b69c86f37c69f44e4785dabfc82c7955


[ROCm/amdsmi commit: 40eed25a3b]
2021-11-08 12:54:00 -05:00
Elena Sakhnovitch 398df0b9d0 [ROCm-SMI] add --showNodesBw
Display min and max bandwidth between gpu nodes

Signed-off-by: Elena Sakhnovitch
Change-Id: I7289fb83f80e2f899996b7d7560ece670cc5f31f


[ROCm/amdsmi commit: 13cde8429d]
2021-10-29 12:49:35 -04:00
Elena Sakhnovitch ff2bcc16fa [rocm_smi.py] remove repetitive footnote
Printing "Primary die (usually one above or below the secondary) shows
total (primary + secondary) socket power information" footnote only one time, not
for every secondary die.

Signed-off-by: Elena Sakhnovitch
Change-Id: Iae9c5c94945ec38ecdb128a576a4eacafc30a044


[ROCm/amdsmi commit: 15e4fe80e1]
2021-10-29 08:32:06 -04:00
Ori Messinger de16bc4552 ROCm SMI CLI: Add --showtopoaccess Functionality
The purpose of this patch is to implement --showtopoaccess
functionality in the CLI, which shows True or False if P2P is
possible between two given GPUs.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I07d70d80ae7b484136b31d5d22780c4990029391


[ROCm/amdsmi commit: e2d9a37e5f]
2021-10-14 11:06:05 -04:00
Elena Sakhnovitch 8b42fe51b5 [rocm_smi.py]: fix fan 255% error
signed-off-by: Elena Sakhnovitch
Change-Id: I265ba32bc3777db5f04f1924547fe432ba78c3d0


[ROCm/amdsmi commit: 2f84906cc2]
2021-09-29 21:11:06 -04:00
Elena Sakhnovitch cda3383b3b [rocm_smi.py]: pep8 formatting
signed-off-by: Elena Sakhnovitch
Change-Id: If12b3371cd6acac16d9f6b3adf5f5cc8df28992f


[ROCm/amdsmi commit: 80140c3b02]
2021-08-26 10:23:58 -04:00
Elena Sakhnovitch 8e8586591a [rocm_smi.py] --showpower error bugfix
Fix error message in -P for secondary die

Signed-off-by: Elena Sakhnovitch
Change-Id: Ica3c0a83b565d2231fad23389b9378056a0f56b3


[ROCm/amdsmi commit: 2db7e2a312]
2021-07-30 00:08:14 -04:00
Elena Sakhnovitch fc4aa3d271 [rocm_smi.py] add secondary die check.
Signed-off-by: Elena Sakhnovitch <Elena.Sakhnovitch@amd.com>
Change-Id: I46618002c1967ec115db88becbaba9e7c0a08af1


[ROCm/amdsmi commit: b59e752122]
2021-07-29 17:46:12 -04:00
Harish Kasiviswanathan 419b720ea5 rocm_smi.py: Remove extraneous line during process termination
During the tail end when process is terminating, subprocess module fails
to find the process. This results in extraneous printing of a line with
char 'b'. Fix this.

BUG: SWDEV-296409

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I39aacf8ae948a5acec0aa93296cc0e0aec88b3ef


[ROCm/amdsmi commit: a03acf2c07]
2021-07-27 16:26:49 -04:00
Ori Messinger 546e11c058 ROCm SMI Python CLI: Fix printLog Collisions
Python's default 'print' implementation is not thread safe, causing
empty lines to be printed during multithreaded code execution.

This fixes the --showevents output for multi-GPU systems.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I72f7341cdf4401f1fed4cd8f7d7a4a90bf9a3a4c


[ROCm/amdsmi commit: 95348f37cc]
2021-07-21 23:58:07 -04:00
Ori Messinger 0cdc8fb26c ROCm SMI Python CLI: Add Zero Padding to Device Model
Use zero padding for the hexadecimal value 'device_model' inside
showProductName with a padding length of 4.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I962b94d414c6ba050d951486ad9e7559123f8850


[ROCm/amdsmi commit: 03ae187a35]
2021-07-17 04:29:52 -04:00
Divya Shikre d356da056d Add fix to show usage of setperfdeterminism functionality in --help command
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ife93c887eea2a9aae69f2923dba45c7cde4838d3


[ROCm/amdsmi commit: 686e6ac654]
2021-05-12 17:29:37 -04:00
Kent Russell 23635d1f90 rocm_smi.py: Fix gpu reset error
Since device is a list, we need to pass a single item to the isAmdGpu
function.

Fixes: ffbe481241 "rocm_smi.py: Don't try to reset non-AMD GPUs"

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I19a74377636ff4589f11d092f41e1d35c1acb307


[ROCm/amdsmi commit: 242d94a668]
2021-04-28 07:44:55 -04:00
Kent Russell 4de1e4094a rocm_smi.py: Don't try to print absent clock files
Instead of throwing "Unsupported clock" errors for ASICs that don't
support a certain clock type (e.g. dcefclk on MI-series), just dump the
warning to logging.debug and don't try to read the clock

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: If3cb9a472b03aa535a76fc24bcd9f77122090634


[ROCm/amdsmi commit: b931380f02]
2021-04-23 10:19:04 -04:00
Ori Messinger 8a1ca3d26c rocm_smi.py: Show 'Out of Spec' warning only if required
Use default power cap exposed via sysfs to determine when to
show 'Out of Spec" warning.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I0fa3612b50e230856b0d5a390f876b35268d9587


[ROCm/amdsmi commit: b71e07b3fb]
2021-04-22 14:44:05 -04:00
Ori Messinger f225c95878 ROCm SMI Python CLI: Add showevent Functionality
Implement showevent functionality in the ROCm SMI Python CLI.

It can be called using --showevents with any combination of:
VM_FAULT, THERMAL_THROTTLE, and/or GPU_RESET
For example:
./rocm-smi --showevents VM_FAULT, THERMAL_THROTTLE, GPU_RESET

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I905fd9c949e91423b79833a04ab89d6ba3760e62


[ROCm/amdsmi commit: a9e7e5a475]
2021-04-22 10:21:07 -04:00
Elena 3eb9426800 [rocm_smi.py] add energy counter
--showenergycounter

Signed-off-by: Elena Sakhnovitch
Change-Id: Iede0f2b06523f7cb2719489a883e9c49722f8d93


[ROCm/amdsmi commit: c80fc54500]
2021-04-21 18:40:19 -04:00
Elena 23d7d4a5ff [rocm_smi.py] Coarse Grain Utilization Counters
--showuse
--showmemuse

====================================
========= % time GPU is busy =======
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 0
====================================

Change-Id: I9db115ad78b394469206b22d195781a430b2f1d8


[ROCm/amdsmi commit: 771b4af95c]
2021-04-21 17:23:21 -04:00
Harish Kasiviswanathan 608afb879b Suppress warning message in getFanSpeed function
Many data center cards are fanless. Don't show warning if unable to get
fan speed. The fan speed will be reported as 0

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I53efe67ac88fb0824cf4820430b46c18bc7692df


[ROCm/amdsmi commit: 1c9e384c8f]
2021-04-21 15:29:44 -04:00
Divya Shikre 38cee239c7 Update setrange functionality in CLI
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ic942bd76297c50caf189bfc0972d30dc42d91f32


[ROCm/amdsmi commit: 56c132873b]
2021-04-20 15:39:05 -04:00
Divya Shikre 86e595089b Add support for mi200 clocks being continuous.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifb7570054572239b9f48eaefe51e879fb3569031


[ROCm/amdsmi commit: dc431506f5]
2021-04-20 13:12:27 -04:00