Commit grafiek

279 Commits

Auteur SHA1 Bericht Datum
Elena Sakhnovitch 9a7f79905d rocm_smi_lib: add gpu_metrics_v1_3 support
Signed-off-by: Elena Sakhnovitch
Change-Id: I4a9dedc80b8fce60e12c5baf8651d54d16a6a41c


[ROCm/amdsmi commit: fee82af1fe]
2021-08-13 09:23:35 -04:00
Harish Kasiviswanathan 0fa22cf381 Fall back to pci-ids if FRU product_name is empty
rocm-smi --showproductname will not show "Card series" in its output if
product_name exported by Kernel is empty string. This has been raised a
regression by customer.

BUG: SWDEV-297228

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I9aae24778e2d3a30aa661d8f338278c1666590fb


[ROCm/amdsmi commit: 7a8c3f3629]
2021-08-04 10:53:55 -04:00
Bill(Shuzhou) Liu e72af58262 support rocm_smi_lib version in the header file
Package the rocm_smi64Config.h into deb/rpm.

Change-Id: Ic4ba90646a0dbeb8bc2dd4edf455004b1a7ea859


[ROCm/amdsmi commit: 26874d2a10]
2021-08-04 10:19:44 -04:00
Bill(Shuzhou) Liu e26cec9f5a Add -g compiler option for ADDRESS_SANITIZER
Add -g compiler option for Address Sanitizer

Change-Id: I958fefa6c4b5871c29734ab1d4ec238c9e073192


[ROCm/amdsmi commit: 42d39d3e34]
2021-08-03 13:54:19 -04:00
Elena Sakhnovitch 8e8586591a [rocm_smi.py] --showpower error bugfix
Fix error message in -P for secondary die

Signed-off-by: Elena Sakhnovitch
Change-Id: Ica3c0a83b565d2231fad23389b9378056a0f56b3


[ROCm/amdsmi commit: 2db7e2a312]
2021-07-30 00:08:14 -04:00
Elena Sakhnovitch fc4aa3d271 [rocm_smi.py] add secondary die check.
Signed-off-by: Elena Sakhnovitch <Elena.Sakhnovitch@amd.com>
Change-Id: I46618002c1967ec115db88becbaba9e7c0a08af1


[ROCm/amdsmi commit: b59e752122]
2021-07-29 17:46:12 -04:00
Harish Kasiviswanathan 419b720ea5 rocm_smi.py: Remove extraneous line during process termination
During the tail end when process is terminating, subprocess module fails
to find the process. This results in extraneous printing of a line with
char 'b'. Fix this.

BUG: SWDEV-296409

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I39aacf8ae948a5acec0aa93296cc0e0aec88b3ef


[ROCm/amdsmi commit: a03acf2c07]
2021-07-27 16:26:49 -04:00
Icarus Sparry f860abd385 Add dependency on rocm-core
Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>
Change-Id: Ie2a5b08747129a1313edf2a834f2e0e8638372c2
(cherry picked from commit 3d74653383)


[ROCm/amdsmi commit: de025ca5f6]
2021-07-27 09:42:30 -04:00
Ori Messinger 546e11c058 ROCm SMI Python CLI: Fix printLog Collisions
Python's default 'print' implementation is not thread safe, causing
empty lines to be printed during multithreaded code execution.

This fixes the --showevents output for multi-GPU systems.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I72f7341cdf4401f1fed4cd8f7d7a4a90bf9a3a4c


[ROCm/amdsmi commit: 95348f37cc]
2021-07-21 23:58:07 -04:00
Ori Messinger 0cdc8fb26c ROCm SMI Python CLI: Add Zero Padding to Device Model
Use zero padding for the hexadecimal value 'device_model' inside
showProductName with a padding length of 4.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I962b94d414c6ba050d951486ad9e7559123f8850


[ROCm/amdsmi commit: 03ae187a35]
2021-07-17 04:29:52 -04:00
Bill(Shuzhou) Liu d9e824060c AddressSanitizer report stack-use-after-scope
Fix the stack-use-after-scope error reported by the AddressSanitizer.

Bug: SWDEV-291913
Change-Id: I0ffd71af8679b8bff6c363096fafe75dffcf329e


[ROCm/amdsmi commit: 8c60dbebaa]
2021-06-25 13:33:38 -04:00
Divya Shikre 7a0b4bc8ac Add fix to ignore error returned when perf determinism is not supported.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I89b6a0a3dbba6fbd4b12ff2e20670eff9f32ed7f


[ROCm/amdsmi commit: 6edea7a92e]
2021-06-14 12:18:22 -04:00
Divya Shikre d356da056d Add fix to show usage of setperfdeterminism functionality in --help command
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ife93c887eea2a9aae69f2923dba45c7cde4838d3


[ROCm/amdsmi commit: 686e6ac654]
2021-05-12 17:29:37 -04:00
Divya Shikre ebec7991cb Return an error when user tries to set out of range clock values for setsrange functionality
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ibe1075c1d2b6c009332a52b81f4b41f7e93d0756


[ROCm/amdsmi commit: 462d4adc24]
2021-05-11 12:32:19 -04:00
Harish Kasiviswanathan 10a16579c1 Add timestamp resolution info in comments
Specify that timestamp resolution is in ns in header file.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4db00a07c0b5c43ae23c98213f2fbbcf93110234


[ROCm/amdsmi commit: 14201290a2]
2021-05-05 12:32:58 -04:00
Harish Kasiviswanathan 0e17236bc5 Add support to read gpu_metrics version 1.2
gpu_metrics version 1.2 provides atomic timestamp. Use this timestamp.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I7a1a675f53b93718f34b1f2979173e9064e0ef93


[ROCm/amdsmi commit: 6b10a7761b]
2021-05-05 12:31:10 -04:00
Harish Kasiviswanathan 3c7b9cef95 Change #define RSMI_GPU_METRICS_API_CONTENT_VER
Chnage to RSMI_GPU_METRICS_API_CONTENT_VER_1. In preparation for
supporting additional formats

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4367a2622a0fa41e6b05bc4436ecd24b8c4e30e2


[ROCm/amdsmi commit: e83cf605c6]
2021-05-04 20:51:10 -04:00
Harish Kasiviswanathan ab54197e08 Move gpu_metrics functions to different file
No logic change. Only structural change

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Id5e1a678c0888f04081ee06db4521c72b5eb9b16


[ROCm/amdsmi commit: c416726054]
2021-05-04 20:49:51 -04:00
Ori Messinger a9e6f40bbb ROCm SMI LIB: Add Default Power Cap To rsmitst
Implement default GPU power cap functionality in rsmitst.
It is available in the "rsmitstReadOnly.TestPowerRead" test, and
is displayed as: "Default Power Cap: #uW" (where uW is microwatts).

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I564ea3785f1a93dfd30587634057516549fa762c


[ROCm/amdsmi commit: 5b42cdf780]
2021-04-28 12:42:34 -04:00
Kent Russell 23635d1f90 rocm_smi.py: Fix gpu reset error
Since device is a list, we need to pass a single item to the isAmdGpu
function.

Fixes: ffbe481241 "rocm_smi.py: Don't try to reset non-AMD GPUs"

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I19a74377636ff4589f11d092f41e1d35c1acb307


[ROCm/amdsmi commit: 242d94a668]
2021-04-28 07:44:55 -04:00
Kent Russell 4de1e4094a rocm_smi.py: Don't try to print absent clock files
Instead of throwing "Unsupported clock" errors for ASICs that don't
support a certain clock type (e.g. dcefclk on MI-series), just dump the
warning to logging.debug and don't try to read the clock

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: If3cb9a472b03aa535a76fc24bcd9f77122090634


[ROCm/amdsmi commit: b931380f02]
2021-04-23 10:19:04 -04:00
Ori Messinger 8a1ca3d26c rocm_smi.py: Show 'Out of Spec' warning only if required
Use default power cap exposed via sysfs to determine when to
show 'Out of Spec" warning.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I0fa3612b50e230856b0d5a390f876b35268d9587


[ROCm/amdsmi commit: b71e07b3fb]
2021-04-22 14:44:05 -04:00
Ori Messinger 9537c89a6b ROCm SMI LIB: Add Default GPU Power Cap
Implement default GPU power cap functionality in the LIB.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53


[ROCm/amdsmi commit: 83cd2fe4f1]
2021-04-22 10:49:55 -04:00
Harish Kasiviswanathan 52dc52654d Add energy counter resolution to rsmi_dev_energy_count_get
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I03b70968257db7a45e21d7ba62542cdedd18eb85


[ROCm/amdsmi commit: 844acbc0d8]
2021-04-22 10:25:06 -04:00
Ori Messinger f225c95878 ROCm SMI Python CLI: Add showevent Functionality
Implement showevent functionality in the ROCm SMI Python CLI.

It can be called using --showevents with any combination of:
VM_FAULT, THERMAL_THROTTLE, and/or GPU_RESET
For example:
./rocm-smi --showevents VM_FAULT, THERMAL_THROTTLE, GPU_RESET

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I905fd9c949e91423b79833a04ab89d6ba3760e62


[ROCm/amdsmi commit: a9e7e5a475]
2021-04-22 10:21:07 -04:00
Elena 3eb9426800 [rocm_smi.py] add energy counter
--showenergycounter

Signed-off-by: Elena Sakhnovitch
Change-Id: Iede0f2b06523f7cb2719489a883e9c49722f8d93


[ROCm/amdsmi commit: c80fc54500]
2021-04-21 18:40:19 -04:00
Elena 23d7d4a5ff [rocm_smi.py] Coarse Grain Utilization Counters
--showuse
--showmemuse

====================================
========= % time GPU is busy =======
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 0
====================================

Change-Id: I9db115ad78b394469206b22d195781a430b2f1d8


[ROCm/amdsmi commit: 771b4af95c]
2021-04-21 17:23:21 -04:00
Harish Kasiviswanathan 608afb879b Suppress warning message in getFanSpeed function
Many data center cards are fanless. Don't show warning if unable to get
fan speed. The fan speed will be reported as 0

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I53efe67ac88fb0824cf4820430b46c18bc7692df


[ROCm/amdsmi commit: 1c9e384c8f]
2021-04-21 15:29:44 -04:00
Harish Kasiviswanathan abedccf6f3 Add time profile for set_power_cap function
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Id728cb5fe85b3558e52b4517508211dca499e801


[ROCm/amdsmi commit: 92cf7ff28a]
2021-04-21 15:29:44 -04:00
Divya Shikre 38cee239c7 Update setrange functionality in CLI
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ic942bd76297c50caf189bfc0972d30dc42d91f32


[ROCm/amdsmi commit: 56c132873b]
2021-04-20 15:39:05 -04:00
Divya Shikre 86e595089b Add support for mi200 clocks being continuous.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifb7570054572239b9f48eaefe51e879fb3569031


[ROCm/amdsmi commit: dc431506f5]
2021-04-20 13:12:27 -04:00
Divya Shikre 5db8002118 Add new setrange function in C++ lib
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I670aaeb93827bf4b2cc08eb36d0f9756f00e4e4e


[ROCm/amdsmi commit: 9f9a7aaf65]
2021-04-19 22:38:59 -04:00
Divya Shikre 3a11b92287 Fix for cli errors - extra args in perf_determinism, undefined variable in setClocks
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Id138cfcbea4384f520537cc045d358024177b1ac


[ROCm/amdsmi commit: d9f7bd0ff4]
2021-04-19 17:32:07 -04:00
Elena 1fa63e0e9c [rocm-smi-lib] add HBM temperature conversion factor
Change-Id: I45339c87c3d2a40670baf1b76ada60dceb650dc0


[ROCm/amdsmi commit: a383dd23aa]
2021-04-19 16:41:48 -04:00
Elena ab17fca25f Adding 4 new HBM temperature sensors.
Signed-off-by: Elena Sakhnovitch
Change-Id: Iaea04c38e8c2353e85d8aa2b871fdb82727157de


[ROCm/amdsmi commit: 81c066350f]
2021-04-17 23:58:49 -04:00
Bill(Shuzhou) Liu 6e21939768 Unit test for energy accumulator counter
Add a few unit tests for energy accumulator counter.

Change-Id: Ib78a67e29465de9c14e6e934c5d62ec64de66d8a


[ROCm/amdsmi commit: 392d13e318]
2021-04-14 16:04:46 -04:00
Bill(Shuzhou) Liu 62bef2b6c4 Unit tests for coarse grain utilization counters
The unit tests for GFX and Memory activity counters.

Change-Id: I968dabc9ef6de9d335d7f751b290fb713b51a79c


[ROCm/amdsmi commit: 6340176b99]
2021-04-14 10:53:55 -04:00
Bill(Shuzhou) Liu 919364871d Add energy accumulator counter
The energy accumulator counter tracks all energy consumed.

Change-Id: I5b25f817b7802d81c477361447f0ecd7ec02fc61


[ROCm/amdsmi commit: 8eec0a7d36]
2021-04-14 10:43:01 -04:00
Bill(Shuzhou) Liu 38ddf00856 Add coarse grain utilization counter
The coarse grain utilization counter includes GFX and Memory activity.

Change-Id: I5d09976792d3f4a1c1081651fa24ff857016d4c0


[ROCm/amdsmi commit: 9bfb9ac297]
2021-04-14 10:40:19 -04:00
Kent Russell ffbe481241 rocm_smi.py: Don't try to reset non-AMD GPUs
This won't work for obvious reasons, so exit with an error instead of
trying to access a file that doesn't exist and segfaulting

Change-Id: Id1230922fa6e9a19e9394280faad88a43c7d2e34


[ROCm/amdsmi commit: c7c2ac5559]
2021-04-13 08:00:17 -04:00
Kent Russell f9cd4e6093 CMakeLists: Add python3 to required packages
Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I434b24d12e92d2f6a6928b7450e74c3898303a44


[ROCm/amdsmi commit: b016a8269a]
2021-04-12 11:33:39 -04:00
Divya Shikre 0fc1abdced Update performance determinism api as per the modified sysfs interface.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ib0ec5128819644a2ff6c916da9194a7fe1dad795


[ROCm/amdsmi commit: aaf2120117]
2021-04-07 16:38:48 -04:00
Bill(Shuzhou) Liu 7b48f14374 Add support for the HBM temperature
The rsmi_dev_temp_metric_get() can also support the HBM
temperatures which is retrieved from gpu_metrics.

Change-Id: I96b979296e90cf881523627b41b1a02849676416


[ROCm/amdsmi commit: da480b4589]
2021-04-05 15:55:55 -04:00
Cole Nelson 005f98d117 CMakeLists.txt: add ENABLE_LDCONFIG to support multi-version install
Signed-off-by: Cole Nelson <cole.nelson@amd.com>
Change-Id: If06e8b7b57ad12f22c1970622d241a42083d575e


[ROCm/amdsmi commit: f990d775b7]
2021-03-30 15:39:47 -04:00
Chris Freehill 7337bfaef9 Handle different gpu_metrics content versions for format v1
Change-Id: I344d1815da683befc8f8b5caf921803b267ae29f


[ROCm/amdsmi commit: 5e2a4f3a15]
2021-03-24 14:34:55 -05:00
Chris Freehill 826996c1c1 Adjust event counters to report only new events
Previously, RSMI assumed that the event counter values returned
from perf were only new events. But in fact, when we read the
counter values, they are running totals. To account for this, we
now record the value we read and take the difference between the
current value and the previously recorded value.

Change-Id: I1e04b514e89c7c4d4719889f2dae3a1283864e7f


[ROCm/amdsmi commit: ce475b009c]
2021-02-24 11:02:17 -06:00
Chris Freehill d1e4491505 Handle set freq for double-digit index in rocm_smi.py
rocm_smi.py --set<m|s>clk was treating the freq as a string.
This causes problems in parsing when the index is more than 1
digit. Now, treat the indexes as integers.

Change-Id: Ia0d859d33b685fe90689a86ff1c83980808b1514


[ROCm/amdsmi commit: 11440536cf]
2021-02-23 18:51:29 -06:00
Chris Freehill 9d2e2ffffd Change Debian Architecture from amd64 to any
rocm_smi_lib is not currently known to only compile
on specific architectures.

Change-Id: I209e8baa063e99ebe5ff09eaf0dc6541770aa829


[ROCm/amdsmi commit: 7effb405f0]
2021-02-01 13:48:38 -06:00
Chris Freehill fff19b1b3e Don't use hwmon# as indicator of gpu
Previously, during the rsmi_init discovery process, the existence
of an hwmon# directory was used to distinguish between gpus nodes
and non-gpu nodes. This isn't reliable in some scenarios. Instead,
the existence of the vbios_version file is used as an
indicator that the node is indeed a gpu.

Change-Id: Icfbe5c42ed0970077b05f25c3d209308a31bec85


[ROCm/amdsmi commit: ff9546aa62]
2021-01-29 13:05:10 -05:00
Ori Messinger 42b33ea096 ROCm SMI Python CLI: Fix Lower Power Cap Warning
The purpose of this patch is to fix a power cap bug for --setpoweroverdrive.
This bug occurs when the user attempts to set a lower wattage than the current
or default wattage, which displays an unnecessary warning message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I730d2c6031b7d7c4af5acf32ecd28da5ca21ab12


[ROCm/amdsmi commit: 20e2d260fb]
2021-01-27 03:24:22 -05:00