Wykres commitów

266 Commity

Autor SHA1 Wiadomość Data
Divya Shikre ebec7991cb Return an error when user tries to set out of range clock values for setsrange functionality
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ibe1075c1d2b6c009332a52b81f4b41f7e93d0756


[ROCm/amdsmi commit: 462d4adc24]
2021-05-11 12:32:19 -04:00
Harish Kasiviswanathan 10a16579c1 Add timestamp resolution info in comments
Specify that timestamp resolution is in ns in header file.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4db00a07c0b5c43ae23c98213f2fbbcf93110234


[ROCm/amdsmi commit: 14201290a2]
2021-05-05 12:32:58 -04:00
Harish Kasiviswanathan 0e17236bc5 Add support to read gpu_metrics version 1.2
gpu_metrics version 1.2 provides atomic timestamp. Use this timestamp.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I7a1a675f53b93718f34b1f2979173e9064e0ef93


[ROCm/amdsmi commit: 6b10a7761b]
2021-05-05 12:31:10 -04:00
Harish Kasiviswanathan 3c7b9cef95 Change #define RSMI_GPU_METRICS_API_CONTENT_VER
Chnage to RSMI_GPU_METRICS_API_CONTENT_VER_1. In preparation for
supporting additional formats

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4367a2622a0fa41e6b05bc4436ecd24b8c4e30e2


[ROCm/amdsmi commit: e83cf605c6]
2021-05-04 20:51:10 -04:00
Harish Kasiviswanathan ab54197e08 Move gpu_metrics functions to different file
No logic change. Only structural change

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Id5e1a678c0888f04081ee06db4521c72b5eb9b16


[ROCm/amdsmi commit: c416726054]
2021-05-04 20:49:51 -04:00
Ori Messinger a9e6f40bbb ROCm SMI LIB: Add Default Power Cap To rsmitst
Implement default GPU power cap functionality in rsmitst.
It is available in the "rsmitstReadOnly.TestPowerRead" test, and
is displayed as: "Default Power Cap: #uW" (where uW is microwatts).

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I564ea3785f1a93dfd30587634057516549fa762c


[ROCm/amdsmi commit: 5b42cdf780]
2021-04-28 12:42:34 -04:00
Kent Russell 23635d1f90 rocm_smi.py: Fix gpu reset error
Since device is a list, we need to pass a single item to the isAmdGpu
function.

Fixes: ffbe481241 "rocm_smi.py: Don't try to reset non-AMD GPUs"

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I19a74377636ff4589f11d092f41e1d35c1acb307


[ROCm/amdsmi commit: 242d94a668]
2021-04-28 07:44:55 -04:00
Kent Russell 4de1e4094a rocm_smi.py: Don't try to print absent clock files
Instead of throwing "Unsupported clock" errors for ASICs that don't
support a certain clock type (e.g. dcefclk on MI-series), just dump the
warning to logging.debug and don't try to read the clock

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: If3cb9a472b03aa535a76fc24bcd9f77122090634


[ROCm/amdsmi commit: b931380f02]
2021-04-23 10:19:04 -04:00
Ori Messinger 8a1ca3d26c rocm_smi.py: Show 'Out of Spec' warning only if required
Use default power cap exposed via sysfs to determine when to
show 'Out of Spec" warning.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I0fa3612b50e230856b0d5a390f876b35268d9587


[ROCm/amdsmi commit: b71e07b3fb]
2021-04-22 14:44:05 -04:00
Ori Messinger 9537c89a6b ROCm SMI LIB: Add Default GPU Power Cap
Implement default GPU power cap functionality in the LIB.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53


[ROCm/amdsmi commit: 83cd2fe4f1]
2021-04-22 10:49:55 -04:00
Harish Kasiviswanathan 52dc52654d Add energy counter resolution to rsmi_dev_energy_count_get
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I03b70968257db7a45e21d7ba62542cdedd18eb85


[ROCm/amdsmi commit: 844acbc0d8]
2021-04-22 10:25:06 -04:00
Ori Messinger f225c95878 ROCm SMI Python CLI: Add showevent Functionality
Implement showevent functionality in the ROCm SMI Python CLI.

It can be called using --showevents with any combination of:
VM_FAULT, THERMAL_THROTTLE, and/or GPU_RESET
For example:
./rocm-smi --showevents VM_FAULT, THERMAL_THROTTLE, GPU_RESET

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I905fd9c949e91423b79833a04ab89d6ba3760e62


[ROCm/amdsmi commit: a9e7e5a475]
2021-04-22 10:21:07 -04:00
Elena 3eb9426800 [rocm_smi.py] add energy counter
--showenergycounter

Signed-off-by: Elena Sakhnovitch
Change-Id: Iede0f2b06523f7cb2719489a883e9c49722f8d93


[ROCm/amdsmi commit: c80fc54500]
2021-04-21 18:40:19 -04:00
Elena 23d7d4a5ff [rocm_smi.py] Coarse Grain Utilization Counters
--showuse
--showmemuse

====================================
========= % time GPU is busy =======
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 0
====================================

Change-Id: I9db115ad78b394469206b22d195781a430b2f1d8


[ROCm/amdsmi commit: 771b4af95c]
2021-04-21 17:23:21 -04:00
Harish Kasiviswanathan 608afb879b Suppress warning message in getFanSpeed function
Many data center cards are fanless. Don't show warning if unable to get
fan speed. The fan speed will be reported as 0

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I53efe67ac88fb0824cf4820430b46c18bc7692df


[ROCm/amdsmi commit: 1c9e384c8f]
2021-04-21 15:29:44 -04:00
Harish Kasiviswanathan abedccf6f3 Add time profile for set_power_cap function
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Id728cb5fe85b3558e52b4517508211dca499e801


[ROCm/amdsmi commit: 92cf7ff28a]
2021-04-21 15:29:44 -04:00
Divya Shikre 38cee239c7 Update setrange functionality in CLI
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ic942bd76297c50caf189bfc0972d30dc42d91f32


[ROCm/amdsmi commit: 56c132873b]
2021-04-20 15:39:05 -04:00
Divya Shikre 86e595089b Add support for mi200 clocks being continuous.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifb7570054572239b9f48eaefe51e879fb3569031


[ROCm/amdsmi commit: dc431506f5]
2021-04-20 13:12:27 -04:00
Divya Shikre 5db8002118 Add new setrange function in C++ lib
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I670aaeb93827bf4b2cc08eb36d0f9756f00e4e4e


[ROCm/amdsmi commit: 9f9a7aaf65]
2021-04-19 22:38:59 -04:00
Divya Shikre 3a11b92287 Fix for cli errors - extra args in perf_determinism, undefined variable in setClocks
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Id138cfcbea4384f520537cc045d358024177b1ac


[ROCm/amdsmi commit: d9f7bd0ff4]
2021-04-19 17:32:07 -04:00
Elena 1fa63e0e9c [rocm-smi-lib] add HBM temperature conversion factor
Change-Id: I45339c87c3d2a40670baf1b76ada60dceb650dc0


[ROCm/amdsmi commit: a383dd23aa]
2021-04-19 16:41:48 -04:00
Elena ab17fca25f Adding 4 new HBM temperature sensors.
Signed-off-by: Elena Sakhnovitch
Change-Id: Iaea04c38e8c2353e85d8aa2b871fdb82727157de


[ROCm/amdsmi commit: 81c066350f]
2021-04-17 23:58:49 -04:00
Bill(Shuzhou) Liu 6e21939768 Unit test for energy accumulator counter
Add a few unit tests for energy accumulator counter.

Change-Id: Ib78a67e29465de9c14e6e934c5d62ec64de66d8a


[ROCm/amdsmi commit: 392d13e318]
2021-04-14 16:04:46 -04:00
Bill(Shuzhou) Liu 62bef2b6c4 Unit tests for coarse grain utilization counters
The unit tests for GFX and Memory activity counters.

Change-Id: I968dabc9ef6de9d335d7f751b290fb713b51a79c


[ROCm/amdsmi commit: 6340176b99]
2021-04-14 10:53:55 -04:00
Bill(Shuzhou) Liu 919364871d Add energy accumulator counter
The energy accumulator counter tracks all energy consumed.

Change-Id: I5b25f817b7802d81c477361447f0ecd7ec02fc61


[ROCm/amdsmi commit: 8eec0a7d36]
2021-04-14 10:43:01 -04:00
Bill(Shuzhou) Liu 38ddf00856 Add coarse grain utilization counter
The coarse grain utilization counter includes GFX and Memory activity.

Change-Id: I5d09976792d3f4a1c1081651fa24ff857016d4c0


[ROCm/amdsmi commit: 9bfb9ac297]
2021-04-14 10:40:19 -04:00
Kent Russell ffbe481241 rocm_smi.py: Don't try to reset non-AMD GPUs
This won't work for obvious reasons, so exit with an error instead of
trying to access a file that doesn't exist and segfaulting

Change-Id: Id1230922fa6e9a19e9394280faad88a43c7d2e34


[ROCm/amdsmi commit: c7c2ac5559]
2021-04-13 08:00:17 -04:00
Kent Russell f9cd4e6093 CMakeLists: Add python3 to required packages
Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I434b24d12e92d2f6a6928b7450e74c3898303a44


[ROCm/amdsmi commit: b016a8269a]
2021-04-12 11:33:39 -04:00
Divya Shikre 0fc1abdced Update performance determinism api as per the modified sysfs interface.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ib0ec5128819644a2ff6c916da9194a7fe1dad795


[ROCm/amdsmi commit: aaf2120117]
2021-04-07 16:38:48 -04:00
Bill(Shuzhou) Liu 7b48f14374 Add support for the HBM temperature
The rsmi_dev_temp_metric_get() can also support the HBM
temperatures which is retrieved from gpu_metrics.

Change-Id: I96b979296e90cf881523627b41b1a02849676416


[ROCm/amdsmi commit: da480b4589]
2021-04-05 15:55:55 -04:00
Cole Nelson 005f98d117 CMakeLists.txt: add ENABLE_LDCONFIG to support multi-version install
Signed-off-by: Cole Nelson <cole.nelson@amd.com>
Change-Id: If06e8b7b57ad12f22c1970622d241a42083d575e


[ROCm/amdsmi commit: f990d775b7]
2021-03-30 15:39:47 -04:00
Chris Freehill 7337bfaef9 Handle different gpu_metrics content versions for format v1
Change-Id: I344d1815da683befc8f8b5caf921803b267ae29f


[ROCm/amdsmi commit: 5e2a4f3a15]
2021-03-24 14:34:55 -05:00
Chris Freehill 826996c1c1 Adjust event counters to report only new events
Previously, RSMI assumed that the event counter values returned
from perf were only new events. But in fact, when we read the
counter values, they are running totals. To account for this, we
now record the value we read and take the difference between the
current value and the previously recorded value.

Change-Id: I1e04b514e89c7c4d4719889f2dae3a1283864e7f


[ROCm/amdsmi commit: ce475b009c]
2021-02-24 11:02:17 -06:00
Chris Freehill d1e4491505 Handle set freq for double-digit index in rocm_smi.py
rocm_smi.py --set<m|s>clk was treating the freq as a string.
This causes problems in parsing when the index is more than 1
digit. Now, treat the indexes as integers.

Change-Id: Ia0d859d33b685fe90689a86ff1c83980808b1514


[ROCm/amdsmi commit: 11440536cf]
2021-02-23 18:51:29 -06:00
Chris Freehill 9d2e2ffffd Change Debian Architecture from amd64 to any
rocm_smi_lib is not currently known to only compile
on specific architectures.

Change-Id: I209e8baa063e99ebe5ff09eaf0dc6541770aa829


[ROCm/amdsmi commit: 7effb405f0]
2021-02-01 13:48:38 -06:00
Chris Freehill fff19b1b3e Don't use hwmon# as indicator of gpu
Previously, during the rsmi_init discovery process, the existence
of an hwmon# directory was used to distinguish between gpus nodes
and non-gpu nodes. This isn't reliable in some scenarios. Instead,
the existence of the vbios_version file is used as an
indicator that the node is indeed a gpu.

Change-Id: Icfbe5c42ed0970077b05f25c3d209308a31bec85


[ROCm/amdsmi commit: ff9546aa62]
2021-01-29 13:05:10 -05:00
Ori Messinger 42b33ea096 ROCm SMI Python CLI: Fix Lower Power Cap Warning
The purpose of this patch is to fix a power cap bug for --setpoweroverdrive.
This bug occurs when the user attempts to set a lower wattage than the current
or default wattage, which displays an unnecessary warning message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I730d2c6031b7d7c4af5acf32ecd28da5ca21ab12


[ROCm/amdsmi commit: 20e2d260fb]
2021-01-27 03:24:22 -05:00
Ori Messinger d41364d1cf ROCm SMI Python CLI & LIB: Add GPU Reset Functionality
The purpose of this patch is to implement GPU reset functionality
in the LIB, and to call it from the rocm_smi python CLI.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Iaf525f7016f8354a7fd93af0209ca2e97ef4fd56


[ROCm/amdsmi commit: 80f629b9be]
2021-01-26 17:52:24 -05:00
Ori Messinger a5fee40cbb ROCm SMI Python CLI: Fix Fan Speed Bug
The purpose of this patch is to fix a fan speed bug for --showfan.
This bug occurs when the current and/or maximum fan speeds are not
found by the LIB, which displayed an unclear error message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ied06e460f22391238dd2d86572813e2a5a64f45b


[ROCm/amdsmi commit: 4f297bdeb3]
2021-01-26 08:51:04 -05:00
Kent Russell 8d37749c05 Fix type in --setmrange documentation
mrange is for MCLK, not SCLK, so fix the typo accordingly

Change-Id: Ib20774b073288a8ec193322f2f767616979c95da


[ROCm/amdsmi commit: a902770f86]
2021-01-25 13:20:20 -05:00
Elena bb879e7f38 ROCm SMI Pythoc CLI: Fix division by zero fan bug
Signed-off-by: Elena Sakhnovitch <Elena.Sakhnovitch@amd.com>
Change-Id: If259ac1ad6d77ce85b2b7616d972b6e7964a9f78


[ROCm/amdsmi commit: 61cdfff562]
2021-01-20 18:21:23 -05:00
Kent Russell 4a35269cc1 CMakeLIsts: Fix libasan usage
static-libasan doesn't exist, so use the easier-to-remember
shared-libsan and change static-libasan to static-libsan

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: Ieef480aacdd770f3bb40673a2e8f8306b308b1c9


[ROCm/amdsmi commit: ef7f99a7e2]
2021-01-15 15:39:05 -05:00
Chris Freehill 23345c0c3a Comment out CPACK_RPM_PACKAGE_SUGGESTS line
This line make the build fail on Centos. It may be
that it's not supported on that disto.

See https://bugzilla.redhat.com/show_bug.cgi?id=1811358

Change-Id: Ied7ce634ae9fb2b1544f85c0b10ceecc039c388a


[ROCm/amdsmi commit: 47b882b8d3]
2021-01-12 17:15:52 -06:00
Kent Russell 2ecaedb600 rocm-smi: Try find the librocm_smi64.so in a few locations
Instead of looking solely in ../lib, try looking in any /opt folder as a
backup option. This is a little more robust and hopefully leads to fewer
issues trying to find the lib

Change-Id: Ie0d3944b48b32d9965917e5c831388838b6d4ef7


[ROCm/amdsmi commit: c7b6b47211]
2021-01-08 15:29:11 -05:00
Chris Freehill 55e86989c1 Remove adding of bogus hwmon label entries
If we fail to find an expected temperature or voltage label
file, previously we were attempting to re-add a mapping of file
index to sensor types. Attempting to insert a map item that is already
present has no effect, so there should be no functional change.

This was a remnant of old code that should have been deleted.

Change-Id: Ie6f8a62f619a1ae58756e0fd891532434518cf78


[ROCm/amdsmi commit: bb5132a66c]
2021-01-06 11:01:07 -05:00
Chris Freehill 76323354d1 Introduce RSMI_DEBUG_INFINITE_LOOP
The environment variable RSMI_DEBUG_INFINITE_LOOP is introduced
to facilitate debugging RSMI in user applications. When this
env. variable is non-zero, an infinite loop will be entered in
rsmi_init(). At this point, a debugger can be attached and RSMI
can be debugger. This only applies to debug builds.

Change-Id: I23f6dd730fc965764295070de053314a1cc5b6aa


[ROCm/amdsmi commit: 68095b50e7]
2021-01-06 10:30:24 -05:00
Kent Russell e4175d0eeb CMakeLists: Add sudo to Suggests field
There are some systems that don't have sudo, and since we require sudo
for any of the "set" functionality, add it to "Suggests".

See https://github.com/RadeonOpenCompute/ROCm/issues/1245

Change-Id: I9428b9a68810ee8b51f91bb2e3b63312463161b0


[ROCm/amdsmi commit: 7b5f220f76]
2021-01-04 10:46:46 -05:00
Kent Russell 2411ad3aea CMakeLists: Make rocm_smi_lib provide rocm-smi
Now that rocm-smi is deprecated, change the DEB/RPM info so that it
provides the rocm-smi package. This will allow for a seamless transition
over during ROCm upgrades

Change-Id: Ia29aab6e45c5974f7b623b786d0649710ba1f7cc


[ROCm/amdsmi commit: 36a0465127]
2021-01-04 10:46:40 -05:00
Ori Messinger 848697c287 ROCm SMI Python CLI: Fix --showclkfrq/--showclocks Failure
The purpose of this patch is to check if each valid clock is supported
on the GPU before attempting to retrieve its value.

The valid clocks are: dcefclk, fclk, mclk, pcie, sclk, socclk.

This should get rid of the 'one or more commands failed' message when
running --showclkfrq or --showclocks on a machine that doesn't support
all the possible valid clocks.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I1fb10989fc1a36f38b68a23e17e6e600ed0ac85b


[ROCm/amdsmi commit: 3b52c895cc]
2020-12-18 17:46:23 -05:00
Divya Shikre 22516a3b63 Fix for error while reading gpu_metrics sysfs file
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: If69b7eeb3573ebece9ed0cb539f5ddffbe3c2f09


[ROCm/amdsmi commit: efd234c9e3]
2020-12-18 15:31:16 -05:00