コミットグラフ

321 コミット

作成者 SHA1 メッセージ 日付
Ori Messinger 83cd2fe4f1 ROCm SMI LIB: Add Default GPU Power Cap
Implement default GPU power cap functionality in the LIB.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53
2021-04-22 10:49:55 -04:00
Harish Kasiviswanathan 844acbc0d8 Add energy counter resolution to rsmi_dev_energy_count_get
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I03b70968257db7a45e21d7ba62542cdedd18eb85
2021-04-22 10:25:06 -04:00
Ori Messinger a9e7e5a475 ROCm SMI Python CLI: Add showevent Functionality
Implement showevent functionality in the ROCm SMI Python CLI.

It can be called using --showevents with any combination of:
VM_FAULT, THERMAL_THROTTLE, and/or GPU_RESET
For example:
./rocm-smi --showevents VM_FAULT, THERMAL_THROTTLE, GPU_RESET

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I905fd9c949e91423b79833a04ab89d6ba3760e62
2021-04-22 10:21:07 -04:00
Elena c80fc54500 [rocm_smi.py] add energy counter
--showenergycounter

Signed-off-by: Elena Sakhnovitch
Change-Id: Iede0f2b06523f7cb2719489a883e9c49722f8d93
2021-04-21 18:40:19 -04:00
Elena 771b4af95c [rocm_smi.py] Coarse Grain Utilization Counters
--showuse
--showmemuse

====================================
========= % time GPU is busy =======
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 0
====================================

Change-Id: I9db115ad78b394469206b22d195781a430b2f1d8
2021-04-21 17:23:21 -04:00
Harish Kasiviswanathan 1c9e384c8f Suppress warning message in getFanSpeed function
Many data center cards are fanless. Don't show warning if unable to get
fan speed. The fan speed will be reported as 0

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I53efe67ac88fb0824cf4820430b46c18bc7692df
2021-04-21 15:29:44 -04:00
Harish Kasiviswanathan 92cf7ff28a Add time profile for set_power_cap function
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Id728cb5fe85b3558e52b4517508211dca499e801
2021-04-21 15:29:44 -04:00
Divya Shikre 56c132873b Update setrange functionality in CLI
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ic942bd76297c50caf189bfc0972d30dc42d91f32
2021-04-20 15:39:05 -04:00
Divya Shikre dc431506f5 Add support for mi200 clocks being continuous.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifb7570054572239b9f48eaefe51e879fb3569031
2021-04-20 13:12:27 -04:00
Divya Shikre 9f9a7aaf65 Add new setrange function in C++ lib
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I670aaeb93827bf4b2cc08eb36d0f9756f00e4e4e
2021-04-19 22:38:59 -04:00
Divya Shikre d9f7bd0ff4 Fix for cli errors - extra args in perf_determinism, undefined variable in setClocks
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Id138cfcbea4384f520537cc045d358024177b1ac
2021-04-19 17:32:07 -04:00
Elena a383dd23aa [rocm-smi-lib] add HBM temperature conversion factor
Change-Id: I45339c87c3d2a40670baf1b76ada60dceb650dc0
2021-04-19 16:41:48 -04:00
Elena 81c066350f Adding 4 new HBM temperature sensors.
Signed-off-by: Elena Sakhnovitch
Change-Id: Iaea04c38e8c2353e85d8aa2b871fdb82727157de
2021-04-17 23:58:49 -04:00
Bill(Shuzhou) Liu 392d13e318 Unit test for energy accumulator counter
Add a few unit tests for energy accumulator counter.

Change-Id: Ib78a67e29465de9c14e6e934c5d62ec64de66d8a
2021-04-14 16:04:46 -04:00
Bill(Shuzhou) Liu 6340176b99 Unit tests for coarse grain utilization counters
The unit tests for GFX and Memory activity counters.

Change-Id: I968dabc9ef6de9d335d7f751b290fb713b51a79c
2021-04-14 10:53:55 -04:00
Bill(Shuzhou) Liu 8eec0a7d36 Add energy accumulator counter
The energy accumulator counter tracks all energy consumed.

Change-Id: I5b25f817b7802d81c477361447f0ecd7ec02fc61
2021-04-14 10:43:01 -04:00
Bill(Shuzhou) Liu 9bfb9ac297 Add coarse grain utilization counter
The coarse grain utilization counter includes GFX and Memory activity.

Change-Id: I5d09976792d3f4a1c1081651fa24ff857016d4c0
2021-04-14 10:40:19 -04:00
Kent Russell c7c2ac5559 rocm_smi.py: Don't try to reset non-AMD GPUs
This won't work for obvious reasons, so exit with an error instead of
trying to access a file that doesn't exist and segfaulting

Change-Id: Id1230922fa6e9a19e9394280faad88a43c7d2e34
2021-04-13 08:00:17 -04:00
Kent Russell b016a8269a CMakeLists: Add python3 to required packages
Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I434b24d12e92d2f6a6928b7450e74c3898303a44
2021-04-12 11:33:39 -04:00
Divya Shikre aaf2120117 Update performance determinism api as per the modified sysfs interface.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ib0ec5128819644a2ff6c916da9194a7fe1dad795
2021-04-07 16:38:48 -04:00
Bill(Shuzhou) Liu da480b4589 Add support for the HBM temperature
The rsmi_dev_temp_metric_get() can also support the HBM
temperatures which is retrieved from gpu_metrics.

Change-Id: I96b979296e90cf881523627b41b1a02849676416
2021-04-05 15:55:55 -04:00
Cole Nelson f990d775b7 CMakeLists.txt: add ENABLE_LDCONFIG to support multi-version install
Signed-off-by: Cole Nelson <cole.nelson@amd.com>
Change-Id: If06e8b7b57ad12f22c1970622d241a42083d575e
2021-03-30 15:39:47 -04:00
Chris Freehill 5e2a4f3a15 Handle different gpu_metrics content versions for format v1
Change-Id: I344d1815da683befc8f8b5caf921803b267ae29f
2021-03-24 14:34:55 -05:00
Chris Freehill ce475b009c Adjust event counters to report only new events
Previously, RSMI assumed that the event counter values returned
from perf were only new events. But in fact, when we read the
counter values, they are running totals. To account for this, we
now record the value we read and take the difference between the
current value and the previously recorded value.

Change-Id: I1e04b514e89c7c4d4719889f2dae3a1283864e7f
2021-02-24 11:02:17 -06:00
Chris Freehill 11440536cf Handle set freq for double-digit index in rocm_smi.py
rocm_smi.py --set<m|s>clk was treating the freq as a string.
This causes problems in parsing when the index is more than 1
digit. Now, treat the indexes as integers.

Change-Id: Ia0d859d33b685fe90689a86ff1c83980808b1514
2021-02-23 18:51:29 -06:00
Chris Freehill 7effb405f0 Change Debian Architecture from amd64 to any
rocm_smi_lib is not currently known to only compile
on specific architectures.

Change-Id: I209e8baa063e99ebe5ff09eaf0dc6541770aa829
2021-02-01 13:48:38 -06:00
Chris Freehill ff9546aa62 Don't use hwmon# as indicator of gpu
Previously, during the rsmi_init discovery process, the existence
of an hwmon# directory was used to distinguish between gpus nodes
and non-gpu nodes. This isn't reliable in some scenarios. Instead,
the existence of the vbios_version file is used as an
indicator that the node is indeed a gpu.

Change-Id: Icfbe5c42ed0970077b05f25c3d209308a31bec85
2021-01-29 13:05:10 -05:00
Ori Messinger 20e2d260fb ROCm SMI Python CLI: Fix Lower Power Cap Warning
The purpose of this patch is to fix a power cap bug for --setpoweroverdrive.
This bug occurs when the user attempts to set a lower wattage than the current
or default wattage, which displays an unnecessary warning message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I730d2c6031b7d7c4af5acf32ecd28da5ca21ab12
2021-01-27 03:24:22 -05:00
Ori Messinger 80f629b9be ROCm SMI Python CLI & LIB: Add GPU Reset Functionality
The purpose of this patch is to implement GPU reset functionality
in the LIB, and to call it from the rocm_smi python CLI.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Iaf525f7016f8354a7fd93af0209ca2e97ef4fd56
2021-01-26 17:52:24 -05:00
Ori Messinger 4f297bdeb3 ROCm SMI Python CLI: Fix Fan Speed Bug
The purpose of this patch is to fix a fan speed bug for --showfan.
This bug occurs when the current and/or maximum fan speeds are not
found by the LIB, which displayed an unclear error message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ied06e460f22391238dd2d86572813e2a5a64f45b
2021-01-26 08:51:04 -05:00
Kent Russell a902770f86 Fix type in --setmrange documentation
mrange is for MCLK, not SCLK, so fix the typo accordingly

Change-Id: Ib20774b073288a8ec193322f2f767616979c95da
2021-01-25 13:20:20 -05:00
Elena 61cdfff562 ROCm SMI Pythoc CLI: Fix division by zero fan bug
Signed-off-by: Elena Sakhnovitch <Elena.Sakhnovitch@amd.com>
Change-Id: If259ac1ad6d77ce85b2b7616d972b6e7964a9f78
2021-01-20 18:21:23 -05:00
Kent Russell ef7f99a7e2 CMakeLIsts: Fix libasan usage
static-libasan doesn't exist, so use the easier-to-remember
shared-libsan and change static-libasan to static-libsan

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: Ieef480aacdd770f3bb40673a2e8f8306b308b1c9
2021-01-15 15:39:05 -05:00
Chris Freehill 47b882b8d3 Comment out CPACK_RPM_PACKAGE_SUGGESTS line
This line make the build fail on Centos. It may be
that it's not supported on that disto.

See https://bugzilla.redhat.com/show_bug.cgi?id=1811358

Change-Id: Ied7ce634ae9fb2b1544f85c0b10ceecc039c388a
2021-01-12 17:15:52 -06:00
Kent Russell c7b6b47211 rocm-smi: Try find the librocm_smi64.so in a few locations
Instead of looking solely in ../lib, try looking in any /opt folder as a
backup option. This is a little more robust and hopefully leads to fewer
issues trying to find the lib

Change-Id: Ie0d3944b48b32d9965917e5c831388838b6d4ef7
2021-01-08 15:29:11 -05:00
Chris Freehill bb5132a66c Remove adding of bogus hwmon label entries
If we fail to find an expected temperature or voltage label
file, previously we were attempting to re-add a mapping of file
index to sensor types. Attempting to insert a map item that is already
present has no effect, so there should be no functional change.

This was a remnant of old code that should have been deleted.

Change-Id: Ie6f8a62f619a1ae58756e0fd891532434518cf78
2021-01-06 11:01:07 -05:00
Chris Freehill 68095b50e7 Introduce RSMI_DEBUG_INFINITE_LOOP
The environment variable RSMI_DEBUG_INFINITE_LOOP is introduced
to facilitate debugging RSMI in user applications. When this
env. variable is non-zero, an infinite loop will be entered in
rsmi_init(). At this point, a debugger can be attached and RSMI
can be debugger. This only applies to debug builds.

Change-Id: I23f6dd730fc965764295070de053314a1cc5b6aa
2021-01-06 10:30:24 -05:00
Kent Russell 7b5f220f76 CMakeLists: Add sudo to Suggests field
There are some systems that don't have sudo, and since we require sudo
for any of the "set" functionality, add it to "Suggests".

See https://github.com/RadeonOpenCompute/ROCm/issues/1245

Change-Id: I9428b9a68810ee8b51f91bb2e3b63312463161b0
2021-01-04 10:46:46 -05:00
Kent Russell 36a0465127 CMakeLists: Make rocm_smi_lib provide rocm-smi
Now that rocm-smi is deprecated, change the DEB/RPM info so that it
provides the rocm-smi package. This will allow for a seamless transition
over during ROCm upgrades

Change-Id: Ia29aab6e45c5974f7b623b786d0649710ba1f7cc
2021-01-04 10:46:40 -05:00
Ori Messinger 3b52c895cc ROCm SMI Python CLI: Fix --showclkfrq/--showclocks Failure
The purpose of this patch is to check if each valid clock is supported
on the GPU before attempting to retrieve its value.

The valid clocks are: dcefclk, fclk, mclk, pcie, sclk, socclk.

This should get rid of the 'one or more commands failed' message when
running --showclkfrq or --showclocks on a machine that doesn't support
all the possible valid clocks.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I1fb10989fc1a36f38b68a23e17e6e600ed0ac85b
2020-12-18 17:46:23 -05:00
Divya Shikre efd234c9e3 Fix for error while reading gpu_metrics sysfs file
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: If69b7eeb3573ebece9ed0cb539f5ddffbe3c2f09
2020-12-18 15:31:16 -05:00
Ori Messinger d9da490214 ROCm SMI Python CLI: Add Json Functionality to showPids
The purpose of this patch is to add Json functionality to showPids
by modifying the print2DArray function to use printSysLog.

Change-Id: Ie834d209b29332777c3f13f776f81c37d94b01b6
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-12-18 02:16:04 -05:00
Ashutosh Mishra d097cb21e1 Standardizing Package name
Enables standards compliant package naming for debian and rpm

Signed-off-by: Ashutosh Mishra <ashutosh.mishra@amd.com>
Change-Id: Ib37d29aedc20b610619f6921f4147b41c0eaf134
2020-12-17 02:47:53 -05:00
Chris Freehill 7e17684532 Correct TestPerfLevelReadWrite test
Enums referenced in the test did not match what's in rocm_smi.h.
Added static assert to try to catch this. Also moved enum string
map to test_common.cc/h where other such maps are.

Also, fixed some cpplint issues.

Change-Id: I683553248ceb2fabb28ce1a1208bc9744aaf88d6
2020-12-16 17:12:04 -06:00
Divya Shikre 47ca37aef7 Fix for inconsistent GPU indexing between rocm-smi and rbt/hip.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I0d966c91bfe0f0d51859ff098d15011a3e4e8b29
2020-12-11 15:11:21 -05:00
Chris Freehill db6d8d36ea Make rocm_smi.py handle disappearing PIDs
rocm_smi.py had an issue where it gets process information
in 2 different places. If the process disappears in between
those 2 places, a crash would occur.

This fix gracefully returns in this scenario.
Reading the file information from /proc instead of using
the python subProcess() call was considered, but it has the
drawback of not being able to read the process names of
processes not owned by the caller.

Change-Id: If812c4641f00da37e99defb0740f670107c8a797
2020-12-10 20:53:45 -06:00
Chris Freehill 6377e0258d Show more info in stderr when rsmi_init() fails
Some rsmi apps fail without much explanation when
rsmi_init() fails. This patch hopes to provide some clues to
the reason for the failure.

Change-Id: Id51308dc327b9871d537dd3e709b677db4ef10bc
2020-12-10 07:32:03 -05:00
Chris Freehill f4938b0ac9 Fix process killed while holding mutex
Previously, when a process holding a shared mutex was killed,
the next time an RSMI application was started, it would not be
able to obtain the mutex--the application would have to exit.
This fix uses pthread_mutexattr_setrobust() to detect this
situation and act accordingingly.

Also, add some missing, needed mutexes and move mutexes
closer to where the protect resource is used.

Change-Id: Icfdc3a246f4cfa3fd008e3f13472199abd76fd35
2020-12-04 12:59:55 -05:00
Divya Shikre a0d10e021b Fix for syntax error caused due to performance determinism commit.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I02fbfec667e7f96ab0d0662036cf339a56025ba6
2020-12-02 16:31:01 -05:00
Divya Shikre 60d0f3052f Adding Performance Determinism Mode to rocm_smi lib, CLI & gtest.
A special mode of operation to achieve minimal performance variation by letting
the user have the ability to provide the desired frequency to be set as the soft limit.
The user can control the entry and exit to the mode via rocm-smi a mechanism to
enter / exit performance determinism mode as below.

Enter performance determinism mode:
- hold a lock
- write performance_determinism to power_dpm_force_performance_level
- write input clk_freq to pp_dpm_sclk
- release lock

Exit performance determinism_mode:
- hold a lock
- write auto to power_dpm_force_performance_level
- release lock

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ia24e27954cdf1c4337ffc83d8948fbdfaf4552d2
2020-12-02 11:11:00 -05:00