İşleme Grafiği

129 İşleme

Yazar SHA1 Mesaj Tarih
Divya Shikre 7b1daaef96 Add fix to display correct GPU Memory Activity and GFX Activity value.
Driver mem fills in 0xFF for all for the metrices not supported for that ASIC.
So if 0xFF is detected, return RSMI_STATUS_NOT_SUPPORTED

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I86a38148c7a288ea0db94893f685560eaac098ab
2021-11-25 14:28:06 -05:00
Divya Shikre f61cb1b41d Add fix for out of range temperature value for HBM.
Driver mem fills in 0xFF for all for the metrices not supported for that ASIC.
So if 0xFF is detected, return RSMI_STATUS_NOT_SUPPORTED

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Iacb6474486e3732f2aa824ff447c17f8243b65cd
2021-11-23 15:37:41 -05:00
Elena Sakhnovitch 50ea68e694 [ROCm SMI LIB]: Add rsmi_minmax_bandwidth_get()
API provides min/max bandwidth values between nodes.
(Current implementation only supports directly (1 hop)
connected XGMI devices.

Signed-off-by: Elena Sakhnovitch
Change-Id: Ifc95da13845fbe7903c5386d320183ffd58c5b53
2021-10-28 17:00:41 -04:00
Ori Messinger ff02042c64 ROCm SMI LIB: Add rsmi_is_P2P_accessible() API
Implements rsmi_is_p2p_accessible API.
The function returns True if P2P is possible between two nodes.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ic7316eebcec4480175c7ad04c21a42b2e1a4c454
2021-10-13 22:01:33 -04:00
Elena Sakhnovitch 5e1bfcadd7 rocm_smi_lib: fix gpu_metrics_v1_3 support
Signed-off-by: Elena Sakhnovitch
Change-Id: Ia7a6b17eb0f317465613ba92ae7548a221c46ee3
2021-08-13 11:59:50 -04:00
Elena Sakhnovitch fee82af1fe rocm_smi_lib: add gpu_metrics_v1_3 support
Signed-off-by: Elena Sakhnovitch
Change-Id: I4a9dedc80b8fce60e12c5baf8651d54d16a6a41c
2021-08-13 09:23:35 -04:00
Harish Kasiviswanathan 7a8c3f3629 Fall back to pci-ids if FRU product_name is empty
rocm-smi --showproductname will not show "Card series" in its output if
product_name exported by Kernel is empty string. This has been raised a
regression by customer.

BUG: SWDEV-297228

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I9aae24778e2d3a30aa661d8f338278c1666590fb
2021-08-04 10:53:55 -04:00
Bill(Shuzhou) Liu 8c60dbebaa AddressSanitizer report stack-use-after-scope
Fix the stack-use-after-scope error reported by the AddressSanitizer.

Bug: SWDEV-291913
Change-Id: I0ffd71af8679b8bff6c363096fafe75dffcf329e
2021-06-25 13:33:38 -04:00
Divya Shikre 462d4adc24 Return an error when user tries to set out of range clock values for setsrange functionality
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ibe1075c1d2b6c009332a52b81f4b41f7e93d0756
2021-05-11 12:32:19 -04:00
Harish Kasiviswanathan 6b10a7761b Add support to read gpu_metrics version 1.2
gpu_metrics version 1.2 provides atomic timestamp. Use this timestamp.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I7a1a675f53b93718f34b1f2979173e9064e0ef93
2021-05-05 12:31:10 -04:00
Harish Kasiviswanathan e83cf605c6 Change #define RSMI_GPU_METRICS_API_CONTENT_VER
Chnage to RSMI_GPU_METRICS_API_CONTENT_VER_1. In preparation for
supporting additional formats

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4367a2622a0fa41e6b05bc4436ecd24b8c4e30e2
2021-05-04 20:51:10 -04:00
Harish Kasiviswanathan c416726054 Move gpu_metrics functions to different file
No logic change. Only structural change

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Id5e1a678c0888f04081ee06db4521c72b5eb9b16
2021-05-04 20:49:51 -04:00
Ori Messinger 83cd2fe4f1 ROCm SMI LIB: Add Default GPU Power Cap
Implement default GPU power cap functionality in the LIB.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53
2021-04-22 10:49:55 -04:00
Harish Kasiviswanathan 844acbc0d8 Add energy counter resolution to rsmi_dev_energy_count_get
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I03b70968257db7a45e21d7ba62542cdedd18eb85
2021-04-22 10:25:06 -04:00
Divya Shikre dc431506f5 Add support for mi200 clocks being continuous.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifb7570054572239b9f48eaefe51e879fb3569031
2021-04-20 13:12:27 -04:00
Divya Shikre 9f9a7aaf65 Add new setrange function in C++ lib
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I670aaeb93827bf4b2cc08eb36d0f9756f00e4e4e
2021-04-19 22:38:59 -04:00
Elena a383dd23aa [rocm-smi-lib] add HBM temperature conversion factor
Change-Id: I45339c87c3d2a40670baf1b76ada60dceb650dc0
2021-04-19 16:41:48 -04:00
Bill(Shuzhou) Liu 8eec0a7d36 Add energy accumulator counter
The energy accumulator counter tracks all energy consumed.

Change-Id: I5b25f817b7802d81c477361447f0ecd7ec02fc61
2021-04-14 10:43:01 -04:00
Bill(Shuzhou) Liu 9bfb9ac297 Add coarse grain utilization counter
The coarse grain utilization counter includes GFX and Memory activity.

Change-Id: I5d09976792d3f4a1c1081651fa24ff857016d4c0
2021-04-14 10:40:19 -04:00
Divya Shikre aaf2120117 Update performance determinism api as per the modified sysfs interface.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ib0ec5128819644a2ff6c916da9194a7fe1dad795
2021-04-07 16:38:48 -04:00
Bill(Shuzhou) Liu da480b4589 Add support for the HBM temperature
The rsmi_dev_temp_metric_get() can also support the HBM
temperatures which is retrieved from gpu_metrics.

Change-Id: I96b979296e90cf881523627b41b1a02849676416
2021-04-05 15:55:55 -04:00
Chris Freehill 5e2a4f3a15 Handle different gpu_metrics content versions for format v1
Change-Id: I344d1815da683befc8f8b5caf921803b267ae29f
2021-03-24 14:34:55 -05:00
Chris Freehill ce475b009c Adjust event counters to report only new events
Previously, RSMI assumed that the event counter values returned
from perf were only new events. But in fact, when we read the
counter values, they are running totals. To account for this, we
now record the value we read and take the difference between the
current value and the previously recorded value.

Change-Id: I1e04b514e89c7c4d4719889f2dae3a1283864e7f
2021-02-24 11:02:17 -06:00
Chris Freehill ff9546aa62 Don't use hwmon# as indicator of gpu
Previously, during the rsmi_init discovery process, the existence
of an hwmon# directory was used to distinguish between gpus nodes
and non-gpu nodes. This isn't reliable in some scenarios. Instead,
the existence of the vbios_version file is used as an
indicator that the node is indeed a gpu.

Change-Id: Icfbe5c42ed0970077b05f25c3d209308a31bec85
2021-01-29 13:05:10 -05:00
Ori Messinger 80f629b9be ROCm SMI Python CLI & LIB: Add GPU Reset Functionality
The purpose of this patch is to implement GPU reset functionality
in the LIB, and to call it from the rocm_smi python CLI.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Iaf525f7016f8354a7fd93af0209ca2e97ef4fd56
2021-01-26 17:52:24 -05:00
Chris Freehill bb5132a66c Remove adding of bogus hwmon label entries
If we fail to find an expected temperature or voltage label
file, previously we were attempting to re-add a mapping of file
index to sensor types. Attempting to insert a map item that is already
present has no effect, so there should be no functional change.

This was a remnant of old code that should have been deleted.

Change-Id: Ie6f8a62f619a1ae58756e0fd891532434518cf78
2021-01-06 11:01:07 -05:00
Chris Freehill 68095b50e7 Introduce RSMI_DEBUG_INFINITE_LOOP
The environment variable RSMI_DEBUG_INFINITE_LOOP is introduced
to facilitate debugging RSMI in user applications. When this
env. variable is non-zero, an infinite loop will be entered in
rsmi_init(). At this point, a debugger can be attached and RSMI
can be debugger. This only applies to debug builds.

Change-Id: I23f6dd730fc965764295070de053314a1cc5b6aa
2021-01-06 10:30:24 -05:00
Divya Shikre efd234c9e3 Fix for error while reading gpu_metrics sysfs file
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: If69b7eeb3573ebece9ed0cb539f5ddffbe3c2f09
2020-12-18 15:31:16 -05:00
Divya Shikre 47ca37aef7 Fix for inconsistent GPU indexing between rocm-smi and rbt/hip.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I0d966c91bfe0f0d51859ff098d15011a3e4e8b29
2020-12-11 15:11:21 -05:00
Chris Freehill 6377e0258d Show more info in stderr when rsmi_init() fails
Some rsmi apps fail without much explanation when
rsmi_init() fails. This patch hopes to provide some clues to
the reason for the failure.

Change-Id: Id51308dc327b9871d537dd3e709b677db4ef10bc
2020-12-10 07:32:03 -05:00
Chris Freehill f4938b0ac9 Fix process killed while holding mutex
Previously, when a process holding a shared mutex was killed,
the next time an RSMI application was started, it would not be
able to obtain the mutex--the application would have to exit.
This fix uses pthread_mutexattr_setrobust() to detect this
situation and act accordingingly.

Also, add some missing, needed mutexes and move mutexes
closer to where the protect resource is used.

Change-Id: Icfdc3a246f4cfa3fd008e3f13472199abd76fd35
2020-12-04 12:59:55 -05:00
Divya Shikre 60d0f3052f Adding Performance Determinism Mode to rocm_smi lib, CLI & gtest.
A special mode of operation to achieve minimal performance variation by letting
the user have the ability to provide the desired frequency to be set as the soft limit.
The user can control the entry and exit to the mode via rocm-smi a mechanism to
enter / exit performance determinism mode as below.

Enter performance determinism mode:
- hold a lock
- write performance_determinism to power_dpm_force_performance_level
- write input clk_freq to pp_dpm_sclk
- release lock

Exit performance determinism_mode:
- hold a lock
- write auto to power_dpm_force_performance_level
- release lock

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ia24e27954cdf1c4337ffc83d8948fbdfaf4552d2
2020-12-02 11:11:00 -05:00
Chris Freehill 63064b0000 Quiet address sanitizer warnings
Also,
* Fix some doxygen issues
* Fix address sanitizer issues in rsmitst

Change-Id: Ie6c6fd9af5c418210b7064e79650fb92cd4a5e2b
2020-11-10 14:16:39 -06:00
Ramesh Errabolu 328878343c Update ROCm SMI library with ability to read CU occupancy
Change-Id: Ib9882fa2d81c13604af282279bfa116bc2fd05a4
2020-10-14 09:33:37 -04:00
Chris Freehill 5465d872aa Revert "Revert "Support for RSMI_EVNT_GRP_XGMI_DATA_OUT counters""
This reverts commit ae6d3fbdd0.



Change-Id: Ic412a64d35aab74caf12bf4c791f0a66ac15b061
2020-10-08 10:36:30 -04:00
Kent Russell e350278b68 Remove extraneous mutexes
We already grab the mutex before getting the device name, so we don't
need to grab it again

Change-Id: Ib627ba3a39c485f6069af052cfd3e6c522873d43
2020-10-08 07:55:07 -04:00
Chris Freehill ae6d3fbdd0 Revert "Support for RSMI_EVNT_GRP_XGMI_DATA_OUT counters"
This reverts commit 946bf93dfb.

Temporarily reverting until the driver side of this is upstream

Change-Id: I2d8243208c1271ebad90bc2ee0fda2dfefb0831b
2020-10-07 18:42:56 -04:00
Kent Russell df7c3434cd Check FRU-based product information if available
WKS and server cards have an FRU with product information, so try to use
that for product name and product SKU if it exists.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I40bbd3bf62f4cb02e96015ed1630112691cacbc3
2020-10-07 14:09:23 -04:00
Chris Freehill c6f02b4d62 Fail gracefully if drm directory is not found
Change-Id: I0f3ab2721108355752caf0280124469b98af4967
2020-10-05 21:12:11 -04:00
Chris Freehill 946bf93dfb Support for RSMI_EVNT_GRP_XGMI_DATA_OUT counters
Also some format fixes

Change-Id: Id3c0f6b3cf5b327bb9ca6acb6091dc67764c8032
2020-10-05 17:22:19 -05:00
Divya Shikre 8b48564ce3 Adding functionality that will parse gpu_metrics sysfs file
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I3a84870b83eb4cd0ed46f10bb19169c91f99fd8e
2020-10-02 10:25:41 -04:00
Mukul Joshi fb2ed24372 Use correct string conversion function for VRAM and SDMA usage
VRAM and SDMA usage can be 64-bit long numbers. Use stoull()
instead of stoi() to convert the VRAM and SDMA usage strings to
numbers.

Change-Id: Ifadbada9f33320fc67666036ce8439823c1d1fb7
2020-09-21 12:28:22 -04:00
Mukul Joshi 406859ca8a Update KFD SMI event notification handling
Event bitmask in KFD SMI event is now replaced with event index in
the SMI event message. Sending a event bitmask, which was a 64-bit
field with only 1 bit set, was quite wasteful of memory and also
potentially limiting to 64 events. Instead the kernel would send
event index in the SMI event message. As a result, update the
KFD SMI event handling to expect the event index in the message.

Change-Id: I3e74620788d3c1f7c0bdaa69e9d9ab3d1aba2c92
2020-09-16 13:24:50 -04:00
Chris Freehill b015052a07 Make sure all sensor labels have valid mappings
There may not be label files for some sensors on older
devices. We need to make sure there is a valid dummy
mapping in these cases.

Change-Id: Id6a8b71e554552be84a0e42a477070b504151e7f
2020-09-11 17:32:54 -05:00
Divya Shikre 54d4b9d500 Adding setsrange, setmrange, setvc, setslevel and setmlevel functionality to rocm lib and cli
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I5fd65ea7bcd5403aaf2e42d2aa28d837929da253
2020-09-08 18:42:39 -04:00
Chris Freehill c2439d28e8 Correct usage of bitwise &
Also, fix warning related to catch() and cpplint error.

Change-Id: I4292170538d0f700fccb605814c5058543abe74a
2020-07-26 20:08:24 -05:00
Chris Freehill 52514835f0 Update xgmi event counter documentation
Also:
* fix doxygen manual generation that was altered during
  OAM refactor
* quiet some compile warnings.

Change-Id: I548a3cf00eb887bea3dbf58e362ca6dfe90bde28
2020-07-16 17:42:56 -05:00
Mukul Joshi eea1ed8c3d Add support to retrieve process SDMA usage information.
Also, print SDMA usage information in TestProcInfoRead.

Change-Id: I8d19be3b8653e298c81237e5067eca75a1743e70
2020-07-13 17:32:08 -04:00
Chris Freehill 68155baed5 Handle un-readable kfd properties files
Some systems have kfd sysfs properties entries that
are unreadable--for example, when a multi-gpu system is
dividing the gpus among containers, each container may
only be able to access certain gpus.

Previously, all kfd topology node properties entries were
assumed to be valid. Now, we check for readability before
declaring them "valid".

Fixes SWDEV-240169

Also:
* remove an assertion that would happen when no pcie
device identifier files are found on the system.
* fix cpplint issues

Change-Id: I74321b685159dd2628c890b33c39ad82988cb9dd
2020-07-10 12:35:31 -04:00
Chris Freehill c2ef9a6879 Fix docs + cmake_utils path issues
This corrects issues that arose after OAM reorganization.
It should address SWDEV-243294.

Also, fix some compile warnings that show up on RHEL.

Change-Id: Id14d444905da35cd7346bcfbcd82b6d0572708c4
2020-07-08 09:47:25 -05:00