Граф коммитов

53 Коммитов

Автор SHA1 Сообщение Дата
Divya Shikre 686e6ac654 Add fix to show usage of setperfdeterminism functionality in --help command
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ife93c887eea2a9aae69f2923dba45c7cde4838d3
2021-05-12 17:29:37 -04:00
Kent Russell 242d94a668 rocm_smi.py: Fix gpu reset error
Since device is a list, we need to pass a single item to the isAmdGpu
function.

Fixes: c7c2ac5559 "rocm_smi.py: Don't try to reset non-AMD GPUs"

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I19a74377636ff4589f11d092f41e1d35c1acb307
2021-04-28 07:44:55 -04:00
Kent Russell b931380f02 rocm_smi.py: Don't try to print absent clock files
Instead of throwing "Unsupported clock" errors for ASICs that don't
support a certain clock type (e.g. dcefclk on MI-series), just dump the
warning to logging.debug and don't try to read the clock

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: If3cb9a472b03aa535a76fc24bcd9f77122090634
2021-04-23 10:19:04 -04:00
Ori Messinger b71e07b3fb rocm_smi.py: Show 'Out of Spec' warning only if required
Use default power cap exposed via sysfs to determine when to
show 'Out of Spec" warning.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I0fa3612b50e230856b0d5a390f876b35268d9587
2021-04-22 14:44:05 -04:00
Ori Messinger a9e7e5a475 ROCm SMI Python CLI: Add showevent Functionality
Implement showevent functionality in the ROCm SMI Python CLI.

It can be called using --showevents with any combination of:
VM_FAULT, THERMAL_THROTTLE, and/or GPU_RESET
For example:
./rocm-smi --showevents VM_FAULT, THERMAL_THROTTLE, GPU_RESET

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I905fd9c949e91423b79833a04ab89d6ba3760e62
2021-04-22 10:21:07 -04:00
Elena c80fc54500 [rocm_smi.py] add energy counter
--showenergycounter

Signed-off-by: Elena Sakhnovitch
Change-Id: Iede0f2b06523f7cb2719489a883e9c49722f8d93
2021-04-21 18:40:19 -04:00
Elena 771b4af95c [rocm_smi.py] Coarse Grain Utilization Counters
--showuse
--showmemuse

====================================
========= % time GPU is busy =======
GPU[0]          : GPU use (%): 0
GPU[0]          : GFX Activity: 0
====================================

Change-Id: I9db115ad78b394469206b22d195781a430b2f1d8
2021-04-21 17:23:21 -04:00
Harish Kasiviswanathan 1c9e384c8f Suppress warning message in getFanSpeed function
Many data center cards are fanless. Don't show warning if unable to get
fan speed. The fan speed will be reported as 0

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I53efe67ac88fb0824cf4820430b46c18bc7692df
2021-04-21 15:29:44 -04:00
Divya Shikre 56c132873b Update setrange functionality in CLI
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ic942bd76297c50caf189bfc0972d30dc42d91f32
2021-04-20 15:39:05 -04:00
Divya Shikre dc431506f5 Add support for mi200 clocks being continuous.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifb7570054572239b9f48eaefe51e879fb3569031
2021-04-20 13:12:27 -04:00
Divya Shikre d9f7bd0ff4 Fix for cli errors - extra args in perf_determinism, undefined variable in setClocks
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Id138cfcbea4384f520537cc045d358024177b1ac
2021-04-19 17:32:07 -04:00
Elena 81c066350f Adding 4 new HBM temperature sensors.
Signed-off-by: Elena Sakhnovitch
Change-Id: Iaea04c38e8c2353e85d8aa2b871fdb82727157de
2021-04-17 23:58:49 -04:00
Kent Russell c7c2ac5559 rocm_smi.py: Don't try to reset non-AMD GPUs
This won't work for obvious reasons, so exit with an error instead of
trying to access a file that doesn't exist and segfaulting

Change-Id: Id1230922fa6e9a19e9394280faad88a43c7d2e34
2021-04-13 08:00:17 -04:00
Divya Shikre aaf2120117 Update performance determinism api as per the modified sysfs interface.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ib0ec5128819644a2ff6c916da9194a7fe1dad795
2021-04-07 16:38:48 -04:00
Chris Freehill 11440536cf Handle set freq for double-digit index in rocm_smi.py
rocm_smi.py --set<m|s>clk was treating the freq as a string.
This causes problems in parsing when the index is more than 1
digit. Now, treat the indexes as integers.

Change-Id: Ia0d859d33b685fe90689a86ff1c83980808b1514
2021-02-23 18:51:29 -06:00
Ori Messinger 20e2d260fb ROCm SMI Python CLI: Fix Lower Power Cap Warning
The purpose of this patch is to fix a power cap bug for --setpoweroverdrive.
This bug occurs when the user attempts to set a lower wattage than the current
or default wattage, which displays an unnecessary warning message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I730d2c6031b7d7c4af5acf32ecd28da5ca21ab12
2021-01-27 03:24:22 -05:00
Ori Messinger 80f629b9be ROCm SMI Python CLI & LIB: Add GPU Reset Functionality
The purpose of this patch is to implement GPU reset functionality
in the LIB, and to call it from the rocm_smi python CLI.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Iaf525f7016f8354a7fd93af0209ca2e97ef4fd56
2021-01-26 17:52:24 -05:00
Ori Messinger 4f297bdeb3 ROCm SMI Python CLI: Fix Fan Speed Bug
The purpose of this patch is to fix a fan speed bug for --showfan.
This bug occurs when the current and/or maximum fan speeds are not
found by the LIB, which displayed an unclear error message.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ied06e460f22391238dd2d86572813e2a5a64f45b
2021-01-26 08:51:04 -05:00
Kent Russell a902770f86 Fix type in --setmrange documentation
mrange is for MCLK, not SCLK, so fix the typo accordingly

Change-Id: Ib20774b073288a8ec193322f2f767616979c95da
2021-01-25 13:20:20 -05:00
Elena 61cdfff562 ROCm SMI Pythoc CLI: Fix division by zero fan bug
Signed-off-by: Elena Sakhnovitch <Elena.Sakhnovitch@amd.com>
Change-Id: If259ac1ad6d77ce85b2b7616d972b6e7964a9f78
2021-01-20 18:21:23 -05:00
Kent Russell c7b6b47211 rocm-smi: Try find the librocm_smi64.so in a few locations
Instead of looking solely in ../lib, try looking in any /opt folder as a
backup option. This is a little more robust and hopefully leads to fewer
issues trying to find the lib

Change-Id: Ie0d3944b48b32d9965917e5c831388838b6d4ef7
2021-01-08 15:29:11 -05:00
Ori Messinger 3b52c895cc ROCm SMI Python CLI: Fix --showclkfrq/--showclocks Failure
The purpose of this patch is to check if each valid clock is supported
on the GPU before attempting to retrieve its value.

The valid clocks are: dcefclk, fclk, mclk, pcie, sclk, socclk.

This should get rid of the 'one or more commands failed' message when
running --showclkfrq or --showclocks on a machine that doesn't support
all the possible valid clocks.

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I1fb10989fc1a36f38b68a23e17e6e600ed0ac85b
2020-12-18 17:46:23 -05:00
Ori Messinger d9da490214 ROCm SMI Python CLI: Add Json Functionality to showPids
The purpose of this patch is to add Json functionality to showPids
by modifying the print2DArray function to use printSysLog.

Change-Id: Ie834d209b29332777c3f13f776f81c37d94b01b6
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-12-18 02:16:04 -05:00
Chris Freehill db6d8d36ea Make rocm_smi.py handle disappearing PIDs
rocm_smi.py had an issue where it gets process information
in 2 different places. If the process disappears in between
those 2 places, a crash would occur.

This fix gracefully returns in this scenario.
Reading the file information from /proc instead of using
the python subProcess() call was considered, but it has the
drawback of not being able to read the process names of
processes not owned by the caller.

Change-Id: If812c4641f00da37e99defb0740f670107c8a797
2020-12-10 20:53:45 -06:00
Divya Shikre a0d10e021b Fix for syntax error caused due to performance determinism commit.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I02fbfec667e7f96ab0d0662036cf339a56025ba6
2020-12-02 16:31:01 -05:00
Divya Shikre 60d0f3052f Adding Performance Determinism Mode to rocm_smi lib, CLI & gtest.
A special mode of operation to achieve minimal performance variation by letting
the user have the ability to provide the desired frequency to be set as the soft limit.
The user can control the entry and exit to the mode via rocm-smi a mechanism to
enter / exit performance determinism mode as below.

Enter performance determinism mode:
- hold a lock
- write performance_determinism to power_dpm_force_performance_level
- write input clk_freq to pp_dpm_sclk
- release lock

Exit performance determinism_mode:
- hold a lock
- write auto to power_dpm_force_performance_level
- release lock

Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ia24e27954cdf1c4337ffc83d8948fbdfaf4552d2
2020-12-02 11:11:00 -05:00
Ori Messinger c0c1fd2098 ROCm SMI Python CLI: Fix --gpureset Bug
The purpose of this patch is to fix a bug present when using the
--gpureset option on a machine that has both an AMD GPU and a
non-AMD GPU (such as a motherboard's integrated graphics).

This bug occurs due to non-AMD GPUs being ignored by the LIB when
enumerating a list of valid AMD GPUs, causing the gpuReset method
to attempt a reset on the integrated graphics.

Change-Id: I1c03a3c41f905786e3c8246ec0c2b42786ff1770
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-11-25 11:21:36 -05:00
Ori Messinger 015c7d59d0 ROCm SMI Python CLI: GPU showproductname SKU Fix
The purpose of this patch is to fix a bug present when using the
--showproducname option, resulting in the following error:
undefined symbol: rsmi_dev_sku_get

This bug fix uses a substring from vbios version instead of using the
LIB's rsmi_dev_sku_get to avoid getting the undefined symbol error.

Change-Id: I56d72a481d5dde44c56106ae297f4bcff40ac15f
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-11-12 15:38:52 -05:00
Chris Freehill 438d28612f Use relative path to find librocm_smi
Change-Id: Ifca3f54d680a802c1c5fa360d17e64338b9ac9a8
2020-10-29 14:36:48 -05:00
Elena Sakhnovitch 4117719edd ROCm SMI Python CLI: --rasinject partial support
This implementation is copied directly from the previous rocm_smi.py
script; This feature is experimental and will be updated or removed with
feauture releases.

Signed-off-by: Elena Saknovitch
Change-Id: I5cd38266946302bc4123aeafaa825e13f704235e
2020-10-22 17:22:13 -04:00
Chris Freehill 1982fdc4fb Add new XGMI counter events to rsmiBindings.py
Also, correct RSMI_EVNT_LAST to new value.

Change-Id: I9f693cb398bba583201f6b5b5f0e2d45ede2e4e0
2020-10-22 17:21:50 -04:00
Divya Shikre 94291bf882 Fix for weight/hops not being updated
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I333d49fa011b85d41eca63c082c0615febe2f7e9
2020-10-20 15:01:06 -04:00
Ori Messinger 20ae72b078 ROCm SMI Python CLI: Add CU Occupancy to showPids function
The purpose of this patch is to add CU occupancy functionality to showPids
by calling rsmi_compute_process_info_get from the LIB.

Now showPids shows the following information on (KFD compute) processes:
PID, process name, GPU(s), VRAM used, SDMA used, and CU occupancy.

Change-Id: Ie005901e0eb946ef0fbb3523245ca451c4eed595
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-10-15 21:21:32 -04:00
Ramesh Errabolu 328878343c Update ROCm SMI library with ability to read CU occupancy
Change-Id: Ib9882fa2d81c13604af282279bfa116bc2fd05a4
2020-10-14 09:33:37 -04:00
Ori Messinger e3c9aec714 ROCm SMI Python CLI: Check for amdgpu Driver Initialization
The purpose of this patch is to check for amdgpu driver initialization
before attempting to initialize rocmsmi in the CLI.

Additionally, since the '--help' functionality does not rely on anything
external to the CLI, it can now be called without the driver initialized.

Change-Id: I2fcce60ca6d9f77835549e3558c4bb1747499c5c
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-10-08 11:17:45 -04:00
Kent Russell df7c3434cd Check FRU-based product information if available
WKS and server cards have an FRU with product information, so try to use
that for product name and product SKU if it exists.

Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I40bbd3bf62f4cb02e96015ed1630112691cacbc3
2020-10-07 14:09:23 -04:00
Ori Messinger 4ed1c1d492 ROCm SMI Python CLI: Implement --setclock for all Valid Clocks
The purpose of this patch is to implement --setclock functionality for
all of the valid clocks (can be set with --setclock TYPE LEVEL).

The valid clocks are: dcefclk, fclk, mclk, pcie, sclk, socclk.
This functionality uses the existing 'setClocks' method.

Change-Id: I1d62baf372427ac1c0642c26a949663b673ef335
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-09-22 15:41:51 -04:00
Chris Freehill 8f9f9433d8 Enable library-based rocm_smi.py
Change-Id: I5443308905456defc9818fac07ac2f20fe9426fd
2020-09-16 09:31:30 -05:00
Elena Sakhnovitch 91f8fcb7b1 ROCm SMI CLI: Add JSON support for topo functions
-Add divider between devices for --showclocks to increase readibility.
-Fix fan rounding error
-Fix spaces to comply with coding standard
-Fix @param description error in topo functions
-JSON result for topology:
{
  "card0": {
    "(Topology) Numa Node": "0",
    "(Topology) Numa Affinity": "4294967295"
  },
  "card1": {
    "(Topology) Numa Node": "0",
    "(Topology) Numa Affinity": "4294967295"
  },
  "system": {
    "(Topology) Weight between DRM devices 0 and 1": "40",
    "(Topology) Hops between DRM devices 0 and 1": "2",
    "(Topology) Link type between DRM devices 0 and 1": "PCIE"
  }
}

Signed-off-by: Elena Sakhnovitch <Elena.Sakhnovitch@amd.com>
Change-Id: I711c100362826ed729ff90edd407009237d64f8f
2020-09-10 12:57:14 -04:00
Elena Sakhnovitch edcae88fe9 Add README.md starter file
signed-off-by: Elena Sakhnovitch
Change-Id: I677b7d643c6559693c5ad627b704ee36631cc32e
2020-09-10 11:09:42 -04:00
Elena Sakhnovitch 8b82621e72 ROCm SMI Python CLI: Implement --showbw
PCIE bandwidth functionality

Signed-off-by: Elena Sakhnovitch
Change-Id: I5a9ddc589846b6032739d491319078ead5723a27
2020-09-09 14:52:58 -04:00
Harish Kasiviswanathan f1786a3095 Don't hard code rocm_smi_lib path
During rocm_smi_lib installation the path should be set using ldconfig

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I0cab18f492013b783d1ce632591ce295f934a168
2020-09-08 19:29:09 -04:00
Divya Shikre 54d4b9d500 Adding setsrange, setmrange, setvc, setslevel and setmlevel functionality to rocm lib and cli
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I5fd65ea7bcd5403aaf2e42d2aa28d837929da253
2020-09-08 18:42:39 -04:00
Ori Messinger 95d43e30e3 ROCm SMI Python CLI: Implement show/set mclk OverDrive
The purpose of this patch is to implement show and set mclk OverDrive.
This implementation is copied directly from the previous rocm_smi.py
script since this functionality is mostly deprecated.

Change-Id: I705430f873a73f954b6812c222a385ff4e9b6eb2
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-09-08 14:24:11 -04:00
Ori Messinger 2d59d0877b ROCm SMI Python CLI: Implement Valid Clocks
The purpose of this patch is to implement the remaining valid clocks.
The valid clocks are: dcefclk, fclk, mclk, pcie, sclk, socclk
This functionality is needed for the 'setClocks' method.

Change-Id: Ie648fb29dbbd61f0f064d4462ac566911f1ca2aa
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
2020-09-02 06:40:59 -04:00
Divya Shikre d1f4c252b0 Adding voltage range functionality to rocm cli
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I9288c0c6cda2a984c34cfd2570deec640b6c9f0d
2020-08-28 12:04:36 -04:00
Divya Shikre 49734f8d34 Adding logic to skip the loop if src and dest device are the same in HW Topology.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ib9cfbf5a7238ba75f6463e8fa6250bb9946b7979
2020-08-20 10:44:28 -04:00
Harish Kasiviswanathan 9f5d4a698e Update rsmi_process_info_t with sdma_usage field
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: Ie326e75674127a2e13f17fac344e2b672e877ce1
2020-08-19 17:54:15 -04:00
Divya Shikre 1276e4b9e9 Adding gpu reset functionality to rocm cli
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ifc0a239e8e8046fd7f56893d0101e0866cc3185f
2020-08-19 13:37:47 -04:00
Divya Shikre 2e8dc4f2a9 Adding Sdma Usage to showpids
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com
Change-Id: I72a9e1adc61eba382f1ac17c8e50b2a8bd6d6898
2020-08-14 12:12:34 -04:00