ROCm SMI Documentation Reorg
Change-Id: I3e4db2c50a43a51eeea4d3e06ba4811ad1859368 Signed-off-by: Charis Poag <Charis.Poag@amd.com>
This commit is contained in:
gecommit door
Charis Poag
bovenliggende
10f3c2325c
commit
2fd36e33ad
+1
-88
@@ -1,92 +1,5 @@
|
||||
# ROCm System Management Interface (ROCm SMI) Library
|
||||
|
||||
The ROCm System Management Interface Library, or ROCm SMI library, is part of the Radeon Open Compute [ROCm](https://github.com/RadeonOpenCompute) software stack . It is a C library for Linux that provides a user space interface for applications to monitor and control GPU applications.
|
||||
|
||||
For additional information refer to [ROCm Documentation](https://rocm.docs.amd.com/projects/rocm_smi_lib/en/latest/)
|
||||
|
||||
## DISCLAIMER
|
||||
|
||||
The information contained herein is for informational purposes only, and is subject to change without notice. In addition, any stated support is planned and is also subject to change. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein.
|
||||
|
||||
© 2022-2024 Advanced Micro Devices, Inc. All Rights Reserved.
|
||||
|
||||
## Planned Deprication Notice
|
||||
ROCm System Management Interface (ROCm SMI) Library is planned to be ***depricated***. Release date to be announced soon. Please start migrating to AMD SMI.
|
||||
- Documentation: [https://rocm.docs.amd.com](https://rocm.docs.amd.com/projects/amdsmi/en/latest/)
|
||||
- Github: [https://github.com/ROCm/amdsmi](https://github.com/ROCm/amdsmi)
|
||||
|
||||
## Installation
|
||||
|
||||
### Install amdgpu using ROCm
|
||||
* Install amdgpu driver:
|
||||
See example below, your release and link may differ. The `amdgpu-install --usecase=rocm` triggers both an amdgpu driver update and ROCm SMI packages to be installed on your device.
|
||||
```shell
|
||||
sudo apt update
|
||||
wget https://repo.radeon.com/amdgpu-install/6.0.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
|
||||
sudo apt install ./amdgpu-install_6.0.60002-1_all.deb
|
||||
sudo amdgpu-install --usecase=rocm
|
||||
```
|
||||
* `rocm-smi --help`
|
||||
|
||||
## Building ROCm SMI
|
||||
|
||||
### Additional Required software for building
|
||||
|
||||
In order to build the ROCm SMI library, the following components are required. Note that the software versions listed are what was used in development. Earlier versions are not guaranteed to work:
|
||||
|
||||
* CMake (v3.5.0)
|
||||
* g++ (5.4.0)
|
||||
|
||||
In order to build the latest documentation, the following are required:
|
||||
|
||||
* Python 3.8+
|
||||
* NPM (sass)
|
||||
|
||||
The source code for ROCm SMI is available on [Github](https://github.com/RadeonOpenCompute/rocm_smi_lib).
|
||||
|
||||
After the ROCm SMI library git repository has been cloned to a local Linux machine, building the library is achieved by following the typical CMake build sequence. Specifically,
|
||||
|
||||
```shell
|
||||
mkdir -p build
|
||||
cd build
|
||||
cmake ..
|
||||
make -j $(nproc)
|
||||
# Install library file and header; default location is /opt/rocm
|
||||
make install
|
||||
```
|
||||
|
||||
The built library will appear in the `build` folder.
|
||||
|
||||
To build the rpm and deb packages follow the above steps with:
|
||||
|
||||
```shell
|
||||
make package
|
||||
```
|
||||
|
||||
#### Documentation
|
||||
|
||||
The following is an example of how to build the docs:
|
||||
|
||||
```shell
|
||||
python3 -m venv .venv
|
||||
.venv/bin/python3 -m pip install -r docs/sphinx/requirements.txt
|
||||
.venv/bin/python3 -m sphinx -T -E -b html -d docs/_build/doctrees -D language=en docs docs/_build/html
|
||||
```
|
||||
|
||||
#### Building the Tests
|
||||
|
||||
In order to verify the build and capability of ROCm SMI on your system and to see an example of how ROCm SMI can be used, you may build and run the tests that are available in the repo. To build the tests, follow these steps:
|
||||
|
||||
```bash
|
||||
mkdir build
|
||||
cd build
|
||||
cmake -DBUILD_TESTS=ON ..
|
||||
make -j $(nproc)
|
||||
```
|
||||
|
||||
To run the test, execute the program `rsmitst` that is built from the steps above.
|
||||
|
||||
## Usage Basics
|
||||
## Use C++ in ROCm SMI
|
||||
|
||||
### Device Indices
|
||||
|
||||
|
||||
@@ -1,2 +0,0 @@
|
||||
```{include} ../README.md
|
||||
```
|
||||
@@ -0,0 +1,47 @@
|
||||
|
||||
.. meta::
|
||||
:description: Using ROCm SMI
|
||||
:keywords: install, SMI, library, api, AMD, ROCm
|
||||
|
||||
|
||||
Using C++ in ROCm SMI
|
||||
*********************
|
||||
|
||||
Device indices
|
||||
---------------
|
||||
|
||||
Many of the functions in the library take a "device index". The device index is a number greater than or equal to 0, and less than the number of devices detected, as determined by `rsmi_num_monitor_devices()`. The index is used to distinguish the detected devices from one another. It is important to note that a device may end up with a different index after a reboot, so an index should not be relied upon to be constant over reboots.
|
||||
|
||||
Hello ROCm SMI
|
||||
================
|
||||
|
||||
The only required ROCm-SMI call for any program that wants to use ROCm-SMI is the `rsmi_init()` call. This call initializes some internal data structures that will be used by subsequent ROCm-SMI calls.
|
||||
|
||||
When ROCm-SMI is no longer being used, `rsmi_shut_down()` should be called. This provides a way to do any releasing of resources that ROCm-SMI may have held. In many cases, this may have no effect, but may be necessary in future versions of the library.
|
||||
|
||||
A simple "Hello World" type program that displays the device ID of detected devices would look like this:
|
||||
|
||||
.. code-block:: c
|
||||
|
||||
#include <stdint.h>
|
||||
#include "rocm_smi/rocm_smi.h"
|
||||
int main() {
|
||||
rsmi_status_t ret;
|
||||
uint32_t num_devices;
|
||||
uint16_t dev_id;
|
||||
|
||||
// We will skip return code checks for this example, but it
|
||||
// is recommended to always check this as some calls may not
|
||||
// apply for some devices or ROCm releases
|
||||
|
||||
ret = rsmi_init(0);
|
||||
ret = rsmi_num_monitor_devices(&num_devices);
|
||||
|
||||
for (int i=0; i < num_devices; ++i) {
|
||||
ret = rsmi_dev_id_get(i, &dev_id);
|
||||
// dev_id holds the device ID of device i, upon a
|
||||
// successful call
|
||||
}
|
||||
ret = rsmi_shut_down();
|
||||
return 0;
|
||||
}
|
||||
@@ -0,0 +1,454 @@
|
||||
|
||||
# Using Python in ROCm SMI
|
||||
|
||||
This tool acts as a command line interface for manipulating and monitoring the amdgpu kernel, and is intended to replace and deprecate the existing rocm_smi.py CLI tool.
|
||||
It uses Ctypes to call the rocm_smi_lib API. Recommended: At least one AMD GPU with ROCm driver installed
|
||||
|
||||
Required: ROCm SMI library installed (librocm_smi64)
|
||||
|
||||
## Installation
|
||||
|
||||
Follow the installation procedure for rocm_smi_lib. Refer to the [installation](../install/install.rst) section.
|
||||
|
||||
LD_LIBRARY_PATH must be set to the folder containing librocm_smi64.
|
||||
|
||||
## Version
|
||||
|
||||
The SMI will report two "versions", ROCM-SMI version and other is ROCM-SMI-LIB version.
|
||||
|
||||
- ROCM-SMI version is the CLI/tool version number with commit ID appended after + sign.
|
||||
- ROCM-SMI-LIB version is the library package version number.
|
||||
|
||||
```
|
||||
ROCM-SMI version: 2.0.0+8e78352
|
||||
ROCM-SMI-LIB version: 6.1.0
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
For detailed and up to date usage information, we recommend consulting the help:
|
||||
|
||||
/opt/rocm/bin/rocm-smi -h
|
||||
|
||||
For convenience purposes, following is the output from the -h flag:
|
||||
|
||||
```
|
||||
/opt/rocm/bin/rocm-smi -h
|
||||
usage: rocm-smi [-h] [-V] [-d DEVICE [DEVICE ...]] [--alldevices] [--showhw] [-a] [-i] [-v] [-e [EVENT [EVENT ...]]]
|
||||
[--showdriverversion] [--showtempgraph] [--showfwinfo [BLOCK [BLOCK ...]]] [--showmclkrange]
|
||||
[--showmemvendor] [--showsclkrange] [--showproductname] [--showserial] [--showuniqueid]
|
||||
[--showvoltagerange] [--showbus] [--showpagesinfo] [--showpendingpages] [--showretiredpages]
|
||||
[--showunreservablepages] [-f] [-P] [-t] [-u] [--showmemuse] [--showvoltage] [-b] [-c] [-g] [-l] [-M]
|
||||
[-m] [-o] [-p] [-S] [-s] [--showmeminfo TYPE [TYPE ...]] [--showpids [VERBOSE]]
|
||||
[--showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]]] [--showreplaycount]
|
||||
[--showrasinfo [SHOWRASINFO [SHOWRASINFO ...]]] [--showvc] [--showxgmierr] [--showtopo]
|
||||
[--showtopoaccess] [--showtopoweight] [--showtopohops] [--showtopotype] [--showtoponuma]
|
||||
[--showenergycounter] [--shownodesbw] [--showcomputepartition] [--showmemorypartition] [-r]
|
||||
[--resetfans] [--resetprofile] [--resetpoweroverdrive] [--resetxgmierr] [--resetperfdeterminism]
|
||||
[--resetcomputepartition] [--resetmemorypartition] [--setclock TYPE LEVEL] [--setsclk LEVEL [LEVEL ...]]
|
||||
[--setmclk LEVEL [LEVEL ...]] [--setpcie LEVEL [LEVEL ...]] [--setslevel SCLKLEVEL SCLK SVOLT]
|
||||
[--setmlevel MCLKLEVEL MCLK MVOLT] [--setvc POINT SCLK SVOLT] [--setsrange SCLKMIN SCLKMAX]
|
||||
[--setextremum min|max sclk|mclk CLK] [--setmrange MCLKMIN MCLKMAX] [--setfan LEVEL]
|
||||
[--setperflevel LEVEL] [--setoverdrive %] [--setmemoverdrive %] [--setpoweroverdrive WATTS]
|
||||
[--setprofile SETPROFILE] [--setperfdeterminism SCLK]
|
||||
[--setcomputepartition {CPX,SPX,DPX,TPX,QPX,cpx,spx,dpx,tpx,qpx}]
|
||||
[--setmemorypartition {NPS1,NPS2,NPS4,NPS8,nps1,nps2,nps4,nps8}] [--rasenable BLOCK ERRTYPE]
|
||||
[--rasdisable BLOCK ERRTYPE] [--rasinject BLOCK] [--gpureset] [--load FILE | --save FILE]
|
||||
[--autorespond RESPONSE] [--loglevel LEVEL] [--json] [--csv]
|
||||
|
||||
AMD ROCm System Management Interface | ROCM-SMI version: 2.0.0+8e78352
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--gpureset Reset specified GPU (One GPU must be specified)
|
||||
--load FILE Load Clock, Fan, Performance and Profile settings
|
||||
from FILE
|
||||
--save FILE Save Clock, Fan, Performance and Profile settings to
|
||||
FILE
|
||||
|
||||
-V, --version Show version information
|
||||
|
||||
-d DEVICE [DEVICE ...], --device DEVICE [DEVICE ...] Execute command on specified device
|
||||
|
||||
Display Options:
|
||||
--alldevices
|
||||
--showhw Show Hardware details
|
||||
-a, --showallinfo Show Temperature, Fan and Clock values
|
||||
|
||||
Topology:
|
||||
-i, --showid Show DEVICE ID
|
||||
-v, --showvbios Show VBIOS version
|
||||
-e [EVENT [EVENT ...]], --showevents [EVENT [EVENT ...]] Show event list
|
||||
--showdriverversion Show kernel driver version
|
||||
--showtempgraph Show Temperature Graph
|
||||
--showfwinfo [BLOCK [BLOCK ...]] Show FW information
|
||||
--showmclkrange Show mclk range
|
||||
--showmemvendor Show GPU memory vendor
|
||||
--showsclkrange Show sclk range
|
||||
--showproductname Show SKU/Vendor name
|
||||
--showserial Show GPU's Serial Number
|
||||
--showuniqueid Show GPU's Unique ID
|
||||
--showvoltagerange Show voltage range
|
||||
--showbus Show PCI bus number
|
||||
|
||||
Pages information:
|
||||
--showpagesinfo Show retired, pending and unreservable pages
|
||||
--showpendingpages Show pending retired pages
|
||||
--showretiredpages Show retired pages
|
||||
--showunreservablepages Show unreservable pages
|
||||
|
||||
Hardware-related information:
|
||||
-f, --showfan Show current fan speed
|
||||
-P, --showpower Show current Average or Socket Graphics Package Power
|
||||
Consumption
|
||||
-t, --showtemp Show current temperature
|
||||
-u, --showuse Show current GPU use
|
||||
--showmemuse Show current GPU memory used
|
||||
--showvoltage Show current GPU voltage
|
||||
|
||||
Software-related/controlled information:
|
||||
-b, --showbw Show estimated PCIe use
|
||||
-c, --showclocks Show current clock frequencies
|
||||
-g, --showgpuclocks Show current GPU clock frequencies
|
||||
-l, --showprofile Show Compute Profile attributes
|
||||
-M, --showmaxpower Show maximum graphics package power this GPU will
|
||||
consume
|
||||
-m, --showmemoverdrive Show current GPU Memory Clock OverDrive level
|
||||
-o, --showoverdrive Show current GPU Clock OverDrive level
|
||||
-p, --showperflevel Show current DPM Performance Level
|
||||
-S, --showclkvolt Show supported GPU and Memory Clocks and Voltages
|
||||
-s, --showclkfrq Show supported GPU and Memory Clock
|
||||
--showmeminfo TYPE [TYPE ...] Show Memory usage information for given block(s) TYPE
|
||||
--showpids [VERBOSE] Show current running KFD PIDs (pass details to
|
||||
VERBOSE for detailed information)
|
||||
--showpidgpus [SHOWPIDGPUS [SHOWPIDGPUS ...]] Show GPUs used by specified KFD PIDs (all if no arg
|
||||
given)
|
||||
--showreplaycount Show PCIe Replay Count
|
||||
--showrasinfo [SHOWRASINFO [SHOWRASINFO ...]] Show RAS enablement information and error counts for
|
||||
the specified block(s) (all if no arg given)
|
||||
--showvc Show voltage curve
|
||||
--showxgmierr Show XGMI error information since last read
|
||||
--showtopo Show hardware topology information
|
||||
--showtopoaccess Shows the link accessibility between GPUs
|
||||
--showtopoweight Shows the relative weight between GPUs
|
||||
--showtopohops Shows the number of hops between GPUs
|
||||
--showtopotype Shows the link type between GPUs
|
||||
--showtoponuma Shows the numa nodes
|
||||
--showenergycounter Energy accumulator that stores amount of energy
|
||||
consumed
|
||||
--shownodesbw Shows the numa nodes
|
||||
--showcomputepartition Shows current compute partitioning
|
||||
--showmemorypartition Shows current memory partition
|
||||
|
||||
Set options:
|
||||
--setclock TYPE LEVEL Set Clock Frequency Level(s) for specified clock
|
||||
(requires manual Perf level)
|
||||
--setsclk LEVEL [LEVEL ...] Set GPU Clock Frequency Level(s) (requires manual
|
||||
Perf level)
|
||||
--setmclk LEVEL [LEVEL ...] Set GPU Memory Clock Frequency Level(s) (requires
|
||||
manual Perf level)
|
||||
--setpcie LEVEL [LEVEL ...] Set PCIE Clock Frequency Level(s) (requires manual
|
||||
Perf level)
|
||||
--setslevel SCLKLEVEL SCLK SVOLT Change GPU Clock frequency (MHz) and Voltage (mV) for
|
||||
a specific Level
|
||||
--setmlevel MCLKLEVEL MCLK MVOLT Change GPU Memory clock frequency (MHz) and Voltage
|
||||
for (mV) a specific Level
|
||||
--setvc POINT SCLK SVOLT Change SCLK Voltage Curve (MHz mV) for a specific
|
||||
point
|
||||
--setsrange SCLKMIN SCLKMAX Set min and max SCLK speed
|
||||
--setextremum min|max sclk|mclk CLK Set min/max of SCLK/MCLK speed
|
||||
--setmrange MCLKMIN MCLKMAX Set min and max MCLK speed
|
||||
--setfan LEVEL Set GPU Fan Speed (Level or %)
|
||||
--setperflevel LEVEL Set Performance Level
|
||||
--setoverdrive % Set GPU OverDrive level (requires manual|high Perf
|
||||
level)
|
||||
--setmemoverdrive % Set GPU Memory Overclock OverDrive level (requires
|
||||
manual|high Perf level)
|
||||
--setpoweroverdrive WATTS Set the maximum GPU power using Power OverDrive in
|
||||
Watts
|
||||
--setprofile SETPROFILE Specify Power Profile level (#) or a quoted string of
|
||||
CUSTOM Profile attributes "# # # #..." (requires
|
||||
manual Perf level)
|
||||
--setperfdeterminism SCLK Set clock frequency limit to get minimal performance
|
||||
variation
|
||||
--setcomputepartition {CPX,SPX,DPX,TPX,QPX,cpx,spx,dpx,tpx,qpx} Set compute partition
|
||||
--setmemorypartition {NPS1,NPS2,NPS4,NPS8,nps1,nps2,nps4,nps8} Set memory partition
|
||||
--rasenable BLOCK ERRTYPE Enable RAS for specified block and error type
|
||||
--rasdisable BLOCK ERRTYPE Disable RAS for specified block and error type
|
||||
--rasinject BLOCK Inject RAS poison for specified block (ONLY WORKS ON
|
||||
UNSECURE BOARDS)
|
||||
|
||||
Reset options:
|
||||
-r, --resetclocks Reset clocks and OverDrive to default
|
||||
--resetfans Reset fans to automatic (driver) control
|
||||
--resetprofile Reset Power Profile back to default
|
||||
--resetpoweroverdrive Set the maximum GPU power back to the device deafult
|
||||
state
|
||||
--resetxgmierr Reset XGMI error count
|
||||
--resetperfdeterminism Disable performance determinism
|
||||
--resetcomputepartition Resets to boot compute partition state
|
||||
--resetmemorypartition Resets to boot memory partition state
|
||||
|
||||
Auto-response options:
|
||||
--autorespond RESPONSE Response to automatically provide for all prompts
|
||||
(NOT RECOMMENDED)
|
||||
|
||||
Output options:
|
||||
--loglevel LEVEL How much output will be printed for what program is
|
||||
doing, one of debug/info/warning/error/critical
|
||||
--json Print output in JSON format
|
||||
--csv Print output in CSV format
|
||||
```
|
||||
|
||||
## Detailed Option Descriptions
|
||||
`--setextremum <min/max> <sclk or mclk> <value in MHz to set to>`
|
||||
Provided ASIC support, users can now set a maximum or minimum sclk or mclk value through our Python CLI tool (`rocm-smi --setextremum max sclk 1500`). See example below.
|
||||
|
||||
```shell
|
||||
$ sudo /opt/rocm/bin/rocm-smi --setextremum max sclk 2100
|
||||
|
||||
============================ ROCm System Management Interface ============================
|
||||
|
||||
******WARNING******
|
||||
|
||||
Operating your AMD GPU outside of official AMD specifications or outside of
|
||||
factory settings, including but not limited to the conducting of overclocking,
|
||||
over-volting or under-volting (including use of this interface software,
|
||||
even if such software has been directly or indirectly provided by AMD or otherwise
|
||||
affiliated in any way with AMD), may cause damage to your AMD GPU, system components
|
||||
and/or result in system failure, as well as cause other problems.
|
||||
DAMAGES CAUSED BY USE OF YOUR AMD GPU OUTSIDE OF OFFICIAL AMD SPECIFICATIONS OR
|
||||
OUTSIDE OF FACTORY SETTINGS ARE NOT COVERED UNDER ANY AMD PRODUCT WARRANTY AND
|
||||
MAY NOT BE COVERED BY YOUR BOARD OR SYSTEM MANUFACTURER'S WARRANTY.
|
||||
Use this utility with caution.
|
||||
|
||||
Do you accept these terms? [y/N] y
|
||||
================================ Set Valid sclk Extremum =================================
|
||||
GPU[0] : Successfully set max sclk to 2100(MHz)
|
||||
GPU[1] : Successfully set max sclk to 2100(MHz)
|
||||
GPU[2] : Successfully set max sclk to 2100(MHz)
|
||||
GPU[3] : Successfully set max sclk to 2100(MHz)
|
||||
================================== End of ROCm SMI Log ===================================
|
||||
```
|
||||
|
||||
--setsclk/--setmclk # [# # ...]:
|
||||
This allows you to set a mask for the levels. For example, if a GPU has 8 clock levels,
|
||||
you can set a mask to use levels 0, 5, 6 and 7 with --setsclk 0 5 6 7 . This will only
|
||||
use the base level, and the top 3 clock levels. This will allow you to keep the GPU at
|
||||
base level when there is no GPU load, and the top 3 levels when the GPU load increases.
|
||||
|
||||
NOTES:
|
||||
The clock levels will change dynamically based on GPU load based on the default
|
||||
Compute and Graphics profiles. The thresholds and delays for a custom mask cannot
|
||||
be controlled through the SMI tool
|
||||
|
||||
This flag automatically sets the Performance Level to "manual" as the mask is not
|
||||
applied when the Performance level is set to auto
|
||||
|
||||
--setfan LEVEL:
|
||||
This sets the fan speed to a value ranging from 0 to maxlevel, or from 0%-100%
|
||||
|
||||
If the level ends with a %, the fan speed is calculated as pct*maxlevel/100
|
||||
(maxlevel is usually 255, but is determined by the ASIC)
|
||||
|
||||
NOTE: While the hardware is usually capable of overriding this value when required, it is
|
||||
recommended to not set the fan level lower than the default value for extended periods
|
||||
of time
|
||||
|
||||
--setperflevel LEVEL:
|
||||
This lets you use the pre-defined Performance Level values for clocks and power profile, which can include:
|
||||
auto (Automatically change values based on GPU workload)
|
||||
low (Keep values low, regardless of workload)
|
||||
high (Keep values high, regardless of workload)
|
||||
manual (Only use values defined by --setsclk and --setmclk)
|
||||
|
||||
--setoverdrive/--setmemoverdrive #:
|
||||
***DEPRECATED IN NEWER KERNEL VERSIONS (use --setslevel/--setmlevel instead)***
|
||||
This sets the percentage above maximum for the max Performance Level.
|
||||
For example, --setoverdrive 20 will increase the top sclk level by 20%, similarly
|
||||
--setmemoverdrive 20 will increase the top mclk level by 20%. Thus if the maximum
|
||||
clock level is 1000MHz, then --setoverdrive 20 will increase the maximum clock to 1200MHz
|
||||
|
||||
NOTES:
|
||||
This option can be used in conjunction with the --setsclk/--setmclk mask
|
||||
|
||||
Operating the GPU outside of specifications can cause irreparable damage to your hardware
|
||||
Observe the warning displayed when using this option
|
||||
|
||||
This flag automatically sets the clock to the highest level, as only the highest level is
|
||||
increased by the OverDrive value
|
||||
|
||||
--setpoweroverdrive/--resetpoweroverdrive #:
|
||||
This allows users to change the maximum power available to a GPU package.
|
||||
The input value is in Watts. This limit is enforced by the hardware, and
|
||||
some cards allow users to set it to a higher value than the default that
|
||||
ships with the GPU. This Power OverDrive mode allows the GPU to run at
|
||||
higher frequencies for longer periods of time, though this may mean the
|
||||
GPU uses more power than it is allowed to use per power supply
|
||||
specifications. Each GPU has a model-specific maximum Power OverDrive that
|
||||
is will take; attempting to set a higher limit than that will cause this
|
||||
command to fail.
|
||||
|
||||
NOTES:
|
||||
Operating the GPU outside of specifications can cause irreparable damage to your hardware
|
||||
Observe the warning displayed when using this option
|
||||
|
||||
--setprofile SETPROFILE:
|
||||
The Compute Profile accepts 1 or n parameters, either the Profile to select (see --showprofile for a list
|
||||
of preset Power Profiles) or a quoted string of values for the CUSTOM profile.
|
||||
NOTE: These values can vary based on the ASIC, and may include:
|
||||
|
||||
| Setting | Description |
|
||||
|---------------------|----------------------------------------------------|
|
||||
| SCLK_PROFILE_ENABLE | Whether or not to apply the 3 following SCLK settings (0=disable,1=enable) |
|
||||
| | **NOTE: This is a hidden field. If set to 0, the following 3 values are displayed as '-’** |
|
||||
| SCLK_UP_HYST | Delay before sclk is increased (in milliseconds) |
|
||||
| SCLK_DOWN_HYST | Delay before sclk is decresed (in milliseconds) |
|
||||
| SCLK_ACTIVE_LEVEL | Workload required before sclk levels change (in %) |
|
||||
| MCLK_PROFILE_ENABLE | Whether or not to apply the 3 following MCLK settings (0=disable,1=enable) |
|
||||
| | **NOTE: This is a hidden field. If set to 0, the following 3 values are displayed as '-'** |
|
||||
| MCLK_UP_HYST | Delay before mclk is increased (in milliseconds) |
|
||||
| MCLK_DOWN_HYST | Delay before mclk is decresed (in milliseconds) |
|
||||
| MCLK_ACTIVE_LEVEL | Workload required before mclk levels change (in %) |
|
||||
|
||||
Other settings:
|
||||
|
||||
| Setting | Description |
|
||||
|------------------|---------------------------------------------------------------------------|
|
||||
| BUSY_SET_POINT | Threshold for raw activity level before levels change |
|
||||
| FPS | Frames Per Second |
|
||||
| USE_RLC_BUSY | When set to 1, DPM is switched up as long as RLC busy message is received |
|
||||
| MIN_ACTIVE_LEVEL | Workload required before levels change (in %) |
|
||||
|
||||
NOTES:
|
||||
When a compute queue is detected, the COMPUTE Power Profile values will be automatically
|
||||
applied to the system, provided that the Perf Level is set to "auto"
|
||||
|
||||
The CUSTOM Power Profile is only applied when the Performance Level is set to "manual"
|
||||
so using this flag will automatically set the performance level to "manual"
|
||||
|
||||
It is not possible to modify the non-CUSTOM Profiles. These are hard-coded by the kernel
|
||||
|
||||
-P, --showpower:
|
||||
Show average or instantaneous socket graphics package power consumption
|
||||
|
||||
"Graphics Package" refers to the GPU plus any HBM (High-Bandwidth memory) modules, if present
|
||||
|
||||
-M, --showmaxpower:
|
||||
Show the maximum Graphics Package power that the GPU will attempt to consume.
|
||||
This limit is enforced by the hardware.
|
||||
|
||||
--loglevel:
|
||||
This will allow the user to set a logging level for the SMI's actions. Currently this is
|
||||
only implemented for sysfs writes, but can easily be expanded upon in the future to log
|
||||
other things from the SMI
|
||||
|
||||
--showmeminfo:
|
||||
This allows the user to see the amount of used and total memory for a given block (vram,
|
||||
vis_vram, gtt). It returns the number of bytes used and total number of bytes for each block
|
||||
'all' can be passed as a field to return all blocks, otherwise a quoted-string is used for
|
||||
multiple values (e.g. "vram vis_vram")
|
||||
vram refers to the Video RAM, or graphics memory, on the specified device
|
||||
vis_vram refers to Visible VRAM, which is the CPU-accessible video memory on the device
|
||||
gtt refers to the Graphics Translation Table
|
||||
|
||||
-b, --showbw:
|
||||
This shows an approximation of the number of bytes received and sent by the GPU over
|
||||
the last second through the PCIe bus. Note that this will not work for APUs since data for
|
||||
the GPU portion of the APU goes through the memory fabric and does not 'enter/exit'
|
||||
the chip via the PCIe interface, thus no accesses are generated, and the performance
|
||||
counters can't count accesses that are not generated.
|
||||
NOTE: It is not possible to easily grab the size of every packet that is transmitted
|
||||
in real time, so the kernel estimates the bandwidth by taking the maximum payload size (mps),
|
||||
which is the max size that a PCIe packet can be. and multiplies it by the number of packets
|
||||
received and sent. This means that the SMI will report the maximum estimated bandwidth,
|
||||
the actual usage could (and likely will be) less
|
||||
|
||||
--showrasinfo:
|
||||
This shows the RAS information for a given block. This includes enablement of the block
|
||||
(currently GFX, SDMA and UMC are the only supported blocks) and the number of errors
|
||||
ue - Uncorrectable errors
|
||||
ce - Correctable errors
|
||||
|
||||
## Clock Type Descriptions
|
||||
|
||||
| Clock type | Description |
|
||||
| ---------- | --- |
|
||||
| DCEFCLK | DCE (Display) |
|
||||
| FCLK | Data fabric (VG20 and later) - Data flow from XGMI, Memory, PCIe |
|
||||
| SCLK | GFXCLK (Graphics core) |
|
||||
| | **Note - SOCCLK split from SCLK as of Vega10. Pre-Vega10 they were both controlled by SCLK** |
|
||||
| MCLK | GPU Memory (VRAM) |
|
||||
| PCLK | PCIe bus |
|
||||
| | **Note - This gives 2 speeds, PCIe Gen1 x1 and the highest available based on the hardware** |
|
||||
| SOCCLK | System clock (VG10 and later) - Data Fabric (DF), MM HUB, AT HUB, SYSTEM HUB, OSS, DFD |
|
||||
| | **Note - DF split from SOCCLK as of Vega20. Pre-Vega20 they were both controlled by SOCCLK** |
|
||||
|
||||
--gpureset:
|
||||
This flag will attempt to reset the GPU for a specified device. This will invoke the GPU reset through
|
||||
the kernel debugfs file amdgpu_gpu_recover. Note that GPU reset will not always work, depending on the
|
||||
manner in which the GPU is hung.
|
||||
|
||||
--showdriverversion:
|
||||
This flag will print out the AMDGPU module version for amdgpu-pro or ROCm kernels. For other kernels,
|
||||
it will simply print out the name of the kernel (`uname -r`)
|
||||
|
||||
--showserial:
|
||||
This flag will print out the serial number for the graphics card
|
||||
NOTE: This is currently only supported on Vega20 server cards that support it. Consumer cards and
|
||||
cards older than Vega20 will not support this feature.
|
||||
|
||||
--showproductname:
|
||||
This uses the pci.ids file to print out more information regarding the GPUs on the system.
|
||||
'update-pciids' may need to be executed on the machine to get the latest PCI ID snapshot,
|
||||
as certain newer GPUs will not be present in the stock pci.ids file, and the file may even
|
||||
be absent on certain OS installation types
|
||||
|
||||
--showpagesinfo | --showretiredpages | --showpendingpages | --showunreservablepages:
|
||||
These flags display the different "bad pages" as reported by the kernel. The three
|
||||
types of pages are:
|
||||
Retired pages (reserved pages) - These pages are reserved and are unable to be used
|
||||
Pending pages - These pages are pending for reservation, and will be reserved/retired
|
||||
Unreservable pages - These pages are not reservable for some reason
|
||||
|
||||
--showmemuse | --showuse | --showmeminfo:
|
||||
--showuse and --showmemuse are used to indicate how busy the respective blocks are. For
|
||||
example, for --showuse (gpu_busy_percent sysfs file), the SMU samples every ms or so to see
|
||||
if any GPU block (RLC, MEC, PFP, CP) is busy. If so, that's 1 (or high). If not, that's 0 (low).
|
||||
If we have 5 high and 5 low samples, that means 50% utilization (50% GPU busy, or 50% GPU use).
|
||||
The windows and sampling vary from generation to generation, but that is how GPU and VRAM use
|
||||
is calculated in a generic sense.
|
||||
--showmeminfo (and VRAM% in concise output) will show the amount of VRAM used (visible, total, GTT),
|
||||
as well as the total available for those partitions. The percentage shown there indicates the
|
||||
amount of used memory in terms of current allocations
|
||||
|
||||
## OverDrive settings
|
||||
|
||||
Enabling OverDrive requires both a card that support OverDrive and a driver parameter that enables its use.
|
||||
Because OverDrive features can damage your card, most workstation and server GPUs cannot use OverDrive.
|
||||
Consumer GPUs that can use OverDrive must enable this feature by setting bit 14 in the amdgpu driver's
|
||||
ppfeaturemask module parameter
|
||||
|
||||
For OverDrive functionality, the OverDrive bit (bit 14) must be enabled (by default, the
|
||||
OverDrive bit is disabled on the ROCK and upstream kernels). This can be done by setting
|
||||
amdgpu.ppfeaturemask accordingly in the kernel parameters, or by changing the default value
|
||||
inside amdgpu_drv.c (if building your own kernel).
|
||||
|
||||
As an example, if the ppfeaturemask is set to 0xffffbfff (11111111111111111011111111111111),
|
||||
then enabling the OverDrive bit would make it 0xffffffff (11111111111111111111111111111111).
|
||||
|
||||
These are the flags that require OverDrive functionality to be enabled for the flag to work:
|
||||
--showclkvolt
|
||||
--showvoltagerange
|
||||
--showvc
|
||||
--showsclkrange
|
||||
--showmclkrange
|
||||
--setslevel
|
||||
--setmlevel
|
||||
--setoverdrive
|
||||
--setpoweroverdrive
|
||||
--resetpoweroverdrive
|
||||
--setvc
|
||||
--setsrange
|
||||
--setmrange
|
||||
|
||||
@@ -1,3 +0,0 @@
|
||||
# ROCm System Management Interface (ROCm SMI) Library
|
||||
|
||||
The ROCm System Management Interface Library, or ROCm SMI library, is part of the Radeon Open Compute [ROCm](https://github.com/RadeonOpenCompute) software stack. It is a C library for Linux that provides a user space interface for applications to monitor and control GPU applications.
|
||||
Executable
@@ -0,0 +1,51 @@
|
||||
.. meta::
|
||||
:description: ROCm SMI
|
||||
:keywords: install, SMI, library, api, AMD, ROCm
|
||||
|
||||
****************************************************
|
||||
ROCm System Management Interface (ROCm SMI) library
|
||||
****************************************************
|
||||
|
||||
The ROCm System Management Interface library, or ROCm SMI library, is part of the ROCm software stack. It is a C library for Linux that provides a user space interface for applications to monitor and control GPU applications.
|
||||
|
||||
For more information, refer to `GitHub. <https://github.com/ROCm/rocm_smi_lib>`_
|
||||
|
||||
.. NOTE::
|
||||
|
||||
The AMD System Management Interface Library (AMD SMI library) is a C library for Linux that provides a user space interface for applications to monitor and control AMD devices. This library will replace rocm_smi_lib over time. We recommend that users transition to the AMD SMI library.
|
||||
|
||||
For more information, refer to the `GitHub repository <https://github.com/ROCm/amdsmi>`_.
|
||||
|
||||
|
||||
|
||||
.. grid:: 2
|
||||
:gutter: 3
|
||||
|
||||
.. grid-item-card:: Install
|
||||
|
||||
* :doc:`ROCm SMI installation <./install/install>`
|
||||
|
||||
.. grid-item-card:: API Reference
|
||||
|
||||
* :doc:`Files <../doxygen/html/files>`
|
||||
* :doc:`Globals <../doxygen/html/globals>`
|
||||
* :doc:`Data structures <../doxygen/html/annotated>`
|
||||
* :doc:`Modules <../doxygen/html/modules>`
|
||||
* :doc:`Python API <reference/python_api>`
|
||||
|
||||
.. grid-item-card:: How to
|
||||
|
||||
* :doc:`Use C++ in ROCm SMI <how-to/use-cpp>`
|
||||
* :doc:`Use Python in ROCm SMI <how-to/use-python>`
|
||||
|
||||
|
||||
.. grid-item-card:: Tutorials
|
||||
|
||||
* :doc:`C++ <tutorials/cpp_tutorials>`
|
||||
* :doc:`Python <tutorials/python_tutorials>`
|
||||
|
||||
|
||||
To contribute to the documentation, refer to `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
|
||||
|
||||
You can find licensing information on the `Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
|
||||
|
||||
@@ -0,0 +1,98 @@
|
||||
.. meta::
|
||||
:description: Install ROCm SMI
|
||||
:keywords: install, SMI, library, api, AMD, ROCm
|
||||
|
||||
|
||||
*********************
|
||||
Installing ROCm SMI
|
||||
*********************
|
||||
|
||||
Planned deprecation notice
|
||||
----------------------------
|
||||
|
||||
ROCm System Management Interface (ROCm SMI) Library is planned to be ***deprecated***, and the release date will be announced soon. We recommend migration to AMD SMI.
|
||||
|
||||
Install amdgpu using ROCm
|
||||
--------------------------
|
||||
Use the following instructions to install AMDGPU using ROCm:
|
||||
|
||||
1. Install amdgpu driver. Refer to the following example, your release and link may differ. The `amdgpu-install --usecase=rocm` triggers both an amdgpu driver update and ROCm SMI packages to be installed on your device.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
sudo apt update
|
||||
wget https://repo.radeon.com/amdgpu-install/6.0.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb
|
||||
sudo apt install ./amdgpu-install_6.0.60002-1_all.deb
|
||||
sudo amdgpu-install --usecase=rocm
|
||||
|
||||
* `rocm-smi --help`
|
||||
|
||||
Building ROCm SMI
|
||||
******************
|
||||
|
||||
Addtional required software
|
||||
============================
|
||||
|
||||
To build the ROCm SMI library, the following components are required.
|
||||
|
||||
.. Note::
|
||||
|
||||
The following software versions are what was used in development. Earlier versions are not guaranteed to work:
|
||||
|
||||
* CMake (v3.5.0)
|
||||
* g++ (5.4.0)
|
||||
|
||||
To build the latest documentation, the following are required:
|
||||
|
||||
* Python 3.8+
|
||||
* NPM (sass)
|
||||
|
||||
The source code for ROCm SMI is available on `Github <https://github.com/RadeonOpenCompute/rocm_smi_lib>`_.
|
||||
|
||||
After the ROCm SMI library git repository is cloned to a local Linux machine, use the following CMake build sequence to build the library. Specifically,
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
mkdir -p build
|
||||
cd build
|
||||
cmake ..
|
||||
make -j $(nproc)
|
||||
# Install library file and header; default location is /opt/rocm
|
||||
make install
|
||||
|
||||
|
||||
The built library will appear in the `build` folder.
|
||||
|
||||
To build the rpm and deb packages follow the above steps with:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
make package
|
||||
|
||||
|
||||
Building documentation
|
||||
=======================
|
||||
|
||||
The following is an example of how to build the docs:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
python3 -m venv .venv
|
||||
.venv/bin/python3 -m pip install -r docs/sphinx/requirements.txt
|
||||
.venv/bin/python3 -m sphinx -T -E -b html -d docs/_build/doctrees -D language=en docs docs/_build/html
|
||||
|
||||
|
||||
Building tests
|
||||
=================
|
||||
|
||||
To verify the build and capability of ROCm SMI on your system and to see an example of how ROCm SMI can be used, you may build and run the tests that are available in the repo. To build the tests, follow these steps:
|
||||
|
||||
.. code-block:: bash
|
||||
|
||||
mkdir build
|
||||
cd build
|
||||
cmake -DBUILD_TESTS=ON ..
|
||||
make -j $(nproc)
|
||||
|
||||
To run the test, execute the program `rsmitst` that is built from the preceding steps.
|
||||
|
||||
@@ -1,2 +0,0 @@
|
||||
```{include} ../python_smi_tools/README.md
|
||||
```
|
||||
@@ -0,0 +1,14 @@
|
||||
.. meta::
|
||||
:description: Install ROCm SMI
|
||||
:keywords: API, SMI, AMD, ROCm
|
||||
|
||||
******************
|
||||
API reference
|
||||
******************
|
||||
|
||||
This section provides technical descriptions and important information about the different ROCm SMI and library components.
|
||||
|
||||
* {doc}`Library <../doxygen/docBin/html/files>`
|
||||
* {doc}`Functions <../doxygen/docBin/html/globals>`
|
||||
* {doc}`Data structures <../doxygen/docBin/html/annotated>`
|
||||
|
||||
@@ -1,8 +1,8 @@
|
||||
====================
|
||||
Python API Reference
|
||||
====================
|
||||
=====================
|
||||
Python API reference
|
||||
=====================
|
||||
|
||||
This chapter describes the ROCm SMI Python module API.
|
||||
This section describes the ROCm SMI Python module API.
|
||||
|
||||
.. default-domain:: py
|
||||
.. py:currentmodule:: rocm_smi
|
||||
@@ -196,6 +196,8 @@ Functions
|
||||
|
||||
.. autofunction:: rocm_smi.showPower
|
||||
|
||||
.. autofunction:: rocm_smi.showPowerPlayTable
|
||||
|
||||
.. autofunction:: rocm_smi.showProduct
|
||||
|
||||
.. autofunction:: rocm_smi.showProfile
|
||||
@@ -5,27 +5,41 @@ defaults:
|
||||
maxdepth: 6
|
||||
root: index
|
||||
subtrees:
|
||||
|
||||
- caption: Installation
|
||||
entries:
|
||||
- file: install/install
|
||||
title: ROCm SMI installation
|
||||
|
||||
- caption: How to
|
||||
entries:
|
||||
- file: how-to/use-cpp
|
||||
title: Use C++ in ROCm SMI
|
||||
- file: how-to/use-python
|
||||
title: Use Python in ROCm SMI
|
||||
|
||||
- caption: API Reference
|
||||
entries:
|
||||
- file: doxygen/html/files
|
||||
title: Files
|
||||
- file: doxygen/html/globals
|
||||
title: Globals
|
||||
- file: doxygen/html/annotated
|
||||
title: Data structures
|
||||
- file: doxygen/html/modules
|
||||
title: Modules
|
||||
- file: reference/python_api
|
||||
title: Python API
|
||||
|
||||
|
||||
- caption: Tutorials
|
||||
entries:
|
||||
- file: c++_tutorials
|
||||
title: C++ Tutorials
|
||||
- file: python_tutorials
|
||||
title: Python Tutorials
|
||||
- caption: How to Guide
|
||||
entries:
|
||||
- file: c++_usage
|
||||
title: C++ How to Guide
|
||||
- file: python_usage
|
||||
title: Python How to Guide
|
||||
- caption: Reference
|
||||
entries:
|
||||
- file: doxygen/html/index
|
||||
title: C++ Reference
|
||||
- file: python_api
|
||||
title: Python Reference
|
||||
- file: tutorials/cpp_tutorials
|
||||
title: C++
|
||||
- file: tutorials/python_tutorials
|
||||
title: Python
|
||||
|
||||
- caption: About
|
||||
entries:
|
||||
- file: license
|
||||
title: License
|
||||
- file: CHANGELOG
|
||||
title: Changelog
|
||||
@@ -1,8 +1,9 @@
|
||||
====================
|
||||
C++ Tutorials
|
||||
====================
|
||||
.. meta::
|
||||
:description: ROCm SMI tutorial
|
||||
:keywords: install, SMI, library, api, AMD, ROCm
|
||||
|
||||
This chapter contains the ROCm SMI C++ API tutorials.
|
||||
ROCm SMI C++ API tutorial
|
||||
----------------------------
|
||||
|
||||
.. code-block:: c++
|
||||
|
||||
@@ -1,8 +1,10 @@
|
||||
====================
|
||||
Python Tutorials
|
||||
====================
|
||||
.. meta::
|
||||
:description: ROCm SMI Python tutorial
|
||||
:keywords: install, SMI, library, api, AMD, ROCm
|
||||
|
||||
This chapter is the rocm_smi Python api tutorials.
|
||||
|
||||
ROCm SMI Python API tutorial
|
||||
-----------------------------
|
||||
|
||||
.. code-block:: python
|
||||
|
||||
@@ -0,0 +1 @@
|
||||
|
||||
@@ -10,8 +10,7 @@ Required: ROCm SMI library installed (librocm_smi64)
|
||||
|
||||
## Installation
|
||||
|
||||
Follow installation procedure for rocm_smi_lib.
|
||||
Please refer to [https://github.com/RadeonOpenCompute/rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib) for the installation guide.
|
||||
Follow installation procedure for rocm_smi_lib. Refer to [https://github.com/RadeonOpenCompute/rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib) for the installation guide.
|
||||
LD_LIBRARY_PATH should be set to the folder containing librocm_smi64.
|
||||
|
||||
## Version
|
||||
@@ -220,7 +219,7 @@ $ sudo /opt/rocm/bin/rocm-smi --setextremum max sclk 2100
|
||||
DAMAGES CAUSED BY USE OF YOUR AMD GPU OUTSIDE OF OFFICIAL AMD SPECIFICATIONS OR
|
||||
OUTSIDE OF FACTORY SETTINGS ARE NOT COVERED UNDER ANY AMD PRODUCT WARRANTY AND
|
||||
MAY NOT BE COVERED BY YOUR BOARD OR SYSTEM MANUFACTURER'S WARRANTY.
|
||||
Please use this utility with caution.
|
||||
Use this utility with caution.
|
||||
|
||||
Do you accept these terms? [y/N] y
|
||||
================================ Set Valid sclk Extremum =================================
|
||||
@@ -273,7 +272,7 @@ GPU[3] : Successfully set max sclk to 2100(MHz)
|
||||
This option can be used in conjunction with the --setsclk/--setmclk mask
|
||||
|
||||
Operating the GPU outside of specifications can cause irreparable damage to your hardware
|
||||
Please observe the warning displayed when using this option
|
||||
Observe the warning displayed when using this option
|
||||
|
||||
This flag automatically sets the clock to the highest level, as only the highest level is
|
||||
increased by the OverDrive value
|
||||
@@ -291,7 +290,7 @@ GPU[3] : Successfully set max sclk to 2100(MHz)
|
||||
|
||||
NOTES:
|
||||
Operating the GPU outside of specifications can cause irreparable damage to your hardware
|
||||
Please observe the warning displayed when using this option
|
||||
Observe the warning displayed when using this option
|
||||
|
||||
--setprofile SETPROFILE:
|
||||
The Compute Profile accepts 1 or n parameters, either the Profile to select (see --showprofile for a list
|
||||
|
||||
Verwijs in nieuw issue
Block a user