From e196f98dbaac2a0e54fd4dd7ee6eb28d2dd9d26f Mon Sep 17 00:00:00 2001 From: Peter Park Date: Wed, 2 Oct 2024 14:42:34 -0400 Subject: [PATCH] docs: Remove redundant/stale docs bump rocm-docs-core to 1.8.2 rm unused files rm stale docs fix sphinx conf reorg docs SWDEV-482203 -- add note to usage guides update readmes Change-Id: I9e0111ac8fe2a691ac964b27436ba47747c27904 Signed-off-by: Peter Park --- README.md | 447 +- amdsmi_cli/README.md | 1236 +---- docs/amdsmi_changelog_link.md | 2 - docs/amdsmi_cli_readme_link.md | 2 - docs/amdsmi_release_notes_link.md | 2 - docs/conceptual/test.md | 1 - docs/conf.py | 68 +- ...AMD-SMI-CLI-tool.md => amdsmi-cli-tool.md} | 139 +- docs/how-to/amdsmi-cpp-lib.md | 198 + docs/how-to/amdsmi-py-lib.md | 105 + docs/how-to/using-amdsmi-for-C++.rst | 254 - docs/index.md | 82 + docs/index.rst | 50 - docs/install/build.md | 110 + docs/install/install.md | 144 + docs/install/install.rst | 147 - docs/license.rst | 9 +- docs/py-interface_readme_link.md | 2 - docs/reference/amdsmi-cpp-api.md | 21 + .../amdsmi-py-api.md} | 137 +- docs/reference/changelog.md | 1733 ++++++ docs/reference/index.rst | 14 - docs/sphinx/_toc.yml.in | 56 +- docs/sphinx/requirements.in | 2 +- docs/sphinx/requirements.txt | 56 +- py-interface/README.md | 4888 +---------------- 26 files changed, 2871 insertions(+), 7034 deletions(-) delete mode 100644 docs/amdsmi_changelog_link.md delete mode 100644 docs/amdsmi_cli_readme_link.md delete mode 100644 docs/amdsmi_release_notes_link.md delete mode 100644 docs/conceptual/test.md rename docs/how-to/{using-AMD-SMI-CLI-tool.md => amdsmi-cli-tool.md} (94%) create mode 100644 docs/how-to/amdsmi-cpp-lib.md create mode 100644 docs/how-to/amdsmi-py-lib.md delete mode 100644 docs/how-to/using-amdsmi-for-C++.rst create mode 100644 docs/index.md delete mode 100644 docs/index.rst create mode 100644 docs/install/build.md create mode 100644 docs/install/install.md delete mode 100644 docs/install/install.rst delete mode 100644 docs/py-interface_readme_link.md create mode 100644 docs/reference/amdsmi-cpp-api.md rename docs/{how-to/using-amdsmi-for-python.md => reference/amdsmi-py-api.md} (97%) create mode 100644 docs/reference/changelog.md delete mode 100644 docs/reference/index.rst diff --git a/README.md b/README.md index 0cc125e01f..1cbc06c7c6 100644 --- a/README.md +++ b/README.md @@ -1,345 +1,195 @@ -# AMD System Management Interface (AMD SMI) Library +# AMD System Management Interface (AMD SMI) library -The AMD System Management Interface Library, or AMD SMI library, is a C library for Linux that provides a user space interface for applications to monitor and control AMD devices. +The AMD System Management Interface (AMD SMI) library offers a unified tool for managing and monitoring GPUs, +particularly in high-performance computing environments. It provides a user-space interface that allows applications to +control GPU operations, monitor performance, and retrieve information about the system's drivers and GPUs. -For additional information refer to [ROCm Documentation](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) +For information on available features, installation steps, API reference material, and helpful tips, refer to the online +documentation at [rocm.docs.amd.com/projects/amdsmi](https://rocm.docs.amd.com/projects/amdsmi/en/latest/) -Note: This project is a successor to [rocm_smi_lib](https://github.com/RadeonOpenCompute/rocm_smi_lib) - -and [esmi_ib_library](https://github.com/amd/esmi_ib_library) +>[!NOTE] +>This project is a successor to [rocm_smi_lib](https://github.com/ROCm/rocm_smi_lib) +>and [esmi_ib_library](https://github.com/amd/esmi_ib_library). ## Supported platforms -At initial release, the AMD SMI library will support Linux bare metal and Linux virtual machine guest for AMD GPUs. In the future release, the library will be extended to support AMD EPYC™ CPUs. +At initial release, the AMD SMI library will support Linux bare metal and Linux +virtual machine guest for AMD GPUs. In a future release, the library will be +extended to support AMD EPYC™ CPUs. -AMD SMI library can run on AMD ROCm supported platforms, refer to [System requirements (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) for more information. +AMD SMI library can run on AMD ROCm supported platforms, refer to +[System requirements (Linux)](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html) +for more information. -To run the AMD SMI library, the amdgpu driver and the hsmp driver needs to be installed. Optionally, the libdrm can be -installed to query firmware information and hardware IPs. +## Installation -## Install CLI Tool and Libraries +* [Install the AMD SMI library and CLI tool](https://rocm.docs.amd.com/projects/amdsmi/en/latest/install/install.html) -**Disclaimer: CLI Tool is provided as an example code to aid the development of telemetry tools and is not guaranteed to be backwards compatible. The Python or C++ Library is recommended as a reliable data source.** +## Requirements +The following are required to install and use the AMD SMI libraries and CLI tool. -### Requirements +* Python 3.6.8+ (64-bit) +* `amdgpu` driver must be loaded for [`amdsmi_init()`](./docs/how-to/amdsmi-cpp-lib#hello-amd-smi) to work. -* python 3.6.8+ 64-bit - - prerequisite modules: - - python3-wheel - - python3-setuptools -* amdgpu driver must be loaded for amdsmi_init() to pass +## Install amdgpu driver and AMD SMI with ROCm -### Installation +1. Get the `amdgpu-install` installer following the instructions for your Linux distribution at + [Installation via AMDGPU installer](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/amdgpu-install.html#installation). -### Install amdgpu using ROCm -* Install amdgpu driver: -See example below, your release and link may differ. The `amdgpu-install --usecase=rocm` triggers both an amdgpu driver update and AMD SMI packages to be installed on your device. -```shell -sudo apt update -wget https://repo.radeon.com/amdgpu-install/6.0.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb -sudo apt install ./amdgpu-install_6.0.60002-1_all.deb -sudo amdgpu-install --usecase=rocm -``` -* amd-smi --help +2. Use `amdgpu-install` to install the `amdgpu` driver and ROCm packages with + AMD SMI included. -### Install Example for Ubuntu 22.04 (without ROCm) + ``` shell + sudo amdgpu-install --usecase=rocm + ``` -``` bash -apt install amd-smi-lib -# if installed with rocm ignore the export -export PATH="${PATH:+${PATH}:}~/opt/rocm/bin" -amd-smi --help -``` + The `amdgpu-install --usecase=rocm` option triggers both an `amdgpu` driver + update and AMD SMI packages to be installed on your device. -### Optional autocompletion +3. Verify your installation. -`amd-smi` cli application supports autocompletion. The package should attempt to install it, if argcomplete is not installed you can enable it by using the following commands: + ```shell + amd-smi --help + ``` -```bash -python3 -m pip install argcomplete -activate-global-python-argcomplete --user -# restart shell to enable -``` +## Install AMD SMI without ROCm -### Manual/Multiple Rocm Instance Python Library Install +The following are example steps to install the AMD SMI libraries and CLI tool on +Ubuntu 22.04. -In the event there are multiple rocm installations and pyenv is not being used, to use the correct amdsmi version you must uninstall previous versions of amd-smi and install the version you want directly from your rocm instance. +1. Install the library. -#### Python Library Install Example for Ubuntu 22.04 + ```shell + sudo apt install amd-smi-lib + ``` -Remove previous amdsmi installation: +2. Add the installation directory to your PATH. If installed with ROCm, ignore + this step. -```bash -python3 -m pip list | grep amd -python3 -m pip uninstall amdsmi -``` + ```shell + export PATH="${PATH:+${PATH}:}~/opt/rocm/bin" + ``` -Then install Python library from your target rocm instance: +3. Verify your installation. -``` bash -apt install amd-smi-lib -cd /opt/rocm/share/amd_smi -python3 -m pip install --upgrade pip -python3 -m pip install --user . -``` + ```shell + amd-smi --help + ``` -Now you have the amdsmi python library in your python path: +## AMD SMI basic usage -``` bash -~$ python3 -Python 3.8.10 (default, May 26 2023, 14:05:08) -[GCC 9.4.0] on linux -Type "help", "copyright", "credits" or "license" for more information. ->>> import amdsmi ->>> -``` +### C++ library -### Installing the Python Prerequisite Modules +For developers focused on performance monitoring, system diagnostics, or resource management, the AMD SMI C++ library +offers a powerful and versatile tool to unlock the full capabilities of AMD hardware. -Python3-setuptools and python3-wheel can both be installed through the pip installer as shown below: +Refer to the [user guide](https://rocm.docs.amd.com/projects/amdsmi/en/latest/how-to/amdsmi-cpp-lib.html) and the +detailed [C++ API reference](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-cpp-api.html) in the +ROCm documentation portal. -```bash -python3 -m pip install setuptools wheel -``` +### Python library -## Usage Basics for the C Library +The AMD SMI Python interface provides an easy-to-use +[API](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-lib.html) for interacting with AMD +hardware. It simplifies tasks like monitoring and controlling GPU operations, allowing for rapid development. -### Device/Socket handles +Refer to the [user guide](https://rocm.docs.amd.com/projects/amdsmi/en/latest/how-to/amdsmi-py-lib.html) and the +detailed [Python API reference](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html) in the +ROCm documentation portal. -Many of the functions in the library take a "socket handle" or "device handle". The socket is an abstraction of hardware physical socket. This will enable amd-smi to provide a better representation of the hardware to user. Although there is always one distinct GPU for a socket, the APU may have both -GPU device and CPU device on the same socket. Moreover, for MI200, it may have multiple GCDs. +### CLI tool -To discover the sockets in the system, `amdsmi_get_socket_handles()` is called to get list of sockets -handles, which in turn can be used to query the devices in that socket using `amdsmi_get_processor_handles()`. The device handler is used to distinguish the detected devices from one another. It is important to note that a device may end up with a different device handles after restart application, so a device handle should not be relied upon to be constant over process. +A versatile command line tool for managing and monitoring AMD hardware. You can use `amd-smi` for: -The list of socket handles discovered using `amdsmi_get_socket_handles()`,can also be used to query the cpus in that socket using `amdsmi_get_processor_handles_by_type()`, which in turn can then be used to query the cores in that cpu using `amdsmi_get_processor_handles_by_type()` again. +- Device information: Quickly retrieve detailed information about AMD GPUs +- Performance monitoring: Real-time monitoring of GPU utilization, memory, temperature, and power consumption -## Hello AMD SMI +- Process information: Identify which processes are using GPUs -The only required AMD-SMI call for any program that wants to use AMD-SMI is the `amdsmi_init()` call. This call initializes some internal data structures that will be used by subsequent AMD-SMI calls. In the call, a flag can be passed if the application is only interested in a specific device type. +- Configuration management: Adjust GPU settings like clock speeds and power limits -When AMD-SMI is no longer being used, `amdsmi_shut_down()` should be called. This provides a way to do any releasing of resources that AMD-SMI may have held. +- Error reporting: Monitor and report GPU errors for proactive maintenance -1) A simple "Hello World" type program that displays the temperature of detected devices would look like this: - -```c++ -#include -#include -#include "amd_smi/amdsmi.h" - -int main() { - amdsmi_status_t ret; - - // Init amdsmi for sockets and devices. Here we are only interested in AMD_GPUS. - ret = amdsmi_init(AMDSMI_INIT_AMD_GPUS); - - // Get all sockets - uint32_t socket_count = 0; - - // Get the socket count available in the system. - ret = amdsmi_get_socket_handles(&socket_count, nullptr); - - // Allocate the memory for the sockets - std::vector sockets(socket_count); - // Get the socket handles in the system - ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]); - - std::cout << "Total Socket: " << socket_count << std::endl; - - // For each socket, get identifier and devices - for (uint32_t i=0; i < socket_count; i++) { - // Get Socket info - char socket_info[128]; - ret = amdsmi_get_socket_info(sockets[i], 128, socket_info); - std::cout << "Socket " << socket_info<< std::endl; - - // Get the device count for the socket. - uint32_t device_count = 0; - ret = amdsmi_get_processor_handles(sockets[i], &device_count, nullptr); - - // Allocate the memory for the device handlers on the socket - std::vector processor_handles(device_count); - // Get all devices of the socket - ret = amdsmi_get_processor_handles(sockets[i], - &device_count, &processor_handles[0]); - - // For each device of the socket, get name and temperature. - for (uint32_t j=0; j < device_count; j++) { - // Get device type. Since the amdsmi is initialized with - // AMD_SMI_INIT_AMD_GPUS, the processor_type must be AMDSMI_PROCESSOR_TYPE_AMD_GPU. - processor_type_t processor_type; - ret = amdsmi_get_processor_type(processor_handles[j], &processor_type); - if (processor_type != AMDSMI_PROCESSOR_TYPE_AMD_GPU) { - std::cout << "Expect AMDSMI_PROCESSOR_TYPE_AMD_GPU device type!\n"; - return 1; - } - - // Get device name - amdsmi_board_info_t board_info; - ret = amdsmi_get_gpu_board_info(processor_handles[j], &board_info); - std::cout << "\tdevice " - << j <<"\n\t\tName:" << board_info.product_name << std::endl; - - // Get temperature - int64_t val_i64 = 0; - ret = amdsmi_get_temp_metric(processor_handles[j], AMDSMI_TEMPERATURE_TYPE_EDGE, - AMDSMI_TEMP_CURRENT, &val_i64); - std::cout << "\t\tTemperature: " << val_i64 << "C" << std::endl; - } - } - - // Clean up resources allocated at amdsmi_init. It will invalidate sockets - // and devices pointers - ret = amdsmi_shut_down(); - - return 0; -} -``` - -2) A sample program that displays the power of detected cpus would look like this: - -```c++ -#include -#include -#include "amd_smi/amdsmi.h" - -int main(int argc, char **argv) { - amdsmi_status_t ret; - uint32_t socket_count = 0; - - // Initialize amdsmi for AMD CPUs - ret = amdsmi_init(AMDSMI_INIT_AMD_CPUS); - - ret = amdsmi_get_socket_handles(&socket_count, nullptr); - - // Allocate the memory for the sockets - std::vector sockets(socket_count); - - // Get the sockets of the system - ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]); - - std::cout << "Total Socket: " << socket_count << std::endl; - - // For each socket, get cpus - for (uint32_t i = 0; i < socket_count; i++) { - uint32_t cpu_count = 0; - - // Set processor type as AMDSMI_PROCESSOR_TYPE_AMD_CPU - processor_type_t processor_type = AMDSMI_PROCESSOR_TYPE_AMD_CPU; - ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, nullptr, &cpu_count); - - // Allocate the memory for the cpus - std::vector plist(cpu_count); - - // Get the cpus for each socket - ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, &plist[0], &cpu_count); - - for (uint32_t index = 0; index < plist.size(); index++) { - uint32_t socket_power; - std::cout<<"CPU "<(socket_power)/1000<` + +3. Build the library by following the typical CMake build sequence (run as root user or use `sudo` before `make install` + command); for instance: + + ```bash + mkdir -p build + cd build + cmake .. + make -j $(nproc) + make install + ``` + + The built library is located in the `build/` directory. To build the `rpm` and `deb` packages use the following + command: + + ```bash + make package + ``` + +### Rebuild the Python wrapper + +The Python wrapper for the AMD SMI library is found in the [auto-generated file](#py_lib_fs) +`py-interface/amdsmi_wrapper.py`. It is essential to regenerate this wrapper whenever there are changes to the C++ API. +It is not regenerated automatically. + +To regenerate the wrapper, use the following command. + +```shell ./update_wrapper.sh ``` -After this command, the file in `py-interface/amdsmi_wrapper.py` will be automatically updated on each compile. +After this command, the file in `py-interface/amdsmi_wrapper.py` will be updated +on compile. -Note: To be able to re-generate python wrapper you need **docker** installed. +>[!NOTE] +>You need Docker installed on your system to regenerate the Python wrapper. -Note: python_wrapper is NOT automatically re-generated. You must run `./update_wrapper.sh`. +### Build the tests -### Additional Required software for building - -In order to build the AMD SMI library, the following components are required. Note that the software versions listed are what was used in development. Earlier versions are not guaranteed to work: - -* CMake (v3.14.0) - `python3 -m pip install cmake` -* g++ (5.4.0) - -In order to build the AMD SMI python package, the following components are required: - -* python (3.6.8 or above) -* virtualenv - `python3 -m pip install virtualenv` - -In order to build the latest documentation, the following are required: - -* DOxygen (1.8.11) -* latex (pdfTeX 3.14159265-2.6-1.40.16) - -The source code for AMD SMI is available on Github. - -After the AMD SMI library git repository has been cloned to a local Linux machine, the Default location for the library and headers is /opt/rocm. Before installation, the old rocm directories should be deleted: -/opt/rocm -/opt/rocm-{number} - -Building the library is achieved by following the typical CMake build sequence (run as root user or use 'sudo' before 'make install' command), specifically: - -```bash -mkdir -p build -cd build -cmake .. -make -j $(nproc) -make install -``` - -The built library will appear in the `build` folder. - -To build the rpm and deb packages follow the above steps with: - -```bash -make package -``` - -### Building the Tests - -In order to verify the build and capability of AMD SMI on your system and to see an example of how AMD SMI can be used, you may build and run the tests that are available in the repo. To build the tests, follow these steps: +To verify the build and capabilities of AMD SMI on your system, as well as to see practical examples of its usage, you +can build and run the available [tests in the repository](https://github.com/ROCm/amdsmi/tree/amd-staging/tests). Follow +these steps to build the tests: ```bash mkdir -p build @@ -348,13 +198,24 @@ cmake -DBUILD_TESTS=ON .. make -j $(nproc) ``` -### Run the Tests +#### Run the tests -To run the test, execute the program `amdsmitst` that is built from the steps above. -Path to the program `amdsmitst`: build/tests/amd_smi_test/ +Once the tests are [built](#build-the-tests), you can run them by executing the `amdsmitst` program. The executable can +be found at `build/tests/amd_smi_test/`. + +### Build the docs + +To build the documentation, follow the instructions at +[Building documentation](https://rocm.docs.amd.com/en/latest/contribute/building.html). ## DISCLAIMER -The information contained herein is for informational purposes only, and is subject to change without notice. In addition, any stated support is planned and is also subject to change. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. +The information contained herein is for informational purposes only, and is subject to change without notice. In +addition, any stated support is planned and is also subject to change. While every precaution has been taken in the +preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is +under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no +representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes +no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular +purposes, with respect to the operation or use of AMD hardware, software or other products described herein. © 2023-2024 Advanced Micro Devices, Inc. All Rights Reserved. diff --git a/amdsmi_cli/README.md b/amdsmi_cli/README.md index f7aa99f01c..ef883b9d2f 100644 --- a/amdsmi_cli/README.md +++ b/amdsmi_cli/README.md @@ -1,1228 +1,28 @@ -# AMD SMI CLI Tool +# AMD SMI CLI tool -**Disclaimer: CLI Tool is provided as an example code to aid the development of telemetry tools and is not guaranteed to be backwards compatible. The Python or C++ Library is recommended as a reliable data source.** +A command line tool for manipulating and monitoring the `amdgpu` kernel; +`amd-smi` is intended to replace and deprecate the existing +[`rocm-smi`](https://github.com/rocm/rocm_smi_lib) CLI tool. -This tool acts as a command line interface for manipulating -and monitoring the amdgpu kernel, and is intended to replace -and deprecate the existing rocm_smi CLI tool & gpuv-smi tool. -It uses Ctypes to call the amd_smi_lib API. -Recommended: At least one AMD GPU with AMD driver installed +When using the CLI tool, you should have at least one AMD GPU and the driver +installed. -## Install CLI Tool and Python Library +>[!NOTE] +>The AMD SMI CLI tool is provided as an example code to aid the development of +>telemetry tools. The Python or C++ library is recommended as a robust data +>source. -### Requirements +Find the documentation in the `docs/` directory. -* python 3.6.8+ 64-bit -* amdgpu or amd_hsmp driver must be loaded for amdsmi_init() to pass +- [Install AMD SMI](../docs/install/install.md) +- [About the tool and how to get started](../docs/how-to/amdsmi-cli-tool.md) -### Installation +## Online documentation -* [Install amdgpu driver](../README.md#install-amdgpu-using-rocm) -* Optionally install amd_hsmp driver for ESMI CPU functions -* Install amd-smi-lib package through package manager -* amd-smi --help +Explore the latest documentation on the [ROCm documentation +portal](https://rocm.docs.amd.com/projects/en/latest/index.html). -### Install Example for Ubuntu 22.04 +- [Install AMD SMI](https://rocm.docs.amd.com/projects/en/latest/install/install.html) -``` bash -apt install amd-smi-lib -amd-smi --help -``` +- [CLI tool usage](https://rocm.docs.amd.com/projects/en/latest/how-to/amdsmi-cli-tool.html). -### Optional autocompletion - -`amd-smi` cli application supports autocompletion. The package should attempt to install it, if argcomplete is not installed you can enable it by using the following commands: - -```bash -python3 -m pip install argcomplete -activate-global-python-argcomplete --user -# restart shell to enable -``` - -### Manual/Multiple Rocm Instance Python Library Install - -In the event there are multiple rocm installations and pyenv is not being used, to use the correct amdsmi version you must uninstall previous versions of amd-smi and install the version you want directly from your rocm instance. - -#### Python Library Install Example for Ubuntu 22.04 - -Remove previous amdsmi installation: - -```bash -python3 -m pip list | grep amd -python3 -m pip uninstall amdsmi -``` - -Then install Python library from your target rocm instance: - -``` bash -apt install amd-smi-lib -amd-smi --help -cd /opt/rocm/share/amd_smi -python3 -m pip install --upgrade pip -python3 -m pip install --user . -``` - -Now you have the amdsmi python library in your python path: - -``` bash -~$ python3 -Python 3.8.10 (default, May 26 2023, 14:05:08) -[GCC 9.4.0] on linux -Type "help", "copyright", "credits" or "license" for more information. ->>> import amdsmi ->>> -``` - -## Usage - -AMD-SMI reports the version and current platform detected when running the command line interface (CLI) without arguments: - -``` bash -~$ amd-smi -usage: amd-smi [-h] ... - -AMD System Management Interface | Version: 24.7.0.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal - -options: - -h, --help show this help message and exit - -AMD-SMI Commands: - Descriptions: - version Display version information - list List GPU information - static Gets static information about the specified GPU - firmware (ucode) Gets firmware information about the specified GPU - bad-pages Gets bad page information about the specified GPU - metric Gets metric/performance information about the specified GPU - process Lists general process information running on the specified GPU - event Displays event information for the given GPU - topology Displays topology information of the devices - set Set options for devices - reset Reset options for devices - monitor (dmon) Monitor metrics for target devices - xgmi Displays xgmi information of the devices -``` - -Example commands: - -``` bash -amd-smi static --gpu 0 -amd-smi metric -amd-smi process --gpu 0 1 -amd-smi reset --gpureset --gpu all -``` - -More detailed verison information is available from `amd-smi version` - -Each command will have detailed information via `amd-smi [command] --help` - - -## Commands - -For convenience, here is the help output for each command - -```bash -~$ amd-smi list --help -usage: amd-smi list [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] - -Lists all the devices on the system and the links between devices. -Lists all the sockets and for each socket, GPUs and/or CPUs associated to -that socket alongside some basic information for each device. -In virtualization environments, it can also list VFs associated to each -GPU with some basic information for each VF. - -options: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi static --help -usage: amd-smi static [-h] [-g GPU [GPU ...] | -U CPU [CPU ...]] [-a] [-b] [-V] [-d] [-v] - [-c] [-B] [-R] [-r] [-p] [-l] [-P] [-x] [-u] [-s] [-i] - [--json | --csv] [--file FILE] [--loglevel LEVEL] - -If no GPU is specified, returns static information for all GPUs on the system. -If no static argument is provided, all static information will be displayed. - -Static Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -a, --asic All asic information - -b, --bus All bus information - -V, --vbios All video bios information (if available) - -d, --driver Displays driver version - -v, --vram All vram information - -c, --cache All cache information - -B, --board All board information - -R, --process-isolation The process isolation status - -r, --ras Displays RAS features information - -p, --partition Partition information - -l, --limit All limit metric values (i.e. power and thermal limits) - -P, --policy The available DPM policy - -x, --xgmi-plpd The available XGMI per-link power down policy - -u, --numa All numa node information - -CPU Arguments: - -s, --smu All SMU FW information - -i, --interface-ver Displays hsmp interface version - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi firmware --help -usage: amd-smi firmware [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-f] - -If no GPU is specified, return firmware information for all GPUs on the system. - -Firmware Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -f, --ucode-list, --fw-list All FW list information - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi bad-pages --help -usage: amd-smi bad-pages [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-p] - [-r] [-u] - -If no GPU is specified, return bad page information for all GPUs on the system. - -Bad Pages Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -p, --pending Displays all pending retired pages - -r, --retired Displays retired pages - -u, --un-res Displays unreservable pages - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi metric --help -usage: amd-smi metric [-h] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] - [-w INTERVAL] [-W TIME] [-i ITERATIONS] [-m] [-u] [-p] [-c] [-t] - [-P] [-e] [-k] [-f] [-C] [-o] [-l] [-x] [-E] [--cpu-power-metrics] - [--cpu-prochot] [--cpu-freq-metrics] [--cpu-c0-res] - [--cpu-lclk-dpm-level NBIOID] [--cpu-pwr-svi-telemtry-rails] - [--cpu-io-bandwidth IO_BW LINKID_NAME] - [--cpu-xgmi-bandwidth XGMI_BW LINKID_NAME] [--cpu-metrics-ver] - [--cpu-metrics-table] [--cpu-socket-energy] [--cpu-ddr-bandwidth] - [--cpu-temp] [--cpu-dimm-temp-range-rate DIMM_ADDR] - [--cpu-dimm-pow-consumption DIMM_ADDR] - [--cpu-dimm-thermal-sensor DIMM_ADDR] [--core-boost-limit] - [--core-curr-active-freq-core-limit] [--core-energy] - [--json | --csv] [--file FILE] [--loglevel LEVEL] - -If no GPU is specified, returns metric information for all GPUs on the system. -If no metric argument is provided all metric information will be displayed. - -Metric arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -w, --watch INTERVAL Reprint the command in a loop of INTERVAL seconds - -W, --watch_time TIME The total TIME to watch the given command - -i, --iterations ITERATIONS Total number of ITERATIONS to loop on the given command - -m, --mem-usage Memory usage per block - -u, --usage Displays engine usage information - -p, --power Current power usage - -c, --clock Average, max, and current clock frequencies - -t, --temperature Current temperatures - -P, --pcie Current PCIe speed, width, and replay count - -e, --ecc Total number of ECC errors - -k, --ecc-blocks Number of ECC errors per block - -f, --fan Current fan speed - -C, --voltage-curve Display voltage curve - -o, --overdrive Current GPU clock overdrive level - -l, --perf-level Current DPM performance level - -x, --xgmi-err XGMI error information since last read - -E, --energy Amount of energy consumed - -CPU Arguments: - --cpu-power-metrics CPU power metrics - --cpu-prochot Displays prochot status - --cpu-freq-metrics Displays currentFclkMemclk frequencies and cclk frequency limit - --cpu-c0-res Displays C0 residency - --cpu-lclk-dpm-level NBIOID Displays lclk dpm level range. Requires socket ID and NBOID as inputs - --cpu-pwr-svi-telemtry-rails Displays svi based telemetry for all rails - --cpu-io-bandwidth IO_BW LINKID_NAME Displays current IO bandwidth for the selected CPU. - input parameters are bandwidth type(1) and link ID encodings - i.e. P2, P3, G0 - G7 - --cpu-xgmi-bandwidth XGMI_BW LINKID_NAME Displays current XGMI bandwidth for the selected CPU - input parameters are bandwidth type(1,2,4) and link ID encodings - i.e. P2, P3, G0 - G7 - --cpu-metrics-ver Displays metrics table version - --cpu-metrics-table Displays metric table - --cpu-socket-energy Displays socket energy for the selected CPU socket - --cpu-ddr-bandwidth Displays per socket max ddr bw, current utilized bw, - and current utilized ddr bw in percentage - --cpu-temp Displays cpu socket temperature - --cpu-dimm-temp-range-rate DIMM_ADDR Displays dimm temperature range and refresh rate - --cpu-dimm-pow-consumption DIMM_ADDR Displays dimm power consumption - --cpu-dimm-thermal-sensor DIMM_ADDR Displays dimm thermal sensor - -CPU Core Arguments: - --core-boost-limit Get boost limit for the selected cores - --core-curr-active-freq-core-limit Get Current CCLK limit set per Core - --core-energy Displays core energy for the selected core - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi process --help -usage: amd-smi process [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] - [-w INTERVAL] [-W TIME] [-i ITERATIONS] [-G] [-e] [-p PID] - [-n NAME] - -If no GPU is specified, returns information for all GPUs on the system. -If no process argument is provided all process information will be displayed. - -Process arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -w, --watch INTERVAL Reprint the command in a loop of INTERVAL seconds - -W, --watch_time TIME The total TIME to watch the given command - -i, --iterations ITERATIONS Total number of ITERATIONS to loop on the given command - -G, --general pid, process name, memory usage - -e, --engine All engine usages - -p, --pid PID Gets all process information about the specified process based on Process ID - -n, --name NAME Gets all process information about the specified process based on Process Name. - If multiple processes have the same name information is returned for all of them. - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi event --help -usage: amd-smi event [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] - -If no GPU is specified, returns event information for all GPUs on the system. - -Event Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi topology --help -usage: amd-smi topology [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-a] - [-w] [-o] [-t] [-b] - -If no GPU is specified, returns information for all GPUs on the system. -If no topology argument is provided all topology information will be displayed. - -Topology arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -a, --access Displays link accessibility between GPUs - -w, --weight Displays relative weight between GPUs - -o, --hops Displays the number of hops between GPUs - -t, --link-type Displays the link type between GPUs - -b, --numa-bw Display max and min bandwidth between nodes - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi set --help -usage: amd-smi set [-h] (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-f %] - [-l LEVEL] [-P SETPROFILE] [-d SCLKMAX] [-C PARTITION] [-M PARTITION] - [-o WATTS] [-p POLICY_ID] [-x POLICY_ID] [-R STATUS] - [--cpu-pwr-limit PWR_LIMIT] [--cpu-xgmi-link-width MIN_WIDTH MAX_WIDTH] - [--cpu-lclk-dpm-level NBIOID MIN_DPM MAX_DPM] [--cpu-pwr-eff-mode MODE] - [--cpu-gmi3-link-width MIN_LW MAX_LW] [--cpu-pcie-link-rate LINK_RATE] - [--cpu-df-pstate-range MAX_PSTATE MIN_PSTATE] [--cpu-enable-apb] - [--cpu-disable-apb DF_PSTATE] [--soc-boost-limit BOOST_LIMIT] - [--core-boost-limit BOOST_LIMIT] [--json | --csv] [--file FILE] - [--loglevel LEVEL] - -A GPU must be specified to set a configuration. -A set argument must be provided; Multiple set arguments are accepted - -Set Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -f, --fan % Set GPU fan speed (0-255 or 0-100%) - -l, --perf-level LEVEL Set performance level - -P, --profile SETPROFILE Set power profile level (#) or a quoted string of custom profile attributes - -d, --perf-determinism SCLKMAX Set GPU clock frequency limit and performance level to determinism to get minimal performance variation - -C, --compute-partition PARTITION Set one of the following the compute partition modes: - CPX, SPX, DPX, TPX, QPX - -M, --memory-partition PARTITION Set one of the following the memory partition modes: - NPS1, NPS2, NPS4, NPS8 - -o, --power-cap WATTS Set power capacity limit - -p, --dpm-policy POLICY_ID Set the GPU DPM policy using policy id - -x, --xgmi-plpd POLICY_ID Set the GPU XGMI per-link power down policy using policy id - -R, --process-isolation STATUS Enable or disable the GPU process isolation: 0 for disable and 1 for enable. - -CPU Arguments: - --cpu-pwr-limit PWR_LIMIT Set power limit for the given socket. Input parameter is power limit value. - --cpu-xgmi-link-width MIN_WIDTH MAX_WIDTH Set max and Min linkwidth. Input parameters are min and max link width values - --cpu-lclk-dpm-level NBIOID MIN_DPM MAX_DPM Sets the max and min dpm level on a given NBIO. - Input parameters are die_index, min dpm, max dpm. - --cpu-pwr-eff-mode MODE Sets the power efficency mode policy. Input parameter is mode. - --cpu-gmi3-link-width MIN_LW MAX_LW Sets max and min gmi3 link width range - --cpu-pcie-link-rate LINK_RATE Sets pcie link rate - --cpu-df-pstate-range MAX_PSTATE MIN_PSTATE Sets max and min df-pstates - --cpu-enable-apb Enables the DF p-state performance boost algorithm - --cpu-disable-apb DF_PSTATE Disables the DF p-state performance boost algorithm. Input parameter is DFPstate (0-3) - --soc-boost-limit BOOST_LIMIT Sets the boost limit for the given socket. Input parameter is socket BOOST_LIMIT value - -CPU Core Arguments: - --core-boost-limit BOOST_LIMIT Sets the boost limit for the given core. Input parameter is core BOOST_LIMIT value - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi reset --help -usage: amd-smi reset [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-G] [-c] - [-f] [-p] [-x] [-d] [-C] [-M] [-o] [-l] - -A GPU must be specified to reset a configuration. -A reset argument must be provided; Multiple reset arguments are accepted - -Reset Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -G, --gpureset Reset the specified GPU - -c, --clocks Reset clocks and overdrive to default - -f, --fans Reset fans to automatic (driver) control - -p, --profile Reset power profile back to default - -x, --xgmierr Reset XGMI error counts - -d, --perf-determinism Disable performance determinism - -C, --compute-partition Reset compute partitions on the specified GPU - -M, --memory-partition Reset memory partitions on the specified GPU - -o, --power-cap Reset power capacity limit to max capable - -l, --run-shader SHADER_NAME Run the shader on processor. Only CLEANER shader can be used to clean up data in LDS/GPRs - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi monitor --help -usage: amd-smi monitor [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] - [-w INTERVAL] [-W TIME] [-i ITERATIONS] [-p] [-t] [-u] [-m] [-n] - [-d] [-e] [-v] [-r] [-q] - -Monitor a target device for the specified arguments. -If no arguments are provided, all arguments will be enabled. -Use the watch arguments to run continuously - -Monitor Arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -w, --watch INTERVAL Reprint the command in a loop of INTERVAL seconds - -W, --watch_time TIME The total TIME to watch the given command - -i, --iterations ITERATIONS Total number of ITERATIONS to loop on the given command - -p, --power-usage Monitor power usage in Watts - -t, --temperature Monitor temperature in Celsius - -u, --gfx Monitor graphics utilization (%) and clock (MHz) - -m, --mem Monitor memory utilization (%) and clock (MHz) - -n, --encoder Monitor encoder utilization (%) and clock (MHz) - -d, --decoder Monitor decoder utilization (%) and clock (MHz) - -e, --ecc Monitor ECC single bit, ECC double bit, and PCIe replay error counts - -v, --vram-usage Monitor memory usage in MB - -r, --pcie Monitor PCIe bandwidth in Mb/s - -q, --process Enable Process information table below monitor output - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -```bash -~$ amd-smi xgmi --help -usage: amd-smi xgmi [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] - [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-m] - -If no GPU is specified, returns information for all GPUs on the system. -If no xgmi argument is provided all xgmi information will be displayed. - -XGMI arguments: - -h, --help show this help message and exit - -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: - ID: 0 | BDF: 0000:01:00.0 | UUID: 7eff74a0-0000-1000-808f-7e20764e2714 - ID: 1 | BDF: 0001:01:00.0 | UUID: b6ff74a0-0000-1000-80ae-7c8cefe1f084 - ID: 2 | BDF: 0002:01:00.0 | UUID: 36ff74a0-0000-1000-8071-25d815189854 - ID: 3 | BDF: 0003:01:00.0 | UUID: f4ff74a0-0000-1000-80c4-4c2be5e66537 - all | Selects all devices - -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: - ID: 0 - ID: 1 - ID: 2 - ID: 3 - all | Selects all devices - -O, --core CORE [CORE ...] Select a Core ID from the possible choices: - ID: 0 - 95 - all | Selects all devices - -m, --metric Metric XGMI information - -Command Modifiers: - --json Displays output in JSON format (human readable by default). - --csv Displays output in CSV format (human readable by default). - --file FILE Saves output into a file on the provided path (stdout by default). - --loglevel LEVEL Set the logging level from the possible choices: - DEBUG, INFO, WARNING, ERROR, CRITICAL -``` - -### Example output from amd-smi static - -Here is some example output from the tool: - -```bash -~$ amd-smi static -CPU: 0 - SMU: - FW_VERSION: 85.90.0 - INTERFACE_VERSION: - PROTO VERSION: 6 - -CPU: 1 - SMU: - FW_VERSION: 85.90.0 - INTERFACE_VERSION: - PROTO VERSION: 6 - -CPU: 2 - SMU: - FW_VERSION: 85.90.0 - INTERFACE_VERSION: - PROTO VERSION: 6 - -CPU: 3 - SMU: - FW_VERSION: 85.90.0 - INTERFACE_VERSION: - PROTO VERSION: 6 - - -GPU: 0 - ASIC: - MARKET_NAME: MI300A - VENDOR_ID: 0x1002 - VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] - SUBVENDOR_ID: 0x1002 - DEVICE_ID: 0x74a0 - REV_ID: 0x00 - ASIC_SERIAL: 0x7E8F7E20764E2714 - OAM_ID: 0 - BUS: - BDF: 0000:01:00.0 - MAX_PCIE_WIDTH: 16 - MAX_PCIE_SPEED: 32 GT/s - PCIE_INTERFACE_VERSION: Gen 5 - SLOT_TYPE: PCIE - VBIOS: - NAME: N/A - BUILD_DATE: N/A - PART_NUMBER: N/A - VERSION: N/A - LIMIT: - MAX_POWER: 550 W - MIN_POWER: 0 W - SOCKET_POWER: 550 W - SLOWDOWN_EDGE_TEMPERATURE: N/A - SLOWDOWN_HOTSPOT_TEMPERATURE: 100 °C - SLOWDOWN_VRAM_TEMPERATURE: 105 °C - SHUTDOWN_EDGE_TEMPERATURE: N/A - SHUTDOWN_HOTSPOT_TEMPERATURE: 110 °C - SHUTDOWN_VRAM_TEMPERATURE: 115 °C - DRIVER: - NAME: amdgpu - VERSION: 6.9.0-rc5+ - BOARD: - MODEL_NUMBER: N/A - PRODUCT_SERIAL: N/A - FRU_ID: N/A - PRODUCT_NAME: Aqua Vanjaram [Instinct MI300A] - MANUFACTURER_NAME: Advanced Micro Devices, Inc. [AMD/ATI] - RAS: - EEPROM_VERSION: 0x0 - PARITY_SCHEMA: DISABLED - SINGLE_BIT_SCHEMA: DISABLED - DOUBLE_BIT_SCHEMA: DISABLED - POISON_SCHEMA: ENABLED - ECC_BLOCK_STATE: - UMC: DISABLED - SDMA: ENABLED - GFX: ENABLED - MMHUB: ENABLED - ATHUB: DISABLED - PCIE_BIF: DISABLED - HDP: DISABLED - XGMI_WAFL: DISABLED - DF: DISABLED - SMN: DISABLED - SEM: DISABLED - MP0: DISABLED - MP1: DISABLED - FUSE: DISABLED - MCA: DISABLED - VCN: DISABLED - JPEG: DISABLED - IH: DISABLED - MPIO: DISABLED - PARTITION: - COMPUTE_PARTITION: SPX - MEMORY_PARTITION: NPS1 - SOC_PSTATE: - NUM_SUPPORTED: 4 - CURRENT_ID: 1 - POLICIES: - POLICY_ID: 0 - POLICY_DESCRIPTION: pstate_default - POLICY_ID: 1 - POLICY_DESCRIPTION: soc_pstate_0 - POLICY_ID: 2 - POLICY_DESCRIPTION: soc_pstate_1 - POLICY_ID: 3 - POLICY_DESCRIPTION: soc_pstate_2 - XGMI_PLPD: - NUM_SUPPORTED: 3 - CURRENT_ID: 1 - PLPDS: - POLICY_ID: 0 - POLICY_DESCRIPTION: plpd_disallow - POLICY_ID: 1 - POLICY_DESCRIPTION: plpd_default - POLICY_ID: 2 - POLICY_DESCRIPTION: plpd_optimized - PROCESS_ISOLATION: N/A - NUMA: - NODE: 0 - AFFINITY: 0 - VRAM: - TYPE: HBM - VENDOR: N/A - SIZE: 64289 MB - CACHE_INFO: - CACHE_0: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 32 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 348 - CACHE_1: - CACHE_PROPERTIES: INST_CACHE, SIMD_CACHE - CACHE_SIZE: 64 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 120 - CACHE_2: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 4096 KB - CACHE_LEVEL: 2 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - CACHE_3: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 262144 KB - CACHE_LEVEL: 3 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - -GPU: 1 - ASIC: - MARKET_NAME: MI300A - VENDOR_ID: 0x1002 - VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] - SUBVENDOR_ID: 0x1002 - DEVICE_ID: 0x74a0 - REV_ID: 0x00 - ASIC_SERIAL: 0xB6AE7C8CEFE1F084 - OAM_ID: 1 - BUS: - BDF: 0001:01:00.0 - MAX_PCIE_WIDTH: 16 - MAX_PCIE_SPEED: 32 GT/s - PCIE_INTERFACE_VERSION: Gen 5 - SLOT_TYPE: PCIE - VBIOS: - NAME: N/A - BUILD_DATE: N/A - PART_NUMBER: N/A - VERSION: N/A - LIMIT: - MAX_POWER: 550 W - MIN_POWER: 0 W - SOCKET_POWER: 550 W - SLOWDOWN_EDGE_TEMPERATURE: N/A - SLOWDOWN_HOTSPOT_TEMPERATURE: 100 °C - SLOWDOWN_VRAM_TEMPERATURE: 105 °C - SHUTDOWN_EDGE_TEMPERATURE: N/A - SHUTDOWN_HOTSPOT_TEMPERATURE: 110 °C - SHUTDOWN_VRAM_TEMPERATURE: 115 °C - DRIVER: - NAME: amdgpu - VERSION: 6.9.0-rc5+ - BOARD: - MODEL_NUMBER: N/A - PRODUCT_SERIAL: N/A - FRU_ID: N/A - PRODUCT_NAME: Aqua Vanjaram [Instinct MI300A] - MANUFACTURER_NAME: Advanced Micro Devices, Inc. [AMD/ATI] - RAS: - EEPROM_VERSION: 0x0 - PARITY_SCHEMA: DISABLED - SINGLE_BIT_SCHEMA: DISABLED - DOUBLE_BIT_SCHEMA: DISABLED - POISON_SCHEMA: ENABLED - ECC_BLOCK_STATE: - UMC: DISABLED - SDMA: ENABLED - GFX: ENABLED - MMHUB: ENABLED - ATHUB: DISABLED - PCIE_BIF: DISABLED - HDP: DISABLED - XGMI_WAFL: DISABLED - DF: DISABLED - SMN: DISABLED - SEM: DISABLED - MP0: DISABLED - MP1: DISABLED - FUSE: DISABLED - MCA: DISABLED - VCN: DISABLED - JPEG: DISABLED - IH: DISABLED - MPIO: DISABLED - PARTITION: - COMPUTE_PARTITION: SPX - MEMORY_PARTITION: NPS1 - SOC_PSTATE: - NUM_SUPPORTED: 4 - CURRENT_ID: 1 - POLICIES: - POLICY_ID: 0 - POLICY_DESCRIPTION: pstate_default - POLICY_ID: 1 - POLICY_DESCRIPTION: soc_pstate_0 - POLICY_ID: 2 - POLICY_DESCRIPTION: soc_pstate_1 - POLICY_ID: 3 - POLICY_DESCRIPTION: soc_pstate_2 - XGMI_PLPD: - NUM_SUPPORTED: 3 - CURRENT_ID: 1 - PLPDS: - POLICY_ID: 0 - POLICY_DESCRIPTION: plpd_disallow - POLICY_ID: 1 - POLICY_DESCRIPTION: plpd_default - POLICY_ID: 2 - POLICY_DESCRIPTION: plpd_optimized - PROCESS_ISOLATION: N/A - NUMA: - NODE: 1 - AFFINITY: 1 - VRAM: - TYPE: HBM - VENDOR: N/A - SIZE: 64289 MB - CACHE_INFO: - CACHE_0: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 32 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 348 - CACHE_1: - CACHE_PROPERTIES: INST_CACHE, SIMD_CACHE - CACHE_SIZE: 64 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 120 - CACHE_2: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 4096 KB - CACHE_LEVEL: 2 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - CACHE_3: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 262144 KB - CACHE_LEVEL: 3 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - -GPU: 2 - ASIC: - MARKET_NAME: MI300A - VENDOR_ID: 0x1002 - VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] - SUBVENDOR_ID: 0x1002 - DEVICE_ID: 0x74a0 - REV_ID: 0x00 - ASIC_SERIAL: 0x367125D815189854 - OAM_ID: 2 - BUS: - BDF: 0002:01:00.0 - MAX_PCIE_WIDTH: 16 - MAX_PCIE_SPEED: 32 GT/s - PCIE_INTERFACE_VERSION: Gen 5 - SLOT_TYPE: PCIE - VBIOS: - NAME: N/A - BUILD_DATE: N/A - PART_NUMBER: N/A - VERSION: N/A - LIMIT: - MAX_POWER: 550 W - MIN_POWER: 0 W - SOCKET_POWER: 550 W - SLOWDOWN_EDGE_TEMPERATURE: N/A - SLOWDOWN_HOTSPOT_TEMPERATURE: 100 °C - SLOWDOWN_VRAM_TEMPERATURE: 105 °C - SHUTDOWN_EDGE_TEMPERATURE: N/A - SHUTDOWN_HOTSPOT_TEMPERATURE: 110 °C - SHUTDOWN_VRAM_TEMPERATURE: 115 °C - DRIVER: - NAME: amdgpu - VERSION: 6.9.0-rc5+ - BOARD: - MODEL_NUMBER: N/A - PRODUCT_SERIAL: N/A - FRU_ID: N/A - PRODUCT_NAME: Aqua Vanjaram [Instinct MI300A] - MANUFACTURER_NAME: Advanced Micro Devices, Inc. [AMD/ATI] - RAS: - EEPROM_VERSION: 0x0 - PARITY_SCHEMA: DISABLED - SINGLE_BIT_SCHEMA: DISABLED - DOUBLE_BIT_SCHEMA: DISABLED - POISON_SCHEMA: ENABLED - ECC_BLOCK_STATE: - UMC: DISABLED - SDMA: ENABLED - GFX: ENABLED - MMHUB: ENABLED - ATHUB: DISABLED - PCIE_BIF: DISABLED - HDP: DISABLED - XGMI_WAFL: DISABLED - DF: DISABLED - SMN: DISABLED - SEM: DISABLED - MP0: DISABLED - MP1: DISABLED - FUSE: DISABLED - MCA: DISABLED - VCN: DISABLED - JPEG: DISABLED - IH: DISABLED - MPIO: DISABLED - PARTITION: - COMPUTE_PARTITION: SPX - MEMORY_PARTITION: NPS1 - SOC_PSTATE: - NUM_SUPPORTED: 4 - CURRENT_ID: 1 - POLICIES: - POLICY_ID: 0 - POLICY_DESCRIPTION: pstate_default - POLICY_ID: 1 - POLICY_DESCRIPTION: soc_pstate_0 - POLICY_ID: 2 - POLICY_DESCRIPTION: soc_pstate_1 - POLICY_ID: 3 - POLICY_DESCRIPTION: soc_pstate_2 - XGMI_PLPD: - NUM_SUPPORTED: 3 - CURRENT_ID: 1 - PLPDS: - POLICY_ID: 0 - POLICY_DESCRIPTION: plpd_disallow - POLICY_ID: 1 - POLICY_DESCRIPTION: plpd_default - POLICY_ID: 2 - POLICY_DESCRIPTION: plpd_optimized - PROCESS_ISOLATION: N/A - NUMA: - NODE: 2 - AFFINITY: 2 - VRAM: - TYPE: HBM - VENDOR: N/A - SIZE: 64289 MB - CACHE_INFO: - CACHE_0: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 32 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 348 - CACHE_1: - CACHE_PROPERTIES: INST_CACHE, SIMD_CACHE - CACHE_SIZE: 64 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 120 - CACHE_2: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 4096 KB - CACHE_LEVEL: 2 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - CACHE_3: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 262144 KB - CACHE_LEVEL: 3 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - -GPU: 3 - ASIC: - MARKET_NAME: MI300A - VENDOR_ID: 0x1002 - VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] - SUBVENDOR_ID: 0x1002 - DEVICE_ID: 0x74a0 - REV_ID: 0x00 - ASIC_SERIAL: 0xF4C44C2BE5E66537 - OAM_ID: 3 - BUS: - BDF: 0003:01:00.0 - MAX_PCIE_WIDTH: 16 - MAX_PCIE_SPEED: 32 GT/s - PCIE_INTERFACE_VERSION: Gen 5 - SLOT_TYPE: PCIE - VBIOS: - NAME: N/A - BUILD_DATE: N/A - PART_NUMBER: N/A - VERSION: N/A - LIMIT: - MAX_POWER: 550 W - MIN_POWER: 0 W - SOCKET_POWER: 550 W - SLOWDOWN_EDGE_TEMPERATURE: N/A - SLOWDOWN_HOTSPOT_TEMPERATURE: 100 °C - SLOWDOWN_VRAM_TEMPERATURE: 105 °C - SHUTDOWN_EDGE_TEMPERATURE: N/A - SHUTDOWN_HOTSPOT_TEMPERATURE: 110 °C - SHUTDOWN_VRAM_TEMPERATURE: 115 °C - DRIVER: - NAME: amdgpu - VERSION: 6.9.0-rc5+ - BOARD: - MODEL_NUMBER: N/A - PRODUCT_SERIAL: N/A - FRU_ID: N/A - PRODUCT_NAME: Aqua Vanjaram [Instinct MI300A] - MANUFACTURER_NAME: Advanced Micro Devices, Inc. [AMD/ATI] - RAS: - EEPROM_VERSION: 0x0 - PARITY_SCHEMA: DISABLED - SINGLE_BIT_SCHEMA: DISABLED - DOUBLE_BIT_SCHEMA: DISABLED - POISON_SCHEMA: ENABLED - ECC_BLOCK_STATE: - UMC: DISABLED - SDMA: ENABLED - GFX: ENABLED - MMHUB: ENABLED - ATHUB: DISABLED - PCIE_BIF: DISABLED - HDP: DISABLED - XGMI_WAFL: DISABLED - DF: DISABLED - SMN: DISABLED - SEM: DISABLED - MP0: DISABLED - MP1: DISABLED - FUSE: DISABLED - MCA: DISABLED - VCN: DISABLED - JPEG: DISABLED - IH: DISABLED - MPIO: DISABLED - PARTITION: - COMPUTE_PARTITION: SPX - MEMORY_PARTITION: NPS1 - SOC_PSTATE: - NUM_SUPPORTED: 4 - CURRENT_ID: 1 - POLICIES: - POLICY_ID: 0 - POLICY_DESCRIPTION: pstate_default - POLICY_ID: 1 - POLICY_DESCRIPTION: soc_pstate_0 - POLICY_ID: 2 - POLICY_DESCRIPTION: soc_pstate_1 - POLICY_ID: 3 - POLICY_DESCRIPTION: soc_pstate_2 - XGMI_PLPD: - NUM_SUPPORTED: 3 - CURRENT_ID: 1 - PLPDS: - POLICY_ID: 0 - POLICY_DESCRIPTION: plpd_disallow - POLICY_ID: 1 - POLICY_DESCRIPTION: plpd_default - POLICY_ID: 2 - POLICY_DESCRIPTION: plpd_optimized - PROCESS_ISOLATION: N/A - NUMA: - NODE: 3 - AFFINITY: 3 - VRAM: - TYPE: HBM - VENDOR: N/A - SIZE: 64289 MB - CACHE_INFO: - CACHE_0: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 32 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 348 - CACHE_1: - CACHE_PROPERTIES: INST_CACHE, SIMD_CACHE - CACHE_SIZE: 64 KB - CACHE_LEVEL: 1 - MAX_NUM_CU_SHARED: 2 - NUM_CACHE_INSTANCE: 120 - CACHE_2: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 4096 KB - CACHE_LEVEL: 2 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - CACHE_3: - CACHE_PROPERTIES: DATA_CACHE, SIMD_CACHE - CACHE_SIZE: 262144 KB - CACHE_LEVEL: 3 - MAX_NUM_CU_SHARED: 228 - NUM_CACHE_INSTANCE: 1 - -``` - -## Disclaimer - -The information contained herein is for informational purposes only, and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. - -AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective companies. - -Copyright (c) 2014-2024 Advanced Micro Devices, Inc. All rights reserved. diff --git a/docs/amdsmi_changelog_link.md b/docs/amdsmi_changelog_link.md deleted file mode 100644 index 66efc0fecd..0000000000 --- a/docs/amdsmi_changelog_link.md +++ /dev/null @@ -1,2 +0,0 @@ -```{include} ../CHANGELOG.md -``` diff --git a/docs/amdsmi_cli_readme_link.md b/docs/amdsmi_cli_readme_link.md deleted file mode 100644 index f2bd033d44..0000000000 --- a/docs/amdsmi_cli_readme_link.md +++ /dev/null @@ -1,2 +0,0 @@ -```{include} ../amdsmi_cli/README.md -``` diff --git a/docs/amdsmi_release_notes_link.md b/docs/amdsmi_release_notes_link.md deleted file mode 100644 index 849e567969..0000000000 --- a/docs/amdsmi_release_notes_link.md +++ /dev/null @@ -1,2 +0,0 @@ -```{include} ../amdsmi_cli/Release_Notes.md -``` diff --git a/docs/conceptual/test.md b/docs/conceptual/test.md deleted file mode 100644 index 9daeafb986..0000000000 --- a/docs/conceptual/test.md +++ /dev/null @@ -1 +0,0 @@ -test diff --git a/docs/conf.py b/docs/conf.py index 2b84ea8ce5..a8ac2208ef 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -4,29 +4,57 @@ # list see the documentation: # https://www.sphinx-doc.org/en/master/usage/configuration.html -import subprocess +import re +import shutil -from rocm_docs import ROCmDocs +# get version number to print in docs +def get_version_info(filepath): + with open(filepath, 'r') as f: + content = f.read() -get_version_year = r'sed -n -e "s/^#define\ AMDSMI_LIB_VERSION_YEAR\ //p" ../include/amd_smi/amdsmi.h' -get_version_major = r'sed -n -e "s/^#define\ AMDSMI_LIB_VERSION_MAJOR\ //p" ../include/amd_smi/amdsmi.h' -get_version_minor = r'sed -n -e "s/^#define\ AMDSMI_LIB_VERSION_MINOR\ //p" ../include/amd_smi/amdsmi.h' -get_version_release = r'sed -n -e "s/^#define\ AMDSMI_LIB_VERSION_RELEASE\ //p" ../include/amd_smi/amdsmi.h' -version_year = subprocess.getoutput(get_version_year) -version_major = subprocess.getoutput(get_version_major) -version_minor = subprocess.getoutput(get_version_minor) -version_release = subprocess.getoutput(get_version_release) -name = f"AMD SMI {version_year}.{version_major}.{version_minor}.{version_release}" + version_pattern = ( + r'^#define\s+AMDSMI_LIB_VERSION_YEAR\s+(\d+)\s*$|' + r'^#define\s+AMDSMI_LIB_VERSION_MAJOR\s+(\d+)\s*$|' + r'^#define\s+AMDSMI_LIB_VERSION_MINOR\s+(\d+)\s*$|' + r'^#define\s+AMDSMI_LIB_VERSION_RELEASE\s+(\d+)\s*$' + ) + matches = re.findall(version_pattern, content, re.MULTILINE) + + if len(matches) == 4: + version_year, version_major, version_minor, version_release = [ + match for match in matches if any(match) + ] + return version_year[0], version_major[1], version_minor[2], version_release[3] + else: + raise ValueError("Couldn't find all VERSION numbers.") + +# copy changelog to docs/ +shutil.copy2("../CHANGELOG.md", "./reference/changelog.md") + +version_year, version_major, version_minor, version_release = get_version_info('../include/amd_smi/amdsmi.h') +version_number = f"{version_year}.{version_major}.{version_minor}.{version_release}" + +# project info +project = "AMD SMI" +author = "Advanced Micro Devices, Inc." +copyright = "Copyright (c) 2024 Advanced Micro Devices, Inc. All rights reserved." +version = version_number +release = version_number + +html_theme = "rocm_docs_theme" +html_theme_options = {"flavor": "rocm"} +html_title = f"AMD SMI {version_number} documentation" +exclude_patterns = ["rocm-smi-lib"] +suppress_warnings = ["etoc.toctree"] external_toc_path = "./sphinx/_toc.yml" -docs_core = ROCmDocs(f"{name} Documentation") -docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/docBin/xml") -docs_core.enable_api_reference() -docs_core.setup() -docs_core.html_theme_options = { - "repository_url": "https://github.com/RadeonOpenCompute/amdsmi" -} +external_projects_current_project = "amdsmi" +extensions = ["rocm_docs", "rocm_docs.doxygen"] -for sphinx_var in ROCmDocs.SPHINX_VARS: - globals()[sphinx_var] = getattr(docs_core, sphinx_var) +doxygen_root = "doxygen" +doxysphinx_enabled = True +doxygen_project = { + "name": "AMD SMI C++ API reference", + "path": "doxygen/docBin/xml", +} diff --git a/docs/how-to/using-AMD-SMI-CLI-tool.md b/docs/how-to/amdsmi-cli-tool.md similarity index 94% rename from docs/how-to/using-AMD-SMI-CLI-tool.md rename to docs/how-to/amdsmi-cli-tool.md index 6371281088..c4a7243f6f 100644 --- a/docs/how-to/using-AMD-SMI-CLI-tool.md +++ b/docs/how-to/amdsmi-cli-tool.md @@ -1,10 +1,37 @@ -# Using AMD SMI Command Line Interface tool +--- +myst: + html_meta: + "description lang=en": "Learn how to use the AMD SMI command line tool." + "keywords": "api, smi, lib, system, management, interface, example" +--- -**Disclaimer: CLI Tool is provided as an example code to aid the development of telemetry tools. Python or C++ Library is recommended as a reliable data source.** +# AMD SMI CLI tool usage -AMD-SMI reports the version and current platform detected when running the command line interface (CLI) without arguments: +This tool is a command line interface (CLI) for manipulating and monitoring the +`amdgpu` kernel; it is intended to replace and deprecate the existing `rocm_smi` +CLI tool and `gpuv-smi` tool. The AMD SMI CLI tool uses Ctypes to call the +`amd_smi_lib` API. -``` bash +When using the CLI tool, you should have at least one AMD GPU and the driver +installed. + +```{admonition} Disclaimer +The AMD SMI CLI tool is provided as an example code to aid the development of +telemetry tools. The [Python](./amdsmi-py-lib) or [C++ +library](./amdsmi-cpp-lib) is recommended as a robust data source. +``` + +## Install the CLI Tool and Python library + +Refer to the [installation instructions](../install/install.md). + +## Get started + +The amd-smi command provides system management and monitoring capabilities for +AMD hardware. When run without arguments, it reports the version and platform +detected: + +```shell-session ~$ amd-smi usage: amd-smi [-h] ... @@ -26,28 +53,38 @@ AMD-SMI Commands: topology Displays topology information of the devices set Set options for devices reset Reset options for devices - monitor Monitor metrics for target devices + monitor (dmon) Monitor metrics for target devices xgmi Displays xgmi information of the devices ``` Example commands: -``` bash +```shell-session amd-smi static --gpu 0 amd-smi metric amd-smi process --gpu 0 1 amd-smi reset --gpureset --gpu all ``` -More detailed verison information is available from `amd-smi version` +```{note} +For command-specific help, use `amd-smi [command] --help` for see more detailed +usage information. See [Commands](#cmds). -Each command will have detailed information via `amd-smi [command] --help` +For more detailed version information, use `amd-smi version`. +``` +(cmds)= ## Commands -For convenience, here is the help output for each command +The following are the help output for each command, providing quick reference +details for usage. -```bash +(cmd-list)= +### amd-smi list + +Lists GPU information. + +```shell-session ~$ amd-smi list --help usage: amd-smi list [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] @@ -84,7 +121,13 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-static)= +### amd-smi static + +Gets static information about the specified GPU. See the [sample +output](#cli-ex-static) for `amd-smi static`. + +```shell-session ~$ amd-smi static --help usage: amd-smi static [-h] [-g GPU [GPU ...] | -U CPU [CPU ...]] [-a] [-b] [-V] [-d] [-v] [-c] [-B] [-R] [-r] [-p] [-l] [-P] [-x] [-u] [-s] [-i] @@ -134,7 +177,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-firmware)= +### amd-smi firmware + +Gets firmware information about the specified GPU. + +```shell-session ~$ amd-smi firmware --help usage: amd-smi firmware [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-f] @@ -168,7 +216,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-bad-pages)= +### amd-smi bad-pages + +Gets bad page information about the specified GPU. + +```shell-session ~$ amd-smi bad-pages --help usage: amd-smi bad-pages [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-p] @@ -205,7 +258,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-metric)= +### amd-smi metric + +Gets metrics and performance information about the specified GPU. + +```shell-session ~$ amd-smi metric --help usage: amd-smi metric [-h] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-w INTERVAL] [-W TIME] [-i ITERATIONS] [-m] [-u] [-p] [-c] [-t] @@ -295,7 +353,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-process)= +### amd-smi process + +Lists general process information running on the specified GPU. + +```shell-session ~$ amd-smi process --help usage: amd-smi process [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] @@ -339,7 +402,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-event)= +### amd-smi event + +Displays event information for the given GPU. + +```shell-session ~$ amd-smi event --help usage: amd-smi event [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] @@ -372,7 +440,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-topology)= +### amd-smi topology + +Displays topology information of the specified devices. + +```shell-session ~$ amd-smi topology --help usage: amd-smi topology [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-a] @@ -412,7 +485,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-set)= +### amd-smi set + +Set options for specified devices. + +```shell-session ~$ amd-smi set --help usage: amd-smi set [-h] (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-f %] [-l LEVEL] [-P SETPROFILE] [-d SCLKMAX] [-C PARTITION] [-M PARTITION] @@ -482,7 +560,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-reset)= +### amd-smi reset + +Reset options for specified devices. + +```shell-session ~$ amd-smi reset --help usage: amd-smi reset [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-G] [-c] @@ -527,7 +610,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-monitor)= +### amd-smi monitor + +Monitor metrics for target devices. + +```shell-session ~$ amd-smi monitor --help usage: amd-smi monitor [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] @@ -577,7 +665,12 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` -```bash +(cmd-xgmi)= +### amd-smi xgmi + +Displays XGMI information of specified devices. + +```shell-session ~$ amd-smi xgmi --help usage: amd-smi xgmi [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] [-m] @@ -612,9 +705,11 @@ Command Modifiers: DEBUG, INFO, WARNING, ERROR, CRITICAL ``` +(cli-ex-static)= ### Example output from amd-smi static -Here is some example output from the tool: +To gain a sense of the AMD SMI CLI's output, the following block is sample +output from the CLI tool: ```bash ~$ amd-smi static diff --git a/docs/how-to/amdsmi-cpp-lib.md b/docs/how-to/amdsmi-cpp-lib.md new file mode 100644 index 0000000000..1c45a9a434 --- /dev/null +++ b/docs/how-to/amdsmi-cpp-lib.md @@ -0,0 +1,198 @@ +--- +myst: + html_meta: + "description lang=en": "Get started with the AMD SMI C++ library. Basic usage and examples." + "keywords": "api, smi, lib, c++, system, management, interface, ROCm" +--- + +# AMD SMI C++ library usage and examples + +This section presents a brief overview and some basic examples on the AMD SMI +library's usage. Whether you are developing applications for performance +monitoring, system diagnostics, or resource allocation, the AMD SMI C++ library +serves as a valuable tool for leveraging the full potential of AMD hardware in +your projects. + +```{note} +``hipcc`` and other compilers will not automatically link in the ``libamd_smi`` +dynamic library. To compile code that uses the AMD SMI library API, ensure the +``libamd_smi.so`` can be located by setting the ``LD_LIBRARY_PATH`` environment +variable to the directory containing ``librocm_smi64.so`` (usually +``/opt/rocm/lib``) or by passing the ``-lamd_smi`` flag to the compiler. +``` + +```{seealso} +Refer to the [C++ library API reference](../reference/amdsmi-cpp-api.md). +``` + +(device_socket_handle)= +## Device and socket handles + +Many functions in the library take a _socket handle_ or _device handle_. A +_socket_ refers to a physical hardware socket, abstracted by the library to +represent the hardware more effectively to the user. While there is always one +unique GPU per socket, an APU may house both a GPU and CPU on the same socket. +For MI200 GPUs, multiple GCDs may reside within a single socket + +To identify the sockets in a system, use the `amdsmi_get_socket_handles()` +function, which returns a list of socket handles. These handles can then be used +with `amdsmi_get_processor_handles()` to query devices within each socket. The +device handle is used to differentiate between detected devices; however, it's +important to note that a device handle may change after restarting the +application, so it should not be considered a persistent identifier across +processes. + +The list of socket handles obtained from `amdsmi_get_socket_handles()` can +also be used to query the CPUs in each socket by calling +`amdsmi_get_processor_handles_by_type()`. This function can then be called again +to query the cores within each CPU. + +(cpp_hello_amdsmi)= +## Hello AMD SMI + +An application using AMD SMI must call `amdsmi_init()` to initialize the AMI SMI +library before all other calls. This call initializes the internal data +structures required for subsequent AMD SMI operations. In the call, a flag can +be passed to indicate if the application is interested in a specific device +type. + +`amdsmi_shut_down()` must be the last call to properly close connection to +driver and make sure that any resources held by AMD SMI are released. + +1. A simple "Hello World" type program that displays the temperature of detected + devices looks like this: + + ```cpp + #include + #include + #include "amd_smi/amdsmi.h" + + int main() { + amdsmi_status_t ret; + + // Init amdsmi for sockets and devices. Here we are only interested in AMD_GPUS. + ret = amdsmi_init(AMDSMI_INIT_AMD_GPUS); + + // Get all sockets + uint32_t socket_count = 0; + + // Get the socket count available in the system. + ret = amdsmi_get_socket_handles(&socket_count, nullptr); + + // Allocate the memory for the sockets + std::vector sockets(socket_count); + // Get the socket handles in the system + ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]); + + std::cout << "Total Socket: " << socket_count << std::endl; + + // For each socket, get identifier and devices + for (uint32_t i=0; i < socket_count; i++) { + // Get Socket info + char socket_info[128]; + ret = amdsmi_get_socket_info(sockets[i], 128, socket_info); + std::cout << "Socket " << socket_info<< std::endl; + + // Get the device count for the socket. + uint32_t device_count = 0; + ret = amdsmi_get_processor_handles(sockets[i], &device_count, nullptr); + + // Allocate the memory for the device handlers on the socket + std::vector processor_handles(device_count); + // Get all devices of the socket + ret = amdsmi_get_processor_handles(sockets[i], + &device_count, &processor_handles[0]); + + // For each device of the socket, get name and temperature. + for (uint32_t j=0; j < device_count; j++) { + // Get device type. Since the amdsmi is initialized with + // AMD_SMI_INIT_AMD_GPUS, the processor_type must be AMDSMI_PROCESSOR_TYPE_AMD_GPU. + processor_type_t processor_type; + ret = amdsmi_get_processor_type(processor_handles[j], &processor_type); + if (processor_type != AMDSMI_PROCESSOR_TYPE_AMD_GPU) { + std::cout << "Expect AMDSMI_PROCESSOR_TYPE_AMD_GPU device type!\n"; + return 1; + } + + // Get device name + amdsmi_board_info_t board_info; + ret = amdsmi_get_gpu_board_info(processor_handles[j], &board_info); + std::cout << "\tdevice " + << j <<"\n\t\tName:" << board_info.product_name << std::endl; + + // Get temperature + int64_t val_i64 = 0; + ret = amdsmi_get_temp_metric(processor_handles[j], AMDSMI_TEMPERATURE_TYPE_EDGE, + AMDSMI_TEMP_CURRENT, &val_i64); + std::cout << "\t\tTemperature: " << val_i64 << "C" << std::endl; + } + } + + // Clean up resources allocated at amdsmi_init. It will invalidate sockets + // and devices pointers + ret = amdsmi_shut_down(); + + return 0; + } + ``` + +2. A sample program that displays the power of detected CPUs looks like this: + + ```cpp + #include + #include + #include "amd_smi/amdsmi.h" + + int main(int argc, char **argv) { + amdsmi_status_t ret; + uint32_t socket_count = 0; + + // Initialize amdsmi for AMD CPUs + ret = amdsmi_init(AMDSMI_INIT_AMD_CPUS); + + ret = amdsmi_get_socket_handles(&socket_count, nullptr); + + // Allocate the memory for the sockets + std::vector sockets(socket_count); + + // Get the sockets of the system + ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]); + + std::cout << "Total Socket: " << socket_count << std::endl; + + // For each socket, get cpus + for (uint32_t i = 0; i < socket_count; i++) { + uint32_t cpu_count = 0; + + // Set processor type as AMDSMI_PROCESSOR_TYPE_AMD_CPU + processor_type_t processor_type = AMDSMI_PROCESSOR_TYPE_AMD_CPU; + ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, nullptr, &cpu_count); + + // Allocate the memory for the cpus + std::vector plist(cpu_count); + + // Get the cpus for each socket + ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, &plist[0], &cpu_count); + + for (uint32_t index = 0; index < plist.size(); index++) { + uint32_t socket_power; + std::cout<<"CPU "<(socket_power)/1000< - #include - #include "amd_smi/amdsmi.h" - - int main() { - amdsmi_status_t ret; - - // Init amdsmi for sockets and devices. Here we are only interested in AMD_GPUS. - ret = amdsmi_init(AMDSMI_INIT_AMD_GPUS); - - // Get all sockets - uint32_t socket_count = 0; - - // Get the socket count available in the system. - ret = amdsmi_get_socket_handles(&socket_count, nullptr); - - // Allocate the memory for the sockets - std::vector sockets(socket_count); - // Get the socket handles in the system - ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]); - - std::cout << "Total Socket: " << socket_count << std::endl; - - // For each socket, get identifier and devices - for (uint32_t i=0; i < socket_count; i++) { - // Get Socket info - char socket_info[128]; - ret = amdsmi_get_socket_info(sockets[i], 128, socket_info); - std::cout << "Socket " << socket_info<< std::endl; - - // Get the device count for the socket. - uint32_t device_count = 0; - ret = amdsmi_get_processor_handles(sockets[i], &device_count, nullptr); - - // Allocate the memory for the device handlers on the socket - std::vector processor_handles(device_count); - // Get all devices of the socket - ret = amdsmi_get_processor_handles(sockets[i], - &device_count, &processor_handles[0]); - - // For each device of the socket, get name and temperature. - for (uint32_t j=0; j < device_count; j++) { - // Get device type. Since the amdsmi is initialized with - // AMD_SMI_INIT_AMD_GPUS, the processor_type must be AMD_GPU. - processor_type_t processor_type; - ret = amdsmi_get_processor_type(processor_handles[j], &processor_type); - if (processor_type != AMD_GPU) { - std::cout << "Expect AMD_GPU device type!\n"; - return 1; - } - - // Get device name - amdsmi_board_info_t board_info; - ret = amdsmi_get_gpu_board_info(processor_handles[j], &board_info); - std::cout << "\tdevice " - << j <<"\n\t\tName:" << board_info.product_name << std::endl; - - // Get temperature - int64_t val_i64 = 0; - ret = amdsmi_get_temp_metric(processor_handles[j], AMDSMI_TEMPERATURE_TYPE_EDGE, - AMDSMI_TEMP_CURRENT, &val_i64); - std::cout << "\t\tTemperature: " << val_i64 << "C" << std::endl; - } - } - - // Clean up resources allocated at amdsmi_init. It will invalidate sockets - // and devices pointers - ret = amdsmi_shut_down(); - - return 0; - } - - -2) A sample program that displays the power of detected cpus would look like this: - -.. code-block:: - - #include - #include - #include "amd_smi/amdsmi.h" - - int main(int argc, char **argv) { - amdsmi_status_t ret; - uint32_t socket_count = 0; - - // Initialize amdsmi for AMD CPUs - ret = amdsmi_init(AMDSMI_INIT_AMD_CPUS); - - ret = amdsmi_get_socket_handles(&socket_count, nullptr); - - // Allocate the memory for the sockets - std::vector sockets(socket_count); - - // Get the sockets of the system - ret = amdsmi_get_socket_handles(&socket_count, &sockets[0]); - - std::cout << "Total Socket: " << socket_count << std::endl; - - // For each socket, get cpus - for (uint32_t i = 0; i < socket_count; i++) { - uint32_t cpu_count = 0; - - // Set processor type as AMD_CPU - processor_type_t processor_type = AMD_CPU; - ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, nullptr, &cpu_count); - - // Allocate the memory for the cpus - std::vector plist(cpu_count); - - // Get the cpus for each socket - ret = amdsmi_get_processor_handles_by_type(sockets[i], processor_type, &plist[0], &cpu_count); - - for (uint32_t index = 0; index < plist.size(); index++) { - uint32_t socket_power; - std::cout<<"CPU "<(socket_power)/1000<. + +```{note} +AMD SMI is the successor to . +``` + +::::{grid} 2 +:gutter: 3 + +:::{grid-item-card} Install +* [Library and CLI tool installation](./install/install.md) +* [Build from source](./install/build.md) +::: + +:::{grid-item-card} How to +* [C++ library usage](./how-to/amdsmi-cpp-lib.md) +* [Python library usage](./how-to/amdsmi-py-lib.md) +* [CLI tool usage](./how-to/amdsmi-cli-tool.md) +::: + +:::{grid-item-card} Reference +* [C++ API](./reference/amdsmi-cpp-api.md) + * [Modules](../doxygen/docBin/html/modules) + * [Files](../doxygen/docBin/html/files) + * [Globals](../doxygen/docBin/html/globals) + * [Data structures](../doxygen/docBin/html/annotated) + * [Data fields](../doxygen/docBin/html/functions_data_fields) +* [Python API](./reference/amdsmi-py-api.md) +::: + +:::{grid-item-card} Tutorials +* [AMD SMI examples (GitHub)](https://github.com/ROCm/amdsmi/tree/amd-staging/example) +* [ROCm SMI examples (GitHub)](https://github.com/ROCm/rocm_smi_lib/tree/amd-staging/example) +::: +:::: + +To contribute to the documentation, refer to +{doc}`Contributing to ROCm `. + +Find ROCm licensing information on the +{doc}`Licensing ` page. + + + +
+The information contained herein is for informational purposes only, and is +subject to change without notice. While every precaution has been taken in the +preparation of this document, it may contain technical inaccuracies, omissions +and typographical errors, and AMD is under no obligation to update or otherwise +correct this information. Advanced Micro Devices, Inc. makes no representations +or warranties with respect to the accuracy or completeness of the contents of +this document, and assumes no liability of any kind, including the implied +warranties of noninfringement, merchantability or fitness for particular +purposes, with respect to the operation or use of AMD hardware, software or +other products described herein. + +AMD, the AMD Arrow logo, and combinations thereof are trademarks of Advanced +Micro Devices, Inc. Other product names used in this publication are for +identification purposes only and may be trademarks of their respective +companies. + +Copyright (c) 2014-2024 Advanced Micro Devices, Inc. All rights reserved. +
diff --git a/docs/index.rst b/docs/index.rst deleted file mode 100644 index 7d5725cba1..0000000000 --- a/docs/index.rst +++ /dev/null @@ -1,50 +0,0 @@ -.. meta:: - :description: AMDSMI documentation and API reference library - :keywords: amdsmi, ROCm, API, documentation - -******************************************************************** -AMD SMI documentation -******************************************************************** - -The AMD System Management Interface (SMI) Library, or AMD SMI library, is a C library for Linux that provides a user space interface for applications to monitor and control AMD devices. - -You can access the AMD SMI code on the `GitHub repository `_. - -.. Note:: - -This project is a successor to `rocm_smi_lib. `_ - -.. grid:: 2 - :gutter: 3 - - .. grid-item-card:: Install - - * :doc:`AMD SMI installation <./install/install>` - - .. grid-item-card:: API reference - - * :doc:`Files <../doxygen/docBin/html/files>` - * :doc:`Globals <../doxygen/docBin/html/globals>` - * :doc:`Data structures <../doxygen/docBin/html/annotated>` - * :doc:`Modules <../doxygen/docBin/html/modules>` - * :doc:`Data fields <../doxygen/docBin/html/functions_data_fields>` - - .. grid-item-card:: How to - - * :doc:`Use AMD SMI for C++ library ` - * :doc:`Use AMD SMI for Python library ` - * :doc:`Use AMD SMI CLI tool ` - - - .. grid-item-card:: Tutorials - - * `AMD SMI GitHub samples `_ - * `ROCm SMI Github samples `_ - - -To contribute to the documentation, refer to -`Contributing to ROCm `_. - -You can find licensing information on the -`Licensing `_ page. - diff --git a/docs/install/build.md b/docs/install/build.md new file mode 100644 index 0000000000..e8829bc3ce --- /dev/null +++ b/docs/install/build.md @@ -0,0 +1,110 @@ +--- +myst: + html_meta: + "description lang=en": "How to build AMD SMI from source." + "keywords": "system, management, interface, contribute, contributing, ROCm, develop, testing" +--- + +# Building AMD SMI + +This section describes the prerequisites and steps to build AMD SMI from source. + +(build_reqs)= +## Required software + +To build the AMD SMI library, the following components are required. Note that +the software versions specified were used during development; earlier +versions are not guaranteed to work. + +* CMake (v3.14.0 or later) -- `python3 -m pip install cmake` +* g++ (v5.4.0 or later) + +In order to build the AMD SMI Python package, the following components are +required: + +* Python (3.6.8 or later) + * prerequisite modules: + * python3-wheel + * python3-setuptools +* virtualenv -- `python3 -m pip install virtualenv` + +## Build steps + +1. Clone the AMD SMI repository to your local Linux machine. + + ```shell + git clone https://github.com/ROCm/amdsmi.git + ``` + +2. The default installation location for the library and headers is `/opt/rocm`. + Before installation, any old ROCm directories should be deleted: + + * `/opt/rocm` + * `/opt/rocm-` + +3. Build the library by following the typical CMake build sequence (run as root + user or use `sudo` before `make install` command); for instance: + + ```bash + mkdir -p build + cd build + cmake .. + make -j $(nproc) + make install + ``` + + The built library is located in the `build/` directory. To build the `rpm` + and `deb` packages use the following command: + + ```bash + make package + ``` + +(rebuild_py_wrapper)= +## Rebuild the Python wrapper + +The Python wrapper for the AMD SMI library is found in the [auto-generated +file](#py_lib_fs) `py-interface/amdsmi_wrapper.py`. It is essential to +regenerate this wrapper whenever there are changes to the C++ API. It is not +regenerated automatically. + +To regenerate the wrapper, use the following command. + +```shell +./update_wrapper.sh +``` + +After this command, the file in `py-interface/amdsmi_wrapper.py` will be updated +on compile. + +```{note} +You need Docker installed on your system to regenerate the Python wrapper. +``` + +(build_tests)= +## Build the tests + +To verify the build and capabilities of AMD SMI on your system, as well as to +see practical examples of its usage, you can build and run the available [tests +in the repository](https://github.com/ROCm/amdsmi/tree/amd-staging/tests). +Follow these steps to build the tests: + +```bash +mkdir -p build +cd build +cmake -DBUILD_TESTS=ON .. +make -j $(nproc) +``` + +(run_tests)= +### Run the tests + +Once the tests are [built](#build_tests), you can run them by executing the +`amdsmitst` program. The executable can be found at `build/tests/amd_smi_test/`. + +(build_docs)= +## Build the docs + +To build the documentation, follow the instructions at [Building +documentation](https://rocm.docs.amd.com/en/latest/contribute/building.html). + diff --git a/docs/install/install.md b/docs/install/install.md new file mode 100644 index 0000000000..b7bdb66f8e --- /dev/null +++ b/docs/install/install.md @@ -0,0 +1,144 @@ +--- +myst: + html_meta: + "description lang=en": "How to install AMD SMI libraries and CLI tool." + "keywords": "system, management, interface, cpu, gpu, hsmp, versions" +--- + +# AMD SMI library and CLI tool + +This section describes how to install the AMD SMI library, Python interface, +and command line tool either as part of the +{doc}`ROCm software stack ` -- or manually. + +(install_reqs)= +## Requirements +The following are required to install and use the AMD SMI libraries and CLI +tool. + +* Python 3.6.8+ (64-bit) +* `amdgpu` driver must be loaded for [`amdsmi_init()`](#cpp_hello_amdsmi) to + work. + +### Supported platforms + +At initial release, the AMD SMI library will support Linux bare metal and Linux +virtual machine guest for AMD GPUs. In a future release, the library will be +extended to support AMD EPYC™ CPUs. + +AMD SMI library can run on AMD ROCm supported platforms, refer to +{doc}`System requirements (Linux) ` +for more information. + + +To run the AMD SMI library, the `amdgpu` driver and the `amd_hsmp` driver need +to be installed. Optionally, `libdrm` can be installed to query firmware +information and hardware IPs. + +(install_amdgpu_rocm)= +## Install amdgpu driver and AMD SMI with ROCm + + +1. Get the `amdgpu-install` installer following the instructions for your + Linux distribution at {doc}`rocm-install-on-linux:install/amdgpu-install`. + + See the following example; your desired ROCm release and install URL may be + different. + + ```shell + sudo apt update + wget https://repo.radeon.com/amdgpu-install/6.2.2/ubuntu/noble/amdgpu-install_6.2.60202-1_all.deb + sudo apt install ./amdgpu-install_6.2.60202-1_all.deb + ``` + +2. Use `amdgpu-install` to install the `amdgpu` driver and ROCm packages with + AMD SMI included. + + ``` shell + sudo amdgpu-install --usecase=rocm + ``` + + The `amdgpu-install --usecase=rocm` option triggers both an `amdgpu` driver + update and AMD SMI packages to be installed on your device. + +3. Verify your installation. + + ```shell + amd-smi --help + ``` + +(install_without_rocm)= +## Install AMD SMI without ROCm + +The following are example steps to install the AMD SMI libraries and CLI tool on +Ubuntu 22.04. + +1. Install the library. + + ```shell + sudo apt install amd-smi-lib + ``` + +2. Add the installation directory to your PATH. If installed with ROCm, ignore + this step. + + ```shell + export PATH="${PATH:+${PATH}:}~/opt/rocm/bin" + ``` + +3. Verify your installation. + + ```shell + amd-smi --help + ``` + +## Optionally enable CLI autocompletion + +The `amd-smi` CLI application supports autocompletion. If `argcomplete` is not +installed and enabled already, do so using the following commands. + +```shell +python3 -m pip install argcomplete +activate-global-python-argcomplete --user +# restart shell to enable +``` + +(install-manual-py-lib)= +## Install the Python library for multiple ROCm instances + +If {doc}`multiple ROCm versions are installed +` and you +are not using `pyenv`, uninstall previous versions of AMD SMI before installing +the desired version from your ROCm instance. + +### Manually install the Python library + +The following are example AMD SMI installation steps on Ubuntu 22.04 without +ROCm. + +1. Remove previous AMD SMI installation. + + ```shell + python3 -m pip list | grep amd + python3 -m pip uninstall amdsmi + ``` + +2. Install the AMD SMI Python library from your target ROCm instance. + + ```shell + apt install amd-smi-lib + cd /opt/rocm/share/amd_smi + python3 -m pip install --upgrade pip + python3 -m pip install --user . + ``` + +3. You should now have the AMD SMI Python library in your Python path: + + ```shell-session + ~$ python3 + Python 3.8.10 (default, May 26 2023, 14:05:08) + [GCC 9.4.0] on linux + Type "help", "copyright", "credits" or "license" for more information. + >>> import amdsmi + >>> + ``` diff --git a/docs/install/install.rst b/docs/install/install.rst deleted file mode 100644 index fa6355e720..0000000000 --- a/docs/install/install.rst +++ /dev/null @@ -1,147 +0,0 @@ -.. meta:: - :description: Install AMD SMI - :keywords: install, SMI, AMD, ROCm - -******************************************************************** -Installation -******************************************************************** - -AMD System Management Interface (AMD SMI) library -------------------------------------------------- - -The AMD System Management Interface Library (AMD SMI library) is a C library for Linux that provides a user space interface for applications to monitor and control AMD devices. - -.. Note:: - -This project is a successor to `rocm_smi_lib. `_ - -Supported platforms -===================== -In its initial release, the AMD SMI library supports Linux bare metal and Linux virtual machine guest for AMD GPUs. In a future release, the library will extend to support AMD EPYC™ CPUs. - -The AMD SMI library can run on AMD ROCm-supported platforms. Refer to `System requirements - Linux `_ for more information. - -To run the AMD SMI library, the `amdgpu` driver and the `hsmp` driver must be installed. Optionally, `libdrm` can be installed to query firmware information and hardware IPs. - - -CLI tool and libraries installation ------------------------------------- - -Requirements -============= - -* Python 3.6.8+ 64-bit -* amdgpu driver must be loaded for `amdsmi_init()` to pass - -Installation steps -------------------- - -1. Install amdgpu using ROCm. - -2. Install amdgpu driver. See the following example. Note that your release and link may differ. The `amdgpu-install --usecase=rocm` triggers both the amdgpu driver update and AMD SMI packages to be installed on your device. - -.. code-block:: shell - - sudo apt update - - wget https://repo.radeon.com/amdgpu-install/6.0.2/ubuntu/jammy/amdgpu-install_6.0.60002-1_all.deb - - sudo apt install ./amdgpu-install_6.0.60002-1_all.deb - - sudo amdgpu-install --usecase=rocm - - amd-smi --help - -3. Install an example for Ubuntu 22.04 (without ROCm). - -.. code-block:: bash - - apt install amd-smi-lib - - # if installed with rocm ignore the export - - export PATH="${PATH:+${PATH}:}~/opt/rocm/bin" - - amd-smi --help - - -Optional autocompletion ------------------------- - -The `amd-smi` cli application supports autocompletion. The package should attempt to install it, if argcomplete is not installed, you can enable it by using the following commands: - -.. code:: bash - - python3 -m pip install argcomplete - - activate-global-python-argcomplete --user - - # restart shell to enable - - -Manual/Multiple ROCm instance Python library install ------------------------------------------------------- - -In the event there are multiple ROCm installations and `pyenv` is not being used to use the correct amdsmi version, you must uninstall previous versions of AMD SMI and install the latest version you want directly from your ROCm instance. - -Python library install example for Ubuntu 22.04 -================================================= - -1. Remove any existing AMD SMI installation: - -.. code-block:: bash - - python3 -m pip list | grep amd - - python3 -m pip uninstall amdsmi - - -2. Install Python library from your target ROCm instance: - -.. code:: bash - - apt install amd-smi-lib - - cd /opt/rocm/share/amd_smi - - python3 -m pip install --upgrade pip - - python3 -m pip install --user - - -Now you have the AMD SMI Python library in your Python path: - - -.. code:: bash - - ~$ python3 - - Python 3.8.10 (default, May 26 2023, 14:05:08) - - [GCC 9.4.0] on linux - -3. Type "help", "copyright", "credits" or "license" for more information - -.. code:: bash - - import amdsmi - - -Sphinx documentation -===================== - -Run the following commands to build the documentation locally: - -.. code-block:: bash - - cd docs - - python3 -m pip install -r sphinx/requirements.txt - - python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html - - -The output is available in `docs/_build/html`. - -For additional details, see `Contribute to ROCm documentation `_. - diff --git a/docs/license.rst b/docs/license.rst index ddb544496e..dce45dfce0 100644 --- a/docs/license.rst +++ b/docs/license.rst @@ -1,6 +1,9 @@ -======= +.. meta:: + :description: Review the AMD SMI license agreement. + :keywords: amdsmi + +******* License -======= +******* .. include:: ../LICENSE - :literal: diff --git a/docs/py-interface_readme_link.md b/docs/py-interface_readme_link.md deleted file mode 100644 index c583458eb7..0000000000 --- a/docs/py-interface_readme_link.md +++ /dev/null @@ -1,2 +0,0 @@ -```{include} ../py-interface/README.md -``` diff --git a/docs/reference/amdsmi-cpp-api.md b/docs/reference/amdsmi-cpp-api.md new file mode 100644 index 0000000000..6e7f50edb4 --- /dev/null +++ b/docs/reference/amdsmi-cpp-api.md @@ -0,0 +1,21 @@ +--- +myst: + html_meta: + "description lang=en": "Explore the AMD SMI C++ API." + "keywords": "api, smi, lib, cpp, header, system, management, interface, ROCm" +--- + +# AMD SMI C++ API reference + +This section provides comprehensive documentation for the AMD SMI C++ API. +Explore these sections to understand the full scope of available +functionalities and how to implement them in your applications. + +- {doc}`Modules <../doxygen/docBin/html/modules>` + +- {doc}`Files <../doxygen/docBin/html/files>` + +- {doc}`Globals <../doxygen/docBin/html/globals>` + +- {doc}`Data structures <../doxygen/docBin/html/annotated>` + diff --git a/docs/how-to/using-amdsmi-for-python.md b/docs/reference/amdsmi-py-api.md similarity index 97% rename from docs/how-to/using-amdsmi-for-python.md rename to docs/reference/amdsmi-py-api.md index 997a37bbb4..80f5068714 100644 --- a/docs/how-to/using-amdsmi-for-python.md +++ b/docs/reference/amdsmi-py-api.md @@ -1,73 +1,20 @@ -# AMD SMI Python Library +--- +myst: + html_meta: + "description lang=en": "Explore the AMD SMI Python API." + "keywords": "api, smi, lib, py, system, management, interface, ROCm" +--- -## Requirements +# AMD SMI Python API reference -* Python 3.6+ 64-bit -* Driver must be loaded for amdsmi_init() to pass +The AMD SMI Python interface provides a convenient way to interact with AMD +hardware through a simple and accessible API. Compatible with Python 3.6 and +higher, this library requires the AMD driver to be loaded for initialization -- +review the [prerequisites](#install_reqs). -## Overview - -### Folder structure - -File Name | Note ----|--- -`__init__.py` | Python package initialization file -`amdsmi_interface.py` | Amdsmi library python interface -`amdsmi_wrapper.py` | Python wrapper around amdsmi binary -`amdsmi_exception.py` | Amdsmi exceptions python file -`README.md` | Documentation - -### Usage - -`amdsmi` folder should be copied and placed next to importing script. It should be imported as: - -```python -from amdsmi import * - -try: - amdsmi_init() - - # amdsmi calls ... - -except AmdSmiException as e: - print(e) -finally: - try: - amdsmi_shut_down() - except AmdSmiException as e: - print(e) -``` - -To initialize amdsmi lib, amdsmi_init() must be called before all other calls to amdsmi lib. - -To close connection to driver, amdsmi_shut_down() must be the last call. - -### Exceptions - -All exceptions are in `amdsmi_exception.py` file. -Exceptions that can be thrown are: - -* `AmdSmiException`: base amdsmi exception class -* `AmdSmiLibraryException`: derives base `AmdSmiException` class and represents errors that can occur in amdsmi-lib. -When this exception is thrown, `err_code` and `err_info` are set. `err_code` is an integer that corresponds to errors that can occur -in amdsmi-lib and `err_info` is a string that explains the error that occurred. -Example: - -```python -try: - num_of_GPUs = len(amdsmi_get_processor_handles()) - if num_of_GPUs == 0: - print("No GPUs on machine") -except AmdSmiException as e: - print("Error code: {}".format(e.err_code)) - if e.err_code == amdsmi_wrapper.AMDSMI_STATUS_RETRY: - print("Error info: {}".format(e.err_info)) -``` - -* `AmdSmiRetryException` : Derives `AmdSmiLibraryException` class and signals device is busy and call should be retried. -* `AmdSmiTimeoutException` : Derives `AmdSmiLibraryException` class and represents that call had timed out. -* `AmdSmiParameterException`: Derives base `AmdSmiException` class and represents errors related to invaild parameters passed to functions. When this exception is thrown, err_msg is set and it explains what is the actual and expected type of the parameters. -* `AmdSmiBdfFormatException`: Derives base `AmdSmiException` class and represents invalid bdf format. +This section provides comprehensive documentation for the AMD SMI Python API. +Explore these sections to understand the full scope of available functionalities +and how to implement them in your applications. ## API @@ -532,17 +479,17 @@ Input parameters: Output: List of Dictionaries containing cache information following the schema below: Schema: -```JSON +```json { - cache_properties: + "cache_properties": { "type" : "array", "items" : {"type" : "string"} }, - cache_size: {"type" : "number"}, - cache_level: {"type" : "number"}, - max_num_cu_shared: {"type" : "number"}, - num_cache_instance: {"type" : "number"} + "cache_size": {"type" : "number"}, + "cache_level": {"type" : "number"}, + "max_num_cu_shared": {"type" : "number"}, + "num_cache_instance": {"type" : "number"} } ``` @@ -2102,6 +2049,7 @@ except AmdSmiException as e: ``` ### amdsmi_set_gpu_process_isolation + Description: Enable/disable the system Process Isolation for the given device handle. Input parameters: @@ -2132,6 +2080,7 @@ except AmdSmiException as e: ``` ### amdsmi_clean_gpu_local_data + Description: Clear the SRAM data of the given device. This can be called between user logins to prevent information leak. Input parameters: @@ -2160,7 +2109,6 @@ except AmdSmiException as e: print(e) ``` - ### amdsmi_get_gpu_overdrive_level Description: Get the overdrive percent associated with the device with provided @@ -3767,6 +3715,44 @@ except AmdSmiException as e: print(e) ``` +### amdsmi_get_gpu_accelerator_partition_profile + +**Note: CURRENTLY HARDCODED TO RETURN EMPTY VALUES** + +Description: Get partition information for target device + +Input parameters: + +* `processor_handle` the device handle + +Output: Dictionary with fields: + +Field | Description +---|--- +`partition_id` | ID of the partition on the GPU provided +`partition_profile` | Dict containing partition data (TBD) + +Exceptions that can be thrown by `amdsmi_get_gpu_accelerator_partition_profile` function: + +* `AmdSmiLibraryException` +* `AmdSmiRetryException` +* `AmdSmiParameterException` + +Example: + +```python +try: + devices = amdsmi_get_processor_handles() + if len(devices) == 0: + print("No GPUs on machine") + else: + for device in devices: + partition_id = amdsmi_get_gpu_accelerator_partition_profile(device)["partition_id"] + print(partition_id) +except AmdSmiException as e: + print(e) +``` + ### amdsmi_get_xgmi_info Description: Returns XGMI information for the GPU. @@ -3839,8 +3825,7 @@ try: else: print(amdsmi_get_gpu_device_uuid(devices[0])) - nearest_gpus = amdsmi_topology_nearest_t() - nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType(2)) + nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType.AMDSMI_LINK_TYPE_PCIE) if (nearest_gpus['count']) == 0: print("No nearest GPUs found on machine") else: diff --git a/docs/reference/changelog.md b/docs/reference/changelog.md new file mode 100644 index 0000000000..3caaedcd35 --- /dev/null +++ b/docs/reference/changelog.md @@ -0,0 +1,1733 @@ +# Changelog for AMD SMI Library + +Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/projects/amdsmi](https://rocm.docs.amd.com/projects/amdsmi/en/latest/). + +***All information listed below is for reference and subject to change.*** + +## amd_smi_lib for ROCm 6.3.0 + +### Changes + +- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**. +Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery: + - `uint64_t accumulation_counter` - used for all throttled calculations + - `uint64_t prochot_residency_acc` - Processor hot accumulator + - `uint64_t ppt_residency_acc` - Package Power Tracking (PPT) accumulator (used in PVIOL calculations) + - `uint64_t socket_thm_residency_acc` - Socket thermal accumulator - (used in TVIOL calculations) + - `uint64_t vr_thm_residency_acc` - Voltage Rail (VR) thermal accumulator + - `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator + - `uint16_t num_partition` - corresponds to the current total number of partitions + - `struct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]` - for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizations + - `uint32_t gfx_busy_inst[MAX_NUM_XCC]` - graphic engine utilization (%) + - `uint16_t jpeg_busy[MAX_NUM_JPEG_ENGS]` - jpeg engine utilization (%) + - `uint16_t vcn_busy[MAX_NUM_VCNS]` - decoder (VCN) engine utilization (%) + - `uint64_t gfx_busy_acc[MAX_NUM_XCC]` - graphic engine utilization accumulated (%) + - `uint32_t pcie_lc_perf_other_end_recovery` - corresponds to the pcie other end recovery counter + +- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`**. + ***Only available for MI300+ ASICs.*** + Users can now retrieve violation status' through either our Python or C++ APIs. Additionally, we have + added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`. + Example outputs are listed below (below is for reference, output is subject to change): + +```shell +$ amd-smi metric --throttle +GPU: 0 + THROTTLE: + ACCUMULATION_COUNTER: 1226415116 + PROCHOT_ACCUMULATED: 0 + PPT_ACCUMULATED: 12 + SOCKET_THERMAL_ACCUMULATED: 0 + VR_THERMAL_ACCUMULATED: 0 + HBM_THERMAL_ACCUMULATED: 0 + PROCHOT_VIOLATION_ACTIVE: NOT ACTIVE + PPT_VIOLATION_ACTIVE: NOT ACTIVE + SOCKET_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + VR_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + HBM_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + PROCHOT_VIOLATION_PERCENT: 0 % + PPT_VIOLATION_PERCENT: 0 % + SOCKET_THERMAL_VIOLATION_PERCENT: 0 % + VR_THERMAL_VIOLATION_PERCENT: 0 % + HBM_THERMAL_VIOLATION_PERCENT: 0 % + +GPU: 1 + THROTTLE: + ACCUMULATION_COUNTER: 1226415121 + PROCHOT_ACCUMULATED: 0 + PPT_ACCUMULATED: 12 + SOCKET_THERMAL_ACCUMULATED: 0 + VR_THERMAL_ACCUMULATED: 0 + HBM_THERMAL_ACCUMULATED: 0 + PROCHOT_VIOLATION_ACTIVE: NOT ACTIVE + PPT_VIOLATION_ACTIVE: NOT ACTIVE + SOCKET_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + VR_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + HBM_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + PROCHOT_VIOLATION_PERCENT: 0 % + PPT_VIOLATION_PERCENT: 0 % + SOCKET_THERMAL_VIOLATION_PERCENT: 0 % + VR_THERMAL_VIOLATION_PERCENT: 0 % + HBM_THERMAL_VIOLATION_PERCENT: 0 % +... +``` + +```shell +$ amd-smi monitor --violation +GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL + 0 0 % 0 % 0 % 0 % 0 % + 1 0 % 0 % 0 % 0 % 0 % + 2 0 % 0 % 0 % 0 % 0 % + 3 0 % 0 % 0 % 0 % 0 % + 4 0 % 0 % 0 % 0 % 0 % + 5 0 % 0 % 0 % 0 % 0 % + 6 0 % 0 % 0 % 0 % 0 % + 7 0 % 0 % 0 % 0 % 0 % + 8 0 % 0 % 0 % 0 % 0 % + 9 0 % 0 % 0 % 0 % 0 % + 10 0 % 0 % 0 % 0 % 0 % + 11 0 % 0 % 0 % 0 % 0 % + 12 0 % 0 % 0 % 0 % 0 % + 13 0 % 0 % 0 % 0 % 0 % + 14 0 % 0 % 0 % 0 % 0 % + 15 0 % 0 % 0 % 0 % 0 % +... +``` + +- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`**. + ***Partition specific features are only available on MI300+ ASICs*** + Users can now retrieve graphic utilization statistic on a per-XCP (per-partition) basis. Here all XCP activities will be listed, + but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`. + Example outputs are listed below (below is for reference, output is subject to change): + +```shell +$ amd-smi metric --usage +GPU: 0 + USAGE: + GFX_ACTIVITY: 0 % + UMC_ACTIVITY: 0 % + MM_ACTIVITY: N/A + VCN_ACTIVITY: [0 %, N/A, N/A, N/A] + JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + GFX_BUSY_INST: + XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + JPEG_BUSY: + XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + VCN_BUSY: + XCP_0: [0 %, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A] + GFX_BUSY_ACC: + XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + +GPU: 1 + USAGE: + GFX_ACTIVITY: 0 % + UMC_ACTIVITY: 0 % + MM_ACTIVITY: N/A + VCN_ACTIVITY: [0 %, N/A, N/A, N/A] + JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + GFX_BUSY_INST: + XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + JPEG_BUSY: + XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + VCN_BUSY: + XCP_0: [0 %, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A] + GFX_BUSY_ACC: + XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + +... +``` + +- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value**. + ***Feature is only available on MI300+ ASICs*** + Users can now retrieve both through `amdsmi_get_pcie_info()` which has an updated structure: + +```C +typedef struct { + ... + struct pcie_metric_ { + uint16_t pcie_width; //!< current PCIe width + uint32_t pcie_speed; //!< current PCIe speed in MT/s + uint32_t pcie_bandwidth; //!< current instantaneous PCIe bandwidth in Mb/s + uint64_t pcie_replay_count; //!< total number of the replays issued on the PCIe link + uint64_t pcie_l0_to_recovery_count; //!< total number of times the PCIe link transitioned from L0 to the recovery state + uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link + uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device + uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver + uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter + uint64_t reserved[12]; + } pcie_metric; + uint64_t reserved[32]; +} amdsmi_pcie_info_t; +``` + + - Example outputs are listed below (below is for reference, output is subject to change): + +```shell +$ amd-smi metric --pcie +GPU: 0 + PCIE: + WIDTH: 16 + SPEED: 32 GT/s + BANDWIDTH: 18 Mb/s + REPLAY_COUNT: 0 + L0_TO_RECOVERY_COUNT: 0 + REPLAY_ROLL_OVER_COUNT: 0 + NAK_SENT_COUNT: 0 + NAK_RECEIVED_COUNT: 0 + CURRENT_BANDWIDTH_SENT: N/A + CURRENT_BANDWIDTH_RECEIVED: N/A + MAX_PACKET_SIZE: N/A + LC_PERF_OTHER_END_RECOVERY: 0 + +GPU: 1 + PCIE: + WIDTH: 16 + SPEED: 32 GT/s + BANDWIDTH: 18 Mb/s + REPLAY_COUNT: 0 + L0_TO_RECOVERY_COUNT: 0 + REPLAY_ROLL_OVER_COUNT: 0 + NAK_SENT_COUNT: 0 + NAK_RECEIVED_COUNT: 0 + CURRENT_BANDWIDTH_SENT: N/A + CURRENT_BANDWIDTH_RECEIVED: N/A + MAX_PACKET_SIZE: N/A + LC_PERF_OTHER_END_RECOVERY: 0 +... +``` + +- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**. +This aligns BDF output with ROCm SMI. +See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function. + - bits [63:32] = domain + - bits [31:28] = partition id + - bits [27:16] = reserved + - bits [15: 0] = pci bus/device/function + +- **Moved python tests directory path install location**. + - `/opt//share/amd_smi/pytest/..` to `/opt//share/amd_smi/tests/python_unittest/..` + - On amd-smi-lib-tests uninstall, the amd_smi tests folder is removed. + - Removed pytest dependency, our python testing now only depends on the unittest framework. + +- **Added retrieving a set of GPUs that are nearest to a given device at a specific link type level**. + - Added `amdsmi_get_link_topology_nearest()` function to amd-smi C and Python Libraries. + +- **Added more supported utilization count types to `amdsmi_get_utilization_count()`**. + +- **Added `amd-smi set -L/--clk-limit ...` command**. + Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency. + +- **Added unittest functionality to test amdsmi API calls in Python**. + +- **Changed the `power` parameter in `amdsmi_get_energy_count()` to `energy_accumulator`**. + - Changes propagate forwards into the python interface as well, however we are maintaing backwards compatibility and keeping the `power` field in the python API until ROCm 6.4. + +- **Added GPU memory overdrive percentage to `amd-smi metric -o`**. + - Added `amdsmi_get_gpu_mem_overdrive_level()` function to amd-smi C and Python Libraries. + +- **Added retrieving connection type and P2P capabilities between two GPUs**. + - Added `amdsmi_topo_get_p2p_status()` function to amd-smi C and Python Libraries. + - Added retrieving P2P link capabilities to CLI `amd-smi topology`. + +```shell +$ amd-smi topology -h +usage: amd-smi topology [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] + [-g GPU [GPU ...]] [-a] [-w] [-o] [-t] [-b] + +If no GPU is specified, returns information for all GPUs on the system. +If no topology argument is provided all topology information will be displayed. + +Topology arguments: + -h, --help show this help message and exit + -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: + ID: 0 | BDF: 0000:0c:00.0 | UUID: + ID: 1 | BDF: 0000:22:00.0 | UUID: + ID: 2 | BDF: 0000:38:00.0 | UUID: + ID: 3 | BDF: 0000:5c:00.0 | UUID: + ID: 4 | BDF: 0000:9f:00.0 | UUID: + ID: 5 | BDF: 0000:af:00.0 | UUID: + ID: 6 | BDF: 0000:bf:00.0 | UUID: + ID: 7 | BDF: 0000:df:00.0 | UUID: + all | Selects all devices + + -a, --access Displays link accessibility between GPUs + -w, --weight Displays relative weight between GPUs + -o, --hops Displays the number of hops between GPUs + -t, --link-type Displays the link type between GPUs + -b, --numa-bw Display max and min bandwidth between nodes + -c, --coherent Display cache coherant (or non-coherant) link capability between nodes + -n, --atomics Display 32 and 64-bit atomic io link capability between nodes + -d, --dma Display P2P direct memory access (DMA) link capability between nodes + -z, --bi-dir Display P2P bi-directional link capability between nodes + +Command Modifiers: + --json Displays output in JSON format (human readable by default). + --csv Displays output in CSV format (human readable by default). + --file FILE Saves output into a file on the provided path (stdout by default). + --loglevel LEVEL Set the logging level from the possible choices: + DEBUG, INFO, WARNING, ERROR, CRITICAL +``` + +```shell +$ amd-smi topology -cndz +CACHE COHERANCY TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 SELF C NC NC C C C NC +0000:22:00.0 C SELF NC C C C NC C +0000:38:00.0 NC NC SELF C C NC C NC +0000:5c:00.0 NC C C SELF NC C NC NC +0000:9f:00.0 C C C NC SELF NC NC C +0000:af:00.0 C C NC C NC SELF C C +0000:bf:00.0 C NC C NC NC C SELF NC +0000:df:00.0 NC C NC NC C C NC SELF + +ATOMICS TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 SELF 64,32 64,32 64 32 32 N/A 64,32 +0000:22:00.0 64,32 SELF 64 32 32 N/A 64,32 64,32 +0000:38:00.0 64,32 64 SELF 32 N/A 64,32 64,32 64,32 +0000:5c:00.0 64 32 32 SELF 64,32 64,32 64,32 32 +0000:9f:00.0 32 32 N/A 64,32 SELF 64,32 32 32 +0000:af:00.0 32 N/A 64,32 64,32 64,32 SELF 32 N/A +0000:bf:00.0 N/A 64,32 64,32 64,32 32 32 SELF 64,32 +0000:df:00.0 64,32 64,32 64,32 32 32 N/A 64,32 SELF + +DMA TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 SELF T T F F T F T +0000:22:00.0 T SELF F F T F T T +0000:38:00.0 T F SELF T F T T T +0000:5c:00.0 F F T SELF T T T F +0000:9f:00.0 F T F T SELF T F F +0000:af:00.0 T F T T T SELF F T +0000:bf:00.0 F T T T F F SELF F +0000:df:00.0 T T T F F T F SELF + +BI-DIRECTIONAL TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 SELF T T F F T F T +0000:22:00.0 T SELF F F T F T T +0000:38:00.0 T F SELF T F T T T +0000:5c:00.0 F F T SELF T T T F +0000:9f:00.0 F T F T SELF T F F +0000:af:00.0 T F T T T SELF F T +0000:bf:00.0 F T T T F F SELF F +0000:df:00.0 T T T F F T F SELF + +Legend: + SELF = Current GPU + ENABLED / DISABLED = Link is enabled or disabled + N/A = Not supported + T/F = True / False + C/NC = Coherant / Non-Coherant io links + 64,32 = 64 bit and 32 bit atomic support + - +``` + +- **Created new amdsmi_kfd_info_t and added information under `amd-smi list`**. + - Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers were added in to a new `amdsmi_kfd_info_t` which gets populated via the API `amdsmi_get_gpu_kfd_info()`. + - This info has been added to the `amd-smi list`. + - These new fields are only available for BM/Guest Linux devices at this time. + +```C +typedef struct { + uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported + uint32_t node_id; //< 0xFFFFFFFF if not supported + uint32_t current_partition_id; //< 0xFFFFFFFF if not supported + uint32_t reserved[12]; +} amdsmi_kfd_info_t; +``` + +```shell +$ amd-smi list +GPU: 0 + BDF: 0000:23:00.0 + UUID: + KFD_ID: 45412 + NODE_ID: 1 + PARTITION_ID: 0 + +GPU: 1 + BDF: 0000:26:00.0 + UUID: + KFD_ID: 59881 + NODE_ID: 2 + PARTITION_ID: 0 +``` + +- **Added Subsystem Device ID to `amd-smi static --asic`**. + - No underlying changes to amdsmi_get_gpu_asic_info + +```shell +$ amd-smi static --asic +GPU: 0 + ASIC: + MARKET_NAME: MI308X + VENDOR_ID: 0x1002 + VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] + SUBVENDOR_ID: 0x1002 + DEVICE_ID: 0x74a2 + SUBSYSTEM_ID: 0x74a2 + REV_ID: 0x00 + ASIC_SERIAL: + OAM_ID: 5 + NUM_COMPUTE_UNITS: 20 + TARGET_GRAPHICS_VERSION: gfx942 +``` + +- **Added Target_Graphics_Version to `amd-smi static --asic` and `amdsmi_get_gpu_asic_info()`**. + +```C +typedef struct { + char market_name[AMDSMI_256_LENGTH]; + uint32_t vendor_id; //< Use 32 bit to be compatible with other platform. + char vendor_name[AMDSMI_MAX_STRING_LENGTH]; + uint32_t subvendor_id; //< The subsystem vendor id + uint64_t device_id; //< The device id of a GPU + uint32_t rev_id; + char asic_serial[AMDSMI_NORMAL_STRING_LENGTH]; + uint32_t oam_id; //< 0xFFFF if not supported + uint32_t num_of_compute_units; //< 0xFFFFFFFF if not supported + uint64_t target_graphics_version; //< 0xFFFFFFFFFFFFFFFF if not supported + uint32_t reserved[15]; +} amdsmi_asic_info_t; +``` + +```shell +$ amd-smi static --asic +GPU: 0 + ASIC: + MARKET_NAME: MI308X + VENDOR_ID: 0x1002 + VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI] + SUBVENDOR_ID: 0x1002 + DEVICE_ID: 0x74a2 + SUBSYSTEM_ID: 0x74a2 + REV_ID: 0x00 + ASIC_SERIAL: + OAM_ID: 5 + NUM_COMPUTE_UNITS: 20 + TARGET_GRAPHICS_VERSION: gfx942 +``` + +- **Udpated Partition APIs and struct information and added and partition_id to `amd-smi static --partition`**. + - As part of an overhaul to partition information, some partition information will be made available in the `amdsmi_accelerator_partition_profile_t`. + - This struct will be filled out by a new API, `amdsmi_get_gpu_accelerator_partition_profile()`. + - Future data from these APIs wil will eventually get added to `amd-smi partition`. + +```C +#define AMDSMI_MAX_ACCELERATOR_PROFILE 32 +#define AMDSMI_MAX_CP_PROFILE_RESOURCES 32 +#define AMDSMI_MAX_ACCELERATOR_PARTITIONS 8 + +/** + * @brief Accelerator Partition. This enum is used to identify + * various accelerator partitioning settings. + */ +typedef enum { + AMDSMI_ACCELERATOR_PARTITION_INVALID = 0, + AMDSMI_ACCELERATOR_PARTITION_SPX, //!< Single GPU mode (SPX)- All XCCs work + //!< together with shared memory + AMDSMI_ACCELERATOR_PARTITION_DPX, //!< Dual GPU mode (DPX)- Half XCCs work + //!< together with shared memory + AMDSMI_ACCELERATOR_PARTITION_TPX, //!< Triple GPU mode (TPX)- One-third XCCs + //!< work together with shared memory + AMDSMI_ACCELERATOR_PARTITION_QPX, //!< Quad GPU mode (QPX)- Quarter XCCs + //!< work together with shared memory + AMDSMI_ACCELERATOR_PARTITION_CPX, //!< Core mode (CPX)- Per-chip XCC with + //!< shared memory +} amdsmi_accelerator_partition_type_t; + +/** + * @brief Possible Memory Partition Modes. + * This union is used to identify various memory partitioning settings. + */ +typedef union { + struct { + uint32_t nps1_cap :1; // bool 1 = true; 0 = false; Max uint32 means unsupported + uint32_t nps2_cap :1; // bool 1 = true; 0 = false; Max uint32 means unsupported + uint32_t nps4_cap :1; // bool 1 = true; 0 = false; Max uint32 means unsupported + uint32_t nps8_cap :1; // bool 1 = true; 0 = false; Max uint32 means unsupported + uint32_t reserved :28; + } amdsmi_nps_flags_t; + + uint32_t nps_cap_mask; +} amdsmi_nps_caps_t; + +typedef struct { + amdsmi_accelerator_partition_type_t profile_type; // SPX, DPX, QPX, CPX and so on + uint32_t num_partitions; // On MI300X, SPX: 1, DPX: 2, QPX: 4, CPX: 8, length of resources array + uint32_t profile_index; + amdsmi_nps_caps_t memory_caps; // Possible memory partition capabilities + uint32_t num_resources; // length of index_of_resources_profile + uint32_t resources[AMDSMI_MAX_ACCELERATOR_PARTITIONS][AMDSMI_MAX_CP_PROFILE_RESOURCES]; + uint64_t reserved[6]; +} amdsmi_accelerator_partition_profile_t; +``` + +```shell +$ amd-smi static --partition +GPU: 0 + PARTITION: + COMPUTE_PARTITION: CPX + MEMORY_PARTITION: NPS4 + PARTITION_ID: 0 +``` + +### Removals + +- **Removed usage of _validate_positive in Parser and replaced with _positive_int and _not_negative_int as appropriate**. + - This will allow 0 to be a valid input for several options in setting CPUs where appropriate (for example, as a mode or NBIOID) + +### Optimizations + +- **Adjusted ordering of gpu_metrics calls to ensure that pcie_bw values remain stable in `amd-smi metric` & `amd-smi monitor`**. + - With this change additional padding was added to PCIE_BW `amd-smi monitor --pcie` + +### Resolved issues + +- **Improved Offline install process & lowered dependency for PyYAML**. + +- **Fixed CPX not showing total number of logical GPUs**. + - Updates were made to `amdsmi_init()` and `amdsmi_get_gpu_bdf_id(..)`. In order to display all logical devices, we needed a way to provide order to GPU's enumerated. This was done by adding a partition_id within the BDF optional pci_id bits. + - Due to driver changes in KFD, some devices may report bits [31:28] or [2:0]. With the newly added `amdsmi_get_gpu_bdf_id(..)`, we provided this fallback to properly retreive partition ID. We +plan to eventually remove partition ID from the function portion of the BDF (Bus Device Function). See below for PCI ID description. + + - bits [63:32] = domain + - bits [31:28] or bits [2:0] = partition id + - bits [27:16] = reserved + - bits [15:8] = Bus + - bits [7:3] = Device + - bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes + + - Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI. + +```shell +$ amd-smi monitor -p -t -v +GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL + 0 248 W 55 °C 48 °C 283 MB 196300 MB + 1 247 W 55 °C 48 °C 283 MB 196300 MB + 2 247 W 55 °C 48 °C 283 MB 196300 MB + 3 247 W 55 °C 48 °C 283 MB 196300 MB + 4 221 W 50 °C 42 °C 283 MB 196300 MB + 5 221 W 50 °C 42 °C 283 MB 196300 MB + 6 222 W 50 °C 42 °C 283 MB 196300 MB + 7 221 W 50 °C 42 °C 283 MB 196300 MB + 8 239 W 53 °C 46 °C 283 MB 196300 MB + 9 239 W 53 °C 46 °C 283 MB 196300 MB + 10 239 W 53 °C 46 °C 283 MB 196300 MB + 11 239 W 53 °C 46 °C 283 MB 196300 MB + 12 219 W 51 °C 48 °C 283 MB 196300 MB + 13 219 W 51 °C 48 °C 283 MB 196300 MB + 14 219 W 51 °C 48 °C 283 MB 196300 MB + 15 219 W 51 °C 48 °C 283 MB 196300 MB + 16 222 W 51 °C 47 °C 283 MB 196300 MB + 17 222 W 51 °C 47 °C 283 MB 196300 MB + 18 222 W 51 °C 47 °C 283 MB 196300 MB + 19 222 W 51 °C 48 °C 283 MB 196300 MB + 20 241 W 55 °C 48 °C 283 MB 196300 MB + 21 241 W 55 °C 48 °C 283 MB 196300 MB + 22 241 W 55 °C 48 °C 283 MB 196300 MB + 23 240 W 55 °C 48 °C 283 MB 196300 MB + 24 211 W 51 °C 45 °C 283 MB 196300 MB + 25 211 W 51 °C 45 °C 283 MB 196300 MB + 26 211 W 51 °C 45 °C 283 MB 196300 MB + 27 211 W 51 °C 45 °C 283 MB 196300 MB + 28 227 W 51 °C 49 °C 283 MB 196300 MB + 29 227 W 51 °C 49 °C 283 MB 196300 MB + 30 227 W 51 °C 49 °C 283 MB 196300 MB + 31 227 W 51 °C 49 °C 283 MB 196300 MB +``` + +- **Fixed incorrect implementation of the Python API `amdsmi_get_gpu_metrics_header_info()`**. + +- **`amdsmitst` TestGpuMetricsRead now prints metric in correct units**. + +### Known issues + +- N/A + +### Upcoming changes + +- **Python API for `amdsmi_get_energy_count()` will deprecate the `power` field in ROCm 6.4 and use `energy_accumulator` field instead**. + +- **Added preliminary `amd-smi partition` command**. + - The new partition command can be used to display GPU information, including memory and accelerator partition information. + - The command will be at full functionality once additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()` has been implemented. + +## amd_smi_lib for ROCm 6.2.1 + +### Additions + +- **Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**. +Guest VMs do not support getting current ECC counts from the Host cards. + +- **Added `amd-smi static --ras`on Guest VMs**. +Guest VMs can view enabled/disabled ras features that are on Host cards. + +### Optimizations + +- N/A + +### Fixes + +- **Fixed TypeError in `amd-smi process -G`**. + +- **Updated CLI error strings to handle empty and invalid GPU/CPU inputs**. + +- **Fixed Guest VM showing passthrough options**. + +- **Fixed firmware formatting where leading 0s were missing**. + +### Known Issues + +- N/A + +## amd_smi_lib for ROCm 6.2.0 + +### Additions + +- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**. + +- **Added optional process table under `amd-smi monitor -q`**. +The monitor subcommand within the CLI Tool now has the `-q` option to enable an optional process table underneath the original monitored output. + +```shell +$ amd-smi monitor -q +GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK ENC_UTIL ENC_CLOCK DEC_UTIL DEC_CLOCK SINGLE_ECC DOUBLE_ECC PCIE_REPLAY VRAM_USED VRAM_TOTAL PCIE_BW + 0 199 W 103 °C 84 °C 99 % 1920 MHz 31 % 1000 MHz N/A 0 MHz N/A 0 MHz 0 0 0 1235 MB 16335 MB N/A Mb/s + +PROCESS INFO: +GPU NAME PID GTT_MEM CPU_MEM VRAM_MEM MEM_USAGE GFX ENC + 0 rvs 1564865 0.0 B 0.0 B 1.1 GB 0.0 B 0 ns 0 ns +``` + +- **Added Handling to detect VMs with passthrough configurations in CLI Tool**. +CLI Tool had only allowed a restricted set of options for Virtual Machines with passthrough GPUs. Now we offer an expanded set of functions availble to passthrough configured GPUs. + +- **Added Process Isolation and Clear SRAM functionality to the CLI Tool for VMs**. +VMs now have the ability to set the process isolation and clear the sram from the CLI tool. Using the following commands + +```shell +amd-smi set --process-isolation <0 or 1> +amd-smi reset --clean_local_data +``` + +- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**. +Added macros to reference max size limitations for certain amdsmi functions such as max dpm policies and max fanspeed. + +- **Added Ring Hang event**. +Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_notification_type_t` enum. + +### Optimizations + +- **Updated CLI error strings to specify invalid device type queried** + +```shell +$ amd-smi static --asic --gpu 123123 +Can not find a device: GPU '123123' Error code: -3 +``` + +- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**. +Previously if a processes with elevated permissions was running amd-smi would required sudo to display all output. Now amd-smi will populate all process data and return N/A for elevated process names instead. However if ran with sudo you will be able to see the name like so: + +```shell +$ amd-smi process +GPU: 0 + PROCESS_INFO: + NAME: N/A + PID: 1693982 + MEMORY_USAGE: + GTT_MEM: 0.0 B + CPU_MEM: 0.0 B + VRAM_MEM: 10.1 GB + MEM_USAGE: 0.0 B + USAGE: + GFX: 0 ns + ENC: 0 ns +``` + +```shell +$ sudo amd-smi process +GPU: 0 + PROCESS_INFO: + NAME: TransferBench + PID: 1693982 + MEMORY_USAGE: + GTT_MEM: 0.0 B + CPU_MEM: 0.0 B + VRAM_MEM: 10.1 GB + MEM_USAGE: 0.0 B + USAGE: + GFX: 0 ns + ENC: 0 ns +``` + +- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**. +Changed the naming to be more accurate to what the function was doing. This change also extends to the CLI where we changed the `clear-sram-data` command to `clean_local_data`. + +- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**. +Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locked value. New struct will be in the following format: + +```shell + typedef struct { ++ uint32_t clk; + uint32_t min_clk; + uint32_t max_clk; ++ uint8_t clk_locked; ++ uint8_t clk_deep_sleep; + uint32_t reserved[4]; + } amdsmi_clk_info_t; +``` + +- **Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest**. +Multiple structures used by APIs were changed for alignment unification: + - Changed `amdsmi_vram_info_t` `vram_size_mb` field changed to to `vram_size` + - Updated `amdsmi_vram_type_t` struct updated to include new enums and added `AMDSMI` prefix + - Updated `amdsmi_status_t` some enums were missing the `AMDSMI_STATUS` prefix + - Added `AMDSMI_PROCESSOR_TYPE` prefix to `processor_type_t` enums + - Removed the fields structure definition in favor for an anonymous definition in `amdsmi_bdf_t` + +- **Added `AMDSMI` prefix in amdsmi.h and amdsmi_interface.py to align with host/guest**. +Multiple structures used by APIs were changed for alignment unification. `AMDSMI` prefix was added to the following structures: + - Added AMDSMI prefix to `amdsmi_container_types_t` enums + - Added AMDSMI prefix to `amdsmi_clk_type_t` enums + - Added AMDSMI prefix to `amdsmi_compute_partition_type_t` enums + - Added AMDSMI prefix to `amdsmi_memory_partition_type_t` enums + - Added AMDSMI prefix to `amdsmi_clk_type_t` enums + - Added AMDSMI prefix to `amdsmi_temperature_type_t` enums + - Added AMDSMI prefix to `amdsmi_fw_block_t` enums + +- **Changed dpm_policy references to soc_pstate**. +The file structure referenced to dpm_policy changed to soc_pstate and we have changed the APIs and CLI tool to be inline with the current structure. `amdsmi_get_dpm_policy()` and `amdsmi_set_dpm_policy()` is no longer valid with the new API being `amdsmi_get_soc_pstate()` and `amdsmi_set_soc_pstate()`. The CLI tool has been changed from `--policy` to `--soc-pstate` + +- **Updated `amdsmi_get_gpu_board_info()` product_name to fallback to pciids**. +Previously on devices without a FRU we would not populate the product name in the `amdsmi_board_info_t` structure, now we will fallback to using the name listed according to the pciids file if available. + +- **Updated CLI voltage curve command output**. +The output for `amd-smi metric --voltage-curve` now splits the frequency and voltage output by curve point or outputs N/A for each curve point if not applicable + +```shell +GPU: 0 + VOLTAGE_CURVE: + POINT_0_FREQUENCY: 872 Mhz + POINT_0_VOLTAGE: 736 mV + POINT_1_FREQUENCY: 1354 Mhz + POINT_1_VOLTAGE: 860 mV + POINT_2_FREQUENCY: 1837 Mhz + POINT_2_VOLTAGE: 1186 mV +``` + +- **Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`**. +Updated sizes that work for retreiving relavant board information across AMD's +ASIC products. This requires users to update any ABIs using this structure. + +### Fixes + +- **Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool**. +When running `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in /dev/shm or rebooting the machine. + +- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**. +Multiple process outputs in the CLI tool were not being registered correctly. The json output did not handle multiple processes and is now in a new valid json format: + +```shell +[ + { + "gpu": 0, + "process_list": [ + { + "process_info": { + "name": "TransferBench", + "pid": 420157, + "mem_usage": { + "value": 0, + "unit": "B" + } + } + }, + { + "process_info": { + "name": "rvs", + "pid": 420315, + "mem_usage": { + "value": 0, + "unit": "B" + } + } + } + ] + } +] +``` + +- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**. +Throttle status may work for older ASICs, but will be replaced with PVIOL and TVIOL metrics for future ASIC support. It remains a field in the gpu_metrics API and in `amd-smi metric --power`. + +- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**. +Previously if there was a partial failure to retrieve character strings, we would return +garbage output to users using the API. This fix intends to populate as many values as possible. +Then any failure(s) found along the way, `\0` is provided to `amdsmi_board_info_t` +structures data members which cannot be populated. Ensuring empty char string values. + +- **Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`**. +The parsing of `pp_od_clk_voltage` was not dynamic enough to work with the dropping of voltage curve support on MI series cards. This propagates down to correcting the CLI's output `amd-smi metric --voltage-curve` to N/A if voltage curve is not enabled. + +### Known Issues + +- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**. + +## amd_smi_lib for ROCm 6.1.2 + +### Additions + +- **Added process isolation and clean shader APIs and CLI commands**. +Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs: + - `amdsmi_get_gpu_process_isolation()` + - `amdsmi_set_gpu_process_isolation()` + - `amdsmi_set_gpu_clear_sram_data()` + +- **Added `MIN_POWER` to output of `amd-smi static --limit`**. +This change helps users identify the range to which they can change the power cap of the GPU. The change is added to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap ` or `amd-smi reset -g all --power-cap`. + +```shell +$ amd-smi static --limit +GPU: 0 + LIMIT: + MAX_POWER: 203 W + MIN_POWER: 0 W + SOCKET_POWER: 203 W + SLOWDOWN_EDGE_TEMPERATURE: 100 °C + SLOWDOWN_HOTSPOT_TEMPERATURE: 110 °C + SLOWDOWN_VRAM_TEMPERATURE: 100 °C + SHUTDOWN_EDGE_TEMPERATURE: 105 °C + SHUTDOWN_HOTSPOT_TEMPERATURE: 115 °C + SHUTDOWN_VRAM_TEMPERATURE: 105 °C + +GPU: 1 + LIMIT: + MAX_POWER: 213 W + MIN_POWER: 213 W + SOCKET_POWER: 213 W + SLOWDOWN_EDGE_TEMPERATURE: 109 °C + SLOWDOWN_HOTSPOT_TEMPERATURE: 110 °C + SLOWDOWN_VRAM_TEMPERATURE: 100 °C + SHUTDOWN_EDGE_TEMPERATURE: 114 °C + SHUTDOWN_HOTSPOT_TEMPERATURE: 115 °C + SHUTDOWN_VRAM_TEMPERATURE: 105 °C +``` + +### Optimizations + +- **Updated `amd-smi monitor --pcie` output**. +The source for pcie bandwidth monitor output was a legacy file we no longer support and was causing delays within the monitor command. The output is no longer using TX/RX but instantaneous bandwidth from gpu_metrics instead; updated output: + +```shell +$ amd-smi monitor --pcie +GPU PCIE_BW + 0 26 Mb/s +``` + +- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**. +`amdsmi_get_power_cap_info` will return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same). + +- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**. +Previously calls were returning "No bad pages found." if no pages were found, now it only returns the list type and can be empty. + +- **Updated `amd-smi metric --ecc-blocks` output**. +The ecc blocks argument was outputing blocks without counters available, updated the filtering show blocks that counters are available for: + +``` shell +$ amd-smi metric --ecc-block +GPU: 0 + ECC_BLOCKS: + UMC: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + SDMA: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + GFX: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + MMHUB: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + PCIE_BIF: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + HDP: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + XGMI_WAFL: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 +``` + +- **Removed `amdsmi_get_gpu_process_info` from Python library**. +amdsmi_get_gpu_process_info was removed from the C library in an earlier build, but the API was still in the Python interface. + +### Fixes + +- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**. +These systems use an older version of gpu_metrics in amdgpu. This fix only updates what CLI outputs. +No change in any of our APIs. + +```shell +$ amd-smi metric --power +GPU: 0 + POWER: + SOCKET_POWER: 11 W + GFX_VOLTAGE: 768 mV + SOC_VOLTAGE: 925 mV + MEM_VOLTAGE: 1250 mV + POWER_MANAGEMENT: ENABLED + THROTTLE_STATUS: UNTHROTTLED + +GPU: 1 + POWER: + SOCKET_POWER: 17 W + GFX_VOLTAGE: 781 mV + SOC_VOLTAGE: 806 mV + MEM_VOLTAGE: 1250 mV + POWER_MANAGEMENT: ENABLED + THROTTLE_STATUS: UNTHROTTLED +``` + +- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**. +Updates required `amdsmi_get_power_cap_info` to return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same). + +- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**. +Previously Python interface calls to populated bad pages resulted in a `ValueError: NULL pointer access`. This fixes the bad-pages subcommand CLI subcommand as well. + +### Known Issues + +- N/A + +## amd_smi_lib for ROCm 6.1.1 + +### Changes + +- **Updated metrics --clocks**. +Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status. + +``` shell +$ amd-smi metric --clock +GPU: 0 + CLOCK: + GFX_0: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_1: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_2: + CLK: 112 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_3: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_4: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_5: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_6: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + GFX_7: + CLK: 113 MHz + MIN_CLK: 500 MHz + MAX_CLK: 1800 MHz + CLK_LOCKED: DISABLED + DEEP_SLEEP: ENABLED + MEM_0: + CLK: 900 MHz + MIN_CLK: 900 MHz + MAX_CLK: 1200 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: DISABLED + VCLK_0: + CLK: 29 MHz + MIN_CLK: 914 MHz + MAX_CLK: 1480 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + VCLK_1: + CLK: 29 MHz + MIN_CLK: 914 MHz + MAX_CLK: 1480 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + VCLK_2: + CLK: 29 MHz + MIN_CLK: 914 MHz + MAX_CLK: 1480 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + VCLK_3: + CLK: 29 MHz + MIN_CLK: 914 MHz + MAX_CLK: 1480 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + DCLK_0: + CLK: 22 MHz + MIN_CLK: 711 MHz + MAX_CLK: 1233 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + DCLK_1: + CLK: 22 MHz + MIN_CLK: 711 MHz + MAX_CLK: 1233 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + DCLK_2: + CLK: 22 MHz + MIN_CLK: 711 MHz + MAX_CLK: 1233 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED + DCLK_3: + CLK: 22 MHz + MIN_CLK: 711 MHz + MAX_CLK: 1233 MHz + CLK_LOCKED: N/A + DEEP_SLEEP: ENABLED +``` + +- **Added deferred ecc counts**. +Added deferred error correctable counts to `amd-smi metric --ecc --ecc-blocks` + +```shell +$ amd-smi metric --ecc --ecc-blocks +GPU: 0 + ECC: + TOTAL_CORRECTABLE_COUNT: 0 + TOTAL_UNCORRECTABLE_COUNT: 0 + TOTAL_DEFERRED_COUNT: 0 + CACHE_CORRECTABLE_COUNT: 0 + CACHE_UNCORRECTABLE_COUNT: 0 + ECC_BLOCKS: + UMC: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + SDMA: + CORRECTABLE_COUNT: 0 + UNCORRECTABLE_COUNT: 0 + DEFERRED_COUNT: 0 + ... +``` + +- **Updated `amd-smi topology --json` to align with host/guest**. +Topology's `--json` output now is changed to align with output host/guest systems. Additionally, users can select/filter specific topology details as desired (refer to `amd-smi topology -h` for full list). See examples shown below. + +*Previous format:* + +```shell +$ amd-smi topology --json +[ + { + "gpu": 0, + "link_accessibility": { + "gpu_0": "ENABLED", + "gpu_1": "DISABLED" + }, + "weight": { + "gpu_0": 0, + "gpu_1": 40 + }, + "hops": { + "gpu_0": 0, + "gpu_1": 2 + }, + "link_type": { + "gpu_0": "SELF", + "gpu_1": "PCIE" + }, + "numa_bandwidth": { + "gpu_0": "N/A", + "gpu_1": "N/A" + } + }, + { + "gpu": 1, + "link_accessibility": { + "gpu_0": "DISABLED", + "gpu_1": "ENABLED" + }, + "weight": { + "gpu_0": 40, + "gpu_1": 0 + }, + "hops": { + "gpu_0": 2, + "gpu_1": 0 + }, + "link_type": { + "gpu_0": "PCIE", + "gpu_1": "SELF" + }, + "numa_bandwidth": { + "gpu_0": "N/A", + "gpu_1": "N/A" + } + } +] +``` + +*New format:* + +```shell +$ amd-smi topology --json +[ + { + "gpu": 0, + "bdf": "0000:01:00.0", + "links": [ + { + "gpu": 0, + "bdf": "0000:01:00.0", + "weight": 0, + "link_status": "ENABLED", + "link_type": "SELF", + "num_hops": 0, + "bandwidth": "N/A", + }, + { + "gpu": 1, + "bdf": "0001:01:00.0", + "weight": 15, + "link_status": "ENABLED", + "link_type": "XGMI", + "num_hops": 1, + "bandwidth": "50000-100000", + }, + ... + ] + }, + ... +] +``` + +```shell +$ /opt/rocm/bin/amd-smi topology -a -t --json +[ + { + "gpu": 0, + "bdf": "0000:08:00.0", + "links": [ + { + "gpu": 0, + "bdf": "0000:08:00.0", + "link_status": "ENABLED", + "link_type": "SELF" + }, + { + "gpu": 1, + "bdf": "0000:44:00.0", + "link_status": "DISABLED", + "link_type": "PCIE" + } + ] + }, + { + "gpu": 1, + "bdf": "0000:44:00.0", + "links": [ + { + "gpu": 0, + "bdf": "0000:08:00.0", + "link_status": "DISABLED", + "link_type": "PCIE" + }, + { + "gpu": 1, + "bdf": "0000:44:00.0", + "link_status": "ENABLED", + "link_type": "SELF" + } + ] + } +] +``` + +### Fixes + +- **Fix for GPU reset error on non-amdgpu cards**. +Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix +updates CLI to target only AMD ASICs. + +- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()` Navi32/31 cards**. +Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix +provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards). + +- **Fix for `amd-smi process`**. +Fixed output results when getting processes running on a device. + +- **Improved Error handling for `amd-smi process`**. +Fixed Attribute Error when getting process in csv format + +### Known issues + +- `amd-smi bad-pages` can results with "ValueError: NULL pointer access" with certain PM FW versions. + +## amd_smi_lib for ROCm 6.1.0 + +### Additions + +- **Added Monitor Command**. +Provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI `rocm-smi` (no argument), additionally allows uers to customize what data is helpful for their use-case. + +```shell +$ amd-smi monitor -h +usage: amd-smi monitor [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL] + [-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]] + [-w INTERVAL] [-W TIME] [-i ITERATIONS] [-p] [-t] [-u] [-m] [-n] + [-d] [-s] [-e] [-v] [-r] + +Monitor a target device for the specified arguments. +If no arguments are provided, all arguments will be enabled. +Use the watch arguments to run continuously + +Monitor Arguments: + -h, --help show this help message and exit + -g, --gpu GPU [GPU ...] Select a GPU ID, BDF, or UUID from the possible choices: + ID: 0 | BDF: 0000:01:00.0 | UUID: + all | Selects all devices + -U, --cpu CPU [CPU ...] Select a CPU ID from the possible choices: + ID: 0 + all | Selects all devices + -O, --core CORE [CORE ...] Select a Core ID from the possible choices: + ID: 0 - 23 + all | Selects all devices + -w, --watch INTERVAL Reprint the command in a loop of INTERVAL seconds + -W, --watch_time TIME The total TIME to watch the given command + -i, --iterations ITERATIONS Total number of ITERATIONS to loop on the given command + -p, --power-usage Monitor power usage in Watts + -t, --temperature Monitor temperature in Celsius + -u, --gfx Monitor graphics utilization (%) and clock (MHz) + -m, --mem Monitor memory utilization (%) and clock (MHz) + -n, --encoder Monitor encoder utilization (%) and clock (MHz) + -d, --decoder Monitor decoder utilization (%) and clock (MHz) + -s, --throttle-status Monitor thermal throttle status + -e, --ecc Monitor ECC single bit, ECC double bit, and PCIe replay error counts + -v, --vram-usage Monitor memory usage in MB + -r, --pcie Monitor PCIe Tx/Rx in MB/s + +Command Modifiers: + --json Displays output in JSON format (human readable by default). + --csv Displays output in CSV format (human readable by default). + --file FILE Saves output into a file on the provided path (stdout by default). + --loglevel LEVEL Set the logging level from the possible choices: + DEBUG, INFO, WARNING, ERROR, CRITICAL +``` + +```shell +$ amd-smi monitor -ptumv +GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK VRAM_USED VRAM_TOTAL + 0 171 W 32 °C 33 °C 0 % 114 MHz 0 % 900 MHz 283 MB 196300 MB + 1 175 W 33 °C 34 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB + 2 177 W 31 °C 33 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB + 3 172 W 33 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB + 4 178 W 32 °C 32 °C 0 % 113 MHz 0 % 900 MHz 284 MB 196300 MB + 5 176 W 33 °C 35 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB + 6 176 W 32 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB + 7 175 W 34 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB +``` + +- **Integrated ESMI Tool**. +Users can get CPU metrics and telemetry through our API and CLI tools. This information can be seen in `amd-smi static` and `amd-smi metric` commands. Only available for limited target processors. As of ROCm 6.0.2, this is listed as: + - AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh + - AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh + + See a few examples listed below. + +```shell +$ amd-smi static -U all +CPU: 0 + SMU: + FW_VERSION: 85.90.0 + INTERFACE_VERSION: + PROTO VERSION: 6 +``` + +```shell +$ amd-smi metric -O 0 1 2 +CORE: 0 + BOOST_LIMIT: + VALUE: 400 MHz + CURR_ACTIVE_FREQ_CORE_LIMIT: + VALUE: 400 MHz + CORE_ENERGY: + VALUE: N/A + +CORE: 1 + BOOST_LIMIT: + VALUE: 400 MHz + CURR_ACTIVE_FREQ_CORE_LIMIT: + VALUE: 400 MHz + CORE_ENERGY: + VALUE: N/A + +CORE: 2 + BOOST_LIMIT: + VALUE: 400 MHz + CURR_ACTIVE_FREQ_CORE_LIMIT: + VALUE: 400 MHz + CORE_ENERGY: + VALUE: N/A +``` + +```shell +$ amd-smi metric -U all +CPU: 0 + POWER_METRICS: + SOCKET POWER: 102675 mW + SOCKET POWER LIMIT: 550000 mW + SOCKET MAX POWER LIMIT: 550000 mW + PROCHOT: + PROCHOT_STATUS: 0 + FREQ_METRICS: + FCLKMEMCLK: + FCLK: 2000 MHz + MCLK: 1300 MHz + CCLKFREQLIMIT: 400 MHz + SOC_CURRENT_ACTIVE_FREQ_LIMIT: + FREQ: 400 MHz + FREQ_SRC: [HSMP Agent] + SOC_FREQ_RANGE: + MAX_SOCKET_FREQ: 3700 MHz + MIN_SOCKET_FREQ: 400 MHz + C0_RESIDENCY: + RESIDENCY: 4 % + SVI_TELEMETRY_ALL_RAILS: + POWER: 102673 mW + METRIC_VERSION: + VERSION: 11 + METRICS_TABLE: + CPU_FAMILY: 25 + CPU_MODEL: 144 + RESPONSE: + MTBL_ACCUMULATION_COUNTER: 2887162626 + MTBL_MAX_SOCKET_TEMPERATURE: 41.0 °C + MTBL_MAX_VR_TEMPERATURE: 39.0 °C + MTBL_MAX_HBM_TEMPERATURE: 40.0 °C + MTBL_MAX_SOCKET_TEMPERATURE_ACC: 108583340881.125 °C + MTBL_MAX_VR_TEMPERATURE_ACC: 109472702595.0 °C + MTBL_MAX_HBM_TEMPERATURE_ACC: 111516663941.0 °C + MTBL_SOCKET_POWER_LIMIT: 550.0 W + MTBL_MAX_SOCKET_POWER_LIMIT: 550.0 W + MTBL_SOCKET_POWER: 102.678 W + MTBL_TIMESTAMP_RAW: 288731677361880 + MTBL_TIMESTAMP_READABLE: Tue Mar 19 12:32:21 2024 + MTBL_SOCKET_ENERGY_ACC: 166127.84 kJ + MTBL_CCD_ENERGY_ACC: 3317.837 kJ + MTBL_XCD_ENERGY_ACC: 21889.147 kJ + MTBL_AID_ENERGY_ACC: 121932.397 kJ + MTBL_HBM_ENERGY_ACC: 18994.108 kJ + MTBL_CCLK_FREQUENCY_LIMIT: 3.7 GHz + MTBL_GFXCLK_FREQUENCY_LIMIT: 0.0 MHz + MTBL_FCLK_FREQUENCY: 1999.988 MHz + MTBL_UCLK_FREQUENCY: 1299.993 MHz + MTBL_SOCCLK_FREQUENCY: [35.716, 35.715, 35.714, 35.714] MHz + MTBL_VCLK_FREQUENCY: [0.0, 53.749, 53.749, 53.749] MHz + MTBL_DCLK_FREQUENCY: [7.143, 44.791, 44.791, 44.791] MHz + MTBL_LCLK_FREQUENCY: [20.872, 18.75, 35.938, 599.558] MHz + MTBL_FCLK_FREQUENCY_TABLE: [1200.0, 1600.0, 1900.0, 2000.0] MHz + MTBL_UCLK_FREQUENCY_TABLE: [900.0, 1100.0, 1200.0, 1300.0] MHz + MTBL_SOCCLK_FREQUENCY_TABLE: [800.0, 1000.0, 1142.857, 1142.857] MHz + MTBL_VCLK_FREQUENCY_TABLE: [914.286, 1300.0, 1560.0, 1720.0] MHz + MTBL_DCLK_FREQUENCY_TABLE: [711.111, 975.0, 1300.0, 1433.333] MHz + MTBL_LCLK_FREQUENCY_TABLE: [600.0, 844.444, 1150.0, 1150.0] MHz + MTBL_CCLK_FREQUENCY_ACC: [4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, + 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, + 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, + 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, + 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, 4399751656.639, + 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, + 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, + 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, + 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, + 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] GHz + MTBL_GFXCLK_FREQUENCY_ACC: [0.0, 0.0, 250534397827.603, 251546257401.82, 250811364089.836, + 249999070486.505, 251622633562.855, 251342375116.05] MHz + MTBL_GFXCLK_FREQUENCY: [0.0, 0.0, 31.091, 31.414, 31.141, 31.478, 31.32, 31.453] + MHz + MTBL_MAX_CCLK_FREQUENCY: 3.7 GHz + MTBL_MIN_CCLK_FREQUENCY: 0.4 GHz + MTBL_MAX_GFXCLK_FREQUENCY: 2100.0 MHz + MTBL_MIN_GFXCLK_FREQUENCY: 500.0 MHz + MTBL_MAX_LCLK_DPM_RANGE: 2 + MTBL_MIN_LCLK_DPM_RANGE: 0 + MTBL_XGMI_WIDTH: 0.0 + MTBL_XGMI_BITRATE: 0.0 Gbps + MTBL_XGMI_READ_BANDWIDTH_ACC: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] Gbps + MTBL_XGMI_WRITE_BANDWIDTH_ACC: [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0] Gbps + MTBL_SOCKET_C0_RESIDENCY: 4.329 % + MTBL_SOCKET_GFX_BUSY: 0.0 % + MTBL_HBM_BANDWIDTH_UTILIZATION: 0.001 % + MTBL_SOCKET_C0_RESIDENCY_ACC: 311523106.34 + MTBL_SOCKET_GFX_BUSY_ACC: 84739.281 + MTBL_HBM_BANDWIDTH_ACC: 33231180.073 Gbps + MTBL_MAX_HBM_BANDWIDTH: 5324.801 Gbps + MTBL_DRAM_BANDWIDTH_UTILIZATION_ACC: 612843.699 + MTBL_PCIE_BANDWIDTH_ACC: [0.0, 0.0, 0.0, 0.0] Gbps + MTBL_PROCHOT_RESIDENCY_ACC: 0 + MTBL_PPT_RESIDENCY_ACC: 2887162626 + MTBL_SOCKET_THM_RESIDENCY_ACC: 2887162626 + MTBL_VR_THM_RESIDENCY_ACC: 0 + MTBL_HBM_THM_RESIDENCY_ACC: 2887162626 + SOCKET_ENERGY: + RESPONSE: N/A + DDR_BANDWIDTH: + RESPONSE: N/A + CPU_TEMP: + RESPONSE: N/A +``` + +- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**. +Using the AMD SMI tool, users can retreive VCN, JPEG engines, and PCIe errors by calling `amd-smi metric -P` or `amd-smi metric --usage`. Depending on device support, `VCN_ACTIVITY` will update for MI3x ASICs (with 4 separate VCN engine activities) for older asics `MM_ACTIVITY` with UVD/VCN engine activity (average of all engines). `JPEG_ACTIVITY` is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields. + +```shell +$ amd-smi metric -P +GPU: 0 + PCIE: + WIDTH: 16 + SPEED: 16 GT/s + REPLAY_COUNT: 0 + L0_TO_RECOVERY_COUNT: 1 + REPLAY_ROLL_OVER_COUNT: 0 + NAK_SENT_COUNT: 0 + NAK_RECEIVED_COUNT: 0 + CURRENT_BANDWIDTH_SENT: N/A + CURRENT_BANDWIDTH_RECEIVED: N/A + MAX_PACKET_SIZE: N/A +``` + +```shell +$ amd-smi metric --usage +GPU: 0 + USAGE: + GFX_ACTIVITY: 0 % + UMC_ACTIVITY: 0 % + MM_ACTIVITY: N/A + VCN_ACTIVITY: [0 %, 0 %, 0 %, 0 %] + JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 + %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, + 0 %, 0 %, 0 %, 0 %] + +``` + +- **Added AMDSMI Tool Version**. +AMD SMI will report ***three versions***: AMDSMI Tool, AMDSMI Library version, and ROCm version. +The AMDSMI Tool version is the CLI/tool version number with commit ID appended after `+` sign. +The AMDSMI Library version is the library package version number. +The ROCm version is the system's installed ROCm version, if ROCm is not installed it will report N/A. + +```shell +$ amd-smi version +AMDSMI Tool: 23.4.2+505b858 | AMDSMI Library version: 24.2.0.0 | ROCm version: 6.1.0 +``` + +- **Added XGMI table**. +Displays XGMI information for AMD GPU devices in a table format. Only available on supported ASICs (eg. MI300). Here users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes). + +```shell +$ amd-smi xgmi +LINK METRIC TABLE: + bdf bit_rate max_bandwidth link_type 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +GPU0 0000:0c:00.0 32 Gb/s 512 Gb/s XGMI + Read N/A 2 KB 2 KB 1 KB 2 KB 1 KB 2 KB 2 KB + Write N/A 1 KB 1 KB 1 KB 1 KB 1 KB 1 KB 1 KB +GPU1 0000:22:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB N/A 2 KB 2 KB 1 KB 2 KB 1 KB 2 KB + Write 0 KB N/A 1 KB 1 KB 1 KB 1 KB 1 KB 1 KB +GPU2 0000:38:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB 1 KB N/A 2 KB 1 KB 2 KB 0 KB 0 KB + Write 0 KB 1 KB N/A 1 KB 1 KB 1 KB 1 KB 1 KB +GPU3 0000:5c:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB 0 KB 2 KB N/A 1 KB 0 KB 0 KB 2 KB + Write 0 KB 1 KB 1 KB N/A 1 KB 1 KB 1 KB 1 KB +GPU4 0000:9f:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB 1 KB 0 KB 0 KB N/A 2 KB 0 KB 2 KB + Write 0 KB 1 KB 1 KB 1 KB N/A 1 KB 1 KB 1 KB +GPU5 0000:af:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB 2 KB 0 KB 0 KB 0 KB N/A 2 KB 0 KB + Write 0 KB 1 KB 1 KB 1 KB 1 KB N/A 1 KB 1 KB +GPU6 0000:bf:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB 0 KB 0 KB 0 KB 0 KB 0 KB N/A 0 KB + Write 0 KB 1 KB 1 KB 1 KB 1 KB 1 KB N/A 1 KB +GPU7 0000:df:00.0 32 Gb/s 512 Gb/s XGMI + Read 0 KB 0 KB 0 KB 0 KB 0 KB 0 KB 0 KB N/A + Write 0 KB 1 KB 1 KB 1 KB 1 KB 1 KB 1 KB N/A + +``` + +- **Added units of measure to JSON output**. +We added unit of measure to JSON/CSV `amd-smi metric`, `amd-smi static`, and `amd-smi monitor` commands. + +Ex. + +```shell +amd-smi metric -p --json +[ + { + "gpu": 0, + "power": { + "socket_power": { + "value": 10, + "unit": "W" + }, + "gfx_voltage": { + "value": 6, + "unit": "mV" + }, + "soc_voltage": { + "value": 918, + "unit": "mV" + }, + "mem_voltage": { + "value": 1250, + "unit": "mV" + }, + "power_management": "ENABLED", + "throttle_status": "UNTHROTTLED" + } + } +] +``` + +### Changes + +- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**. +We provided each device's BDF for every table's row/columns, then left aligned data. We want AMD SMI Tool output to be easy to understand and digest for our users. Having users scroll up to find this information made it difficult to follow, especially for devices which have many devices associated with one ASIC. + +```shell +$ amd-smi topology +ACCESS TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:22:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:38:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:5c:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:9f:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:af:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:bf:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED +0000:df:00.0 ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED ENABLED + +WEIGHT TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 0 15 15 15 15 15 15 15 +0000:22:00.0 15 0 15 15 15 15 15 15 +0000:38:00.0 15 15 0 15 15 15 15 15 +0000:5c:00.0 15 15 15 0 15 15 15 15 +0000:9f:00.0 15 15 15 15 0 15 15 15 +0000:af:00.0 15 15 15 15 15 0 15 15 +0000:bf:00.0 15 15 15 15 15 15 0 15 +0000:df:00.0 15 15 15 15 15 15 15 0 + +HOPS TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 0 1 1 1 1 1 1 1 +0000:22:00.0 1 0 1 1 1 1 1 1 +0000:38:00.0 1 1 0 1 1 1 1 1 +0000:5c:00.0 1 1 1 0 1 1 1 1 +0000:9f:00.0 1 1 1 1 0 1 1 1 +0000:af:00.0 1 1 1 1 1 0 1 1 +0000:bf:00.0 1 1 1 1 1 1 0 1 +0000:df:00.0 1 1 1 1 1 1 1 0 + +LINK TYPE TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 SELF XGMI XGMI XGMI XGMI XGMI XGMI XGMI +0000:22:00.0 XGMI SELF XGMI XGMI XGMI XGMI XGMI XGMI +0000:38:00.0 XGMI XGMI SELF XGMI XGMI XGMI XGMI XGMI +0000:5c:00.0 XGMI XGMI XGMI SELF XGMI XGMI XGMI XGMI +0000:9f:00.0 XGMI XGMI XGMI XGMI SELF XGMI XGMI XGMI +0000:af:00.0 XGMI XGMI XGMI XGMI XGMI SELF XGMI XGMI +0000:bf:00.0 XGMI XGMI XGMI XGMI XGMI XGMI SELF XGMI +0000:df:00.0 XGMI XGMI XGMI XGMI XGMI XGMI XGMI SELF + +NUMA BW TABLE: + 0000:0c:00.0 0000:22:00.0 0000:38:00.0 0000:5c:00.0 0000:9f:00.0 0000:af:00.0 0000:bf:00.0 0000:df:00.0 +0000:0c:00.0 N/A 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 +0000:22:00.0 50000-50000 N/A 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 +0000:38:00.0 50000-50000 50000-50000 N/A 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 +0000:5c:00.0 50000-50000 50000-50000 50000-50000 N/A 50000-50000 50000-50000 50000-50000 50000-50000 +0000:9f:00.0 50000-50000 50000-50000 50000-50000 50000-50000 N/A 50000-50000 50000-50000 50000-50000 +0000:af:00.0 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 N/A 50000-50000 50000-50000 +0000:bf:00.0 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 N/A 50000-50000 +0000:df:00.0 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 50000-50000 N/A +``` + +### Fixes + +- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**. +Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string. +- **Fix for devices which have an older pyyaml installed**. +Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands: + - `amd-smi list` + - `amd-smi static` + - `amd-smi firmware` + - `amd-smi metric` + - `amd-smi topology` + +```shell +TypeError: dump_all() got an unexpected keyword argument 'sort_keys' +``` + +- **Fix for crash when user is not a member of video/render groups**. +AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user. + +## amd_smi_lib for ROCm 6.0.0 + +### Additions + +- **Integrated the E-SMI (EPYC-SMI) library**. +You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details. + +- **Added support for gfx942 metrics**. +You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance. + +- **Compute and memory partition support**. +Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration. + +### Optimizations + +- Updated to C++17, gtest-1.14, and cmake 3.14 + +### Changes + +- **GPU index sorting made consistent with other tools**. +To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number. +- **Topology output is now aligned with GPU BDF table**. +Earlier versions of the topology output were difficult to read since each GPU was displayed linearly. +Now the information is displayed as a table by each GPU's BDF, which closer resembles rocm-smi output. + +### Fixes + +- **Fix for driver not initialized**. +If driver module is not loaded, user retrieve error reponse indicating amdgpu module is not loaded. diff --git a/docs/reference/index.rst b/docs/reference/index.rst deleted file mode 100644 index 019a2bb523..0000000000 --- a/docs/reference/index.rst +++ /dev/null @@ -1,14 +0,0 @@ -.. meta:: - :description: Install AMD SMI - :keywords: install, SMI, AMD, ROCm - -****************** -API reference -****************** - -This section provides technical descriptions and important information about the different AMD SMI and library components. - -* {doc}`Library <../doxygen/docBin/html/files>` -* {doc}`Functions <../doxygen/docBin/html/globals>` -* {doc}`Data structures <../doxygen/docBin/html/annotated>` - diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index fe0899a0fd..6b5e1ab687 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -1,49 +1,51 @@ defaults: numbered: false root: index -subtrees: -- entries: - - file: what-is-AMDSMI.rst - title: What is AMD SMI? - +subtrees: - caption: Install entries: - - file: install/install.rst - title: AMD SMI installation - - + - file: install/install.md + title: Library and CLI tool installation + - file: install/build.md + title: Build from source + - caption: How to entries: - - file: how-to/using-amdsmi-for-C++.rst - title: Use AMD SMI for C++ library - - file: how-to/using-amdsmi-for-python.md - title: Use AMD SMI for Python library - - file: how-to/using-AMD-SMI-CLI-tool.md - title: Use AMD SMI CLI tool - -- caption: API reference + - file: how-to/amdsmi-cpp-lib.md + title: C++ library usage + - file: how-to/amdsmi-py-lib.md + title: Python library usage + - file: how-to/amdsmi-cli-tool.md + title: CLI tool usage + +- caption: Reference entries: + - file: reference/amdsmi-cpp-api.md + title: C++ API + entries: + - file: doxygen/docBin/html/modules + title: Modules - file: doxygen/docBin/html/files title: Files - file: doxygen/docBin/html/globals title: Globals - file: doxygen/docBin/html/annotated title: Data structures - - file: doxygen/docBin/html/modules - title: Modules - file: doxygen/docBin/html/functions_data_fields title: Data fields + - file: reference/amdsmi-py-api.md + title: Python API + - file: reference/changelog.md + title: Changelog - - - caption: Tutorials entries: - - url: https://github.com/ROCm/amdsmi/tree/amd-staging/example - title: AMD SMI GitHub samples - - url: https://github.com/ROCm/rocm_smi_lib/tree/amd-staging/docs - title: ROCm SMI lib GitHub samples - + - url: https://github.com/ROCm/amdsmi/tree/amd-staging/example + title: AMD SMI examples (GitHub) + - url: https://github.com/ROCm/rocm_smi_lib/tree/amd-staging/example + title: ROCm SMI lib examples (GitHub) + - caption: About entries: - - file: license.md + - file: license.md diff --git a/docs/sphinx/requirements.in b/docs/sphinx/requirements.in index 221c930455..aa0042a51d 100644 --- a/docs/sphinx/requirements.in +++ b/docs/sphinx/requirements.in @@ -1 +1 @@ -rocm-docs-core[api_reference]==1.4.0 +rocm-docs-core[api_reference]==1.8.2 diff --git a/docs/sphinx/requirements.txt b/docs/sphinx/requirements.txt index fd2a5eee85..cd2e0a3d83 100644 --- a/docs/sphinx/requirements.txt +++ b/docs/sphinx/requirements.txt @@ -6,9 +6,9 @@ # accessible-pygments==0.0.5 # via pydata-sphinx-theme -alabaster==0.7.16 +alabaster==1.0.0 # via sphinx -babel==2.15.0 +babel==2.16.0 # via # pydata-sphinx-theme # sphinx @@ -16,9 +16,9 @@ beautifulsoup4==4.12.3 # via pydata-sphinx-theme breathe==4.35.0 # via rocm-docs-core -certifi==2024.6.2 +certifi==2024.8.30 # via requests -cffi==1.16.0 +cffi==1.17.1 # via # cryptography # pynacl @@ -31,7 +31,7 @@ click==8.1.7 # sphinx-external-toc click-log==0.4.0 # via doxysphinx -cryptography==42.0.8 +cryptography==43.0.1 # via pyjwt deprecated==1.2.14 # via pygithub @@ -41,15 +41,15 @@ docutils==0.21.2 # myst-parser # pydata-sphinx-theme # sphinx -doxysphinx==3.3.8 +doxysphinx==3.3.10 # via rocm-docs-core -fastjsonschema==2.19.1 +fastjsonschema==2.20.0 # via rocm-docs-core gitdb==4.0.11 # via gitpython gitpython==3.1.43 # via rocm-docs-core -idna==3.7 +idna==3.10 # via requests imagesize==1.4.1 # via sphinx @@ -67,13 +67,13 @@ markdown-it-py==3.0.0 # myst-parser markupsafe==2.1.5 # via jinja2 -mdit-py-plugins==0.4.1 +mdit-py-plugins==0.4.2 # via myst-parser mdurl==0.1.2 # via markdown-it-py mpire==2.10.2 # via doxysphinx -myst-parser==3.0.1 +myst-parser==4.0.0 # via rocm-docs-core numpy==1.26.4 # via doxysphinx @@ -83,11 +83,11 @@ packaging==24.1 # sphinx pycparser==2.22 # via cffi -pydata-sphinx-theme==0.15.3 +pydata-sphinx-theme==0.15.4 # via # rocm-docs-core # sphinx-book-theme -pygithub==2.3.0 +pygithub==2.4.0 # via rocm-docs-core pygments==2.18.0 # via @@ -97,13 +97,13 @@ pygments==2.18.0 # sphinx pyjson5==1.6.6 # via doxysphinx -pyjwt[crypto]==2.8.0 +pyjwt[crypto]==2.9.0 # via pygithub pynacl==1.5.0 # via pygithub -pyparsing==3.1.2 +pyparsing==3.1.4 # via doxysphinx -pyyaml==6.0.1 +pyyaml==6.0.2 # via # myst-parser # rocm-docs-core @@ -112,15 +112,15 @@ requests==2.32.3 # via # pygithub # sphinx -rocm-docs-core[api-reference]==1.4.0 +rocm-docs-core[api-reference]==1.8.2 # via -r requirements.in smmap==5.0.1 # via gitdb snowballstemmer==2.2.0 # via sphinx -soupsieve==2.5 +soupsieve==2.6 # via beautifulsoup4 -sphinx==7.3.7 +sphinx==8.0.2 # via # breathe # myst-parser @@ -135,33 +135,33 @@ sphinx-book-theme==1.1.3 # via rocm-docs-core sphinx-copybutton==0.5.2 # via rocm-docs-core -sphinx-design==0.6.0 +sphinx-design==0.6.1 # via rocm-docs-core sphinx-external-toc==1.0.1 # via rocm-docs-core -sphinx-notfound-page==1.0.2 +sphinx-notfound-page==1.0.4 # via rocm-docs-core -sphinxcontrib-applehelp==1.0.8 +sphinxcontrib-applehelp==2.0.0 # via sphinx -sphinxcontrib-devhelp==1.0.6 +sphinxcontrib-devhelp==2.0.0 # via sphinx -sphinxcontrib-htmlhelp==2.0.5 +sphinxcontrib-htmlhelp==2.1.0 # via sphinx sphinxcontrib-jsmath==1.0.1 # via sphinx -sphinxcontrib-qthelp==1.0.7 +sphinxcontrib-qthelp==2.0.0 # via sphinx -sphinxcontrib-serializinghtml==1.1.10 +sphinxcontrib-serializinghtml==2.0.0 # via sphinx -tomli==2.0.1 +tomli==2.0.2 # via sphinx -tqdm==4.66.4 +tqdm==4.66.5 # via mpire typing-extensions==4.12.2 # via # pydata-sphinx-theme # pygithub -urllib3==2.2.1 +urllib3==2.2.3 # via # pygithub # requests diff --git a/py-interface/README.md b/py-interface/README.md index 41fe962bab..c3c6b5ca83 100644 --- a/py-interface/README.md +++ b/py-interface/README.md @@ -1,4879 +1,23 @@ -# AMD SMI Python Library +# AMD SMI Python library -## Requirements +The AMD SMI Python interface offers an accessible way to interact +with AMD hardware through a user-friendly API. Find the documentation in the +`docs/` directory. -* Python 3.6+ 64-bit -* Driver must be loaded for amdsmi_init() to pass +- [Install AMD SMI](../docs/install/install.md) +- [About the library and how to get started](../docs/how-to/amdsmi-py-lib.md) +- [Python API reference](../docs/reference/amdsmi-py-api.md) -## Overview +## Online documentation -### Folder structure +Explore the latest documentation on the [ROCm documentation +portal](https://rocm.docs.amd.com/projects/en/latest/index.html). -File Name | Note ----|--- -`__init__.py` | Python package initialization file -`amdsmi_interface.py` | Amdsmi library python interface -`amdsmi_wrapper.py` | Python wrapper around amdsmi binary -`amdsmi_exception.py` | Amdsmi exceptions python file -`README.md` | Documentation +- [Install AMD + SMI](https://rocm.docs.amd.com/projects/en/latest/install/install.html) -### Usage +- [Python library + usage](https://rocm.docs.amd.com/projects/en/latest/how-to/amdsmi-py-lib.html). -`amdsmi` folder should be copied and placed next to importing script. It should be imported as: - -```python -from amdsmi import * - -try: - amdsmi_init() - - # amdsmi calls ... - -except AmdSmiException as e: - print(e) -finally: - try: - amdsmi_shut_down() - except AmdSmiException as e: - print(e) -``` - -To initialize amdsmi lib, amdsmi_init() must be called before all other calls to amdsmi lib. - -To close connection to driver, amdsmi_shut_down() must be the last call. - -### Exceptions - -All exceptions are in `amdsmi_exception.py` file. -Exceptions that can be thrown are: - -* `AmdSmiException`: base amdsmi exception class -* `AmdSmiLibraryException`: derives base `AmdSmiException` class and represents errors that can occur in amdsmi-lib. -When this exception is thrown, `err_code` and `err_info` are set. `err_code` is an integer that corresponds to errors that can occur -in amdsmi-lib and `err_info` is a string that explains the error that occurred. -Example: - -```python -try: - num_of_GPUs = len(amdsmi_get_processor_handles()) - if num_of_GPUs == 0: - print("No GPUs on machine") -except AmdSmiException as e: - print("Error code: {}".format(e.err_code)) - if e.err_code == amdsmi_wrapper.AMDSMI_STATUS_RETRY: - print("Error info: {}".format(e.err_info)) -``` - -* `AmdSmiRetryException` : Derives `AmdSmiLibraryException` class and signals device is busy and call should be retried. -* `AmdSmiTimeoutException` : Derives `AmdSmiLibraryException` class and represents that call had timed out. -* `AmdSmiParameterException`: Derives base `AmdSmiException` class and represents errors related to invaild parameters passed to functions. When this exception is thrown, err_msg is set and it explains what is the actual and expected type of the parameters. -* `AmdSmiBdfFormatException`: Derives base `AmdSmiException` class and represents invalid bdf format. - -## API - -### amdsmi_init - -Description: Initialize amdsmi with AmdSmiInitFlags - -Input parameters: AmdSmiInitFlags - -Output: `None` - -Exceptions that can be thrown by `amdsmi_init` function: - -* `AmdSmiLibraryException` - -Initialize GPUs only example: - -```python -try: - # by default we initalize with AmdSmiInitFlags.INIT_AMD_GPUS - ret = amdsmi_init() - # continue with amdsmi -except AmdSmiException as e: - print("Init GPUs failed") - print(e) -``` - -Initialize CPUs only example: - -```python -try: - ret = amdsmi_init(AmdSmiInitFlags.INIT_AMD_CPUS) - # continue with amdsmi -except AmdSmiException as e: - print("Init CPUs failed") - print(e) -``` - -Initialize both GPUs and CPUs example: - -```python -try: - ret = amdsmi_init(AmdSmiInitFlags.INIT_AMD_APUS) - # continue with amdsmi -except AmdSmiException as e: - print("Init both GPUs & CPUs failed") - print(e) -``` - -### amdsmi_shut_down - -Description: Finalize and close connection to driver - -Input parameters: `None` - -Output: `None` - -Exceptions that can be thrown by `amdsmi_shut_down` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - amdsmi_init() - amdsmi_shut_down() -except AmdSmiException as e: - print("Shut down failed") - print(e) -``` - -### amdsmi_get_processor_type - -Description: Checks the type of device with provided handle. - -Input parameters: device handle as an instance of `amdsmi_processor_handle` - -Output: Integer, type of gpu - -Exceptions that can be thrown by `amdsmi_get_processor_type` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - type_of_GPU = amdsmi_get_processor_type(processor_handle) - if type_of_GPU == 1: - print("This is an AMD GPU") -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_processor_handles - -Description: Returns list of GPU device handle objects on current machine - -Input parameters: `None` - -Output: List of GPU device handle objects - -Exceptions that can be thrown by `amdsmi_get_processor_handles` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - print(amdsmi_get_gpu_device_uuid(device)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_socket_handles - -**Note: CURRENTLY HARDCODED TO RETURN DUMMY DATA** - -Description: Returns list of socket device handle objects on current machine - -Input parameters: `None` - -Output: List of socket device handle objects - -Exceptions that can be thrown by `amdsmi_get_socket_handles` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - sockets = amdsmi_get_socket_handles() - print('Socket numbers: {}'.format(len(sockets))) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_socket_info - -**Note: CURRENTLY HARDCODED TO RETURN EMPTY VALUES** - -Description: Return socket name - -Input parameters: -`socket_handle` socket handle - -Output: Socket name - -Exceptions that can be thrown by `amdsmi_get_socket_info` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - socket_handles = amdsmi_get_socket_handles() - if len(socket_handles) == 0: - print("No sockets on machine") - else: - for socket in socket_handles: - print(amdsmi_get_socket_info(socket)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_processor_handle_from_bdf - -Description: Returns device handle from the given BDF - -Input parameters: bdf string in form of either `::.` or `:.` in hexcode format. -Where: - -* `` is 4 hex digits long from 0000-FFFF interval -* `` is 2 hex digits long from 00-FF interval -* `` is 2 hex digits long from 00-1F interval -* `` is 1 hex digit long from 0-7 interval - -Output: device handle object - -Exceptions that can be thrown by `amdsmi_get_processor_handle_from_bdf` function: - -* `AmdSmiLibraryException` -* `AmdSmiBdfFormatException` - -Example: - -```python -try: - device = amdsmi_get_processor_handle_from_bdf("0000:23:00.0") - print(amdsmi_get_gpu_device_uuid(device)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_device_bdf - -Description: Returns BDF of the given device - -Input parameters: - -* `processor_handle` dev for which to query - -Output: BDF string in form of `::.` in hexcode format. -Where: - -* `` is 4 hex digits long from 0000-FFFF interval -* `` is 2 hex digits long from 00-FF interval -* `` is 2 hex digits long from 00-1F interval -* `` is 1 hex digit long from 0-7 interval - -Exceptions that can be thrown by `amdsmi_get_gpu_device_bdf` function: - -* `AmdSmiParameterException` -* `AmdSmiLibraryException` - -Example: - -```python -try: - device = amdsmi_get_processor_handles()[0] - print("Device's bdf:", amdsmi_get_gpu_device_bdf(device)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_device_uuid - -Description: Returns the UUID of the device - -Input parameters: - -* `processor_handle` dev for which to query - -Output: UUID string unique to the device - -Exceptions that can be thrown by `amdsmi_get_gpu_device_uuid` function: - -* `AmdSmiParameterException` -* `AmdSmiLibraryException` - -Example: - -```python -try: - device = amdsmi_get_processor_handles()[0] - print("Device UUID: ", amdsmi_get_gpu_device_uuid(device)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_driver_info - -Description: Returns the info of the driver - -Input parameters: - -* `processor_handle` dev for which to query - -Output: Dictionary with fields - -Field | Content ----|--- -`driver_name` | driver name -`driver_version` | driver_version -`driver_date` | driver_date - -Exceptions that can be thrown by `amdsmi_get_gpu_driver_info` function: - -* `AmdSmiParameterException` -* `AmdSmiLibraryException` - -Example: - -```python -try: - device = amdsmi_get_processor_handles()[0] - print("Driver info: ", amdsmi_get_gpu_driver_info(device)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_asic_info - -Description: Returns asic information for the given GPU - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Content ----|--- -`market_name` | market name -`vendor_id` | vendor id -`vendor_name` | vendor name -`device_id` | device id -`rev_id` | revision id -`asic_serial` | asic serial -`oam_id` | oam id -`num_of_compute_units` | number of compute units on asic -`target_graphics_version` | hardware graphics version - -Exceptions that can be thrown by `amdsmi_get_gpu_asic_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - asic_info = amdsmi_get_gpu_asic_info(device) - print(asic_info) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_kfd_info - -Description: Returns KFD(kernel fusion driver) information for the given GPU -This correlates to GUID in rocm-smi - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Content ----|--- -`kfd_id` | KFD's unique GPU identifier -`node_id` | KFD's internal GPU index - -Exceptions that can be thrown by `amdsmi_get_gpu_kfd_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - kfd_info = amdsmi_get_gpu_kfd_info(device) - print(kfd_info) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_power_cap_info - -Description: Returns dictionary of power capabilities as currently configured -on the given GPU. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description | Units ----|---|--- -`power_cap` | power capability | uW -`dpm_cap` | dynamic power management capability | MHz -`default_power_cap` | default power capability | uW -`min_power_cap` | min power capability | uW -`max_power_cap` | max power capability | uW - -Exceptions that can be thrown by `amdsmi_get_power_cap_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - power_info = amdsmi_get_power_cap_info(device) - print(power_info['power_cap']) - print(power_info['dpm_cap']) - print(power_info['default_power_cap']) - print(power_info['min_power_cap']) - print(power_info['max_power_cap']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_vram_info - -Description: Returns dictionary of vram information for the given GPU. - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description ----|--- -`vram_type` | vram type -`vram_vendor` | vram vendor -`vram_size` | vram size in mb - -Exceptions that can be thrown by `amdsmi_get_gpu_vram_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vram_info = amdsmi_get_gpu_vram_info(device) - print(vram_info['vram_type']) - print(vram_info['vram_vendor']) - print(vram_info['vram_size']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_cache_info - -Description: Returns a list of dictionaries containing cache information for the given GPU. - -Input parameters: - -* `processor_handle` device which to query - -Output: List of Dictionaries containing cache information following the schema below: -Schema: - -```JSON -{ - cache_properties: - { - "type" : "array", - "items" : {"type" : "string"} - }, - cache_size: {"type" : "number"}, - cache_level: {"type" : "number"}, - max_num_cu_shared: {"type" : "number"}, - num_cache_instance: {"type" : "number"} -} -``` - -Field | Description ----|--- -`cache_properties` | list of up to 4 cache property type strings. Ex. data ("DATA_CACHE"), instruction ("INST_CACHE"), CPU ("CPU_CACHE"), or SIMD ("SIMD_CACHE"). -`cache_size` | size of cache in KB -`cache_level` | level of cache -`max_num_cu_shared` | max number of compute units shared -`num_cache_instance` | number of cache instances - -Exceptions that can be thrown by `amdsmi_get_gpu_cache_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - cache_info = amdsmi_get_gpu_cache_info(device) - for cache_index, cache_values in cache_info.items(): - print(cache_values['cache_properties']) - print(cache_values['cache_size']) - print(cache_values['cache_level']) - print(cache_values['max_num_cu_shared']) - print(cache_values['num_cache_instance']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_vbios_info - -Description: Returns the static information for the VBIOS on the device. - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description ----|--- -`name` | vbios name -`build_date` | vbios build date -`part_number` | vbios part number -`version` | vbios version string - -Exceptions that can be thrown by `amdsmi_get_gpu_vbios_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vbios_info = amdsmi_get_gpu_vbios_info(device) - print(vbios_info['name']) - print(vbios_info['build_date']) - print(vbios_info['part_number']) - print(vbios_info['version']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_fw_info - -Description: Returns GPU firmware related information. - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description ----|--- -`fw_list` | List of dictionaries that contain information about a certain firmware block - -Exceptions that can be thrown by `amdsmi_get_fw_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - firmware_list = amdsmi_get_fw_info(device)['fw_list'] - for firmware_block in firmware_list: - print(firmware_block['fw_name']) - # String formated hex or decimal value ie: 21.00.00.AC or 130 - print(firmware_block['fw_version']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_activity - -Description: Returns the engine usage for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary of activites to their respective usage percentage or 'N/A' if not supported - -Field | Description ----|--- -`gfx_activity` | graphics engine usage percentage (0 - 100) -`umc_activity` | memory engine usage percentage (0 - 100) -`mm_activity` | average multimedia engine usages in percentage (0 - 100) - -Exceptions that can be thrown by `amdsmi_get_gpu_activity` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - engine_usage = amdsmi_get_gpu_activity(device) - print(engine_usage['gfx_activity']) - print(engine_usage['umc_activity']) - print(engine_usage['mm_activity']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_power_info - -Description: Returns the current power and voltage for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description ----|--- -`current_socket_power` | current socket power -`average_socket_power` | average socket power -`gfx_voltage` | voltage gfx -`soc_voltage` | voltage soc -`mem_voltage` | voltage mem -`power_limit` | power limit - -Exceptions that can be thrown by `amdsmi_get_power_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - power_measure = amdsmi_get_power_info(device) - print(power_measure['current_socket_power']) - print(power_measure['average_socket_power']) - print(power_measure['gfx_voltage']) - print(power_measure['soc_voltage']) - print(power_measure['mem_voltage']) - print(power_measure['power_limit']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_vram_usage - -Description: Returns total VRAM and VRAM in use - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description ----|--- -`vram_total` | VRAM total -`vram_used` | VRAM currently in use - -Exceptions that can be thrown by `amdsmi_get_gpu_vram_usage` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vram_usage = amdsmi_get_gpu_vram_usage(device) - print(vram_usage['vram_used']) - print(vram_usage['vram_total']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_clock_info - -Description: Returns the clock measure for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query -* `clock_type` one of `AmdSmiClkType` enum values: - -Field | Description ----|--- -`SYS` | SYS clock type -`GFX` | GFX clock type -`DF` | DF clock type -`DCEF` | DCEF clock type -`SOC` | SOC clock type -`MEM` | MEM clock type -`PCIE` | PCIE clock type -`VCLK0` | VCLK0 clock type -`VCLK1` | VCLK1 clock type -`DCLK0` | DCLK0 clock type -`DCLK1` | DCLK1 clock type - -Output: Dictionary with fields - -Field | Description ----|--- -`clk` | Current clock for given clock type -`min_clk` | Minimum clock for given clock type -`max_clk` | Maximum clock for given clock type -`clk_locked` | flag only supported on GFX clock domain -`clk_deep_sleep` | clock deep sleep mode flag - -Exceptions that can be thrown by `amdsmi_get_clock_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - clock_measure = amdsmi_get_clock_info(device, AmdSmiClkType.GFX) - print(clock_measure['clk']) - print(clock_measure['min_clk']) - print(clock_measure['max_clk']) - print(clock_measure['clk_locked']) - print(clock_measure['clk_deep_sleep']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_pcie_info - -Description: Returns the pcie metric and static information for the given GPU. For accurate PCIe Bandwidth measurements it is recommended to use this function once per 1000ms -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with 2 fields `pcie_static` and `pcie_metric` - -Fields | Description ----|--- -`pcie_static` |
Subfield Description
`max_pcie_width`Maximum number of pcie lanes available
`max_pcie_speed`Maximum capable pcie speed in GT/s
`pcie_interface_version`PCIe generation ie. 3,4,5...
`slot_type`The type of form factor of the slot: OAM, PCIE, CEM, or Unknown
-`pcie_metric` |
Subfield Description
`pcie_width`Current number of pcie lanes available
`pcie_speed`Current pcie speed capable in GT/s
`pcie_bandwidth`Current instantaneous bandwidth usage in Mb/s
`pcie_replay_count`Total number of PCIe replays (NAKs)
`pcie_l0_to_recovery_count`PCIE L0 to recovery state transition accumulated count
`pcie_replay_roll_over_count`PCIe Replay accumulated count
`pcie_nak_sent_count`PCIe NAK sent accumulated count
`pcie_nak_received_count`PCIe NAK received accumulated count
- -Exceptions that can be thrown by `amdsmi_get_pcie_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - pcie_info = amdsmi_get_pcie_info(device) - print(pcie_info["pcie_static"]) - print(pcie_info["pcie_metric"]) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_bad_page_info - -Description: Returns bad page info for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: List consisting of dictionaries with fields for each bad page found; can be an empty list - -Field | Description ----|--- -`value` | Value of page -`page_address` | Address of bad page -`page_size` | Size of bad page -`status` | Status of bad page - -Exceptions that can be thrown by `amdsmi_get_gpu_bad_page_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - bad_page_info = amdsmi_get_gpu_bad_page_info(device) - if not bad_page_info: # Can be empty list - print("No bad pages found") - continue - for bad_page in bad_page_info: - print(bad_page["value"]) - print(bad_page["page_address"]) - print(bad_page["page_size"]) - print(bad_page["status"]) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_memory_reserved_pages - -Description: Returns reserved memory page info for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: List consisting of dictionaries with fields for each reserved memory page found; can be an empty list - -Field | Description ----|--- -`value` | Value of memory reserved page -`page_address` | Address of memory reserved page -`page_size` | Size of memory reserved page -`status` | Status of memory reserved page - -Exceptions that can be thrown by `amdsmi_get_gpu_memory_reserved_pages` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - reserved_memory_page_info = amdsmi_get_gpu_memory_reserved_pages(device) - if not reserved_memory_page_info: # Can be empty list - print("No memory reserved pages found") - continue - for reserved_memory_page in reserved_memory_page_info: - print(reserved_memory_page["value"]) - print(reserved_memory_page["page_address"]) - print(reserved_memory_page["page_size"]) - print(reserved_memory_page["status"]) -except AmdSmiException as e: - print(e) -``` - - -### amdsmi_get_gpu_process_list - -Description: Returns the list of processes running on the target GPU; Requires root level access to display root process names; otherwise will return "N/A" - -Input parameters: - -* `processor_handle` device which to query - -Output: List of Dictionaries with the corresponding fields; empty list if no running process are detected - -Field | Description ----|--- -`name` | Name of process. If user does not have permission this will be "N/A" -`pid` | Process ID -`mem` | Process memory usage -`engine_usage` |
Subfield Description
`gfx`GFX engine usage in ns
`enc`Encode engine usage in ns
-`memory_usage` |
Subfield Description
`gtt_mem`GTT memory usage
`cpu_mem`CPU memory usage
`vram_mem`VRAM memory usage
- -Exceptions that can be thrown by `amdsmi_get_gpu_process_list` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - processes = amdsmi_get_gpu_process_list(device) - if len(processes) == 0: - print("No processes running on this GPU") - else: - for process in processes: - print(process) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_total_ecc_count - -Description: Returns the ECC error count for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Description ----|--- -`correctable_count` | Correctable ECC error count -`uncorrectable_count` | Uncorrectable ECC error count -`deferred_count` | Deferred ECC error count - -Exceptions that can be thrown by `amdsmi_get_gpu_total_ecc_count` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - ecc_error_count = amdsmi_get_gpu_total_ecc_count(device) - print(ecc_error_count["correctable_count"]) - print(ecc_error_count["uncorrectable_count"]) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_board_info - -Description: Returns board info for the given GPU - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields correctable and uncorrectable - -Field | Description ----|--- -`model_number` | Board serial number -`product_serial` | Product serial -`fru_id` | FRU ID -`product_name` | Product name -`manufacturer_name` | Manufacturer name - -Exceptions that can be thrown by `amdsmi_get_gpu_board_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - device = amdsmi_get_processor_handle_from_bdf("0000:23.00.0") - board_info = amdsmi_get_gpu_board_info(device) - print(board_info["model_number"]) - print(board_info["product_serial"]) - print(board_info["fru_id"]) - print(board_info["product_name"]) - print(board_info["manufacturer_name"]) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_ras_feature_info - -Description: Returns RAS version and schema information -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: List containing dictionaries with fields - -Field | Description ----|--- -`eeprom_version` | eeprom version -`parity_schema` | parity schema -`single_bit_schema` | single bit schema -`double_bit_schema` | double bit schema -`poison_schema` | poison schema - -Exceptions that can be thrown by `amdsmi_get_gpu_ras_feature_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - ras_info = amdsmi_get_gpu_ras_feature_info(device) - print(ras_info) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_ras_block_features_enabled - -Description: Returns status of each RAS block for the given GPU. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: List containing dictionaries with fields for each RAS block - -Field | Description ----|--- -`block` | RAS block -`status` | RAS block status - -Exceptions that can be thrown by `amdsmi_get_gpu_ras_block_features_enabled` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - ras_block_features = amdsmi_get_gpu_ras_block_features_enabled(device) - print(ras_block_features) -except AmdSmiException as e: - print(e) -``` - -### AmdSmiEventReader class - -Description: Providing methods for event monitoring. This is context manager class. -Can be used with `with` statement for automatic cleanup. - -Methods: - -#### Constructor - -Description: Allocates a new event reader notifier to monitor different types of events for the given GPU - -Input parameters: - -* `processor_handle` device handle corresponding to the device on which to listen for events -* `event_types` list of event types from AmdSmiEvtNotificationType enum. Specifying which events to collect for the given device. - -Event Type | Description ----|------ -`VMFAULT` | VM page fault -`THERMAL_THROTTLE` | thermal throttle -`GPU_PRE_RESET` | gpu pre reset -`GPU_POST_RESET` | gpu post reset -`RING_HANG` | ring hang event - -#### read - -Description: Reads events on the given device. When event is caught, device handle, message and event type are returned. Reading events stops when timestamp passes without event reading. - -Input parameters: - -* `timestamp` number of milliseconds to wait for an event to occur. If event does not happen monitoring is finished -* `num_elem` number of events. This is optional parameter. Default value is 10. - -#### stop - -Description: Any resources used by event notification for the the given device will be freed with this function. This can be used explicitly or -automatically using `with` statement, like in the examples below. This should be called either manually or automatically for every created AmdSmiEventReader object. - -Input parameters: `None` - -Example with manual cleanup of AmdSmiEventReader: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - event = AmdSmiEventReader(device[0], AmdSmiEvtNotificationType.GPU_PRE_RESET, AmdSmiEvtNotificationType.GPU_POST_RESET) - event.read(10000) -except AmdSmiException as e: - print(e) -finally: - event.stop() -``` - -Example with automatic cleanup using `with` statement: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - with AmdSmiEventReader(device[0], AmdSmiEvtNotificationType.GPU_PRE_RESET, AmdSmiEvtNotificationType.GPU_POST_RESET) as event: - event.read(10000) -except AmdSmiException as e: - print(e) - -``` - -### amdsmi_set_gpu_pci_bandwidth - -Description: Control the set of allowed PCIe bandwidths that can be used -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `bw_bitmask` A bitmask indicating the indices of the bandwidths that are -to be enabled (1) and disabled (0) - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_pci_bandwidth` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_pci_bandwidth(device, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_power_cap - -Description: Set the power cap value. It is not supported on virtual machine -guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_ind` a 0-based sensor index. Normally, this will be 0. If a -device has more than one sensor, it could be greater than 0 -* `cap` int that indicates the desired power cap, in microwatts - -Output: None - -Exceptions that can be thrown by `amdsmi_set_power_cap` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - power_cap = 250 * 1000000 - amdsmi_set_power_cap(device, 0, power_cap) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_power_profile - -Description: Set the power profile. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `reserved` Not currently used, set to 0 -* `profile` a amdsmi_power_profile_preset_masks_t that hold the mask of -the desired new power profile - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_power_profile` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - profile = ... - amdsmi_set_gpu_power_profile(device, 0, profile) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_clk_range - -Description: This function sets the clock range information. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `min_clk_value` minimum clock value for desired clock range -* `max_clk_value` maximum clock value for desired clock range -* `clk_type`AMDSMI_CLK_TYPE_SYS | AMDSMI_CLK_TYPE_MEM range type - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_clk_range` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_clk_range(device, 0, 1000, AmdSmiClkType.AMDSMI_CLK_TYPE_SYS) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_bdf_id - -Description: Get the unique PCI device identifier associated for a device - -Input parameters: - -* `processor_handle` device which to query - -Output: device bdf -The format of bdfid will be as follows: - -BDFID = ((DOMAIN & 0xffffffff) << 32) | ((BUS & 0xff) << 8) | - ((DEVICE & 0x1f) <<3 ) | (FUNCTION & 0x7) - -| Name | Field | ----------- | ------- | -| Domain | [64:32] | -| Reserved | [31:16] | -| Bus | [15: 8] | -| Device | [ 7: 3] | -| Function | [ 2: 0] | - -Exceptions that can be thrown by `amdsmi_get_gpu_bdf_id` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - bdfid = amdsmi_get_gpu_bdf_id(device) - print(bdfid) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_pci_bandwidth - -Description: Get the list of possible PCIe bandwidths that are available. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with the possible T/s values and associated number of lanes - -Field | Content ----|--- -`transfer_rate` | transfer_rate dictionary -`lanes` | lanes - -transfer_rate dictionary - -Field | Content ----|--- -`num_supported` | num_supported -`current` | current -`frequency` | list of frequency - -Exceptions that can be thrown by `amdsmi_get_gpu_pci_bandwidth` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - bandwidth = amdsmi_get_gpu_pci_bandwidth(device) - print(bandwidth) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_pci_throughput - -Description: Get PCIe traffic information. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with the fields - -Field | Content ----|--- -`sent` | number of bytes sent in 1 second -`received` | the number of bytes received -`max_pkt_sz` | maximum packet size - -Exceptions that can be thrown by `amdsmi_get_gpu_pci_throughput` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - pci = amdsmi_get_gpu_pci_throughput(device) - print(pci) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_pci_replay_counter - -Description: Get PCIe replay counter - -Input parameters: - -* `processor_handle` device which to query - -Output: counter value -The sum of the NAK's received and generated by the GPU - -Exceptions that can be thrown by `amdsmi_get_gpu_pci_replay_counter` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - counter = amdsmi_get_gpu_pci_replay_counter(device) - print(counter) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_topo_numa_affinity - -Description: Get the NUMA node associated with a device - -Input parameters: - -* `processor_handle` device which to query - -Output: NUMA node value - -Exceptions that can be thrown by `amdsmi_get_gpu_topo_numa_affinity` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - numa_node = amdsmi_get_gpu_topo_numa_affinity(device) - print(numa_node) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_energy_count - -Description: Get the energy accumulator counter information of the device. -energy_accumulator * counter_resolution = total_energy_consumption in micro-Joules -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query - -Output: Dictionary with fields - -Field | Content ----|--- -`power` | counter for energy accumulation since last restart/gpu rest (Deprecating in 6.4) -`energy_accumulator` | counter for energy accumulation since last restart/gpu rest -`counter_resolution` | counter resolution -`timestamp` | timestamp - -Exceptions that can be thrown by `amdsmi_get_energy_count` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - energy_dict = amdsmi_get_energy_count(device) - print(energy_dict) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_memory_total - -Description: Get the total amount of memory that exists - -Input parameters: - -* `processor_handle` device which to query -* `mem_type` enum AmdSmiMemoryType - -Output: total amount of memory - -Exceptions that can be thrown by `amdsmi_get_gpu_memory_total` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vram_memory_total = amdsmi_get_gpu_memory_total(device, amdsmi_interface.AmdSmiMemoryType.VRAM) - print(vram_memory_total) - vis_vram_memory_total = amdsmi_get_gpu_memory_total(device, amdsmi_interface.AmdSmiMemoryType.VIS_VRAM) - print(vis_vram_memory_total) - gtt_memory_total = amdsmi_get_gpu_memory_total(device, amdsmi_interface.AmdSmiMemoryType.GTT) - print(gtt_memory_total) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_od_clk_info - -Description: This function sets the clock frequency information. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `level` AMDSMI_FREQ_IND_MIN|AMDSMI_FREQ_IND_MAX to set the minimum (0) -or maximum (1) speed -* `clk_value` value to apply to the clock range -* `clk_type` AMDSMI_CLK_TYPE_SYS | AMDSMI_CLK_TYPE_MEM range type - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_od_clk_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_od_clk_info( - device, - AmdSmiFreqInd.AMDSMI_FREQ_IND_MAX, - 1000, - AmdSmiClkType.AMDSMI_CLK_TYPE_SYS - ) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_memory_usage - -Description: Get the current memory usage - -Input parameters: - -* `processor_handle` device which to query -* `mem_type` enum AmdSmiMemoryType - -Output: the amount of memory currently being used - -Exceptions that can be thrown by `amdsmi_get_gpu_memory_usage` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vram_memory_usage = amdsmi_get_gpu_memory_usage(device, amdsmi_interface.AmdSmiMemoryType.VRAM) - print(vram_memory_usage) - vis_vram_memory_usage = amdsmi_get_gpu_memory_usage(device, amdsmi_interface.AmdSmiMemoryType.VIS_VRAM) - print(vis_vram_memory_usage) - gtt_memory_usage = amdsmi_get_gpu_memory_usage(device, amdsmi_interface.AmdSmiMemoryType.GTT) - print(gtt_memory_usage) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_od_volt_info - -Description: This function sets 1 of the 3 voltage curve points. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `vpoint` voltage point [0|1|2] on the voltage curve -* `clk_value` clock value component of voltage curve point -* `volt_value` voltage value component of voltage curve point - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_od_volt_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_od_volt_info(device, 1, 1000, 980) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_fan_rpms - -Description: Get the fan speed in RPMs of the device with the specified device -handle and 0-based sensor index. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_idx` a 0-based sensor index. Normally, this will be 0. If a device has -more than one sensor, it could be greater than 0. - -Output: Fan speed in rpms as integer - -Exceptions that can be thrown by `amdsmi_get_gpu_fan_rpms` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - fan_rpm = amdsmi_get_gpu_fan_rpms(device, 0) - print(fan_rpm) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_fan_speed - -Description: Get the fan speed for the specified device as a value relative to -AMDSMI_MAX_FAN_SPEED. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_idx` a 0-based sensor index. Normally, this will be 0. If a device has -more than one sensor, it could be greater than 0. - -Output: Fan speed in relative to MAX - -Exceptions that can be thrown by `amdsmi_get_gpu_fan_speed` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - fan_speed = amdsmi_get_gpu_fan_speed(device, 0) - print(fan_speed) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_fan_speed_max - -Description: Get the max fan speed of the device with provided device handle. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_idx` a 0-based sensor index. Normally, this will be 0. If a device has -more than one sensor, it could be greater than 0. - -Output: Max fan speed as integer - -Exceptions that can be thrown by `amdsmi_get_gpu_fan_speed_max` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - max_fan_speed = amdsmi_get_gpu_fan_speed_max(device, 0) - print(max_fan_speed) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_is_gpu_power_management_enabled - -Description: Returns is power management enabled - -Input parameters: - -* `processor_handle` GPU device which to query - -Output: Bool true if power management enabled else false - -Exceptions that can be thrown by `amdsmi_is_gpu_power_management_enabled` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for processor in devices: - is_power_management_enabled = amdsmi_is_gpu_power_management_enabled(processor) - print(is_power_management_enabled) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_temp_metric - -Description: Get the temperature metric value for the specified metric, from the -specified temperature sensor on the specified device. It is not supported on virtual -machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_type` part of device from which temperature should be obtained -* `metric` enum indicated which temperature value should be retrieved - -Output: Temperature as integer in millidegrees Celcius - -Exceptions that can be thrown by `amdsmi_get_temp_metric` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - temp_metric = amdsmi_get_temp_metric(device, AmdSmiTemperatureType.EDGE, - AmdSmiTemperatureMetric.CURRENT) - print(temp_metric) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_volt_metric - -Description: Get the voltage metric value for the specified metric, from the -specified voltage sensor on the specified device. It is not supported on virtual -machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_type` part of device from which voltage should be obtained -* `metric` enum indicated which voltage value should be retrieved - -Output: Voltage as integer in millivolts - -Exceptions that can be thrown by `amdsmi_get_gpu_volt_metric` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - voltage = amdsmi_get_gpu_volt_metric(device, AmdSmiVoltageType.VDDGFX, - AmdSmiVoltageMetric.AVERAGE) - print(voltage) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_utilization_count - -Description: Get coarse/fine grain utilization counter of the specified device - -Input parameters: - -* `processor_handle` handle for the given device -* `counter_types` List of AmdSmiUtilizationCounterType counters requested - -Output: List containing dictionaries with fields - -Field | Description ----|--- -`timestamp` | The timestamp when the counter is retreived - Resolution: 1 ns -`Dictionary for each counter` |
Subfield Description
`type`Counter that was requested
`value`Value gotten for utilization counter
- -Exceptions that can be thrown by `amdsmi_get_utilization_count` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - utilization = amdsmi_get_utilization_count( - device, - AmdSmiUtilizationCounterType.COARSE_GRAIN_GFX_ACTIVITY - ) - print(utilization) - utilization = amdsmi_get_utilization_count( - device, - [AmdSmiUtilizationCounterType.COARSE_GRAIN_GFX_ACTIVITY, - AmdSmiUtilizationCounterType.COARSE_GRAIN_MEM_ACTIVITY, - AmdSmiUtilizationCounterType.COARSE_DECODER_ACTIVITY, - AmdSmiUtilizationCounterType.FINE_GRAIN_GFX_ACTIVITY, - AmdSmiUtilizationCounterType.FINE_GRAIN_MEM_ACTIVITY, - AmdSmiUtilizationCounterType.FINE_DECODER_ACTIVITY] - ) - print(utilization) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_perf_level - -Description: Get the performance level of the device with provided device handle. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Performance level as enum value of dev_perf_level_t - -Exceptions that can be thrown by `amdsmi_get_gpu_perf_level` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - perf_level = amdsmi_get_gpu_perf_level(dev) - print(perf_level) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_perf_determinism_mode - -Description: Enter performance determinism mode with provided device handle. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `clkvalue` softmax value for GFXCLK in MHz - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_perf_determinism_mode` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_perf_determinism_mode(device, 1333) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_process_isolation - -Description: Get the status of the Process Isolation - -Input parameters: - -* `processor_handle` handle for the given device - -Output: integer corresponding to isolation_status; 0 - disabled, 1 - enabled - -Exceptions that can be thrown by `amdsmi_get_gpu_process_isolation` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - isolate = amdsmi_get_gpu_process_isolation(device) - print("Process Isolation Status: ", isolate) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_process_isolation - -Description: Enable/disable the system Process Isolation for the given device handle. - -Input parameters: - -* `processor_handle` handle for the given device -* `pisolate` the process isolation status to set. 0 is the process isolation disabled, and 1 is the process isolation enabled. - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_process_isolation` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_process_isolation(device, 1) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_clean_gpu_local_data - -Description: Clear the SRAM data of the given device. This can be called between user logins to prevent information leak. - -Input parameters: - -* `processor_handle` handle for the given device - -Output: None - -Exceptions that can be thrown by `amdsmi_clean_gpu_local_data` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_clean_gpu_local_data(device) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_overdrive_level - -Description: Get the overdrive percent associated with the device with provided -device handle. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Overdrive percentage as integer - -Exceptions that can be thrown by `amdsmi_get_gpu_overdrive_level` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - od_level = amdsmi_get_gpu_overdrive_level(dev) - print(od_level) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_mem_overdrive_level - -Description: Get the GPU memory clock overdrive percent associated with the device with provided -device handle. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Overdrive percentage as integer - -Exceptions that can be thrown by `amdsmi_get_gpu_mem_overdrive_level` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - od_level = amdsmi_get_gpu_mem_overdrive_level(dev) - print(od_level) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_clk_freq - -Description: Get the list of possible system clock speeds of device for a -specified clock type. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `clk_type` the type of clock for which the frequency is desired - -Output: Dictionary with fields - -Field | Description ----|--- -`num_supported` | The number of supported frequencies -`current` | The current frequency index -`frequency` | List of frequencies, only the first num_supported frequencies are valid - -Exceptions that can be thrown by `amdsmi_get_clk_freq` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_get_clk_freq(device, AmdSmiClkType.SYS) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_od_volt_info - -Description: This function retrieves the voltage/frequency curve information. -If the num_regions is 0 then the voltage curve is not supported. -It is not supported on virtual machine guest. - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Dictionary with fields - -Field | Description ----|--- -`curr_sclk_range` |
Subfield Description
`lower_bound`lower bound sclk range
`upper_bound`upper bound sclk range
-`curr_mclk_range` |
Subfield Description
`lower_bound`lower bound mclk range
`upper_bound`upper bound mclk range
-`sclk_freq_limits` |
Subfield Description
`lower_bound`lower bound sclk range limt
`upper_bound`upper bound sclk range limit
-`mclk_freq_limits` |
Subfield Description
`lower_bound`lower bound mclk range limit
`upper_bound`upper bound mclk range limit
-`curve.vc_points` | List of voltage curve points -`num_regions` | The number of voltage curve regions - -Exceptions that can be thrown by `amdsmi_get_gpu_od_volt_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_get_gpu_od_volt_info(dev) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_metrics_info - -Description: This function retrieves the gpu metrics information. It is not -supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Dictionary with fields - -| Field | Description |Unit| -|-------|-------------|----| -`temperature_edge` | Edge temperature value | Celsius (C) -`temperature_hotspot` | Hotspot (aka junction) temperature value | Celsius (C) -`temperature_mem` | Memory temperature value | Celsius (C) -`temperature_vrgfx` | vrgfx temperature value | Celsius (C) -`temperature_vrsoc` | vrsoc temperature value | Celsius (C) -`temperature_vrmem` | vrmem temperature value | Celsius (C) -`average_gfx_activity` | Average gfx activity | % -`average_umc_activity` | Average umc (Universal Memory Controller) activity | % -`average_mm_activity` | Average mm (multimedia) engine activity | % -`average_socket_power` | Average socket power | W -`energy_accumulator` | Energy accumulated with a 15.3 uJ resolution over 1ns | uJ -`system_clock_counter` | System clock counter | ns -`average_gfxclk_frequency` | Average gfx clock frequency | MHz -`average_socclk_frequency` | Average soc clock frequency | MHz -`average_uclk_frequency` | Average uclk frequency | MHz -`average_vclk0_frequency` | Average vclk0 frequency | MHz -`average_dclk0_frequency` | Average dclk0 frequency | MHz -`average_vclk1_frequency` | Average vclk1 frequency | MHz -`average_dclk1_frequency` | Average dclk1 frequency | MHz -`current_gfxclk` | Current gfx clock | MHz -`current_socclk` | Current soc clock | MHz -`current_uclk` | Current uclk | MHz -`current_vclk0` | Current vclk0 | MHz -`current_dclk0` | Current dclk0 | MHz -`current_vclk1` | Current vclk1 | MHz -`current_dclk1` | Current dclk1 | MHz -`throttle_status` | Current throttle status | bool -`current_fan_speed` | Current fan speed | RPM -`pcie_link_width` | PCIe link width (number of lanes) | lanes -`pcie_link_speed` | PCIe link speed in 0.1 GT/s (Giga Transfers per second) | GT/s -`padding` | padding -`gfx_activity_acc` | gfx activity accumulated | % -`mem_activity_acc` | Memory activity accumulated | % -`temperature_hbm` | list of hbm temperatures | Celsius (C) -`firmware_timestamp` | timestamp from PMFW (10ns resolution) | ns -`voltage_soc` | soc voltage | mV -`voltage_gfx` | gfx voltage | mV -`voltage_mem` | mem voltage | mV -`indep_throttle_status` | ASIC independent throttle status (see drivers/gpu/drm/amd/pm/swsmu/inc/amdgpu_smu.h for bit flags) | -`current_socket_power` | Current socket power (also known as instant socket power) | W -`vcn_activity` | List of VCN encode/decode engine utilization per AID | % -`gfxclk_lock_status` | Clock lock status. Bits 0:7 correspond to each gfx clock engine instance. Bits 0:5 for APU/AID devices | -`xgmi_link_width` | XGMI bus width | lanes -`xgmi_link_speed` | XGMI bitrate | GB/s -`pcie_bandwidth_acc` | PCIe accumulated bandwidth | GB/s -`pcie_bandwidth_inst` | PCIe instantaneous bandwidth | GB/s -`pcie_l0_to_recov_count_acc` | PCIe L0 to recovery state transition accumulated count | -`pcie_replay_count_acc` | PCIe replay accumulated count | -`pcie_replay_rover_count_acc` | PCIe replay rollover accumulated count | -`xgmi_read_data_acc` | XGMI accumulated read data transfer size (KiloBytes) | KB -`xgmi_write_data_acc` | XGMI accumulated write data transfer size (KiloBytes) | KB -`current_gfxclks` | List of current gfx clock frequencies | MHz -`current_socclks` | List of current soc clock frequencies | MHz -`current_vclk0s` | List of current v0 clock frequencies | MHz -`current_dclk0s` | List of current d0 clock frequencies | MHz -`pcie_nak_sent_count_acc` | PCIe NAC sent count accumulated | -`pcie_nak_rcvd_count_acc` | PCIe NAC received count accumulated | -`jpeg_activity` | List of JPEG engine activity | % - -Exceptions that can be thrown by `amdsmi_get_gpu_metrics_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_get_gpu_metrics_info(dev) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_od_volt_curve_regions - -Description: This function will retrieve the current valid regions in the -frequency/voltage space. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `num_regions` number of freq volt regions - -Output: List containing a dictionary with fields for each freq volt region - -Field | Description ----|--- -`freq_range` |
Subfield Description
`lower_bound`lower bound freq range
`upper_bound`upper bound freq range
-`volt_range` |
Subfield Description
`lower_bound`lower bound volt range
`upper_bound`upper bound volt range
- -Exceptions that can be thrown by `amdsmi_get_gpu_od_volt_curve_regions` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_get_gpu_od_volt_curve_regions(device, 3) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_power_profile_presets - -Description: Get the list of available preset power profiles and an indication of -which profile is currently active. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_idx` number of freq volt regions - -Output: Dictionary with fields - -Field | Description ----|--- -`available_profiles` | Which profiles are supported by this system -`current` | Which power profile is currently active -`num_profiles` | How many power profiles are available - -Exceptions that can be thrown by `amdsmi_get_gpu_power_profile_presets` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_get_gpu_power_profile_presets(device, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_gpu_counter_group_supported - -Description: Tell if an event group is supported by a given device. -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` device which to query -* `event_group` event group being checked for support - -Output: None - -Exceptions that can be thrown by `amdsmi_gpu_counter_group_supported` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_gpu_counter_group_supported(device, AmdSmiEventGroup.XGMI) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_gpu_create_counter - -Description: Creates a performance counter object - -Input parameters: - -* `processor_handle` device which to query -* `event_type` event group being checked for support - -Output: An event handle of the newly created performance counter object - -Exceptions that can be thrown by `amdsmi_gpu_create_counter` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - event_handle = amdsmi_gpu_create_counter(device, AmdSmiEventGroup.XGMI) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_gpu_destroy_counter - -Description: Destroys a performance counter object - -Input parameters: - -* `event_handle` event handle of the performance counter object - -Output: None - -Exceptions that can be thrown by `amdsmi_gpu_destroy_counter` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - event_handle = amdsmi_gpu_create_counter(device, AmdSmiEventGroup.XGMI) - amdsmi_gpu_destroy_counter(event_handle) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_gpu_control_counter - -Description: Issue performance counter control commands. It is not supported -on virtual machine guest - -Input parameters: - -* `event_handle` event handle of the performance counter object -* `counter_command` command being passed to counter as AmdSmiCounterCommand - -Output: None - -Exceptions that can be thrown by `amdsmi_gpu_control_counter` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - event_handle = amdsmi_gpu_create_counter(device, AmdSmiEventType.XGMI_1_REQUEST_TX) - amdsmi_gpu_control_counter(event_handle, AmdSmiCounterCommand.CMD_START) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_gpu_read_counter - -Description: Read the current value of a performance counter - -Input parameters: - -* `event_handle` event handle of the performance counter object - -Output: Dictionary with fields - -Field | Description ----|--- -`value` | Counter value -`time_enabled` | Time that the counter was enabled in nanoseconds -`time_running` | Time that the counter was running in nanoseconds - -Exceptions that can be thrown by `amdsmi_gpu_read_counter` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - event_handle = amdsmi_gpu_create_counter(device, AmdSmiEventType.XGMI_1_REQUEST_TX) - amdsmi_gpu_read_counter(event_handle) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_available_counters - -Description: Get the number of currently available counters. It is not supported -on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `event_group` event group being checked as AmdSmiEventGroup - -Output: Number of available counters for the given device of the inputted event group - -Exceptions that can be thrown by `amdsmi_get_gpu_available_counters` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - available_counters = amdsmi_get_gpu_available_counters(device, AmdSmiEventGroup.XGMI) - print(available_counters) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_perf_level - -Description: Set a desired performance level for given device. It is not -supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `perf_level` performance level being set as AmdSmiDevPerfLevel - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_perf_level` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_perf_level(device, AmdSmiDevPerfLevel.STABLE_PEAK) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_reset_gpu - -Description: Reset the gpu associated with the device with provided device handle -It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: None - -Exceptions that can be thrown by `amdsmi_reset_gpu` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_reset_gpu(device) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_fan_speed - -Description: Set the fan speed for the specified device with the provided speed, -in RPMs. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_idx` sensor index as integer -* `fan_speed` the speed to which the function will attempt to set the fan - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_fan_speed` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_fan_speed(device, 0, 1333) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_reset_gpu_fan - -Description: Reset the fan to automatic driver control. It is not -supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `sensor_idx` sensor index as integer - -Output: None - -Exceptions that can be thrown by `amdsmi_reset_gpu_fan` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_reset_gpu_fan(device, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_clk_freq - -Description: Control the set of allowed frequencies that can be used for the -specified clock. It is not supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `clk_type` the type of clock for which the set of frequencies will be modified -as AmdSmiClkType -* `freq_bitmask` bitmask indicating the indices of the frequencies that are to -be enabled (1) and disabled (0). Only the lowest ::amdsmi_frequencies_t.num_supported -bits of this mask are relevant. - -Output: None - -Exceptions that can be thrown by `amdsmi_set_clk_freq` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - freq_bitmask = 0 - amdsmi_set_clk_freq(device, AmdSmiClkType.GFX, freq_bitmask) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_soc_pstate - -Description: Get dpm policy information. - -Input parameters: - -* `processor_handle` handle for the given device -* `policy_id` the policy id to set. - -Output: Dictionary with fields - -Field | Description ----|--- -`num_supported` | total number of supported policies -`current_id` | current policy id -`policies` | list of dictionaries containing possible policies - -Exceptions that can be thrown by `amdsmi_get_soc_pstate` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - dpm_policies = amdsmi_get_soc_pstate(device) - print(dpm_policies) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_soc_pstate - -Description: Set the dpm policy to corresponding policy_id. Typically following: 0(default),1,2,3 - -Input parameters: - -* `processor_handle` handle for the given device -* `policy_id` the policy id to set. - -Output: None - -Exceptions that can be thrown by `amdsmi_set_soc_pstate` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_soc_pstate(device, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_xgmi_plpd - -Description: Set the xgmi per-link power down policy parameter for the processor - -Input parameters: - -* `processor_handle` handle for the given device -* `policy_id` the xgmi plpd id to set. - -Output: None - -Exceptions that can be thrown by `amdsmi_set_xgmi_plpd` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_xgmi_plpd(device, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_xgmi_plpd - -Description: Get the xgmi per-link power down policy parameter for the processor - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Dict containing information about xgmi per-link power down policy - -Field | Description ----|--- -`num_supported` | The number of supported policies -`current_id` | The current policy index -`plpds` | List of policies. - -Exceptions that can be thrown by `amdsmi_get_xgmi_plpd` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - xgmi_plpd = amdsmi_get_xgmi_plpd(device) - print(xgmi_plpd) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_overdrive_level - -Description: **deprecated** Set the overdrive percent associated with the -device with provided device handle with the provided value. It is not -supported on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `overdrive_value` value to which the overdrive level should be set - -Output: None - -Exceptions that can be thrown by `amdsmi_set_gpu_overdrive_level` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_overdrive_level(device, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_ecc_count - -Description: Retrieve the error counts for a GPU block. It is not supported -on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `block` The block for which error counts should be retrieved - -Output: Dict containing information about error counts - -Field | Description ----|--- -`correctable_count` | Count of correctable errors -`uncorrectable_count` | Count of uncorrectable errors -`deferred_count` | Count of deferred errors - -Exceptions that can be thrown by `amdsmi_get_gpu_ecc_count` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - ecc_count = amdsmi_get_gpu_ecc_count(device, AmdSmiGpuBlock.UMC) - print(ecc_count) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_ecc_enabled - -Description: Retrieve the enabled ECC bit-mask. It is not supported on virtual -machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: Enabled ECC bit-mask - -Exceptions that can be thrown by `amdsmi_get_gpu_ecc_enabled` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - enabled = amdsmi_get_gpu_ecc_enabled(device) - print(enabled) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_ecc_status - -Description: Retrieve the ECC status for a GPU block. It is not supported -on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device -* `block` The block for which ECC status should be retrieved - -Output: ECC status for a requested GPU block - -Exceptions that can be thrown by `amdsmi_get_gpu_ecc_status` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - status = amdsmi_get_gpu_ecc_status(device, AmdSmiGpuBlock.UMC) - print(status) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_status_code_to_string - -Description: Get a description of a provided AMDSMI error status - -Input parameters: - -* `status` The error status for which a description is desired - -Output: String description of the provided error code - -Exceptions that can be thrown by `amdsmi_status_code_to_string` function: - -* `AmdSmiParameterException` - -Example: - -```python -try: - status_str = amdsmi_status_code_to_string(ctypes.c_uint32(0)) - print(status_str) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_compute_process_info - -Description: Get process information about processes currently using GPU - -Input parameters: None - -Output: List of python dicts each containing a process information - -Field | Description ----|--- -`process_id` | Process ID -`pasid` | PASID -`vram_usage` | VRAM usage -`sdma_usage` | SDMA usage in microseconds -`cu_occupancy` | Compute Unit usage in percents - -Exceptions that can be thrown by `amdsmi_get_gpu_compute_process_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` - -Example: - -```python -try: - procs = amdsmi_get_gpu_compute_process_info() - for proc in procs: - print(proc) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_compute_process_info_by_pid - -Description: Get process information about processes currently using GPU - -Input parameters: - -* `pid` The process ID for which process information is being requested - -Output: Dict containing a process information - -Field | Description ----|--- -`process_id` | Process ID -`pasid` | PASID -`vram_usage` | VRAM usage -`sdma_usage` | SDMA usage in microseconds -`cu_occupancy` | Compute Unit usage in percents - -Exceptions that can be thrown by `amdsmi_get_gpu_compute_process_info_by_pid` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - pid = 0 # << valid pid here - proc = amdsmi_get_gpu_compute_process_info_by_pid(pid) - print(proc) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_compute_process_gpus - -Description: Get the device indices currently being used by a process - -Input parameters: - -* `pid` The process id of the process for which the number of gpus currently being used is requested - -Output: List of indices of devices currently being used by the process - -Exceptions that can be thrown by `amdsmi_get_gpu_compute_process_gpus` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - pid = 0 # << valid pid here - indices = amdsmi_get_gpu_compute_process_gpus(pid) - print(indices) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_gpu_xgmi_error_status - -Description: Retrieve the XGMI error status for a device. It is not supported on -virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: XGMI error status for a requested device - -Exceptions that can be thrown by `amdsmi_gpu_xgmi_error_status` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - status = amdsmi_gpu_xgmi_error_status(device) - print(status) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_reset_gpu_xgmi_error - -Description: Reset the XGMI error status for a device. It is not supported -on virtual machine guest - -Input parameters: - -* `processor_handle` handle for the given device - -Output: None - -Exceptions that can be thrown by `amdsmi_reset_gpu_xgmi_error` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_reset_gpu_xgmi_error(device) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_vendor_name - -Description: Returns the device vendor name - -Input parameters: - -* `processor_handle` device which to query - -Output: device vendor name - -Exceptions that can be thrown by `amdsmi_get_gpu_vendor_name` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vendor_name = amdsmi_get_gpu_vendor_name(device) - print(vendor_name) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_id - -Description: Get the device id associated with the device with provided device handler - -Input parameters: - -* `processor_handle` device which to query - -Output: device id - -Exceptions that can be thrown by `amdsmi_get_gpu_id` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - dev_id = amdsmi_get_gpu_id(device) - print(dev_id) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_vram_vendor - -Description: Get the vram vendor string of a gpu device. - -Input parameters: - -* `processor_handle` device which to query - -Output: vram vendor - -Exceptions that can be thrown by `amdsmi_get_gpu_vram_vendor` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - vram_vendor = amdsmi_get_gpu_vram_vendor(device) - print(vram_vendor) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_subsystem_id - -Description: Get the subsystem device id associated with the device with provided device handle. - -Input parameters: - -* `processor_handle` device which to query - -Output: subsystem device id - -Exceptions that can be thrown by `amdsmi_get_gpu_subsystem_id` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - id = amdsmi_get_gpu_subsystem_id(device) - print(id) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_subsystem_name - -Description: Get the name string for the device subsytem - -Input parameters: - -* `processor_handle` device which to query - -Output: device subsytem - -Exceptions that can be thrown by `amdsmi_get_gpu_subsystem_name` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - subsystem_nam = amdsmi_get_gpu_subsystem_name(device) - print(subsystem_nam) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_lib_version - -Description: Get the build version information for the currently running build of AMDSMI. - -Output: amdsmi build version - -Exceptions that can be thrown by `amdsmi_get_lib_version` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - version = amdsmi_get_lib_version() - print(version) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_topo_get_numa_node_number - -Description: Retrieve the NUMA CPU node number for a device - -Input parameters: - -* `processor_handle` device which to query - -Output: node number of NUMA CPU for the device - -Exceptions that can be thrown by `amdsmi_topo_get_numa_node_number` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - node_number = amdsmi_topo_get_numa_node_number() - print(node_number) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_topo_get_link_weight - -Description: Retrieve the weight for a connection between 2 GPUs. - -Input parameters: - -* `processor_handle_src` the source device handle -* `processor_handle_dest` the destination device handle - -Output: the weight for a connection between 2 GPUs - -Exceptions that can be thrown by `amdsmi_topo_get_link_weight` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - processor_handle_src = devices[0] - processor_handle_dest = devices[1] - weight = amdsmi_topo_get_link_weight(processor_handle_src, processor_handle_dest) - print(weight) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_minmax_bandwidth_between_processors - -Description: Retreive minimal and maximal io link bandwidth between 2 GPUs. - -Input parameters: - -* `processor_handle_src` the source device handle -* `processor_handle_dest` the destination device handle - -Output: Dictionary with fields: - -Field | Description ----|--- -`min_bandwidth` | minimal bandwidth for the connection -`max_bandwidth` | maximal bandwidth for the connection - -Exceptions that can be thrown by `amdsmi_get_minmax_bandwidth_between_processors` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - processor_handle_src = devices[0] - processor_handle_dest = devices[1] - bandwidth = amdsmi_get_minmax_bandwidth_between_processors(processor_handle_src, processor_handle_dest) - print(bandwidth['min_bandwidth']) - print(bandwidth['max_bandwidth']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_topo_get_link_type - -Description: Retrieve the hops and the connection type between 2 GPUs - -Input parameters: - -* `processor_handle_src` the source device handle -* `processor_handle_dest` the destination device handle - -Output: Dictionary with fields: - -Field | Description ----|--- -`hops` | number of hops -`type` | the connection type - -Exceptions that can be thrown by `amdsmi_topo_get_link_type` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - processor_handle_src = devices[0] - processor_handle_dest = devices[1] - link_type = amdsmi_topo_get_link_type(processor_handle_src, processor_handle_dest) - print(link_type['hops']) - print(link_type['type']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_is_P2P_accessible - -Description: Return P2P availability status between 2 GPUs - -Input parameters: - -* `processor_handle_src` the source device handle -* `processor_handle_dest` the destination device handle - -Output: P2P availability status between 2 GPUs - -Exceptions that can be thrown by `amdsmi_is_P2P_accessible` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - processor_handle_src = devices[0] - processor_handle_dest = devices[1] - accessible = amdsmi_is_P2P_accessible(processor_handle_src, processor_handle_dest) - print(accessible) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_compute_partition - -Description: Get the compute partition from the given GPU - -Input parameters: - -* `processor_handle` the device handle - -Output: String of the partition type - -Exceptions that can be thrown by `amdsmi_get_gpu_compute_partition` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - compute_partition_type = amdsmi_get_gpu_compute_partition(device) - print(compute_partition_type) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_compute_partition - -Description: Set the compute partition to the given GPU - -Input parameters: - -* `processor_handle` the device handle -* `compute_partition` the type of compute_partition to set - -Output: String of the partition type - -Exceptions that can be thrown by `amdsmi_set_gpu_compute_partition` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - compute_partition = AmdSmiComputePartitionType.SPX - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_compute_partition(device, compute_partition) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_gpu_memory_partition - -Description: Get the memory partition from the given GPU - -Input parameters: - -* `processor_handle` the device handle - -Output: String of the partition type - -Exceptions that can be thrown by `amdsmi_get_gpu_memory_partition` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - memory_partition_type = amdsmi_get_gpu_memory_partition(device) - print(memory_partition_type) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_gpu_memory_partition - -Description: Set the memory partition to the given GPU - -Input parameters: - -* `processor_handle` the device handle -* `memory_partition` the type of memory_partition to set - -Output: String of the partition type - -Exceptions that can be thrown by `amdsmi_set_gpu_memory_partition` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - memory_partition = AmdSmiMemoryPartitionType.NPS1 - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - amdsmi_set_gpu_memory_partition(device, memory_partition) -except AmdSmiException as e: - print(e) -``` - - -### amdsmi_get_gpu_accelerator_partition_profile - -**Note: CURRENTLY HARDCODED TO RETURN EMPTY VALUES** - -Description: Get partition information for target device - -Input parameters: - -* `processor_handle` the device handle - -Output: Dictionary with fields: - -Field | Description ----|--- -`partition_id` | ID of the partition on the GPU provided -`partition_profile` | Dict containing partition data (TBD) - -Exceptions that can be thrown by `amdsmi_get_gpu_accelerator_partition_profile` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - partition_id = amdsmi_get_gpu_accelerator_partition_profile(device)["partition_id"] - print(partition_id) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_xgmi_info - -Description: Returns XGMI information for the GPU. - -Input parameters: - -* `processor_handle` device handle - -Output: Dictionary with fields: - -Field | Description ----|--- -`xgmi_lanes` | xgmi lanes -`xgmi_hive_id` | xgmi hive id -`xgmi_node_id` | xgmi node id -`index` | index - -Exceptions that can be thrown by `amdsmi_get_xgmi_info` function: - -* `AmdSmiLibraryException` -* `AmdSmiRetryException` -* `AmdSmiParameterException` - -Example: - -```python -try: - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs on machine") - else: - for device in devices: - xgmi_info = amdsmi_get_xgmi_info(device) - print(xgmi_info['xgmi_lanes']) - print(xgmi_info['xgmi_hive_id']) - print(xgmi_info['xgmi_node_id']) - print(xgmi_info['index']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_link_topology_nearest - -Description: Retrieve the set of GPUs that are nearest to a given device - at a specific interconnectivity level. - -Input parameters: -* `processor_handle` The identifier of the given device. -* `link_type` The AmdSmiLinkType level to search for nearest devices - -Output: Dictionary holding the following fields. -* `count` number of nearest devices found based on given topology level -* `processor_list` list of all nearest device handlers found - - -Exceptions that can be thrown by `amdsmi_get_link_topology_nearest` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - amdsmi_init() - - devices = amdsmi_get_processor_handles() - if len(devices) == 0: - print("No GPUs found on machine") - exit() - else: - print(amdsmi_get_gpu_device_uuid(devices[0])) - - nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType.AMDSMI_LINK_TYPE_PCIE) - if (nearest_gpus['count']) == 0: - print("No nearest GPUs found on machine") - else: - print("Nearest GPUs") - for gpu in nearest_gpus['processor_list']: - print(amdsmi_get_gpu_device_uuid(gpu)) - -except AmdSmiException as e: - print(e) -finally: - try: - amdsmi_shut_down() - except AmdSmiException as e: - print(e) -``` - -## CPU APIs - -### amdsmi_get_processor_info - -**Note: CURRENTLY HARDCODED TO RETURN EMPTY VALUES** - -Description: Return processor name - -Input parameters: -`processor_handle` processor handle - -Output: Processor name - -Exceptions that can be thrown by `amdsmi_get_processor_info` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_processor_handles() - if len(processor_handles) == 0: - print("No processors on machine") - else: - for processor in processor_handles: - print(amdsmi_get_processor_info(processor)) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_hsmp_proto_ver - -Description: Get the hsmp protocol version. - -Output: amdsmi hsmp protocol version - -Exceptions that can be thrown by `amdsmi_get_cpu_hsmp_proto_ver` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - version = amdsmi_get_cpu_hsmp_proto_ver(processor) - print(version) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_smu_fw_version - -Description: Get the SMU Firmware version. - -Output: amdsmi SMU Firmware version - -Exceptions that can be thrown by `amdsmi_get_cpu_smu_fw_version` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - version = amdsmi_get_cpu_smu_fw_version(processor) - print(version['debug']) - print(version['minor']) - print(version['major']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_prochot_status - -Description: Get the CPU's prochot status. - -Output: amdsmi cpu prochot status - -Exceptions that can be thrown by `amdsmi_get_cpu_prochot_status` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - prochot = amdsmi_get_cpu_prochot_status(processor) - print(prochot) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_fclk_mclk - -Description: Get the Data fabric clock and Memory clock in MHz. - -Output: amdsmi data fabric clock and memory clock - -Exceptions that can be thrown by `amdsmi_get_cpu_fclk_mclk` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - clk = amdsmi_get_cpu_fclk_mclk(processor) - for fclk, mclk in clk.items(): - print(fclk) - print(mclk) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_cclk_limit - -Description: Get the core clock in MHz. - -Output: amdsmi core clock - -Exceptions that can be thrown by `amdsmi_get_cpu_cclk_limit` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - cclk_limit = amdsmi_get_cpu_cclk_limit(processor) - print(cclk_limit) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_current_active_freq_limit - -Description: Get current active frequency limit of the socket. - -Output: amdsmi frequency value in MHz and frequency source name - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_current_active_freq_limit` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - freq_limit = amdsmi_get_cpu_socket_current_active_freq_limit(processor) - for freq, src in freq_limit.items(): - print(freq) - print(src) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_freq_range - -Description: Get socket frequency range - -Output: amdsmi maximum frequency and minimum frequency - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_freq_range` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - freq_range = amdsmi_get_cpu_socket_freq_range(processor) - for fmax, fmin in freq_range.items(): - print(fmax) - print(fmin) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_core_current_freq_limit - -Description: Get socket frequency limit of the core - -Output: amdsmi frequency - -Exceptions that can be thrown by `amdsmi_get_cpu_core_current_freq_limit` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpucore_handles() - if len(processor_handles) == 0: - print("No CPU cores on machine") - else: - for processor in processor_handles: - freq_limit = amdsmi_get_cpu_core_current_freq_limit(processor) - print(freq_limit) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_power - -Description: Get the socket power. - -Output: amdsmi socket power - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_power` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - sock_power = amdsmi_get_cpu_socket_power(processor) - print(sock_power) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_power_cap - -Description: Get the socket power cap. - -Output: amdsmi socket power cap - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_power_cap` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - sock_power = amdsmi_get_cpu_socket_power_cap(processor) - print(sock_power) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_power_cap_max - -Description: Get the socket power cap max. - -Output: amdsmi socket power cap max - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_power_cap_max` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - sock_power = amdsmi_get_cpu_socket_power_cap_max(processor) - print(sock_power) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_pwr_svi_telemetry_all_rails - -Description: Get the SVI based power telemetry for all rails. - -Output: amdsmi svi based power value - -Exceptions that can be thrown by `amdsmi_get_cpu_pwr_svi_telemetry_all_rails` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - power = amdsmi_get_cpu_pwr_svi_telemetry_all_rails(processor) - print(power) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_socket_power_cap - -Description: Set the power cap value for a given socket. - -Input: amdsmi socket power cap value - -Exceptions that can be thrown by `amdsmi_set_cpu_socket_power_cap` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - power = amdsmi_set_cpu_socket_power_cap(processor, 1000) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_pwr_efficiency_mode - -Description: Set the power efficiency profile policy. - -Input: mode(0, 1, or 2) - -Exceptions that can be thrown by `amdsmi_set_cpu_pwr_efficiency_mode` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - policy = amdsmi_set_cpu_pwr_efficiency_mode(processor, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_core_boostlimit - -Description: Get boost limit of the cpu core - -Output: amdsmi frequency - -Exceptions that can be thrown by `amdsmi_get_cpu_core_boostlimit` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpucore_handles() - if len(processor_handles) == 0: - print("No CPU cores on machine") - else: - for processor in processor_handles: - boost_limit = amdsmi_get_cpu_core_boostlimit(processor) - print(boost_limit) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_c0_residency - -Description: Get the cpu socket C0 residency. - -Output: amdsmi C0 residency value - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_c0_residency` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - c0_residency = amdsmi_get_cpu_socket_c0_residency(processor) - print(c0_residency) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_core_boostlimit - -Description: Set the cpu core boost limit. - -Output: amdsmi boostlimit value - -Exceptions that can be thrown by `amdsmi_set_cpu_core_boostlimit` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpucore_handles() - if len(processor_handles) == 0: - print("No CPU cores on machine") - else: - for processor in processor_handles: - boost_limit = amdsmi_set_cpu_core_boostlimit(processor, 1000) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_socket_boostlimit - -Description: Set the cpu socket boost limit. - -Input: amdsmi boostlimit value - -Exceptions that can be thrown by `amdsmi_set_cpu_socket_boostlimit` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - boost_limit = amdsmi_set_cpu_socket_boostlimit(processor, 1000) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_ddr_bw - -Description: Get the CPU DDR Bandwidth. - -Output: amdsmi ddr bandwidth data - -Exceptions that can be thrown by `amdsmi_get_cpu_ddr_bw` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - ddr_bw = amdsmi_get_cpu_ddr_bw(processor) - print(ddr_bw['max_bw']) - print(ddr_bw['utilized_bw']) - print(ddr_bw['utilized_pct']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_temperature - -Description: Get the socket temperature. - -Output: amdsmi temperature value - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_temperature` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - ptmon = amdsmi_get_cpu_socket_temperature(processor) - print(ptmon) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_dimm_temp_range_and_refresh_rate - -Description: Get DIMM temperature range and refresh rate. - -Output: amdsmi dimm metric data - -Exceptions that can be thrown by `amdsmi_get_cpu_dimm_temp_range_and_refresh_rate` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - dimm = amdsmi_get_cpu_dimm_temp_range_and_refresh_rate(processor) - print(dimm['range']) - print(dimm['ref_rate']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_dimm_power_consumption - -Description: amdsmi_get_cpu_dimm_power_consumption. - -Output: amdsmi dimm power consumption value - -Exceptions that can be thrown by `amdsmi_get_cpu_dimm_power_consumption` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - dimm = amdsmi_get_cpu_dimm_power_consumption(processor) - print(dimm['power']) - print(dimm['update_rate']) - print(dimm['dimm_addr']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_dimm_thermal_sensor - -Description: Get DIMM thermal sensor value. - -Output: amdsmi dimm temperature data - -Exceptions that can be thrown by `amdsmi_get_cpu_dimm_thermal_sensor` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - dimm = amdsmi_get_cpu_dimm_thermal_sensor(processor) - print(dimm['sensor']) - print(dimm['update_rate']) - print(dimm['dimm_addr']) - print(dimm['temp']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_xgmi_width - -Description: Set xgmi width. - -Input: amdsmi xgmi width - -Exceptions that can be thrown by `amdsmi_set_cpu_xgmi_width` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - xgmi_width = amdsmi_set_cpu_xgmi_width(processor, 0, 100) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_gmi3_link_width_range - -Description: Set gmi3 link width range. - -Input: minimum & maximum link width to be set. - -Exceptions that can be thrown by `amdsmi_set_cpu_gmi3_link_width_range` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - gmi_link_width = amdsmi_set_cpu_gmi3_link_width_range(processor, 0, 100) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_cpu_apb_enable - -Description: Enable APB. - -Input: amdsmi processor handle - -Exceptions that can be thrown by `amdsmi_cpu_apb_enable` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - apb_enable = amdsmi_cpu_apb_enable(processor) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_cpu_apb_disable - -Description: Disable APB. - -Input: pstate value - -Exceptions that can be thrown by `amdsmi_cpu_apb_disable` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - apb_disable = amdsmi_cpu_apb_disable(processor, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_socket_lclk_dpm_level - -Description: Set NBIO lclk dpm level value. - -Input: nbio id, min value, max value - -Exceptions that can be thrown by `amdsmi_set_cpu_socket_lclk_dpm_level` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for socket in socket_handles: - nbio = amdsmi_set_cpu_socket_lclk_dpm_level(socket, 0, 0, 2) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_socket_lclk_dpm_level - -Description: Get NBIO LCLK dpm level. - -Output: nbio id - -Exceptions that can be thrown by `amdsmi_get_cpu_socket_lclk_dpm_level` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - nbio = amdsmi_get_cpu_socket_lclk_dpm_level(processor) - print(nbio['max_dpm_level']) - print(nbio['max_dpm_level']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_pcie_link_rate - -Description: Set pcie link rate. - -Input: rate control value - -Exceptions that can be thrown by `amdsmi_set_cpu_pcie_link_rate` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - link_rate = amdsmi_set_cpu_pcie_link_rate(processor, 0, 0) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_set_cpu_df_pstate_range - -Description: Set df pstate range. - -Input: max pstate, min pstate - -Exceptions that can be thrown by `amdsmi_set_cpu_df_pstate_range` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - pstate_range = amdsmi_set_cpu_df_pstate_range(processor, 0, 2) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_current_io_bandwidth - -Description: Get current input output bandwidth. - -Output: link id and bw type to which io bandwidth to be obtained - -Exceptions that can be thrown by `amdsmi_get_cpu_current_io_bandwidth` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - io_bw = amdsmi_get_cpu_current_io_bandwidth(processor) - print(io_bw) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_current_xgmi_bw - -Description: Get current xgmi bandwidth. - -Output: amdsmi link id and bw type to which xgmi bandwidth to be obtained - -Exceptions that can be thrown by `amdsmi_get_cpu_current_xgmi_bw` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - xgmi_bw = amdsmi_get_cpu_current_xgmi_bw(processor) - print(xgmi_bw) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_hsmp_metrics_table_version - -Description: Get HSMP metrics table version. - -Output: amdsmi HSMP metrics table version - -Exceptions that can be thrown by `amdsmi_get_hsmp_metrics_table_version` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - met_ver = amdsmi_get_hsmp_metrics_table_version(processor) - print(met_ver) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_hsmp_metrics_table - -Description: Get HSMP metrics table - -Output: HSMP metric table data - -Exceptions that can be thrown by `amdsmi_get_hsmp_metrics_table` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - mtbl = amdsmi_get_hsmp_metrics_table(processor) - print(mtbl['accumulation_counter']) - print(mtbl['max_socket_temperature']) - print(mtbl['max_vr_temperature']) - print(mtbl['max_hbm_temperature']) - print(mtbl['socket_power_limit']) - print(mtbl['max_socket_power_limit']) - print(mtbl['socket_power']) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_first_online_core_on_cpu_socket - -Description: Get first online core on cpu socket. - -Output: first online core on cpu socket - -Exceptions that can be thrown by `amdsmi_first_online_core_on_cpu_socket` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - processor_handles = amdsmi_get_cpusocket_handles() - if len(processor_handles) == 0: - print("No CPU sockets on machine") - else: - for processor in processor_handles: - pcore_ind = amdsmi_first_online_core_on_cpu_socket(processor) - print(pcore_ind) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_family - -Description: Get cpu family. - -Output: cpu family - -Exceptions that can be thrown by `amdsmi_get_cpu_family` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - cpu_family = amdsmi_get_cpu_family() - print(cpu_family) -except AmdSmiException as e: - print(e) -``` - -### amdsmi_get_cpu_model - -Description: Get cpu model. - -Output: cpu model - -Exceptions that can be thrown by `amdsmi_get_cpu_model` function: - -* `AmdSmiLibraryException` - -Example: - -```python -try: - cpu_model = amdsmi_get_cpu_model() - print(cpu_model) -except AmdSmiException as e: - print(e) -``` +- [Python API + reference](https://rocm.docs.amd.com/projects/en/latest/reference/amdsmi-py-api.html).