[SWDEV-488276/SWDEV-497613] Update memory partition set functionality

Changes:
  - [CLI] Added warning screen to AMD SMI users
    setting memory partition
  - [CLI] Added a progress bar time-bar for CLI sets display to 40 seconds
  - [API] Updated to wait until the driver reloads with SYSFS files active
  - [CLI] Now users can set or reset without providing:
    amd-smi set -g all <set arguments>
    or amd-smi reset -g all <set arguments>
    now can directly call -> sudo amd-smi set <set arguments>
    or sudo amd-smi reset <set arguments>
  - [SWDEV-475712][CLI/API] Fixed target_graphics_version field
    not properly displaying for older MI or Navi ASICs.
  - [All APIs] Added a catch for the driver to report invalid arguments
    now these APIs will show AMDSMI_STATUS_INVAL
    (ex. changing to NPS8 if the device does not support it)
  - [Install] Modified paths for Python install commands to support
    multi-ROCm installs

Change-Id: Id11f25d68a82d23c6b2d77ccb30b51e860dd0ca7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
This commit is contained in:
Charis Poag
2024-11-08 17:31:25 -06:00
bovenliggende 19cc4718c0
commit 3ea4a42a6e
24 gewijzigde bestanden met toevoegingen van 1711 en 726 verwijderingen
+53 -1
Bestand weergeven
@@ -8,7 +8,7 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
### Added
- **Added support for `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
- **Added support for `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
Guest VMs now support getting current ECC counts and ras information from the Host cards.
- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**.
@@ -497,6 +497,56 @@ GPU: 0
### Changed
- **Improvement: Users now have the ability to set and reset without providing `-g all` using AMD SMI CLI**.
Users can now provide set and reset without `-g all`. Previously, users were required to provide:
`sudo amd-smi set -g all <set arguments>` or `sudo amd-smi reset -g all <set arguments>`
This update allows users to set or reset without providing `-g all` arguments. Allowing commands:
`sudo amd-smi set <set arguments>` or `sudo amd-smi reset <set arguments>`
This action will default to try to set/reset for all AMD GPUs on the user's system.
- **Improvement: `amd-smi set --memory-partition` now includes a warning banner and progress bar**.
For devices which support dynamically changing memory partitions, we now provide a warning for users. We provide this warning to provide users knowledge that this action requires users to quit any gpu workloads. Also to let them know this process will trigger an AMD GPU driver reload. Since this process takes time to complete, a progress bar has been provided until actions can verified as a successful change. Otherwise, AMD SMI will report any errors to users and what actions can be taken. See example below:
```shell
$ sudo amd-smi set -M NPS1
****** WARNING ******
Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
AMD SMI will then attempt to change memory (NPS) partition mode.
Upon a successful set, AMD SMI will then initiate an action to restart amdgpu driver.
This action will change all GPU's in the hive to the requested memory (NPS) partition mode.
Please use this utility with caution.
Do you accept these terms? [Y/N] y
Updating memory partition for gpu 0: [████████████████████████████████████████] 40/40 secs remain
GPU: 0
MEMORYPARTITION: Successfully set memory partition to NPS1
GPU: 1
MEMORYPARTITION: Successfully set memory partition to NPS1
GPU: 2
MEMORYPARTITION: Successfully set memory partition to NPS1
...
```
- **Updated `amdsmi_get_gpu_accelerator_partition_profile` to provide driver memory partition capablities**.
Driver now has the ability to report what the user can set memory partition modes to. User can now see available
memory partition modes upon an invalid argument return from memory partition mode set (`amdsmi_set_gpu_memory_partition`).
This change also updates `amd-smi partition`, `amd-smi partition --memory`, and `amd-smi partition --accelerator` (*see note below)
***Note: *Subject to change for ROCm 6.4***
- **Updated `amdsmi_set_gpu_memory_partition` to not return until a successful restart of AMD GPU Driver.**
This change keeps checking for ~ up to 40 seconds for a successful restart of the AMD GPU driver. Additionally, the API call continues to check if memory partition (NPS) SYSFS files are successfully updated to reflect the user's requested memory partition (NPS) mode change. Otherwise, reports an error back to the user. Due to these changes, we have updated AMD SMI's CLI to reflect the maximum wait of 40 seconds, while a memory partition change is in progress.
- **All APIs now have the ability to catch driver reporting invalid arguments.**
Now AMD SMI APIs can show AMDSMI_STATUS_INVAL when driver returns EINVAL.
For example, if user tries to set to NPS8, but the memory partition mode is not an available mode to set to. Commonly referred to as `CAPS` (see `amd-smi partition --memory`), provided by `amdsmi_get_gpu_accelerator_partition_profile`(*see note below).
***Note: *Subject to change for ROCm 6.4***
- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**.
This aligns BDF output with ROCm SMI.
See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function.
@@ -590,6 +640,8 @@ GPU: 0
### Resolved issues
- **Fixed `amdsmi_get_gpu_asic_info`'s `target_graphics_version` and `amd-smi --asic` not displaying properly for MI2x or Navi 3x ASICs**.
- **Fixed `amd-smi reset` commands showing an AttributeError**.
- **Improved Offline install process & lowered dependency for PyYAML**.
+7
Bestand weergeven
@@ -107,6 +107,13 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wformat=2 -fno-common -Wstrict-overflow
# Intentionally leave out -Wsign-promo. It causes spurious warnings.
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Woverloaded-virtual -Wreorder")
# Add CMAKE debug flags
if ("${CMAKE_BUILD_TYPE}" STREQUAL Release)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
else ()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -O0 -DDEBUG")
endif ()
set(COMMON_SRC_DIR "${PROJECT_SOURCE_DIR}/src")
set(ROCM_SRC_DIR "${PROJECT_SOURCE_DIR}/rocm_smi/src")
set(AMDSMI_SRC_DIR "${PROJECT_SOURCE_DIR}/src/amd_smi")
+5 -3
Bestand weergeven
@@ -156,9 +156,11 @@ do_install_amdsmi_python_lib() {
"AMD-SMI python library will not be installed."
return
fi
local amdsmi_python_lib_path="/opt/rocm/share/amd_smi"
local amdsmi_setup_py_path="/opt/rocm/share/amd_smi/setup.py"
# install python library at @CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@/amdsmi
local python_lib_path=@CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@
local amdsmi_python_lib_path="$python_lib_path"
local amdsmi_setup_py_path="$python_lib_path/setup.py"
# Decide installation method based on setuptools version
if [[ "$(printf '%s\n' "$setuptools_version" "28.5" | sort -V | head -n1)" == "$setuptools_version" ]]; then
+4 -2
Bestand weergeven
@@ -157,8 +157,10 @@ do_install_amdsmi_python_lib() {
return
fi
local amdsmi_python_lib_path="/opt/rocm/share/amd_smi"
local amdsmi_setup_py_path="/opt/rocm/share/amd_smi/setup.py"
# install python library at @CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@/amdsmi
local python_lib_path=@CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@
local amdsmi_python_lib_path="$python_lib_path"
local amdsmi_setup_py_path="$python_lib_path/setup.py"
# Decide installation method based on setuptools version
if [[ "$(printf '%s\n' "$setuptools_version" "28.5" | sort -V | head -n1)" == "$setuptools_version" ]]; then
@@ -49,11 +49,13 @@ AMDSMI_ERROR_MESSAGES = {
31: "Device Not found",
32: "Device not initialized",
33: "No more free slot",
34: "Driver not loaded",
# Reserved for future error messages
40: "No data was found for given input",
41: "Insufficient size for operation",
42: "Unexpected size of data was read",
43: "The data read or provided was unexpected",
54: "AMDGPU restart error",
}
def _get_error_message(error_code):
+98 -7
Bestand weergeven
@@ -25,6 +25,9 @@ import sys
import threading
import time
import json
import multiprocessing
import threading
import os
from _version import __version__
from amdsmi_helpers import AMDSMIHelpers
@@ -3890,10 +3893,10 @@ class AMDSMICommands():
args.process_isolation = process_isolation
if clk_limit:
args.clk_limit = clk_limit
# Handle No GPU passed
if args.gpu == None:
raise ValueError('No GPU provided, specific GPU target(s) are needed')
args.gpu = self.device_handles
# Handle multiple GPUs
handled_multiple_gpus, device_handle = self.helpers.handle_gpus(args, self.logger, self.set_gpu)
@@ -3975,13 +3978,101 @@ class AMDSMICommands():
raise ValueError(f"Unable to set compute partition to {args.compute_partition} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'computepartition', f"Successfully set compute partition to {args.compute_partition}")
if args.memory_partition:
####################################################################
# Get current and available memory partition modes #
# Info used if AMDSMI_STATUS_INVAL is caught & to set progress bar #
####################################################################
try:
memory_partition = amdsmi_interface.amdsmi_get_gpu_memory_partition(gpu) # this info likely actually comes from different apis than used here
except amdsmi_exception.AmdSmiLibraryException as e:
memory_partition = "N/A"
logging.debug("Failed to get current memory partition for GPU %s | %s", gpu_id, e.get_error_info())
try:
mem_caps_str = "N/A"
partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu)
temp_mem_caps = partition_dict['partition_profile']['memory_caps']
mem_caps = temp_mem_caps.nps_cap_mask
if temp_mem_caps.amdsmi_nps_flags_t == None:
mem_caps_list = []
if mem_caps & 1 == 1:
mem_caps_list.append("NPS1")
if mem_caps & 2 == 2:
mem_caps_list.append("NPS2")
if mem_caps & 4 == 4:
mem_caps_list.append("NPS4")
if mem_caps & 8 == 8:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
else:
mem_caps = temp_mem_caps.amdsmi_nps_flags_t
mem_caps_list = []
if mem_caps.nps1_cap == 1:
mem_caps_list.append("NPS1")
if mem_caps.nps2_cap == 1:
mem_caps_list.append("NPS2")
if mem_caps.nps4_cap == 1:
mem_caps_list.append("NPS4")
if mem_caps.nps8_cap == 1:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
if mem_caps_str == "":
mem_caps_str = "N/A"
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info())
memory_dict = {'caps': mem_caps_str, 'current': memory_partition}
###############################################################
# memory partition set starts here #
###############################################################
showProgressBar = False
if ((str(memory_dict['current']) != "N/A") and (str(args.memory_partition) in mem_caps_str)
and ((str(memory_dict['current']) != str(args.memory_partition)))):
showProgressBar = True # Only show progress bar if
# 1) Device can set memory partition modes
# 2) Requested mode is a valid mode to set
# 3) Current is not already the requested mode
# otherwise function will return fast
threads = []
kTimeWait = 40
self.helpers.increment_set_count()
set_count = self.helpers.get_set_count()
if set_count == 1: # only show reload warning on 1st set
self.helpers.confirm_changing_memory_partition_gpu_reload_warning()
memory_partition = amdsmi_interface.AmdSmiMemoryPartitionType[args.memory_partition]
try:
if set_count == 1 and showProgressBar: # only show reload warning on 1st set
string_out = f"Updating memory partition for gpu {gpu_id}"
t1 = multiprocessing.Process(target=self.helpers.showProgressbar,
args=(string_out, kTimeWait,))
threads.append(t1)
t1.start()
amdsmi_interface.amdsmi_set_gpu_memory_partition(args.gpu, memory_partition)
for thread in threads:
thread.terminate()
thread.join()
except amdsmi_exception.AmdSmiLibraryException as e:
f = open(os.devnull, 'w') #redirect to /dev/null (crossplatform)
print("\n\n", end='\r', flush=True, file=f)
for thread in threads:
thread.join()
thread.terminate()
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_INVAL:
print(f"[amdsmi_wrapper.AMDSMI_STATUS_INVAL] Unable to set memory partition to {args.memory_partition} on {gpu_string}")
print(f"Valid Memory partition Modes: {mem_caps_str}\n")
# fall through for value error
f = open(os.devnull, 'w') #redirect to /dev/null (crossplatform)
print("\n\n", end='\r', flush=True, file=f)
raise ValueError(f"Unable to set memory partition to {args.memory_partition} on {gpu_string}") from e
except Exception as e:
for thread in threads:
thread.join()
thread.terminate()
raise ValueError(f"Generic error found | Unable to set memory partition to {args.memory_partition} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'memorypartition', f"Successfully set memory partition to {args.memory_partition}")
if isinstance(args.power_cap, int):
try:
@@ -4226,7 +4317,7 @@ class AMDSMICommands():
self.set_core(args, multiple_devices, core, core_boost_limit)
elif self.helpers.is_amdgpu_initialized(): # Only GPU is initialized
if args.gpu == None:
raise ValueError('No GPU provided, specific GPU target(s) are needed')
args.gpu = self.device_handles
self.logger.clear_multiple_devices_ouput()
self.set_gpu(args, multiple_devices, gpu, fan, perf_level,
profile, perf_determinism, compute_partition,
@@ -4281,7 +4372,7 @@ class AMDSMICommands():
# Handle No GPU passed
if args.gpu == None:
raise ValueError('No GPU provided, specific GPU target(s) are needed')
args.gpu = self.device_handles
# Handle multiple GPUs
handled_multiple_gpus, device_handle = self.helpers.handle_gpus(args, self.logger, self.reset)
@@ -5299,7 +5390,7 @@ class AMDSMICommands():
mem_caps_list.append("NPS4")
if mem_caps.nps8_cap == 1:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
if mem_caps_str == "":
mem_caps_str = "N/A"
except amdsmi_exception.AmdSmiLibraryException as e:
@@ -5350,7 +5441,7 @@ class AMDSMICommands():
mem_caps_list.append("NPS4")
if mem_caps & 8 == 8:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
else:
mem_caps = temp_mem_caps.amdsmi_nps_flags_t
mem_caps_list = []
@@ -5362,7 +5453,7 @@ class AMDSMICommands():
mem_caps_list.append("NPS4")
if mem_caps.nps8_cap == 1:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
if mem_caps_str == "":
mem_caps_str = "N/A"
except amdsmi_exception.AmdSmiLibraryException as e:
+49
Bestand weergeven
@@ -27,6 +27,7 @@ import platform
import sys
import time
import re
import multiprocessing
from typing import List, Union
from enum import Enum
@@ -54,6 +55,7 @@ class AMDSMIHelpers():
self._is_linux = False
self._is_windows = False
self._count_of_sets_called = 0
if self.operating_system.startswith("Linux"):
self._is_linux = True
@@ -77,6 +79,11 @@ class AMDSMIHelpers():
self._is_virtual_os = False
self._is_passthrough = True
def increment_set_count(self):
self._count_of_sets_called += 1
def get_set_count(self):
return self._count_of_sets_called
def is_virtual_os(self):
return self._is_virtual_os
@@ -740,6 +747,30 @@ class AMDSMIHelpers():
else:
sys.exit('Confirmation not given. Exiting without setting value')
def confirm_changing_memory_partition_gpu_reload_warning(self, auto_respond=False):
""" Print the warning for running outside of specification and prompt user to accept the terms.
:param autoRespond: Response to automatically provide for all prompts
"""
print('''
****** WARNING ******\n
Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
AMD SMI will then attempt to change memory (NPS) partition mode.
Upon a successful set, AMD SMI will then initiate an action to restart amdgpu driver.
This action will change all GPU's in the hive to the requested memory (NPS) partition mode.
Please use this utility with caution.
''')
if not auto_respond:
user_input = input('Do you accept these terms? [Y/N] ')
else:
user_input = auto_respond
if user_input in ['Yes', 'yes', 'y', 'Y', 'YES']:
print('')
return
else:
print('Confirmation not given. Exiting without setting value')
sys.exit(1)
def is_valid_profile(self, profile):
profile_presets = amdsmi_interface.amdsmi_wrapper.amdsmi_power_profile_preset_masks_t__enumvalues
@@ -818,3 +849,21 @@ class AMDSMIHelpers():
except Exception as _:
continue
return pci_devices
def progressbar(self, it, prefix="", size=60, out=sys.stdout):
count = len(it)
def show(j):
x = int(size*j/count)
print("{}[{}{}] {}/{} secs remain".format(prefix, u""*x, "."*(size-x), j, count),
end='\r', file=out, flush=True)
show(0)
for i, item in enumerate(it):
yield item
show(i+1)
print("\n\n", end='\r', flush=True, file=out)
def showProgressbar(self, title="", timeInSeconds=13):
if title != "":
title += ": "
for i in self.progressbar(range(timeInSeconds), title, 40):
time.sleep(1)
+6 -2
Bestand weergeven
@@ -26,6 +26,7 @@ import atexit
import logging
import signal
import sys
import os
from pathlib import Path
@@ -134,8 +135,11 @@ def amdsmi_cli_shutdown():
def signal_handler(sig, frame):
logging.debug(f"Handling signal: {sig}")
sys.exit(0)
try:
sys.exit(0)
except Exception as e:
logging.error("Unable to cleanly shut down amd-smi-lib, exception: %s", str(type(e).__name__))
os._exit(0)
if not AMDSMI_INITIALIZED:
AMDSMI_INIT_FLAG = amdsmi_cli_init()
+6 -6
Bestand weergeven
@@ -1032,7 +1032,7 @@ class AMDSMIParser(argparse.ArgumentParser):
# Subparser help text
set_value_help = "Set options for devices"
set_value_subcommand_help = "A GPU must be specified to set a configuration.\
set_value_subcommand_help = "If no GPU is specified, will select all GPUs on the system.\
\nA set argument must be provided; Multiple set arguments are accepted"
set_value_optionals_title = "Set Arguments"
@@ -1073,8 +1073,8 @@ class AMDSMIParser(argparse.ArgumentParser):
set_value_parser.formatter_class=lambda prog: AMDSMISubparserHelpFormatter(prog)
set_value_parser.set_defaults(func=func)
# Device args are required as safeguard from the user applying the operation to all gpus unintentionally
self._add_device_arguments(set_value_parser, required=True)
# Providing no -g 0 or -g all, is not required
self._add_device_arguments(set_value_parser, required=False)
if self.helpers.is_amdgpu_initialized():
if self.helpers.is_baremetal():
@@ -1126,7 +1126,7 @@ class AMDSMIParser(argparse.ArgumentParser):
# Subparser help text
reset_help = "Reset options for devices"
reset_subcommand_help = "A GPU must be specified to reset a configuration.\
reset_subcommand_help = "If no GPU is specified, will select all GPUs on the system.\
\nA reset argument must be provided; Multiple reset arguments are accepted"
reset_optionals_title = "Reset Arguments"
@@ -1148,8 +1148,8 @@ class AMDSMIParser(argparse.ArgumentParser):
# Add Universal Arguments
self._add_command_modifiers(reset_parser)
# Device args are required as safeguard from the user applying the operation to all gpus unintentionally
self._add_device_arguments(reset_parser, required=True)
# Providing no -g 0 or -g all, is not required
self._add_device_arguments(reset_parser, required=False)
if self.helpers.is_baremetal():
# Add Baremetal reset arguments
+3 -3
Bestand weergeven
@@ -507,8 +507,8 @@ usage: amd-smi set [-h] (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...
[--core-boost-limit BOOST_LIMIT] [--json | --csv] [--file FILE]
[--loglevel LEVEL]
A GPU must be specified to set a configuration.
A set argument must be provided; Multiple set arguments are accepted.
If no GPU is specified, will select all GPUs on the system.
A set argument must be provided; Multiple set arguments are accepted
Set Arguments:
-h, --help show this help message and exit
@@ -578,7 +578,7 @@ usage: amd-smi reset [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
(-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-G] [-c]
[-f] [-p] [-x] [-d] [-C] [-M] [-o] [-l]
A GPU must be specified to reset a configuration.
If no GPU is specified, will select all GPUs on the system.
A reset argument must be provided; Multiple reset arguments are accepted
Reset Arguments:
@@ -24,6 +24,7 @@
#include <limits>
#include <type_traits>
#include <string>
#include "amd_smi/amdsmi.h"
#include "amd_smi/impl/amd_smi_gpu_device.h"
@@ -48,6 +49,8 @@ amdsmi_status_t smi_amdgpu_get_driver_version(amd::smi::AMDSmiGPUDevice* device,
amdsmi_status_t smi_amdgpu_get_pcie_speed_from_pcie_type(uint16_t pcie_type, uint32_t *pcie_speed);
amdsmi_status_t smi_amdgpu_get_market_name_from_dev_id(uint32_t device_id, char *market_name);
amdsmi_status_t smi_amdgpu_is_gpu_power_management_enabled(amd::smi::AMDSmiGPUDevice* device, bool *enabled);
std::string smi_split_string(std::string str, char delim);
std::string smi_amdgpu_get_status_string(amdsmi_status_t ret, bool fullStatus);
template<typename>
+1
Bestand weergeven
@@ -88,6 +88,7 @@ class AmdSmiLibraryException(AmdSmiException):
amdsmi_wrapper.AMDSMI_STATUS_FILE_NOT_FOUND : "AMDSMI_STATUS_FILE_NOT_FOUND - File or directory not found",
amdsmi_wrapper.AMDSMI_STATUS_ARG_PTR_NULL : "AMDSMI_STATUS_ARG_PTR_NULL - Parsed argument is invalid",
amdsmi_wrapper.AMDSMI_STATUS_MAP_ERROR : "AMDSMI_STATUS_MAP_ERROR - The internal library error did not map to a status code",
amdsmi_wrapper.AMDSMI_STATUS_AMDGPU_RESTART_ERR: "AMDSMI_STATUS_AMDGPU_RESTART_ERR - AMDGPU restart failed, please check dmsg for errors",
amdsmi_wrapper.AMDSMI_STATUS_UNKNOWN_ERROR : "AMDSMI_STATUS_UNKNOWN_ERROR - An unknown error occurred"
}
+7 -2
Bestand weergeven
@@ -1653,8 +1653,13 @@ def amdsmi_get_gpu_asic_info(
processor_handle, ctypes.byref(asic_info_struct))
)
market_name = _pad_hex_value(asic_info_struct.market_name.decode("utf-8"), 4)
target_graphics_version = str(asic_info_struct.target_graphics_version)
if len(target_graphics_version) == 4 and ("Instinct MI2" in market_name):
hex_part = str(hex(int(str(asic_info_struct.target_graphics_version)[2:]))).replace("0x", "")
target_graphics_version = str(asic_info_struct.target_graphics_version)[:2] + hex_part
asic_info = {
"market_name": _pad_hex_value(asic_info_struct.market_name.decode("utf-8"), 4),
"market_name": market_name,
"vendor_id": asic_info_struct.vendor_id,
"vendor_name": asic_info_struct.vendor_name.decode("utf-8"),
"subvendor_id": asic_info_struct.subvendor_id,
@@ -1663,7 +1668,7 @@ def amdsmi_get_gpu_asic_info(
"asic_serial": asic_info_struct.asic_serial.decode("utf-8"),
"oam_id": asic_info_struct.oam_id,
"num_compute_units": asic_info_struct.num_of_compute_units,
"target_graphics_version": "gfx" + str(asic_info_struct.target_graphics_version)
"target_graphics_version": "gfx" + target_graphics_version
}
string_values = ["market_name", "vendor_name"]
+2 -2
Bestand weergeven
@@ -987,12 +987,12 @@ struct_amdsmi_accelerator_partition_profile_t._pack_ = 1 # source:False
struct_amdsmi_accelerator_partition_profile_t._fields_ = [
('profile_type', amdsmi_accelerator_partition_type_t),
('num_partitions', ctypes.c_uint32),
('profile_index', ctypes.c_uint32),
('memory_caps', amdsmi_nps_caps_t),
('profile_index', ctypes.c_uint32),
('num_resources', ctypes.c_uint32),
('resources', ctypes.c_uint32 * 32 * 8),
('PADDING_0', ctypes.c_ubyte * 4),
('reserved', ctypes.c_uint64 * 6),
('reserved', ctypes.c_uint64 * 13),
]
amdsmi_accelerator_partition_profile_t = struct_amdsmi_accelerator_partition_profile_t
@@ -652,11 +652,6 @@ static rsmi_status_t test_set_compute_partitioning(uint32_t dv_ind) {
std::cout << "\n" << "\n";
}
std::cout << "About to initate compute partition reset..." << "\n";
ret = rsmi_dev_compute_partition_reset(dv_ind);
CHK_RSMI_NOT_SUPPORTED_RET(ret)
std::cout << "Done resetting compute partition." << "\n";
std::string myComputePartition = originalComputePartition;
if (myComputePartition.empty() == false) {
std::cout << "Resetting back to original compute partition to "
@@ -709,11 +704,6 @@ static rsmi_status_t test_set_memory_partition(uint32_t dv_ind) {
<< "." << "\n\n\n";
}
std::cout << "About to initate memory partition reset...\n";
ret = rsmi_dev_memory_partition_reset(dv_ind);
CHK_RSMI_NOT_SUPPORTED_RET(ret)
std::cout << "Done resetting memory partition.\n";
std::string myMemPart = originalMemoryPartition;
if (myMemPart.empty() == false) {
std::cout << "Resetting memory partition to " << originalMemoryPartition
+33 -40
Bestand weergeven
@@ -4596,25 +4596,6 @@ rsmi_status_t
rsmi_dev_compute_partition_set(uint32_t dv_ind,
rsmi_compute_partition_type_t compute_partition);
/**
* @brief Reverts a selected device's compute partition setting back to its
* boot state.
*
* @details Given a device index @p dv_ind , this function will attempt to
* revert its compute partition setting back to its boot state.
*
* @param[in] dv_ind a device index
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function
* @retval ::RSMI_STATUS_BUSY A resource or mutex could not be acquired
* because it is already being used - device is busy
*
*/
rsmi_status_t rsmi_dev_compute_partition_reset(uint32_t dv_ind);
/**
* @brief Retrieves the partition_id for a desired device
*
@@ -4680,6 +4661,39 @@ rsmi_status_t
rsmi_dev_memory_partition_get(uint32_t dv_ind, char *memory_partition,
uint32_t len);
/**
* @brief Retrieves the available memory partition capabilities
* for a desired device
*
* @details
* Given a device index @p dv_ind and a string @p memory_partition_caps ,
* and uint32 @p len , this function will attempt to obtain the device's
* available memory partition capabilities string. Upon successful
* retreival, the obtained device's available memory partition capablilities
* string shall be stored in the passed @p memory_partition_caps
* char string variable.
*
* @param[in] dv_ind a device index
*
* @param[inout] memory_partition_caps a pointer to a char string variable,
* which the device's available memory partition capabilities will be written to.
*
* @param[in] len the length of the caller provided buffer @p len ,
* suggested length is 30 or greater.
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
* @retval ::RSMI_STATUS_UNEXPECTED_DATA data provided to function is not valid
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function
* @retval ::RSMI_STATUS_INSUFFICIENT_SIZE is returned if @p len bytes is not
* large enough to hold the entire memory partition value. In this case,
* only @p len bytes will be written.
*
*/
rsmi_status_t rsmi_dev_memory_partition_capabilities_get(
uint32_t dv_ind, char *memory_partition_caps, uint32_t len);
/**
* @brief Modifies a selected device's current memory partition setting.
*
@@ -4707,27 +4721,6 @@ rsmi_status_t
rsmi_dev_memory_partition_set(uint32_t dv_ind,
rsmi_memory_partition_type_t memory_partition);
/**
* @brief Reverts a selected device's memory partition setting back to its
* boot state.
*
* @details Given a device index @p dv_ind , this function will attempt to
* revert its current memory partition setting back to its boot state.
*
* @param[in] dv_ind a device index
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_PERMISSION function requires root access
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function
* @retval ::RSMI_STATUS_AMDGPU_RESTART_ERR could not successfully restart
* the amdgpu driver
* @retval ::RSMI_STATUS_BUSY A resource or mutex could not be acquired
* because it is already being used - device is busy
*
*/
rsmi_status_t rsmi_dev_memory_partition_reset(uint32_t dv_ind);
/** @} */ // end of memory_partition
/*****************************************************************************/
@@ -182,6 +182,7 @@ enum DevInfoTypes {
kDevAvailableComputePartition,
kDevComputePartition,
kDevMemoryPartition,
kDevAvailableMemoryPartition,
// The information read from pci core sysfs
kDevPCieTypeStart = 1000,
@@ -245,6 +246,8 @@ class Device {
bool DeviceAPISupported(std::string name, uint64_t variant,
uint64_t sub_variant);
rsmi_status_t restartAMDGpuDriver(void);
rsmi_status_t isRestartInProgress(bool *isRestartInProgress,
bool *isAMDGPUModuleLive);
rsmi_status_t storeDevicePartitions(uint32_t dv_ind);
template <typename T> std::string readBootPartitionState(uint32_t dv_ind);
rsmi_status_t check_amdgpu_property_reinforcement_query(uint32_t dev_idx, AMDGpuVerbTypes_t verb_type);
@@ -92,7 +92,8 @@ std::pair<bool, std::string> executeCommand(std::string command,
rsmi_status_t storeTmpFile(uint32_t dv_ind, std::string parameterName,
std::string stateName, std::string storageData);
std::vector<std::string> getListOfAppTmpFiles();
bool containsString(std::string originalString, std::string substring);
bool containsString(std::string originalString, std::string substring,
bool displayComparisons = false);
std::tuple<bool, std::string> readTmpFile(
uint32_t dv_ind,
std::string stateName,
@@ -141,6 +142,8 @@ std::string removeNewLines(const std::string &s);
std::string removeString(const std::string origStr,
const std::string &removeMe);
void system_wait(int milli_seconds);
int countDigit(uint64_t n);
template <typename T>
std::string print_int_as_hex(T i, bool showHexNotation = true,
int overloadBitSize = 0) {
Diff onderdrukt omdat het te groot bestand Laad Diff
+189 -75
Bestand weergeven
@@ -5729,12 +5729,22 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
LOG_TRACE(ss);
REQUIRE_ROOT_ACCESS
DEVICE_MUTEX
const uint32_t kMaxBoardLength = 128;
bool isCorrectDevice = false;
char boardName[128];
char boardName[kMaxBoardLength];
boardName[0] = '\0';
const uint32_t kMaxMemoryCapabilitiesSize = 30;
char available_memory_capabilities[kMaxMemoryCapabilitiesSize];
available_memory_capabilities[0] = '\0';
const uint32_t kMaxCurrentMemoryMode = 5;
char current_memory_mode[kMaxCurrentMemoryMode];
current_memory_mode[0] = '\0';
// rsmi_dev_memory_partition_set is only available for for discrete variant,
// others are required to update through bios settings
rsmi_dev_name_get(dv_ind, boardName, 128);
rsmi_dev_name_get(dv_ind, boardName, static_cast<size_t>(kMaxBoardLength));
std::string myBoardName = boardName;
if (!myBoardName.empty()) {
std::transform(myBoardName.begin(), myBoardName.end(), myBoardName.begin(),
@@ -5747,18 +5757,19 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
if (!isCorrectDevice) {
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Fail "
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Cause: device board name does not support this action"
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_NOT_SUPPORTED) << " |";
<< " | ======= end ======= "
<< " | Fail "
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Cause: device board name does not support this action"
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_NOT_SUPPORTED, false);
LOG_ERROR(ss);
return RSMI_STATUS_NOT_SUPPORTED;
}
// Is the current mode already what user requested?
switch (memory_partition) {
case RSMI_MEMORY_PARTITION_NPS1:
case RSMI_MEMORY_PARTITION_NPS2:
@@ -5775,7 +5786,7 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Cause: requested setting was invalid"
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS) << " |";
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS, false);
LOG_ERROR(ss);
return RSMI_STATUS_INVALID_ARGS;
}
@@ -5797,7 +5808,7 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
<< " | Cause: could retrieve current memory partition or retrieved"
<< " unexpected data"
<< " | Returning = "
<< getRSMIStatusString(ret_get) << " |";
<< getRSMIStatusString(ret_get, false);
LOG_ERROR(ss);
return ret_get;
}
@@ -5813,11 +5824,52 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Data: " << newMemoryPartition
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_SUCCESS) << " |";
<< getRSMIStatusString(RSMI_STATUS_SUCCESS, false);
LOG_TRACE(ss);
return RSMI_STATUS_SUCCESS;
}
// is this an available mode to set to?
std::string memory_capabilities_str = "unknown";
std::string user_requested_memory_partition = newMemoryPartition;
std::transform(user_requested_memory_partition.begin(), user_requested_memory_partition.end(),
user_requested_memory_partition.begin(), ::toupper);
rsmi_status_t caps_ret = rsmi_dev_memory_partition_capabilities_get(dv_ind,
available_memory_capabilities, kMaxMemoryCapabilitiesSize);
memory_capabilities_str = available_memory_capabilities;
std::transform(memory_capabilities_str.begin(), memory_capabilities_str.end(),
memory_capabilities_str.begin(), ::toupper);
ss << __PRETTY_FUNCTION__ << " | user_requested_memory_partition: "
<< user_requested_memory_partition
<< "; memory_capabilities_str: " << memory_capabilities_str
<< "; rsmi_dev_memory_partition_capabilities_get(" << dv_ind
<< ", " << user_requested_memory_partition << "): return = "
<< amd::smi::getRSMIStatusString(caps_ret, false);
LOG_DEBUG(ss);
if ((caps_ret == RSMI_STATUS_SUCCESS)
&& (!memory_capabilities_str.empty())
&& (!user_requested_memory_partition.empty())) {
bool is_available_mode = amd::smi::containsString(memory_capabilities_str,
user_requested_memory_partition, true);
ss << __PRETTY_FUNCTION__
<< " | is_available_mode: " << (is_available_mode ? "True": "False");
LOG_DEBUG(ss);
if (is_available_mode == false) { // report RSMI_STATUS_INVALID_ARGS
ss << __PRETTY_FUNCTION__
<< " | ======= Check if available mode ======= "
<< " | WARNING: detected invalid mode to set to, will try to set anyways"
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Data (user requested mode): " << user_requested_memory_partition
<< " | Available Memory Partition Modes: " << memory_capabilities_str
<< " | Cause: requested setting was not an available mode"
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS, false);
LOG_INFO(ss);
}
}
GET_DEV_FROM_INDX
int ret = dev->writeDevInfo(amd::smi::kDevMemoryPartition,
newMemoryPartition);
@@ -5835,7 +5887,7 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Cause: issue writing reqested setting of " + newMemoryPartition
<< " | Returning = "
<< getRSMIStatusString(err) << " |";
<< getRSMIStatusString(err, false);
LOG_ERROR(ss);
return err;
}
@@ -5849,8 +5901,76 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Data: " << newMemoryPartition
<< " | Returning = "
<< getRSMIStatusString(restartRet) << " |";
<< getRSMIStatusString(restartRet, false);
LOG_TRACE(ss);
if (restartRet != RSMI_STATUS_SUCCESS) {
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Fail - restart AMD GPU detected"
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Cause: issue writing reqested setting of " + newMemoryPartition
<< " | Returning = "
<< getRSMIStatusString(restartRet, false);
LOG_ERROR(ss);
return restartRet;
}
std::string current_memory_mode_str = "unknown";
rsmi_status_t can_read_sysfs_again = RSMI_STATUS_AMDGPU_RESTART_ERR;
int maxWaitSeconds = 10;
const int k1000_MS_WAIT = 1000;
// wait until we can read SYSFS again
if (restartRet == RSMI_STATUS_SUCCESS) {
while (current_memory_mode_str != user_requested_memory_partition) {
maxWaitSeconds -= 1;
can_read_sysfs_again =
rsmi_dev_memory_partition_get(dv_ind, current_memory_mode, kMaxCurrentMemoryMode);
if (can_read_sysfs_again == RSMI_STATUS_SUCCESS) {
current_memory_mode_str.clear();
current_memory_mode_str = current_memory_mode;
ss << __PRETTY_FUNCTION__
<< " | ======= rsmi_dev_memory_partition_get ======= "
<< " | Success - can read SYSFS"
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Data (user requested mode): " << user_requested_memory_partition
<< " | Current Memory Partition Mode: " << current_memory_mode_str
<< " | Available Memory Partition Modes: " << memory_capabilities_str
<< " | total wait time (sec): " << (10 - maxWaitSeconds)
<< " | Returning = "
<< getRSMIStatusString(can_read_sysfs_again, false);
LOG_TRACE(ss);
if (!current_memory_mode_str.empty()
&& (current_memory_mode_str == user_requested_memory_partition)) {
break;
}
}
amd::smi::system_wait(k1000_MS_WAIT);
}
}
if (current_memory_mode_str == user_requested_memory_partition) {
restartRet = RSMI_STATUS_SUCCESS;
} else {
restartRet = RSMI_STATUS_AMDGPU_RESTART_ERR;
}
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Success - completed driver restart and all SYSFS are active"
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Data: " << user_requested_memory_partition
<< " | Current Memory Partition Mode: " << current_memory_mode_str
<< " | Available Memory Partition Modes: " << memory_capabilities_str
<< " | Returning = "
<< getRSMIStatusString(restartRet, false);
LOG_TRACE(ss);
return restartRet;
CATCH
}
@@ -5927,79 +6047,73 @@ rsmi_dev_memory_partition_get(uint32_t dv_ind, char *memory_partition,
CATCH
}
rsmi_status_t rsmi_dev_compute_partition_reset(uint32_t dv_ind) {
rsmi_status_t rsmi_dev_memory_partition_capabilities_get(
uint32_t dv_ind, char *memory_partition_caps, uint32_t len) {
TRY
std::ostringstream ss;
ss << __PRETTY_FUNCTION__ << " | ======= start =======, " << dv_ind;
LOG_TRACE(ss);
REQUIRE_ROOT_ACCESS
if ((len == 0) || (memory_partition_caps == nullptr)) {
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Fail "
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
<< " | Cause: user sent invalid arguments, len = 0 or memory_partition_caps"
<< " was a null ptr"
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS, false);
LOG_ERROR(ss);
return RSMI_STATUS_INVALID_ARGS;
}
CHK_SUPPORT_NAME_ONLY(memory_partition_caps)
DEVICE_MUTEX
GET_DEV_FROM_INDX
rsmi_status_t ret = RSMI_STATUS_NOT_SUPPORTED;
// Only use 1st index, rest are there in-case of future issues
// NOTE: Partitions sets cause rocm-smi indexes to fluctuate
// since the nodes are grouped in respect to primary node - why we only use
// 1st node/device id to reset
std::string bootState =
dev->readBootPartitionState<rsmi_compute_partition_type_t>(0);
std::string availableMemoryPartitions;
rsmi_status_t ret =
get_dev_value_line(amd::smi::kDevAvailableMemoryPartition,
dv_ind, &availableMemoryPartitions);
if (ret != RSMI_STATUS_SUCCESS) {
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | FAIL "
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
<< " | Data: could not retrieve requested data"
<< " | Returning = "
<< getRSMIStatusString(ret, false);
LOG_ERROR(ss);
return ret;
}
// Initiate reset
// If bootState is UNKNOWN, we cannot reset - return RSMI_STATUS_NOT_SUPPORTED
// Likely due to device not supporting it
if (bootState != "UNKNOWN") {
rsmi_compute_partition_type_t compute_partition =
mapStringToRSMIComputePartitionTypes.at(bootState);
ret = rsmi_dev_compute_partition_set(dv_ind, compute_partition);
std::size_t length = availableMemoryPartitions.copy(memory_partition_caps, len-1);
memory_partition_caps[length]='\0';
if (len < (availableMemoryPartitions.size() + 1)) {
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Fail "
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
<< " | Cause: requested size was insufficient"
<< " | Returning = "
<< getRSMIStatusString(RSMI_STATUS_INSUFFICIENT_SIZE, false);
LOG_ERROR(ss);
return RSMI_STATUS_INSUFFICIENT_SIZE;
}
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Success - if original boot state was not unknown or valid setting"
<< " | Success "
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevComputePartition)
<< " | Data: " << bootState
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
<< " | Data: " << memory_partition_caps
<< " | Returning = "
<< getRSMIStatusString(ret) << " |";
LOG_TRACE(ss);
return ret;
CATCH
}
rsmi_status_t rsmi_dev_memory_partition_reset(uint32_t dv_ind) {
TRY
std::ostringstream ss;
ss << __PRETTY_FUNCTION__ << "| ======= start =======, " << dv_ind;
LOG_TRACE(ss);
REQUIRE_ROOT_ACCESS
DEVICE_MUTEX
GET_DEV_FROM_INDX
rsmi_status_t ret = RSMI_STATUS_NOT_SUPPORTED;
// Only use 1st index, rest are there in-case of future issues
// NOTE: Partitions sets cause rocm-smi indexes to fluctuate.
// Since the nodes are grouped in respect to primary node - why we only use
// 1st node/device id to reset
std::string bootState =
dev->readBootPartitionState<rsmi_memory_partition_type_t>(0);
// Initiate reset
// If bootState is UNKNOWN, we cannot reset - return RSMI_STATUS_NOT_SUPPORTED
// Likely due to device not supporting it
if (bootState != "UNKNOWN") {
rsmi_memory_partition_type_t memory_partition =
mapStringToMemoryPartitionTypes.at(bootState);
ret = rsmi_dev_memory_partition_set(dv_ind, memory_partition);
}
ss << __PRETTY_FUNCTION__
<< " | ======= end ======= "
<< " | Success - if original boot state was not unknown or valid setting"
<< " | Device #: " << dv_ind
<< " | Type: "
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
<< " | Data: " << bootState
<< " | Returning = "
<< getRSMIStatusString(ret) << " |";
<< getRSMIStatusString(ret, false);
LOG_TRACE(ss);
return ret;
CATCH
+72 -4
Bestand weergeven
@@ -140,6 +140,7 @@ static const char *kDevAvailableComputePartitionFName =
"available_compute_partition";
static const char *kDevComputePartitionFName = "current_compute_partition";
static const char *kDevMemoryPartitionFName = "current_memory_partition";
static const char *kDevAvailableMemoryPartitionFName = "available_memory_partition";
// Firmware version files
static const char *kDevFwVersionAsdFName = "fw_version/asd_fw_version";
@@ -328,6 +329,7 @@ static const std::map<DevInfoTypes, const char *> kDevAttribNameMap = {
{kDevAvailableComputePartition, kDevAvailableComputePartitionFName},
{kDevComputePartition, kDevComputePartitionFName},
{kDevMemoryPartition, kDevMemoryPartitionFName},
{kDevAvailableMemoryPartition, kDevAvailableMemoryPartitionFName},
};
static const std::map<rsmi_dev_perf_level, const char *> kDevPerfLvlMap = {
@@ -479,6 +481,7 @@ Device::devInfoTypesStrings = {
{kDevAvailableComputePartition, "kDevAvailableComputePartition"},
{kDevComputePartition, "kDevComputePartition"},
{kDevMemoryPartition, "kDevMemoryPartition"},
{kDevAvailableMemoryPartition, "kDevAvailableMemoryPartition"},
{kDevPCieVendorID, "kDevPCieVendorID"},
{kDevSocPstate, "kDevSocPstate"},
{kDevXgmiPlpd, "kDevXgmiPlpd"},
@@ -1308,6 +1311,7 @@ int Device::readDevInfo(DevInfoTypes type, std::string *val) {
case kDevMemoryPartition:
case kDevNumaNode:
case kDevXGMIPhysicalID:
case kDevAvailableMemoryPartition:
case kDevProcessIsolation:
return readDevInfoStr(type, val);
break;
@@ -1486,10 +1490,15 @@ bool Device::DeviceAPISupported(std::string name, uint64_t variant,
rsmi_status_t Device::restartAMDGpuDriver(void) {
REQUIRE_ROOT_ACCESS
std::ostringstream ss;
bool restartSuccessful = true;
bool success = false;
std::string out;
bool wasGdmServiceActive = false;
bool restartInProgress = true;
bool isRestartInProgress = true;
bool isAMDGPUModuleLive = false;
std::string captureRestartErr;
// sudo systemctl is-active gdm
// we do not care about the success of checking if gdm is active
@@ -1498,8 +1507,8 @@ rsmi_status_t Device::restartAMDGpuDriver(void) {
(restartSuccessful = true);
// if gdm is active -> sudo systemctl stop gdm
// TODO: are are there other display manager's we need to take into account?
// see https://en.wikipedia.org/wiki/GNOME_Display_Manager
// TODO(AMD_SMI_team): are are there other display manager's we need to take into account?
// see https://help.gnome.org/admin/gdm/stable/overview.html.en_GB
if (success && (out == "active")) {
wasGdmServiceActive = true;
std::tie(success, out) = executeCommand("systemctl stop gdm&", false);
@@ -1509,8 +1518,13 @@ rsmi_status_t Device::restartAMDGpuDriver(void) {
// sudo modprobe -r amdgpu
// sudo modprobe amdgpu
std::tie(success, out) =
executeCommand("modprobe -r amdgpu && modprobe amdgpu&", false);
executeCommand("modprobe -r amdgpu && modprobe amdgpu&", true);
restartSuccessful &= success;
captureRestartErr = out;
if (success) {
restartSuccessful = false;
}
// if gdm was active -> sudo systemctl start gdm
if (wasGdmServiceActive) {
@@ -1518,7 +1532,61 @@ rsmi_status_t Device::restartAMDGpuDriver(void) {
restartSuccessful &= success;
}
return (restartSuccessful ? RSMI_STATUS_SUCCESS :
// Return early if there was an issue restarting amdgpu
if (!restartSuccessful) {
ss << __PRETTY_FUNCTION__ << " | [WARNING] Issue found during amdgpu restart: "
<< captureRestartErr << "; retartSuccessful: " << (restartSuccessful ? "True" : "False");
LOG_INFO(ss);
return RSMI_STATUS_AMDGPU_RESTART_ERR;
}
// wait for amdgpu module to come back up
rsmi_status_t status = Device::isRestartInProgress(&isRestartInProgress,
&isAMDGPUModuleLive);
const int kTimeToWaitForDriverMSec = 1000;
int maxLoops = 10; // wait a max of 10 sec
while (status != RSMI_STATUS_SUCCESS) {
maxLoops -= 1;
if (maxLoops == 0) {
break;
}
amd::smi::system_wait(kTimeToWaitForDriverMSec);
status = Device::isRestartInProgress(&isRestartInProgress,
&isAMDGPUModuleLive);
}
return ((restartSuccessful && (!isRestartInProgress && isAMDGPUModuleLive)) ?
RSMI_STATUS_SUCCESS :
RSMI_STATUS_AMDGPU_RESTART_ERR);
}
rsmi_status_t Device::isRestartInProgress(bool *isRestartInProgress,
bool *isAMDGPUModuleLive) {
REQUIRE_ROOT_ACCESS
std::ostringstream ss;
bool restartSuccessful = true;
bool success = false;
std::string out;
bool deviceRestartInProgress = true; // Assume in progress, we intend to disprove
bool isSystemAMDGPUModuleLive = false; // Assume AMD GPU module is not live,
// we intend to disprove
// wait for amdgpu module to come back up
std::tie(success, out) = executeCommand("cat /sys/module/amdgpu/initstate", true);
ss << __PRETTY_FUNCTION__
<< " | success = " << success
<< " | out = " << out;
LOG_DEBUG(ss);
if ((success == true) && (!out.empty())) {
isSystemAMDGPUModuleLive = containsString(out, "live");
}
if (isAMDGPUModuleLive) {
deviceRestartInProgress = false;
}
*isRestartInProgress = deviceRestartInProgress;
*isAMDGPUModuleLive = isSystemAMDGPUModuleLive;
return ((*isAMDGPUModuleLive && !*isRestartInProgress) ? RSMI_STATUS_SUCCESS :
RSMI_STATUS_AMDGPU_RESTART_ERR);
}
+45 -5
Bestand weergeven
@@ -63,6 +63,7 @@
#include <sstream>
#include <string>
#include <vector>
#include <cmath>
#include "rocm_smi/rocm_smi.h"
#include "rocm_smi/rocm_smi_kfd.h"
@@ -357,6 +358,7 @@ rsmi_status_t ErrnoToRsmiStatus(int err) {
case EIO: return RSMI_STATUS_UNEXPECTED_SIZE;
case ENXIO: return RSMI_STATUS_UNEXPECTED_DATA;
case EBUSY: return RSMI_STATUS_BUSY;
case EINVAL: return RSMI_STATUS_INVALID_ARGS;
default: return RSMI_STATUS_UNKNOWN_ERROR;
}
}
@@ -429,14 +431,14 @@ std::pair<bool, std::string> executeCommand(std::string command, bool stdOut) {
char buffer[128];
std::string stdoutAndErr;
bool successfulRun = true;
command = "stdbuf -i0 -o0 -e0 " + command; // remove stdOut and err buffering
command = "stdbuf -i0 -o0 -e0 " + command; // remove stdOut and err buffering
FILE *pipe = popen(command.c_str(), "r");
if (!pipe) {
stdoutAndErr = "[ERROR] popen failed to call " + command;
successfulRun = false;
} else {
//read until end of process
// read until end of process
while (!feof(pipe)) {
// use buffer to read and add to stdoutAndErr
if (fgets(buffer, sizeof(buffer), pipe) != nullptr) {
@@ -459,8 +461,19 @@ std::pair<bool, std::string> executeCommand(std::string command, bool stdOut) {
// originalString - string to search for substring
// substring - string looking to find
bool containsString(std::string originalString, std::string substring) {
return (originalString.find(substring) != std::string::npos);
// displayComparisons = defaults to false, set to true to see debug prints
bool containsString(std::string originalString, std::string substring,
bool displayComparisons) {
std::ostringstream ss;
bool found = originalString.find(substring) != std::string::npos;
if (displayComparisons) {
ss << __PRETTY_FUNCTION__
<< " | originalString: " << originalString
<< " | substring: " << substring
<< " | found: " << (found ? "True": "False");
LOG_TRACE(ss);
}
return found;
}
// Creates and stores supplied data into a temporary file (within /tmp/).
@@ -1217,7 +1230,9 @@ rsmi_status_t rsmi_get_gfx_target_version(uint32_t dv_ind, std::string *gfx_vers
// separate out parts -> put back into normal graphics version format
major = static_cast<uint64_t>((orig_target_version / 10000) * 100);
minor = static_cast<uint64_t>((orig_target_version % 10000 / 100) * 10);
if (minor == 0) major *= 10; // 0 as a minor is correct, but bump up by 10
if ((minor == 0) && (countDigit(major) < 4)) {
major *= 10; // 0 as a minor is correct, but bump up by 10
}
rev = static_cast<uint64_t>(orig_target_version % 100);
*gfx_version = "gfx" + std::to_string(major + minor + rev);
ss << __PRETTY_FUNCTION__
@@ -1278,6 +1293,31 @@ std::queue<std::string> getAllDeviceGfxVers() {
return deviceGfxVersions;
}
// milli_seconds: time to wait, in milliseconds
// 1 sec = 1000ms
// .5 sec = 500ms
void system_wait(int milli_seconds) {
std::ostringstream ss;
auto start = std::chrono::high_resolution_clock::now();
// 1 ms = 1000 us
int waitTime = milli_seconds * 1000;
ss << __PRETTY_FUNCTION__ << " | "
<< "** Waiting for " << std::dec << waitTime
<< " us (" << waitTime/1000 << " milli-seconds) **";
LOG_DEBUG(ss);
usleep(waitTime);
auto stop = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
ss << __PRETTY_FUNCTION__ << " | "
<< "** Waiting took " << duration.count() / 1000
<< " milli-seconds **";
LOG_DEBUG(ss);
}
int countDigit(uint64_t n) {
return static_cast<int>(std::floor(log10(n) + 1));
}
} // namespace smi
} // namespace amd
+47 -6
Bestand weergeven
@@ -1455,16 +1455,57 @@ amdsmi_get_gpu_accelerator_partition_profile(amdsmi_processor_handle processor_h
amdsmi_accelerator_partition_profile_t *profile,
uint32_t *partition_id) {
AMDSMI_CHECK_INIT();
// TODO: also fill out profile later
// default to 0xffffffff if not supported
*partition_id = std::numeric_limits<uint32_t>::max();
auto tmp_partition_id = uint32_t(0);
if (profile == nullptr) {
return AMDSMI_STATUS_INVAL;
}
std::ostringstream ss;
// TODO(amdsmi_team): also fill out profile later
amdsmi_nps_caps_t flags;
flags.amdsmi_nps_flags_t.nps1_cap = 0;
flags.amdsmi_nps_flags_t.nps2_cap = 0;
flags.amdsmi_nps_flags_t.nps4_cap = 0;
flags.amdsmi_nps_flags_t.nps8_cap = 0;
profile->memory_caps = flags;
amdsmi_status_t status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, &tmp_partition_id);
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS){
// TODO(amdsmi_team): add resources here ^
auto tmp_partition_id = uint32_t(0);
auto tmp_xcd_count = uint16_t(0);
amdsmi_status_t status = AMDSMI_STATUS_NOT_SUPPORTED;
status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, &tmp_partition_id);
if (status == AMDSMI_STATUS_SUCCESS) {
*partition_id = tmp_partition_id;
}
// Add memory partition capabilities here
constexpr uint32_t kLenCapsSize = 30;
char memory_caps[kLenCapsSize];
status = rsmi_wrapper(rsmi_dev_memory_partition_capabilities_get, processor_handle,
memory_caps, kLenCapsSize);
ss << __PRETTY_FUNCTION__
<< " | rsmi_dev_memory_partition_capabilities_get Returning: "
<< smi_amdgpu_get_status_string(status, false)
<< " | Type: memory_partition_capabilities"
<< " | Data: " << memory_caps;
LOG_DEBUG(ss);
std::string memory_caps_str = "N/A";
if (status == AMDSMI_STATUS_SUCCESS) {
memory_caps_str = std::string(memory_caps);
if (memory_caps_str.find("NPS1") != std::string::npos) {
flags.amdsmi_nps_flags_t.nps1_cap = 1;
}
if (memory_caps_str.find("NPS2") != std::string::npos) {
flags.amdsmi_nps_flags_t.nps2_cap = 1;
}
if (memory_caps_str.find("NPS4") != std::string::npos) {
flags.amdsmi_nps_flags_t.nps4_cap = 1;
}
if (memory_caps_str.find("NPS8") != std::string::npos) {
flags.amdsmi_nps_flags_t.nps8_cap = 1;
}
}
profile->memory_caps = flags;
return status;
}
+32
Bestand weergeven
@@ -624,3 +624,35 @@ amdsmi_status_t smi_amdgpu_is_gpu_power_management_enabled(amd::smi::AMDSmiGPUDe
return AMDSMI_STATUS_SUCCESS;
}
std::string smi_amdgpu_split_string(std::string str, char delim) {
std::vector<std::string> tokens;
std::stringstream ss(str);
std::string token;
if (str.empty()) {
return "";
}
while (std::getline(ss, token, delim)) {
tokens.push_back(token);
return token; // return 1st match
}
}
// wrapper to return string expression of a rsmi_status_t return
// rsmi_status_t ret - return value of RSMI API function
// bool fullStatus - defaults to true, set to false to chop off description
// Returns:
// string - if fullStatus == true, returns full decription of return value
// ex. 'RSMI_STATUS_SUCCESS: The function has been executed successfully.'
// string - if fullStatus == false, returns a minimalized return value
// ex. 'RSMI_STATUS_SUCCESS'
std::string smi_amdgpu_get_status_string(amdsmi_status_t ret, bool fullStatus = true) {
const char *err_str;
amdsmi_status_code_to_string(ret, &err_str);
if (!fullStatus) {
return smi_amdgpu_split_string(std::string(err_str), ':');
}
return std::string(err_str);
}