[SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Changes:
- [CLI] Added warning screen to AMD SMI users
setting memory partition
- [CLI] Added a progress bar time-bar for CLI sets display to 40 seconds
- [API] Updated to wait until the driver reloads with SYSFS files active
- [CLI] Now users can set or reset without providing:
amd-smi set -g all <set arguments>
or amd-smi reset -g all <set arguments>
now can directly call -> sudo amd-smi set <set arguments>
or sudo amd-smi reset <set arguments>
- [SWDEV-475712][CLI/API] Fixed target_graphics_version field
not properly displaying for older MI or Navi ASICs.
- [All APIs] Added a catch for the driver to report invalid arguments
now these APIs will show AMDSMI_STATUS_INVAL
(ex. changing to NPS8 if the device does not support it)
- [Install] Modified paths for Python install commands to support
multi-ROCm installs
Change-Id: Id11f25d68a82d23c6b2d77ccb30b51e860dd0ca7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
This commit is contained in:
+53
-1
@@ -8,7 +8,7 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
|
||||
|
||||
### Added
|
||||
|
||||
- **Added support for `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
|
||||
- **Added support for `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
|
||||
Guest VMs now support getting current ECC counts and ras information from the Host cards.
|
||||
|
||||
- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**.
|
||||
@@ -497,6 +497,56 @@ GPU: 0
|
||||
|
||||
### Changed
|
||||
|
||||
- **Improvement: Users now have the ability to set and reset without providing `-g all` using AMD SMI CLI**.
|
||||
Users can now provide set and reset without `-g all`. Previously, users were required to provide:
|
||||
`sudo amd-smi set -g all <set arguments>` or `sudo amd-smi reset -g all <set arguments>`
|
||||
This update allows users to set or reset without providing `-g all` arguments. Allowing commands:
|
||||
`sudo amd-smi set <set arguments>` or `sudo amd-smi reset <set arguments>`
|
||||
This action will default to try to set/reset for all AMD GPUs on the user's system.
|
||||
|
||||
- **Improvement: `amd-smi set --memory-partition` now includes a warning banner and progress bar**.
|
||||
For devices which support dynamically changing memory partitions, we now provide a warning for users. We provide this warning to provide users knowledge that this action requires users to quit any gpu workloads. Also to let them know this process will trigger an AMD GPU driver reload. Since this process takes time to complete, a progress bar has been provided until actions can verified as a successful change. Otherwise, AMD SMI will report any errors to users and what actions can be taken. See example below:
|
||||
```shell
|
||||
$ sudo amd-smi set -M NPS1
|
||||
|
||||
****** WARNING ******
|
||||
|
||||
Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
|
||||
AMD SMI will then attempt to change memory (NPS) partition mode.
|
||||
Upon a successful set, AMD SMI will then initiate an action to restart amdgpu driver.
|
||||
This action will change all GPU's in the hive to the requested memory (NPS) partition mode.
|
||||
|
||||
Please use this utility with caution.
|
||||
|
||||
Do you accept these terms? [Y/N] y
|
||||
|
||||
Updating memory partition for gpu 0: [████████████████████████████████████████] 40/40 secs remain
|
||||
|
||||
GPU: 0
|
||||
MEMORYPARTITION: Successfully set memory partition to NPS1
|
||||
|
||||
GPU: 1
|
||||
MEMORYPARTITION: Successfully set memory partition to NPS1
|
||||
|
||||
GPU: 2
|
||||
MEMORYPARTITION: Successfully set memory partition to NPS1
|
||||
...
|
||||
```
|
||||
|
||||
- **Updated `amdsmi_get_gpu_accelerator_partition_profile` to provide driver memory partition capablities**.
|
||||
Driver now has the ability to report what the user can set memory partition modes to. User can now see available
|
||||
memory partition modes upon an invalid argument return from memory partition mode set (`amdsmi_set_gpu_memory_partition`).
|
||||
This change also updates `amd-smi partition`, `amd-smi partition --memory`, and `amd-smi partition --accelerator` (*see note below)
|
||||
***Note: *Subject to change for ROCm 6.4***
|
||||
|
||||
- **Updated `amdsmi_set_gpu_memory_partition` to not return until a successful restart of AMD GPU Driver.**
|
||||
This change keeps checking for ~ up to 40 seconds for a successful restart of the AMD GPU driver. Additionally, the API call continues to check if memory partition (NPS) SYSFS files are successfully updated to reflect the user's requested memory partition (NPS) mode change. Otherwise, reports an error back to the user. Due to these changes, we have updated AMD SMI's CLI to reflect the maximum wait of 40 seconds, while a memory partition change is in progress.
|
||||
|
||||
- **All APIs now have the ability to catch driver reporting invalid arguments.**
|
||||
Now AMD SMI APIs can show AMDSMI_STATUS_INVAL when driver returns EINVAL.
|
||||
For example, if user tries to set to NPS8, but the memory partition mode is not an available mode to set to. Commonly referred to as `CAPS` (see `amd-smi partition --memory`), provided by `amdsmi_get_gpu_accelerator_partition_profile`(*see note below).
|
||||
***Note: *Subject to change for ROCm 6.4***
|
||||
|
||||
- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**.
|
||||
This aligns BDF output with ROCm SMI.
|
||||
See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function.
|
||||
@@ -590,6 +640,8 @@ GPU: 0
|
||||
|
||||
### Resolved issues
|
||||
|
||||
- **Fixed `amdsmi_get_gpu_asic_info`'s `target_graphics_version` and `amd-smi --asic` not displaying properly for MI2x or Navi 3x ASICs**.
|
||||
|
||||
- **Fixed `amd-smi reset` commands showing an AttributeError**.
|
||||
|
||||
- **Improved Offline install process & lowered dependency for PyYAML**.
|
||||
|
||||
@@ -107,6 +107,13 @@ set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wformat=2 -fno-common -Wstrict-overflow
|
||||
# Intentionally leave out -Wsign-promo. It causes spurious warnings.
|
||||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Woverloaded-virtual -Wreorder")
|
||||
|
||||
# Add CMAKE debug flags
|
||||
if ("${CMAKE_BUILD_TYPE}" STREQUAL Release)
|
||||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O2")
|
||||
else ()
|
||||
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -ggdb -O0 -DDEBUG")
|
||||
endif ()
|
||||
|
||||
set(COMMON_SRC_DIR "${PROJECT_SOURCE_DIR}/src")
|
||||
set(ROCM_SRC_DIR "${PROJECT_SOURCE_DIR}/rocm_smi/src")
|
||||
set(AMDSMI_SRC_DIR "${PROJECT_SOURCE_DIR}/src/amd_smi")
|
||||
|
||||
@@ -156,9 +156,11 @@ do_install_amdsmi_python_lib() {
|
||||
"AMD-SMI python library will not be installed."
|
||||
return
|
||||
fi
|
||||
|
||||
local amdsmi_python_lib_path="/opt/rocm/share/amd_smi"
|
||||
local amdsmi_setup_py_path="/opt/rocm/share/amd_smi/setup.py"
|
||||
|
||||
# install python library at @CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@/amdsmi
|
||||
local python_lib_path=@CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@
|
||||
local amdsmi_python_lib_path="$python_lib_path"
|
||||
local amdsmi_setup_py_path="$python_lib_path/setup.py"
|
||||
|
||||
# Decide installation method based on setuptools version
|
||||
if [[ "$(printf '%s\n' "$setuptools_version" "28.5" | sort -V | head -n1)" == "$setuptools_version" ]]; then
|
||||
|
||||
+4
-2
@@ -157,8 +157,10 @@ do_install_amdsmi_python_lib() {
|
||||
return
|
||||
fi
|
||||
|
||||
local amdsmi_python_lib_path="/opt/rocm/share/amd_smi"
|
||||
local amdsmi_setup_py_path="/opt/rocm/share/amd_smi/setup.py"
|
||||
# install python library at @CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@/amdsmi
|
||||
local python_lib_path=@CPACK_PACKAGING_INSTALL_PREFIX@/@SHARE_INSTALL_PREFIX@
|
||||
local amdsmi_python_lib_path="$python_lib_path"
|
||||
local amdsmi_setup_py_path="$python_lib_path/setup.py"
|
||||
|
||||
# Decide installation method based on setuptools version
|
||||
if [[ "$(printf '%s\n' "$setuptools_version" "28.5" | sort -V | head -n1)" == "$setuptools_version" ]]; then
|
||||
|
||||
@@ -49,11 +49,13 @@ AMDSMI_ERROR_MESSAGES = {
|
||||
31: "Device Not found",
|
||||
32: "Device not initialized",
|
||||
33: "No more free slot",
|
||||
34: "Driver not loaded",
|
||||
# Reserved for future error messages
|
||||
40: "No data was found for given input",
|
||||
41: "Insufficient size for operation",
|
||||
42: "Unexpected size of data was read",
|
||||
43: "The data read or provided was unexpected",
|
||||
54: "AMDGPU restart error",
|
||||
}
|
||||
|
||||
def _get_error_message(error_code):
|
||||
|
||||
@@ -25,6 +25,9 @@ import sys
|
||||
import threading
|
||||
import time
|
||||
import json
|
||||
import multiprocessing
|
||||
import threading
|
||||
import os
|
||||
|
||||
from _version import __version__
|
||||
from amdsmi_helpers import AMDSMIHelpers
|
||||
@@ -3890,10 +3893,10 @@ class AMDSMICommands():
|
||||
args.process_isolation = process_isolation
|
||||
if clk_limit:
|
||||
args.clk_limit = clk_limit
|
||||
|
||||
|
||||
# Handle No GPU passed
|
||||
if args.gpu == None:
|
||||
raise ValueError('No GPU provided, specific GPU target(s) are needed')
|
||||
args.gpu = self.device_handles
|
||||
|
||||
# Handle multiple GPUs
|
||||
handled_multiple_gpus, device_handle = self.helpers.handle_gpus(args, self.logger, self.set_gpu)
|
||||
@@ -3975,13 +3978,101 @@ class AMDSMICommands():
|
||||
raise ValueError(f"Unable to set compute partition to {args.compute_partition} on {gpu_string}") from e
|
||||
self.logger.store_output(args.gpu, 'computepartition', f"Successfully set compute partition to {args.compute_partition}")
|
||||
if args.memory_partition:
|
||||
####################################################################
|
||||
# Get current and available memory partition modes #
|
||||
# Info used if AMDSMI_STATUS_INVAL is caught & to set progress bar #
|
||||
####################################################################
|
||||
try:
|
||||
memory_partition = amdsmi_interface.amdsmi_get_gpu_memory_partition(gpu) # this info likely actually comes from different apis than used here
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
memory_partition = "N/A"
|
||||
logging.debug("Failed to get current memory partition for GPU %s | %s", gpu_id, e.get_error_info())
|
||||
try:
|
||||
mem_caps_str = "N/A"
|
||||
partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu)
|
||||
temp_mem_caps = partition_dict['partition_profile']['memory_caps']
|
||||
mem_caps = temp_mem_caps.nps_cap_mask
|
||||
if temp_mem_caps.amdsmi_nps_flags_t == None:
|
||||
mem_caps_list = []
|
||||
if mem_caps & 1 == 1:
|
||||
mem_caps_list.append("NPS1")
|
||||
if mem_caps & 2 == 2:
|
||||
mem_caps_list.append("NPS2")
|
||||
if mem_caps & 4 == 4:
|
||||
mem_caps_list.append("NPS4")
|
||||
if mem_caps & 8 == 8:
|
||||
mem_caps_list.append("NPS8")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
|
||||
else:
|
||||
mem_caps = temp_mem_caps.amdsmi_nps_flags_t
|
||||
mem_caps_list = []
|
||||
if mem_caps.nps1_cap == 1:
|
||||
mem_caps_list.append("NPS1")
|
||||
if mem_caps.nps2_cap == 1:
|
||||
mem_caps_list.append("NPS2")
|
||||
if mem_caps.nps4_cap == 1:
|
||||
mem_caps_list.append("NPS4")
|
||||
if mem_caps.nps8_cap == 1:
|
||||
mem_caps_list.append("NPS8")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
|
||||
if mem_caps_str == "":
|
||||
mem_caps_str = "N/A"
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info())
|
||||
memory_dict = {'caps': mem_caps_str, 'current': memory_partition}
|
||||
|
||||
###############################################################
|
||||
# memory partition set starts here #
|
||||
###############################################################
|
||||
showProgressBar = False
|
||||
if ((str(memory_dict['current']) != "N/A") and (str(args.memory_partition) in mem_caps_str)
|
||||
and ((str(memory_dict['current']) != str(args.memory_partition)))):
|
||||
showProgressBar = True # Only show progress bar if
|
||||
# 1) Device can set memory partition modes
|
||||
# 2) Requested mode is a valid mode to set
|
||||
# 3) Current is not already the requested mode
|
||||
# otherwise function will return fast
|
||||
threads = []
|
||||
kTimeWait = 40
|
||||
self.helpers.increment_set_count()
|
||||
set_count = self.helpers.get_set_count()
|
||||
if set_count == 1: # only show reload warning on 1st set
|
||||
self.helpers.confirm_changing_memory_partition_gpu_reload_warning()
|
||||
memory_partition = amdsmi_interface.AmdSmiMemoryPartitionType[args.memory_partition]
|
||||
try:
|
||||
if set_count == 1 and showProgressBar: # only show reload warning on 1st set
|
||||
string_out = f"Updating memory partition for gpu {gpu_id}"
|
||||
t1 = multiprocessing.Process(target=self.helpers.showProgressbar,
|
||||
args=(string_out, kTimeWait,))
|
||||
threads.append(t1)
|
||||
t1.start()
|
||||
amdsmi_interface.amdsmi_set_gpu_memory_partition(args.gpu, memory_partition)
|
||||
for thread in threads:
|
||||
thread.terminate()
|
||||
thread.join()
|
||||
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
f = open(os.devnull, 'w') #redirect to /dev/null (crossplatform)
|
||||
print("\n\n", end='\r', flush=True, file=f)
|
||||
for thread in threads:
|
||||
thread.join()
|
||||
thread.terminate()
|
||||
|
||||
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
|
||||
raise PermissionError('Command requires elevation') from e
|
||||
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_INVAL:
|
||||
print(f"[amdsmi_wrapper.AMDSMI_STATUS_INVAL] Unable to set memory partition to {args.memory_partition} on {gpu_string}")
|
||||
print(f"Valid Memory partition Modes: {mem_caps_str}\n")
|
||||
# fall through for value error
|
||||
|
||||
f = open(os.devnull, 'w') #redirect to /dev/null (crossplatform)
|
||||
print("\n\n", end='\r', flush=True, file=f)
|
||||
raise ValueError(f"Unable to set memory partition to {args.memory_partition} on {gpu_string}") from e
|
||||
except Exception as e:
|
||||
for thread in threads:
|
||||
thread.join()
|
||||
thread.terminate()
|
||||
raise ValueError(f"Generic error found | Unable to set memory partition to {args.memory_partition} on {gpu_string}") from e
|
||||
self.logger.store_output(args.gpu, 'memorypartition', f"Successfully set memory partition to {args.memory_partition}")
|
||||
if isinstance(args.power_cap, int):
|
||||
try:
|
||||
@@ -4226,7 +4317,7 @@ class AMDSMICommands():
|
||||
self.set_core(args, multiple_devices, core, core_boost_limit)
|
||||
elif self.helpers.is_amdgpu_initialized(): # Only GPU is initialized
|
||||
if args.gpu == None:
|
||||
raise ValueError('No GPU provided, specific GPU target(s) are needed')
|
||||
args.gpu = self.device_handles
|
||||
self.logger.clear_multiple_devices_ouput()
|
||||
self.set_gpu(args, multiple_devices, gpu, fan, perf_level,
|
||||
profile, perf_determinism, compute_partition,
|
||||
@@ -4281,7 +4372,7 @@ class AMDSMICommands():
|
||||
|
||||
# Handle No GPU passed
|
||||
if args.gpu == None:
|
||||
raise ValueError('No GPU provided, specific GPU target(s) are needed')
|
||||
args.gpu = self.device_handles
|
||||
|
||||
# Handle multiple GPUs
|
||||
handled_multiple_gpus, device_handle = self.helpers.handle_gpus(args, self.logger, self.reset)
|
||||
@@ -5299,7 +5390,7 @@ class AMDSMICommands():
|
||||
mem_caps_list.append("NPS4")
|
||||
if mem_caps.nps8_cap == 1:
|
||||
mem_caps_list.append("NPS8")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
|
||||
if mem_caps_str == "":
|
||||
mem_caps_str = "N/A"
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
@@ -5350,7 +5441,7 @@ class AMDSMICommands():
|
||||
mem_caps_list.append("NPS4")
|
||||
if mem_caps & 8 == 8:
|
||||
mem_caps_list.append("NPS8")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
|
||||
else:
|
||||
mem_caps = temp_mem_caps.amdsmi_nps_flags_t
|
||||
mem_caps_list = []
|
||||
@@ -5362,7 +5453,7 @@ class AMDSMICommands():
|
||||
mem_caps_list.append("NPS4")
|
||||
if mem_caps.nps8_cap == 1:
|
||||
mem_caps_list.append("NPS8")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
|
||||
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "").replace("\'", "")
|
||||
if mem_caps_str == "":
|
||||
mem_caps_str = "N/A"
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
|
||||
@@ -27,6 +27,7 @@ import platform
|
||||
import sys
|
||||
import time
|
||||
import re
|
||||
import multiprocessing
|
||||
|
||||
from typing import List, Union
|
||||
from enum import Enum
|
||||
@@ -54,6 +55,7 @@ class AMDSMIHelpers():
|
||||
|
||||
self._is_linux = False
|
||||
self._is_windows = False
|
||||
self._count_of_sets_called = 0
|
||||
|
||||
if self.operating_system.startswith("Linux"):
|
||||
self._is_linux = True
|
||||
@@ -77,6 +79,11 @@ class AMDSMIHelpers():
|
||||
self._is_virtual_os = False
|
||||
self._is_passthrough = True
|
||||
|
||||
def increment_set_count(self):
|
||||
self._count_of_sets_called += 1
|
||||
|
||||
def get_set_count(self):
|
||||
return self._count_of_sets_called
|
||||
|
||||
def is_virtual_os(self):
|
||||
return self._is_virtual_os
|
||||
@@ -740,6 +747,30 @@ class AMDSMIHelpers():
|
||||
else:
|
||||
sys.exit('Confirmation not given. Exiting without setting value')
|
||||
|
||||
def confirm_changing_memory_partition_gpu_reload_warning(self, auto_respond=False):
|
||||
""" Print the warning for running outside of specification and prompt user to accept the terms.
|
||||
|
||||
:param autoRespond: Response to automatically provide for all prompts
|
||||
"""
|
||||
print('''
|
||||
****** WARNING ******\n
|
||||
Setting Dynamic Memory (NPS) partition modes require users to quit all GPU workloads.
|
||||
AMD SMI will then attempt to change memory (NPS) partition mode.
|
||||
Upon a successful set, AMD SMI will then initiate an action to restart amdgpu driver.
|
||||
This action will change all GPU's in the hive to the requested memory (NPS) partition mode.
|
||||
|
||||
Please use this utility with caution.
|
||||
''')
|
||||
if not auto_respond:
|
||||
user_input = input('Do you accept these terms? [Y/N] ')
|
||||
else:
|
||||
user_input = auto_respond
|
||||
if user_input in ['Yes', 'yes', 'y', 'Y', 'YES']:
|
||||
print('')
|
||||
return
|
||||
else:
|
||||
print('Confirmation not given. Exiting without setting value')
|
||||
sys.exit(1)
|
||||
|
||||
def is_valid_profile(self, profile):
|
||||
profile_presets = amdsmi_interface.amdsmi_wrapper.amdsmi_power_profile_preset_masks_t__enumvalues
|
||||
@@ -818,3 +849,21 @@ class AMDSMIHelpers():
|
||||
except Exception as _:
|
||||
continue
|
||||
return pci_devices
|
||||
|
||||
def progressbar(self, it, prefix="", size=60, out=sys.stdout):
|
||||
count = len(it)
|
||||
def show(j):
|
||||
x = int(size*j/count)
|
||||
print("{}[{}{}] {}/{} secs remain".format(prefix, u"█"*x, "."*(size-x), j, count),
|
||||
end='\r', file=out, flush=True)
|
||||
show(0)
|
||||
for i, item in enumerate(it):
|
||||
yield item
|
||||
show(i+1)
|
||||
print("\n\n", end='\r', flush=True, file=out)
|
||||
|
||||
def showProgressbar(self, title="", timeInSeconds=13):
|
||||
if title != "":
|
||||
title += ": "
|
||||
for i in self.progressbar(range(timeInSeconds), title, 40):
|
||||
time.sleep(1)
|
||||
|
||||
@@ -26,6 +26,7 @@ import atexit
|
||||
import logging
|
||||
import signal
|
||||
import sys
|
||||
import os
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
@@ -134,8 +135,11 @@ def amdsmi_cli_shutdown():
|
||||
|
||||
def signal_handler(sig, frame):
|
||||
logging.debug(f"Handling signal: {sig}")
|
||||
sys.exit(0)
|
||||
|
||||
try:
|
||||
sys.exit(0)
|
||||
except Exception as e:
|
||||
logging.error("Unable to cleanly shut down amd-smi-lib, exception: %s", str(type(e).__name__))
|
||||
os._exit(0)
|
||||
|
||||
if not AMDSMI_INITIALIZED:
|
||||
AMDSMI_INIT_FLAG = amdsmi_cli_init()
|
||||
|
||||
@@ -1032,7 +1032,7 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
|
||||
# Subparser help text
|
||||
set_value_help = "Set options for devices"
|
||||
set_value_subcommand_help = "A GPU must be specified to set a configuration.\
|
||||
set_value_subcommand_help = "If no GPU is specified, will select all GPUs on the system.\
|
||||
\nA set argument must be provided; Multiple set arguments are accepted"
|
||||
set_value_optionals_title = "Set Arguments"
|
||||
|
||||
@@ -1073,8 +1073,8 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
set_value_parser.formatter_class=lambda prog: AMDSMISubparserHelpFormatter(prog)
|
||||
set_value_parser.set_defaults(func=func)
|
||||
|
||||
# Device args are required as safeguard from the user applying the operation to all gpus unintentionally
|
||||
self._add_device_arguments(set_value_parser, required=True)
|
||||
# Providing no -g 0 or -g all, is not required
|
||||
self._add_device_arguments(set_value_parser, required=False)
|
||||
|
||||
if self.helpers.is_amdgpu_initialized():
|
||||
if self.helpers.is_baremetal():
|
||||
@@ -1126,7 +1126,7 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
|
||||
# Subparser help text
|
||||
reset_help = "Reset options for devices"
|
||||
reset_subcommand_help = "A GPU must be specified to reset a configuration.\
|
||||
reset_subcommand_help = "If no GPU is specified, will select all GPUs on the system.\
|
||||
\nA reset argument must be provided; Multiple reset arguments are accepted"
|
||||
reset_optionals_title = "Reset Arguments"
|
||||
|
||||
@@ -1148,8 +1148,8 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
|
||||
# Add Universal Arguments
|
||||
self._add_command_modifiers(reset_parser)
|
||||
# Device args are required as safeguard from the user applying the operation to all gpus unintentionally
|
||||
self._add_device_arguments(reset_parser, required=True)
|
||||
# Providing no -g 0 or -g all, is not required
|
||||
self._add_device_arguments(reset_parser, required=False)
|
||||
|
||||
if self.helpers.is_baremetal():
|
||||
# Add Baremetal reset arguments
|
||||
|
||||
@@ -507,8 +507,8 @@ usage: amd-smi set [-h] (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...
|
||||
[--core-boost-limit BOOST_LIMIT] [--json | --csv] [--file FILE]
|
||||
[--loglevel LEVEL]
|
||||
|
||||
A GPU must be specified to set a configuration.
|
||||
A set argument must be provided; Multiple set arguments are accepted.
|
||||
If no GPU is specified, will select all GPUs on the system.
|
||||
A set argument must be provided; Multiple set arguments are accepted
|
||||
|
||||
Set Arguments:
|
||||
-h, --help show this help message and exit
|
||||
@@ -578,7 +578,7 @@ usage: amd-smi reset [-h] [--json | --csv] [--file FILE] [--loglevel LEVEL]
|
||||
(-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-G] [-c]
|
||||
[-f] [-p] [-x] [-d] [-C] [-M] [-o] [-l]
|
||||
|
||||
A GPU must be specified to reset a configuration.
|
||||
If no GPU is specified, will select all GPUs on the system.
|
||||
A reset argument must be provided; Multiple reset arguments are accepted
|
||||
|
||||
Reset Arguments:
|
||||
|
||||
@@ -24,6 +24,7 @@
|
||||
|
||||
#include <limits>
|
||||
#include <type_traits>
|
||||
#include <string>
|
||||
|
||||
#include "amd_smi/amdsmi.h"
|
||||
#include "amd_smi/impl/amd_smi_gpu_device.h"
|
||||
@@ -48,6 +49,8 @@ amdsmi_status_t smi_amdgpu_get_driver_version(amd::smi::AMDSmiGPUDevice* device,
|
||||
amdsmi_status_t smi_amdgpu_get_pcie_speed_from_pcie_type(uint16_t pcie_type, uint32_t *pcie_speed);
|
||||
amdsmi_status_t smi_amdgpu_get_market_name_from_dev_id(uint32_t device_id, char *market_name);
|
||||
amdsmi_status_t smi_amdgpu_is_gpu_power_management_enabled(amd::smi::AMDSmiGPUDevice* device, bool *enabled);
|
||||
std::string smi_split_string(std::string str, char delim);
|
||||
std::string smi_amdgpu_get_status_string(amdsmi_status_t ret, bool fullStatus);
|
||||
|
||||
|
||||
template<typename>
|
||||
|
||||
@@ -88,6 +88,7 @@ class AmdSmiLibraryException(AmdSmiException):
|
||||
amdsmi_wrapper.AMDSMI_STATUS_FILE_NOT_FOUND : "AMDSMI_STATUS_FILE_NOT_FOUND - File or directory not found",
|
||||
amdsmi_wrapper.AMDSMI_STATUS_ARG_PTR_NULL : "AMDSMI_STATUS_ARG_PTR_NULL - Parsed argument is invalid",
|
||||
amdsmi_wrapper.AMDSMI_STATUS_MAP_ERROR : "AMDSMI_STATUS_MAP_ERROR - The internal library error did not map to a status code",
|
||||
amdsmi_wrapper.AMDSMI_STATUS_AMDGPU_RESTART_ERR: "AMDSMI_STATUS_AMDGPU_RESTART_ERR - AMDGPU restart failed, please check dmsg for errors",
|
||||
amdsmi_wrapper.AMDSMI_STATUS_UNKNOWN_ERROR : "AMDSMI_STATUS_UNKNOWN_ERROR - An unknown error occurred"
|
||||
}
|
||||
|
||||
|
||||
@@ -1653,8 +1653,13 @@ def amdsmi_get_gpu_asic_info(
|
||||
processor_handle, ctypes.byref(asic_info_struct))
|
||||
)
|
||||
|
||||
market_name = _pad_hex_value(asic_info_struct.market_name.decode("utf-8"), 4)
|
||||
target_graphics_version = str(asic_info_struct.target_graphics_version)
|
||||
if len(target_graphics_version) == 4 and ("Instinct MI2" in market_name):
|
||||
hex_part = str(hex(int(str(asic_info_struct.target_graphics_version)[2:]))).replace("0x", "")
|
||||
target_graphics_version = str(asic_info_struct.target_graphics_version)[:2] + hex_part
|
||||
asic_info = {
|
||||
"market_name": _pad_hex_value(asic_info_struct.market_name.decode("utf-8"), 4),
|
||||
"market_name": market_name,
|
||||
"vendor_id": asic_info_struct.vendor_id,
|
||||
"vendor_name": asic_info_struct.vendor_name.decode("utf-8"),
|
||||
"subvendor_id": asic_info_struct.subvendor_id,
|
||||
@@ -1663,7 +1668,7 @@ def amdsmi_get_gpu_asic_info(
|
||||
"asic_serial": asic_info_struct.asic_serial.decode("utf-8"),
|
||||
"oam_id": asic_info_struct.oam_id,
|
||||
"num_compute_units": asic_info_struct.num_of_compute_units,
|
||||
"target_graphics_version": "gfx" + str(asic_info_struct.target_graphics_version)
|
||||
"target_graphics_version": "gfx" + target_graphics_version
|
||||
}
|
||||
|
||||
string_values = ["market_name", "vendor_name"]
|
||||
|
||||
@@ -987,12 +987,12 @@ struct_amdsmi_accelerator_partition_profile_t._pack_ = 1 # source:False
|
||||
struct_amdsmi_accelerator_partition_profile_t._fields_ = [
|
||||
('profile_type', amdsmi_accelerator_partition_type_t),
|
||||
('num_partitions', ctypes.c_uint32),
|
||||
('profile_index', ctypes.c_uint32),
|
||||
('memory_caps', amdsmi_nps_caps_t),
|
||||
('profile_index', ctypes.c_uint32),
|
||||
('num_resources', ctypes.c_uint32),
|
||||
('resources', ctypes.c_uint32 * 32 * 8),
|
||||
('PADDING_0', ctypes.c_ubyte * 4),
|
||||
('reserved', ctypes.c_uint64 * 6),
|
||||
('reserved', ctypes.c_uint64 * 13),
|
||||
]
|
||||
|
||||
amdsmi_accelerator_partition_profile_t = struct_amdsmi_accelerator_partition_profile_t
|
||||
|
||||
@@ -652,11 +652,6 @@ static rsmi_status_t test_set_compute_partitioning(uint32_t dv_ind) {
|
||||
std::cout << "\n" << "\n";
|
||||
}
|
||||
|
||||
std::cout << "About to initate compute partition reset..." << "\n";
|
||||
ret = rsmi_dev_compute_partition_reset(dv_ind);
|
||||
CHK_RSMI_NOT_SUPPORTED_RET(ret)
|
||||
std::cout << "Done resetting compute partition." << "\n";
|
||||
|
||||
std::string myComputePartition = originalComputePartition;
|
||||
if (myComputePartition.empty() == false) {
|
||||
std::cout << "Resetting back to original compute partition to "
|
||||
@@ -709,11 +704,6 @@ static rsmi_status_t test_set_memory_partition(uint32_t dv_ind) {
|
||||
<< "." << "\n\n\n";
|
||||
}
|
||||
|
||||
std::cout << "About to initate memory partition reset...\n";
|
||||
ret = rsmi_dev_memory_partition_reset(dv_ind);
|
||||
CHK_RSMI_NOT_SUPPORTED_RET(ret)
|
||||
std::cout << "Done resetting memory partition.\n";
|
||||
|
||||
std::string myMemPart = originalMemoryPartition;
|
||||
if (myMemPart.empty() == false) {
|
||||
std::cout << "Resetting memory partition to " << originalMemoryPartition
|
||||
|
||||
@@ -4596,25 +4596,6 @@ rsmi_status_t
|
||||
rsmi_dev_compute_partition_set(uint32_t dv_ind,
|
||||
rsmi_compute_partition_type_t compute_partition);
|
||||
|
||||
/**
|
||||
* @brief Reverts a selected device's compute partition setting back to its
|
||||
* boot state.
|
||||
*
|
||||
* @details Given a device index @p dv_ind , this function will attempt to
|
||||
* revert its compute partition setting back to its boot state.
|
||||
*
|
||||
* @param[in] dv_ind a device index
|
||||
*
|
||||
* @retval ::RSMI_STATUS_SUCCESS call was successful
|
||||
* @retval ::RSMI_STATUS_PERMISSION function requires root access
|
||||
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
|
||||
* support this function
|
||||
* @retval ::RSMI_STATUS_BUSY A resource or mutex could not be acquired
|
||||
* because it is already being used - device is busy
|
||||
*
|
||||
*/
|
||||
rsmi_status_t rsmi_dev_compute_partition_reset(uint32_t dv_ind);
|
||||
|
||||
/**
|
||||
* @brief Retrieves the partition_id for a desired device
|
||||
*
|
||||
@@ -4680,6 +4661,39 @@ rsmi_status_t
|
||||
rsmi_dev_memory_partition_get(uint32_t dv_ind, char *memory_partition,
|
||||
uint32_t len);
|
||||
|
||||
/**
|
||||
* @brief Retrieves the available memory partition capabilities
|
||||
* for a desired device
|
||||
*
|
||||
* @details
|
||||
* Given a device index @p dv_ind and a string @p memory_partition_caps ,
|
||||
* and uint32 @p len , this function will attempt to obtain the device's
|
||||
* available memory partition capabilities string. Upon successful
|
||||
* retreival, the obtained device's available memory partition capablilities
|
||||
* string shall be stored in the passed @p memory_partition_caps
|
||||
* char string variable.
|
||||
*
|
||||
* @param[in] dv_ind a device index
|
||||
*
|
||||
* @param[inout] memory_partition_caps a pointer to a char string variable,
|
||||
* which the device's available memory partition capabilities will be written to.
|
||||
*
|
||||
* @param[in] len the length of the caller provided buffer @p len ,
|
||||
* suggested length is 30 or greater.
|
||||
*
|
||||
* @retval ::RSMI_STATUS_SUCCESS call was successful
|
||||
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
|
||||
* @retval ::RSMI_STATUS_UNEXPECTED_DATA data provided to function is not valid
|
||||
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
|
||||
* support this function
|
||||
* @retval ::RSMI_STATUS_INSUFFICIENT_SIZE is returned if @p len bytes is not
|
||||
* large enough to hold the entire memory partition value. In this case,
|
||||
* only @p len bytes will be written.
|
||||
*
|
||||
*/
|
||||
rsmi_status_t rsmi_dev_memory_partition_capabilities_get(
|
||||
uint32_t dv_ind, char *memory_partition_caps, uint32_t len);
|
||||
|
||||
/**
|
||||
* @brief Modifies a selected device's current memory partition setting.
|
||||
*
|
||||
@@ -4707,27 +4721,6 @@ rsmi_status_t
|
||||
rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
rsmi_memory_partition_type_t memory_partition);
|
||||
|
||||
/**
|
||||
* @brief Reverts a selected device's memory partition setting back to its
|
||||
* boot state.
|
||||
*
|
||||
* @details Given a device index @p dv_ind , this function will attempt to
|
||||
* revert its current memory partition setting back to its boot state.
|
||||
*
|
||||
* @param[in] dv_ind a device index
|
||||
*
|
||||
* @retval ::RSMI_STATUS_SUCCESS call was successful
|
||||
* @retval ::RSMI_STATUS_PERMISSION function requires root access
|
||||
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
|
||||
* support this function
|
||||
* @retval ::RSMI_STATUS_AMDGPU_RESTART_ERR could not successfully restart
|
||||
* the amdgpu driver
|
||||
* @retval ::RSMI_STATUS_BUSY A resource or mutex could not be acquired
|
||||
* because it is already being used - device is busy
|
||||
*
|
||||
*/
|
||||
rsmi_status_t rsmi_dev_memory_partition_reset(uint32_t dv_ind);
|
||||
|
||||
/** @} */ // end of memory_partition
|
||||
|
||||
/*****************************************************************************/
|
||||
|
||||
@@ -182,6 +182,7 @@ enum DevInfoTypes {
|
||||
kDevAvailableComputePartition,
|
||||
kDevComputePartition,
|
||||
kDevMemoryPartition,
|
||||
kDevAvailableMemoryPartition,
|
||||
|
||||
// The information read from pci core sysfs
|
||||
kDevPCieTypeStart = 1000,
|
||||
@@ -245,6 +246,8 @@ class Device {
|
||||
bool DeviceAPISupported(std::string name, uint64_t variant,
|
||||
uint64_t sub_variant);
|
||||
rsmi_status_t restartAMDGpuDriver(void);
|
||||
rsmi_status_t isRestartInProgress(bool *isRestartInProgress,
|
||||
bool *isAMDGPUModuleLive);
|
||||
rsmi_status_t storeDevicePartitions(uint32_t dv_ind);
|
||||
template <typename T> std::string readBootPartitionState(uint32_t dv_ind);
|
||||
rsmi_status_t check_amdgpu_property_reinforcement_query(uint32_t dev_idx, AMDGpuVerbTypes_t verb_type);
|
||||
|
||||
@@ -92,7 +92,8 @@ std::pair<bool, std::string> executeCommand(std::string command,
|
||||
rsmi_status_t storeTmpFile(uint32_t dv_ind, std::string parameterName,
|
||||
std::string stateName, std::string storageData);
|
||||
std::vector<std::string> getListOfAppTmpFiles();
|
||||
bool containsString(std::string originalString, std::string substring);
|
||||
bool containsString(std::string originalString, std::string substring,
|
||||
bool displayComparisons = false);
|
||||
std::tuple<bool, std::string> readTmpFile(
|
||||
uint32_t dv_ind,
|
||||
std::string stateName,
|
||||
@@ -141,6 +142,8 @@ std::string removeNewLines(const std::string &s);
|
||||
|
||||
std::string removeString(const std::string origStr,
|
||||
const std::string &removeMe);
|
||||
void system_wait(int milli_seconds);
|
||||
int countDigit(uint64_t n);
|
||||
template <typename T>
|
||||
std::string print_int_as_hex(T i, bool showHexNotation = true,
|
||||
int overloadBitSize = 0) {
|
||||
|
||||
Diff onderdrukt omdat het te groot bestand
Laad Diff
@@ -5729,12 +5729,22 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
LOG_TRACE(ss);
|
||||
REQUIRE_ROOT_ACCESS
|
||||
DEVICE_MUTEX
|
||||
const uint32_t kMaxBoardLength = 128;
|
||||
bool isCorrectDevice = false;
|
||||
char boardName[128];
|
||||
char boardName[kMaxBoardLength];
|
||||
boardName[0] = '\0';
|
||||
|
||||
const uint32_t kMaxMemoryCapabilitiesSize = 30;
|
||||
char available_memory_capabilities[kMaxMemoryCapabilitiesSize];
|
||||
available_memory_capabilities[0] = '\0';
|
||||
|
||||
const uint32_t kMaxCurrentMemoryMode = 5;
|
||||
char current_memory_mode[kMaxCurrentMemoryMode];
|
||||
current_memory_mode[0] = '\0';
|
||||
|
||||
// rsmi_dev_memory_partition_set is only available for for discrete variant,
|
||||
// others are required to update through bios settings
|
||||
rsmi_dev_name_get(dv_ind, boardName, 128);
|
||||
rsmi_dev_name_get(dv_ind, boardName, static_cast<size_t>(kMaxBoardLength));
|
||||
std::string myBoardName = boardName;
|
||||
if (!myBoardName.empty()) {
|
||||
std::transform(myBoardName.begin(), myBoardName.end(), myBoardName.begin(),
|
||||
@@ -5747,18 +5757,19 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
|
||||
if (!isCorrectDevice) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Fail "
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Cause: device board name does not support this action"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_NOT_SUPPORTED) << " |";
|
||||
<< " | ======= end ======= "
|
||||
<< " | Fail "
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Cause: device board name does not support this action"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_NOT_SUPPORTED, false);
|
||||
LOG_ERROR(ss);
|
||||
return RSMI_STATUS_NOT_SUPPORTED;
|
||||
}
|
||||
|
||||
// Is the current mode already what user requested?
|
||||
switch (memory_partition) {
|
||||
case RSMI_MEMORY_PARTITION_NPS1:
|
||||
case RSMI_MEMORY_PARTITION_NPS2:
|
||||
@@ -5775,7 +5786,7 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Cause: requested setting was invalid"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS) << " |";
|
||||
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS, false);
|
||||
LOG_ERROR(ss);
|
||||
return RSMI_STATUS_INVALID_ARGS;
|
||||
}
|
||||
@@ -5797,7 +5808,7 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
<< " | Cause: could retrieve current memory partition or retrieved"
|
||||
<< " unexpected data"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(ret_get) << " |";
|
||||
<< getRSMIStatusString(ret_get, false);
|
||||
LOG_ERROR(ss);
|
||||
return ret_get;
|
||||
}
|
||||
@@ -5813,11 +5824,52 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Data: " << newMemoryPartition
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_SUCCESS) << " |";
|
||||
<< getRSMIStatusString(RSMI_STATUS_SUCCESS, false);
|
||||
LOG_TRACE(ss);
|
||||
return RSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
// is this an available mode to set to?
|
||||
std::string memory_capabilities_str = "unknown";
|
||||
std::string user_requested_memory_partition = newMemoryPartition;
|
||||
std::transform(user_requested_memory_partition.begin(), user_requested_memory_partition.end(),
|
||||
user_requested_memory_partition.begin(), ::toupper);
|
||||
rsmi_status_t caps_ret = rsmi_dev_memory_partition_capabilities_get(dv_ind,
|
||||
available_memory_capabilities, kMaxMemoryCapabilitiesSize);
|
||||
memory_capabilities_str = available_memory_capabilities;
|
||||
std::transform(memory_capabilities_str.begin(), memory_capabilities_str.end(),
|
||||
memory_capabilities_str.begin(), ::toupper);
|
||||
ss << __PRETTY_FUNCTION__ << " | user_requested_memory_partition: "
|
||||
<< user_requested_memory_partition
|
||||
<< "; memory_capabilities_str: " << memory_capabilities_str
|
||||
<< "; rsmi_dev_memory_partition_capabilities_get(" << dv_ind
|
||||
<< ", " << user_requested_memory_partition << "): return = "
|
||||
<< amd::smi::getRSMIStatusString(caps_ret, false);
|
||||
LOG_DEBUG(ss);
|
||||
if ((caps_ret == RSMI_STATUS_SUCCESS)
|
||||
&& (!memory_capabilities_str.empty())
|
||||
&& (!user_requested_memory_partition.empty())) {
|
||||
bool is_available_mode = amd::smi::containsString(memory_capabilities_str,
|
||||
user_requested_memory_partition, true);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | is_available_mode: " << (is_available_mode ? "True": "False");
|
||||
LOG_DEBUG(ss);
|
||||
if (is_available_mode == false) { // report RSMI_STATUS_INVALID_ARGS
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= Check if available mode ======= "
|
||||
<< " | WARNING: detected invalid mode to set to, will try to set anyways"
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Data (user requested mode): " << user_requested_memory_partition
|
||||
<< " | Available Memory Partition Modes: " << memory_capabilities_str
|
||||
<< " | Cause: requested setting was not an available mode"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS, false);
|
||||
LOG_INFO(ss);
|
||||
}
|
||||
}
|
||||
|
||||
GET_DEV_FROM_INDX
|
||||
int ret = dev->writeDevInfo(amd::smi::kDevMemoryPartition,
|
||||
newMemoryPartition);
|
||||
@@ -5835,7 +5887,7 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Cause: issue writing reqested setting of " + newMemoryPartition
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(err) << " |";
|
||||
<< getRSMIStatusString(err, false);
|
||||
LOG_ERROR(ss);
|
||||
return err;
|
||||
}
|
||||
@@ -5849,8 +5901,76 @@ rsmi_dev_memory_partition_set(uint32_t dv_ind,
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Data: " << newMemoryPartition
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(restartRet) << " |";
|
||||
<< getRSMIStatusString(restartRet, false);
|
||||
LOG_TRACE(ss);
|
||||
if (restartRet != RSMI_STATUS_SUCCESS) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Fail - restart AMD GPU detected"
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Cause: issue writing reqested setting of " + newMemoryPartition
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(restartRet, false);
|
||||
LOG_ERROR(ss);
|
||||
return restartRet;
|
||||
}
|
||||
|
||||
std::string current_memory_mode_str = "unknown";
|
||||
rsmi_status_t can_read_sysfs_again = RSMI_STATUS_AMDGPU_RESTART_ERR;
|
||||
int maxWaitSeconds = 10;
|
||||
const int k1000_MS_WAIT = 1000;
|
||||
// wait until we can read SYSFS again
|
||||
if (restartRet == RSMI_STATUS_SUCCESS) {
|
||||
while (current_memory_mode_str != user_requested_memory_partition) {
|
||||
maxWaitSeconds -= 1;
|
||||
can_read_sysfs_again =
|
||||
rsmi_dev_memory_partition_get(dv_ind, current_memory_mode, kMaxCurrentMemoryMode);
|
||||
if (can_read_sysfs_again == RSMI_STATUS_SUCCESS) {
|
||||
current_memory_mode_str.clear();
|
||||
current_memory_mode_str = current_memory_mode;
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= rsmi_dev_memory_partition_get ======= "
|
||||
<< " | Success - can read SYSFS"
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Data (user requested mode): " << user_requested_memory_partition
|
||||
<< " | Current Memory Partition Mode: " << current_memory_mode_str
|
||||
<< " | Available Memory Partition Modes: " << memory_capabilities_str
|
||||
<< " | total wait time (sec): " << (10 - maxWaitSeconds)
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(can_read_sysfs_again, false);
|
||||
LOG_TRACE(ss);
|
||||
if (!current_memory_mode_str.empty()
|
||||
&& (current_memory_mode_str == user_requested_memory_partition)) {
|
||||
break;
|
||||
}
|
||||
}
|
||||
amd::smi::system_wait(k1000_MS_WAIT);
|
||||
}
|
||||
}
|
||||
|
||||
if (current_memory_mode_str == user_requested_memory_partition) {
|
||||
restartRet = RSMI_STATUS_SUCCESS;
|
||||
} else {
|
||||
restartRet = RSMI_STATUS_AMDGPU_RESTART_ERR;
|
||||
}
|
||||
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Success - completed driver restart and all SYSFS are active"
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Data: " << user_requested_memory_partition
|
||||
<< " | Current Memory Partition Mode: " << current_memory_mode_str
|
||||
<< " | Available Memory Partition Modes: " << memory_capabilities_str
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(restartRet, false);
|
||||
LOG_TRACE(ss);
|
||||
|
||||
return restartRet;
|
||||
CATCH
|
||||
}
|
||||
@@ -5927,79 +6047,73 @@ rsmi_dev_memory_partition_get(uint32_t dv_ind, char *memory_partition,
|
||||
CATCH
|
||||
}
|
||||
|
||||
rsmi_status_t rsmi_dev_compute_partition_reset(uint32_t dv_ind) {
|
||||
rsmi_status_t rsmi_dev_memory_partition_capabilities_get(
|
||||
uint32_t dv_ind, char *memory_partition_caps, uint32_t len) {
|
||||
TRY
|
||||
std::ostringstream ss;
|
||||
ss << __PRETTY_FUNCTION__ << " | ======= start =======, " << dv_ind;
|
||||
LOG_TRACE(ss);
|
||||
REQUIRE_ROOT_ACCESS
|
||||
|
||||
if ((len == 0) || (memory_partition_caps == nullptr)) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Fail "
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
|
||||
<< " | Cause: user sent invalid arguments, len = 0 or memory_partition_caps"
|
||||
<< " was a null ptr"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_INVALID_ARGS, false);
|
||||
LOG_ERROR(ss);
|
||||
return RSMI_STATUS_INVALID_ARGS;
|
||||
}
|
||||
CHK_SUPPORT_NAME_ONLY(memory_partition_caps)
|
||||
DEVICE_MUTEX
|
||||
GET_DEV_FROM_INDX
|
||||
rsmi_status_t ret = RSMI_STATUS_NOT_SUPPORTED;
|
||||
|
||||
// Only use 1st index, rest are there in-case of future issues
|
||||
// NOTE: Partitions sets cause rocm-smi indexes to fluctuate
|
||||
// since the nodes are grouped in respect to primary node - why we only use
|
||||
// 1st node/device id to reset
|
||||
std::string bootState =
|
||||
dev->readBootPartitionState<rsmi_compute_partition_type_t>(0);
|
||||
std::string availableMemoryPartitions;
|
||||
rsmi_status_t ret =
|
||||
get_dev_value_line(amd::smi::kDevAvailableMemoryPartition,
|
||||
dv_ind, &availableMemoryPartitions);
|
||||
if (ret != RSMI_STATUS_SUCCESS) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | FAIL "
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
|
||||
<< " | Data: could not retrieve requested data"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(ret, false);
|
||||
LOG_ERROR(ss);
|
||||
return ret;
|
||||
}
|
||||
|
||||
// Initiate reset
|
||||
// If bootState is UNKNOWN, we cannot reset - return RSMI_STATUS_NOT_SUPPORTED
|
||||
// Likely due to device not supporting it
|
||||
if (bootState != "UNKNOWN") {
|
||||
rsmi_compute_partition_type_t compute_partition =
|
||||
mapStringToRSMIComputePartitionTypes.at(bootState);
|
||||
ret = rsmi_dev_compute_partition_set(dv_ind, compute_partition);
|
||||
std::size_t length = availableMemoryPartitions.copy(memory_partition_caps, len-1);
|
||||
memory_partition_caps[length]='\0';
|
||||
|
||||
if (len < (availableMemoryPartitions.size() + 1)) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Fail "
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
|
||||
<< " | Cause: requested size was insufficient"
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(RSMI_STATUS_INSUFFICIENT_SIZE, false);
|
||||
LOG_ERROR(ss);
|
||||
return RSMI_STATUS_INSUFFICIENT_SIZE;
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Success - if original boot state was not unknown or valid setting"
|
||||
<< " | Success "
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevComputePartition)
|
||||
<< " | Data: " << bootState
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevAvailableMemoryPartition)
|
||||
<< " | Data: " << memory_partition_caps
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(ret) << " |";
|
||||
LOG_TRACE(ss);
|
||||
return ret;
|
||||
CATCH
|
||||
}
|
||||
|
||||
rsmi_status_t rsmi_dev_memory_partition_reset(uint32_t dv_ind) {
|
||||
TRY
|
||||
std::ostringstream ss;
|
||||
ss << __PRETTY_FUNCTION__ << "| ======= start =======, " << dv_ind;
|
||||
LOG_TRACE(ss);
|
||||
REQUIRE_ROOT_ACCESS
|
||||
DEVICE_MUTEX
|
||||
GET_DEV_FROM_INDX
|
||||
rsmi_status_t ret = RSMI_STATUS_NOT_SUPPORTED;
|
||||
|
||||
// Only use 1st index, rest are there in-case of future issues
|
||||
// NOTE: Partitions sets cause rocm-smi indexes to fluctuate.
|
||||
// Since the nodes are grouped in respect to primary node - why we only use
|
||||
// 1st node/device id to reset
|
||||
std::string bootState =
|
||||
dev->readBootPartitionState<rsmi_memory_partition_type_t>(0);
|
||||
|
||||
// Initiate reset
|
||||
// If bootState is UNKNOWN, we cannot reset - return RSMI_STATUS_NOT_SUPPORTED
|
||||
// Likely due to device not supporting it
|
||||
if (bootState != "UNKNOWN") {
|
||||
rsmi_memory_partition_type_t memory_partition =
|
||||
mapStringToMemoryPartitionTypes.at(bootState);
|
||||
ret = rsmi_dev_memory_partition_set(dv_ind, memory_partition);
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | ======= end ======= "
|
||||
<< " | Success - if original boot state was not unknown or valid setting"
|
||||
<< " | Device #: " << dv_ind
|
||||
<< " | Type: "
|
||||
<< amd::smi::Device::get_type_string(amd::smi::kDevMemoryPartition)
|
||||
<< " | Data: " << bootState
|
||||
<< " | Returning = "
|
||||
<< getRSMIStatusString(ret) << " |";
|
||||
<< getRSMIStatusString(ret, false);
|
||||
LOG_TRACE(ss);
|
||||
return ret;
|
||||
CATCH
|
||||
|
||||
@@ -140,6 +140,7 @@ static const char *kDevAvailableComputePartitionFName =
|
||||
"available_compute_partition";
|
||||
static const char *kDevComputePartitionFName = "current_compute_partition";
|
||||
static const char *kDevMemoryPartitionFName = "current_memory_partition";
|
||||
static const char *kDevAvailableMemoryPartitionFName = "available_memory_partition";
|
||||
|
||||
// Firmware version files
|
||||
static const char *kDevFwVersionAsdFName = "fw_version/asd_fw_version";
|
||||
@@ -328,6 +329,7 @@ static const std::map<DevInfoTypes, const char *> kDevAttribNameMap = {
|
||||
{kDevAvailableComputePartition, kDevAvailableComputePartitionFName},
|
||||
{kDevComputePartition, kDevComputePartitionFName},
|
||||
{kDevMemoryPartition, kDevMemoryPartitionFName},
|
||||
{kDevAvailableMemoryPartition, kDevAvailableMemoryPartitionFName},
|
||||
};
|
||||
|
||||
static const std::map<rsmi_dev_perf_level, const char *> kDevPerfLvlMap = {
|
||||
@@ -479,6 +481,7 @@ Device::devInfoTypesStrings = {
|
||||
{kDevAvailableComputePartition, "kDevAvailableComputePartition"},
|
||||
{kDevComputePartition, "kDevComputePartition"},
|
||||
{kDevMemoryPartition, "kDevMemoryPartition"},
|
||||
{kDevAvailableMemoryPartition, "kDevAvailableMemoryPartition"},
|
||||
{kDevPCieVendorID, "kDevPCieVendorID"},
|
||||
{kDevSocPstate, "kDevSocPstate"},
|
||||
{kDevXgmiPlpd, "kDevXgmiPlpd"},
|
||||
@@ -1308,6 +1311,7 @@ int Device::readDevInfo(DevInfoTypes type, std::string *val) {
|
||||
case kDevMemoryPartition:
|
||||
case kDevNumaNode:
|
||||
case kDevXGMIPhysicalID:
|
||||
case kDevAvailableMemoryPartition:
|
||||
case kDevProcessIsolation:
|
||||
return readDevInfoStr(type, val);
|
||||
break;
|
||||
@@ -1486,10 +1490,15 @@ bool Device::DeviceAPISupported(std::string name, uint64_t variant,
|
||||
|
||||
rsmi_status_t Device::restartAMDGpuDriver(void) {
|
||||
REQUIRE_ROOT_ACCESS
|
||||
std::ostringstream ss;
|
||||
bool restartSuccessful = true;
|
||||
bool success = false;
|
||||
std::string out;
|
||||
bool wasGdmServiceActive = false;
|
||||
bool restartInProgress = true;
|
||||
bool isRestartInProgress = true;
|
||||
bool isAMDGPUModuleLive = false;
|
||||
std::string captureRestartErr;
|
||||
|
||||
// sudo systemctl is-active gdm
|
||||
// we do not care about the success of checking if gdm is active
|
||||
@@ -1498,8 +1507,8 @@ rsmi_status_t Device::restartAMDGpuDriver(void) {
|
||||
(restartSuccessful = true);
|
||||
|
||||
// if gdm is active -> sudo systemctl stop gdm
|
||||
// TODO: are are there other display manager's we need to take into account?
|
||||
// see https://en.wikipedia.org/wiki/GNOME_Display_Manager
|
||||
// TODO(AMD_SMI_team): are are there other display manager's we need to take into account?
|
||||
// see https://help.gnome.org/admin/gdm/stable/overview.html.en_GB
|
||||
if (success && (out == "active")) {
|
||||
wasGdmServiceActive = true;
|
||||
std::tie(success, out) = executeCommand("systemctl stop gdm&", false);
|
||||
@@ -1509,8 +1518,13 @@ rsmi_status_t Device::restartAMDGpuDriver(void) {
|
||||
// sudo modprobe -r amdgpu
|
||||
// sudo modprobe amdgpu
|
||||
std::tie(success, out) =
|
||||
executeCommand("modprobe -r amdgpu && modprobe amdgpu&", false);
|
||||
executeCommand("modprobe -r amdgpu && modprobe amdgpu&", true);
|
||||
restartSuccessful &= success;
|
||||
captureRestartErr = out;
|
||||
|
||||
if (success) {
|
||||
restartSuccessful = false;
|
||||
}
|
||||
|
||||
// if gdm was active -> sudo systemctl start gdm
|
||||
if (wasGdmServiceActive) {
|
||||
@@ -1518,7 +1532,61 @@ rsmi_status_t Device::restartAMDGpuDriver(void) {
|
||||
restartSuccessful &= success;
|
||||
}
|
||||
|
||||
return (restartSuccessful ? RSMI_STATUS_SUCCESS :
|
||||
// Return early if there was an issue restarting amdgpu
|
||||
if (!restartSuccessful) {
|
||||
ss << __PRETTY_FUNCTION__ << " | [WARNING] Issue found during amdgpu restart: "
|
||||
<< captureRestartErr << "; retartSuccessful: " << (restartSuccessful ? "True" : "False");
|
||||
LOG_INFO(ss);
|
||||
return RSMI_STATUS_AMDGPU_RESTART_ERR;
|
||||
}
|
||||
|
||||
// wait for amdgpu module to come back up
|
||||
rsmi_status_t status = Device::isRestartInProgress(&isRestartInProgress,
|
||||
&isAMDGPUModuleLive);
|
||||
const int kTimeToWaitForDriverMSec = 1000;
|
||||
int maxLoops = 10; // wait a max of 10 sec
|
||||
while (status != RSMI_STATUS_SUCCESS) {
|
||||
maxLoops -= 1;
|
||||
if (maxLoops == 0) {
|
||||
break;
|
||||
}
|
||||
amd::smi::system_wait(kTimeToWaitForDriverMSec);
|
||||
status = Device::isRestartInProgress(&isRestartInProgress,
|
||||
&isAMDGPUModuleLive);
|
||||
}
|
||||
|
||||
return ((restartSuccessful && (!isRestartInProgress && isAMDGPUModuleLive)) ?
|
||||
RSMI_STATUS_SUCCESS :
|
||||
RSMI_STATUS_AMDGPU_RESTART_ERR);
|
||||
}
|
||||
|
||||
rsmi_status_t Device::isRestartInProgress(bool *isRestartInProgress,
|
||||
bool *isAMDGPUModuleLive) {
|
||||
REQUIRE_ROOT_ACCESS
|
||||
std::ostringstream ss;
|
||||
bool restartSuccessful = true;
|
||||
bool success = false;
|
||||
std::string out;
|
||||
bool deviceRestartInProgress = true; // Assume in progress, we intend to disprove
|
||||
bool isSystemAMDGPUModuleLive = false; // Assume AMD GPU module is not live,
|
||||
// we intend to disprove
|
||||
|
||||
// wait for amdgpu module to come back up
|
||||
std::tie(success, out) = executeCommand("cat /sys/module/amdgpu/initstate", true);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | success = " << success
|
||||
<< " | out = " << out;
|
||||
LOG_DEBUG(ss);
|
||||
if ((success == true) && (!out.empty())) {
|
||||
isSystemAMDGPUModuleLive = containsString(out, "live");
|
||||
}
|
||||
if (isAMDGPUModuleLive) {
|
||||
deviceRestartInProgress = false;
|
||||
}
|
||||
*isRestartInProgress = deviceRestartInProgress;
|
||||
*isAMDGPUModuleLive = isSystemAMDGPUModuleLive;
|
||||
|
||||
return ((*isAMDGPUModuleLive && !*isRestartInProgress) ? RSMI_STATUS_SUCCESS :
|
||||
RSMI_STATUS_AMDGPU_RESTART_ERR);
|
||||
}
|
||||
|
||||
|
||||
@@ -63,6 +63,7 @@
|
||||
#include <sstream>
|
||||
#include <string>
|
||||
#include <vector>
|
||||
#include <cmath>
|
||||
|
||||
#include "rocm_smi/rocm_smi.h"
|
||||
#include "rocm_smi/rocm_smi_kfd.h"
|
||||
@@ -357,6 +358,7 @@ rsmi_status_t ErrnoToRsmiStatus(int err) {
|
||||
case EIO: return RSMI_STATUS_UNEXPECTED_SIZE;
|
||||
case ENXIO: return RSMI_STATUS_UNEXPECTED_DATA;
|
||||
case EBUSY: return RSMI_STATUS_BUSY;
|
||||
case EINVAL: return RSMI_STATUS_INVALID_ARGS;
|
||||
default: return RSMI_STATUS_UNKNOWN_ERROR;
|
||||
}
|
||||
}
|
||||
@@ -429,14 +431,14 @@ std::pair<bool, std::string> executeCommand(std::string command, bool stdOut) {
|
||||
char buffer[128];
|
||||
std::string stdoutAndErr;
|
||||
bool successfulRun = true;
|
||||
command = "stdbuf -i0 -o0 -e0 " + command; // remove stdOut and err buffering
|
||||
command = "stdbuf -i0 -o0 -e0 " + command; // remove stdOut and err buffering
|
||||
|
||||
FILE *pipe = popen(command.c_str(), "r");
|
||||
if (!pipe) {
|
||||
stdoutAndErr = "[ERROR] popen failed to call " + command;
|
||||
successfulRun = false;
|
||||
} else {
|
||||
//read until end of process
|
||||
// read until end of process
|
||||
while (!feof(pipe)) {
|
||||
// use buffer to read and add to stdoutAndErr
|
||||
if (fgets(buffer, sizeof(buffer), pipe) != nullptr) {
|
||||
@@ -459,8 +461,19 @@ std::pair<bool, std::string> executeCommand(std::string command, bool stdOut) {
|
||||
|
||||
// originalString - string to search for substring
|
||||
// substring - string looking to find
|
||||
bool containsString(std::string originalString, std::string substring) {
|
||||
return (originalString.find(substring) != std::string::npos);
|
||||
// displayComparisons = defaults to false, set to true to see debug prints
|
||||
bool containsString(std::string originalString, std::string substring,
|
||||
bool displayComparisons) {
|
||||
std::ostringstream ss;
|
||||
bool found = originalString.find(substring) != std::string::npos;
|
||||
if (displayComparisons) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | originalString: " << originalString
|
||||
<< " | substring: " << substring
|
||||
<< " | found: " << (found ? "True": "False");
|
||||
LOG_TRACE(ss);
|
||||
}
|
||||
return found;
|
||||
}
|
||||
|
||||
// Creates and stores supplied data into a temporary file (within /tmp/).
|
||||
@@ -1217,7 +1230,9 @@ rsmi_status_t rsmi_get_gfx_target_version(uint32_t dv_ind, std::string *gfx_vers
|
||||
// separate out parts -> put back into normal graphics version format
|
||||
major = static_cast<uint64_t>((orig_target_version / 10000) * 100);
|
||||
minor = static_cast<uint64_t>((orig_target_version % 10000 / 100) * 10);
|
||||
if (minor == 0) major *= 10; // 0 as a minor is correct, but bump up by 10
|
||||
if ((minor == 0) && (countDigit(major) < 4)) {
|
||||
major *= 10; // 0 as a minor is correct, but bump up by 10
|
||||
}
|
||||
rev = static_cast<uint64_t>(orig_target_version % 100);
|
||||
*gfx_version = "gfx" + std::to_string(major + minor + rev);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
@@ -1278,6 +1293,31 @@ std::queue<std::string> getAllDeviceGfxVers() {
|
||||
return deviceGfxVersions;
|
||||
}
|
||||
|
||||
// milli_seconds: time to wait, in milliseconds
|
||||
// 1 sec = 1000ms
|
||||
// .5 sec = 500ms
|
||||
void system_wait(int milli_seconds) {
|
||||
std::ostringstream ss;
|
||||
auto start = std::chrono::high_resolution_clock::now();
|
||||
// 1 ms = 1000 us
|
||||
int waitTime = milli_seconds * 1000;
|
||||
ss << __PRETTY_FUNCTION__ << " | "
|
||||
<< "** Waiting for " << std::dec << waitTime
|
||||
<< " us (" << waitTime/1000 << " milli-seconds) **";
|
||||
LOG_DEBUG(ss);
|
||||
usleep(waitTime);
|
||||
auto stop = std::chrono::high_resolution_clock::now();
|
||||
auto duration =
|
||||
std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
|
||||
ss << __PRETTY_FUNCTION__ << " | "
|
||||
<< "** Waiting took " << duration.count() / 1000
|
||||
<< " milli-seconds **";
|
||||
LOG_DEBUG(ss);
|
||||
}
|
||||
|
||||
int countDigit(uint64_t n) {
|
||||
return static_cast<int>(std::floor(log10(n) + 1));
|
||||
}
|
||||
|
||||
} // namespace smi
|
||||
} // namespace amd
|
||||
|
||||
@@ -1455,16 +1455,57 @@ amdsmi_get_gpu_accelerator_partition_profile(amdsmi_processor_handle processor_h
|
||||
amdsmi_accelerator_partition_profile_t *profile,
|
||||
uint32_t *partition_id) {
|
||||
AMDSMI_CHECK_INIT();
|
||||
// TODO: also fill out profile later
|
||||
// default to 0xffffffff if not supported
|
||||
*partition_id = std::numeric_limits<uint32_t>::max();
|
||||
auto tmp_partition_id = uint32_t(0);
|
||||
if (profile == nullptr) {
|
||||
return AMDSMI_STATUS_INVAL;
|
||||
}
|
||||
std::ostringstream ss;
|
||||
// TODO(amdsmi_team): also fill out profile later
|
||||
amdsmi_nps_caps_t flags;
|
||||
flags.amdsmi_nps_flags_t.nps1_cap = 0;
|
||||
flags.amdsmi_nps_flags_t.nps2_cap = 0;
|
||||
flags.amdsmi_nps_flags_t.nps4_cap = 0;
|
||||
flags.amdsmi_nps_flags_t.nps8_cap = 0;
|
||||
profile->memory_caps = flags;
|
||||
|
||||
amdsmi_status_t status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, &tmp_partition_id);
|
||||
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS){
|
||||
// TODO(amdsmi_team): add resources here ^
|
||||
auto tmp_partition_id = uint32_t(0);
|
||||
auto tmp_xcd_count = uint16_t(0);
|
||||
amdsmi_status_t status = AMDSMI_STATUS_NOT_SUPPORTED;
|
||||
|
||||
status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, &tmp_partition_id);
|
||||
if (status == AMDSMI_STATUS_SUCCESS) {
|
||||
*partition_id = tmp_partition_id;
|
||||
}
|
||||
|
||||
// Add memory partition capabilities here
|
||||
constexpr uint32_t kLenCapsSize = 30;
|
||||
char memory_caps[kLenCapsSize];
|
||||
status = rsmi_wrapper(rsmi_dev_memory_partition_capabilities_get, processor_handle,
|
||||
memory_caps, kLenCapsSize);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | rsmi_dev_memory_partition_capabilities_get Returning: "
|
||||
<< smi_amdgpu_get_status_string(status, false)
|
||||
<< " | Type: memory_partition_capabilities"
|
||||
<< " | Data: " << memory_caps;
|
||||
LOG_DEBUG(ss);
|
||||
std::string memory_caps_str = "N/A";
|
||||
if (status == AMDSMI_STATUS_SUCCESS) {
|
||||
memory_caps_str = std::string(memory_caps);
|
||||
if (memory_caps_str.find("NPS1") != std::string::npos) {
|
||||
flags.amdsmi_nps_flags_t.nps1_cap = 1;
|
||||
}
|
||||
if (memory_caps_str.find("NPS2") != std::string::npos) {
|
||||
flags.amdsmi_nps_flags_t.nps2_cap = 1;
|
||||
}
|
||||
if (memory_caps_str.find("NPS4") != std::string::npos) {
|
||||
flags.amdsmi_nps_flags_t.nps4_cap = 1;
|
||||
}
|
||||
if (memory_caps_str.find("NPS8") != std::string::npos) {
|
||||
flags.amdsmi_nps_flags_t.nps8_cap = 1;
|
||||
}
|
||||
}
|
||||
profile->memory_caps = flags;
|
||||
|
||||
return status;
|
||||
}
|
||||
|
||||
|
||||
@@ -624,3 +624,35 @@ amdsmi_status_t smi_amdgpu_is_gpu_power_management_enabled(amd::smi::AMDSmiGPUDe
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
std::string smi_amdgpu_split_string(std::string str, char delim) {
|
||||
std::vector<std::string> tokens;
|
||||
std::stringstream ss(str);
|
||||
std::string token;
|
||||
|
||||
if (str.empty()) {
|
||||
return "";
|
||||
}
|
||||
|
||||
while (std::getline(ss, token, delim)) {
|
||||
tokens.push_back(token);
|
||||
return token; // return 1st match
|
||||
}
|
||||
}
|
||||
|
||||
// wrapper to return string expression of a rsmi_status_t return
|
||||
// rsmi_status_t ret - return value of RSMI API function
|
||||
// bool fullStatus - defaults to true, set to false to chop off description
|
||||
// Returns:
|
||||
// string - if fullStatus == true, returns full decription of return value
|
||||
// ex. 'RSMI_STATUS_SUCCESS: The function has been executed successfully.'
|
||||
// string - if fullStatus == false, returns a minimalized return value
|
||||
// ex. 'RSMI_STATUS_SUCCESS'
|
||||
std::string smi_amdgpu_get_status_string(amdsmi_status_t ret, bool fullStatus = true) {
|
||||
const char *err_str;
|
||||
amdsmi_status_code_to_string(ret, &err_str);
|
||||
if (!fullStatus) {
|
||||
return smi_amdgpu_split_string(std::string(err_str), ':');
|
||||
}
|
||||
return std::string(err_str);
|
||||
}
|
||||
|
||||
|
||||
Verwijs in nieuw issue
Block a user