Merge amd-dev into amd-master 20240502

Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: I9d8d0cd0f4ffe39605d087dd52a7768fc15db49d
Этот коммит содержится в:
Maisam Arif
2024-05-02 16:40:26 -05:00
родитель 881920c864 bf6fc51f4f
Коммит f8c19dce67
20 изменённых файлов: 1009 добавлений и 224 удалений
+103 -8
Просмотреть файл
@@ -4,14 +4,54 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/](
***All information listed below is for reference and subject to change.***
## amd_smi_lib for ROCm 6.1.1
## amd_smi_lib for ROCm 6.1.2
### Added
- N/A
- **Added process isolation and clean shader APIs and CLI commands**
Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs:
- `amdsmi_get_gpu_process_isolation()`
- `amdsmi_set_gpu_process_isolation()`
- `amdsmi_set_gpu_clear_sram_data()`
- **Added `MIN_POWER` to output of `amd-smi static --limit`**
This change was to help users to identify what range they can change the power cap of the GPU to. We added this to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap <value in W>` or `amd-smi reset -g all --power-cap`.
```shell
$ amd-smi static --limit
GPU: 0
LIMIT:
MAX_POWER: 203 W
MIN_POWER: 0 W
SOCKET_POWER: 203 W
SLOWDOWN_EDGE_TEMPERATURE: 100 °C
SLOWDOWN_HOTSPOT_TEMPERATURE: 110 °C
SLOWDOWN_VRAM_TEMPERATURE: 100 °C
SHUTDOWN_EDGE_TEMPERATURE: 105 °C
SHUTDOWN_HOTSPOT_TEMPERATURE: 115 °C
SHUTDOWN_VRAM_TEMPERATURE: 105 °C
GPU: 1
LIMIT:
MAX_POWER: 213 W
MIN_POWER: 213 W
SOCKET_POWER: 213 W
SLOWDOWN_EDGE_TEMPERATURE: 109 °C
SLOWDOWN_HOTSPOT_TEMPERATURE: 110 °C
SLOWDOWN_VRAM_TEMPERATURE: 100 °C
SHUTDOWN_EDGE_TEMPERATURE: 114 °C
SHUTDOWN_HOTSPOT_TEMPERATURE: 115 °C
SHUTDOWN_VRAM_TEMPERATURE: 105 °C
```
### Changed
- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**
`amdsmi_get_power_cap_info` will return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).
- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**
Previously calls were returning "No bad pages found." if no pages were found, now it only returns the list type and can be empty.
- **Updated `amd-smi metric --ecc-blocks` output**
The ecc blocks arguement was outputing blocks without counters available, updated the filtering show blocks that counters are available for:
@@ -52,6 +92,58 @@ GPU: 0
- **Removed `amdsmi_get_gpu_process_info` from python library**
amdsmi_get_gpu_process_info was removed from the C library in an earlier build, but the API was still in the python interface
### Optimizations
- **Updated `amd-smi monitor --pcie` output**
The source for pcie bandwidth monitor output was a legacy file we no longer support and was causing delays within the monitor command. The output is no longer using TX/RX but instantaneous bandwidth from gpu_metrics instead; updated output:
```shell
$ amd-smi monitor --pcie
GPU PCIE_BW
0 26 Mb/s
```
### Fixed
- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**
These systems use an older version of gpu_metrics in amdgpu. This fix only updates what CLI outputs.
No change in any of our APIs.
```shell
$ amd-smi metric --power
GPU: 0
POWER:
SOCKET_POWER: 11 W
GFX_VOLTAGE: 768 mV
SOC_VOLTAGE: 925 mV
MEM_VOLTAGE: 1250 mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
GPU: 1
POWER:
SOCKET_POWER: 17 W
GFX_VOLTAGE: 781 mV
SOC_VOLTAGE: 806 mV
MEM_VOLTAGE: 1250 mV
POWER_MANAGEMENT: ENABLED
THROTTLE_STATUS: UNTHROTTLED
```
- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**
Updates required `amdsmi_get_power_cap_info` to return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).
- **Fixed python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**
Previously python interface calls to populated bad pages resulted in a `ValueError: NULL pointer access`. This fixes the bad-pages subcommand CLI subcommand as well.
## amd_smi_lib for ROCm 6.1.1
### Added
- N/A
### Changed
- **Updated metrics --clocks**
Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status.
@@ -188,9 +280,10 @@ GPU: 0
```
- **Updated `amd-smi topology --json` to align with host/guest**
Topology's `--json` output now is changed to align with output reported bt host/guest systems. Additionally, users can select/filter specific topology details as desired (refer to `amd-smi topology -h` for full list). See examples shown below.
Topology's `--json` output now is changed to align with output host/guest systems. Additionally, users can select/filter specific topology details as desired (refer to `amd-smi topology -h` for full list). See examples shown below.
*Previous format:*
*Previous format:*
```shell
$ amd-smi topology --json
[
@@ -244,6 +337,7 @@ $ amd-smi topology --json
```
*New format:*
```shell
$ amd-smi topology --json
[
@@ -275,6 +369,7 @@ $ amd-smi topology --json
...
]
```
```shell
$ /opt/rocm/bin/amd-smi topology -a -t --json
[
@@ -323,18 +418,18 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json
### Fixed
- **Fix for GPU reset error on non-amdgpu cards**
- **Fix for GPU reset error on non-amdgpu cards**
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
updates CLI to target only AMD ASICs.
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
- **Fix for `amd-smi process`**
- **Fix for `amd-smi process`**
Fixed output results when getting processes running on a device.
- **Improved Error handling for `amd-smi process`**
- **Improved Error handling for `amd-smi process`**
Fixed Attribute Error when getting process in csv format
### Known issues
+1 -1
Просмотреть файл
@@ -28,7 +28,7 @@ find_program(GIT NAMES git)
## Setup the package version based on git tags.
set(PKG_VERSION_GIT_TAG_PREFIX "amdsmi_pkg_ver")
get_package_version_number("24.5.1" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
get_package_version_number("24.5.2" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
message("Package version: ${PKG_VERSION_STR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MAJOR "${CPACK_PACKAGE_VERSION_MAJOR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MINOR "${CPACK_PACKAGE_VERSION_MINOR}")
+9 -6
Просмотреть файл
@@ -79,7 +79,7 @@ amd-smi will report the version and current platform detected when running the c
~$ amd-smi
usage: amd-smi [-h] ...
AMD System Management Interface | Version: 24.5.1.0 | ROCm version: 6.1.1 | Platform: Linux Baremetal
AMD System Management Interface | Version: 24.5.2.0 | ROCm version: 6.1.2 | Platform: Linux Baremetal
options:
-h, --help show this help message and exit
@@ -148,9 +148,9 @@ Command Modifiers:
```bash
~$ amd-smi static --help
usage: amd-smi static [-h] [-g GPU [GPU ...] | -U CPU [CPU ...]] [-a] [-b] [-V] [-d] [-v]
[-c] [-B] [-r] [-p] [-l] [-u] [-s] [-i] [--json | --csv]
[--file FILE] [--loglevel LEVEL]
usage: amd-smi static [-h] [-g GPU [GPU ...]] [-a] [-b] [-V] [-d] [-v] [-c] [-B] [-r] [-p]
[-l] [-P] [-x] [-s] [-u] [--json | --csv] [--file FILE]
[--loglevel LEVEL]
If no GPU is specified, returns static information for all GPUs on the system.
If no static argument is provided, all static information will be displayed.
@@ -179,6 +179,7 @@ Static Arguments:
-r, --ras Displays RAS features information
-p, --partition Partition information
-l, --limit All limit metric values (i.e. power and thermal limits)
-s, --process-isolation The process isolation status
-u, --numa All numa node information
CPU Arguments:
@@ -474,13 +475,13 @@ Command Modifiers:
```bash
usage: amd-smi set [-h] (-g GPU [GPU ...] | -U CPU [CPU ...] | -O CORE [CORE ...]) [-f %]
[-l LEVEL] [-P SETPROFILE] [-d SCLKMAX] [-C PARTITION] [-M PARTITION]
[-o WATTS] [-p POLICY] [--cpu-pwr-limit PWR_LIMIT]
[-o WATTS] [-p POLICY] [-i STATUS] [--cpu-pwr-limit PWR_LIMIT]
[--cpu-xgmi-link-width MIN_WIDTH MAX_WIDTH]
[--cpu-lclk-dpm-level NBIOID MIN_DPM MAX_DPM] [--cpu-pwr-eff-mode MODE]
[--cpu-gmi3-link-width MIN_LW MAX_LW] [--cpu-pcie-link-rate LINK_RATE]
[--cpu-df-pstate-range MAX_PSTATE MIN_PSTATE] [--cpu-enable-apb]
[--cpu-disable-apb DF_PSTATE] [--soc-boost-limit BOOST_LIMIT]
[--core-boost-limit BOOST_LIMIT] [--json | --csv] [--file FILE]
[--core-boost-limit BOOST_LIMIT] [-c] [--json | --csv] [--file FILE]
[--loglevel LEVEL]
A GPU must be specified to set a configuration.
@@ -514,6 +515,8 @@ Set Arguments:
-o, --power-cap WATTS Set power capacity limit
-p, --dpm-policy POLICY_ID Set the GPU DPM policy using policy id
-x, --xgmi-plpd POLICY_ID Set the GPU XGMI per-link power down policy using policy id
-i, --process-isolation STATUS Enable or disable the GPU process isolation: 0 for disable and 1 for enable.
-c, --clear-sram-data Clear the GPU SRAM data
CPU Arguments:
--cpu-pwr-limit PWR_LIMIT Set power limit for the given socket. Input parameter is power limit value.
+170 -113
Просмотреть файл
@@ -245,7 +245,7 @@ class AMDSMICommands():
def static_gpu(self, args, multiple_devices=False, gpu=None, asic=None, bus=None, vbios=None,
limit=None, driver=None, ras=None, board=None, numa=None, vram=None,
cache=None, partition=None, dfc_ucode=None, fb_info=None, num_vf=None,
policy=None, xgmi_plpd=None):
policy=None, xgmi_plpd=None, process_isolation=None):
"""Get Static information for target gpu
Args:
@@ -270,6 +270,7 @@ class AMDSMICommands():
num_vf (bool, optional): Value override for args.num_vf. Defaults to None.
policy (bool, optional): Value override for args.policy. Defaults to None.
xgmi_plpd (bool, optional): Value override for args.xgmi_plpd. Defaults to None.
process_isolation (bool, optional): Value override for args.process_isolation. Defaults to None.
Returns:
None: Print output via AMDSMILogger to destination
"""
@@ -306,8 +307,10 @@ class AMDSMICommands():
args.policy = policy
if xgmi_plpd:
args.xgmi_plpd = xgmi_plpd
current_platform_args += ["ras", "limit", "partition", "policy", "xgmi_plpd"]
current_platform_values += [args.ras, args.limit, args.partition, args.policy, args.xgmi_plpd]
if process_isolation:
args.process_isolation = process_isolation
current_platform_args += ["ras", "limit", "partition", "policy", "xgmi_plpd", "process_isolation"]
current_platform_values += [args.ras, args.limit, args.partition, args.policy, args.xgmi_plpd, args.process_isolation]
if self.helpers.is_linux() and not self.helpers.is_virtual_os():
if numa:
@@ -411,7 +414,11 @@ class AMDSMICommands():
power_limit_error = False
power_cap_info = amdsmi_interface.amdsmi_get_power_cap_info(args.gpu)
max_power_limit = power_cap_info['max_power_cap']
max_power_limit = AMDSMIHelpers.convert_SI_unit(max_power_limit, AMDSMIHelpers.SI_Unit.MICRO)
min_power_limit = power_cap_info['min_power_cap']
min_power_limit = AMDSMIHelpers.convert_SI_unit(min_power_limit, AMDSMIHelpers.SI_Unit.MICRO)
socket_power_limit = power_cap_info['power_cap']
socket_power_limit = AMDSMIHelpers.convert_SI_unit(socket_power_limit, AMDSMIHelpers.SI_Unit.MICRO)
except amdsmi_exception.AmdSmiLibraryException as e:
power_limit_error = True
max_power_limit = "N/A"
@@ -489,11 +496,18 @@ class AMDSMICommands():
power_unit = 'W'
temp_unit_human_readable = '\N{DEGREE SIGN}C'
temp_unit_json = 'C'
if self.logger.is_human_readable_format():
if not power_limit_error:
max_power_limit = f"{max_power_limit} {power_unit}"
socket_power_limit = f"{socket_power_limit} {power_unit}"
if not power_limit_error:
max_power_limit = self.helpers.unit_format(self.logger,
max_power_limit,
power_unit)
min_power_limit = self.helpers.unit_format(self.logger,
min_power_limit,
power_unit)
socket_power_limit = self.helpers.unit_format(self.logger,
socket_power_limit,
power_unit)
if self.logger.is_human_readable_format():
if not slowdown_temp_edge_limit_error:
slowdown_temp_edge_limit = f"{slowdown_temp_edge_limit} {temp_unit_human_readable}"
if not slowdown_temp_hotspot_limit_error:
@@ -506,13 +520,8 @@ class AMDSMICommands():
shutdown_temp_hotspot_limit = f"{shutdown_temp_hotspot_limit} {temp_unit_human_readable}"
if not shutdown_temp_vram_limit_error:
shutdown_temp_vram_limit = f"{shutdown_temp_vram_limit} {temp_unit_human_readable}"
if self.logger.is_json_format():
if not power_limit_error:
max_power_limit = {"value" : max_power_limit,
"unit" : power_unit}
socket_power_limit = {"value" : socket_power_limit,
"unit" : power_unit}
if self.logger.is_json_format():
if not slowdown_temp_edge_limit_error:
slowdown_temp_edge_limit = {"value" : slowdown_temp_edge_limit,
"unit" : temp_unit_json}
@@ -535,6 +544,7 @@ class AMDSMICommands():
limit_info = {}
# Power limits
limit_info['max_power'] = max_power_limit
limit_info['min_power'] = min_power_limit
limit_info['socket_power'] = socket_power_limit
# Shutdown limits
@@ -643,6 +653,16 @@ class AMDSMICommands():
logging.debug("Failed to get xgmi_plpd info for gpu %s | %s", gpu_id, e.get_error_info())
static_dict['xgmi_plpd'] = policy_info
if 'process_isolation' in current_platform_args:
if args.process_isolation:
try:
status = amdsmi_interface.amdsmi_get_gpu_process_isolation(args.gpu)
status = "Enabled" if status else "Disabled"
except amdsmi_exception.AmdSmiLibraryException as e:
status = "N/A"
logging.debug("Failed to process isolation for gpu %s | %s", gpu_id, e.get_error_info())
static_dict['process_isolation'] = status
if 'numa' in current_platform_args:
if args.numa:
try:
@@ -779,7 +799,7 @@ class AMDSMICommands():
bus=None, vbios=None, limit=None, driver=None, ras=None,
board=None, numa=None, vram=None, cache=None, partition=None,
dfc_ucode=None, fb_info=None, num_vf=None, cpu=None,
interface_ver=None, policy=None, xgmi_plpd = None):
interface_ver=None, policy=None, xgmi_plpd = None, process_isolation=None):
"""Get Static information for target gpu and cpu
Args:
@@ -804,6 +824,7 @@ class AMDSMICommands():
interface_ver (bool, optional): Value override for args.interface_ver. Defaults to None
policy (bool, optional): Value override for args.policy. Defaults to None.
xgmi_plpd (bool, optional): Value override for args.xgmi_plpd. Defaults to None.
process_isolation (bool, optional): Value override for args.process_isolation. Defaults to None.
Raises:
IndexError: Index error if gpu list is empty
@@ -829,7 +850,8 @@ class AMDSMICommands():
gpu_args_enabled = False
gpu_attributes = ["asic", "bus", "vbios", "limit", "driver", "ras",
"board", "numa", "vram", "cache", "partition",
"dfc_ucode", "fb_info", "num_vf", "policy", "xgmi_plpd"]
"dfc_ucode", "fb_info", "num_vf", "policy", "xgmi_plpd",
"process_isolation"]
for attr in gpu_attributes:
if hasattr(args, attr):
if getattr(args, attr):
@@ -859,7 +881,8 @@ class AMDSMICommands():
self.static_gpu(args, multiple_devices, gpu, asic,
bus, vbios, limit, driver, ras,
board, numa, vram, cache, partition,
dfc_ucode, fb_info, num_vf, policy)
dfc_ucode, fb_info, num_vf, policy,
process_isolation)
elif self.helpers.is_amd_hsmp_initialized(): # Only CPU is initialized
if args.cpu == None:
args.cpu = self.cpu_handles
@@ -873,7 +896,8 @@ class AMDSMICommands():
self.static_gpu(args, multiple_devices, gpu, asic,
bus, vbios, limit, driver, ras,
board, numa, vram, cache, partition,
dfc_ucode, fb_info, num_vf, policy, xgmi_plpd)
dfc_ucode, fb_info, num_vf, policy, xgmi_plpd,
process_isolation)
def firmware(self, args, multiple_devices=False, gpu=None, fw_list=True):
@@ -998,14 +1022,19 @@ class AMDSMICommands():
# Get gpu_id for logging
gpu_id = self.helpers.get_gpu_id_from_device_handle(args.gpu)
bad_pages_not_found = "No bad pages found."
try:
bad_page_info = amdsmi_interface.amdsmi_get_gpu_bad_page_info(args.gpu)
# If bad_page_info is an empty list overwrite with not found error statement
if bad_page_info == []:
bad_page_info = bad_pages_not_found
bad_page_error = True
else:
bad_page_error = False
except amdsmi_exception.AmdSmiLibraryException as e:
bad_page_info = "N/A"
logging.debug("Failed to get bad page info for gpu %s | %s", gpu_id, e.get_error_info())
if bad_page_info == "N/A" or bad_page_info == "No bad pages found.":
bad_page_error = True
logging.debug("Failed to get bad page info for gpu %s | %s", gpu_id, e.get_error_info())
if args.retired:
if bad_page_error:
@@ -1017,13 +1046,17 @@ class AMDSMICommands():
bad_page_info_entry = {}
bad_page_info_entry["page_address"] = bad_page["page_address"]
bad_page_info_entry["page_size"] = bad_page["page_size"]
bad_page_info_entry["status"] = bad_page["status"].name
status_string = amdsmi_interface.amdsmi_wrapper.amdsmi_memory_page_status_t__enumvalues[bad_page["status"]]
bad_page_info_entry["status"] = status_string.replace("AMDSMI_MEM_PAGE_STATUS_", "")
bad_page_info_output.append(bad_page_info_entry)
# Remove brackets if there is only one value
if len(bad_page_info_output) == 1:
bad_page_info_output = bad_page_info_output[0]
values_dict['retired'] = bad_page_info_output
if bad_page_info_output == []:
values_dict['retired'] = bad_pages_not_found
else:
values_dict['retired'] = bad_page_info_output
if args.pending:
if bad_page_error:
@@ -1035,13 +1068,17 @@ class AMDSMICommands():
bad_page_info_entry = {}
bad_page_info_entry["page_address"] = bad_page["page_address"]
bad_page_info_entry["page_size"] = bad_page["page_size"]
bad_page_info_entry["status"] = bad_page["status"].name
status_string = amdsmi_interface.amdsmi_wrapper.amdsmi_memory_page_status_t__enumvalues[bad_page["status"]]
bad_page_info_entry["status"] = status_string.replace("AMDSMI_MEM_PAGE_STATUS_", "")
bad_page_info_output.append(bad_page_info_entry)
# Remove brackets if there is only one value
if len(bad_page_info_output) == 1:
bad_page_info_output = bad_page_info_output[0]
values_dict['pending'] = bad_page_info_output
if bad_page_info_output == []:
values_dict['pending'] = bad_pages_not_found
else:
values_dict['pending'] = bad_page_info_output
if args.un_res:
if bad_page_error:
@@ -1053,13 +1090,17 @@ class AMDSMICommands():
bad_page_info_entry = {}
bad_page_info_entry["page_address"] = bad_page["page_address"]
bad_page_info_entry["page_size"] = bad_page["page_size"]
bad_page_info_entry["status"] = bad_page["status"].name
status_string = amdsmi_interface.amdsmi_wrapper.amdsmi_memory_page_status_t__enumvalues[bad_page["status"]]
bad_page_info_entry["status"] = status_string.replace("AMDSMI_MEM_PAGE_STATUS_", "")
bad_page_info_output.append(bad_page_info_entry)
# Remove brackets if there is only one value
if len(bad_page_info_output) == 1:
bad_page_info_output = bad_page_info_output[0]
values_dict['un_res'] = bad_page_info_output
if bad_page_info_output == []:
values_dict['un_res'] = bad_pages_not_found
else:
values_dict['un_res'] = bad_page_info_output
# Store values in logger.output
self.logger.store_output(args.gpu, 'values', values_dict)
@@ -1295,24 +1336,19 @@ class AMDSMICommands():
for key, value in power_info.items():
if value == 0xFFFF:
power_info[key] = "N/A"
elif self.logger.is_human_readable_format():
if "voltage" in key:
power_info[key] = f"{value} {voltage_unit}"
elif "power" in key:
power_info[key] = f"{value} {power_unit}"
elif self.logger.is_json_format():
if "voltage" in key:
power_info[key] = {"value" : value,
"unit" : voltage_unit}
elif "power" in key:
power_info[key] = {"value" : value,
"unit" : power_unit}
power_dict['socket_power'] = power_info['current_socket_power']
if power_dict['socket_power'] == "N/A":
# For older gpu's when current power doesn't populate we use the average socket power instead
power_dict['socket_power'] = power_info['average_socket_power']
elif "voltage" in key:
power_info[key] = self.helpers.unit_format(self.logger,
value,
voltage_unit)
elif "power" in key:
if ((key == "current_socket_power" or key == "average_socket_power")
and value != "N/A"):
power_dict['socket_power'] = self.helpers.unit_format(self.logger,
value,
power_unit)
power_info[key] = self.helpers.unit_format(self.logger,
value,
power_unit)
power_dict['gfx_voltage'] = power_info['gfx_voltage']
power_dict['soc_voltage'] = power_info['soc_voltage']
@@ -3326,7 +3362,8 @@ class AMDSMICommands():
def set_gpu(self, args, multiple_devices=False, gpu=None, fan=None, perf_level=None,
profile=None, perf_determinism=None, compute_partition=None,
memory_partition=None, power_cap=None, dpm_policy=None, xgmi_plpd = None):
memory_partition=None, power_cap=None, dpm_policy=None, xgmi_plpd = None,
process_isolation=None, clear_sram_data = None):
"""Issue reset commands to target gpu(s)
Args:
@@ -3342,7 +3379,8 @@ class AMDSMICommands():
power_cap (int, optional): Value override for args.power_cap. Defaults to None.
dpm_policy (int, optional): Value override for args.dpm_policy. Defaults to None.
xgmi_plpd (int, optional): Value override for args.xgmi_plpd. Defaults to None.
process_isolation (int, optional): Value override for args.process_isolation. Defaults to None.
clear_sram_data (int, optional): Value override for args.clear_sram_data. Defaults to None.
Raises:
ValueError: Value error if no gpu value is provided
IndexError: Index error if gpu list is empty
@@ -3371,6 +3409,10 @@ class AMDSMICommands():
args.dpm_policy = dpm_policy
if xgmi_plpd:
args.xgmi_plpd = xgmi_plpd
if process_isolation:
args.process_isolation = process_isolation
if clear_sram_data:
args.clear_sram_data = clear_sram_data
# Handle No GPU passed
if args.gpu == None:
raise ValueError('No GPU provided, specific GPU target(s) are needed')
@@ -3389,9 +3431,11 @@ class AMDSMICommands():
args.compute_partition,
args.memory_partition,
args.perf_determinism is not None,
args.power_cap,
args.dpm_policy,
args.xgmi_plpd]):
args.power_cap is not None,
args.dpm_policy is not None,
args.xgmi_plpd is not None,
args.process_isolation is not None,
args.clear_sram_data]):
command = " ".join(sys.argv[1:])
raise AmdSmiRequiredCommandException(command, self.logger.format)
@@ -3455,32 +3499,16 @@ class AMDSMICommands():
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to set memory partition to {args.memory_partition} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'memorypartition', f"Successfully set memory partition to {args.memory_partition}")
if args.dpm_policy:
try:
amdsmi_interface.amdsmi_set_dpm_policy(args.gpu, args.dpm_policy)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to set dpm policy to {args.dpm_policy} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'dpmpolicy', f"Successfully set dpm policy to id {args.dpm_policy}")
if args.xgmi_plpd:
try:
amdsmi_interface.amdsmi_set_xgmi_plpd(args.gpu, args.xgmi_plpd)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to set XGMI policy to {args.xgmi_plpd} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'xgmiplpd', f"Successfully set per-link power down policy to id {args.dpm_policy}")
if isinstance(args.power_cap, int):
try:
power_cap_info = amdsmi_interface.amdsmi_get_power_cap_info(args.gpu)
logging.debug(f"Power cap info for gpu {gpu_id} | {power_cap_info}")
min_power_cap = power_cap_info["min_power_cap"]
min_power_cap = AMDSMIHelpers.convert_SI_unit(min_power_cap, AMDSMIHelpers.SI_Unit.MICRO)
max_power_cap = power_cap_info["max_power_cap"]
max_power_cap = AMDSMIHelpers.convert_SI_unit(max_power_cap, AMDSMIHelpers.SI_Unit.MICRO)
current_power_cap = power_cap_info["power_cap"]
current_power_cap = AMDSMIHelpers.convert_SI_unit(current_power_cap, AMDSMIHelpers.SI_Unit.MICRO)
except amdsmi_exception.AmdSmiLibraryException as e:
raise ValueError(f"Unable to get power cap info from {gpu_string}") from e
@@ -3488,7 +3516,9 @@ class AMDSMICommands():
self.logger.store_output(args.gpu, 'powercap', f"Power cap is already set to {args.power_cap}")
elif args.power_cap >= min_power_cap and args.power_cap <= max_power_cap:
try:
amdsmi_interface.amdsmi_set_power_cap(args.gpu, 0, args.power_cap * 1000000)
new_power_cap = AMDSMIHelpers.convert_SI_unit(args.power_cap, AMDSMIHelpers.SI_Unit.BASE,
AMDSMIHelpers.SI_Unit.MICRO)
amdsmi_interface.amdsmi_set_power_cap(args.gpu, 0, new_power_cap)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
@@ -3499,6 +3529,48 @@ class AMDSMICommands():
if min_power_cap == 0:
min_power_cap = 1
self.logger.store_output(args.gpu, 'powercap', f"Power cap must be between {min_power_cap} and {max_power_cap}")
if isinstance(args.dpm_policy, int):
try:
amdsmi_interface.amdsmi_set_dpm_policy(args.gpu, args.dpm_policy)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to set dpm policy to {args.dpm_policy} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'dpmpolicy', f"Successfully set dpm policy to id {args.dpm_policy}")
if isinstance(args.xgmi_plpd, int):
try:
amdsmi_interface.amdsmi_set_xgmi_plpd(args.gpu, args.xgmi_plpd)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to set XGMI policy to {args.xgmi_plpd} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'xgmiplpd', f"Successfully set per-link power down policy to id {args.dpm_policy}")
if isinstance(args.process_isolation, int):
status_string = "Enabled" if args.process_isolation else "Disabled"
result = f"Requested process isolation to {status_string}" # This should not print out
try:
current_status = amdsmi_interface.amdsmi_get_gpu_process_isolation(args.gpu)
if current_status == args.process_isolation:
result = f"Process isolation is already {status_string}"
else:
amdsmi_interface.amdsmi_set_gpu_process_isolation(args.gpu, args.process_isolation)
result = f"Successfully set process isolation to {status_string}"
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to set process isolation to {status_string} on {gpu_string}") from e
self.logger.store_output(args.gpu, 'process_isolation', result)
if args.clear_sram_data:
try:
# Only 1 can be used for now.
amdsmi_interface.amdsmi_set_gpu_clear_sram_data(args.gpu, 1)
result = 'Successfully clear GPU SRAM data'
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to clear SRAM data on GPU {gpu_id}") from e
self.logger.store_output(args.gpu, 'clear_sram_data', result)
if multiple_devices:
self.logger.store_multiple_device_output()
@@ -3513,7 +3585,8 @@ class AMDSMICommands():
cpu=None, cpu_pwr_limit=None, cpu_xgmi_link_width=None, cpu_lclk_dpm_level=None,
cpu_pwr_eff_mode=None, cpu_gmi3_link_width=None, cpu_pcie_link_rate=None,
cpu_df_pstate_range=None, cpu_enable_apb=None, cpu_disable_apb=None,
soc_boost_limit=None, core=None, core_boost_limit=None, dpm_policy=None, xgmi_plpd=None):
soc_boost_limit=None, core=None, core_boost_limit=None, dpm_policy=None, xgmi_plpd=None,
process_isolation=None, clear_sram_data=None):
"""Issue reset commands to target gpu(s)
Args:
@@ -3544,7 +3617,8 @@ class AMDSMICommands():
core_boost_limit (int, optional): Value override for args.core_boost_limit. Defaults to None
dpm_policy (int, optional): Value override for args.dpm_policy. Defaults to None.
xgmi_plpd (int, optional): Value override for args.xgmi_plpd. Defaults to None.
process_isolation (int, optional): Value override for args.process_isolation. Defaults to None.
clear_sram_data (int, optional): Value override for args.clear_sram_data. Defaults to None.
Raises:
ValueError: Value error if no gpu value is provided
IndexError: Index error if gpu list is empty
@@ -3564,13 +3638,13 @@ class AMDSMICommands():
# Check if a GPU argument has been set
gpu_args_enabled = False
gpu_attributes = ["fan", "perf_level", "profile", "perf_determinism", "compute_partition",
"memory_partition", "power_cap", "dpm_policy", "xgmi_plpd"]
"memory_partition", "power_cap", "dpm_policy", "xgmi_plpd", "process_isolation",
"clear_sram_data"]
for attr in gpu_attributes:
if hasattr(args, attr):
if getattr(args, attr) is not None:
gpu_args_enabled = True
break
# Check if a CPU argument has been set
cpu_args_enabled = False
cpu_attributes = ["cpu_pwr_limit", "cpu_xgmi_link_width", "cpu_lclk_dpm_level", "cpu_pwr_eff_mode",
@@ -3578,7 +3652,7 @@ class AMDSMICommands():
"cpu_enable_apb", "cpu_disable_apb", "soc_boost_limit"]
for attr in cpu_attributes:
if hasattr(args, attr):
if getattr(args, attr) is not None:
if getattr(args, attr) not in [None, False]:
cpu_args_enabled = True
break
@@ -3620,7 +3694,8 @@ class AMDSMICommands():
self.logger.clear_multiple_devices_ouput()
self.set_gpu(args, multiple_devices, gpu, fan, perf_level,
profile, perf_determinism, compute_partition,
memory_partition, power_cap, dpm_policy, xgmi_plpd)
memory_partition, power_cap, dpm_policy, xgmi_plpd,
process_isolation, clear_sram_data)
elif self.helpers.is_amd_hsmp_initialized(): # Only CPU is initialized
if args.cpu == None and args.core == None:
raise ValueError('No CPU or CORE provided, specific target(s) are needed')
@@ -3639,7 +3714,8 @@ class AMDSMICommands():
self.logger.clear_multiple_devices_ouput()
self.set_gpu(args, multiple_devices, gpu, fan, perf_level,
profile, perf_determinism, compute_partition,
memory_partition, power_cap, dpm_policy, xgmi_plpd)
memory_partition, power_cap, dpm_policy, xgmi_plpd,
process_isolation, clear_sram_data)
def reset(self, args, multiple_devices=False, gpu=None, gpureset=None,
@@ -3660,7 +3736,6 @@ class AMDSMICommands():
compute_partition (bool, optional): Value override for args.compute_partition. Defaults to None.
memory_partition (bool, optional): Value override for args.memory_partition. Defaults to None.
power_cap (int, optional): Value override for args.power_cap. Defaults to None.
Raises:
ValueError: Value error if no gpu value is provided
IndexError: Index error if gpu list is empty
@@ -3838,20 +3913,26 @@ class AMDSMICommands():
try:
power_cap_info = amdsmi_interface.amdsmi_get_power_cap_info(args.gpu)
logging.debug(f"Power cap info for gpu {gpu_id} | {power_cap_info}")
default_power_cap = power_cap_info["default_power_cap"]
default_power_cap_in_w = power_cap_info["default_power_cap"]
default_power_cap_in_w = AMDSMIHelpers.convert_SI_unit(default_power_cap_in_w, AMDSMIHelpers.SI_Unit.MICRO)
current_power_cap_in_w = power_cap_info["power_cap"]
current_power_cap_in_w = AMDSMIHelpers.convert_SI_unit(current_power_cap_in_w, AMDSMIHelpers.SI_Unit.MICRO)
except amdsmi_exception.AmdSmiLibraryException as e:
raise ValueError(f"Unable to get power cap info from {gpu_id}") from e
if args.power_cap == default_power_cap:
self.logger.store_output(args.gpu, 'powercap', f"Power cap is already set to {default_power_cap}")
if current_power_cap_in_w == default_power_cap_in_w:
self.logger.store_output(args.gpu, 'powercap', f"Power cap is already set to {default_power_cap_in_w}")
else:
try:
amdsmi_interface.amdsmi_set_power_cap(args.gpu, 0, default_power_cap * 1000000)
default_power_cap_in_uw = AMDSMIHelpers.convert_SI_unit(default_power_cap_in_w,
AMDSMIHelpers.SI_Unit.BASE,
AMDSMIHelpers.SI_Unit.MICRO)
amdsmi_interface.amdsmi_set_power_cap(args.gpu, 0, default_power_cap_in_uw)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to reset power cap to {default_power_cap} on GPU {gpu_id}") from e
self.logger.store_output(args.gpu, 'powercap', f"Successfully set power cap to {default_power_cap}")
raise ValueError(f"Unable to reset power cap to {default_power_cap_in_w} on GPU {gpu_id}") from e
self.logger.store_output(args.gpu, 'powercap', f"Successfully set power cap to {default_power_cap_in_w}")
if multiple_devices:
self.logger.store_multiple_device_output()
@@ -4234,38 +4315,14 @@ class AMDSMICommands():
self.logger.table_header += 'VRAM_TOTAL'.rjust(12)
if args.pcie:
try:
pcie_bw = amdsmi_interface.amdsmi_get_gpu_pci_throughput(args.gpu)
sent = pcie_bw['sent'] * pcie_bw['max_pkt_sz']
received = pcie_bw['received'] * pcie_bw['max_pkt_sz']
bw_unit = "Mb/s"
packet_size_unit = "B"
if sent > 0:
sent = sent // 1024 // 1024
if received > 0:
received = received // 1024 // 1024
if self.logger.is_human_readable_format():
sent = f"{sent} {bw_unit}"
received = f"{received} {bw_unit}"
pcie_bw['max_pkt_sz'] = f"{pcie_bw['max_pkt_sz']} {packet_size_unit}"
if self.logger.is_json_format():
sent = {"value" : sent,
"unit" : bw_unit}
received = {"value" : received,
"unit" : bw_unit}
pcie_bw['max_pkt_sz'] = {"value" : pcie_bw['max_pkt_sz'],
"unit" : packet_size_unit}
monitor_values['pcie_tx'] = sent
monitor_values['pcie_rx'] = received
pcie_info = amdsmi_interface.amdsmi_get_pcie_info(args.gpu)['pcie_metric']
pcie_bw_unit = 'Mb/s'
monitor_values['pcie_bw'] = self.helpers.unit_format(self.logger, pcie_info['pcie_bandwidth'], pcie_bw_unit)
except amdsmi_exception.AmdSmiLibraryException as e:
monitor_values['pcie_tx'] = "N/A"
monitor_values['pcie_rx'] = "N/A"
logging.debug("Failed to get pci throughput on gpu %s | %s", gpu_id, e.get_error_info())
monitor_values['pcie_bw'] = "N/A"
logging.debug("Failed to get pci bandwidth on gpu %s | %s", gpu_id, e.get_error_info())
self.logger.table_header += 'PCIE_TX'.rjust(10)
self.logger.table_header += 'PCIE_RX'.rjust(10)
self.logger.table_header += 'PCIE_BW'.rjust(10)
self.logger.store_output(args.gpu, 'values', monitor_values)
+44
Просмотреть файл
@@ -29,6 +29,7 @@ import time
from subprocess import run
from subprocess import PIPE, STDOUT
from typing import List
from enum import Enum
from amdsmi_init import *
from BDF import BDF
@@ -726,3 +727,46 @@ class AMDSMIHelpers():
if logger.is_human_readable_format():
return f"{value} {unit}"
return f"{value}"
class SI_Unit(float, Enum):
GIGA = 1000000000 # 10^9
MEGA = 1000000 # 10^6
KILO = 1000 # 10^3
HECTO = 100 # 10^2
DEKA = 10 # 10^1
BASE = 1 # 10^0
DECI = 0.1 # 10^-1
CENTI = 0.01 # 10^-2
MILLI = 0.001 # 10^-3
MICRO = 0.000001 # 10^-6
NANO = 0.000000001 # 10^-9
def convert_SI_unit(val: float, unit_in: SI_Unit, unit_out = SI_Unit.BASE) -> float:
"""This function will convert a value into another
scientific (SI) unit. Defaults unit_out to SI_Unit.BASE
This function returns a float.
params:
val: float unit to convert
unit_in: Requires using SI_Unit to set current value's SI unit (eg. SI_Unit.MICRO)
unit_out - Requires using SI_Unit to set current value's SI unit
default value is SI_Unit.BASE (eg. SI_Unit.MICRO)
return:
float : converted SI unit of value requested
"""
return val * unit_in / unit_out
def convert_SI_unit(val: int, unit_in: SI_Unit, unit_out=SI_Unit.BASE) -> int:
"""This function will convert a value into another
scientific (SI) unit. Defaults unit_out to SI_Unit.BASE
This function returns a int.
params:
val: int unit to convert
unit_in: Requires using SI_Unit to set current value's SI unit (eg. SI_Unit.MICRO)
unit_out - Requires using SI_Unit to set current value's SI unit
default value is SI_Unit.BASE (eg. SI_Unit.MICRO)
return:
int : converted SI unit of value requested
"""
return int(float(val) * unit_in / unit_out)
+10 -4
Просмотреть файл
@@ -545,6 +545,7 @@ class AMDSMIParser(argparse.ArgumentParser):
board_help = "All board information"
dpm_policy_help = "The available DPM policy"
xgmi_plpd_help = "The available XGMI per-link power down policy"
process_isolation_help = "The process isolation status"
# Options arguments help text for Hypervisors and Baremetal
ras_help = "Displays RAS features information"
@@ -586,6 +587,7 @@ class AMDSMIParser(argparse.ArgumentParser):
static_parser.add_argument('-l', '--limit', action='store_true', required=False, help=limit_help)
static_parser.add_argument('-P', '--policy', action='store_true', required=False, help=dpm_policy_help)
static_parser.add_argument('-x', '--xgmi-plpd', action='store_true', required=False, help=xgmi_plpd_help)
static_parser.add_argument('-R', '--process-isolation', action='store_true', required=False, help=process_isolation_help)
if self.helpers.is_linux() and not self.helpers.is_virtual_os():
static_parser.add_argument('-u', '--numa', action='store_true', required=False, help=numa_help)
@@ -967,8 +969,9 @@ class AMDSMIParser(argparse.ArgumentParser):
set_compute_partition_help = f"Set one of the following the compute partition modes:\n\t{compute_partition_choices_str}"
set_memory_partition_help = f"Set one of the following the memory partition modes:\n\t{memory_partition_choices_str}"
set_power_cap_help = "Set power capacity limit"
set_dpm_policy_help = f"Set the GPU DPM policy using policy id\n"
set_xgmi_plpd_help = f"Set the GPU XGMI per-link power down policy using policy id\n"
set_dpm_policy_help = "Set the GPU DPM policy using policy id\n"
set_xgmi_plpd_help = "Set the GPU XGMI per-link power down policy using policy id\n"
set_process_isolation_help = "Enable or disable the GPU process isolation: 0 for disable and 1 for enable.\n"
# Help text for CPU set options
set_cpu_pwr_limit_help = "Set power limit for the given socket. Input parameter is power limit value."
@@ -982,6 +985,7 @@ class AMDSMIParser(argparse.ArgumentParser):
set_cpu_enable_apb_help = "Enables the DF p-state performance boost algorithm"
set_cpu_disable_apb_help = "Disables the DF p-state performance boost algorithm. Input parameter is DFPstate (0-3)"
set_soc_boost_limit_help = "Sets the boost limit for the given socket. Input parameter is socket BOOST_LIMIT value"
run_gpu_clear_sram_data_help = f"Clear the GPU SRAM data\n"
# Help text for CPU Core set options
set_core_boost_limit_help = "Sets the boost limit for the given core. Input parameter is core BOOST_LIMIT value"
@@ -1006,6 +1010,8 @@ class AMDSMIParser(argparse.ArgumentParser):
set_value_parser.add_argument('-o', '--power-cap', action='store', type=self._positive_int, required=False, help=set_power_cap_help, metavar='WATTS')
set_value_parser.add_argument('-p', '--dpm-policy', action='store', required=False, type=self._not_negative_int, help=set_dpm_policy_help, metavar='POLICY_ID')
set_value_parser.add_argument('-x', '--xgmi-plpd', action='store', required=False, type=self._not_negative_int, help=set_xgmi_plpd_help, metavar='POLICY_ID')
set_value_parser.add_argument('-R', '--process-isolation', action='store', choices=[0,1], type=self._not_negative_int, required=False, help=set_process_isolation_help, metavar='STATUS')
set_value_parser.add_argument('-c', '--clear-sram-data', action='store_true', required=False, help=run_gpu_clear_sram_data_help)
if self.helpers.is_amd_hsmp_initialized():
# Optional CPU Args
@@ -1104,7 +1110,7 @@ class AMDSMIParser(argparse.ArgumentParser):
throttle_help = "Monitor thermal throttle status"
ecc_help = "Monitor ECC single bit, ECC double bit, and PCIe replay error counts"
mem_usage_help = "Monitor memory usage in MB"
pcie_throughput_help = "Monitor PCIe Tx/Rx in MB/s"
pcie_bandwidth_help = "Monitor PCIe bandwidth in Mb/s"
# Create monitor subparser
monitor_parser = subparsers.add_parser('monitor', help=monitor_help, description=monitor_subcommand_help)
@@ -1127,7 +1133,7 @@ class AMDSMIParser(argparse.ArgumentParser):
monitor_parser.add_argument('-s', '--throttle-status', action='store_true', required=False, help=throttle_help)
monitor_parser.add_argument('-e', '--ecc', action='store_true', required=False, help=ecc_help)
monitor_parser.add_argument('-v', '--vram-usage', action='store_true', required=False, help=mem_usage_help)
monitor_parser.add_argument('-r', '--pcie', action='store_true', required=False, help=pcie_throughput_help)
monitor_parser.add_argument('-r', '--pcie', action='store_true', required=False, help=pcie_bandwidth_help)
def _add_rocm_smi_parser(self, subparsers, func):
+1 -1
Просмотреть файл
@@ -48,7 +48,7 @@ PROJECT_NAME = AMD SMI
# could be handy for archiving the generated documentation or if some version
# control system is used.
PROJECT_NUMBER = "24.5.1.0"
PROJECT_NUMBER = "24.5.2.0"
# Using the PROJECT_BRIEF tag one can provide an optional one line description
# for a project that appears at the top of each page and should give viewer a
+5 -5
Просмотреть файл
@@ -657,15 +657,15 @@ int main() {
CHK_AMDSMI_RET(ret)
printf(" Output of amdsmi_get_power_cap_info:\n");
std::cout << "\t\t Power Cap: " << cap_info.power_cap
<< "W\n";
<< " uW\n";
std::cout << "\t\t Default Power Cap: " << cap_info.default_power_cap
<< "\n\n";
<< " uW\n\n";
std::cout << "\t\t Dpm Cap: " << cap_info.dpm_cap
<< "\n\n";
<< " MHz\n\n";
std::cout << "\t\t Min Power Cap: " << cap_info.min_power_cap
<< "\n\n";
<< " uW\n\n";
std::cout << "\t\t Max Power Cap: " << cap_info.max_power_cap
<< "\n\n";
<< " uW\n\n";
/// Get GPU Metrics info
std::cout << "\n\n";
+70 -7
Просмотреть файл
@@ -154,7 +154,7 @@ typedef enum {
#define AMDSMI_LIB_VERSION_MAJOR 5
//! Minor version should be updated for each API change, but without changing headers
#define AMDSMI_LIB_VERSION_MINOR 1
#define AMDSMI_LIB_VERSION_MINOR 2
//! Release version should be set to 0 as default and can be updated by the PMs for each CSP point release
#define AMDSMI_LIB_VERSION_RELEASE 0
@@ -522,11 +522,11 @@ typedef struct {
} amdsmi_pcie_info_t;
typedef struct {
uint64_t power_cap;
uint64_t default_power_cap;
uint64_t dpm_cap;
uint64_t min_power_cap;
uint64_t max_power_cap;
uint64_t power_cap; //!< current power cap (uW)
uint64_t default_power_cap; //!< default power cap (uW)
uint64_t dpm_cap; //!< dpm power cap (MHz)
uint64_t min_power_cap; //!< minimum power cap (uW)
uint64_t max_power_cap; //!< maximum power cap (uW)
uint64_t reserved[3];
} amdsmi_power_cap_info_t;
@@ -3455,6 +3455,68 @@ amdsmi_status_t amdsmi_get_xgmi_plpd(amdsmi_processor_handle processor_handle,
amdsmi_status_t amdsmi_set_xgmi_plpd(amdsmi_processor_handle processor_handle,
uint32_t plpd_id);
/**
* @brief Get the status of the Process Isolation
*
* @platform{gpu_bm_linux} @platform{guest_1vf}
*
* @details Given a processor handle @p processor_handle, this function will write
* current process isolation status to @p pisolate. The 0 is the process isolation
* disabled, and the 1 is the process isolation enabled.
*
* @param[in] processor_handle a processor handle
*
* @param[in, out] pisolate the process isolation status.
* If this parameter is nullptr, this function will return
* ::AMDSMI_STATUS_INVAL
*
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
*/
amdsmi_status_t amdsmi_get_gpu_process_isolation(amdsmi_processor_handle processor_handle,
uint32_t* pisolate);
/**
* @brief Enable/disable the system Process Isolation
*
* @platform{gpu_bm_linux} @platform{guest_1vf}
*
* @details Given a processor handle @p processor_handle and a process isolation @p pisolate,
* flag, this function will set the Process Isolation for this processor. The 0 is the process
* isolation disabled, and the 1 is the process isolation enabled.
*
* @note This function requires root access
*
* @param[in] processor_handle a processor handle
*
* @param[in] pisolate the process isolation status to set.
*
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
*/
amdsmi_status_t amdsmi_set_gpu_process_isolation(amdsmi_processor_handle processor_handle,
uint32_t pisolate);
/**
* @brief Clear the GPU SRAM data
*
* @platform{gpu_bm_linux} @platform{guest_1vf}
*
* @details Given a processor handle @p processor_handle, and a sclean flag @p sclean,
* this function will clear the SRAM data of this processor. This can be called between
* user logins to prevent information leak.
*
* @note This function requires root access
*
* @param[in] processor_handle a processor handle
*
* @param[in] sclean the clean flag. Only 1 will take effect and other number
* are reserved for future usage.
*
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
*/
amdsmi_status_t amdsmi_set_gpu_clear_sram_data(amdsmi_processor_handle processor_handle,
uint32_t sclean);
/** @} End PerfCont */
/*****************************************************************************/
@@ -4546,7 +4608,8 @@ amdsmi_get_gpu_board_info(amdsmi_processor_handle processor_handle, amdsmi_board
/**
* @brief Returns the power caps as currently configured in the
* system. It is not supported on virtual machine guest
* system. Power in units of uW.
* It is not supported on virtual machine guest
*
* @platform{gpu_bm_linux} @platform{host}
*
+219 -16
Просмотреть файл
@@ -414,13 +414,13 @@ Input parameters:
Output: Dictionary with fields
Field | Description
---|---
`power_cap` | power capability
`dpm_cap` | dynamic power management capability
`default_power_cap` | default power capability
`min_power_cap` | min power capability
`max_power_cap` | max power capability
Field | Description | Units
---|---|---
`power_cap` | power capability | uW
`dpm_cap` | dynamic power management capability | MHz
`default_power_cap` | default power capability | uW
`min_power_cap` | min power capability | uW
`max_power_cap` | max power capability | uW
Exceptions that can be thrown by `amdsmi_get_power_cap_info` function:
@@ -843,7 +843,7 @@ Input parameters:
* `processor_handle` device which to query
Output: List consisting of dictionaries with fields for each bad page found
Output: List consisting of dictionaries with fields for each bad page found; can be an empty list
Field | Description
---|---
@@ -868,7 +868,7 @@ try:
else:
for device in devices:
bad_page_info = amdsmi_get_gpu_bad_page_info(device)
if not len(bad_page_info):
if not bad_page_info: # Can be empty list
print("No bad pages found")
continue
for bad_page in bad_page_info:
@@ -880,6 +880,53 @@ except AmdSmiException as e:
print(e)
```
### amdsmi_get_gpu_memory_reserved_pages
Description: Returns reserved memory page info for the given GPU.
It is not supported on virtual machine guest
Input parameters:
* `processor_handle` device which to query
Output: List consisting of dictionaries with fields for each reserved memory page found; can be an empty list
Field | Description
---|---
`value` | Value of memory reserved page
`page_address` | Address of memory reserved page
`page_size` | Size of memory reserved page
`status` | Status of memory reserved page
Exceptions that can be thrown by `amdsmi_get_gpu_memory_reserved_pages` function:
* `AmdSmiLibraryException`
* `AmdSmiRetryException`
* `AmdSmiParameterException`
Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
reserved_memory_page_info = amdsmi_get_gpu_memory_reserved_pages(device)
if not reserved_memory_page_info: # Can be empty list
print("No memory reserved pages found")
continue
for reserved_memory_page in reserved_memory_page_info:
print(reserved_memory_page["value"])
print(reserved_memory_page["page_address"])
print(reserved_memory_page["page_size"])
print(reserved_memory_page["status"])
except AmdSmiException as e:
print(e)
```
### amdsmi_get_gpu_process_list
Description: Returns the list of processes running on the target GPU; May require root level access
@@ -1963,6 +2010,98 @@ except AmdSmiException as e:
print(e)
```
### amdsmi_get_gpu_process_isolation
Description: Get the status of the Process Isolation
Input parameters:
* `processor_handle` handle for the given device
Output: integer corresponding to isolation_status; 0 - disabled, 1 - enabled
Exceptions that can be thrown by `amdsmi_get_gpu_process_isolation` function:
* `AmdSmiLibraryException`
* `AmdSmiRetryException`
* `AmdSmiParameterException`
Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
isolate = amdsmi_get_gpu_process_isolation(device)
print("Process Isolation Status: ", isolate)
except AmdSmiException as e:
print(e)
```
### amdsmi_set_gpu_process_isolation
Description: Enable/disable the system Process Isolation for the given device handle.
Input parameters:
* `processor_handle` handle for the given device
* `pisolate` the process isolation status to set. 0 is the process isolation disabled, and 1 is the process isolation enabled.
Output: None
Exceptions that can be thrown by `amdsmi_set_gpu_process_isolation` function:
* `AmdSmiLibraryException`
* `AmdSmiRetryException`
* `AmdSmiParameterException`
Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
amdsmi_set_gpu_process_isolation(device, 1)
except AmdSmiException as e:
print(e)
```
### amdsmi_set_gpu_clear_sram_data
Description: Clear the SRAM data of the given device. This can be called between user logins to prevent information leak.
Input parameters:
* `processor_handle` handle for the given device
* `sclean` the clean flag. Only 1 will take effect and other number are reserved for future usage.
Output: None
Exceptions that can be thrown by `amdsmi_set_gpu_clear_sram_data` function:
* `AmdSmiLibraryException`
* `AmdSmiRetryException`
* `AmdSmiParameterException`
Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
amdsmi_set_gpu_clear_sram_data(device, 1)
except AmdSmiException as e:
print(e)
```
### amdsmi_get_gpu_overdrive_level
Description: Get the overdrive percent associated with the device with provided
@@ -2602,6 +2741,75 @@ except AmdSmiException as e:
print(e)
```
### amdsmi_get_dpm_policy
Description: Get dpm policy information.
Input parameters:
* `processor_handle` handle for the given device
* `policy_id` the policy id to set.
Output: Dictionary with fields
Field | Description
---|---
`num_supported` | total number of supported policies
`current_id` | current policy id
`policies` | list of dictionaries containing possible policies
Exceptions that can be thrown by `amdsmi_get_dpm_policy` function:
* `AmdSmiLibraryException`
* `AmdSmiRetryException`
* `AmdSmiParameterException`
Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
dpm_policies = amdsmi_get_dpm_policy(device)
print(dpm_policies)
except AmdSmiException as e:
print(e)
```
### amdsmi_set_dpm_policy
Description: Set the dpm policy to corresponding policy_id. Typically following: 0(default),1,2,3
Input parameters:
* `processor_handle` handle for the given device
* `policy_id` the policy id to set.
Output: None
Exceptions that can be thrown by `amdsmi_set_dpm_policy` function:
* `AmdSmiLibraryException`
* `AmdSmiRetryException`
* `AmdSmiParameterException`
Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
amdsmi_set_dpm_policy(device, 0)
except AmdSmiException as e:
print(e)
```
### amdsmi_set_xgmi_plpd
Description: Set the xgmi per-link power down policy parameter for the processor
@@ -3159,13 +3367,8 @@ Example:
```python
try:
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs on machine")
else:
for device in devices:
version = amdsmi_get_lib_version()
print(version)
version = amdsmi_get_lib_version()
print(version)
except AmdSmiException as e:
print(e)
```
+4
Просмотреть файл
@@ -134,6 +134,10 @@ from .amdsmi_interface import amdsmi_set_gpu_fan_speed
from .amdsmi_interface import amdsmi_reset_gpu_fan
from .amdsmi_interface import amdsmi_set_clk_freq
from .amdsmi_interface import amdsmi_set_gpu_overdrive_level
from .amdsmi_interface import amdsmi_set_dpm_policy
from .amdsmi_interface import amdsmi_set_xgmi_plpd
from .amdsmi_interface import amdsmi_set_gpu_clear_sram_data
from .amdsmi_interface import amdsmi_set_gpu_process_isolation
# # Physical State Queries
from .amdsmi_interface import amdsmi_get_gpu_fan_rpms
+90 -30
Просмотреть файл
@@ -365,6 +365,7 @@ class AmdSmiProcessorType(IntEnum):
NON_AMD_GPU = amdsmi_wrapper.NON_AMD_GPU
NON_AMD_CPU = amdsmi_wrapper.NON_AMD_CPU
class AmdSmiEventReader:
def __init__(
self, processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
@@ -439,21 +440,22 @@ def _format_bad_page_info(bad_page_info, bad_page_count: ctypes.c_uint32) -> Lis
Format bad page info data retrieved.
Parameters:
bad_page_info(`POINTER(amdsmi_retired_page_record_t)`): Pointer to bad page info
retrieved.
bad_page_info(`amdsmi_retired_page_record_t`): A populated list of amdsmi_retired_page_record_t(s)
retrieved. Ex: (amdsmi_wrapper.amdsmi_retired_page_record_t * #)()
bad_page_count(`c_uint32`): Bad page count.
Returns:
`list`: List containing formatted bad pages.
`list`: List containing formatted bad pages. Can be empty
"""
if not isinstance(
bad_page_info, ctypes.POINTER(
amdsmi_wrapper.amdsmi_retired_page_record_t)
):
raise AmdSmiParameterException(
bad_page_info, ctypes.POINTER(
amdsmi_wrapper.amdsmi_retired_page_record_t)
)
if bad_page_count == 0:
return []
# Check if each struct within bad_page_info is valid
for bad_page in bad_page_info:
if not isinstance(bad_page, amdsmi_wrapper.amdsmi_retired_page_record_t):
raise AmdSmiParameterException(
bad_page, amdsmi_wrapper.amdsmi_retired_page_record_t
)
table_records = []
for i in range(bad_page_count.value):
@@ -1802,23 +1804,24 @@ def amdsmi_get_gpu_bad_page_info(
)
num_pages = ctypes.c_uint32()
retired_page_record = ctypes.POINTER(
amdsmi_wrapper.amdsmi_retired_page_record_t)()
nullptr = ctypes.POINTER(amdsmi_wrapper.amdsmi_retired_page_record_t)()
_check_res(
amdsmi_wrapper.amdsmi_get_gpu_bad_page_info(
processor_handle, ctypes.byref(num_pages), retired_page_record
processor_handle, ctypes.byref(num_pages), nullptr
)
)
table_records = _format_bad_page_info(retired_page_record, num_pages)
if num_pages.value == 0:
return "No bad pages found."
else:
table_records = _format_bad_page_info(retired_page_record, num_pages)
return []
return table_records
bad_pages = (amdsmi_wrapper.amdsmi_retired_page_record_t * num_pages.value)()
_check_res(
amdsmi_wrapper.amdsmi_get_gpu_bad_page_info(
processor_handle, ctypes.byref(num_pages), bad_pages
)
)
return _format_bad_page_info(bad_pages, num_pages)
def amdsmi_get_gpu_total_ecc_count(
@@ -2734,6 +2737,7 @@ def amdsmi_set_clk_freq(
)
)
def amdsmi_set_dpm_policy(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
policy_id: int,
@@ -2748,6 +2752,7 @@ def amdsmi_set_dpm_policy(
)
)
def amdsmi_set_xgmi_plpd(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
policy_id: int,
@@ -2762,6 +2767,37 @@ def amdsmi_set_xgmi_plpd(
)
)
def amdsmi_set_gpu_process_isolation(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
pisolate: int,
):
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
raise AmdSmiParameterException(
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
)
_check_res(
amdsmi_wrapper.amdsmi_set_gpu_process_isolation(
processor_handle, pisolate
)
)
def amdsmi_set_gpu_clear_sram_data(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
sclean: int,
):
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
raise AmdSmiParameterException(
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
)
_check_res(
amdsmi_wrapper.amdsmi_set_gpu_clear_sram_data(
processor_handle, sclean
)
)
def amdsmi_set_gpu_overdrive_level(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle, overdrive_value: int
):
@@ -2793,6 +2829,7 @@ def amdsmi_get_gpu_bdf_id(processor_handle: amdsmi_wrapper.amdsmi_processor_hand
return bdfid.value
def amdsmi_set_gpu_pci_bandwidth(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle, bitmask: int
) -> None:
@@ -3089,7 +3126,6 @@ def amdsmi_set_gpu_od_volt_info(
)
def amdsmi_get_gpu_fan_rpms(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle, sensor_idx: int
) -> int:
@@ -3320,6 +3356,7 @@ def amdsmi_get_clk_freq(
"frequency": list(freq.frequency)[: freq.num_supported - 1],
}
def amdsmi_get_dpm_policy(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
) -> Dict[str, Any]:
@@ -3351,6 +3388,7 @@ def amdsmi_get_dpm_policy(
"policies": polices,
}
def amdsmi_get_xgmi_plpd(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
) -> Dict[str, Any]:
@@ -3382,6 +3420,25 @@ def amdsmi_get_xgmi_plpd(
"plpds": polices,
}
def amdsmi_get_gpu_process_isolation(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
) -> int:
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
raise AmdSmiParameterException(
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
)
pisolate = ctypes.c_uint32()
_check_res(
amdsmi_wrapper.amdsmi_get_gpu_process_isolation(
processor_handle, ctypes.byref(pisolate)
)
)
return pisolate.value
def amdsmi_get_gpu_od_volt_info(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
) -> Dict[str, Any]:
@@ -3804,21 +3861,24 @@ def amdsmi_get_gpu_memory_reserved_pages(
)
num_pages = ctypes.c_uint32()
retired_page_record = ctypes.POINTER(
amdsmi_wrapper.amdsmi_retired_page_record_t)()
nullptr = ctypes.POINTER(amdsmi_wrapper.amdsmi_retired_page_record_t)()
_check_res(
amdsmi_wrapper.amdsmi_get_gpu_memory_reserved_pages(
processor_handle, ctypes.byref(num_pages), retired_page_record
processor_handle, ctypes.byref(num_pages), nullptr
)
)
table_records = _format_bad_page_info(retired_page_record, num_pages)
if num_pages.value == 0:
return "No bad pages found."
else:
table_records = _format_bad_page_info(retired_page_record, num_pages)
return []
return table_records
mem_reserved_pages = (amdsmi_wrapper.amdsmi_retired_page_record_t * num_pages)()
_check_res(
amdsmi_wrapper.amdsmi_get_gpu_memory_reserved_pages(
processor_handle, ctypes.byref(num_pages), mem_reserved_pages
)
)
return _format_bad_page_info(mem_reserved_pages, num_pages)
def amdsmi_get_gpu_metrics_header_info(
+17 -7
Просмотреть файл
@@ -2076,6 +2076,15 @@ amdsmi_get_xgmi_plpd.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_
amdsmi_set_xgmi_plpd = _libraries['libamd_smi.so'].amdsmi_set_xgmi_plpd
amdsmi_set_xgmi_plpd.restype = amdsmi_status_t
amdsmi_set_xgmi_plpd.argtypes = [amdsmi_processor_handle, uint32_t]
amdsmi_get_gpu_process_isolation = _libraries['libamd_smi.so'].amdsmi_get_gpu_process_isolation
amdsmi_get_gpu_process_isolation.restype = amdsmi_status_t
amdsmi_get_gpu_process_isolation.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint32)]
amdsmi_set_gpu_process_isolation = _libraries['libamd_smi.so'].amdsmi_set_gpu_process_isolation
amdsmi_set_gpu_process_isolation.restype = amdsmi_status_t
amdsmi_set_gpu_process_isolation.argtypes = [amdsmi_processor_handle, uint32_t]
amdsmi_set_gpu_clear_sram_data = _libraries['libamd_smi.so'].amdsmi_set_gpu_clear_sram_data
amdsmi_set_gpu_clear_sram_data.restype = amdsmi_status_t
amdsmi_set_gpu_clear_sram_data.argtypes = [amdsmi_processor_handle, uint32_t]
amdsmi_get_lib_version = _libraries['libamd_smi.so'].amdsmi_get_lib_version
amdsmi_get_lib_version.restype = amdsmi_status_t
amdsmi_get_lib_version.argtypes = [ctypes.POINTER(struct_amdsmi_version_t)]
@@ -2589,7 +2598,7 @@ __all__ = \
'amdsmi_get_gpu_pci_throughput', 'amdsmi_get_gpu_perf_level',
'amdsmi_get_gpu_pm_metrics_info',
'amdsmi_get_gpu_power_profile_presets',
'amdsmi_get_gpu_process_list',
'amdsmi_get_gpu_process_isolation', 'amdsmi_get_gpu_process_list',
'amdsmi_get_gpu_ras_block_features_enabled',
'amdsmi_get_gpu_ras_feature_info',
'amdsmi_get_gpu_reg_table_info', 'amdsmi_get_gpu_revision',
@@ -2646,18 +2655,19 @@ __all__ = \
'amdsmi_set_cpu_socket_boostlimit',
'amdsmi_set_cpu_socket_lclk_dpm_level',
'amdsmi_set_cpu_socket_power_cap', 'amdsmi_set_cpu_xgmi_width',
'amdsmi_set_dpm_policy', 'amdsmi_set_gpu_clk_range',
'amdsmi_set_gpu_compute_partition',
'amdsmi_set_dpm_policy', 'amdsmi_set_gpu_clear_sram_data',
'amdsmi_set_gpu_clk_range', 'amdsmi_set_gpu_compute_partition',
'amdsmi_set_gpu_event_notification_mask',
'amdsmi_set_gpu_fan_speed', 'amdsmi_set_gpu_memory_partition',
'amdsmi_set_gpu_od_clk_info', 'amdsmi_set_gpu_od_volt_info',
'amdsmi_set_gpu_overdrive_level', 'amdsmi_set_gpu_pci_bandwidth',
'amdsmi_set_gpu_perf_determinism_mode',
'amdsmi_set_gpu_perf_level', 'amdsmi_set_gpu_power_profile',
'amdsmi_set_power_cap', 'amdsmi_set_xgmi_plpd',
'amdsmi_shut_down', 'amdsmi_smu_fw_version_t',
'amdsmi_socket_handle', 'amdsmi_status_code_to_string',
'amdsmi_status_t', 'amdsmi_stop_gpu_event_notification',
'amdsmi_set_gpu_process_isolation', 'amdsmi_set_power_cap',
'amdsmi_set_xgmi_plpd', 'amdsmi_shut_down',
'amdsmi_smu_fw_version_t', 'amdsmi_socket_handle',
'amdsmi_status_code_to_string', 'amdsmi_status_t',
'amdsmi_stop_gpu_event_notification',
'amdsmi_temp_range_refresh_rate_t', 'amdsmi_temperature_metric_t',
'amdsmi_temperature_type_t', 'amdsmi_topo_get_link_type',
'amdsmi_topo_get_link_weight', 'amdsmi_topo_get_numa_node_number',
+56 -1
Просмотреть файл
@@ -3362,7 +3362,7 @@ rsmi_status_t rsmi_dev_dpm_policy_get(uint32_t dv_ind,
*
* @note This function requires root access
*
* @param[in] processor_handle a processor handle
* @param[in] dv_ind a device index
*
* @param[in] policy_id the dpm policy will be modified
*
@@ -3410,6 +3410,61 @@ rsmi_status_t rsmi_dev_xgmi_plpd_get(uint32_t dv_ind,
*/
rsmi_status_t rsmi_dev_xgmi_plpd_set(uint32_t dv_ind,
uint32_t plpd_id);
/**
* @brief Get the status of the Process Isolation
*
* @details Given a device index @p dv_ind, this function will write
* current process isolation status to @p pisolate. The 0 is the process isolation
* disabled, and the 1 is the process isolation enabled.
*
* @param[in] dv_ind a device index
*
* @param[in, out] pisolate the process isolation status.
* If this parameter is nullptr, this function will return
* ::RSMI_STATUS_INVAL
*
* @return ::RSMI_STATUS_SUCCESS is returned upon successful call, non-zero on fail
*/
rsmi_status_t rsmi_dev_process_isolation_get(uint32_t dv_ind,
uint32_t* pisolate);
/**
* @brief Enable/disable the system Process Isolation
*
* @details Given a device index @p dv_ind and a process isolation @p pisolate,
* flag, this function will set the Process Isolation for this device. The 0 is the process
* isolation disabled, and the 1 is the process isolation enabled.
*
* @note This function requires root access
*
* @param[in] dv_ind a device index
*
* @param[in] pisolate the process isolation status to set.
*
* @return ::RSMI_STATUS_SUCCESS is returned upon successful call, non-zero on fail
*/
rsmi_status_t rsmi_dev_process_isolation_set(uint32_t dv_ind,
uint32_t pisolate);
/**
* @brief Clear the GPU SRAM data
*
*
* @details Given a device index @p dv_ind, this function will clear the
* GPU SRAM data of this device. This can be called between user logins to prevent information leak.
*
* @note This function requires root access
*
* @param[in] dv_ind a device index
*
* @param[in] sclean the clean flag. Only 1 will take effect and other number
* are reserved for future usage.
*
* @return ::RSMI_STATUS_SUCCESS is returned upon successful call, non-zero on fail
*/
rsmi_status_t rsmi_dev_gpu_clear_sram_data(uint32_t dv_ind, uint32_t sclean);
/** @} */ // end of PerfCont
/*****************************************************************************/
+3
Просмотреть файл
@@ -101,6 +101,8 @@ enum DevKFDNodePropTypes {
enum DevInfoTypes {
kDevPerfLevel,
kDevProcessIsolation,
kDevShaderClean,
kDevOverDriveLevel,
kDevMemOverDriveLevel,
kDevDevID,
@@ -222,6 +224,7 @@ class Device {
void set_drm_render_minor(uint32_t minor) {drm_render_minor_ = minor;}
static rsmi_dev_perf_level perfLvlStrToEnum(std::string s);
uint64_t bdfid(void) const {return bdfid_;}
int get_partition_id() const {return (bdfid_ >> 28) & 0xf; } // location_id[31:28]
void set_bdfid(uint64_t val) {bdfid_ = val;}
pthread_mutex_t *mutex(void) {return mutex_.ptr;}
evt::dev_evt_grp_set_t* supported_event_groups(void) {
+139
Просмотреть файл
@@ -59,6 +59,7 @@
#include <cstring>
#include <fstream>
#include <iostream>
#include <iterator>
#include <map>
#include <sstream>
#include <vector>
@@ -1974,6 +1975,144 @@ rsmi_dev_gpu_clk_freq_set(uint32_t dv_ind,
}
rsmi_status_t rsmi_dev_process_isolation_get(uint32_t dv_ind,
uint32_t* pisolate) {
std::ostringstream ss;
ss << __PRETTY_FUNCTION__ << "| ======= start ======= dev_ind:"
<< dv_ind;
LOG_TRACE(ss);
CHK_SUPPORT_NAME_ONLY(pisolate)
// the enforce_isolation sysfs is in this format <partition_id, enable_flag>
// Get the partition_id. For SPX, the partition_id will be 0.
int partition_id = dev->get_partition_id();
DEVICE_MUTEX
std::string str_val;
rsmi_status_t ret = get_dev_value_line(amd::smi::kDevProcessIsolation, dv_ind, &str_val);
if (ret == RSMI_STATUS_FILE_ERROR) {
ss << __PRETTY_FUNCTION__ << " | ======= end ======="
<< ", get_dev_value_str() ret was RSMI_STATUS_FILE_ERROR "
<< "-> reporting RSMI_STATUS_NOT_SUPPORTED";
LOG_ERROR(ss);
return RSMI_STATUS_NOT_SUPPORTED;
}
if (ret != RSMI_STATUS_SUCCESS) {
ss << __PRETTY_FUNCTION__ << " | ======= end ======="
<< ", get_dev_value_str() ret was not RSMI_STATUS_SUCCESS"
<< " -> reporting " << amd::smi::getRSMIStatusString(ret);
LOG_ERROR(ss);
return ret;
}
/*
for 4 partition: enforce isolation is enabled on partition 2 and
disabled on partitions 0, 1, 3.
$ cat /sys/class/drm/cardX/device/enforce_isolation
0 0 1 0
*/
std::stringstream iss(str_val);
int number;
std::vector<int> partition_status;
while ( iss >> number )
partition_status.push_back(number);
if (partition_status.size() <= partition_id) {
ss << __PRETTY_FUNCTION__ << " | ======= end ======="
<< ", the sysfs line " << str_val
<< " does not have the partition_id "
<< partition_id;
LOG_ERROR(ss);
return RSMI_STATUS_UNEXPECTED_DATA;
}
*pisolate = partition_status[partition_id];
return RSMI_STATUS_SUCCESS;
}
rsmi_status_t rsmi_dev_process_isolation_set(uint32_t dv_ind,
uint32_t pisolate) {
rsmi_status_t ret;
TRY
std::ostringstream ss;
ss << __PRETTY_FUNCTION__ << " | ======= start =======";
LOG_TRACE(ss);
REQUIRE_ROOT_ACCESS
DEVICE_MUTEX
GET_DEV_FROM_INDX
// To set the values,need to specify the setting for all of the partitions
// For two partition
// echo "1 0" | sudo tee  /sys/class/drm/cardX/device/enforce_isolation
int partition_id = dev->get_partition_id();
std::string str_val;
rsmi_status_t ret = get_dev_value_line(amd::smi::kDevProcessIsolation, dv_ind, &str_val);
if (ret == RSMI_STATUS_FILE_ERROR) {
ss << __PRETTY_FUNCTION__ << " | ======= end ======="
<< ", get_dev_value_str() ret was RSMI_STATUS_FILE_ERROR "
<< "-> reporting RSMI_STATUS_NOT_SUPPORTED";
LOG_ERROR(ss);
return RSMI_STATUS_NOT_SUPPORTED;
}
if (ret != RSMI_STATUS_SUCCESS) {
ss << __PRETTY_FUNCTION__ << " | ======= end ======="
<< ", get_dev_value_str() ret was not RSMI_STATUS_SUCCESS"
<< " -> reporting " << amd::smi::getRSMIStatusString(ret);
LOG_ERROR(ss);
return ret;
}
// craft the string need to be writeen.
// (1) parse the read enforce_isolation data into a vector
std::stringstream iss(str_val);
int number;
std::vector<int> partition_status;
while ( iss >> number ) {
partition_status.push_back(number);
}
// (2) Validate the data
if (partition_status.size() <= partition_id) {
ss << __PRETTY_FUNCTION__ << " | ======= end ======="
<< ", the sysfs line " << str_val
<< " does not have the partition_id "
<< partition_id;
LOG_ERROR(ss);
return RSMI_STATUS_UNEXPECTED_DATA;
}
// (3) Create the complete list with the update
partition_status[partition_id] = pisolate;
std::stringstream result;
std::copy(partition_status.begin(), partition_status.end(),
std::ostream_iterator<int>(result, " "));
std::string value = result.str().c_str();
int write_ret = dev->writeDevInfo(amd::smi::kDevProcessIsolation , value);
return amd::smi::ErrnoToRsmiStatus(write_ret);
CATCH
}
rsmi_status_t rsmi_dev_gpu_clear_sram_data(uint32_t dv_ind,
uint32_t sclean) {
rsmi_status_t ret;
TRY
std::ostringstream ss;
ss << __PRETTY_FUNCTION__ << " | ======= start =======";
LOG_TRACE(ss);
REQUIRE_ROOT_ACCESS
DEVICE_MUTEX
GET_DEV_FROM_INDX
std::string value = std::to_string(sclean);
int ret = dev->writeDevInfo(amd::smi::kDevShaderClean , value);
return amd::smi::ErrnoToRsmiStatus(ret);
CATCH
}
rsmi_status_t
rsmi_dev_dpm_policy_set(uint32_t dv_ind,
uint32_t policy_id) {
+13
Просмотреть файл
@@ -82,6 +82,8 @@ static const char *kDevPCieVendorIDFName = "vendor";
// Device sysfs file names
static const char *kDevPerfLevelFName = "power_dpm_force_performance_level";
static const char *kDevProcessIsolationFName = "enforce_isolation";
static const char *kDevShaderCleanFName = "run_cleaner_shader";
static const char *kDevDevProdNameFName = "product_name";
static const char *kDevDevProdNumFName = "product_number";
static const char *kDevDevIDFName = "device";
@@ -317,6 +319,8 @@ static const std::map<DevInfoTypes, const char *> kDevAttribNameMap = {
{kDevGpuMetrics, kDevGpuMetricsFName},
{kDevPmMetrics, kDevPmMetricsFName},
{kDevDPMPolicy, kDevDPMPolicyFName},
{kDevProcessIsolation, kDevProcessIsolationFName},
{kDevShaderClean, kDevShaderCleanFName},
{kDevRegMetrics, kDevRegMetricsFName},
{kDevGpuReset, kDevGpuResetFName},
{kDevAvailableComputePartition, kDevAvailableComputePartitionFName},
@@ -475,6 +479,8 @@ Device::devInfoTypesStrings = {
{kDevMemoryPartition, "kDevMemoryPartition"},
{kDevPCieVendorID, "kDevPCieVendorID"},
{kDevDPMPolicy, "kDevDPMPolicy"},
{kDevProcessIsolation, "kDevProcessIsolation"},
{kDevShaderClean, "kDevShaderClean"},
};
static const std::map<const char *, dev_depends_t> kDevFuncDependsMap = {
@@ -516,6 +522,9 @@ static const std::map<const char *, dev_depends_t> kDevFuncDependsMap = {
{"rsmi_dev_perf_level_set", {{kDevPerfLevelFName}, {}}},
{"rsmi_dev_perf_level_set_v1", {{kDevPerfLevelFName}, {}}},
{"rsmi_dev_perf_level_get", {{kDevPerfLevelFName}, {}}},
{"rsmi_dev_process_isolation_set", {{kDevProcessIsolationFName}, {}}},
{"rsmi_dev_process_isolation_get", {{kDevProcessIsolationFName}, {}}},
{"rsmi_dev_gpu_shader_clean", {{kDevShaderCleanFName}, {}}},
{"rsmi_perf_determinism_mode_set", {{kDevPerfLevelFName,
kDevPowerODVoltageFName}, {}}},
{"rsmi_dev_overdrive_level_set", {{kDevOverDriveLevelFName}, {}}},
@@ -939,6 +948,8 @@ int Device::writeDevInfo(DevInfoTypes type, std::string val) {
sysfs_path += kDevAttribNameMap.at(type);
switch (type) {
case kDevGPUMClk:
case kDevProcessIsolation:
case kDevShaderClean:
case kDevDCEFClk:
case kDevFClk:
case kDevGPUSClk:
@@ -1212,6 +1223,7 @@ int Device::readDevInfo(DevInfoTypes type, std::vector<std::string> *val) {
switch (type) {
case kDevGPUMClk:
case kDevProcessIsolation:
case kDevGPUSClk:
case kDevDCEFClk:
case kDevFClk:
@@ -1279,6 +1291,7 @@ int Device::readDevInfo(DevInfoTypes type, std::string *val) {
case kDevMemoryPartition:
case kDevNumaNode:
case kDevXGMIPhysicalID:
case kDevProcessIsolation:
return readDevInfoStr(type, val);
break;
+24 -5
Просмотреть файл
@@ -1208,15 +1208,10 @@ amdsmi_get_power_cap_info(amdsmi_processor_handle processor_handle,
if ((status == AMDSMI_STATUS_SUCCESS) && !set_ret_success)
set_ret_success = true;
// Dividing by 1000000 to get measurement in Watts
(info->default_power_cap) /= 1000000;
status = rsmi_wrapper(rsmi_dev_power_cap_range_get, processor_handle, sensor_ind,
&(info->max_power_cap), &(info->min_power_cap));
// Dividing by 1000000 to get measurement in Watts
(info->max_power_cap) /= 1000000;
(info->min_power_cap) /= 1000000;
if ((status == AMDSMI_STATUS_SUCCESS) && !set_ret_success)
set_ret_success = true;
@@ -1385,6 +1380,30 @@ amdsmi_status_t amdsmi_get_xgmi_plpd(amdsmi_processor_handle processor_handle,
reinterpret_cast<rsmi_dpm_policy_t*>(policy));
}
amdsmi_status_t amdsmi_get_gpu_process_isolation(amdsmi_processor_handle processor_handle,
uint32_t* pisolate) {
AMDSMI_CHECK_INIT();
return rsmi_wrapper(rsmi_dev_process_isolation_get, processor_handle,
pisolate);
}
amdsmi_status_t amdsmi_set_gpu_process_isolation(amdsmi_processor_handle processor_handle,
uint32_t pisolate) {
AMDSMI_CHECK_INIT();
return rsmi_wrapper(rsmi_dev_process_isolation_set, processor_handle,
pisolate);
}
amdsmi_status_t amdsmi_set_gpu_clear_sram_data(amdsmi_processor_handle processor_handle,
uint32_t sclean) {
AMDSMI_CHECK_INIT();
return rsmi_wrapper(rsmi_dev_gpu_clear_sram_data, processor_handle,
sclean);
}
amdsmi_status_t
amdsmi_get_gpu_memory_reserved_pages(amdsmi_processor_handle processor_handle,
uint32_t *num_pages,
-2
Просмотреть файл
@@ -201,8 +201,6 @@ amdsmi_status_t smi_amdgpu_get_power_cap(amd::smi::AMDSmiGPUDevice* device, int
return AMDSMI_STATUS_API_FAILED;
}
// Dividing by 1000000 to get measurement in Watts
*cap /= 1000000;
return AMDSMI_STATUS_SUCCESS;
}
+31 -18
Просмотреть файл
@@ -89,9 +89,10 @@ void TestPowerCapReadWrite::Close() {
void TestPowerCapReadWrite::Run(void) {
amdsmi_status_t ret;
uint64_t orig, min, max, new_cap;
uint64_t default_cap, min, max, new_cap, curr_cap;
clock_t start, end;
double cpu_time_used;
const uint64_t MICRO_CONVERSION = 1000000;
TestBase::Run();
if (setup_failed_) {
@@ -110,22 +111,24 @@ void TestPowerCapReadWrite::Run(void) {
ASSERT_EQ(ret, AMDSMI_STATUS_INVAL);
min = info.min_power_cap;
max = info.max_power_cap;
orig = info.default_power_cap;
default_cap = info.default_power_cap;
curr_cap = info.power_cap;
new_cap = (max + min)/2;
// Check if power cap is within the range
// skip the test otherwise
if (orig < min || orig > max) {
std::cout << "Power cap is not within the range. Skipping test for " << dv_ind << std::endl;
if (new_cap < min || new_cap > max) {
std::cout << "Power cap requested (" << new_cap
<< " uW) is not within the range. Skipping test for " << dv_ind << std::endl;
continue;
}
new_cap = (max + min)/2;
IF_VERB(STANDARD) {
std::cout << "Original Power Cap: " << orig << " uW" << std::endl;
std::cout << "Power Cap Range: " << max << " uW to " << min <<
std::cout << "[Before Set] Default Power Cap: " << default_cap << " uW" << std::endl;
std::cout << "[Before Set] Current Power Cap: " << curr_cap << " uW" << std::endl;
std::cout << "[Before Set] Power Cap Range [max to min]: " << max << " uW to " << min <<
" uW" << std::endl;
std::cout << "Setting new cap to " << new_cap << "..." << std::endl;
std::cout << "[Before Set] Setting new cap to " << new_cap << "..." << std::endl;
}
start = clock();
ret = amdsmi_set_power_cap(processor_handles_[dv_ind], 0, new_cap);
@@ -142,25 +145,35 @@ void TestPowerCapReadWrite::Run(void) {
ret = amdsmi_get_power_cap_info(processor_handles_[dv_ind], 0, &info);
CHK_ERR_ASRT(ret)
new_cap = info.default_power_cap;
curr_cap = info.power_cap;
// TODO(cfreehil) add some kind of assertion to verify new_cap is correct
// (or within a range)
IF_VERB(STANDARD) {
std::cout << "Time spent: " << cpu_time_used << " uS" << std::endl;
std::cout << "New Power Cap: " << new_cap << " uW" << std::endl;
std::cout << "Resetting cap to " << orig << "..." << std::endl;
std::cout << "[After Set] Time spent: " << cpu_time_used << " uS" << std::endl;
std::cout << "[After Set] Current Power Cap: " << curr_cap << " uW" << std::endl;
std::cout << "[After Set] Requested Power Cap: " << new_cap << " uW" << std::endl;
std::cout << "[After Set] Power Cap Range [max to min]: " << max << " uW to "
<< min << " uW" << std::endl;
std::cout << "[After Set] Resetting cap to " << default_cap << "..." << std::endl;
}
// Confirm in watts the values are equal
ASSERT_EQ(curr_cap/MICRO_CONVERSION, new_cap/MICRO_CONVERSION);
ret = amdsmi_set_power_cap(processor_handles_[dv_ind], 0, orig);
// Reset to default power cap
ret = amdsmi_set_power_cap(processor_handles_[dv_ind], 0, default_cap);
CHK_ERR_ASRT(ret)
ret = amdsmi_get_power_cap_info(processor_handles_[dv_ind], 0, &info);
CHK_ERR_ASRT(ret)
new_cap = info.default_power_cap;
curr_cap = info.power_cap;
IF_VERB(STANDARD) {
std::cout << "Current Power Cap: " << new_cap << " uW" << std::endl;
std::cout << "[After Reset] Current Power Cap: " << curr_cap << " uW" << std::endl;
std::cout << "[After Reset] Requested Power Cap (default): " << default_cap << " uW"
<< std::endl;
std::cout << "[After Reset] Power Cap Range [max to min]: " << max << " uW to "
<< min << " uW" << std::endl;
}
// Confirm in watts the values are equal
ASSERT_EQ(curr_cap/MICRO_CONVERSION, default_cap/MICRO_CONVERSION);
}
}