[SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558)

Changes:
  - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
  - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
    (Violation Status is the first example of this in monitor)
  - Improve CLI monitor output:
    support multiple GPU lines per GPU, add new columns, and better formatting
  - Refactor helpers and logger for flexible unit formatting and table rendering
  - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
    new metrics APIs in C++ example
  - Sync Python/C++ interface and structures for new metrics fields and naming
  - Remove deprecated/unused RSMI activity APIs, documentation not needed since
    the APIs no longer exist in ROCm SMI either.
  - Cleanup metric violations + fix handle watch arguments
  - Provide better handling/doc for average_flattened_ints()
  - Group xcp metrics with brackets in human readable + adjust output size

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>

[ROCm/amdsmi commit: e2e4fc65c1]
This commit is contained in:
Poag, Charis
2025-08-06 16:03:06 -05:00
committato da GitHub
parent 3437d5b5da
commit 07dfa789d0
12 ha cambiato i file con 737 aggiunte e 396 eliminazioni
+9 -10
Vedi File
@@ -117,12 +117,12 @@ $ amd-smi
<a name="separate-driver-reload-anchor"></a>
- **Separated driver reload from `amdsmi_set_gpu_memory_partition()` / `amdsmi_set_gpu_memory_partition_mode()` and CLI (`sudo amd-smi set -M <NPS mode>`)**
- Providing new API (`amdsmi_gpu_driver_reload()`) and CLI (`sudo amd-smi reset -r` or `sudo amd-smi reset --reload-driver`) once user is ready to reload driver. We understand
the automatic reload could be at an inconvienient time. This is why we now provide this
the automatic reload could be at an inconvenient time. This is why we now provide this
functionality in separate API/CLI commands to use when the time is right.
- It is important to understand, the memory (NPS) partition change requires:
1) Memory partition change request (`amdsmi_set_gpu_memory_partition()` / `amdsmi_set_gpu_memory_partition_mode()`) or CLI (`sudo amd-smi set -M <NPS mode>`)
2) Driver reload (`amdsmi_gpu_driver_reload()` / `sudo amd-smi reset -r` or `sudo amd-smi reset --reload-driver`) \[\*\]
\[\*\] <i>Driver reload requires all GPU activity on all devices to be stopped.</i>
1) Memory partition change request (`amdsmi_set_gpu_memory_partition()` / `amdsmi_set_gpu_memory_partition_mode()`) or CLI (`sudo amd-smi set -M <NPS mode>`)
2) Driver reload (`amdsmi_gpu_driver_reload()` / `sudo amd-smi reset -r` or `sudo amd-smi reset --reload-driver`) \[\*\]
***Driver reload requires all GPU activity on all devices to be stopped.***
- **Modified `amd-smi` CLI `monitor` and `metric` for violations**.
- Disabled `amd-smi monitor --violation` on guests.
@@ -164,9 +164,6 @@ $ amd-smi
- `AMDSMI_EVT_NOTIF_PROCESS_START`
- `AMDSMI_EVT_NOTIF_PROCESS_END`
- **Updated `amdsmi_get_clock_info` in `amdsmi_interface.py`**.
- The `clk_deep_sleep` field now returns the sleep integer value.
- **Added Power Cap to `amd-smi monitor`**.
- `amd-smi monitor -p` will display the power cap along with power.
@@ -357,6 +354,8 @@ $ amd-smi
- **Removed duplicated GPU IDs when receiving events using the `amd-smi event` command**.
- **Fixed `amd-smi monitor` decoder utilization (`DEC%`) not showing up on MI3x ASICs**.
### Upcoming changes
- N/A
@@ -800,7 +799,7 @@ Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to
### Changed
- **AMDSMI Library Version number to reflect changes in backwards compatability**.
- **AMDSMI Library Version number to reflect changes in backwards compatibility**.
- Removed Year from AMDSMI Library version number.
- Version changed from 25.2.0.0 (Year.Major.Minor.Patch) to 25.2.0 (Major.Minor.Patch)
- Removed year in all version references
@@ -852,7 +851,7 @@ Functions affected by struct change are:
- **Added violation status output for Graphics Clock Below Host Limit to our CLI: `amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`**.
***Only available for MI300+ ASICs.***
Users can retrieve violation status' through either our Python or C++ APIs.
Additionally, we have added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`.
Additionally, we have added capability to view these outputs conveniently through `amd-smi metric --throttle` and `amd-smi monitor --violation`.
Example outputs are listed below (below is for reference, output is subject to change):
```console
@@ -963,7 +962,7 @@ Functions affected by struct change are:
...
```
- **Changed amd-smi partition --accelerator & `amdsmi_get_gpu_accelerator_partition_profile_config()` detect users running without root/sudo privledges**
- **Changed amd-smi partition --accelerator & `amdsmi_get_gpu_accelerator_partition_profile_config()` detect users running without root/sudo permissions**
- Updated `amdsmi_get_gpu_accelerator_partition_profile_config()` to return `AMDSMI_STATUS_NO_PERM` immediately if users run without root/sudo permissions.
- Updated `amd-smi partition --accelerator` to provide a warning for users without root/sudo permissions (see example below, ***output subject to change***).
+79 -53
Vedi File
@@ -75,12 +75,64 @@ def _print_error(e, destination):
f = open(destination, "w", encoding="utf-8")
f.write(e)
f.close()
print("Error occured. Result written to " + str(destination) + " file")
print("Error occurred. Result written to " + str(destination) + " file")
def configure_logging_and_execute(args, amd_smi_commands):
"""
Configures logging based on the provided arguments and executes the subcommand.
Args:
args: Parsed command-line arguments.
amd_smi_commands: Instance of AMDSMICommands.
"""
# Remove previous log handlers
for handler in logging.root.handlers[:]:
logging.root.removeHandler(handler)
# To enable debug logs in AMD SMI library:
# set RSMI_LOGGING = 1 for logging to files
# set RSMI_LOGGING = 2 for logging to stdout
# set RSMI_LOGGING = 3 for logging to stdout and files
# set RSMI_LOGGING = 0 to disable logging
# Files will be located in /var/log/amd_smi_lib/AMD-SMI-lib.log*
# log string with the following format:
# loglevel | YYYY-MM-DD HH:MM:SS.ms | filename:line | message
logging_dict = {
'DEBUG': logging.DEBUG,
'INFO': logging.INFO,
'WARNING': logging.WARNING,
'ERROR': logging.ERROR,
'CRITICAL': logging.CRITICAL
}
time = '%(asctime)s.%(msecs)03d'
datefmt = '%Y-%m-%d %H:%M:%S'
logging.basicConfig(format='%(levelname)s | ' + time + ' | %(filename)s:%(lineno)d | %(message)s',
level=logging_dict[args.loglevel], datefmt=datefmt)
# Disable traceback for non-debug log levels
if args.loglevel == "DEBUG":
sys.tracebacklimit = 10
else:
sys.tracebacklimit = -1
logging.debug(args)
# Execute subcommands
try:
args.func(args)
except amdsmi_cli_exceptions.AmdSmiException as e:
_print_error(str(e), amd_smi_commands.logger.destination)
except amdsmi_exception.AmdSmiLibraryException as e:
exc = amdsmi_cli_exceptions.AmdSmiLibraryErrorException(amd_smi_commands.logger.format, e.get_error_code())
_print_error(str(exc), amd_smi_commands.logger.destination)
if __name__ == "__main__":
# Disable traceback before possible init errors in AMDSMICommands and AMDSMIParser
if "DEBUG" in sys.argv:
copy_argv = str(sys.argv.copy()).upper()
if "DEBUG" in copy_argv:
sys.tracebacklimit = 10
else:
sys.tracebacklimit = -1
@@ -107,57 +159,31 @@ if __name__ == "__main__":
sys_argv=sys.argv,
helpers=amd_smi_helpers)
try:
try:
argcomplete.autocomplete(amd_smi_parser)
except NameError:
logging.debug("argcomplete module not found. Autocomplete will not work.")
argcomplete.autocomplete(amd_smi_parser)
except NameError:
logging.debug("argcomplete module not found. Autocomplete will not work.")
# Store possible subcommands & aliases for later errors
valid_commands = amd_smi_parser.possible_commands
valid_commands += ['--help', '-h']
# Store possible subcommands & aliases for later errors
valid_commands = amd_smi_parser.possible_commands
valid_commands += ['--help', '-h']
sys.argv = [arg.lower() if arg.startswith('--') or not arg.startswith('-')
else arg for arg in sys.argv]
if len(sys.argv) == 1:
args = amd_smi_parser.parse_args(args=['default'])
elif sys.argv[1] in valid_commands:
args = amd_smi_parser.parse_args(args=None)
else:
raise amdsmi_cli_exceptions.AmdSmiInvalidSubcommandException(sys.argv[1],amd_smi_commands.logger.destination)
sys.argv = [arg.lower() if arg.startswith('--') or not arg.startswith('-')
else arg for arg in sys.argv]
if len(sys.argv) == 1:
args = amd_smi_parser.parse_args(args=['default'])
elif sys.tracebacklimit == 10 and (sys.argv[1] == '--loglevel'):
args = amd_smi_parser.parse_args(args=['default', '--loglevel'] + sys.argv[2:])
elif sys.argv[1] in valid_commands:
args = amd_smi_parser.parse_args(args=None)
else:
raise amdsmi_cli_exceptions.AmdSmiInvalidSubcommandException(sys.argv[1],amd_smi_commands.logger.destination)
# Handle command modifiers before subcommand execution
# human readable is the default output format
if hasattr(args, 'json') and args.json:
amd_smi_commands.logger.format = amd_smi_commands.logger.LoggerFormat.json.value
if hasattr(args, 'csv') and args.csv:
amd_smi_commands.logger.format = amd_smi_commands.logger.LoggerFormat.csv.value
if hasattr(args, 'file') and args.file:
amd_smi_commands.logger.destination = args.file
# Remove previous log handlers
for handler in logging.root.handlers[:]:
logging.root.removeHandler(handler)
logging_dict = {'DEBUG' : logging.DEBUG,
'INFO' : logging.INFO,
'WARNING': logging.WARNING,
'ERROR': logging.ERROR,
'CRITICAL': logging.CRITICAL}
# To enable debug logs on rocm-smi library set RSMI_LOGGING = 1 in environment
logging.basicConfig(format='%(levelname)s: %(message)s', level=logging_dict[args.loglevel])
# Disable traceback for non-debug log levels
if args.loglevel == "DEBUG":
sys.tracebacklimit = 10
else:
sys.tracebacklimit = -1
logging.debug(args)
# Execute subcommands
args.func(args)
except amdsmi_cli_exceptions.AmdSmiException as e:
_print_error(str(e), amd_smi_commands.logger.destination)
except amdsmi_exception.AmdSmiLibraryException as e:
exc = amdsmi_cli_exceptions.AmdSmiLibraryErrorException(amd_smi_commands.logger.format, e.get_error_code())
_print_error(str(exc), amd_smi_commands.logger.destination)
# Handle command modifiers before subcommand execution
# human readable is the default output format
if hasattr(args, 'json') and args.json:
amd_smi_commands.logger.format = amd_smi_commands.logger.LoggerFormat.json.value
if hasattr(args, 'csv') and args.csv:
amd_smi_commands.logger.format = amd_smi_commands.logger.LoggerFormat.csv.value
if hasattr(args, 'file') and args.file:
amd_smi_commands.logger.destination = args.file
configure_logging_and_execute(args, amd_smi_commands)
+214 -169
Vedi File
@@ -27,6 +27,7 @@ import os
import sys
import threading
import time
import copy
from _version import __version__
from amdsmi_cli_exceptions import AmdSmiInvalidParameterException, AmdSmiRequiredCommandException, AmdSmiInvalidCommandException
@@ -1430,7 +1431,29 @@ class AMDSMICommands():
def build_xcp_dict(self, key, violation_status, num_partition):
return {f"xcp_{i}": violation_status[key][i] for i in range(num_partition)}
if not isinstance(violation_status[key], list):
if "active_" in key:
if violation_status[key] != "N/A":
if violation_status[key] is True:
violation_status[key] = "ACTIVE"
elif violation_status[key] is False:
violation_status[key] = "NOT ACTIVE"
ret = violation_status[key]
elif isinstance(violation_status[key], list):
for row in violation_status[key]:
for element in row:
if element != "N/A":
if "active_" in key:
if element is True:
row[row.index(element)] = "ACTIVE"
elif element is False:
row[row.index(element)] = "NOT ACTIVE"
elif ("per_" or "acc_") in key:
row[row.index(element)] = element
else:
continue
ret = {f"xcp_{i}": violation_status[key][i] for i in range(num_partition)}
return ret
def metric_gpu(self, args, multiple_devices=False, watching_output=False, gpu=None,
usage=None, watch=None, watch_time=None, iterations=None, power=None,
@@ -1469,7 +1492,7 @@ class AMDSMICommands():
guest_data (bool, optional): Value override for args.guest_data. Defaults to None.
fb_usage (bool, optional): Value override for args.fb_usage. Defaults to None.
xgmi (bool, optional): Value override for args.xgmi. Defaults to None.
throttle (bool, optional): Value override for args.violation. Defaults to None.
throttle (bool, optional): Value override for args.throttle. Defaults to None.
Raises:
IndexError: Index error if gpu list is empty
@@ -1506,6 +1529,8 @@ class AMDSMICommands():
args.clock = clock
if temperature:
args.temperature = temperature
if voltage:
args.voltage = voltage
if pcie:
args.pcie = pcie
if ecc:
@@ -1532,10 +1557,11 @@ class AMDSMICommands():
args.energy = energy
if throttle:
args.violation = throttle
args.throttle = throttle
current_platform_args += ["fan", "voltage_curve", "overdrive", "perf_level",
"xgmi_err", "energy", "throttle"]
current_platform_values += [args.fan, args.voltage_curve, args.overdrive,
args.perf_level, args.xgmi_err, args.energy, args.violation,
args.perf_level, args.xgmi_err, args.energy, args.throttle,
]
if self.helpers.is_hypervisor():
@@ -1636,88 +1662,22 @@ class AMDSMICommands():
gpu_metric = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("#3 - Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
gpu_metric = {
"temperature_edge": "N/A",
"temperature_hotspot": "N/A",
"temperature_mem": "N/A",
"temperature_vrgfx": "N/A",
"temperature_vrsoc": "N/A",
"temperature_vrmem": "N/A",
"average_gfx_activity": "N/A",
"average_umc_activity": "N/A",
"average_mm_activity": "N/A",
"average_socket_power": "N/A",
"energy_accumulator": "N/A",
"system_clock_counter": "N/A",
"average_gfxclk_frequency": "N/A",
"average_socclk_frequency": "N/A",
"average_uclk_frequency": "N/A",
"average_vclk0_frequency": "N/A",
"average_dclk0_frequency": "N/A",
"average_vclk1_frequency": "N/A",
"average_dclk1_frequency": "N/A",
"current_gfxclk": "N/A",
"current_socclk": "N/A",
"current_uclk": "N/A",
"current_vclk0": "N/A",
"current_dclk0": "N/A",
"current_vclk1": "N/A",
"current_dclk1": "N/A",
"throttle_status": "N/A",
"current_fan_speed": "N/A",
"pcie_link_width": "N/A",
"pcie_link_speed": "N/A",
"gfx_activity_acc": "N/A",
"mem_activity_acc": "N/A",
"temperature_hbm": "N/A",
"firmware_timestamp": "N/A",
"voltage_soc": "N/A",
"voltage_gfx": "N/A",
"voltage_mem": "N/A",
"indep_throttle_status": "N/A",
"current_socket_power": "N/A",
"vcn_activity": "N/A",
"gfxclk_lock_status": "N/A",
"xgmi_link_width": "N/A",
"xgmi_link_speed": "N/A",
"pcie_bandwidth_acc": "N/A",
"pcie_bandwidth_inst": "N/A",
"pcie_l0_to_recov_count_acc": "N/A",
"pcie_replay_count_acc": "N/A",
"pcie_replay_rover_count_acc": "N/A",
"xgmi_read_data_acc": "N/A",
"xgmi_write_data_acc": "N/A",
"current_gfxclks": "N/A",
"current_socclks": "N/A",
"current_vclk0s": "N/A",
"current_dclk0s": "N/A",
"jpeg_activity": "N/A",
"pcie_nak_sent_count_acc": "N/A",
"pcie_nak_rcvd_count_acc": "N/A",
"accumulation_counter": "N/A",
"prochot_residency_acc": "N/A",
"ppt_residency_acc": "N/A",
"socket_thm_residency_acc": "N/A",
"vr_thm_residency_acc": "N/A",
"hbm_thm_residency_acc": "N/A",
"num_partition": "N/A",
"xcp_stats.gfx_busy_inst": "N/A",
"xcp_stats.jpeg_busy": "N/A",
"xcp_stats.vcn_busy": "N/A",
"xcp_stats.gfx_busy_acc": "N/A",
"xcp_stats.gfx_below_host_limit_acc": "N/A",
"xcp_stats.gfx_below_host_limit_ppt_acc": "N/A",
"xcp_stats.gfx_below_host_limit_thm_acc": "N/A",
"xcp_stats.gfx_low_utilization_acc": "N/A",
"xcp_stats.gfx_below_host_limit_total_acc": "N/A",
"xcp_stats.gfx_below_host_limit_ppt_per": "N/A",
"xcp_stats.gfx_below_host_limit_thm_per": "N/A",
"xcp_stats.gfx_low_utilization_per": "N/A",
"xcp_stats.gfx_below_host_limit_total_per": "N/A",
"pcie_lc_perf_other_end_recovery": "N/A",
"vram_max_bandwidth": "N/A",
"xgmi_link_status": "N/A",
}
gpu_metric = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
# Workaround for XCP (partition) metrics not providing num_partition in v1.0
# Confirmed with driver team that we can default to 1 if num_partition is not defined.
# Pending partitions exist, ie. partition_id > 0. See logic below.
try:
partition_id = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)['current_partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get current partition id for gpu %s | %s", gpu_id, e.get_error_info())
partition_id = "N/A"
num_partition = gpu_metric['num_partition']
if num_partition == "N/A" and isinstance(partition_id, int) and partition_id > 0:
num_partition = 1 # Workaround for XCP metrics not providing num_partition in v1.0
logging.debug(f"num_partition is N/A and partition_id: {partition_id} (greater > 0).\nModified num_partition: {num_partition} to adjust for XCP metrics.")
if self.logger.is_json_format():
values_dict['gpu'] = int(gpu_id)
# Populate the pcie_dict first due to multiple gpu metrics calls incorrectly increasing bandwidth
@@ -1821,7 +1781,6 @@ class AMDSMICommands():
# TODO: move vcn_activity and jpeg_activity into amdsmi_get_gpu_activity
engine_usage['vcn_activity'] = gpu_metric['vcn_activity']
engine_usage['jpeg_activity'] = gpu_metric['jpeg_activity']
num_partition = gpu_metric['num_partition']
engine_usage['gfx_busy_inst'] = "N/A"
engine_usage['jpeg_busy'] = "N/A"
engine_usage['vcn_busy'] = "N/A"
@@ -2560,7 +2519,7 @@ class AMDSMICommands():
values_dict['mem_usage'] = memory_usage
if "throttle" in current_platform_args:
if args.violation:
if args.throttle:
throttle_status = {
# Current values - counter/accumulated
'accumulation_counter': "N/A",
@@ -2571,9 +2530,9 @@ class AMDSMICommands():
'hbm_thermal_accumulated': "N/A",
'gfx_clk_below_host_limit_accumulated': "N/A", # deprecated
'gfx_clk_below_host_limit_power_accumulated': "N/A",
'gfx_clk_below_host_limit_thermal_violation_accumulated': "N/A",
'gfx_clk_below_host_limit_violation_accumulated': "N/A",
'low_utilization_violation_accumulated': "N/A",
'gfx_clk_below_host_limit_thermal_accumulated': "N/A",
'total_gfx_clk_below_host_limit_accumulated': "N/A",
'low_utilization_accumulated': "N/A",
# violation status values - active/not active
'prochot_violation_status': "N/A",
@@ -2581,9 +2540,10 @@ class AMDSMICommands():
'socket_thermal_violation_status': "N/A",
'vr_thermal_violation_status': "N/A",
'hbm_thermal_violation_status': "N/A",
'gfx_clk_below_host_limit_violation_status': "N/A", # deprecated
'gfx_clk_below_host_limit_power_violation_status': "N/A",
'gfx_clk_below_host_limit_thermal_violation_status': "N/A",
'gfx_clk_below_host_limit_violation_status': "N/A",
'total_gfx_clk_below_host_limit_violation_status': "N/A",
'low_utilization_violation_status': "N/A",
# violation activity values - percent
@@ -2592,12 +2552,12 @@ class AMDSMICommands():
'socket_thermal_violation_activity': "N/A",
'vr_thermal_violation_activity': "N/A",
'hbm_thermal_violation_activity': "N/A",
'gfx_clk_below_host_limit_violation_activity': "N/A", # deprecated
'gfx_clk_below_host_limit_power_violation_activity': "N/A",
'gfx_clk_below_host_limit_thermal_violation_activity': "N/A",
'gfx_clk_below_host_limit_violation_activity': "N/A",
'total_gfx_clk_below_host_limit_violation_activity': "N/A",
'low_utilization_violation_activity': "N/A",
}
num_partition = gpu_metric['num_partition']
try:
violation_status = amdsmi_interface.amdsmi_get_violation_status(args.gpu)
@@ -2609,18 +2569,18 @@ class AMDSMICommands():
throttle_status['hbm_thermal_accumulated'] = violation_status['acc_hbm_thrm']
throttle_status['gfx_clk_below_host_limit_accumulated'] = violation_status['acc_gfx_clk_below_host_limit'] #deprecated
throttle_status['gfx_clk_below_host_limit_power_accumulated'] = self.build_xcp_dict('acc_gfx_clk_below_host_limit_pwr', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_thermal_violation_accumulated'] = self.build_xcp_dict('acc_gfx_clk_below_host_limit_thm', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_violation_accumulated'] = self.build_xcp_dict('acc_gfx_clk_below_host_limit_total', violation_status, num_partition)
throttle_status['low_utilization_violation_accumulated'] = self.build_xcp_dict('acc_low_utilization', violation_status, num_partition)
throttle_status['prochot_violation_status'] = violation_status['active_prochot_thrm']
throttle_status['ppt_violation_status'] = violation_status['active_ppt_pwr']
throttle_status['socket_thermal_violation_status'] = violation_status['active_socket_thrm']
throttle_status['vr_thermal_violation_status'] = violation_status['active_vr_thrm']
throttle_status['hbm_thermal_violation_status'] = violation_status['active_hbm_thrm']
throttle_status['gfx_clk_below_host_limit_violation_status'] = violation_status['active_gfx_clk_below_host_limit'] # deprecated
throttle_status['gfx_clk_below_host_limit_thermal_accumulated'] = self.build_xcp_dict('acc_gfx_clk_below_host_limit_thrm', violation_status, num_partition)
throttle_status['total_gfx_clk_below_host_limit_accumulated'] = self.build_xcp_dict('acc_gfx_clk_below_host_limit_total', violation_status, num_partition)
throttle_status['low_utilization_accumulated'] = self.build_xcp_dict('acc_low_utilization', violation_status, num_partition)
throttle_status['prochot_violation_status'] = self.build_xcp_dict('active_prochot_thrm', violation_status, num_partition)
throttle_status['ppt_violation_status'] = self.build_xcp_dict('active_ppt_pwr', violation_status, num_partition)
throttle_status['socket_thermal_violation_status'] = self.build_xcp_dict('active_socket_thrm', violation_status, num_partition)
throttle_status['vr_thermal_violation_status'] = self.build_xcp_dict('active_vr_thrm', violation_status, num_partition)
throttle_status['hbm_thermal_violation_status'] = self.build_xcp_dict('active_hbm_thrm', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_violation_status'] = self.build_xcp_dict('active_gfx_clk_below_host_limit', violation_status, num_partition) # deprecated
throttle_status['gfx_clk_below_host_limit_power_violation_status'] = self.build_xcp_dict('active_gfx_clk_below_host_limit_pwr', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_thermal_violation_status'] = self.build_xcp_dict('active_gfx_clk_below_host_limit_thm', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_violation_status'] = self.build_xcp_dict('active_gfx_clk_below_host_limit_total', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_thermal_violation_status'] = self.build_xcp_dict('active_gfx_clk_below_host_limit_thrm', violation_status, num_partition)
throttle_status['total_gfx_clk_below_host_limit_violation_status'] = self.build_xcp_dict('active_gfx_clk_below_host_limit_total', violation_status, num_partition)
throttle_status['low_utilization_violation_status'] = self.build_xcp_dict('active_low_utilization', violation_status, num_partition)
throttle_status['prochot_violation_activity'] = violation_status['per_prochot_thrm']
throttle_status['ppt_violation_activity'] = violation_status['per_ppt_pwr']
@@ -2629,20 +2589,15 @@ class AMDSMICommands():
throttle_status['hbm_thermal_violation_activity'] = violation_status['per_hbm_thrm']
throttle_status['gfx_clk_below_host_limit_violation_activity'] = violation_status['per_gfx_clk_below_host_limit'] # deprecated
throttle_status['gfx_clk_below_host_limit_power_violation_activity'] = self.build_xcp_dict('per_gfx_clk_below_host_limit_pwr', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_thermal_violation_activity'] = self.build_xcp_dict('per_gfx_clk_below_host_limit_thm', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_violation_activity'] = self.build_xcp_dict('per_low_utilization', violation_status, num_partition)
throttle_status['low_utilization_violation_activity'] = self.build_xcp_dict('per_gfx_clk_below_host_limit_total', violation_status, num_partition)
throttle_status['gfx_clk_below_host_limit_thermal_violation_activity'] = self.build_xcp_dict('per_gfx_clk_below_host_limit_thrm', violation_status, num_partition)
throttle_status['total_gfx_clk_below_host_limit_violation_activity'] = self.build_xcp_dict('per_gfx_clk_below_host_limit_total', violation_status, num_partition)
throttle_status['low_utilization_violation_activity'] = self.build_xcp_dict('per_low_utilization', violation_status, num_partition)
except amdsmi_exception.AmdSmiLibraryException as e:
values_dict['throttle'] = throttle_status
logging.debug("Failed to get violation status' for gpu %s | %s", gpu_id, e.get_error_info())
for key, value in throttle_status.items():
if "_status" in key:
if value is True:
throttle_status[key] = "ACTIVE"
elif value is False:
throttle_status[key] = "NOT ACTIVE"
activity_unit = ''
if "_activity" in key:
@@ -2651,21 +2606,18 @@ class AMDSMICommands():
if self.logger.is_human_readable_format():
if isinstance(value, (list, dict)):
for k, v in value.items():
for index, activity in enumerate(v):
if activity != "N/A":
value[k][index] = f"{activity} {activity_unit}"
value[k] = '[' + ", ".join(value[k]) + ']'
for index, activity in enumerate(v):
value[k][index] = self.helpers.unit_format(self.logger, activity, activity_unit)
value[k] = '[' + ", ".join(value[k]) + ']'
elif value != "N/A":
throttle_status[key] = f"{value} {activity_unit}"
value = self.helpers.unit_format(self.logger, value, activity_unit)
if self.logger.is_json_format():
if isinstance(value, list):
for index, activity in enumerate(value):
if activity != "N/A":
throttle_status[key][index] = {"value" : activity,
"unit" : activity_unit}
if isinstance(value, (list, dict)):
for k, v in value.items():
for index, activity in enumerate(v):
value[k][index] = self.helpers.unit_format(self.logger, activity, activity_unit)
elif value != "N/A":
throttle_status[key] = {"value" : value,
"unit" : activity_unit}
throttle_status[key] = self.helpers.unit_format(self.logger, value, activity_unit)
values_dict['throttle'] = throttle_status
# Store timestamp first if watching_output is enabled
@@ -5525,7 +5477,6 @@ class AMDSMICommands():
self.logger.clear_multiple_devices_output()
return
def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None,
watch=None, watch_time=None, iterations=None, power_usage=None,
temperature=None, gfx_util=None, mem_util=None, encoder=None,
@@ -5691,9 +5642,29 @@ class AMDSMICommands():
gpu_metric_debug_info = json.dumps(gpu_metrics_info, indent=4)
logging.debug("GPU Metrics table for GPU %s | %s", gpu_id, gpu_metric_debug_info)
except amdsmi_exception.AmdSmiLibraryException as e:
gpu_metrics_info = {} # Empty dict to avoid NameError
gpu_metrics_info = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
# Workaround for XCP (partition) metrics not providing num_partition in v1.0
# Confirmed with driver team that we can default to 1 if num_partition is not defined.
# Pending partitions exist, ie. partition_id > 0. See logic below.
try:
partition_id = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)['current_partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get current partition id for gpu %s | %s", gpu_id, e.get_error_info())
partition_id = "N/A"
num_partition = gpu_metrics_info['num_partition']
if num_partition == "N/A":
num_partition = partition_id
num_xcp = num_partition # used later for XCP metrics
self.logger.table_header += 'XCP'.rjust(5, ' ')
self.logger.store_output(args.gpu, 'xcp', partition_id) # Starting with partition_id.
# Outputs which have xcp details
# will update this value via num_xcp.
# This value will help map to primary device.
# Store the pcie_bw values due to possible increase in bandwidth due to repeated gpu_metrics calls
if args.pcie:
try:
@@ -5725,10 +5696,11 @@ class AMDSMICommands():
self.logger.table_header += 'POWER'.rjust(7)
if args.power_usage and not args.default_output:
# Get Max Power Cap
# Get Current Power Cap
try:
power_cap_info = amdsmi_interface.amdsmi_get_power_cap_info(args.gpu)
monitor_values['max_power'] = power_cap_info['max_power_cap']
monitor_values['max_power'] = power_cap_info['power_cap'] # Get current power cap (`power_cap`) socket is set to
# `max_power_cap`, is the maximum value it can be set to
monitor_values['max_power'] = self.helpers.convert_SI_unit(monitor_values['max_power'], AMDSMIHelpers.SI_Unit.MICRO)
if self.logger.is_human_readable_format() and monitor_values['max_power'] != "N/A":
@@ -5785,7 +5757,7 @@ class AMDSMICommands():
monitor_values['gfx_clk'] = f"{monitor_values['gfx_clk']} {freq_unit}"
if self.logger.is_json_format():
monitor_values['gfx_clk'] = {"value" : monitor_values['gfx_clk'],
"unit" : freq_unit}
"unit" : freq_unit}
except (KeyError, amdsmi_exception.AmdSmiLibraryException) as e:
monitor_values['gfx_clk'] = "N/A"
@@ -5795,13 +5767,13 @@ class AMDSMICommands():
try:
gfx_util = gpu_metrics_info['average_gfx_activity']
monitor_values['gfx'] = round(gfx_util)
activity_unit = '%'
if gfx_util != "N/A":
if self.logger.is_human_readable_format():
monitor_values['gfx'] = f"{monitor_values['gfx']} {activity_unit}"
if self.logger.is_json_format():
monitor_values['gfx'] = {"value" : monitor_values['gfx'],
monitor_values['gfx'] = gfx_util
if self.logger.is_human_readable_format():
monitor_values['gfx'] = f"{monitor_values['gfx']} {activity_unit}"
if self.logger.is_json_format():
monitor_values['gfx'] = {"value" : monitor_values['gfx'],
"unit" : activity_unit}
except (KeyError, amdsmi_exception.AmdSmiLibraryException) as e:
monitor_values['gfx'] = "N/A"
@@ -5812,14 +5784,14 @@ class AMDSMICommands():
if args.mem:
try:
mem_util = gpu_metrics_info['average_umc_activity']
monitor_values['mem'] = round(mem_util)
activity_unit = '%'
if mem_util != "N/A":
if self.logger.is_human_readable_format():
monitor_values['mem'] = f"{monitor_values['mem']} {activity_unit}"
if self.logger.is_json_format():
monitor_values['mem'] = {"value" : monitor_values['mem'],
"unit" : activity_unit}
monitor_values['mem'] = mem_util
if self.logger.is_human_readable_format():
monitor_values['mem'] = f"{monitor_values['mem']} {activity_unit}"
if self.logger.is_json_format():
monitor_values['mem'] = {"value" : monitor_values['mem'],
"unit" : activity_unit}
except (KeyError, amdsmi_exception.AmdSmiLibraryException) as e:
monitor_values['mem'] = "N/A"
logging.debug("Failed to get mem utilization on gpu %s | %s", gpu_id, e)
@@ -5878,19 +5850,13 @@ class AMDSMICommands():
if args.decoder:
try:
# Get List of vcn activity values
# Note: MI3x ASICs only support decoding, so the vcn_activity is used for decoding activity.
# Note: MI3x ASICs only support decoding, so the vcn_activity/vcn_busy
# is used for decoding activity.
decoder_util = gpu_metrics_info['vcn_activity']
decoding_activity_avg = []
for value in decoder_util:
if isinstance(value, int):
decoding_activity_avg.append(value)
# Averaging the possible decoding activity values
if decoding_activity_avg:
decoding_activity_avg = round(sum(decoding_activity_avg) / len(decoding_activity_avg))
else:
decoding_activity_avg = "N/A"
if (gpu_metrics_info['vcn_activity'][0] == "N/A" and
gpu_metrics_info['xcp_stats.vcn_busy'][partition_id][0] != "N/A"):
decoder_util = gpu_metrics_info['xcp_stats.vcn_busy'][partition_id]
decoding_activity_avg = self.helpers.average_flattened_ints(decoder_util, context="decoder_util")
monitor_values['decoder'] = decoding_activity_avg
activity_unit = '%'
@@ -6050,6 +6016,10 @@ class AMDSMICommands():
"vr_tviol": "N/A",
"hbm_tviol": "N/A",
"gfx_clkviol": "N/A",
"gfxclk_pviol": "N/A",
"gfxclk_tviol": "N/A",
"gfxclk_totalviol": "N/A",
"low_utilviol": "N/A"
}
try:
violations = amdsmi_interface.amdsmi_get_violation_status(args.gpu)
@@ -6060,6 +6030,10 @@ class AMDSMICommands():
violation_status['vr_tviol'] = violations['per_vr_thrm']
violation_status['hbm_tviol'] = violations['per_hbm_thrm']
violation_status['gfx_clkviol'] = violations['per_gfx_clk_below_host_limit']
violation_status['gfxclk_pviol'] = violations['per_gfx_clk_below_host_limit_pwr']
violation_status['gfxclk_tviol'] = violations['per_gfx_clk_below_host_limit_thrm']
violation_status['gfxclk_totalviol'] = violations['per_gfx_clk_below_host_limit_total']
violation_status['low_utilviol'] = violations['per_low_utilization']
except amdsmi_exception.AmdSmiLibraryException as e:
monitor_values['pviol'] = violation_status['pviol']
monitor_values['tviol'] = violation_status['tviol']
@@ -6068,6 +6042,10 @@ class AMDSMICommands():
monitor_values['vr_tviol'] = violation_status['vr_tviol']
monitor_values['hbm_tviol'] = violation_status['hbm_tviol']
monitor_values['gfx_clkviol'] = violation_status['gfx_clkviol']
monitor_values['gfxclk_pviol'] = violation_status['gfxclk_pviol']
monitor_values['gfxclk_tviol'] = violation_status['gfxclk_tviol']
monitor_values['gfxclk_totalviol'] = violation_status['gfxclk_totalviol']
monitor_values['low_utilviol'] = violation_status['low_utilviol']
logging.debug("Failed to get violation status on gpu %s | %s", gpu_id, e.get_error_info())
violation_status_unit = "%"
kPVIOL_MAX_WIDTH = 7
@@ -6077,23 +6055,32 @@ class AMDSMICommands():
kVR_MAX_WIDTH = 10
kHBM_MAX_WIDTH = 11
kGFXC_MAX_WIDTH = 13
kGFXC_PVIOL_MAX_WIDTH = 58
kGFXC_TVIOL_MAX_WIDTH = kGFXC_PVIOL_MAX_WIDTH
kGFXC_TOTALVIOL_MAX_WIDTH = kGFXC_PVIOL_MAX_WIDTH
kLOW_UTILVIOL_MAX_WIDTH = kGFXC_PVIOL_MAX_WIDTH
for key, value in violation_status.items():
if value != "N/A":
if key == "tviol_active":
monitor_values[key] = value
if not isinstance(value, list):
if value != "N/A":
if key == 'tviol_active' or key == 'xcp':
monitor_values[key] = value
else:
monitor_values[key] = self.helpers.unit_format(self.logger, violation_status[key], violation_status_unit)
else:
monitor_values[key] = self.helpers.unit_format(self.logger, violation_status[key], violation_status_unit)
monitor_values[key] = violation_status[key]
else:
monitor_values[key] = violation_status[key]
if num_partition != "N/A":
# these are one after another, in order to display each in sub-sections
new_xcp_dict = {}
for current_xcp in range(num_partition):
new_xcp_dict[f"xcp_{current_xcp}"] = self.helpers.unit_format(self.logger, value[current_xcp], "%")
monitor_values[key] = new_xcp_dict
else:
monitor_values[key] = value[0] if value else "N/A"
# save deep copy of monitor values, used later to grab xcp specific values
monitor_values_deepcopy = copy.deepcopy(monitor_values)
if self.logger.is_human_readable_format():
monitor_values['pviol'] = monitor_values['pviol'].rjust(kPVIOL_MAX_WIDTH, ' ')
monitor_values['tviol'] = monitor_values['tviol'].rjust(kTVIOL_MAX_WIDTH, ' ')
monitor_values['phot_tviol'] = monitor_values['phot_tviol'].rjust(kPHOT_MAX_WIDTH, ' ')
monitor_values['vr_tviol'] = monitor_values['vr_tviol'].rjust(kVR_MAX_WIDTH, ' ')
monitor_values['hbm_tviol'] = monitor_values['hbm_tviol'].rjust(kHBM_MAX_WIDTH, ' ')
monitor_values['gfx_clkviol'] = monitor_values['gfx_clkviol'].rjust(kGFXC_MAX_WIDTH, ' ')
self.logger.table_header += 'PVIOL'.rjust(kPVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'TVIOL'.rjust(kTVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'TVIOL_ACTIVE'.rjust(kTVIOL_ACTIVE_MAX_WIDTH, ' ')
@@ -6101,9 +6088,69 @@ class AMDSMICommands():
self.logger.table_header += 'VR_TVIOL'.rjust(kVR_MAX_WIDTH, ' ')
self.logger.table_header += 'HBM_TVIOL'.rjust(kHBM_MAX_WIDTH, ' ')
self.logger.table_header += 'GFX_CLKVIOL'.rjust(kGFXC_MAX_WIDTH, ' ')
self.logger.table_header += 'GFXCLK_PVIOL'.rjust(kGFXC_PVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'GFXCLK_TVIOL'.rjust(kGFXC_TVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'GFXCLK_TOTALVIOL'.rjust(kGFXC_TOTALVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'LOW_UTILVIOL'.rjust(kLOW_UTILVIOL_MAX_WIDTH, ' ')
self.logger.store_output(args.gpu, 'values', monitor_values)
# Print/capture by XCPs
if num_partition != "N/A" and partition_id == 0:
current_xcp = 0
while (current_xcp in range(num_partition) or current_xcp == 0):
if not multiple_devices and watching_output and current_xcp == 0:
# Need to clear output for single device, otherwise while watching output
# XCP detail will continue stacking on top of each other
self.logger.clear_multiple_devices_output()
if watching_output:
self.logger.store_output(args.gpu, 'timestamp', int(time.time()))
self.logger.store_output(args.gpu, 'xcp', current_xcp)
if current_xcp != 0: # set all other values without XCP stats to N/A
monitor_values['pviol'] = "N/A"
monitor_values['tviol'] = "N/A"
monitor_values['tviol_active'] = "N/A"
monitor_values['phot_tviol'] = "N/A"
monitor_values['vr_tviol'] = "N/A"
monitor_values['hbm_tviol'] = "N/A"
monitor_values['gfx_clkviol'] = "N/A"
for k, _ in monitor_values.items(): # change other keys to "N/A" since we should have all applicable XCP stats
# eg. amd-smi monitor -p -t -V should only show XCP info for violations
# below primary device
if k != 'xcp' and k not in ['gfxclk_pviol', 'gfxclk_tviol', 'gfxclk_totalviol', 'low_utilviol']:
monitor_values[k] = "N/A"
if isinstance(monitor_values_deepcopy['gfxclk_pviol'], dict):
monitor_values['gfxclk_pviol'] = monitor_values_deepcopy['gfxclk_pviol'][f"xcp_{current_xcp}"]
if isinstance(monitor_values_deepcopy['gfxclk_tviol'], dict):
monitor_values['gfxclk_tviol'] = monitor_values_deepcopy['gfxclk_tviol'][f"xcp_{current_xcp}"]
if isinstance(monitor_values_deepcopy['gfxclk_totalviol'], dict):
monitor_values['gfxclk_totalviol'] = monitor_values_deepcopy['gfxclk_totalviol'][f"xcp_{current_xcp}"]
if isinstance(monitor_values_deepcopy['low_utilviol'], dict):
monitor_values['low_utilviol'] = monitor_values_deepcopy['low_utilviol'][f"xcp_{current_xcp}"]
if self.logger.is_human_readable_format():
monitor_values['pviol'] = monitor_values['pviol'].rjust(kPVIOL_MAX_WIDTH, ' ')
monitor_values['tviol'] = monitor_values['tviol'].rjust(kTVIOL_MAX_WIDTH, ' ')
monitor_values['phot_tviol'] = monitor_values['phot_tviol'].rjust(kPHOT_MAX_WIDTH, ' ')
monitor_values['vr_tviol'] = monitor_values['vr_tviol'].rjust(kVR_MAX_WIDTH, ' ')
monitor_values['hbm_tviol'] = monitor_values['hbm_tviol'].rjust(kHBM_MAX_WIDTH, ' ')
monitor_values['gfx_clkviol'] = monitor_values['gfx_clkviol'].rjust(kGFXC_MAX_WIDTH, ' ')
monitor_values['gfxclk_pviol'] = str(monitor_values['gfxclk_pviol']).rjust(kGFXC_PVIOL_MAX_WIDTH, ' ').strip().replace('\'', '')
monitor_values['gfxclk_tviol'] = str(monitor_values['gfxclk_tviol']).rjust(kGFXC_TVIOL_MAX_WIDTH, ' ').strip().replace('\'', '')
monitor_values['gfxclk_totalviol'] = str(monitor_values['gfxclk_totalviol']).rjust(kGFXC_TOTALVIOL_MAX_WIDTH, ' ').strip().replace('\'', '')
monitor_values['low_utilviol'] = str(monitor_values['low_utilviol']).rjust(kLOW_UTILVIOL_MAX_WIDTH, ' ').strip().replace('\'', '')
self.logger.store_output(args.gpu, 'values', monitor_values)
self.logger.store_multiple_device_output()
current_xcp += 1
else:
self.logger.store_output(args.gpu, 'xcp', num_xcp)
self.logger.store_output(args.gpu, 'values', monitor_values)
self.logger.store_multiple_device_output()
# Store typical output for all commands (XCP data will be handled separately, eg. violation status)
if not args.violation:
self.logger.store_output(args.gpu, 'values', monitor_values)
# intialize dual_csv_format; applicable to process only
dual_csv_output = False
@@ -6207,7 +6254,7 @@ class AMDSMICommands():
self.logger.store_watch_output(multiple_device_enabled=False)
self.logger.print_output(multiple_device_enabled=False, watching_output=watching_output, tabular=True, dual_csv_output=dual_csv_output)
self.logger.print_output(multiple_device_enabled=True, watching_output=watching_output, tabular=True, dual_csv_output=dual_csv_output)
def xgmi(self, args, multiple_devices=False, gpu=None, metric=None, xgmi_link_status=None):
@@ -6947,7 +6994,7 @@ class AMDSMICommands():
try:
gpu_metrics = amdsmi_interface.amdsmi_get_gpu_metrics_info(processor)
except amdsmi_exception.AmdSmiLibraryException as e:
gpu_metrics = "N/A"
gpu_metrics = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
# partition info
try:
@@ -6999,9 +7046,7 @@ class AMDSMICommands():
# mem utilization, GPU utilization, power usage, and temperature from gpu_metrics
if gpu_metrics != "N/A":
mem_util = gpu_metrics['average_umc_activity']
mem_util = round(mem_util)
gfx_util = gpu_metrics['average_gfx_activity']
gfx_util = round(gfx_util)
if gpu_metrics['current_socket_power'] != "N/A":
current_power = gpu_metrics['current_socket_power']
else:
@@ -1014,13 +1014,28 @@ class AMDSMIHelpers():
return:
str or dict : formatted output
"""
if value == "N/A":
return "N/A"
if logger.is_json_format():
return {"value": value, "unit": unit}
if logger.is_human_readable_format():
return f"{value} {unit}".rstrip()
return f"{value}"
if isinstance(value, list):
formatted_values = []
for val in value:
if isinstance(val, str) and val == "N/A":
formatted_values.append("N/A")
else:
formatted_values.append(self.unit_format(logger, val, unit))
return formatted_values
else:
if value == "N/A":
return "N/A"
if logger.is_json_format():
if unit:
return {"value": value, "unit": unit}
else:
return value
if logger.is_human_readable_format():
if unit:
return f"{value} {unit}".rstrip()
else:
return f"{value}".rstrip()
return f"{value}"
def unit_unformat(self, logger, formatted_value):
"""
@@ -1483,3 +1498,22 @@ class AMDSMIHelpers():
ranges[cpu] = f"{start_setbit}-{end_setbit}"
return ranges
@staticmethod
def average_flattened_ints(data, context="data"):
"""Calculate the average of flattened integers from a list or tuple
Args:
data (list or tuple): Data to calculate the average from
context (str, optional): Context for logging. Defaults to "data".
Returns:
float or str: Average of integers if available, otherwise "N/A"
"""
# Type validation - ensure data is list or tuple
# Note: Data can be nested list of lists and will filter out N/A values
if not isinstance(data, (list, tuple)):
logging.debug(f"Invalid data type for {context}: expected list/tuple, got {type(data)}")
return "N/A"
# Flatten nested lists and filter integers
flat = [v for value in data for v in (value if isinstance(value, list) else [value]) if isinstance(v, int)]
return round(sum(flat) / len(flat)) if flat else "N/A"
@@ -157,6 +157,9 @@ class AMDSMILogger():
elif key == 'gpu':
stored_gpu = string_value
table_values += string_value.rjust(3)
elif key == 'xcp':
stored_gpu = string_value
table_values += string_value.rjust(5)
elif key == 'timestamp':
stored_timestamp = string_value
table_values += string_value.rjust(10) + ' '
@@ -170,6 +173,8 @@ class AMDSMILogger():
table_values += string_value.rjust(7)
elif key in ('gfx_clk'):
table_values += string_value.rjust(10)
elif key in ('vram_usage'):
table_values += string_value.rjust(16)
elif key in ('mem_clock', 'vram_used'):
table_values += string_value.rjust(11)
elif key in ('vram_total', 'vram_free'):
@@ -217,6 +222,8 @@ class AMDSMILogger():
table_values += string_value.rjust(11)
elif key == "gfx_clkviol":
table_values += string_value.rjust(13)
elif key in ("gfxclk_pviol", "gfxclk_tviol", "gfxclk_totalviol", "low_utilviol"):
table_values += string_value.rjust(58)
elif key == "process_list":
#Add an additional padding between the first instance of GPU and NAME
table_values += ' '
@@ -1014,8 +1014,8 @@ class AMDSMIParser(argparse.ArgumentParser):
metric_parser.add_argument('-l', '--perf-level', action='store_true', required=False, help=perf_level_help)
metric_parser.add_argument('-x', '--xgmi-err', action='store_true', required=False, help=xgmi_err_help)
metric_parser.add_argument('-E', '--energy', action='store_true', required=False, help=energy_help)
metric_parser.add_argument('-v', '--violation', action='store_true', required=False, help=throttle_help)
metric_parser.add_argument('-T', '--throttle', dest='violation', action='store_true', required=False, help=argparse.SUPPRESS)
metric_parser.add_argument('-v', '--violation', dest='throttle', action='store_true', required=False, help=throttle_help)
metric_parser.add_argument('-T', '--throttle', dest='throttle', action='store_true', required=False, help=argparse.SUPPRESS)
# Options to only display to Hypervisors
if self.helpers.is_hypervisor():
@@ -872,6 +872,7 @@ int main() {
// For each device of the socket, get name and temperature.
for (uint32_t device_index = 0; device_index < device_count; device_index++) {
std::cout << "Device Index: " << device_index << std::endl;
std::cout << "SMI gpu #: " << gpu_number << std::endl;
// Commenting out the code to get CPU socket count and GPU count
// Doesn't work on system with no supported CPU sockets
@@ -884,6 +885,95 @@ int main() {
std::cout << "GPU count: " << gpus << std::endl;
#endif
// Commenting out since, not verified to work on all ASICs yet.
#if 0
amdsmi_name_value_t *pm_metrics = {};
uint32_t num_metrics = 0;
ret = amdsmi_get_gpu_pm_metrics_info(processor_handles[device_index],
&pm_metrics, &num_metrics);
const char* err_str;
amdsmi_status_code_to_string(ret, &err_str);
std::cout << " Output of amdsmi_get_gpu_pm_metrics_info:" << err_str << "\n";
if (ret == AMDSMI_STATUS_SUCCESS) {
CHK_AMDSMI_RET(ret)
std::cout << "\tNumber of PM metrics: " << num_metrics << std::endl;
for (uint32_t j = 0; j < num_metrics; j++) {
std::cout << "\tPM Metric Name: " << pm_metrics[j].name
<< ", Value: " << pm_metrics[j].value << std::endl;
}
}
free(pm_metrics);
// typedef enum {
// AMDSMI_REG_XGMI, //!< XGMI registers
// AMDSMI_REG_WAFL, //!< WAFL registers
// AMDSMI_REG_PCIE, //!< PCIe registers
// AMDSMI_REG_USR, //!< Usr registers
// AMDSMI_REG_USR1 //!< Usr1 registers
// } amdsmi_reg_type_t;
std::map<amdsmi_reg_type_t, std::string> reg_type_map = {
{AMDSMI_REG_XGMI, "XGMI"},
{AMDSMI_REG_WAFL, "WAFL"},
{AMDSMI_REG_PCIE, "PCIE"},
{AMDSMI_REG_USR, "USR"},
{AMDSMI_REG_USR1, "USR1"}
};
for (uint32_t j = static_cast<uint32_t>(AMDSMI_REG_XGMI);
j <= static_cast<uint32_t>(AMDSMI_REG_USR1); j++) {
amdsmi_name_value_t *reg_metrics = {};
amdsmi_reg_type_t reg_type = static_cast<amdsmi_reg_type_t>(j);
std::string reg_type_str = "N/A";
ret = amdsmi_get_gpu_reg_table_info(processor_handles[device_index],
reg_type, &reg_metrics, &num_metrics);
if (auto it = reg_type_map.find(reg_type); it != reg_type_map.end()) {
reg_type_str = it->second;
}
// Skipping these for now due to some ASICS having issues
if (reg_type == AMDSMI_REG_USR1 || reg_type == AMDSMI_REG_XGMI ||
reg_type == AMDSMI_REG_USR) {
std::cout << "\tSkipping " << reg_type_str << " registers for now."
<< std::endl;
free(reg_metrics);
continue;
}
amdsmi_status_code_to_string(ret, &err_str);
std::cout << " Output of amdsmi_get_gpu_reg_table_info(" << gpu_number << ", "
<< reg_type_str << "): " << err_str << "\n";
if (ret == AMDSMI_STATUS_SUCCESS) {
CHK_AMDSMI_RET(ret)
std::cout << "\tNumber of Register metrics: " << num_metrics << std::endl;
for (uint32_t k = 0; k < num_metrics; k++) {
if (reg_metrics == nullptr) {
std::cout << "\tRegister Number: " << k
<< ", Type: " << reg_type_str
<< ", Register Metric Name: N/A, Value: N/A" << std::endl;
continue;
}
if (reg_metrics[k].name == nullptr) {
std::cout << "\tRegister Number: " << k
<< ", Type: " << reg_type_str
<< ", Register Metric Name: "
<< (reg_metrics[k].name != nullptr ?
reg_metrics[k].name : "N/A")
<< ", Value: N/A" << std::endl;
continue;
}
std::cout << "\tRegister Number: " << k
<< ", Type: " << reg_type_str
<< ", Register Metric Name: "
<< (reg_metrics[k].name != nullptr ?
reg_metrics[k].name : "N/A")
<< ", Value: " << reg_metrics[k].value << std::endl;
}
}
free(reg_metrics);
std::cout << std::endl;
}
std::cout << std::endl;
#endif
// Get device type. Since the amdsmi is initialized with
// AMD_SMI_INIT_AMD_GPUS, the processor_type must be AMDSMI_PROCESSOR_TYPE_AMD_GPU.
processor_type_t processor_type = {};
@@ -1909,8 +1999,8 @@ int main() {
}
}
}
gpu_number++;
}
gpu_number++;
}
}
// Clean up resources allocated at amdsmi_init. It will invalidate sockets
+3 -3
Vedi File
@@ -714,17 +714,17 @@ typedef struct {
Gfx clock below host limit violation; 1 = active 0 = not active; Max uint8 means unsupported.*/
//GPU metrics 1.8 violations
uint64_t acc_gfx_clk_below_host_limit_pwr[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Current gfx clock below host limit power count; Max uint64 means unsupported
uint64_t acc_gfx_clk_below_host_limit_thm[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Current gfx clock below host limit thermal count; Max uint64 means unsupported
uint64_t acc_gfx_clk_below_host_limit_thrm[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Current gfx clock below host limit thermal count; Max uint64 means unsupported
uint64_t acc_low_utilization[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Current low utilization count; Max uint64 means unsupported
uint64_t acc_gfx_clk_below_host_limit_total[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Current gfx clock below host limit total count; Max uint64 means unsupported
uint64_t per_gfx_clk_below_host_limit_pwr[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Gfx clock below host limit power violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_gfx_clk_below_host_limit_thm[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Gfx clock below host limit violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_gfx_clk_below_host_limit_thrm[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Gfx clock below host limit violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_low_utilization[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Low utilization violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_gfx_clk_below_host_limit_total[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Any Gfx clock below host limit violation % (greater than 0% is a violation); Max uint64 means unsupported
uint8_t active_gfx_clk_below_host_limit_pwr[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Gfx clock below host limit power violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_gfx_clk_below_host_limit_thm[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Gfx clock below host limit thermal violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_gfx_clk_below_host_limit_thrm[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Gfx clock below host limit thermal violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_low_utilization[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; //!< New Driver 1.8 fields: Low utilization violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_gfx_clk_below_host_limit_total[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC];//!< New Driver 1.8 fields: Any Gfx clock host limit violation; 1 = active 0 = not active; Max uint8 means unsupported
uint64_t reserved[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]; // reserved for new violation info
@@ -787,6 +787,103 @@ def _notifyTypeToString(notify_type_b):
else:
return "Unknown"
def _NA_amdsmi_get_gpu_metrics_info() -> Dict[str, str]:
"""
Get 'N/A' metric values for gpu_metric, used for exception handling.
Parameters:
None
Returns:
Dict[str, str]: A dictionary with keys as metric names and values as 'N/A'.
This is used to indicate that the metric is not available or applicable.
Raises:
N/A
"""
na_gpu_metrics_info = {
"common_header.structure_size": "N/A",
"common_header.format_revision": "N/A",
"common_header.content_revision": "N/A",
"temperature_edge": "N/A",
"temperature_hotspot": "N/A",
"temperature_mem": "N/A",
"temperature_vrgfx": "N/A",
"temperature_vrsoc": "N/A",
"temperature_vrmem": "N/A",
"average_gfx_activity": "N/A",
"average_umc_activity": "N/A",
"average_mm_activity": "N/A",
"average_socket_power": "N/A",
"energy_accumulator": "N/A",
"system_clock_counter": "N/A",
"average_gfxclk_frequency": "N/A",
"average_socclk_frequency": "N/A",
"average_uclk_frequency": "N/A",
"average_vclk0_frequency": "N/A",
"average_dclk0_frequency": "N/A",
"average_vclk1_frequency": "N/A",
"average_dclk1_frequency": "N/A",
"current_gfxclk": "N/A",
"current_socclk": "N/A",
"current_uclk": "N/A",
"current_vclk0": "N/A",
"current_dclk0": "N/A",
"current_vclk1": "N/A",
"current_dclk1": "N/A",
"throttle_status": "N/A",
"current_fan_speed": "N/A",
"pcie_link_width": "N/A",
"pcie_link_speed": "N/A",
"gfx_activity_acc": "N/A",
"mem_activity_acc": "N/A",
"temperature_hbm": "N/A",
"firmware_timestamp": "N/A",
"voltage_soc": "N/A",
"voltage_gfx": "N/A",
"voltage_mem": "N/A",
"indep_throttle_status": "N/A",
"current_socket_power": "N/A",
"vcn_activity": "N/A",
"gfxclk_lock_status": "N/A",
"xgmi_link_width": "N/A",
"xgmi_link_speed": "N/A",
"pcie_bandwidth_acc": "N/A",
"pcie_bandwidth_inst": "N/A",
"pcie_l0_to_recov_count_acc": "N/A",
"pcie_replay_count_acc": "N/A",
"pcie_replay_rover_count_acc": "N/A",
"xgmi_read_data_acc": "N/A",
"xgmi_write_data_acc": "N/A",
"current_gfxclks": "N/A",
"current_socclks": "N/A",
"current_vclk0s": "N/A",
"current_dclk0s": "N/A",
"jpeg_activity": "N/A",
"pcie_nak_sent_count_acc": "N/A",
"pcie_nak_rcvd_count_acc": "N/A",
"accumulation_counter": "N/A",
"prochot_residency_acc": "N/A",
"ppt_residency_acc": "N/A",
"socket_thm_residency_acc": "N/A",
"vr_thm_residency_acc": "N/A",
"hbm_thm_residency_acc": "N/A",
"num_partition": "N/A",
"xcp_stats.gfx_busy_inst": "N/A",
"xcp_stats.jpeg_busy": "N/A",
"xcp_stats.vcn_busy": "N/A",
"xcp_stats.gfx_busy_acc": "N/A",
"xcp_stats.gfx_below_host_limit_acc": "N/A",
"xcp_stats.gfx_below_host_limit_ppt_acc": "N/A",
"xcp_stats.gfx_below_host_limit_thm_acc": "N/A",
"xcp_stats.gfx_low_utilization_acc": "N/A",
"xcp_stats.gfx_below_host_limit_total_acc": "N/A",
"pcie_lc_perf_other_end_recovery": "N/A",
"vram_max_bandwidth": "N/A",
"xgmi_link_status": "N/A"
}
return na_gpu_metrics_info
def amdsmi_get_socket_handles() -> List[c_void_p]:
"""
@@ -2351,9 +2448,9 @@ def amdsmi_get_violation_status(
"acc_hbm_thrm": _validate_if_max_uint(violation_status.acc_hbm_thrm, MaxUIntegerTypes.UINT64_T),
"acc_gfx_clk_below_host_limit": _validate_if_max_uint(violation_status.acc_gfx_clk_below_host_limit, MaxUIntegerTypes.UINT64_T),
"acc_gfx_clk_below_host_limit_pwr": list(violation_status.acc_gfx_clk_below_host_limit_pwr),
"acc_gfx_clk_below_host_limit_thm": list(violation_status.acc_gfx_clk_below_host_limit_thm),
"acc_low_utilization": list(violation_status.acc_low_utilization),
"acc_gfx_clk_below_host_limit_thrm": list(violation_status.acc_gfx_clk_below_host_limit_thrm),
"acc_gfx_clk_below_host_limit_total": list(violation_status.acc_gfx_clk_below_host_limit_total),
"acc_low_utilization": list(violation_status.acc_low_utilization),
"per_prochot_thrm": _validate_if_max_uint(violation_status.per_prochot_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True),
"per_ppt_pwr": _validate_if_max_uint(violation_status.per_ppt_pwr, MaxUIntegerTypes.UINT64_T, isActivity=True), #PVIOL
"per_socket_thrm": _validate_if_max_uint(violation_status.per_socket_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True), #TVIOL
@@ -2361,9 +2458,9 @@ def amdsmi_get_violation_status(
"per_hbm_thrm": _validate_if_max_uint(violation_status.per_hbm_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True),
"per_gfx_clk_below_host_limit": _validate_if_max_uint(violation_status.per_gfx_clk_below_host_limit, MaxUIntegerTypes.UINT64_T, isActivity=True),
"per_gfx_clk_below_host_limit_pwr": list(violation_status.per_gfx_clk_below_host_limit_pwr),
"per_gfx_clk_below_host_limit_thm": list(violation_status.per_gfx_clk_below_host_limit_thm),
"per_low_utilization": list(violation_status.per_low_utilization),
"per_gfx_clk_below_host_limit_thrm": list(violation_status.per_gfx_clk_below_host_limit_thrm),
"per_gfx_clk_below_host_limit_total": list(violation_status.per_gfx_clk_below_host_limit_total),
"per_low_utilization": list(violation_status.per_low_utilization),
"active_prochot_thrm": _validate_if_max_uint(violation_status.active_prochot_thrm, MaxUIntegerTypes.UINT8_T, isBool=True),
"active_ppt_pwr": _validate_if_max_uint(violation_status.active_ppt_pwr, MaxUIntegerTypes.UINT8_T, isBool=True), #PVIOL
"active_socket_thrm": _validate_if_max_uint(violation_status.active_socket_thrm, MaxUIntegerTypes.UINT8_T, isBool=True), #TVIOL
@@ -2371,9 +2468,9 @@ def amdsmi_get_violation_status(
"active_hbm_thrm": _validate_if_max_uint(violation_status.active_hbm_thrm, MaxUIntegerTypes.UINT8_T, isBool=True),
"active_gfx_clk_below_host_limit": _validate_if_max_uint(violation_status.active_gfx_clk_below_host_limit, MaxUIntegerTypes.UINT8_T, isBool=True),
"active_gfx_clk_below_host_limit_pwr": list(violation_status.active_gfx_clk_below_host_limit_pwr),
"active_gfx_clk_below_host_limit_thm": list(violation_status.active_gfx_clk_below_host_limit_thm),
"active_low_utilization": list(violation_status.active_low_utilization),
"active_gfx_clk_below_host_limit_thrm": list(violation_status.active_gfx_clk_below_host_limit_thrm),
"active_gfx_clk_below_host_limit_total": list(violation_status.active_gfx_clk_below_host_limit_total),
"active_low_utilization": list(violation_status.active_low_utilization),
}
# Create 2d array with each XCD's stats
@@ -2381,25 +2478,25 @@ def amdsmi_get_violation_status(
for xcp_index, xcp_metrics in enumerate(dict_return['acc_gfx_clk_below_host_limit_pwr']):
xcp_detail = []
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
dict_return['acc_gfx_clk_below_host_limit_pwr'][xcp_index] = xcp_detail
if 'acc_gfx_clk_below_host_limit_thm' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['acc_gfx_clk_below_host_limit_thm']):
if 'acc_gfx_clk_below_host_limit_thrm' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['acc_gfx_clk_below_host_limit_thrm']):
xcp_detail = []
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
dict_return['acc_gfx_clk_below_host_limit_thm'][xcp_index] = xcp_detail
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
dict_return['acc_gfx_clk_below_host_limit_thrm'][xcp_index] = xcp_detail
if 'acc_low_utilization' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['acc_low_utilization']):
xcp_detail = []
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
dict_return['acc_low_utilization'][xcp_index] = xcp_detail
if 'acc_gfx_clk_below_host_limit_total' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['acc_gfx_clk_below_host_limit_total']):
xcp_detail = []
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
dict_return['acc_gfx_clk_below_host_limit_total'][xcp_index] = xcp_detail
if 'per_gfx_clk_below_host_limit_pwr' in dict_return:
@@ -2408,12 +2505,12 @@ def amdsmi_get_violation_status(
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
dict_return['per_gfx_clk_below_host_limit_pwr'][xcp_index] = xcp_detail
if 'per_gfx_clk_below_host_limit_thm' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['per_gfx_clk_below_host_limit_thm']):
if 'per_gfx_clk_below_host_limit_thrm' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['per_gfx_clk_below_host_limit_thrm']):
xcp_detail = []
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
dict_return['per_gfx_clk_below_host_limit_thm'][xcp_index] = xcp_detail
dict_return['per_gfx_clk_below_host_limit_thrm'][xcp_index] = xcp_detail
if 'per_low_utilization' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['per_low_utilization']):
xcp_detail = []
@@ -2433,12 +2530,12 @@ def amdsmi_get_violation_status(
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT8_T, isBool=True))
dict_return['active_gfx_clk_below_host_limit_pwr'][xcp_index] = xcp_detail
if 'active_gfx_clk_below_host_limit_thm' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['active_gfx_clk_below_host_limit_thm']):
if 'active_gfx_clk_below_host_limit_thrm' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['active_gfx_clk_below_host_limit_thrm']):
xcp_detail = []
for val in xcp_metrics:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT8_T, isBool=True))
dict_return['active_gfx_clk_below_host_limit_thm'][xcp_index] = xcp_detail
dict_return['active_gfx_clk_below_host_limit_thrm'][xcp_index] = xcp_detail
if 'active_low_utilization' in dict_return:
for xcp_index, xcp_metrics in enumerate(dict_return['active_low_utilization']):
xcp_detail = []
@@ -4614,6 +4711,9 @@ def amdsmi_get_gpu_metrics_info(
)
gpu_metrics_output = {
"common_header.structure_size": _validate_if_max_uint(gpu_metrics.common_header.structure_size, MaxUIntegerTypes.UINT16_T),
"common_header.format_revision": _validate_if_max_uint(gpu_metrics.common_header.format_revision, MaxUIntegerTypes.UINT8_T),
"common_header.content_revision": _validate_if_max_uint(gpu_metrics.common_header.content_revision, MaxUIntegerTypes.UINT8_T),
"temperature_edge": _validate_if_max_uint(gpu_metrics.temperature_edge, MaxUIntegerTypes.UINT16_T),
"temperature_hotspot": _validate_if_max_uint(gpu_metrics.temperature_hotspot, MaxUIntegerTypes.UINT16_T),
"temperature_mem": _validate_if_max_uint(gpu_metrics.temperature_mem, MaxUIntegerTypes.UINT16_T),
@@ -871,15 +871,15 @@ struct_amdsmi_violation_status_t._fields_ = [
('active_gfx_clk_below_host_limit', ctypes.c_ubyte),
('PADDING_0', ctypes.c_ubyte * 2),
('acc_gfx_clk_below_host_limit_pwr', ctypes.c_uint64 * 8 * 8),
('acc_gfx_clk_below_host_limit_thm', ctypes.c_uint64 * 8 * 8),
('acc_gfx_clk_below_host_limit_thrm', ctypes.c_uint64 * 8 * 8),
('acc_low_utilization', ctypes.c_uint64 * 8 * 8),
('acc_gfx_clk_below_host_limit_total', ctypes.c_uint64 * 8 * 8),
('per_gfx_clk_below_host_limit_pwr', ctypes.c_uint64 * 8 * 8),
('per_gfx_clk_below_host_limit_thm', ctypes.c_uint64 * 8 * 8),
('per_gfx_clk_below_host_limit_thrm', ctypes.c_uint64 * 8 * 8),
('per_low_utilization', ctypes.c_uint64 * 8 * 8),
('per_gfx_clk_below_host_limit_total', ctypes.c_uint64 * 8 * 8),
('active_gfx_clk_below_host_limit_pwr', ctypes.c_ubyte * 8 * 8),
('active_gfx_clk_below_host_limit_thm', ctypes.c_ubyte * 8 * 8),
('active_gfx_clk_below_host_limit_thrm', ctypes.c_ubyte * 8 * 8),
('active_low_utilization', ctypes.c_ubyte * 8 * 8),
('active_gfx_clk_below_host_limit_total', ctypes.c_ubyte * 8 * 8),
('reserved', ctypes.c_uint64 * 8 * 8),
@@ -705,7 +705,7 @@ struct AMDGpuMetrics_v18_t {
uint16_t m_average_gfx_activity;
uint16_t m_average_umc_activity; // memory controller
/* VRAM max bandwidthi (in GB/sec) at max memory clock */
/* VRAM max bandwidth (in GB/sec) at max memory clock */
uint64_t m_mem_max_bandwidth;
/* Energy (15.259uJ (2^-16) units) */
+167 -127
Vedi File
@@ -1043,20 +1043,32 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
violation_status->active_hbm_thrm = std::numeric_limits<uint8_t>::max();
violation_status->active_gfx_clk_below_host_limit = std::numeric_limits<uint8_t>::max();
fill_2d_array(violation_status->acc_gfx_clk_below_host_limit_pwr, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_gfx_clk_below_host_limit_thm, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_low_utilization, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_gfx_clk_below_host_limit_total, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_gfx_clk_below_host_limit_pwr,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_gfx_clk_below_host_limit_thrm,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_low_utilization,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->acc_gfx_clk_below_host_limit_total,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_gfx_clk_below_host_limit_pwr, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_gfx_clk_below_host_limit_thm, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_low_utilization, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_gfx_clk_below_host_limit_total, std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_gfx_clk_below_host_limit_pwr,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_gfx_clk_below_host_limit_thrm,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_low_utilization,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->per_gfx_clk_below_host_limit_total,
std::numeric_limits<uint64_t>::max());
fill_2d_array(violation_status->active_gfx_clk_below_host_limit_pwr, std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_gfx_clk_below_host_limit_thm, std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_low_utilization, std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_gfx_clk_below_host_limit_total, std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_gfx_clk_below_host_limit_pwr,
std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_gfx_clk_below_host_limit_thrm,
std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_low_utilization,
std::numeric_limits<uint8_t>::max());
fill_2d_array(violation_status->active_gfx_clk_below_host_limit_total,
std::numeric_limits<uint8_t>::max());
const auto p1 = std::chrono::system_clock::now();
auto current_time = std::chrono::duration_cast<std::chrono::microseconds>(
@@ -1081,14 +1093,14 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
}
// default to 0xffffffff as not supported
uint32_t partitition_id = std::numeric_limits<uint32_t>::max();
uint32_t partition_id = std::numeric_limits<uint32_t>::max();
auto tmp_partition_id = uint32_t(0);
amdsmi_status_t status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, 0,
&(tmp_partition_id));
// Do not return early if this value fails
// continue to try getting all info
if (status == AMDSMI_STATUS_SUCCESS) {
partitition_id = tmp_partition_id;
partition_id = tmp_partition_id;
}
amdsmi_gpu_metrics_t metric_info_a = {};
@@ -1102,15 +1114,28 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
return status;
}
// if all of these values are "undefined" then the feature is not supported on the ASIC
// Note: Both XCP and partition_id will default to 0, if gpu_metrics file is not present.
// This is why we can check elements in kFIRST_ELEMENT == 0 for both XCP and partition_id.
const uint32_t kFIRST_ELEMENT = 0;
// Check if violation status is supported:
// If all of these values are "undefined" then the feature is not supported on the ASIC
if (metric_info_a.accumulation_counter == std::numeric_limits<uint64_t>::max()
&& metric_info_a.prochot_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.ppt_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.socket_thm_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.vr_thm_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.hbm_thm_residency_acc == std::numeric_limits<uint64_t>::max()
&& (metric_info_a.xcp_stats->gfx_below_host_limit_acc[partitition_id]
== std::numeric_limits<uint64_t>::max())) {
&& metric_info_a.xcp_stats[kFIRST_ELEMENT].gfx_below_host_limit_acc[kFIRST_ELEMENT]
== std::numeric_limits<uint64_t>::max()
&& metric_info_a.xcp_stats[kFIRST_ELEMENT].gfx_below_host_limit_ppt_acc[kFIRST_ELEMENT]
== std::numeric_limits<uint64_t>::max()
&& metric_info_a.xcp_stats[kFIRST_ELEMENT].gfx_below_host_limit_thm_acc[kFIRST_ELEMENT]
== std::numeric_limits<uint64_t>::max()
&& metric_info_a.xcp_stats[kFIRST_ELEMENT].gfx_low_utilization_acc[kFIRST_ELEMENT]
== std::numeric_limits<uint64_t>::max()
&& metric_info_a.xcp_stats[kFIRST_ELEMENT].gfx_below_host_limit_total_acc[kFIRST_ELEMENT]
== std::numeric_limits<uint64_t>::max()) {
ss << __PRETTY_FUNCTION__
<< " | ASIC does not support throttle violations!, "
<< "returning AMDSMI_STATUS_NOT_SUPPORTED";
@@ -1136,8 +1161,26 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
violation_status->acc_socket_thrm = metric_info_b.socket_thm_residency_acc;
violation_status->acc_vr_thrm = metric_info_b.vr_thm_residency_acc;
violation_status->acc_hbm_thrm = metric_info_b.hbm_thm_residency_acc;
violation_status->acc_gfx_clk_below_host_limit //deprecated
= metric_info_b.xcp_stats->gfx_below_host_limit_acc[partitition_id];
violation_status->acc_gfx_clk_below_host_limit // deprecated
= metric_info_b.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT];
// Copy XCP accumulators into 2D array
auto copy_xcp_metric = [](const auto& src, auto& dst, auto member_ptr) {
for (size_t i = 0; i < AMDSMI_MAX_NUM_XCP; ++i) {
std::copy(
std::begin(src[i].*member_ptr),
std::end(src[i].*member_ptr),
dst[i]);
}
};
copy_xcp_metric(metric_info_b.xcp_stats, violation_status->acc_gfx_clk_below_host_limit_pwr,
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_ppt_acc);
copy_xcp_metric(metric_info_b.xcp_stats, violation_status->acc_gfx_clk_below_host_limit_thrm,
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_thm_acc);
copy_xcp_metric(metric_info_b.xcp_stats, violation_status->acc_low_utilization,
&amdsmi_gpu_xcp_metrics_t::gfx_low_utilization_acc);
copy_xcp_metric(metric_info_b.xcp_stats, violation_status->acc_gfx_clk_below_host_limit_total,
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_total_acc);
ss << __PRETTY_FUNCTION__ << " | "
<< "[gpu_metrics A] metric_info_a.accumulation_counter: " << std::dec
@@ -1152,8 +1195,9 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
<< metric_info_a.vr_thm_residency_acc << "\n"
<< "; metric_info_a.hbm_thm_residency_acc: " << std::dec
<< metric_info_a.hbm_thm_residency_acc << "\n"
<< "; metric_info_b.xcp_stats->gfx_below_host_limit_acc[" << partitition_id << "]: "
<< std::dec << metric_info_a.xcp_stats->gfx_below_host_limit_acc[partitition_id] << "\n"
<< "; metric_info_a.xcp_stats[" << partition_id << "].gfx_below_host_limit_acc["
<< kFIRST_ELEMENT << "]: " << std::dec // deprecated
<< metric_info_a.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT] << "\n"
<< " [gpu_metrics B] metric_info_b.accumulation_counter: " << std::dec
<< metric_info_b.accumulation_counter << "\n"
<< "; metric_info_b.prochot_residency_acc: " << std::dec
@@ -1166,46 +1210,11 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
<< metric_info_b.vr_thm_residency_acc << "\n"
<< "; metric_info_b.hbm_thm_residency_acc: " << std::dec
<< metric_info_b.hbm_thm_residency_acc << "\n"
<< "; metric_info_b.xcp_stats->gfx_below_host_limit_acc[" << partitition_id << "]: " //deprecated
<< std::dec << metric_info_b.xcp_stats->gfx_below_host_limit_acc[partitition_id] << "\n";
<< "; metric_info_b.xcp_stats[" << partition_id << "].gfx_below_host_limit_acc["
<< kFIRST_ELEMENT << "]: " << std::dec // deprecated
<< metric_info_b.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT] << "\n";
LOG_DEBUG(ss);
auto copy_gfx_acc = [](auto priv_it, auto priv_end, auto pub_it, auto gfx_acc_ptr) {
for (; priv_it != priv_end; ++priv_it, ++pub_it) {
std::copy(std::begin((*priv_it).*gfx_acc_ptr),
std::end((*priv_it).*gfx_acc_ptr),
std::begin(*pub_it));
}
};
copy_gfx_acc(
std::begin(metric_info_b.xcp_stats),
std::end(metric_info_b.xcp_stats),
std::begin(violation_status->acc_gfx_clk_below_host_limit_pwr),
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_ppt_acc
);
copy_gfx_acc(
std::begin(metric_info_b.xcp_stats),
std::end(metric_info_b.xcp_stats),
std::begin(violation_status->acc_gfx_clk_below_host_limit_thm),
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_thm_acc
);
copy_gfx_acc(
std::begin(metric_info_b.xcp_stats),
std::end(metric_info_b.xcp_stats),
std::begin(violation_status->acc_low_utilization),
&amdsmi_gpu_xcp_metrics_t::gfx_low_utilization_acc
);
copy_gfx_acc(
std::begin(metric_info_b.xcp_stats),
std::end(metric_info_b.xcp_stats),
std::begin(violation_status->acc_gfx_clk_below_host_limit_total),
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_total_acc
);
if ( (metric_info_b.prochot_residency_acc != std::numeric_limits<uint64_t>::max()
|| metric_info_a.prochot_residency_acc != std::numeric_limits<uint64_t>::max())
&& (metric_info_b.prochot_residency_acc >= metric_info_a.prochot_residency_acc)
@@ -1309,15 +1318,19 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
<< violation_status->active_hbm_thrm << "\n";
LOG_DEBUG(ss);
}
/* //deprecated
if ((metric_info_b.xcp_stats->gfx_below_host_limit_acc[partitition_id] != std::numeric_limits<uint64_t>::max() ||
metric_info_a.xcp_stats->gfx_below_host_limit_acc[partitition_id] != std::numeric_limits<uint64_t>::max()) &&
(metric_info_b.xcp_stats->gfx_below_host_limit_acc[partitition_id] >= metric_info_a.xcp_stats->gfx_below_host_limit_acc[partitition_id]) &&
// deprecated - design likely needs to include both [XCP][XCC], like the new metrics
if ((metric_info_b.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT]
!= std::numeric_limits<uint64_t>::max() ||
metric_info_a.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT]
!= std::numeric_limits<uint64_t>::max()) &&
(metric_info_b.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT]
>= metric_info_a.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT]) &&
((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) {
violation_status->per_gfx_clk_below_host_limit =
(((metric_info_b.xcp_stats->gfx_below_host_limit_acc[partitition_id] -
metric_info_a.xcp_stats->gfx_below_host_limit_acc[partitition_id]) * 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
(((metric_info_b.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT] -
metric_info_a.xcp_stats[partition_id].gfx_below_host_limit_acc[kFIRST_ELEMENT])
* 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
if (violation_status->per_gfx_clk_below_host_limit > 0) {
violation_status->active_gfx_clk_below_host_limit = 1;
@@ -1327,68 +1340,95 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
ss << __PRETTY_FUNCTION__ << " | "
<< "ENTERED gfx_below_host_limit_acc | per_gfx_clk_below_host_limit: " << std::dec
<< violation_status->per_gfx_clk_below_host_limit
<< "%; active_ppt_pwr = " << std::dec
<< "%; active_ppt_pwr = " << std::boolalpha
<< violation_status->active_gfx_clk_below_host_limit << "\n";
LOG_DEBUG(ss);
}
*/
uint64_t counter_delta = metric_info_b.accumulation_counter - metric_info_a.accumulation_counter;
auto calc_viol_actv_percent = [](auto priv_it1, auto end1, auto priv_it2, auto pub_it, auto act_it, auto viol_ptr, uint64_t counter_delta) {
for (; priv_it1 != end1; ++priv_it1, ++priv_it2, ++pub_it, ++act_it) {
auto& priv_it_arr2 = (*priv_it2).*viol_ptr;
auto& priv_it_arr1 = (*priv_it1).*viol_ptr;
for (size_t i = 0; i < AMDSMI_MAX_NUM_XCC; ++i) {
uint64_t value2 = priv_it_arr2[i];
uint64_t value1 = priv_it_arr1[i];
if ((value2 != std::numeric_limits<uint64_t>::max() ||
value1 != std::numeric_limits<uint64_t>::max()) &&
(value2 > value1) && (counter_delta > 0)) {
(*pub_it)[i] = ((value2 - value1) * 100) / counter_delta;
(*act_it)[i] = (((*pub_it)[i]) > 0) ? 1 : 0;
// one-shot processing of all XCP violation metrics
// using a lambda function to avoid code duplication
using MetricArrayType = uint64_t[AMDSMI_MAX_NUM_XCC];
using MetricMemberPtr = MetricArrayType amdsmi_gpu_xcp_metrics_t::*;
auto process_all_XCP_violation_metrics = [&](
const std::vector<std::pair<std::string, MetricMemberPtr>>& metric_members,
std::vector<std::reference_wrapper<
uint64_t[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]>> per_arrays,
std::vector<std::reference_wrapper<
uint8_t[AMDSMI_MAX_NUM_XCP][AMDSMI_MAX_NUM_XCC]>> active_arrays) {
uint64_t counter_delta = static_cast<uint64_t>(metric_info_b.accumulation_counter)
- static_cast<uint64_t>(metric_info_a.accumulation_counter);
ss << __PRETTY_FUNCTION__ << " | Processing all XCP metrics with counter_delta: "
<< std::dec << counter_delta << "\n";
LOG_DEBUG(ss);
for (size_t metric_idx = 0; metric_idx < metric_members.size(); ++metric_idx) {
const auto& member_pair = metric_members[metric_idx];
const std::string& member_name = member_pair.first;
MetricMemberPtr member_ptr = member_pair.second;
auto& per_arr = per_arrays[metric_idx].get();
auto& active_arr = active_arrays[metric_idx].get();
ss << " [Metric] " << member_name << "\n";
for (uint32_t xcp = 0; xcp < AMDSMI_MAX_NUM_XCP; ++xcp) {
const MetricArrayType& arr_a = metric_info_a.xcp_stats[xcp].*member_ptr;
const MetricArrayType& arr_b = metric_info_b.xcp_stats[xcp].*member_ptr;
ss << " xcp: " << xcp << " (";
for (uint32_t xcc = 0; xcc < AMDSMI_MAX_NUM_XCC; ++xcc) {
uint64_t val_a = arr_a[xcc];
uint64_t val_b = arr_b[xcc];
if (val_b == std::numeric_limits<uint64_t>::max() ||
val_a == std::numeric_limits<uint64_t>::max() ||
counter_delta <= 0 ||
val_b < val_a) {
per_arr[xcp][xcc] = std::numeric_limits<uint64_t>::max();
active_arr[xcp][xcc] = std::numeric_limits<uint8_t>::max();
ss << "[Invalid] (" << std::dec << per_arr[xcp][xcc]
<< ", " << static_cast<int>(active_arr[xcp][xcc]) << ") ";
continue;
}
uint64_t percent = ((val_b - val_a) * 100) / counter_delta;
per_arr[xcp][xcc] = percent;
active_arr[xcp][xcc] = (percent > 0) ? 1 : 0;
ss << "[Valid] (" << std::dec << percent << "%, "
<< std::boolalpha << static_cast<bool>(active_arr[xcp][xcc])
<< ") | val_b: " << std::dec << val_b
<< ", val_a: " << std::dec << val_a
<< ", counter_delta: " << std::dec << counter_delta << " ";
}
ss << ")\n";
}
}
LOG_DEBUG(ss);
};
calc_viol_actv_percent(
std::begin(metric_info_a.xcp_stats),
std::end(metric_info_a.xcp_stats),
std::begin(metric_info_b.xcp_stats),
std::begin(violation_status->per_gfx_clk_below_host_limit_pwr),
std::begin(violation_status->active_gfx_clk_below_host_limit_pwr),
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_ppt_acc,
counter_delta
);
// Prepare metric members and arrays for processing
const std::vector<std::pair<std::string, MetricMemberPtr>> metric_members = {
{"gfx_below_host_limit_ppt_acc", &amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_ppt_acc},
{"gfx_below_host_limit_thm_acc", &amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_thm_acc},
{"gfx_low_utilization_acc", &amdsmi_gpu_xcp_metrics_t::gfx_low_utilization_acc},
{"gfx_below_host_limit_total_acc",
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_total_acc}
};
calc_viol_actv_percent(
std::begin(metric_info_a.xcp_stats),
std::end(metric_info_a.xcp_stats),
std::begin(metric_info_b.xcp_stats),
std::begin(violation_status->per_gfx_clk_below_host_limit_thm),
std::begin(violation_status->active_gfx_clk_below_host_limit_thm),
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_thm_acc,
counter_delta
);
calc_viol_actv_percent(
std::begin(metric_info_a.xcp_stats),
std::end(metric_info_a.xcp_stats),
std::begin(metric_info_b.xcp_stats),
std::begin(violation_status->per_low_utilization),
std::begin(violation_status->active_low_utilization),
&amdsmi_gpu_xcp_metrics_t::gfx_low_utilization_acc,
counter_delta
);
calc_viol_actv_percent(
std::begin(metric_info_a.xcp_stats),
std::end(metric_info_a.xcp_stats),
std::begin(metric_info_b.xcp_stats),
std::begin(violation_status->per_gfx_clk_below_host_limit_total),
std::begin(violation_status->active_gfx_clk_below_host_limit_total),
&amdsmi_gpu_xcp_metrics_t::gfx_below_host_limit_total_acc,
counter_delta
);
process_all_XCP_violation_metrics(
metric_members,
{
std::ref(violation_status->per_gfx_clk_below_host_limit_pwr),
std::ref(violation_status->per_gfx_clk_below_host_limit_thrm),
std::ref(violation_status->per_low_utilization),
std::ref(violation_status->per_gfx_clk_below_host_limit_total)
},
{
std::ref(violation_status->active_gfx_clk_below_host_limit_pwr),
std::ref(violation_status->active_gfx_clk_below_host_limit_thrm),
std::ref(violation_status->active_low_utilization),
std::ref(violation_status->active_gfx_clk_below_host_limit_total)
});
ss << __PRETTY_FUNCTION__ << " | "
<< "RETURNING AMDSMI_STATUS_SUCCESS | "
@@ -1406,20 +1446,20 @@ amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_ha
<< violation_status->per_vr_thrm
<< "; violation_status->per_hbm_thrm (%): " << std::dec
<< violation_status->per_hbm_thrm
<< "; violation_status->per_gfx_clk_below_host_limit (%): " << std::dec //deprecated
<< "; violation_status->per_gfx_clk_below_host_limit (%): " << std::dec // deprecated
<< violation_status->per_gfx_clk_below_host_limit
<< "; violation_status->active_prochot_thrm (bool): " << std::dec
<< "; violation_status->active_prochot_thrm (bool): " << std::boolalpha
<< static_cast<int>(violation_status->active_prochot_thrm)
<< "; violation_status->active_ppt_pwr (bool): " << std::dec
<< "; violation_status->active_ppt_pwr (bool): " << std::boolalpha
<< static_cast<int>(violation_status->active_ppt_pwr)
<< "; violation_status->active_socket_thrm (bool): " << std::dec
<< "; violation_status->active_socket_thrm (bool): " << std::boolalpha
<< static_cast<int>(violation_status->active_socket_thrm)
<< "; violation_status->active_vr_thrm (bool): " << std::dec
<< "; violation_status->active_vr_thrm (bool): " << std::boolalpha
<< static_cast<int>(violation_status->active_vr_thrm)
<< "; violation_status->active_hbm_thrm (bool): " << std::dec
<< "; violation_status->active_hbm_thrm (bool): " << std::boolalpha
<< static_cast<int>(violation_status->active_hbm_thrm)
<< "; violation_status->active_gfx_clk_below_host_limit (bool): " << std::dec //deprecated
<< static_cast<int>(violation_status->active_gfx_clk_below_host_limit)
<< "; violation_status->active_gfx_clk_below_host_limit (bool): " // deprecated
<< std::boolalpha << static_cast<int>(violation_status->active_gfx_clk_below_host_limit)
<< "\n";
LOG_INFO(ss);