[SWDEV-535159] Add support for GPU partition metrics (#490)

[SWDEV-535159] Add support for GPU partition metrics

Changes include:
  - Internal logic to smart-switch between gpu_metrics/xcp_metrics files
  - [WIP] Initial plumbing for new partition metric API

Change-Id: I4340fb1b48bac0117d80d5d486b9e871430d5cd8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Add amdsmi_get_gpu_partition_metrics_info() + minor cleanup

Change-Id: I5d60604f18baddbd03852dc90e88aa0b8107d50e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Fix partition metric logic + update logging/tests

Change-Id: I9e89b19ead17694c54e224f8e13ff8ee3eb2e22a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Adjust amd-smi metric/monitor/default to show (some) partition information

Change-Id: I2e8d2745876a19bdaec3c039daa97345c9f701b5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Add C++ tests

Change-Id: Ib9eb0b57a6d7a280992e05a4c6eba632826952ef
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Remove modification of energy counter, not needed

Change-Id: I5c48eaaae248ee6dc79abba609d837ec35d78022
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[CLI] amd-smi metric: cleaned up N/A'd multi-valued to show just N/A

Changes:
1. amd-smi metric: cleaned up N/A'd multi-valued to show just N/A
ex.
JPEG_ACTIVITY: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]

Now just shows: N/A

2. [Python Unit Test] Changed testname TestAmdSmiPythonBDF(unittest.TestCase) ->
 AmdSmiPythonUnitTest

Test name was confusing.

Change-Id: Ieb3b036f30002fd22362508eb9fc5d443df395ae
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Log cleanup

Change-Id: I1b1a95f1844d35bec7a7bd8cb996f87e4914c069
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Add amd-smi partition-metrics CLI + general cleanup

Change-Id: Ia91488e6cb3a4d62b4087afbddfe0b3bb9378fdc
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[1.3 metrics] Remove forwards compatibility for partition metrics

Change-Id: Iab928983e6f6f1587bc9307f6f3fa2b2696ca6f7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Fixed violation output not showing % + general cleanup

Change-Id: Icac1b0a55b18c7628b07109ae0c377d17e0825f1
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Clean up amdsmi_get_gpu_partition_metrics_info & amd-smi partition-metric outputs

Change-Id: I6427028b980874641e9ffb3b5d88ad493dbf9cf4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Fix metrics not found + extra logging/formatting

Change-Id: I841a27bb2c305e97ec7579a13ac915e5be497c3a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Update license to current default

Change-Id: I0de9b8a2d5dbbeab4491097f0354ba17b0d30866
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Cleanup for review

Change-Id: I96ed25c3f2b8968eea1af24c5e5860c2b4e74e6e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Moderize updated/new interal APIs.

Change-Id: I3c48a250eeb703709b14cb5ffa68268d8321626c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Remove extra logging in dynamic metrics

Change-Id: Idb97547bcbe143d6fa1cb5cb278ffe4da615ce14
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Remove amd-smi partition-metric command

Change-Id: Ib83c17e5cd7e0da3798198943bddd46c296b411c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Move new CLI updates to another PR + minor fixes

Change-Id: I3b1163eec12f9b5f7d95ee33de08e168cec1b1fe
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Allow dynamic metrics to work for gpu/xcp metrics 1.9+/1.1+

Updated some logging as well.

Change-Id: I2ed9f5a5ef8afb1520508820ca6153525f0644b4
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Allow dyn gpu/xcp metric v1.9+/v1.1+

Added tests for quick check

Change-Id: I576d6f6582a55afb08e5ac57791ce95e2fa184a2
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Update tests for larger subset of version checks

Change-Id: I3cdf4f8bb4fc6161f4c76566939f90545d0f362a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

* Fix XCP metrics in gpu/partition metric pre-v1.9/v1.1 (dynamic)

Change-Id: I4dabc1ed6bef6b86c8e7f92bf9cb5992f3966fe2
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: 01b4fe6614]
Этот коммит содержится в:
Poag, Charis
2025-10-20 14:43:40 -05:00
коммит произвёл GitHub
родитель 428bded17a
Коммит ce19b921b0
22 изменённых файлов: 2235 добавлений и 815 удалений
+16 -34
Просмотреть файл
@@ -1632,6 +1632,7 @@ class AMDSMICommands():
# Add timestamp and store values for specified arguments
values_dict = {}
is_partition_metrics = False # True if we get the metrics from xcp_metrics file (amdsmi_get_gpu_partition_metrics_info)
#get metric info only once per gpu, this will speed up data output
try:
# Get GPU Metrics table
@@ -1640,19 +1641,10 @@ class AMDSMICommands():
logging.debug("#3 - Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
gpu_metric = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
# Workaround for XCP (partition) metrics not providing num_partition in v1.0
# Confirmed with driver team that we can default to 1 if num_partition is not defined.
# Pending partitions exist, ie. partition_id > 0. See logic below.
try:
partition_id = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)['current_partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get current partition id for gpu %s | %s", gpu_id, e.get_error_info())
partition_id = "N/A"
num_partition = gpu_metric['num_partition']
if num_partition == "N/A":
num_partition = 1 # Workaround for XCP metrics not providing num_partition in v1.0
logging.debug(f"num_partition is N/A and partition_id: {partition_id} (greater > 0).\nModified num_partition: {num_partition} to adjust for XCP metrics.")
# Workaround for XCP (partition) metrics not providing num_partition in v1.9+/v1.1+
# Provides original formatting for earlier metric versions
partition_metric_info = self.helpers._get_metric_version_and_partition_info(gpu_metric, is_partition_metrics, gpu_id, args.gpu)
num_partition = partition_metric_info['num_partition']
if self.logger.is_json_format():
values_dict['gpu'] = int(gpu_id)
@@ -2679,7 +2671,7 @@ class AMDSMICommands():
value[k][index] = self.helpers.unit_format(self.logger, activity, activity_unit)
value[k] = '[' + ", ".join(value[k]) + ']'
elif value != "N/A":
value = self.helpers.unit_format(self.logger, value, activity_unit)
throttle_status[key] = self.helpers.unit_format(self.logger, value, activity_unit)
if self.logger.is_json_format():
if isinstance(value, (list, dict)):
for k, v in value.items():
@@ -3090,7 +3082,6 @@ class AMDSMICommands():
if not self.logger.is_json_format():
self.logger.print_output(multiple_device_enabled=multiple_devices_csv_override)
def metric(self, args, multiple_devices=False, watching_output=False, gpu=None,
usage=None, watch=None, watch_time=None, iterations=None, power=None,
clock=None, temperature=None, ecc=None, ecc_blocks=None, pcie=None,
@@ -5710,6 +5701,7 @@ class AMDSMICommands():
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("#5 - Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
is_partition_metrics = False # True if we get the metrics from xcp_metrics file (amdsmi_get_gpu_partition_metrics_info)
#get metric info only once per gpu, this will speed up data output
try:
# Get GPU Metrics table
@@ -5721,25 +5713,15 @@ class AMDSMICommands():
gpu_metrics_info = amdsmi_interface._NA_amdsmi_get_gpu_metrics_info()
logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.get_error_info())
# Workaround for XCP (partition) metrics not providing num_partition in v1.0
# Confirmed with driver team that we can default to 1 if num_partition is not defined.
# Pending partitions exist, ie. partition_id > 0. See logic below.
try:
partition_id = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)['current_partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get current partition id for gpu %s | %s", gpu_id, e.get_error_info())
partition_id = "N/A"
# Workaround for XCP (partition) metrics not providing num_partition in v1.9+/v1.1+
# Provides original formatting for earlier metric versions
partition_metric_info = self.helpers._get_metric_version_and_partition_info(gpu_metrics_info, is_partition_metrics, gpu_id, args.gpu)
partition_id = partition_metric_info['partition_id']
num_partition = partition_metric_info['num_partition']
num_partition = gpu_metrics_info['num_partition']
if num_partition == "N/A":
num_partition = partition_id
num_xcp = num_partition # used later for XCP metrics
# Update logger for XCP display (only if applicable)
self.logger.table_header += 'XCP'.rjust(5, ' ')
self.logger.store_output(args.gpu, 'xcp', partition_id) # Starting with partition_id.
# Outputs which have xcp details
# will update this value via num_xcp.
# This value will help map to primary device.
self.logger.store_output(args.gpu, 'xcp', partition_id) # Store partition_id initially; can be updated via num_xcp
# Store the pcie_bw values due to possible increase in bandwidth due to repeated gpu_metrics calls
if args.pcie:
@@ -5979,7 +5961,7 @@ class AMDSMICommands():
"unit" : freq_unit}
except (KeyError, amdsmi_exception.AmdSmiLibraryException) as e:
monitor_values['dclock'] = "N/A"
logging.debug("Failed to get vclock on gpu %s | %s", gpu_id, e)
logging.debug("Failed to get dclock on gpu %s | %s", gpu_id, e)
self.logger.table_header += 'DCLOCK'.rjust(10)
@@ -6322,7 +6304,7 @@ class AMDSMICommands():
self.logger.store_multiple_device_output()
current_xcp += 1
else:
self.logger.store_output(args.gpu, 'xcp', num_xcp)
self.logger.store_output(args.gpu, 'xcp', partition_id)
self.logger.store_output(args.gpu, 'values', monitor_values)
# Store typical output for all commands (XCP data will be handled separately, eg. violation status)
+70 -1
Просмотреть файл
@@ -1018,7 +1018,6 @@ class AMDSMIHelpers():
"""This function will format output with unit based on the logger output format
params:
args - argparser args to pass to subcommand
logger (AMDSMILogger) - Logger to print out output
value - the value to be formatted
unit - the unit to be formatted with the value
@@ -1041,6 +1040,9 @@ class AMDSMIHelpers():
return {"value": value, "unit": unit}
else:
return value
if logger.is_csv_format():
# For CSV, return the raw value (number or "N/A"), not a string
return value
if logger.is_human_readable_format():
if unit:
return f"{value} {unit}".rstrip()
@@ -1745,3 +1747,70 @@ class AMDSMIHelpers():
# Flatten nested lists and filter integers
flat = [v for value in data for v in (value if isinstance(value, list) else [value]) if isinstance(v, int)]
return round(sum(flat) / len(flat)) if flat else "N/A"
def _get_metric_version_and_partition_info(self, gpu_metrics_info, is_partition_metrics, gpu_id, gpu_handle):
"""
Helper method to compute metric version, partition ID, and num_partition for dynamic metrics.
Handles logging updates internally for reusability.
Args:
gpu_metrics_info (dict): GPU metrics info from amdsmi_get_gpu_metrics_info.
is_partition_metrics (bool): Whether this is for partition metrics.
gpu_id (int): GPU ID for logging.
gpu_handle: GPU device handle for KFD info retrieval.
Returns:
dict: {
'metric_version': float or "N/A",
'partition_id': int or "N/A",
'num_partition': int or "N/A",
'num_xcp': int or "N/A" # Alias for num_partition
}
"""
# Compute metric version from header revisions
metric_version = "N/A"
format_rev = gpu_metrics_info.get('common_header.format_revision', "N/A")
content_rev = gpu_metrics_info.get('common_header.content_revision', "N/A")
if format_rev != "N/A" and content_rev != "N/A":
try:
metric_version = float(f"{format_rev}.{content_rev}")
except ValueError:
metric_version = "N/A" # Fallback if conversion fails
# Retrieve partition ID from KFD info
partition_id = "N/A"
try:
kfd_info = amdsmi_interface.amdsmi_get_gpu_kfd_info(gpu_handle)
partition_id = kfd_info.get('current_partition_id', "N/A")
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get current partition ID for GPU %s | %s", gpu_id, e.get_error_info())
# Determine num_partition with fallback logic for dynamic metrics
num_partition = gpu_metrics_info.get('num_partition', "N/A")
if metric_version != "N/A" and num_partition == "N/A":
# Workaround: Default to 1 for newer metric versions if num_partition is missing
# (Confirmed with driver team; applies to GPU and partition metrics)
if not is_partition_metrics and metric_version >= 1.9:
num_partition = 1
elif is_partition_metrics and metric_version >= 1.1:
num_partition = 1
elif partition_id != "N/A" and partition_id > 0:
# Fallback to partition_id if partitions exist but num_partition is unavailable
num_partition = partition_id
# Else: Remains "N/A" if no conditions match
# Alias num_xcp for XCP metrics usage
num_xcp = num_partition
# Debug logging
logging.debug(
"GPU %s | Metric version: %s, num_partition: %s, partition_id: %s, num_xcp: %s",
gpu_id, metric_version, num_partition, partition_id, num_xcp
)
return {
'metric_version': metric_version,
'partition_id': partition_id,
'num_partition': num_partition,
'num_xcp': num_xcp
}
-1
Просмотреть файл
@@ -918,7 +918,6 @@ class AMDSMIParser(argparse.ArgumentParser):
self._add_device_arguments(bad_pages_parser, required=False)
self._add_command_modifiers(bad_pages_parser)
def _add_metric_parser(self, subparsers: argparse._SubParsersAction, func):
# Subparser help text
metric_help = "Gets metric/performance information about the specified GPU"
+24
Просмотреть файл
@@ -4055,6 +4055,30 @@ amdsmi_get_gpu_metrics_header_info(amdsmi_processor_handle processor_handle, amd
amdsmi_status_t amdsmi_get_gpu_metrics_info(amdsmi_processor_handle processor_handle,
amdsmi_gpu_metrics_t *pgpu_metrics);
/**
* @brief This function retrieves the partition metrics information.
*
* @ingroup tagClkPowerPerfQuery
*
* @platform{gpu_bm_linux} @platform{guest_1vf}
*
* @details Given a processor handle @p processor_handle and a pointer to a
* ::amdsmi_gpu_metrics_t structure @p pgpu_metrics, this function will populate
* @p pgpu_metrics. See ::amdsmi_gpu_metrics_t for more details.
*
* @param[in] processor_handle a processor handle
*
* @param[in,out] pgpu_metrics a pointer to an ::amdsmi_gpu_metrics_t structure
* If this parameter is nullptr, this function will return
* ::AMDSMI_STATUS_INVAL if the function is supported with the provided,
* arguments and ::AMDSMI_STATUS_NOT_SUPPORTED if it is not supported with the
* provided arguments.
*
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
*/
amdsmi_status_t amdsmi_get_gpu_partition_metrics_info(amdsmi_processor_handle processor_handle,
amdsmi_gpu_metrics_t *pgpu_metrics);
/**
* @brief Get the pm metrics table with provided device index.
*
+1
Просмотреть файл
@@ -184,6 +184,7 @@ from .amdsmi_interface import amdsmi_get_gpu_mem_overdrive_level
from .amdsmi_interface import amdsmi_get_clk_freq
from .amdsmi_interface import amdsmi_get_gpu_od_volt_info
from .amdsmi_interface import amdsmi_get_gpu_metrics_info
from .amdsmi_interface import amdsmi_get_gpu_partition_metrics_info
from .amdsmi_interface import amdsmi_get_gpu_od_volt_curve_regions
from .amdsmi_interface import amdsmi_is_gpu_power_management_enabled
+159
Просмотреть файл
@@ -4932,6 +4932,165 @@ def amdsmi_get_gpu_metrics_info(
gpu_metrics_output['xcp_stats.gfx_below_host_limit_total_acc'][xcp_index] = xcp_detail
return gpu_metrics_output
def amdsmi_get_gpu_partition_metrics_info(
processor_handle: processor_handle_t,
) -> Dict[str, Any]:
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
raise AmdSmiParameterException(
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
)
gpu_metrics = amdsmi_wrapper.amdsmi_gpu_metrics_t()
_check_res(
amdsmi_wrapper.amdsmi_get_gpu_partition_metrics_info(
processor_handle, ctypes.byref(gpu_metrics)
)
)
gpu_metrics_output = {
"common_header.structure_size": _validate_if_max_uint(gpu_metrics.common_header.structure_size, MaxUIntegerTypes.UINT16_T),
"common_header.format_revision": _validate_if_max_uint(gpu_metrics.common_header.format_revision, MaxUIntegerTypes.UINT8_T),
"common_header.content_revision": _validate_if_max_uint(gpu_metrics.common_header.content_revision, MaxUIntegerTypes.UINT8_T),
"temperature_edge": _validate_if_max_uint(gpu_metrics.temperature_edge, MaxUIntegerTypes.UINT16_T),
"temperature_hotspot": _validate_if_max_uint(gpu_metrics.temperature_hotspot, MaxUIntegerTypes.UINT16_T),
"temperature_mem": _validate_if_max_uint(gpu_metrics.temperature_mem, MaxUIntegerTypes.UINT16_T),
"temperature_vrgfx": _validate_if_max_uint(gpu_metrics.temperature_vrgfx, MaxUIntegerTypes.UINT16_T),
"temperature_vrsoc": _validate_if_max_uint(gpu_metrics.temperature_vrsoc, MaxUIntegerTypes.UINT16_T),
"temperature_vrmem": _validate_if_max_uint(gpu_metrics.temperature_vrmem, MaxUIntegerTypes.UINT16_T),
"average_gfx_activity": _validate_if_max_uint(gpu_metrics.average_gfx_activity, MaxUIntegerTypes.UINT16_T, isActivity=True),
"average_umc_activity": _validate_if_max_uint(gpu_metrics.average_umc_activity, MaxUIntegerTypes.UINT16_T, isActivity=True),
"average_mm_activity": _validate_if_max_uint(gpu_metrics.average_mm_activity, MaxUIntegerTypes.UINT16_T, isActivity=True),
"average_socket_power": _validate_if_max_uint(gpu_metrics.average_socket_power, MaxUIntegerTypes.UINT16_T),
"energy_accumulator": _validate_if_max_uint(gpu_metrics.energy_accumulator, MaxUIntegerTypes.UINT64_T),
"system_clock_counter": _validate_if_max_uint(gpu_metrics.system_clock_counter, MaxUIntegerTypes.UINT64_T),
"average_gfxclk_frequency": _validate_if_max_uint(gpu_metrics.average_gfxclk_frequency, MaxUIntegerTypes.UINT16_T),
"average_socclk_frequency": _validate_if_max_uint(gpu_metrics.average_socclk_frequency, MaxUIntegerTypes.UINT16_T),
"average_uclk_frequency": _validate_if_max_uint(gpu_metrics.average_uclk_frequency, MaxUIntegerTypes.UINT16_T),
"average_vclk0_frequency": _validate_if_max_uint(gpu_metrics.average_vclk0_frequency, MaxUIntegerTypes.UINT16_T),
"average_dclk0_frequency": _validate_if_max_uint(gpu_metrics.average_dclk0_frequency, MaxUIntegerTypes.UINT16_T),
"average_vclk1_frequency": _validate_if_max_uint(gpu_metrics.average_vclk1_frequency, MaxUIntegerTypes.UINT16_T),
"average_dclk1_frequency": _validate_if_max_uint(gpu_metrics.average_dclk1_frequency, MaxUIntegerTypes.UINT16_T),
"current_gfxclk": _validate_if_max_uint(gpu_metrics.current_gfxclk, MaxUIntegerTypes.UINT16_T),
"current_socclk": _validate_if_max_uint(gpu_metrics.current_socclk, MaxUIntegerTypes.UINT16_T),
"current_uclk": _validate_if_max_uint(gpu_metrics.current_uclk, MaxUIntegerTypes.UINT16_T),
"current_vclk0": _validate_if_max_uint(gpu_metrics.current_vclk0, MaxUIntegerTypes.UINT16_T),
"current_dclk0": _validate_if_max_uint(gpu_metrics.current_dclk0, MaxUIntegerTypes.UINT16_T),
"current_vclk1": _validate_if_max_uint(gpu_metrics.current_vclk1, MaxUIntegerTypes.UINT16_T),
"current_dclk1": _validate_if_max_uint(gpu_metrics.current_dclk1, MaxUIntegerTypes.UINT16_T),
"throttle_status": _validate_if_max_uint(gpu_metrics.throttle_status, MaxUIntegerTypes.UINT32_T, isBool=True),
"current_fan_speed": _validate_if_max_uint(gpu_metrics.current_fan_speed, MaxUIntegerTypes.UINT16_T),
"pcie_link_width": _validate_if_max_uint(gpu_metrics.pcie_link_width, MaxUIntegerTypes.UINT16_T),
"pcie_link_speed": _validate_if_max_uint(gpu_metrics.pcie_link_speed, MaxUIntegerTypes.UINT16_T),
"gfx_activity_acc": _validate_if_max_uint(gpu_metrics.gfx_activity_acc, MaxUIntegerTypes.UINT32_T),
"mem_activity_acc": _validate_if_max_uint(gpu_metrics.mem_activity_acc, MaxUIntegerTypes.UINT32_T),
"temperature_hbm": _validate_if_max_uint(list(gpu_metrics.temperature_hbm), MaxUIntegerTypes.UINT16_T),
"firmware_timestamp": _validate_if_max_uint(gpu_metrics.firmware_timestamp, MaxUIntegerTypes.UINT64_T),
"voltage_soc": _validate_if_max_uint(gpu_metrics.voltage_soc, MaxUIntegerTypes.UINT16_T),
"voltage_gfx": _validate_if_max_uint(gpu_metrics.voltage_gfx, MaxUIntegerTypes.UINT16_T),
"voltage_mem": _validate_if_max_uint(gpu_metrics.voltage_mem, MaxUIntegerTypes.UINT16_T),
"indep_throttle_status": _validate_if_max_uint(gpu_metrics.indep_throttle_status, MaxUIntegerTypes.UINT64_T, isBool=True),
"current_socket_power": _validate_if_max_uint(gpu_metrics.current_socket_power, MaxUIntegerTypes.UINT16_T),
"vcn_activity": _validate_if_max_uint(list(gpu_metrics.vcn_activity), MaxUIntegerTypes.UINT16_T, isActivity=True),
"gfxclk_lock_status": _validate_if_max_uint(gpu_metrics.gfxclk_lock_status, MaxUIntegerTypes.UINT32_T),
"xgmi_link_width": _validate_if_max_uint(gpu_metrics.xgmi_link_width, MaxUIntegerTypes.UINT16_T),
"xgmi_link_speed": _validate_if_max_uint(gpu_metrics.xgmi_link_speed, MaxUIntegerTypes.UINT16_T),
"pcie_bandwidth_acc": _validate_if_max_uint(gpu_metrics.pcie_bandwidth_acc, MaxUIntegerTypes.UINT64_T),
"pcie_bandwidth_inst": _validate_if_max_uint(gpu_metrics.pcie_bandwidth_inst, MaxUIntegerTypes.UINT64_T),
"pcie_l0_to_recov_count_acc": _validate_if_max_uint(gpu_metrics.pcie_l0_to_recov_count_acc, MaxUIntegerTypes.UINT64_T),
"pcie_replay_count_acc": _validate_if_max_uint(gpu_metrics.pcie_replay_count_acc, MaxUIntegerTypes.UINT64_T),
"pcie_replay_rover_count_acc": _validate_if_max_uint(gpu_metrics.pcie_replay_rover_count_acc, MaxUIntegerTypes.UINT64_T),
"xgmi_read_data_acc": _validate_if_max_uint(list(gpu_metrics.xgmi_read_data_acc), MaxUIntegerTypes.UINT64_T),
"xgmi_write_data_acc": _validate_if_max_uint(list(gpu_metrics.xgmi_write_data_acc), MaxUIntegerTypes.UINT64_T),
"current_gfxclks": _validate_if_max_uint(list(gpu_metrics.current_gfxclks), MaxUIntegerTypes.UINT16_T),
"current_socclks": _validate_if_max_uint(list(gpu_metrics.current_socclks), MaxUIntegerTypes.UINT16_T),
"current_vclk0s": _validate_if_max_uint(list(gpu_metrics.current_vclk0s), MaxUIntegerTypes.UINT16_T),
"current_dclk0s": _validate_if_max_uint(list(gpu_metrics.current_dclk0s), MaxUIntegerTypes.UINT16_T),
"jpeg_activity": _validate_if_max_uint(list(gpu_metrics.jpeg_activity), MaxUIntegerTypes.UINT16_T, isActivity=True),
"pcie_nak_sent_count_acc": _validate_if_max_uint(gpu_metrics.pcie_nak_sent_count_acc, MaxUIntegerTypes.UINT32_T),
"pcie_nak_rcvd_count_acc": _validate_if_max_uint(gpu_metrics.pcie_nak_rcvd_count_acc, MaxUIntegerTypes.UINT32_T),
"accumulation_counter": _validate_if_max_uint(gpu_metrics.accumulation_counter, MaxUIntegerTypes.UINT64_T),
"prochot_residency_acc": _validate_if_max_uint(gpu_metrics.prochot_residency_acc, MaxUIntegerTypes.UINT64_T),
"ppt_residency_acc": _validate_if_max_uint(gpu_metrics.ppt_residency_acc, MaxUIntegerTypes.UINT64_T),
"socket_thm_residency_acc": _validate_if_max_uint(gpu_metrics.socket_thm_residency_acc, MaxUIntegerTypes.UINT64_T),
"vr_thm_residency_acc": _validate_if_max_uint(gpu_metrics.vr_thm_residency_acc, MaxUIntegerTypes.UINT64_T),
"hbm_thm_residency_acc": _validate_if_max_uint(gpu_metrics.hbm_thm_residency_acc, MaxUIntegerTypes.UINT64_T),
"num_partition": _validate_if_max_uint(gpu_metrics.num_partition, MaxUIntegerTypes.UINT16_T),
"xcp_stats.gfx_busy_inst": list(gpu_metrics.xcp_stats),
"xcp_stats.jpeg_busy": list(gpu_metrics.xcp_stats),
"xcp_stats.vcn_busy": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_busy_acc": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_below_host_limit_acc": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_below_host_limit_ppt_acc": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_below_host_limit_thm_acc": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_low_utilization_acc": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_below_host_limit_total_acc": list(gpu_metrics.xcp_stats),
"pcie_lc_perf_other_end_recovery": _validate_if_max_uint(gpu_metrics.pcie_lc_perf_other_end_recovery, MaxUIntegerTypes.UINT32_T),
"vram_max_bandwidth": _validate_if_max_uint(gpu_metrics.vram_max_bandwidth, MaxUIntegerTypes.UINT64_T),
"xgmi_link_status": _validate_if_max_uint(list(gpu_metrics.xgmi_link_status), MaxUIntegerTypes.UINT16_T),
}
# Create 2d array with each XCD's stats
if 'xcp_stats.gfx_busy_inst' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_busy_inst']):
xcp_detail = []
for val in xcp_metrics.gfx_busy_inst:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT32_T, isActivity=True))
gpu_metrics_output['xcp_stats.gfx_busy_inst'][xcp_index] = xcp_detail
if 'xcp_stats.jpeg_busy' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.jpeg_busy']):
xcp_detail = []
for val in xcp_metrics.jpeg_busy:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT16_T, isActivity=True))
gpu_metrics_output['xcp_stats.jpeg_busy'][xcp_index] = xcp_detail
if 'xcp_stats.vcn_busy' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.vcn_busy']):
xcp_detail = []
for val in xcp_metrics.vcn_busy:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT16_T, isActivity=True))
gpu_metrics_output["xcp_stats.vcn_busy"][xcp_index] = xcp_detail
if 'xcp_stats.gfx_busy_acc' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_busy_acc']):
xcp_detail = []
for val in xcp_metrics.gfx_busy_acc:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
gpu_metrics_output["xcp_stats.gfx_busy_acc"][xcp_index] = xcp_detail
if 'xcp_stats.gfx_below_host_limit_acc' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_below_host_limit_acc']):
xcp_detail = []
for val in xcp_metrics.gfx_below_host_limit_acc:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
gpu_metrics_output['xcp_stats.gfx_below_host_limit_acc'][xcp_index] = xcp_detail
# new for gpu metrics v1.8
if 'xcp_stats.gfx_below_host_limit_ppt_acc' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_below_host_limit_ppt_acc']):
xcp_detail = []
for val in xcp_metrics.gfx_below_host_limit_ppt_acc:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
gpu_metrics_output['xcp_stats.gfx_below_host_limit_ppt_acc'][xcp_index] = xcp_detail
if 'xcp_stats.gfx_below_host_limit_thm_acc' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_below_host_limit_thm_acc']):
xcp_detail = []
for val in xcp_metrics.gfx_below_host_limit_thm_acc:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
gpu_metrics_output['xcp_stats.gfx_below_host_limit_thm_acc'][xcp_index] = xcp_detail
if 'xcp_stats.gfx_low_utilization_acc' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_low_utilization_acc']):
xcp_detail = []
for val in xcp_metrics.gfx_low_utilization_acc:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
gpu_metrics_output['xcp_stats.gfx_low_utilization_acc'][xcp_index] = xcp_detail
if 'xcp_stats.gfx_below_host_limit_total_acc' in gpu_metrics_output:
for xcp_index, xcp_metrics in enumerate(gpu_metrics_output['xcp_stats.gfx_below_host_limit_total_acc']):
xcp_detail = []
for val in xcp_metrics.gfx_below_host_limit_total_acc:
xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T))
gpu_metrics_output['xcp_stats.gfx_below_host_limit_total_acc'][xcp_index] = xcp_detail
return gpu_metrics_output
def amdsmi_get_gpu_od_volt_curve_regions(
processor_handle: processor_handle_t, num_regions: int
+19 -15
Просмотреть файл
@@ -964,6 +964,21 @@ amdsmi_card_form_factor_t = ctypes.c_uint32 # enum
class struct_amdsmi_pcie_info_t(Structure):
pass
class struct_pcie_static_(Structure):
pass
struct_pcie_static_._pack_ = 1 # source:False
struct_pcie_static_._fields_ = [
('max_pcie_width', ctypes.c_uint16),
('PADDING_0', ctypes.c_ubyte * 2),
('max_pcie_speed', ctypes.c_uint32),
('pcie_interface_version', ctypes.c_uint32),
('slot_type', amdsmi_card_form_factor_t),
('max_pcie_interface_version', ctypes.c_uint32),
('PADDING_1', ctypes.c_ubyte * 4),
('reserved', ctypes.c_uint64 * 9),
]
class struct_pcie_metric_(Structure):
pass
@@ -984,21 +999,6 @@ struct_pcie_metric_._fields_ = [
('reserved', ctypes.c_uint64 * 12),
]
class struct_pcie_static_(Structure):
pass
struct_pcie_static_._pack_ = 1 # source:False
struct_pcie_static_._fields_ = [
('max_pcie_width', ctypes.c_uint16),
('PADDING_0', ctypes.c_ubyte * 2),
('max_pcie_speed', ctypes.c_uint32),
('pcie_interface_version', ctypes.c_uint32),
('slot_type', amdsmi_card_form_factor_t),
('max_pcie_interface_version', ctypes.c_uint32),
('PADDING_1', ctypes.c_ubyte * 4),
('reserved', ctypes.c_uint64 * 9),
]
struct_amdsmi_pcie_info_t._pack_ = 1 # source:False
struct_amdsmi_pcie_info_t._fields_ = [
('pcie_static', struct_pcie_static_),
@@ -2630,6 +2630,9 @@ amdsmi_get_gpu_metrics_header_info.argtypes = [amdsmi_processor_handle, ctypes.P
amdsmi_get_gpu_metrics_info = _libraries['libamd_smi.so'].amdsmi_get_gpu_metrics_info
amdsmi_get_gpu_metrics_info.restype = amdsmi_status_t
amdsmi_get_gpu_metrics_info.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_gpu_metrics_t)]
amdsmi_get_gpu_partition_metrics_info = _libraries['libamd_smi.so'].amdsmi_get_gpu_partition_metrics_info
amdsmi_get_gpu_partition_metrics_info.restype = amdsmi_status_t
amdsmi_get_gpu_partition_metrics_info.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_gpu_metrics_t)]
amdsmi_get_gpu_pm_metrics_info = _libraries['libamd_smi.so'].amdsmi_get_gpu_pm_metrics_info
amdsmi_get_gpu_pm_metrics_info.restype = amdsmi_status_t
amdsmi_get_gpu_pm_metrics_info.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.POINTER(struct_amdsmi_name_value_t)), ctypes.POINTER(ctypes.c_uint32)]
@@ -3418,6 +3421,7 @@ __all__ = \
'amdsmi_get_gpu_metrics_info',
'amdsmi_get_gpu_od_volt_curve_regions',
'amdsmi_get_gpu_od_volt_info', 'amdsmi_get_gpu_overdrive_level',
'amdsmi_get_gpu_partition_metrics_info',
'amdsmi_get_gpu_pci_bandwidth',
'amdsmi_get_gpu_pci_replay_counter',
'amdsmi_get_gpu_pci_throughput', 'amdsmi_get_gpu_perf_level',
+23
Просмотреть файл
@@ -3264,6 +3264,29 @@ rsmi_status_t rsmi_dev_gpu_reset(uint32_t dv_ind);
rsmi_status_t rsmi_dev_od_volt_info_get(uint32_t dv_ind,
rsmi_od_volt_freq_data_t *odv);
/**
* @brief This function retrieves the gpu partition metrics information
*
* @details Given a device index @p dv_ind and a pointer to a
* ::rsmi_gpu_metrics_t structure @p pgpu_metrics, this function will populate
* @p pgpu_metrics. See ::rsmi_gpu_metrics_t for more details.
*
* @param[in] dv_ind a device index
*
* @param[inout] pgpu_metrics a pointer to an ::rsmi_gpu_metrics_t structure
* If this parameter is nullptr, this function will return
* ::RSMI_STATUS_INVALID_ARGS if the function is supported with the provided,
* arguments and ::RSMI_STATUS_NOT_SUPPORTED if it is not supported with the
* provided arguments.
*
* @retval ::RSMI_STATUS_SUCCESS call was successful
* @retval ::RSMI_STATUS_NOT_SUPPORTED installed software or hardware does not
* support this function with the given arguments
* @retval ::RSMI_STATUS_INVALID_ARGS the provided arguments are not valid
*/
rsmi_status_t rsmi_dev_gpu_partition_metrics_info_get(uint32_t dv_ind,
rsmi_gpu_metrics_t *pgpu_metrics);
/**
* @brief This function retrieves the gpu metrics information
*
+23 -9
Просмотреть файл
@@ -156,6 +156,7 @@ enum DevInfoTypes {
kDevMemPageBad,
kDevNumaNode,
kDevGpuMetrics,
kdevGpuPartitionMetrics,
kDevPmMetrics,
kDevRegMetrics,
kDevBaseBoardTempMetrics,
@@ -215,7 +216,7 @@ class Device {
int readDevInfo(DevInfoTypes type, std::vector<std::string> *retVec);
int readDevInfo(DevInfoTypes type, std::size_t b_size,
void *p_binary_data);
std::string get_sys_file_path_by_type(DevInfoTypes type) const;
std::string get_sys_file_path_by_type(DevInfoTypes type, bool getPathOnly = false) const;
// Get the property from a file which may contain multiple properties.
int readDevInfo(DevInfoTypes type, const std::string& property,
std::string& value);
@@ -254,19 +255,31 @@ class Device {
template <typename T> std::string readBootPartitionState(uint32_t dv_ind);
rsmi_status_t check_amdgpu_property_reinforcement_query(uint32_t dev_idx, AMDGpuVerbTypes_t verb_type);
void dev_set_gpu_metric(GpuMetricsBasePtr gpu_metrics_ptr) { m_gpu_metrics_ptr = std::move(gpu_metrics_ptr); };
GpuMetricsBasePtr& dev_get_gpu_metric() { return m_gpu_metrics_ptr; };
const AMDGpuMetricsHeader_v1_t& dev_get_metrics_header() {return m_gpu_metrics_header; }
rsmi_status_t setup_gpu_metrics_reading();
rsmi_status_t dev_read_gpu_metrics_header_data();
rsmi_status_t dev_read_gpu_metrics_all_data();
rsmi_status_t run_internal_gpu_metrics_query(AMDGpuMetricsUnitType_t metric_counter, AMDGpuDynamicMetricTblValues_t& values);
rsmi_status_t dev_log_gpu_metrics(std::ostringstream& outstream_metrics);
AMGpuMetricsPublicLatestTupl_t dev_copy_internal_to_external_metrics();
auto setup_gpu_metrics_reading(DevInfoTypes type = DevInfoTypes::kDevGpuMetrics)
-> rsmi_status_t;
auto dev_read_gpu_metrics_header_data(DevInfoTypes type = DevInfoTypes::kDevGpuMetrics)
-> rsmi_status_t;
auto dev_read_gpu_metrics_all_data(DevInfoTypes type = DevInfoTypes::kDevGpuMetrics)
-> rsmi_status_t;
auto run_internal_gpu_metrics_query(AMDGpuMetricsUnitType_t metric_counter,
AMDGpuDynamicMetricTblValues_t &values,
DevInfoTypes type = DevInfoTypes::kDevGpuMetrics)
-> rsmi_status_t;
auto dev_log_gpu_metrics(std::ostringstream &outstream_metrics,
DevInfoTypes type = DevInfoTypes::kDevGpuMetrics) -> rsmi_status_t;
auto dev_copy_internal_to_external_metrics(DevInfoTypes type = DevInfoTypes::kDevGpuMetrics)
-> AMGpuMetricsPublicLatestTupl_t;
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings;
void set_smi_device_id(uint32_t device_id) { m_device_id = device_id; }
void set_smi_partition_id(uint32_t partition_id) { m_partition_id = partition_id; }
auto set_smi_dev_info_type(DevInfoTypes type) -> void { m_dev_info_type = type; }
auto get_smi_device_id(void) const -> uint32_t { return m_device_id; }
auto get_smi_partition_id(void) const -> uint32_t { return m_partition_id; }
auto is_smi_expecting_partition_metrics(void) const -> bool {
return m_dev_info_type == DevInfoTypes::kdevGpuPartitionMetrics;
}
static const char* get_type_string(DevInfoTypes type);
rsmi_status_t get_smi_device_identifiers(uint32_t device_id,
rsmi_device_identifiers_t *device_identifiers);
@@ -310,6 +323,7 @@ class Device {
uint64_t m_gpu_metrics_updated_timestamp;
uint32_t m_device_id;
uint32_t m_partition_id;
DevInfoTypes m_dev_info_type{DevInfoTypes::kDevGpuMetrics};
// New dynamic GPU metrics support
bool m_is_dynamic_gpu_metrics_supported = false;
+12 -36
Просмотреть файл
@@ -1,49 +1,24 @@
/*
* MIT License
*
* Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
*
* Developed by:
*
* AMD ML Software Engineering
*
* Advanced Micro Devices, Inc.
*
* www.amd.com
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* - Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimers.
* - Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimers in
* the documentation and/or other materials provided with the distribution.
* - Neither the names of Advanced Micro Devices, Inc,
* nor the names of its contributors may be used to endorse or promote
* products derived from this Software without specific prior written
* permission.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*
*
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
#ifndef ROCM_SMI_ROCM_SMI_DYN_GPU_METRICS_H_
#define ROCM_SMI_ROCM_SMI_DYN_GPU_METRICS_H_
@@ -299,6 +274,7 @@ enum class AMDGpuMetricUnitType_t
QUANTITY,
STATUS_FLAG
};
using AMDGpuMetricUnitTypeTranslationTable_t = std::unordered_map<AMDGpuMetricUnitType_t, AMDGpuDynamicTranslationTextInfo_t>;
static const auto AMDGpuMetricUnitTypeToString = AMDGpuMetricUnitTypeTranslationTable_t {
+62 -8
Просмотреть файл
@@ -26,6 +26,7 @@
#include "rocm_smi/rocm_smi_common.h"
#include "rocm_smi/rocm_smi.h"
#include "rocm_smi/rocm_smi_dyn_gpu_metrics.h"
#include "rocm_smi/rocm_smi_logger.h"
#include <array>
#include <algorithm>
@@ -689,6 +690,33 @@ struct AMDGpuMetrics_v17_t {
uint32_t m_pcie_lc_perf_other_end_recovery;
};
struct AMDGpuMetrics_v18_Partition_v1_0_t {
~AMDGpuMetrics_v18_Partition_v1_0_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
/* Current clocks (Mhz) */
uint16_t m_current_gfxclk[kRSMI_MAX_NUM_XCC];
uint16_t m_current_socclk[kRSMI_MAX_NUM_CLKS];
uint16_t m_current_vclk0[kRSMI_MAX_NUM_CLKS];
uint16_t m_current_dclk0[kRSMI_MAX_NUM_CLKS];
uint16_t m_current_uclk;
uint16_t m_padding;
/* Utilization Instantaneous (%) */
uint32_t m_gfx_busy_inst[kRSMI_MAX_NUM_XCC];
uint16_t m_jpeg_busy[kRSMI_MAX_NUM_JPEG_ENG_V1];
uint16_t m_vcn_busy[kRSMI_MAX_NUM_VCNS];
/* Utilization Accumulated (%) */
uint64_t m_gfx_busy_acc[kRSMI_MAX_NUM_XCC];
/* Total App Clock Counter Accumulated */
uint64_t m_gfx_below_host_limit_ppt_acc[kRSMI_MAX_NUM_XCC];
uint64_t m_gfx_below_host_limit_thm_acc[kRSMI_MAX_NUM_XCC];
uint64_t m_gfx_low_utilization_acc[kRSMI_MAX_NUM_XCC];
uint64_t m_gfx_below_host_limit_total_acc[kRSMI_MAX_NUM_XCC];
};
struct AMDGpuMetrics_v18_t {
~AMDGpuMetrics_v18_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
@@ -1053,8 +1081,10 @@ enum class AMDGpuMetricVersionFlags_t : AMDGpuMetricVersionFlagId_t
kGpuMetricV15 = (0x1 << 5),
kGpuMetricV16 = (0x1 << 6),
kGpuMetricV17 = (0x1 << 7),
kGpuMetricV18 = (0x1 << 8), // Added new version flag: Last static GPU Metrics
kGpuMetricV19 = (0x1 << 9), // Dyn.GPU Metrics
kGpuMetricV18 = (0x1 << 8),
kGpuXcpMetricV10 = (0x1 << 0), // Added in v1.8 for partition metrics v1.0
kGpuMetricDynV19Plus = (0x1 << 9), // Dyn. GPU Metrics v1.9+
kGpuXcpMetricDynV11Plus = (0x1 << 1), // Added in v1.9 for Dyn. partition metrics v1.1+
};
using AMDGpuMetricVersionTranslationTbl_t = std::map<uint16_t, AMDGpuMetricVersionFlags_t>;
using GpuMetricTypePtr_t = std::shared_ptr<void>;
@@ -1069,6 +1099,7 @@ class GpuMetricsBase_t {
virtual AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() = 0;
virtual void set_device_id(uint32_t device_id) { m_device_id = device_id; }
virtual void set_partition_id(uint32_t partition_id) { m_partition_id = partition_id; }
virtual void set_is_partition_metrics(bool is_partition_req) { m_is_partition_metrics = is_partition_req; }
static std::mutex s_base_tbl_mu;
virtual AMDGpuDynamicMetricsTbl_t get_metrics_dynamic_tbl() {
std::lock_guard<std::mutex> lk(s_base_tbl_mu);
@@ -1080,6 +1111,7 @@ class GpuMetricsBase_t {
uint64_t m_metrics_timestamp;
uint32_t m_device_id;
uint32_t m_partition_id;
bool m_is_partition_metrics {false};
};
using GpuMetricsBasePtr = std::shared_ptr<GpuMetricsBase_t>;
using AMDGpuMetricFactories_t = const std::map<AMDGpuMetricVersionFlags_t, GpuMetricsBasePtr>;
@@ -1293,11 +1325,31 @@ class GpuMetricsBase_v18_t final : public GpuMetricsBase_t {
}
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v18_t*){});
std::ostringstream ss;
ss << __PRETTY_FUNCTION__
<< " ==== START ==== "
<< " Initializing metrics table request: "
<< " | Partition ID: " << m_partition_id
<< " | Device ID: " << m_device_id
<< " | Is Partition Metrics: " << std::boolalpha << m_is_partition_metrics
<< " | m_gpu_metric_ptr: " << (!m_gpu_metric_ptr ? "nullptr" : "valid")
<< " | m_gpu_metric_partition_ptr: "
<< (!m_gpu_metric_partition_ptr ? "nullptr" : "valid");
LOG_DEBUG(ss);
// If m_is_partition_metrics is false, we use the main GPU metrics table.
// Otherwise, we use the partition metrics table.
// This is to avoid having two pointers to the same table.
if (m_is_partition_metrics && !m_gpu_metric_partition_ptr) {
return std::shared_ptr<AMDGpuMetrics_v18_Partition_v1_0_t>(
&m_gpu_metrics_partition_tbl, [](AMDGpuMetrics_v18_Partition_v1_0_t*){/* no-op */});
} else if (!m_is_partition_metrics && !m_gpu_metric_ptr) {
return std::shared_ptr<AMDGpuMetrics_v18_t>(
&m_gpu_metrics_tbl, [](AMDGpuMetrics_v18_t*){/* no-op */});
}
assert(m_gpu_metric_ptr != nullptr);
return m_gpu_metric_ptr;
return std::shared_ptr<AMDGpuMetrics_v18_t>(
nullptr, [](AMDGpuMetrics_v18_t*){/* no-op */}); // Return nullptr if we couldn't
// validate which metric table
// user is requesting
}
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
@@ -1310,10 +1362,12 @@ class GpuMetricsBase_v18_t final : public GpuMetricsBase_t {
private:
AMDGpuMetrics_v18_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v18_t> m_gpu_metric_ptr;
AMDGpuMetrics_v18_Partition_v1_0_t m_gpu_metrics_partition_tbl;
std::shared_ptr<AMDGpuMetrics_v18_Partition_v1_0_t> m_gpu_metric_partition_ptr;
};
class GpuMetricsBaseDynamic_t final : public GpuMetricsBase_t {
public:
public:
~GpuMetricsBaseDynamic_t() = default;
// Unused
@@ -1341,7 +1395,7 @@ class GpuMetricsBaseDynamic_t final : public GpuMetricsBase_t {
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
private:
AMDGpuDynamicMetrics_t m_dyn;
details::AMDGpuDynamicMetricsHeader_v1_t m_header{};
+24 -2
Просмотреть файл
@@ -114,6 +114,7 @@ static const char *kDevXGMIErrorFName = "xgmi_error";
static const char *kDevSerialNumberFName = "serial_number";
static const char *kDevNumaNodeFName = "numa_node";
static const char *kDevGpuMetricsFName = "gpu_metrics";
static const char *kDevGpuPartitionMetricsFName = "xcp/xcp_metrics";
static const char *kDevPmMetricsFName = "pm_metrics"; // PM log
static const char *kDevRegMetricsFName = "reg_state"; // register table
static const char *kDevBaseBoardTempMetricsFName = "board/baseboard_temp";
@@ -321,6 +322,7 @@ static const std::map<DevInfoTypes, const char *> kDevAttribNameMap = {
{kDevMemPageBad, kDevMemPageBadFName},
{kDevNumaNode, kDevNumaNodeFName},
{kDevGpuMetrics, kDevGpuMetricsFName},
{kdevGpuPartitionMetrics, kDevGpuPartitionMetricsFName},
{kDevPmMetrics, kDevPmMetricsFName},
{kDevSocPstate, kDevSocPstateFName},
{kDevXgmiPlpd, kDevXgmiPlpdFName},
@@ -498,6 +500,7 @@ Device::devInfoTypesStrings = {
{kDevMemPageBad, "kDevMemPageBad"},
{kDevNumaNode, "kDevNumaNode"},
{kDevGpuMetrics, "kDevGpuMetrics"},
{kdevGpuPartitionMetrics, "kdevGpuPartitionMetrics"},
{kDevPmMetrics, "kDevPmMetrics"},
{kDevRegMetrics, "kDevRegMetrics"},
{kDevBaseBoardTempMetrics, "kDevBaseBoardTempMetrics"},
@@ -747,10 +750,29 @@ int Device::openDebugFileStream(DevInfoTypes type, T *fs, const char *str) {
return 0;
}
std::string Device::get_sys_file_path_by_type(DevInfoTypes type) const {
/**
* @brief Get the sysfs file path for a given device attribute type.
*
* This function constructs the full path to a sysfs file corresponding to the specified
* device attribute type for this device instance. The path is constructed using the device's
* base path, appending "/device/" and the attribute name from kDevAttribNameMap.
*
* If getPathOnly is true, the constructed path is returned without checking for file existence.
* If getPathOnly is false, the function checks if the file exists; if not, an empty string is returned.
*
* @param type The device attribute type (DevInfoTypes) for which to get the sysfs file path.
* @param getPathOnly If true, return the constructed path without checking for file existence.
* If false, return an empty string if the file does not exist.
* @return std::string The full sysfs file path, or an empty string if the file does not exist
* and getPathOnly is false.
*/
std::string Device::get_sys_file_path_by_type(DevInfoTypes type, bool getPathOnly) const {
auto sysfs_path = path_;
sysfs_path += "/device/";
sysfs_path += kDevAttribNameMap.at(type);
if (getPathOnly) {
return sysfs_path;
}
if (access(sysfs_path.c_str(), F_OK) != 0) {
sysfs_path.clear();
@@ -1133,7 +1155,6 @@ int Device::readDevInfoBinary(DevInfoTypes type, std::size_t b_size,
// is the issue, so should remain.
const std::string key = path_ + "/device/" + kDevAttribNameMap.at(type)
+ "#" + std::to_string(b_size);
GpuMetricsCache* cache_ptr = nullptr;
{
std::lock_guard<std::mutex> map_lk(g_gpu_metrics_cache_map_mu);
@@ -1447,6 +1468,7 @@ int Device::readDevInfo(DevInfoTypes type, std::size_t b_size,
switch (type) {
case kDevGpuMetrics:
case kdevGpuPartitionMetrics:
return readDevInfoBinary(type, b_size, p_binary_data);
break;
+23 -36
Просмотреть файл
@@ -1,46 +1,23 @@
/*
* MIT License
*
* Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
*
* Developed by:
*
* AMD ML Software Engineering
*
* Advanced Micro Devices, Inc.
*
* www.amd.com
*
* Permission is hereby granted, free of charge, to any person obtaining a
* copy of this software and associated documentation files (the "Software"),
* to deal in the Software without restriction, including without limitation
* the rights to use, copy, modify, merge, publish, distribute, sublicense,
* and/or sell copies of the Software, and to permit persons to whom the
* Software is furnished to do so, subject to the following conditions:
*
* - Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimers.
* - Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimers in
* the documentation and/or other materials provided with the distribution.
* - Neither the names of Advanced Micro Devices, Inc,
* nor the names of its contributors may be used to endorse or promote
* products derived from this Software without specific prior written
* permission.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
* THE COPYRIGHT HOLDER(S) OR AUTHOR(S) BE LIABLE FOR ANY CLAIM, DAMAGES OR
* OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE,
* ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR
* OTHER DEALINGS IN THE SOFTWARE.
*
*
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
#include "rocm_smi/rocm_smi.h"
@@ -156,7 +133,7 @@ static inline std::optional<AMDGpuMetricAttributeValue_t> read_metric_value(Curs
auto AMDGpuDynamicMetrics_t::parse_from_buffer(const std::byte* data,
std::size_t size) noexcept -> rsmi_status_t {
std::ostringstream ss;
rsmi_status_t status = RSMI_STATUS_SUCCESS;
if (!data || (size < (sizeof(AMDGpuDynamicMetricsHeader_v1_t) + sizeof(uint32_t)))) {
return RSMI_STATUS_INSUFFICIENT_SIZE;
@@ -178,6 +155,17 @@ auto AMDGpuDynamicMetrics_t::parse_from_buffer(const std::byte* data,
if (attr_count == 0 || attr_count > size){
return RSMI_STATUS_UNEXPECTED_SIZE;
}
std::string m_header_version_str = std::to_string(static_cast<uint32_t>(hdr.m_format_revision))
+ "." +
std::to_string(static_cast<uint32_t>(hdr.m_content_revision));
ss << __PRETTY_FUNCTION__
<< " | Info: Dynamic GPU Metrics"
<< " | Attr Count: " << attr_count
<< " | Header Version: " << m_header_version_str
<< " | Header Size: " << hdr.get_size()
<< " | Total Size: " << size
<< " |";
LOG_TRACE(ss);
details::AMDGpuMetricSchemaType_t metrics_data;
metrics_data.reserve(attr_count);
@@ -212,7 +200,6 @@ auto AMDGpuDynamicMetrics_t::parse_from_buffer(const std::byte* data,
AMDGpuMetricAttributeInstance_t inst{};
status = schema_lookup_instance(attr_id, attr_type, inst);
if (status != RSMI_STATUS_SUCCESS){
std::ostringstream ss;
ss << __PRETTY_FUNCTION__
<< " | Warn: schema lookup miss"
<< " | Attr ID: " << static_cast<std::underlying_type_t<AMDGpuMetricAttributeId_t>>(attr_id)
Разница между файлами не показана из-за своего большого размера Загрузить разницу
+15 -1
Просмотреть файл
@@ -3352,13 +3352,27 @@ amdsmi_get_gpu_metrics_header_info(amdsmi_processor_handle processor_handle,
reinterpret_cast<metrics_table_header_t*>(header_value));
}
amdsmi_status_t amdsmi_get_gpu_partition_metrics_info(
amdsmi_processor_handle processor_handle,
amdsmi_gpu_metrics_t *pgpu_metrics) {
AMDSMI_CHECK_INIT();
if (pgpu_metrics != nullptr) {
*pgpu_metrics = amdsmi_gpu_metrics_t{}; // Use a default initializer for the struct
} else {
return AMDSMI_STATUS_INVAL; // Return error if pgpu_metrics is null
}
return rsmi_wrapper(rsmi_dev_gpu_partition_metrics_info_get, processor_handle, 0,
reinterpret_cast<rsmi_gpu_metrics_t*>(pgpu_metrics));
}
amdsmi_status_t amdsmi_get_gpu_metrics_info(
amdsmi_processor_handle processor_handle,
amdsmi_gpu_metrics_t *pgpu_metrics) {
AMDSMI_CHECK_INIT();
// nullptr api supported
if (pgpu_metrics != nullptr) {
*pgpu_metrics = amdsmi_gpu_metrics_t{}; // Use a default initializer for the struct
} else {
return AMDSMI_STATUS_INVAL; // Return error if pgpu_metrics is null
}
return rsmi_wrapper(rsmi_dev_gpu_metrics_info_get, processor_handle, 0,
reinterpret_cast<rsmi_gpu_metrics_t*>(pgpu_metrics));
+5
Просмотреть файл
@@ -52,6 +52,11 @@ include_directories(${TEST} ${CMAKE_CURRENT_SOURCE_DIR}/.. ${ROCM_INC_DIR}/..)
add_executable(${TEST} ${tstSources} ${functionalSources})
target_link_libraries(${TEST} ${AMD_SMI} GTest::gtest_main c stdc++ pthread)
if (CMAKE_CXX_COMPILER_ID STREQUAL "GNU"
AND CMAKE_CXX_COMPILER_VERSION VERSION_LESS "9")
target_link_libraries(${TEST} stdc++fs)
endif()
# Install tests
install(
TARGETS ${TEST}
+203
Просмотреть файл
@@ -0,0 +1,203 @@
/*
* Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
#include <amd_smi_test/test_base.h>
#include <gtest/gtest.h>
#include <cstdint>
#include <filesystem>
#include <fstream>
#include <vector>
#include "rocm_smi/rocm_smi_gpu_metrics.h"
namespace amd::smi {
// Forward declarations of internal helpers we exercise in this unit-test.
AMDGpuMetricVersionFlags_t translate_header_to_flag_version(
const AMDGpuMetricsHeader_v1_t& metrics_header, bool is_partition_metrics,
const std::string& file_path);
GpuMetricsBasePtr amdgpu_metrics_factory(AMDGpuMetricVersionFlags_t gpu_metric_version,
bool is_partition_metrics, const std::string& file_path);
} // namespace amd::smi
namespace {
// Version helper checker
auto GetExpectedMetricVersionFlag(uint16_t major, uint16_t minor, bool is_partition_metrics)
-> amd::smi::AMDGpuMetricVersionFlags_t {
using Flag = amd::smi::AMDGpuMetricVersionFlags_t;
if (is_partition_metrics) {
if (major == 1) {
if (minor == 0) {
return Flag::kGpuXcpMetricV10;
} else if (minor >= 1) {
return Flag::kGpuXcpMetricDynV11Plus;
} else {
return Flag::kGpuMetricNone;
}
}
} else { // GPU metrics
if (major == 1) {
switch (minor) {
case 0: return Flag::kGpuMetricNone;
case 1: return Flag::kGpuMetricV11;
case 2: return Flag::kGpuMetricV12;
case 3: return Flag::kGpuMetricV13;
case 4: return Flag::kGpuMetricV14;
case 5: return Flag::kGpuMetricV15;
case 6: return Flag::kGpuMetricV16;
case 7: return Flag::kGpuMetricV17;
case 8: return Flag::kGpuMetricV18;
default: return Flag::kGpuMetricDynV19Plus;
}
}
}
return Flag::kGpuMetricNone;
}
// pass a header we want to test against
auto BuildFakeMetricsBlob(amd::smi::AMDGpuMetricsHeader_v1_t new_header) -> std::vector<uint8_t> {
if (new_header.m_structure_size < sizeof(new_header)) {
throw std::runtime_error("Header size too small");
}
amd::smi::AMDGpuMetricsHeader_v1_t header{};
header.m_structure_size = static_cast<uint16_t>(sizeof(header));
header.m_format_revision = new_header.m_format_revision;
header.m_content_revision = new_header.m_content_revision;
const uint8_t* begin = reinterpret_cast<const uint8_t*>(&header);
return std::vector<uint8_t>(begin, begin + sizeof(header));
}
auto WriteBlobToTempFile(const std::vector<uint8_t>& blob,
const std::string& filename = "amdsmi_fake_metrics.bin")
-> std::filesystem::path {
auto temp_dir = std::filesystem::temp_directory_path();
auto file_path = temp_dir / filename;
std::ofstream stream(file_path, std::ios::binary | std::ios::trunc);
stream.write(reinterpret_cast<const char*>(blob.data()),
static_cast<std::streamsize>(blob.size()));
stream.close();
return file_path;
}
} // namespace
TEST(AmdSmiDynamicMetricTest, GPUMetricDynamicVersionSupported) {
const bool is_partition_metrics = false;
for (auto ver : {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}) {
std::string test_detail = "[GPUMetric";
if (ver >= 9) {
test_detail += "Dynamic] ";
} else {
test_detail += "] ";
}
std::cout << test_detail << "Checking version 1." << ver << std::endl;
SCOPED_TRACE(testing::Message() << "Subtest for minor version: 1." << ver);
const auto blob = BuildFakeMetricsBlob(amd::smi::AMDGpuMetricsHeader_v1_t{
.m_structure_size = sizeof(amd::smi::AMDGpuMetricsHeader_v1_t),
.m_format_revision = 1,
.m_content_revision = static_cast<uint16_t>(ver), // Known minor versions
});
const auto fake_path =
WriteBlobToTempFile(blob, "amdsmi_fake_gpu_metrics_v1" + std::to_string(ver) + ".bin");
ASSERT_FALSE(blob.empty());
ASSERT_TRUE(std::filesystem::exists(fake_path));
const auto* header = reinterpret_cast<const amd::smi::AMDGpuMetricsHeader_v1_t*>(blob.data());
const auto flag = amd::smi::translate_header_to_flag_version(*header, is_partition_metrics,
fake_path.string());
EXPECT_EQ(flag, GetExpectedMetricVersionFlag(1, ver, is_partition_metrics))
<< "Version 1." << ver << " should be treated as supported";
auto gpu_metrics_ptr =
amd::smi::amdgpu_metrics_factory(flag, is_partition_metrics, fake_path.string());
if (ver != 0) {
EXPECT_NE(gpu_metrics_ptr, nullptr)
<< "Factory must create metrics object for supported version";
} else {
EXPECT_EQ(gpu_metrics_ptr, nullptr)
<< "Factory must not create metrics object for unsupported versions";
}
if (gpu_metrics_ptr) {
std::cout << test_detail << "Created valid object for version 1." << ver << std::endl;
} else {
std::cout << test_detail << "Unsupported Metric Version"
<< " | Failed to create valid object for version 1." << ver << std::endl;
}
std::filesystem::remove(fake_path);
}
}
TEST(AmdSmiDynamicMetricTest, XCPMetricDynamicVersionSupported) {
const bool is_partition_metrics = true;
for (auto ver : {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18}) {
std::string test_detail = "[XCPMetric";
if (ver >= 1) {
test_detail += "Dynamic] ";
} else {
test_detail += "] ";
}
std::cout << test_detail << "Checking version 1." << ver << std::endl;
SCOPED_TRACE(testing::Message() << "Subtest for minor version: 1." << ver);
const auto blob = BuildFakeMetricsBlob(amd::smi::AMDGpuMetricsHeader_v1_t{
.m_structure_size = sizeof(amd::smi::AMDGpuMetricsHeader_v1_t),
.m_format_revision = 1,
.m_content_revision = static_cast<uint16_t>(ver), // Known minor versions
});
const auto fake_path =
WriteBlobToTempFile(blob, "amdsmi_fake_xcp_metrics_v1" + std::to_string(ver) + ".bin");
ASSERT_FALSE(blob.empty());
ASSERT_TRUE(std::filesystem::exists(fake_path));
const auto* header = reinterpret_cast<const amd::smi::AMDGpuMetricsHeader_v1_t*>(blob.data());
const auto flag = amd::smi::translate_header_to_flag_version(*header, is_partition_metrics,
fake_path.string());
EXPECT_EQ(flag, GetExpectedMetricVersionFlag(1, ver, is_partition_metrics))
<< "Version 1." << ver << " should be treated as supported";
auto xcp_metrics_ptr =
amd::smi::amdgpu_metrics_factory(flag, is_partition_metrics, fake_path.string());
EXPECT_NE(xcp_metrics_ptr, nullptr)
<< "Factory must create metrics object for supported version";
if (xcp_metrics_ptr) {
std::cout << test_detail << "Created valid object for version 1." << ver << std::endl;
} else {
std::cout << test_detail << "Failed to create valid object for version 1." << ver
<< std::endl;
}
std::filesystem::remove(fake_path);
}
}
+426
Просмотреть файл
@@ -0,0 +1,426 @@
/*
* Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
#include <cstdint>
#include <iostream>
#include <iterator>
#include <string>
#include <map>
#include <gtest/gtest.h>
#include "amd_smi/amdsmi.h"
#include "gpu_partition_metrics_read.h"
#include "../test_common.h"
#include "rocm_smi/rocm_smi_utils.h"
#include "amd_smi/impl/amd_smi_utils.h"
TestGpuPartitionMetricsRead::TestGpuPartitionMetricsRead() : TestBase() {
set_title("AMDSMI GPU Partition (XCP) Metrics Read Test");
set_description("The GPU Partition (XCP) Metrics tests verifies that "
"the gpu metrics info can be read properly.");
}
TestGpuPartitionMetricsRead::~TestGpuPartitionMetricsRead(void) {
}
void TestGpuPartitionMetricsRead::SetUp(void) {
TestBase::SetUp();
return;
}
void TestGpuPartitionMetricsRead::DisplayTestInfo(void) {
TestBase::DisplayTestInfo();
}
void TestGpuPartitionMetricsRead::DisplayResults(void) const {
TestBase::DisplayResults();
return;
}
void TestGpuPartitionMetricsRead::Close() {
// This will close handles opened within amdsmitst utility calls and call
// amdsmi_shut_down(), so it should be done after other hsa cleanup
TestBase::Close();
}
void TestGpuPartitionMetricsRead::Run(void) {
amdsmi_status_t err;
TestBase::Run();
if (setup_failed_) {
std::cout << "** SetUp Failed for this test. Skipping.**" << std::endl;
return;
}
for (uint32_t i = 0; i < num_monitor_devs(); ++i) {
PrintDeviceHeader(processor_handles_[i]);
std::cout << "Device #" << std::to_string(i) << "\n";
IF_VERB(STANDARD) {
std::cout << "\n\n";
std::cout << "\t**GPU PARTITION METRICS: Using static struct (Backwards Compatibility):\n";
}
amdsmi_gpu_metrics_t smu = {};
err = amdsmi_get_gpu_partition_metrics_info(processor_handles_[i], &smu);
const char *status_string;
amdsmi_status_code_to_string(err, &status_string);
std::cout << "\t\t** amdsmi_get_gpu_partition_metrics_info(): " << status_string
<< "\n";
if (err != AMDSMI_STATUS_SUCCESS) {
if (err == AMDSMI_STATUS_NOT_SUPPORTED) {
IF_VERB(STANDARD) {
std::cout << "\t**" <<
"Not supported on this machine" << std::endl;
continue;
}
}
CHK_ERR_ASRT(err); // Anything else should be a failure
// (ie, we are not handling the metrics right/etc..)
} else {
IF_VERB(STANDARD) {
std::cout << "METRIC TABLE HEADER:\n";
std::cout << "structure_size=" << std::dec
<< static_cast<uint16_t>(smu.common_header.structure_size) << "\n";
std::cout << "format_revision=" << std::dec
<< static_cast<uint16_t>(smu.common_header.format_revision) << "\n";
std::cout << "content_revision=" << std::dec
<< static_cast<uint16_t>(smu.common_header.content_revision) << "\n";
std::cout << "\n";
std::cout << "TIME STAMPS (ns):\n";
std::cout << std::dec << "system_clock_counter=" << smu.system_clock_counter << "\n";
std::cout << "firmware_timestamp (10ns resolution)=" << std::dec << smu.firmware_timestamp
<< "\n";
std::cout << "\n";
std::cout << "TEMPERATURES (C):\n";
std::cout << std::dec << "temperature_edge= " << smu.temperature_edge << "\n";
std::cout << std::dec << "temperature_hotspot= " << smu.temperature_hotspot << "\n";
std::cout << std::dec << "temperature_mem= " << smu.temperature_mem << "\n";
std::cout << std::dec << "temperature_vrgfx= " << smu.temperature_vrgfx << "\n";
std::cout << std::dec << "temperature_vrsoc= " << smu.temperature_vrsoc << "\n";
std::cout << std::dec << "temperature_vrmem= " << smu.temperature_vrmem << "\n";
std::cout << "temperature_hbm = [";
std::copy(std::begin(smu.temperature_hbm),
std::end(smu.temperature_hbm),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << "\n";
std::cout << "UTILIZATION (%):\n";
std::cout << std::dec << "average_gfx_activity=" << smu.average_gfx_activity << "\n";
std::cout << std::dec << "average_umc_activity=" << smu.average_umc_activity << "\n";
std::cout << std::dec << "average_mm_activity=" << smu.average_mm_activity << "\n";
std::cout << std::dec << "vcn_activity= [";
std::copy(std::begin(smu.vcn_activity),
std::end(smu.vcn_activity),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << "\n";
std::cout << std::dec << "jpeg_activity= [";
std::copy(std::begin(smu.jpeg_activity),
std::end(smu.jpeg_activity),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << "\n";
std::cout << "POWER (W)/ENERGY (15.259uJ per 1ns):\n";
std::cout << std::dec << "average_socket_power=" << smu.average_socket_power << "\n";
std::cout << std::dec << "current_socket_power=" << smu.current_socket_power << "\n";
std::cout << std::dec << "energy_accumulator=" << smu.energy_accumulator << "\n";
std::cout << "\n";
std::cout << "AVG CLOCKS (MHz):\n";
std::cout << std::dec << "average_gfxclk_frequency=" << smu.average_gfxclk_frequency
<< "\n";
std::cout << std::dec << "average_gfxclk_frequency=" << smu.average_gfxclk_frequency
<< "\n";
std::cout << std::dec << "average_uclk_frequency=" << smu.average_uclk_frequency << "\n";
std::cout << std::dec << "average_vclk0_frequency=" << smu.average_vclk0_frequency
<< "\n";
std::cout << std::dec << "average_dclk0_frequency=" << smu.average_dclk0_frequency
<< "\n";
std::cout << std::dec << "average_vclk1_frequency=" << smu.average_vclk1_frequency
<< "\n";
std::cout << std::dec << "average_dclk1_frequency=" << smu.average_dclk1_frequency
<< "\n";
std::cout << "\n";
std::cout << "CURRENT CLOCKS (MHz):\n";
std::cout << std::dec << "current_gfxclk=" << smu.current_gfxclk << "\n";
std::cout << std::dec << "current_gfxclks= [";
std::copy(std::begin(smu.current_gfxclks),
std::end(smu.current_gfxclks),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_socclk=" << smu.current_socclk << "\n";
std::cout << std::dec << "current_socclks= [";
std::copy(std::begin(smu.current_socclks),
std::end(smu.current_socclks),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_uclk=" << smu.current_uclk << "\n";
std::cout << std::dec << "current_vclk0=" << smu.current_vclk0 << "\n";
std::cout << std::dec << "current_vclk0s= [";
std::copy(std::begin(smu.current_vclk0s),
std::end(smu.current_vclk0s),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_dclk0=" << smu.current_dclk0 << "\n";
std::cout << std::dec << "current_dclk0s= [";
std::copy(std::begin(smu.current_dclk0s),
std::end(smu.current_dclk0s),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_vclk1=" << smu.current_vclk1 << "\n";
std::cout << std::dec << "current_dclk1=" << smu.current_dclk1 << "\n";
std::cout << "\n";
std::cout << "TROTTLE STATUS:\n";
std::cout << std::dec << "throttle_status=" << smu.throttle_status << "\n";
std::cout << "\n";
std::cout << "FAN SPEED:\n";
std::cout << std::dec << "current_fan_speed=" << smu.current_fan_speed << "\n";
std::cout << "\n";
std::cout << "LINK WIDTH (number of lanes) /SPEED (0.1 GT/s):\n";
std::cout << "pcie_link_width=" << smu.pcie_link_width << "\n";
std::cout << "pcie_link_speed=" << smu.pcie_link_speed << "\n";
std::cout << "xgmi_link_width=" << smu.xgmi_link_width << "\n";
std::cout << "xgmi_link_speed=" << smu.xgmi_link_speed << "\n";
std::cout << "\n";
std::cout << "Utilization Accumulated(%):\n";
std::cout << "gfx_activity_acc=" << std::dec << smu.gfx_activity_acc << "\n";
std::cout << "mem_activity_acc=" << std::dec << smu.mem_activity_acc << "\n";
std::cout << "\n";
std::cout << "XGMI ACCUMULATED DATA TRANSFER SIZE (KB):\n";
std::cout << std::dec << "xgmi_read_data_acc= [";
std::copy(std::begin(smu.xgmi_read_data_acc),
std::end(smu.xgmi_read_data_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "xgmi_write_data_acc= [";
std::copy(std::begin(smu.xgmi_write_data_acc),
std::end(smu.xgmi_write_data_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "xgmi_link_status= [";
std::copy(std::begin(smu.xgmi_link_status),
std::end(smu.xgmi_link_status),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
// Voltage (mV)
std::cout << "voltage_soc = " << std::dec << smu.voltage_soc << "\n";
std::cout << "voltage_gfx = " << std::dec << smu.voltage_gfx << "\n";
std::cout << "voltage_mem = " << std::dec << smu.voltage_mem << "\n";
std::cout << "indep_throttle_status = " << std::dec << smu.indep_throttle_status << "\n";
// Clock Lock Status. Each bit corresponds to clock instance
std::cout << "gfxclk_lock_status (in hex) = " << std::hex
<< smu.gfxclk_lock_status << std::dec <<"\n";
// Bandwidth (GB/sec)
std::cout << "pcie_bandwidth_acc=" << std::dec << smu.pcie_bandwidth_acc << "\n";
std::cout << "pcie_bandwidth_inst=" << std::dec << smu.pcie_bandwidth_inst << "\n";
// VRAM max bandwidth at max memory clock (GB/sec)
std::cout << "vram_max_bandwidth=" << std::dec << smu.vram_max_bandwidth << "\n";
// Counts
std::cout << "pcie_l0_to_recov_count_acc= " << std::dec << smu.pcie_l0_to_recov_count_acc
<< "\n";
std::cout << "pcie_replay_count_acc= " << std::dec << smu.pcie_replay_count_acc << "\n";
std::cout << "pcie_replay_rover_count_acc= " << std::dec
<< smu.pcie_replay_rover_count_acc << "\n";
std::cout << "pcie_nak_sent_count_acc= " << std::dec << smu.pcie_nak_sent_count_acc
<< "\n";
std::cout << "pcie_nak_rcvd_count_acc= " << std::dec << smu.pcie_nak_rcvd_count_acc
<< "\n";
// Accumulation cycle counter
// Accumulated throttler residencies
std::cout << "\n";
std::cout << "RESIDENCY ACCUMULATION / COUNTER:\n";
std::cout << "accumulation_counter = " << std::dec << smu.accumulation_counter << "\n";
std::cout << "prochot_residency_acc = " << std::dec << smu.prochot_residency_acc << "\n";
std::cout << "ppt_residency_acc = " << std::dec << smu.ppt_residency_acc << "\n";
std::cout << "socket_thm_residency_acc = " << std::dec << smu.socket_thm_residency_acc
<< "\n";
std::cout << "vr_thm_residency_acc = " << std::dec << smu.vr_thm_residency_acc
<< "\n";
std::cout << "hbm_thm_residency_acc = " << std::dec << smu.hbm_thm_residency_acc << "\n";
// Number of current partitions
std::cout << "num_partition = " << std::dec << smu.num_partition << "\n";
// PCIE other end recovery counter
std::cout << "pcie_lc_perf_other_end_recovery = "
<< std::dec << smu.pcie_lc_perf_other_end_recovery << "\n";
std::cout << std::dec << "xcp_stats.gfx_busy_inst = \n";
auto xcp = 0;
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_busy_inst),
std::end(row.gfx_busy_inst),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.jpeg_busy = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.jpeg_busy),
std::end(row.jpeg_busy),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.vcn_busy = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.vcn_busy),
std::end(row.vcn_busy),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_busy_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_busy_acc),
std::end(row.gfx_busy_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_below_host_limit_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_below_host_limit_acc),
std::end(row.gfx_below_host_limit_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
// new for gpu metrics v1.8
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_below_host_limit_ppt_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_below_host_limit_ppt_acc),
std::end(row.gfx_below_host_limit_ppt_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_below_host_limit_thm_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_below_host_limit_thm_acc),
std::end(row.gfx_below_host_limit_thm_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_low_utilization_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_low_utilization_acc),
std::end(row.gfx_low_utilization_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_below_host_limit_total_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_below_host_limit_total_acc),
std::end(row.gfx_below_host_limit_total_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
std::cout << "\n\n";
std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n";
constexpr uint16_t kMAX_ITER_TEST = 10;
amdsmi_gpu_metrics_t gpu_xcp_metrics_check = {};
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_xcp_metrics_check);
std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: "
<< gpu_xcp_metrics_check.firmware_timestamp << "\n";
}
std::cout << "\n";
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
amdsmi_get_gpu_partition_metrics_info(processor_handles_[i], &gpu_xcp_metrics_check);
std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: "
<< gpu_xcp_metrics_check.system_clock_counter << "\n";
}
std::cout << "\n";
std::cout << " ** Note: Values MAX'ed out "
<< "(UINTX MAX are unsupported for the version in question) ** " << "\n\n";
}
}
// Verify api support checking functionality is working
err = amdsmi_get_gpu_partition_metrics_info(processor_handles_[i], nullptr);
if (err !=AMDSMI_STATUS_INVAL) {
DISPLAY_AMDSMI_ERR(err);
}
amdsmi_status_code_to_string(err, &status_string);
std::cout << "\t\t** amdsmi_get_gpu_partition_metrics_info(nullptr check): " << status_string << "\n";
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
}
}
+51
Просмотреть файл
@@ -0,0 +1,51 @@
/*
* Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to deal
* in the Software without restriction, including without limitation the rights
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
* copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in
* all copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
* THE SOFTWARE.
*/
#ifndef TESTS_AMD_SMI_TEST_FUNCTIONAL_GPU_PARTITION_METRICS_READ_H_
#define TESTS_AMD_SMI_TEST_FUNCTIONAL_GPU_PARTITION_METRICS_READ_H_
#include "../test_base.h"
class TestGpuPartitionMetricsRead : public TestBase {
public:
TestGpuPartitionMetricsRead();
// @Brief: Destructor for test case of TestGpuPartitionMetricsRead
virtual ~TestGpuPartitionMetricsRead();
// @Brief: Setup the environment for measurement
virtual void SetUp();
// @Brief: Core measurement execution
virtual void Run();
// @Brief: Clean up and retrive the resource
virtual void Close();
// @Brief: Display results
virtual void DisplayResults() const;
// @Brief: Display information about what this test does
virtual void DisplayTestInfo(void);
};
#endif // TESTS_AMD_SMI_TEST_FUNCTIONAL_GPU_PARTITION_METRICS_READ_H_
+5
Просмотреть файл
@@ -37,6 +37,7 @@
#include "functional/process_info_read.h"
#include "functional/gpu_busy_read.h"
#include "functional/gpu_metrics_read.h"
#include "functional/gpu_partition_metrics_read.h"
#include "functional/err_cnt_read.h"
#include "functional/power_read.h"
#include "functional/power_read_write.h"
@@ -224,6 +225,10 @@ TEST(amdsmitstReadOnly, TestGpuMetricsRead) {
TestGpuMetricsRead tst;
RunGenericTest(&tst);
}
TEST(amdsmitstReadOnly, TestGpuPartitionMetricsRead) {
TestGpuPartitionMetricsRead tst;
RunGenericTest(&tst);
}
TEST(amdsmitstReadOnly, TestMetricsCounterRead) {
TestMetricsCounterRead tst;
RunGenericTest(&tst);
+17 -1
Просмотреть файл
@@ -282,7 +282,23 @@ void TestBase::PrintDeviceHeader(amdsmi_processor_handle dv_ind) {
}
}
std::cout << std::setbase(10);
amdsmi_kfd_info_t kfd_info;
err = amdsmi_get_gpu_kfd_info(dv_ind, &kfd_info);
if (err == AMDSMI_STATUS_NOT_SUPPORTED) {
IF_VERB(STANDARD) {
std::cout << "\t**KFD info: " << smi_amdgpu_get_status_string(err, false) << std::endl;
}
ASSERT_EQ(err, AMDSMI_STATUS_NOT_SUPPORTED);
} else {
CHK_ERR_ASRT(err)
IF_VERB(STANDARD) {
std::cout << "\t**KFD info: " << std::endl;
std::cout << "\t\t**GPU ID: " << std::dec << kfd_info.kfd_id << std::endl;
std::cout << "\t\t**Node ID: " << std::dec << kfd_info.node_id << std::endl;
std::cout << "\t\t**Partition ID: "
<< std::dec << kfd_info.current_partition_id << std::endl;
}
}
}
void TestBase::Run(void) {
std::string label;
+15 -2
Просмотреть файл
@@ -1581,8 +1581,6 @@ class TestAmdSmiPython(unittest.TestCase):
def test_get_gpu_metrics_info(self):
self._print_func_name('')
if self.TODO_SKIP_FAIL:
self.skipTest("Skipping test_get_gpu_metrics_info as it fails (MI350X, AMDSMI_STATUS_UNEXPECTED_DATA).")
for i, gpu in enumerate(self.processors):
msg = f'gpu({i}):'
try:
@@ -1595,6 +1593,19 @@ class TestAmdSmiPython(unittest.TestCase):
raise self.raise_exception
return
def test_get_gpu_partition_metrics_info(self):
self._print_func_name('')
for i, gpu in enumerate(self.processors):
try:
msg = f'gpu({i}): '
ret = amdsmi.amdsmi_get_gpu_partition_metrics_info(gpu)
self._print(msg, ret)
except amdsmi.AmdSmiLibraryException as e:
if self._check_ret(msg, e, self.PASS):
self.raise_exception = e
if self.raise_exception:
raise self.raise_exception
def test_get_gpu_od_volt_curve_regions(self):
self._print_func_name('')
num_region = 10
@@ -3110,6 +3121,8 @@ class TestAmdSmiPython(unittest.TestCase):
def test_set_gpu_perf_level(self):
self._print_func_name('')
if self.TODO_SKIP_NOT_COMPLETE:
self.skipTest("Skipping test_set_gpu_perf_level as it is not complete.")
dev_perf_level_current = self.dev_perf_levels[0][1]
for i, gpu in enumerate(self.processors):
msg = f'gpu({i}):'