Fix PPT - reset calls, unit format, and get_power_cap()

Changes:
  - Simplified reset calls
  - Updated static limit N/A values to all possible data
  (helps csv format be consistent)
  - Unit format was broken on static
  - get_power_cap() had min/max values swapped, and the return
    was missing two fields
  - Updated changelog to reflect all changes

Change-Id: I23713471b984f52085372486c6e6ff852e2f42f8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
This commit is contained in:
Charis Poag
2025-11-13 15:18:36 -06:00
committed by Arif, Maisam
parent 46e5f157f0
commit 00a893d299
4 changed files with 105 additions and 88 deletions
+81 -25
View File
@@ -58,38 +58,94 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
- **Added support for PPT1 power limit information**.
- Support has been added for querying and setting the PPT (Package Power Tracking) limits
- There are two PPT limits, PPT0 has lower limit and tracks a filtered version of the input power and PPT1 has higher limit but tracks the raw input power. This is to catch spikes in the raw data.
- New API added:
- amdsmi_get_supported_power_cap(): Returns which power cap types are supported on the device (PPT0, PPT1). This will allow users to know which power cap types they can get/set.
- Original APIs remain the same but now can get/set both PPT0 and PPT1 limits (on supported hardware):
- amdsmi_get_power_cap_info()
- amdsmi_set_power_cap()
- See the Changed section for changes made to the `set` and `static` commands regarding support for PPT1.
### Changed
- **`amd-smi set --power-cap` now requires sepcification of the power cap type**.
- command now takes the form: `amd-smi set --power-cap <power-cap-type> <new-cap>`
- acceptable power cap types are "ppt0" and "ppt1"
- **`amd-smi set --power-cap` now requires specification of the power cap type**.
- Command now takes the form: `amd-smi set --power-cap <power-cap-type> <new-cap>`. Acceptable power cap types are "ppt0" and "ppt1".
Ex.
```console
$ sudo amd-smi set --power-cap ppt1 1150
GPU: 0
POWERCAP: Successfully set ppt1 power cap to 1150W
...
```
```console
$ sudo amd-smi set --power-cap ppt1 1150
GPU: 0
POWERCAP: Successfully set PPT1 power cap to 1150W
...
```
- **`amd-smi reset --power-cap` will attempt to reset both power caps**.
- When using the reset command, both PPT0 and PPT1 power caps will be reset to their default values. If a device only has PPT0, then only PPT0 will be reset.
Ex.
```console
$ sudo amd-smi reset --power-cap ppt1 1150
GPU: 0
POWERCAP:
PPT0: Successfully reset power cap to 203W
PPT1: [AMDSMI_STATUS_NOT_SUPPORTED] Unable to reset to default power cap
...
```
- **`amd-smi static --limit` now has a PPT1 section when PPT1 is available**.
```console
$ amd-smi static --limit
GPU: 0
LIMIT:
PPT0:
MAX_POWER_LIMIT: 1000
MIN_POWER_LIMIT: 0
SOCKET_POWER_LIMIT: 1000
PPT1:
MAX_POWER_LIMIT: 1300
MIN_POWER_LIMIT: 1100
SOCKET_POWER_LIMIT: 1250
SLOWDOWN_EDGE_TEMPERATURE: N/A
...
```
- The static --limit command has been updated to include PPT1 power limit information when available on the device.
```console
$ amd-smi static --limit
GPU: 0
LIMIT:
PPT0:
MAX_POWER_LIMIT: 1000
MIN_POWER_LIMIT: 0
SOCKET_POWER_LIMIT: 1000
PPT1:
MAX_POWER_LIMIT: 1300
MIN_POWER_LIMIT: 1100
SOCKET_POWER_LIMIT: 1250
SLOWDOWN_EDGE_TEMPERATURE: N/A
...
```
- JSON and CSV formats are updated to reflect this change as well.
Ex.
```console
$ amd-smi static --limit --json
{
"gpu_data": [
{
"gpu": 0,
"limit": {
"ppt0": {
"max_power_limit": {
"value": 203,
"unit": "W"
},
"min_power_limit": {
"value": 0,
"unit": "W"
},
"socket_power_limit": {
"value": 100,
"unit": "W"
}
},
"ppt1": {
"max_power_limit": "N/A",
"min_power_limit": "N/A",
"socket_power_limit": "N/A"
},
...
}
},
...
```
```console
$ amd-smi static --limit --csv
gpu,ppt0_max_power_limit,ppt0_min_power_limit,ppt0_socket_power_limit,ppt1_max_power_limit,ppt1_min_power_limit,ppt1_socket_power_limit,slowdown_edge_temperature,slowdown_hotspot_temperature,slowdown_vram_temperature,shutdown_edge_temperature,shutdown_hotspot_temperature,shutdown_vram_temperature
0,203,0,100,N/A,N/A,N/A,100,110,100,105,115,105
1,213,0,100,N/A,N/A,N/A,109,110,100,114,115,105
```
### Removed
+20 -59
View File
@@ -611,10 +611,13 @@ class AMDSMICommands():
for power_type in amdsmi_interface.AmdSmiPowerCapType:
# Strip 'AMDSMI_POWER_CAP_TYPE_' prefix and convert to lowercase
key = power_type.name.replace('AMDSMI_POWER_CAP_TYPE_', '').lower()
power_limit_types[key] = "N/A"
power_limit_types[key] = {
"max_power_limit" : "N/A",
"min_power_limit" : "N/A",
"socket_power_limit" : "N/A"
}
try:
power_limit_error = False
power_cap_types = amdsmi_interface.amdsmi_get_supported_power_cap(args.gpu)
for sensor in power_cap_types['sensor_inds']:
power_cap_info = amdsmi_interface.amdsmi_get_power_cap_info(args.gpu, sensor)
@@ -625,9 +628,9 @@ class AMDSMICommands():
socket_power_limit = power_cap_info['power_cap']
socket_power_limit = self.helpers.convert_SI_unit(socket_power_limit, AMDSMIHelpers.SI_Unit.MICRO)
ppt = {
"max_power_limit" : max_power_limit,
"min_power_limit" : min_power_limit,
"socket_power_limit" : socket_power_limit
"max_power_limit" : self.helpers.unit_format(self.logger, max_power_limit, 'W'),
"min_power_limit" : self.helpers.unit_format(self.logger, min_power_limit, 'W'),
"socket_power_limit" : self.helpers.unit_format(self.logger, socket_power_limit, 'W')
}
sensor_name = power_cap_types['sensor_types'][sensor]
@@ -635,7 +638,6 @@ class AMDSMICommands():
sensor_key = sensor_name.name.replace('AMDSMI_POWER_CAP_TYPE_', '').lower()
power_limit_types[sensor_key] = ppt
except amdsmi_exception.AmdSmiLibraryException as e:
power_limit_error = True
logging.debug("Failed to get power cap info for gpu %s | %s", gpu_id, e.get_error_info())
# Edge temperature limits
@@ -709,16 +711,6 @@ class AMDSMICommands():
power_unit = 'W'
temp_unit_human_readable = '\N{DEGREE SIGN}C'
temp_unit_json = 'C'
if not power_limit_error:
max_power_limit = self.helpers.unit_format(self.logger,
max_power_limit,
power_unit)
min_power_limit = self.helpers.unit_format(self.logger,
min_power_limit,
power_unit)
socket_power_limit = self.helpers.unit_format(self.logger,
socket_power_limit,
power_unit)
if self.logger.is_human_readable_format():
if not slowdown_temp_edge_limit_error:
@@ -5488,65 +5480,34 @@ class AMDSMICommands():
self.logger.clear_multiple_devices_output()
return
if args.power_cap:
final_output = {"ppt0": "[AMDSMI_STATUS_NOT_SUPPORTED] Unable to reset to default power cap", "ppt1": "[AMDSMI_STATUS_NOT_SUPPORTED] Unable to reset to default power cap"}
power_limit_types = {}
for power_type in amdsmi_interface.AmdSmiPowerCapType:
# Strip 'AMDSMI_POWER_CAP_TYPE_' prefix and convert to lowercase
key = power_type.name.replace('AMDSMI_POWER_CAP_TYPE_', '').lower()
power_limit_types[key] = "N/A"
current_sensor_num = 0
try:
power_cap_types = amdsmi_interface.amdsmi_get_supported_power_cap(args.gpu)
for sensor in power_cap_types['sensor_inds']:
current_sensor_num = sensor
power_cap_info = amdsmi_interface.amdsmi_get_power_cap_info(args.gpu, sensor)
logging.debug(f"Power cap info for gpu {gpu_id} ppt{sensor} | {power_cap_info}")
default_power_cap_in_w = power_cap_info["default_power_cap"]
default_power_cap_in_w = self.helpers.convert_SI_unit(default_power_cap_in_w, AMDSMIHelpers.SI_Unit.MICRO)
current_power_cap_in_w = power_cap_info["power_cap"]
current_power_cap_in_w = self.helpers.convert_SI_unit(current_power_cap_in_w, AMDSMIHelpers.SI_Unit.MICRO)
default_power_cap_in_mw = power_cap_info["default_power_cap"]
default_power_cap_in_w = self.helpers.convert_SI_unit(default_power_cap_in_mw, AMDSMIHelpers.SI_Unit.MICRO)
current_power_cap_in_mw = power_cap_info["power_cap"]
current_power_cap_in_w = self.helpers.convert_SI_unit(current_power_cap_in_mw, AMDSMIHelpers.SI_Unit.MICRO)
sensor_name = power_cap_types['sensor_types'][sensor]
# Strip 'AMDSMI_POWER_CAP_TYPE_' prefix and convert to lowercase
sensor_key = sensor_name.name.replace('AMDSMI_POWER_CAP_TYPE_', '').lower()
power_limit_types[sensor_key] = (default_power_cap_in_w, current_power_cap_in_w)
amdsmi_interface.amdsmi_set_power_cap(args.gpu, sensor, default_power_cap_in_mw)
final_output[f"ppt{current_sensor_num}"] = f"Successfully reset power cap to {default_power_cap_in_w}W"
except amdsmi_exception.AmdSmiLibraryException as e:
self.logger.store_output(args.gpu, 'powercap', f"[{e.get_error_info(detailed=False)}] Unable to reset power cap to default")
self.logger.print_output()
self.logger.clear_multiple_devices_output()
return
# TODO Make agnostic to number of power cap types
final_output = {"ppt0": "", "ppt1": ""}
if power_limit_types['ppt0'] == "N/A":
final_output['ppt0'] = f"PPT0 Power cap information is not available"
elif power_limit_types['ppt0'][1] == power_limit_types['ppt0'][0]:
final_output['ppt0'] = f"PPT0 Power cap is already set to {power_limit_types['ppt0'][0]}W"
else:
try:
default_ppt0_power_cap_in_uw = self.helpers.convert_SI_unit(power_limit_types['ppt0'][0],
AMDSMIHelpers.SI_Unit.BASE,
AMDSMIHelpers.SI_Unit.MICRO)
amdsmi_interface.amdsmi_set_power_cap(args.gpu, 0, default_ppt0_power_cap_in_uw)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to reset PPT0 power cap to {power_limit_types['ppt0'][0]} on GPU {gpu_id}") from e
final_output['ppt0'] = f"Successfully reset PPT0 power cap to {power_limit_types['ppt0'][0]}W"
if power_limit_types['ppt1'] == "N/A":
final_output['ppt1'] = f"PPT1 Power cap information is not available"
elif power_limit_types['ppt1'][1] == power_limit_types['ppt1'][0]:
final_output['ppt1'] = f"PPT1 Power cap is already set to {power_limit_types['ppt1'][0]}W"
else:
try:
default_ppt1_power_cap_in_uw = self.helpers.convert_SI_unit(power_limit_types['ppt1'][0],
AMDSMIHelpers.SI_Unit.BASE,
AMDSMIHelpers.SI_Unit.MICRO)
amdsmi_interface.amdsmi_set_power_cap(args.gpu, 1, default_ppt1_power_cap_in_uw)
except amdsmi_exception.AmdSmiLibraryException as e:
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
raise ValueError(f"Unable to reset PPT1 power cap to {power_limit_types['ppt1'][0]} on GPU {gpu_id}") from e
final_output['ppt1'] = f"Successfully reset PPT1 power cap to {power_limit_types['ppt1'][0]}W"
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
raise PermissionError('Command requires elevation') from e
final_output[f"ppt{current_sensor_num}"] = f"[{e.get_error_info(detailed=False)}] Unable to reset cap to default power cap"
self.logger.store_output(args.gpu, 'powercap', final_output)
self.logger.print_output()
self.logger.clear_multiple_devices_output()
+3 -3
View File
@@ -846,12 +846,12 @@ class AMDSMIHelpers():
if power_limit_types[ppt_key]['power_cap_max'] == 0:
power_limit_types[ppt_key]['power_cap_max'] = "N/A"
ppt0_power_cap_max = self.format_power_cap(power_limit_types['ppt0']['power_cap_min'])
ppt0_power_cap_min = self.format_power_cap(power_limit_types['ppt0']['power_cap_max'])
ppt0_power_cap_max = self.format_power_cap(power_limit_types['ppt0']['power_cap_max'])
ppt0_power_cap_min = self.format_power_cap(power_limit_types['ppt0']['power_cap_min'])
ppt1_power_cap_max = self.format_power_cap(power_limit_types['ppt1']['power_cap_max'])
ppt1_power_cap_min = self.format_power_cap(power_limit_types['ppt1']['power_cap_min'])
return (ppt0_power_cap_min, ppt0_power_cap_min, ppt1_power_cap_max, ppt1_power_cap_min)
return (ppt0_power_cap_min, ppt0_power_cap_max, ppt1_power_cap_min, ppt1_power_cap_max)
def format_power_cap(self, value):
+1 -1
View File
@@ -1260,7 +1260,7 @@ class AMDSMIParser(argparse.ArgumentParser):
set_xgmi_plpd_help = f"Set the GPU XGMI per-link power down policy using policy id, an integer. Valid id's include:\n\t{xgmi_plpd_help_info}"
set_clock_freq_help = "Set one or more sclk (aka gfxclk), mclk, fclk, pcie, or socclk frequency levels.\n\tUse `amd-smi static --clock` to find acceptable levels.\n\tUse `amd-smi static --bus` to find acceptable pcie levels."
ppt0_power_cap_min, ppt0_power_cap_max, ppt1_power_cap_min, ppt1_power_cap_max = self.helpers.get_power_caps()
set_power_cap_help = f"Set either PPT0 or PPT1 power capacity limit:\n\tex: amd-smi set -o ppt0 1300\n\tPPT0 min cap: {ppt0_power_cap_min}, PPT0 max cap: {ppt0_power_cap_max}\n\tPPT1 min cap: {ppt1_power_cap_min}, PPT1 max cap: {ppt1_power_cap_max}"
set_power_cap_help = f"Set either PPT0 or PPT1 power capacity limit:\n\tEx: `amd-smi set -o ppt0 1300`\n\tPPT0 min cap: {ppt0_power_cap_min}, PPT0 max cap: {ppt0_power_cap_max}\n\tPPT1 min cap: {ppt1_power_cap_min}, PPT1 max cap: {ppt1_power_cap_max}"
set_clk_limit_help = "Sets the sclk (aka gfxclk) or mclk minimum and maximum frequencies. \n\tex: amd-smi set -L (sclk | mclk) (min | max) value"
set_process_isolation_help = "Enable or disable the GPU process isolation on a per partition basis:\n 0 for disable and 1 for enable.\n"