[SWDEV-511234] Added amdsmi_get_gpu_cper_entries & CLI implementation
Added amdsmi_get_gpu_cper_entries() in the python and C APIs
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Co-authored-by: Saeed, Oosman <Oosman.Saeed@amd.com>
Co-authored-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
[ROCm/amdsmi commit: d81871ef16]
Bu işleme şunda yer alıyor:
@@ -55,50 +55,96 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
|
||||
|
||||
### Added
|
||||
|
||||
- N/A
|
||||
- **Added dumping CPER entries from RAS tool `amdsmi_get_gpu_cper_entries()` to Python & C APIs.**
|
||||
- CPER entries consist of `amdsmi_cper_hdr_t`
|
||||
|
||||
```shell
|
||||
typedef struct {
|
||||
char signature[4]; /* "CPER" */
|
||||
uint16_t revision;
|
||||
uint32_t signature_end; /* 0xFFFFFFFF */
|
||||
uint16_t sec_cnt;
|
||||
amdsmi_cper_sev_t error_severity;
|
||||
//valid_bits_t valid_bits;
|
||||
//uint32_t valid_mask;
|
||||
amdsmi_cper_valid_bits_t cper_valid_bits;
|
||||
uint32_t record_length; /* Total size of CPER Entry */
|
||||
amdsmi_cper_timestamp_t timestamp;
|
||||
char platform_id[16];
|
||||
amdsmi_cper_guid_t partition_id; /* Reserved */
|
||||
char creator_id[16];
|
||||
amdsmi_cper_guid_t notify_type; /* CMC, MCE, can use amdsmi_cper_notifiy_type_t to decode*/
|
||||
char record_id[8]; /* Unique CPER Entry ID */
|
||||
uint32_t flags; /* Reserved */
|
||||
uint64_t persistence_info; /* Reserved */
|
||||
uint8_t reserved[12]; /* Reserved */
|
||||
} amdsmi_cper_hdr_t;
|
||||
```
|
||||
|
||||
- Dumping CPER entires is also enabled in the CLI interface via `sudo amd-smi ras --cper`
|
||||
|
||||
```shell
|
||||
$ sudo amd-smi ras --cper
|
||||
Dumping CPER file header entries for GPU 0:
|
||||
"0": {
|
||||
"error_severity": "non_fatal_corrected",
|
||||
"notify_type": "CMC",
|
||||
"timestamp": "2025/04/08 18:23:44",
|
||||
"signature": "CPER",
|
||||
"revision": 256,
|
||||
"signature_end": "0xffffffff",
|
||||
"sec_cnt": 1,
|
||||
"record_length": 472,
|
||||
"platform_id": "0x1002:0x74A2",
|
||||
"creator_id": "amdgpu",
|
||||
"record_id": "5:1",
|
||||
"flags": 0,
|
||||
"persistence_info": 0
|
||||
}
|
||||
```
|
||||
### Changed
|
||||
|
||||
- **Changed amd-smi partition --accelerator & `amdsmi_get_gpu_accelerator_partition_profile_config()` detect users running without root/sudo privledges**
|
||||
- Updated `amdsmi_get_gpu_accelerator_partition_profile_config()` to return `AMDSMI_STATUS_NO_PERM` immediately
|
||||
if users run without root/sudo permissions.
|
||||
- Updated `amd-smi partition --accelerator` to provide a warning for users without root/sudo permissions (see example below, ***output subject to change***).
|
||||
```shell
|
||||
$ amd-smi partition --accelerator
|
||||
|
||||
ACCELERATOR_PARTITION_PROFILES:
|
||||
```shell
|
||||
$ amd-smi partition --accelerator
|
||||
|
||||
***************************************************************************
|
||||
** WARNING: **
|
||||
** ACCELERATOR_PARTITION_PROFILES requires sudo/root permissions to run. **
|
||||
** Please run the command with sudo permissions to get accurate results. **
|
||||
***************************************************************************
|
||||
ACCELERATOR_PARTITION_PROFILES:
|
||||
|
||||
GPU_ID PROFILE_INDEX MEMORY_PARTITION_CAPS ACCELERATOR_TYPE PARTITION_ID NUM_PARTITIONS NUM_RESOURCES RESOURCE_INDEX RESOURCE_TYPE RESOURCE_INSTANCES RESOURCES_SHARED
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
***************************************************************************
|
||||
** WARNING: **
|
||||
** ACCELERATOR_PARTITION_PROFILES requires sudo/root permissions to run. **
|
||||
** Please run the command with sudo permissions to get accurate results. **
|
||||
***************************************************************************
|
||||
|
||||
ACCELERATOR_PARTITION_RESOURCES:
|
||||
RESOURCE_INDEX RESOURCE_TYPE RESOURCE_INSTANCES RESOURCES_SHARED
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
GPU_ID PROFILE_INDEX MEMORY_PARTITION_CAPS ACCELERATOR_TYPE PARTITION_ID NUM_PARTITIONS NUM_RESOURCES RESOURCE_INDEX RESOURCE_TYPE RESOURCE_INSTANCES RESOURCES_SHARED
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A 0 N/A N/A N/A N/A N/A N/A
|
||||
|
||||
ACCELERATOR_PARTITION_RESOURCES:
|
||||
RESOURCE_INDEX RESOURCE_TYPE RESOURCE_INSTANCES RESOURCES_SHARED
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
N/A N/A N/A N/A
|
||||
|
||||
|
||||
Legend:
|
||||
* = Current mode
|
||||
```
|
||||
Legend:
|
||||
* = Current mode
|
||||
```
|
||||
|
||||
- **Changed `amd-smi partition --current`, `amd-smi partition --accelerator`, and `amdsmi_get_gpu_accelerator_partition_profile()` to display partition ID for each individual partition**
|
||||
- Host will continue to display in the full array format, they do not display the individual partitions as Baremetal/Guest setups.
|
||||
@@ -106,44 +152,46 @@ Legend:
|
||||
reflect each individual partition ID, now provided in `partition_id[0]` location (as seen in other amd-smi CLI commands).
|
||||
This change was needed for BM/Guest setups due to other related partition outputs seen in (`amd-smi list` and `amd-smi static --partition`) and individual logical partition devices displayed. ***See examples below for reference.***
|
||||
|
||||
Previous output:
|
||||
```shell
|
||||
$ amd-smi partition --current
|
||||
Previous output:
|
||||
|
||||
CURRENT_PARTITION:
|
||||
GPU_ID MEMORY ACCELERATOR_TYPE ACCELERATOR_PROFILE_INDEX PARTITION_ID
|
||||
0 NPS1 CPX 3 0,1,2,3,4,5,6,7
|
||||
1 NPS1 CPX 3 N/A
|
||||
2 NPS1 CPX 3 N/A
|
||||
3 NPS1 CPX 3 N/A
|
||||
4 NPS1 CPX 3 N/A
|
||||
5 NPS1 CPX 3 N/A
|
||||
6 NPS1 CPX 3 N/A
|
||||
7 NPS1 CPX 3 N/A
|
||||
8 NPS1 CPX 3 0,1,2,3,4,5,6,7
|
||||
9 NPS1 CPX 3 N/A
|
||||
10 NPS1 CPX 3 N/A
|
||||
...
|
||||
```
|
||||
```shell
|
||||
$ amd-smi partition --current
|
||||
|
||||
New output:
|
||||
```shell
|
||||
amd-smi partition --current
|
||||
CURRENT_PARTITION:
|
||||
GPU_ID MEMORY ACCELERATOR_TYPE ACCELERATOR_PROFILE_INDEX PARTITION_ID
|
||||
0 NPS1 CPX 3 0
|
||||
1 NPS1 CPX 3 1
|
||||
2 NPS1 CPX 3 2
|
||||
3 NPS1 CPX 3 3
|
||||
4 NPS1 CPX 3 4
|
||||
5 NPS1 CPX 3 5
|
||||
6 NPS1 CPX 3 6
|
||||
7 NPS1 CPX 3 7
|
||||
8 NPS1 CPX 3 0
|
||||
9 NPS1 CPX 3 1
|
||||
10 NPS1 CPX 3 2
|
||||
...
|
||||
```
|
||||
CURRENT_PARTITION:
|
||||
GPU_ID MEMORY ACCELERATOR_TYPE ACCELERATOR_PROFILE_INDEX PARTITION_ID
|
||||
0 NPS1 CPX 3 0,1,2,3,4,5,6,7
|
||||
1 NPS1 CPX 3 N/A
|
||||
2 NPS1 CPX 3 N/A
|
||||
3 NPS1 CPX 3 N/A
|
||||
4 NPS1 CPX 3 N/A
|
||||
5 NPS1 CPX 3 N/A
|
||||
6 NPS1 CPX 3 N/A
|
||||
7 NPS1 CPX 3 N/A
|
||||
8 NPS1 CPX 3 0,1,2,3,4,5,6,7
|
||||
9 NPS1 CPX 3 N/A
|
||||
10 NPS1 CPX 3 N/A
|
||||
...
|
||||
```
|
||||
|
||||
New output:
|
||||
|
||||
```shell
|
||||
amd-smi partition --current
|
||||
CURRENT_PARTITION:
|
||||
GPU_ID MEMORY ACCELERATOR_TYPE ACCELERATOR_PROFILE_INDEX PARTITION_ID
|
||||
0 NPS1 CPX 3 0
|
||||
1 NPS1 CPX 3 1
|
||||
2 NPS1 CPX 3 2
|
||||
3 NPS1 CPX 3 3
|
||||
4 NPS1 CPX 3 4
|
||||
5 NPS1 CPX 3 5
|
||||
6 NPS1 CPX 3 6
|
||||
7 NPS1 CPX 3 7
|
||||
8 NPS1 CPX 3 0
|
||||
9 NPS1 CPX 3 1
|
||||
10 NPS1 CPX 3 2
|
||||
...
|
||||
```
|
||||
|
||||
### Removed
|
||||
|
||||
@@ -165,6 +213,7 @@ GPU_ID MEMORY ACCELERATOR_TYPE ACCELERATOR_PROFILE_INDEX PARTITION_ID
|
||||
|
||||
- N/A
|
||||
|
||||
|
||||
## amd_smi_lib for ROCm 6.4.0
|
||||
|
||||
### Added
|
||||
|
||||
@@ -96,7 +96,8 @@ if __name__ == "__main__":
|
||||
amd_smi_commands.monitor,
|
||||
amd_smi_commands.rocm_smi,
|
||||
amd_smi_commands.xgmi,
|
||||
amd_smi_commands.partition)
|
||||
amd_smi_commands.partition,
|
||||
amd_smi_commands.ras)
|
||||
try:
|
||||
try:
|
||||
argcomplete.autocomplete(amd_smi_parser)
|
||||
@@ -105,7 +106,7 @@ if __name__ == "__main__":
|
||||
|
||||
valid_commands = ['version', 'list', 'static', 'firmware', 'bad-pages',
|
||||
'metric', 'process', 'profile', 'event', 'topology', 'set',
|
||||
'reset', 'monitor', 'xgmi', 'partition', '--help', '-h']
|
||||
'reset', 'monitor', 'xgmi', 'partition', 'ras', '--help', '-h']
|
||||
|
||||
sys.argv = [arg.lower() if arg.startswith('--') or not arg.startswith('-')
|
||||
else arg for arg in sys.argv]
|
||||
@@ -117,11 +118,12 @@ if __name__ == "__main__":
|
||||
raise amdsmi_cli_exceptions.AmdSmiInvalidSubcommandException(sys.argv[1],amd_smi_commands.logger.destination)
|
||||
|
||||
# Handle command modifiers before subcommand execution
|
||||
if args.json:
|
||||
# human readable is the default output format
|
||||
if hasattr(args, 'json') and args.json:
|
||||
amd_smi_commands.logger.format = amd_smi_commands.logger.LoggerFormat.json.value
|
||||
if args.csv:
|
||||
if hasattr(args, 'csv') and args.csv:
|
||||
amd_smi_commands.logger.format = amd_smi_commands.logger.LoggerFormat.csv.value
|
||||
if args.file:
|
||||
if hasattr(args, 'file') and args.file:
|
||||
amd_smi_commands.logger.destination = args.file
|
||||
|
||||
# Remove previous log handlers
|
||||
|
||||
@@ -34,12 +34,12 @@ from amdsmi_helpers import AMDSMIHelpers
|
||||
from amdsmi_logger import AMDSMILogger
|
||||
from amdsmi import amdsmi_exception, amdsmi_interface
|
||||
|
||||
|
||||
class AMDSMICommands():
|
||||
"""This class contains all the commands corresponding to AMDSMIParser
|
||||
Each command function will interact with AMDSMILogger to handle
|
||||
displaying the output to the specified format and destination.
|
||||
"""
|
||||
|
||||
def __init__(self, format='human_readable', destination='stdout') -> None:
|
||||
self.helpers = AMDSMIHelpers()
|
||||
self.logger = AMDSMILogger(format=format, destination=destination)
|
||||
@@ -175,6 +175,7 @@ class AMDSMICommands():
|
||||
elif self.logger.is_json_format() or self.logger.is_csv_format():
|
||||
self.logger.print_output()
|
||||
|
||||
|
||||
def list(self, args, multiple_devices=False, gpu=None):
|
||||
"""List information for target gpu
|
||||
|
||||
@@ -6160,6 +6161,108 @@ class AMDSMICommands():
|
||||
with self.logger.destination.open('a', encoding="utf-8") as output_file:
|
||||
output_file.write(legend_output + '\n')
|
||||
|
||||
|
||||
def ras(self, args, multiple_devices=False, gpu=None, cper=None,
|
||||
severity=None, folder=None, file_limit=None, follow=None):
|
||||
"""
|
||||
Retrieve and process CPER (RAS) entries for a target GPU.
|
||||
|
||||
Expected command (all options only):
|
||||
amd-smi ras --cper --severity=nonfatal-uncorrected,fatal --folder <folder_name> --file_limit=1000 --follow
|
||||
|
||||
Since no timestamp is provided on the command line, the function starts from a default cursor of 0.
|
||||
The output file name is auto-generated using the timestamp from the CPER header data (converted from
|
||||
the header’s "YYYY/MM/DD HH:MM:SS" format), along with the GPU/platform ID and error severity.
|
||||
"""
|
||||
# GPU handle logic.
|
||||
if gpu:
|
||||
args.gpu = gpu
|
||||
if cper:
|
||||
args.cper = cper
|
||||
if severity:
|
||||
args.severity = severity
|
||||
if folder:
|
||||
args.folder = folder
|
||||
if file_limit:
|
||||
args.file_limit = file_limit
|
||||
if follow:
|
||||
args.follow = follow
|
||||
|
||||
if args.gpu == None:
|
||||
args.gpu = self.device_handles
|
||||
|
||||
self.helpers.check_required_groups()
|
||||
handled_multiple_gpus, device_handle = self.helpers.handle_gpus(args, self.logger, self.ras)
|
||||
if handled_multiple_gpus:
|
||||
return
|
||||
|
||||
args.gpu = device_handle
|
||||
|
||||
# Parse severity mask dynamically from the --severity option.
|
||||
severity_mask = 0
|
||||
# drop duplicates of args
|
||||
logging.debug(args)
|
||||
for sev in list(set(args.severity)):
|
||||
if sev == "all":
|
||||
# Set bits for NON_FATAL_UNCORRECTED (0), FATAL (1), and NON_FATAL_CORRECTED (2)
|
||||
severity_mask |= ((1 << 0) | (1 << 1) | (1 << 2))
|
||||
elif sev == "fatal":
|
||||
# Set bit corresponding to AMDSMI_CPER_SEV_FATAL (which is 1)
|
||||
severity_mask |= (1 << 1)
|
||||
elif sev in ("nonfatal", "nonfatal-uncorrected"):
|
||||
# Set bit corresponding to AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED (which is 0)
|
||||
severity_mask |= (1 << 0)
|
||||
elif sev in ("nonfatal-corrected", "corrected"):
|
||||
# Set bit corresponding to AMDSMI_CPER_SEV_NON_FATAL_CORRECTED (which is 2)
|
||||
severity_mask |= (1 << 2)
|
||||
|
||||
if args.cper:
|
||||
# Start from cursor 0 (no timestamp argument provided).
|
||||
cursor = 0
|
||||
buffer_size = 1048576
|
||||
file_limit = int(args.file_limit) if args.file_limit else 1000
|
||||
|
||||
# Print exit message only once and only when follow is set
|
||||
if self.logger.cper_exit_message() and args.follow:
|
||||
print('Press q and hit ENTER when you want to stop.')
|
||||
self.logger.set_cper_exit_message(False)
|
||||
|
||||
# Main loop: continuously retrieve CPER entries if --follow is set.
|
||||
gpu_id = self.helpers.get_gpu_id_from_device_handle(args.gpu)
|
||||
if args.folder:
|
||||
print(f'Dumping CPER file header entries for GPU {gpu_id} in folder {args.folder}\n')
|
||||
else:
|
||||
print(f'Dumping CPER file header entries for GPU {gpu_id}:\n')
|
||||
|
||||
self.stop = False
|
||||
while True:
|
||||
try:
|
||||
entries, new_cursor, cper_data = amdsmi_interface.amdsmi_get_gpu_cper_entries(
|
||||
args.gpu, severity_mask, buffer_size, cursor)
|
||||
logging.debug(f"cper_entries | entries: {entries}")
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
|
||||
raise PermissionError('Error opening CPER file. This command requires elevation') from e
|
||||
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_FILE_NOT_FOUND:
|
||||
raise FileNotFoundError('Error opening CPER file. This command requires a CPER to be enabled.') from e
|
||||
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_FILE_ERROR:
|
||||
raise FileExistsError('Error opening CPER file. Unable to read CPER File') from e
|
||||
else:
|
||||
logging.debug(f"Error retrieving CPER entries: {e}")
|
||||
break
|
||||
if entries:
|
||||
self.helpers.dump_entries(args.folder, entries, cper_data)
|
||||
if len(entries) == 0 or not args.follow:
|
||||
break
|
||||
cursor = new_cursor
|
||||
time.sleep(5)
|
||||
user_input = input()
|
||||
if user_input == 'q':
|
||||
print("Escape Sequence Detected; Exiting")
|
||||
self.stop = True
|
||||
break
|
||||
|
||||
|
||||
def _event_thread(self, commands, i):
|
||||
devices = commands.device_handles
|
||||
if len(devices) == 0:
|
||||
|
||||
@@ -19,18 +19,19 @@
|
||||
# IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
|
||||
# CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
|
||||
|
||||
import grp
|
||||
import json
|
||||
import logging
|
||||
import math
|
||||
import multiprocessing
|
||||
import os
|
||||
import grp
|
||||
import platform
|
||||
import re
|
||||
import sys
|
||||
import time
|
||||
import re
|
||||
import multiprocessing
|
||||
import json
|
||||
|
||||
from enum import Enum
|
||||
from pathlib import Path
|
||||
from typing import List, Set, Union
|
||||
|
||||
from amdsmi_init import *
|
||||
@@ -55,7 +56,11 @@ class AMDSMIHelpers():
|
||||
|
||||
self._is_linux = False
|
||||
self._is_windows = False
|
||||
|
||||
# Counts and Tracking variables
|
||||
self._count_of_sets_called = 0
|
||||
self._count_of_cper_files = 0
|
||||
|
||||
|
||||
# Check if the system is a virtual OS
|
||||
if self.operating_system.startswith("Linux"):
|
||||
@@ -95,6 +100,7 @@ class AMDSMIHelpers():
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
logging.debug("Unable to determine virtualization status: " + str(e.get_error_code()))
|
||||
|
||||
|
||||
def increment_set_count(self):
|
||||
self._count_of_sets_called += 1
|
||||
|
||||
@@ -103,6 +109,14 @@ class AMDSMIHelpers():
|
||||
return self._count_of_sets_called
|
||||
|
||||
|
||||
def increment_cper_count(self):
|
||||
self._count_of_cper_files += 1
|
||||
|
||||
|
||||
def get_cper_count(self):
|
||||
return self._count_of_cper_files
|
||||
|
||||
|
||||
def is_virtual_os(self):
|
||||
return self._is_virtual_os
|
||||
|
||||
@@ -116,6 +130,7 @@ class AMDSMIHelpers():
|
||||
# Returns True if system is baremetal, if system is hypervisor this should return False
|
||||
return self._is_baremetal
|
||||
|
||||
|
||||
def is_passthrough(self):
|
||||
return self._is_passthrough
|
||||
|
||||
@@ -197,7 +212,7 @@ class AMDSMIHelpers():
|
||||
"""
|
||||
cpu_choices = {}
|
||||
cpu_choices_str = ""
|
||||
#import pdb;pdb.set_trace()
|
||||
|
||||
try:
|
||||
cpu_handles = []
|
||||
# amdsmi_get_cpusocket_handles() returns the cpu socket handles stored for cpu_id
|
||||
@@ -230,6 +245,7 @@ class AMDSMIHelpers():
|
||||
|
||||
return (cpu_choices, cpu_choices_str)
|
||||
|
||||
|
||||
def get_core_choices(self):
|
||||
"""Return dictionary of possible Core choices and string of the output:
|
||||
Dictionary will be in format: coress[ID]: Device Handle)
|
||||
@@ -705,11 +721,13 @@ class AMDSMIHelpers():
|
||||
except:
|
||||
return False
|
||||
|
||||
|
||||
def get_perf_levels(self):
|
||||
perf_levels_str = [clock.name for clock in amdsmi_interface.AmdSmiDevPerfLevel]
|
||||
perf_levels_int = list(set(clock.value for clock in amdsmi_interface.AmdSmiDevPerfLevel))
|
||||
return perf_levels_str, perf_levels_int
|
||||
|
||||
|
||||
def get_accelerator_partition_profile_config(self):
|
||||
device_handles = amdsmi_interface.amdsmi_get_processor_handles()
|
||||
accelerator_partition_profiles = {'profile_indices':[], 'profile_types':[], 'memory_caps': []}
|
||||
@@ -726,6 +744,7 @@ class AMDSMIHelpers():
|
||||
break
|
||||
return accelerator_partition_profiles
|
||||
|
||||
|
||||
def get_accelerator_choices_types_indices(self):
|
||||
return_val = ("N/A", {'profile_indices':[], 'profile_types':[]})
|
||||
accelerator_partition_profiles = self.get_accelerator_partition_profile_config()
|
||||
@@ -735,6 +754,7 @@ class AMDSMIHelpers():
|
||||
return_val = (accelerator_choices, accelerator_partition_profiles)
|
||||
return return_val
|
||||
|
||||
|
||||
def get_memory_partition_types(self):
|
||||
memory_partitions_str = [partition.name for partition in amdsmi_interface.AmdSmiMemoryPartitionType]
|
||||
if 'UNKNOWN' in memory_partitions_str:
|
||||
@@ -854,6 +874,7 @@ class AMDSMIHelpers():
|
||||
else:
|
||||
sys.exit('Confirmation not given. Exiting without setting value')
|
||||
|
||||
|
||||
def confirm_changing_memory_partition_gpu_reload_warning(self, auto_respond=False):
|
||||
""" Print the warning for running outside of specification and prompt user to accept the terms.
|
||||
|
||||
@@ -879,6 +900,7 @@ class AMDSMIHelpers():
|
||||
print('Confirmation not given. Exiting without setting value')
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def is_valid_profile(self, profile):
|
||||
profile_presets = amdsmi_interface.amdsmi_wrapper.amdsmi_power_profile_preset_masks_t__enumvalues
|
||||
if profile in profile_presets:
|
||||
@@ -924,6 +946,7 @@ class AMDSMIHelpers():
|
||||
return f"{value} {unit}".rstrip()
|
||||
return f"{value}"
|
||||
|
||||
|
||||
class SI_Unit(float, Enum):
|
||||
GIGA = 1000000000 # 10^9
|
||||
MEGA = 1000000 # 10^6
|
||||
@@ -937,6 +960,7 @@ class AMDSMIHelpers():
|
||||
MICRO = 0.000001 # 10^-6
|
||||
NANO = 0.000000001 # 10^-9
|
||||
|
||||
|
||||
def convert_SI_unit(self, val: Union[int, float], unit_in: SI_Unit, unit_out = SI_Unit.BASE) -> Union[int, float]:
|
||||
"""This function will convert a value into another
|
||||
scientific (SI) unit. Defaults unit_out to SI_Unit.BASE
|
||||
@@ -956,6 +980,7 @@ class AMDSMIHelpers():
|
||||
else:
|
||||
raise TypeError("val must be an int or float")
|
||||
|
||||
|
||||
def get_pci_device_ids(self) -> Set[str]:
|
||||
pci_devices_path = "/sys/bus/pci/devices"
|
||||
pci_devices: set[str] = set()
|
||||
@@ -969,6 +994,7 @@ class AMDSMIHelpers():
|
||||
continue
|
||||
return pci_devices
|
||||
|
||||
|
||||
def progressbar(self, it, prefix="", size=60, out=sys.stdout, add_newline=False):
|
||||
count = len(it)
|
||||
if (add_newline):
|
||||
@@ -985,12 +1011,14 @@ class AMDSMIHelpers():
|
||||
show(i+1)
|
||||
print("\n\n", end='\r', flush=True, file=out)
|
||||
|
||||
|
||||
def showProgressbar(self, title="", timeInSeconds=13, add_newline=False):
|
||||
if title != "":
|
||||
title += " "
|
||||
for i in self.progressbar(range(timeInSeconds), title, 40, add_newline=add_newline):
|
||||
time.sleep(1)
|
||||
|
||||
|
||||
def check_required_groups(self):
|
||||
"""
|
||||
Check if the current user is a member of the required groups.
|
||||
@@ -1016,3 +1044,60 @@ class AMDSMIHelpers():
|
||||
) % ", ".join(sorted(missing_groups))
|
||||
print(msg)
|
||||
logging.warning(msg)
|
||||
|
||||
def hexdump(self, data, size, filepath):
|
||||
"""
|
||||
Converts binary data to a hex dump string, similar to the hexdump utility.
|
||||
"""
|
||||
def to_printable_ascii(byte):
|
||||
return chr(byte) if 32 <= byte <= 126 else "."
|
||||
|
||||
with open(filepath, 'w') as f:
|
||||
offset = 0
|
||||
while offset < size:
|
||||
chunk = data[offset:offset + 16]
|
||||
hex_values = " ".join(f"{byte:02x}" for byte in chunk)
|
||||
ascii_values = "".join(to_printable_ascii(byte) for byte in chunk)
|
||||
print(f"{offset:08x} {hex_values:<48} |{ascii_values}|", file=f)
|
||||
offset += 16
|
||||
|
||||
def dump_entries(self, folder, entries, cper_data):
|
||||
if folder:
|
||||
folder = Path(folder)
|
||||
folder.mkdir(parents=True, exist_ok=True) # Ensure folder exists
|
||||
|
||||
# Loop through all entries in the dictionary.
|
||||
for entry_index, entry in enumerate(entries.values()):
|
||||
# Assume 'entry' is a dictionary with keys: "error_severity" and "notify_type".
|
||||
error_severity = entry.get("error_severity", "Unknown")
|
||||
notify_type = entry.get("notify_type", "Unknown")
|
||||
|
||||
if error_severity == "non_fatal_uncorrected":
|
||||
prefix = "uncorrected"
|
||||
elif error_severity == "non_fatal_corrected":
|
||||
prefix = "corrected"
|
||||
elif error_severity == "fatal":
|
||||
prefix = "fatal"
|
||||
if notify_type == "BOOT":
|
||||
prefix = "boot"
|
||||
|
||||
# Construct a unique file name using the key to avoid overwriting
|
||||
entry_file = f"{prefix}_{self.get_cper_count()}.json"
|
||||
output_path = folder / entry_file
|
||||
|
||||
cper_data_file = f"{prefix}_{self.get_cper_count()}.cper"
|
||||
cper_data_file_path = folder / cper_data_file
|
||||
self.hexdump(cper_data[entry_index]["bytes"], cper_data[entry_index]["size"], cper_data_file_path)
|
||||
|
||||
try:
|
||||
with output_path.open("w") as f:
|
||||
logging.debug(f"Writing entry {self.get_cper_count()}: {entry} to {output_path}")
|
||||
# Dump the single entry as JSON, handling bytes via the lambda.
|
||||
f.write(json.dumps(entry, indent=2,
|
||||
default=lambda o: o.decode('utf-8') if isinstance(o, bytes) else o))
|
||||
except Exception as e:
|
||||
logging.error(f"Failed to write entry {self.get_cper_count()} to {output_path}: {e}")
|
||||
else:
|
||||
print(json.dumps(entries, indent=2,
|
||||
default=lambda o: o.decode('utf-8') if isinstance(o, bytes) else o))
|
||||
self.increment_cper_count()
|
||||
@@ -42,6 +42,7 @@ class AMDSMILogger():
|
||||
self.secondary_table_header = ""
|
||||
self.warning_message = ""
|
||||
self.helpers = AMDSMIHelpers()
|
||||
self._cper_exit_message = True
|
||||
|
||||
|
||||
class LoggerFormat(Enum):
|
||||
@@ -78,6 +79,26 @@ class AMDSMILogger():
|
||||
self.multiple_device_output.clear()
|
||||
|
||||
|
||||
def cper_exit_message(self):
|
||||
""" Store the cper exit message
|
||||
params:
|
||||
message (str) - message to store
|
||||
return:
|
||||
cper_exit_message (bool) - True if cper exit message is set
|
||||
"""
|
||||
return self._cper_exit_message
|
||||
|
||||
|
||||
def set_cper_exit_message(self, flag:bool):
|
||||
""" Set the cper exit message
|
||||
params:
|
||||
flag (bool) - True if cper exit message is set
|
||||
return:
|
||||
Nothing
|
||||
"""
|
||||
self._cper_exit_message = flag
|
||||
|
||||
|
||||
def _capitalize_keys(self, input_dict):
|
||||
output_dict = {}
|
||||
for key in input_dict.keys():
|
||||
|
||||
@@ -69,7 +69,7 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
"""
|
||||
def __init__(self, version, list, static, firmware, bad_pages, metric,
|
||||
process, profile, event, topology, set_value, reset, monitor,
|
||||
rocmsmi, xgmi, partition):
|
||||
rocmsmi, xgmi, partition, ras):
|
||||
|
||||
# Helper variables
|
||||
self.helpers = AMDSMIHelpers()
|
||||
@@ -115,7 +115,7 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
# Store possible subcommands & aliases for later errors
|
||||
self.possible_commands = ['version', 'list', 'static', 'firmware', 'ucode', 'bad-pages',
|
||||
'metric', 'process', 'profile', 'event', 'topology', 'set',
|
||||
'reset', 'monitor', 'dmon', 'xgmi', 'partition']
|
||||
'reset', 'monitor', 'dmon', 'xgmi', 'partition', 'ras']
|
||||
|
||||
# Add all subparsers
|
||||
self._add_version_parser(self.subparsers, version)
|
||||
@@ -134,6 +134,7 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
self._add_rocm_smi_parser(self.subparsers, rocmsmi)
|
||||
self._add_xgmi_parser(self.subparsers, xgmi)
|
||||
self._add_partition_parser(self.subparsers, partition)
|
||||
self._add_ras_parser(self.subparsers, ras)
|
||||
|
||||
|
||||
def _not_negative_int(self, int_value, sub_arg=None):
|
||||
@@ -241,6 +242,24 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
return AMDSMIFreqArgs
|
||||
|
||||
|
||||
def _check_folder_path(self):
|
||||
""" Argument action validator:
|
||||
Returns a path to folder from the folder path provided.
|
||||
If the path doesn't exist create it.
|
||||
"""
|
||||
class CheckOutputFilePath(argparse.Action):
|
||||
outputformat = self.helpers.get_output_format()
|
||||
# Checks the values
|
||||
def __call__(self, parser, args, values, option_string=None):
|
||||
path = Path(values)
|
||||
path.mkdir(parents=True, exist_ok=True)
|
||||
if not path.exists():
|
||||
raise amdsmi_cli_exceptions.AmdSmiInvalidFilePathException(path, CheckOutputFilePath.outputformat)
|
||||
elif path.is_dir():
|
||||
setattr(args, self.dest, path)
|
||||
return CheckOutputFilePath
|
||||
|
||||
|
||||
def _check_output_file_path(self):
|
||||
""" Argument action validator:
|
||||
Returns a path to a file from the output file path provided.
|
||||
@@ -408,7 +427,7 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
return _CoreSelectAction
|
||||
|
||||
|
||||
def _add_command_modifiers(self, subcommand_parser: argparse.ArgumentParser):
|
||||
def _add_command_modifiers(self, subcommand_parser: argparse.ArgumentParser, logging_only=False):
|
||||
json_help = "Displays output in JSON format (human readable by default)."
|
||||
csv_help = "Displays output in CSV format (human readable by default)."
|
||||
file_help = "Saves output into a file on the provided path (stdout by default)."
|
||||
@@ -418,12 +437,14 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
|
||||
command_modifier_group = subcommand_parser.add_argument_group('Command Modifiers')
|
||||
|
||||
# Output Format options
|
||||
logging_args = command_modifier_group.add_mutually_exclusive_group()
|
||||
logging_args.add_argument('--json', action='store_true', required=False, help=json_help)
|
||||
logging_args.add_argument('--csv', action='store_true', required=False, help=csv_help)
|
||||
if not logging_only:
|
||||
# Output Format options
|
||||
logging_args = command_modifier_group.add_mutually_exclusive_group()
|
||||
logging_args.add_argument('--json', action='store_true', required=False, help=json_help)
|
||||
logging_args.add_argument('--csv', action='store_true', required=False, help=csv_help)
|
||||
|
||||
command_modifier_group.add_argument('--file', action=self._check_output_file_path(), type=str, required=False, help=file_help)
|
||||
|
||||
command_modifier_group.add_argument('--file', action=self._check_output_file_path(), type=str, required=False, help=file_help)
|
||||
# Placing loglevel outside the subcommands so it can be used with any subcommand
|
||||
command_modifier_group.add_argument('--loglevel', action='store', type=str.upper, required=False, help=loglevel_help, default='ERROR', metavar='LEVEL',
|
||||
choices=loglevel_choices)
|
||||
@@ -1398,6 +1419,49 @@ class AMDSMIParser(argparse.ArgumentParser):
|
||||
self._add_command_modifiers(partition_parser)
|
||||
|
||||
|
||||
def _add_ras_parser(self, subparsers: argparse._SubParsersAction, func):
|
||||
"""
|
||||
Adds the 'ras' subcommand.
|
||||
|
||||
Expected command:
|
||||
amd-smi ras --cper --severity=nonfatal-uncorrected,fatal --folder <folder_name> --file_limit=1000 --follow
|
||||
|
||||
All parameters are provided via options; no positional arguments or optional --file/--gpu are used.
|
||||
"""
|
||||
# Subparser help text
|
||||
ras_help = "Retrieve CPER (RAS) entries from the driver"
|
||||
ras_description = (
|
||||
"Retrieve and decode CPER (RAS) entries from the kernel driver.\n"
|
||||
"Supports filtering by severity, exporting to different formats, and continuous monitoring.\n"
|
||||
"This command accepts options only; no positional arguments are required."
|
||||
)
|
||||
|
||||
# Help text for RAS arguments
|
||||
cper_help = "Trigger CPER data retrieval"
|
||||
|
||||
severity_choices = ["nonfatal-uncorrected", "fatal", "nonfatal-corrected", "all"]
|
||||
severity_choices_str = ", ".join(severity_choices)
|
||||
severity_help = f"Set the SEVERITY filters from the following:\n {severity_choices_str}"
|
||||
folder_help = "Folder to dump CPER report files"
|
||||
file_limit_help = "Maximum number of entries per output file"
|
||||
follow_help = "Continuously monitor for new entries"
|
||||
|
||||
ras_parser = subparsers.add_parser("ras", help=ras_help, description=ras_description)
|
||||
ras_parser.formatter_class = lambda prog: AMDSMISubparserHelpFormatter(prog)
|
||||
ras_parser.set_defaults(func=func)
|
||||
|
||||
# Required flags and arguments:
|
||||
ras_parser.add_argument("--cper", action="store_true", required=True, help=cper_help)
|
||||
ras_parser.add_argument("--severity", type=str.lower, nargs='+', default=['all'], help=severity_help, choices=severity_choices, metavar='SEVERITY')
|
||||
ras_parser.add_argument("--folder", type=str, action=self._check_folder_path(), default=False, help=folder_help)
|
||||
ras_parser.add_argument("--file_limit", type=self._positive_int, action='store', default=1000, help=file_limit_help)
|
||||
ras_parser.add_argument("--follow", action="store_true", default=False, help=follow_help)
|
||||
|
||||
# Add common modifiers and device selection arguments.
|
||||
self._add_device_arguments(ras_parser, required=False)
|
||||
self._add_command_modifiers(ras_parser, logging_only=True)
|
||||
|
||||
|
||||
def error(self, message):
|
||||
outputformat = self.helpers.get_output_format()
|
||||
|
||||
|
||||
@@ -1123,6 +1123,56 @@ except AmdSmiException as e:
|
||||
print(e)
|
||||
```
|
||||
|
||||
### amdsmi_get_gpu_cper_entries
|
||||
|
||||
Description: Dump CPER entries for a given GPU in a file using from CPER header file from RAS tool.
|
||||
|
||||
Input parameters:
|
||||
* `processor_handle` device which to query
|
||||
* `severity_mask` the severity mask of the entries to be retrieved
|
||||
* `buffer_size` pointer to a variable that specifies the size of the cper_data
|
||||
* `cursor` pointer to a variable that will contain the cursor for the next call
|
||||
|
||||
Output: Dictionary with fields
|
||||
|
||||
Field | Description
|
||||
---|---
|
||||
`error_severity` | The severity of the CPER error ex: `non_fatal_uncorrected`, `fatal`, `non_fatal_corrected`. |
|
||||
`notify_type` | The notification type associated with the CPER entry. |
|
||||
`timestamp` | The time when the CPER entry was recorded, formatted as `YYYY/MM/DD HH:MM:SS`. |
|
||||
`signature` | A 4-byte signature identifying the entry, typically `CPER`. |
|
||||
`revision` | The revision number of the CPER record format. |
|
||||
`signature_end` | A marker value (typically `0xFFFFFFFF`) confirming the integrity of the signature. |
|
||||
`sec_cnt` | The count of sections included in the CPER entry. |
|
||||
`record_length` | The total length in bytes of the CPER entry. |
|
||||
`platform_id` | A character array identifying the GPU or platform. |
|
||||
`creator_id` | A character array indicating the creator of the CPER entry. |
|
||||
`record_id` | A unique identifier for the CPER entry. |
|
||||
`flags` | Reserved flags related to the CPER entry. |
|
||||
`persistence_info` | Reserved information related to persistence. |
|
||||
|
||||
Exceptions that can be thrown by `amdsmi_get_gpu_cper_entries` function:
|
||||
|
||||
* `AmdSmiLibraryException`
|
||||
* `AmdSmiParameterException`
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
for device in devices:
|
||||
entries, new_cursor = amdsmi_get_gpu_cper_entries(device, severity_mask, buffer_size, initial_cursor)
|
||||
print("CPER entries for device", device)
|
||||
for key, entry in entries.items():
|
||||
print("Entry", key)
|
||||
print(" Error Severity:", entry.get("error_severity", "Unknown"))
|
||||
print(" Notify Type:", entry.get("notify_type", "Unknown"))
|
||||
print(" Timestamp:", entry.get("timestamp", ""))
|
||||
print()
|
||||
print("New Cursor Position:", new_cursor)
|
||||
except AmdSmiException as e:
|
||||
print(e)
|
||||
```
|
||||
|
||||
### amdsmi_get_gpu_board_info
|
||||
|
||||
Description: Returns board info for the given GPU
|
||||
|
||||
@@ -333,6 +333,7 @@ typedef enum {
|
||||
AMDSMI_STATUS_AMDGPU_RESTART_ERR = 54, //!< AMDGPU restart failed
|
||||
AMDSMI_STATUS_SETTING_UNAVAILABLE = 55, //!< Setting is not available
|
||||
AMDSMI_STATUS_CORRUPTED_EEPROM = 56, //!< EEPROM is corrupted
|
||||
AMDSMI_STATUS_MORE_DATA = 57, //!< There is more data than the buffer size the user passed
|
||||
// General errors
|
||||
AMDSMI_STATUS_MAP_ERROR = 0xFFFFFFFE, //!< The internal library error did not map to a status code
|
||||
AMDSMI_STATUS_UNKNOWN_ERROR = 0xFFFFFFFF, //!< An unknown error occurred
|
||||
@@ -1408,6 +1409,29 @@ typedef enum {
|
||||
CLK_LIMIT_MAX //!< Clock values in MHz
|
||||
} amdsmi_clk_limit_type_t;
|
||||
|
||||
typedef enum {
|
||||
AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED = 0,
|
||||
AMDSMI_CPER_SEV_FATAL = 1,
|
||||
AMDSMI_CPER_SEV_NON_FATAL_CORRECTED = 2,
|
||||
AMDSMI_CPER_SEV_NUM = 3,
|
||||
AMDSMI_CPER_SEV_UNUSED = 10,
|
||||
} amdsmi_cper_sev_t;
|
||||
|
||||
typedef enum {
|
||||
AMDSMI_CPER_NOTIFY_TYPE_CMC = 0x450eBDD72DCE8BB1,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_CPE = 0x4a55D8434E292F96,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_MCE = 0x4cc5919CE8F56FFE,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_PCIE = 0x4dfc1A16CF93C01F,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_INIT = 0x454a9308CC5263E8,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_NMI = 0x42c9B7E65BAD89FF,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_BOOT = 0x409aAB403D61A466,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_DMAR = 0x4c27C6B3667DD791,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_SEA = 0x11E4BBE89A78788A,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_SEI = 0x4E87B0AE5C284C81,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_PEI = 0x4214520409A9D5AC,
|
||||
AMDSMI_CPER_NOTIFY_TYPE_CXL_COMPONENT = 0x49A341DF69293BC9,
|
||||
} amdsmi_cper_notify_type_t;
|
||||
|
||||
/**
|
||||
* @brief The current ECC state
|
||||
*
|
||||
@@ -3360,6 +3384,7 @@ amdsmi_get_gpu_bad_page_info(amdsmi_processor_handle processor_handle, uint32_t
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_bad_page_threshold(amdsmi_processor_handle processor_handle, uint32_t *threshold);
|
||||
|
||||
|
||||
/**
|
||||
* @brief Verify the checksum of RAS EEPROM. It is not supported on virtual
|
||||
* machine guest
|
||||
@@ -4645,6 +4670,104 @@ amdsmi_status_t amdsmi_get_gpu_ecc_enabled(amdsmi_processor_handle processor_han
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_error_count_t *ec);
|
||||
|
||||
|
||||
#pragma pack(push, 1)
|
||||
typedef struct {
|
||||
unsigned char b[16];
|
||||
} amdsmi_cper_guid_t;
|
||||
|
||||
typedef struct {
|
||||
uint8_t seconds;
|
||||
uint8_t minutes;
|
||||
uint8_t hours;
|
||||
uint8_t flag;
|
||||
uint8_t day;
|
||||
uint8_t month;
|
||||
uint8_t year;
|
||||
uint8_t century;
|
||||
} amdsmi_cper_timestamp_t;
|
||||
|
||||
typedef struct {
|
||||
uint32_t platform_id : 1;
|
||||
uint32_t timestamp : 1;
|
||||
uint32_t partition_id : 1;
|
||||
uint32_t reserved : 29;
|
||||
} valid_bits_t;
|
||||
|
||||
typedef union {
|
||||
struct valid_bits_ {
|
||||
uint32_t platform_id : 1;
|
||||
uint32_t timestamp : 1;
|
||||
uint32_t partition_id : 1;
|
||||
uint32_t reserved : 29;
|
||||
} valid_bits;
|
||||
uint32_t valid_mask;
|
||||
} amdsmi_cper_valid_bits_t;
|
||||
|
||||
typedef struct {
|
||||
char signature[4]; /* "CPER" */
|
||||
uint16_t revision;
|
||||
uint32_t signature_end; /* 0xFFFFFFFF */
|
||||
uint16_t sec_cnt;
|
||||
amdsmi_cper_sev_t error_severity;
|
||||
|
||||
// valid_bits_t valid_bits;
|
||||
// uint32_t valid_mask;
|
||||
amdsmi_cper_valid_bits_t cper_valid_bits;
|
||||
|
||||
uint32_t record_length; /* Total size of CPER Entry */
|
||||
amdsmi_cper_timestamp_t timestamp;
|
||||
char platform_id[16];
|
||||
amdsmi_cper_guid_t partition_id; /* Reserved */
|
||||
char creator_id[16];
|
||||
amdsmi_cper_guid_t notify_type; /* CMC, MCE, can use amdsmi_cper_notifiy_type_t to decode*/
|
||||
char record_id[8]; /* Unique CPER Entry ID */
|
||||
uint32_t flags; /* Reserved */
|
||||
uint64_t persistence_info; /* Reserved */
|
||||
uint8_t reserved[12]; /* Reserved */
|
||||
} amdsmi_cper_hdr_t;
|
||||
|
||||
#pragma pack(pop)
|
||||
/**
|
||||
* @brief Retrieve CPER entries cached in the driver.
|
||||
*
|
||||
* The user will pass buffers to hold the CPER data and CPER headers. The library will
|
||||
* fill the buffer based on the severity_mask user passed. It will also parse the CPER header
|
||||
* and stored in the cper_hdrs array. The user can use the cper_hdrs to get the timestamp and other header information.
|
||||
* A cursor is also returned to the user, which can be used to get the next set of CPER entries.
|
||||
*
|
||||
* If there are more data than any of the buffers user pass, the library will return AMDSMI_STATUS_MORE_DATA.
|
||||
* User can call the API again with the cursor returned at previous call to get more data.
|
||||
* If the buffer size is too small to even hold one entry, the library
|
||||
* will return AMDSMI_STATUS_OUT_OF_RESOURCES.
|
||||
*
|
||||
* Even if the API returns AMDSMI_STATUS_MORE_DATA, the 2nd call may still get the entry_count == 0 as the driver
|
||||
* cache may not contain the serverity user is interested in. The API should return AMDSMI_STATUS_SUCCESS in this case
|
||||
* so that user can ignore that call.
|
||||
*
|
||||
* @ingroup tagECCInfo
|
||||
*
|
||||
* @platform{gpu_bm_linux} @platform{host} @platform{guest_1vf}
|
||||
*
|
||||
* @param[in] processor_handle Handle to the processor for which CPER entries are to be retrieved.
|
||||
* @param[in] severity_mask The severity mask of the entries to be retrieved.
|
||||
* @param[in,out] cper_data Pointer to a buffer where the CPER data will be stored. User must allocate the buffer
|
||||
* and set the buf_size correctly.
|
||||
* @param[in,out] buf_size Pointer to a variable that specifies the size of the cper_data.
|
||||
* On return, it will contain the actual size of the data written to the cper_data.
|
||||
* @param[in,out] cper_hdrs Array of the parsed headers of the cper_data. The user must allocate
|
||||
* the array of pointers to cper_hdr. The library will fill the array with the pointers to the parsed
|
||||
* headers. The underlying data is in the cper_data buffer and only pointer is stored in this array.
|
||||
* @param[in,out] entry_count Pointer to a variable that specifies the array length of the cper_hdrs user allocated.
|
||||
* On return, it will contain the actual entries written to the cper_hdrs.
|
||||
* @param[in,out] cursor Pointer to a variable that will contain the cursor for the next call.
|
||||
*
|
||||
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
|
||||
*/
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_cper_entries(amdsmi_processor_handle processor_handle, uint32_t severity_mask, char *cper_data,
|
||||
uint64_t *buf_size, amdsmi_cper_hdr_t** cper_hdrs, uint64_t *entry_count, uint64_t *cursor);
|
||||
|
||||
/** @} End tagECCInfo */
|
||||
|
||||
/*****************************************************************************/
|
||||
@@ -5904,7 +6027,7 @@ amdsmi_status_t amdsmi_get_pcie_info(amdsmi_processor_handle processor_handle, a
|
||||
* @brief Returns the 'xcd_counter' from the GPU metrics associated with the device
|
||||
*
|
||||
* @ingroup tagAsicBoardInfo
|
||||
*
|
||||
*
|
||||
* @platform{gpu_bm_linux} @platform{guest_1vf} @platform{guest_mvf}
|
||||
*
|
||||
* @param[in] processor_handle Device which to query
|
||||
|
||||
@@ -124,6 +124,7 @@ from .amdsmi_interface import amdsmi_get_gpu_board_info
|
||||
# # Ras Information
|
||||
from .amdsmi_interface import amdsmi_get_gpu_ras_feature_info
|
||||
from .amdsmi_interface import amdsmi_get_gpu_ras_block_features_enabled
|
||||
from .amdsmi_interface import amdsmi_get_gpu_cper_entries
|
||||
|
||||
# # Unsupported Functions In Virtual Environment
|
||||
from .amdsmi_interface import amdsmi_set_gpu_pci_bandwidth
|
||||
|
||||
@@ -25,7 +25,7 @@ import os
|
||||
import re
|
||||
import sys
|
||||
from collections.abc import Iterable
|
||||
from enum import IntEnum
|
||||
from enum import IntEnum, Enum
|
||||
from pathlib import Path
|
||||
from time import asctime, localtime, time
|
||||
from typing import Any, Dict, List, Tuple, Union
|
||||
@@ -386,6 +386,21 @@ class AmdSmiRasErrState(IntEnum):
|
||||
INVALID = amdsmi_wrapper.AMDSMI_RAS_ERR_STATE_INVALID
|
||||
|
||||
|
||||
class AmdSmiCperNotifyType(Enum):
|
||||
CMC = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_CMC
|
||||
CPE = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_CPE
|
||||
MCE = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_MCE
|
||||
PCIE = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_PCIE
|
||||
INIT = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_INIT
|
||||
NMI = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_NMI
|
||||
BOOT = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_BOOT
|
||||
DMAr = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_DMAR
|
||||
SEA = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_SEA
|
||||
SEI = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_SEI
|
||||
PEI = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_PEI
|
||||
CXL_COMPONENT = amdsmi_wrapper.AMDSMI_CPER_NOTIFY_TYPE_CXL_COMPONENT
|
||||
|
||||
|
||||
class AmdSmiMemoryType(IntEnum):
|
||||
VRAM = amdsmi_wrapper.AMDSMI_MEM_TYPE_VRAM
|
||||
VIS_VRAM = amdsmi_wrapper.AMDSMI_MEM_TYPE_VIS_VRAM
|
||||
@@ -460,6 +475,7 @@ class AmdSmiVirtualizationMode(IntEnum):
|
||||
GUEST = amdsmi_wrapper.AMDSMI_VIRTUALIZATION_MODE_GUEST
|
||||
PASSTHROUGH = amdsmi_wrapper.AMDSMI_VIRTUALIZATION_MODE_PASSTHROUGH
|
||||
|
||||
|
||||
class AmdSmiVramType(IntEnum):
|
||||
UNKNOWN = amdsmi_wrapper.AMDSMI_VRAM_TYPE_UNKNOWN
|
||||
HBM = amdsmi_wrapper.AMDSMI_VRAM_TYPE_HBM
|
||||
@@ -478,6 +494,7 @@ class AmdSmiVramType(IntEnum):
|
||||
GDDR7 = amdsmi_wrapper.AMDSMI_VRAM_TYPE_GDDR7
|
||||
MAX = amdsmi_wrapper.AMDSMI_VRAM_TYPE__MAX
|
||||
|
||||
|
||||
class AmdSmiVramVendor(IntEnum):
|
||||
SAMSUNG = amdsmi_wrapper.AMDSMI_VRAM_VENDOR_SAMSUNG
|
||||
INFINEON = amdsmi_wrapper.AMDSMI_VRAM_VENDOR_INFINEON
|
||||
@@ -491,6 +508,7 @@ class AmdSmiVramVendor(IntEnum):
|
||||
MICRON = amdsmi_wrapper.AMDSMI_VRAM_VENDOR_MICRON
|
||||
UNKNOWN = amdsmi_wrapper.AMDSMI_VRAM_VENDOR_UNKNOWN
|
||||
|
||||
|
||||
class AmdSmiEventReader:
|
||||
def __init__(
|
||||
self, processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
|
||||
@@ -699,6 +717,21 @@ def _validate_if_max_uint(value, uint_type: MaxUIntegerTypes, isActivity=False,
|
||||
else:
|
||||
return return_val
|
||||
|
||||
|
||||
def _notifyTypeToString(notify_type_b):
|
||||
guid = []
|
||||
# Iterate over only the first 8 bytes, but backwards
|
||||
for i in notify_type_b[7::-1]:
|
||||
guid.append(format(i, '02x'))
|
||||
hex_string = "".join(guid)
|
||||
hex_value = int(hex_string, 16)
|
||||
if hex_value in AmdSmiCperNotifyType._value2member_map_:
|
||||
# Convert to the corresponding enum name
|
||||
return AmdSmiCperNotifyType(hex_value).name
|
||||
else:
|
||||
return "Unknown"
|
||||
|
||||
|
||||
def amdsmi_get_socket_handles() -> List[amdsmi_wrapper.amdsmi_socket_handle]:
|
||||
"""
|
||||
Function that gets socket handles. Wraps the same named function call.
|
||||
@@ -1782,7 +1815,7 @@ def amdsmi_get_gpu_enumeration_info(processor_handle: amdsmi_wrapper.amdsmi_proc
|
||||
|
||||
# Call the C function to populate the struct
|
||||
status = amdsmi_wrapper.amdsmi_get_gpu_enumeration_info(processor_handle, ctypes.byref(enumeration_info))
|
||||
|
||||
|
||||
# Validate the status result
|
||||
_check_res(status)
|
||||
|
||||
@@ -2238,6 +2271,96 @@ def amdsmi_get_gpu_total_ecc_count(
|
||||
"deferred_count": ec.deferred_count,
|
||||
}
|
||||
|
||||
def notifyTypeToString(notify_type_b):
|
||||
idx = 0
|
||||
guid = []
|
||||
for i in notify_type_b:
|
||||
guid.append(format(i, '02x'))
|
||||
if idx == 7:
|
||||
break
|
||||
idx = idx +1
|
||||
return "".join(guid[::-1])
|
||||
|
||||
def amdsmi_get_gpu_cper_entries(processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
|
||||
severity_mask: int,
|
||||
buffer_size: int = 4*1048576,
|
||||
cursor: int = 0
|
||||
) -> Tuple[List[Dict[str, Any]], int]:
|
||||
|
||||
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
|
||||
raise AmdSmiParameterException(
|
||||
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
|
||||
)
|
||||
|
||||
# Allocate a buffer for CPER data.
|
||||
buf = ctypes.create_string_buffer(buffer_size)
|
||||
buf_size = ctypes.c_uint64(buffer_size)
|
||||
entry_count = ctypes.c_uint64(20)
|
||||
cur = ctypes.c_uint64(cursor)
|
||||
# Allocate a pointer for the CPER header array.
|
||||
cper_hdrs_array = (ctypes.POINTER(amdsmi_wrapper.amdsmi_cper_hdr_t) * 20)()
|
||||
cper_hdrs = ctypes.cast(cper_hdrs_array, ctypes.POINTER(ctypes.POINTER(amdsmi_wrapper.amdsmi_cper_hdr_t)))
|
||||
|
||||
# Call the underlying AMD-SMI API.
|
||||
ret = amdsmi_wrapper.amdsmi_get_gpu_cper_entries(
|
||||
processor_handle,
|
||||
ctypes.c_uint32(severity_mask),
|
||||
buf,
|
||||
ctypes.byref(buf_size),
|
||||
cper_hdrs,
|
||||
ctypes.byref(entry_count),
|
||||
ctypes.byref(cur)
|
||||
)
|
||||
if ret != amdsmi_wrapper.AMDSMI_STATUS_SUCCESS:
|
||||
raise AmdSmiLibraryException(ret)
|
||||
|
||||
entries = {}
|
||||
cper_data = []
|
||||
offset = 0
|
||||
# Iterate over each entry using its variable record_length.
|
||||
for i in range(entry_count.value):
|
||||
entry_address = ctypes.addressof(buf) + offset
|
||||
entry_ptr = ctypes.cast(entry_address, ctypes.POINTER(amdsmi_wrapper.amdsmi_cper_hdr_t))
|
||||
cper_data.append({
|
||||
"bytes":list((entry_ptr.contents.record_length * ctypes.c_byte).from_address(entry_address)),
|
||||
"size":entry_ptr.contents.record_length
|
||||
})
|
||||
# Extract the timestamp fields.
|
||||
year = entry_ptr.contents.timestamp.year
|
||||
# Adjust the year if it's less than 100. You can tweak this logic based on your expected data.
|
||||
if year < 100:
|
||||
year += 2000
|
||||
formatted_timestamp = (
|
||||
f"{year:04d}/"
|
||||
f"{entry_ptr.contents.timestamp.month:02d}/"
|
||||
f"{entry_ptr.contents.timestamp.day:02d} "
|
||||
f"{entry_ptr.contents.timestamp.hours:02d}:"
|
||||
f"{entry_ptr.contents.timestamp.minutes:02d}:"
|
||||
f"{entry_ptr.contents.timestamp.seconds:02d}"
|
||||
)
|
||||
cper_entry = {
|
||||
"error_severity": amdsmi_wrapper.amdsmi_cper_sev_t__enumvalues.get(entry_ptr.contents.error_severity, "AMDSMI_CPER_SEV_UNUSED").replace("AMDSMI_CPER_SEV_", "").lower(),
|
||||
"notify_type": _notifyTypeToString(entry_ptr.contents.notify_type.b),
|
||||
"timestamp": formatted_timestamp,
|
||||
"signature" : entry_ptr.contents.signature,
|
||||
"revision" : entry_ptr.contents.revision,
|
||||
"signature_end" : hex(entry_ptr.contents.signature_end),
|
||||
"sec_cnt" : entry_ptr.contents.sec_cnt,
|
||||
"record_length" : entry_ptr.contents.record_length,
|
||||
"platform_id" : entry_ptr.contents.platform_id,
|
||||
"creator_id" : entry_ptr.contents.creator_id,
|
||||
"record_id" : entry_ptr.contents.record_id,
|
||||
"flags" : entry_ptr.contents.flags,
|
||||
"persistence_info" : entry_ptr.contents.persistence_info,
|
||||
#"reserved" : entry_ptr.contents.reserved
|
||||
#"cper_valid_bit" : entry_ptr.contents.cper_valid_bits,
|
||||
#"partition_id" : entry_ptr.contents.partition_id,
|
||||
}
|
||||
entries[i] = cper_entry.copy()
|
||||
offset += entry_ptr.contents.record_length # Use the actual record length to advance the offset
|
||||
|
||||
return entries, cur.value, cper_data
|
||||
|
||||
|
||||
def amdsmi_get_gpu_board_info(
|
||||
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
|
||||
@@ -2938,7 +3061,7 @@ def amdsmi_get_gpu_memory_partition_config(processor_handle: amdsmi_wrapper.amds
|
||||
raise AmdSmiParameterException(
|
||||
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
|
||||
)
|
||||
|
||||
|
||||
config = amdsmi_wrapper.amdsmi_memory_partition_config_t()
|
||||
|
||||
_check_res(
|
||||
@@ -3017,10 +3140,10 @@ def amdsmi_get_gpu_accelerator_partition_profile(
|
||||
)
|
||||
profile_type_ret = amdsmi_wrapper.amdsmi_accelerator_partition_type_t__enumvalues[profile.profile_type].replace("AMDSMI_ACCELERATOR_PARTITION_", "")
|
||||
profile_type_ret = profile_type_ret.replace("INVALID", "N/A")
|
||||
|
||||
|
||||
length = profile.num_partitions
|
||||
partition_ids = []
|
||||
|
||||
|
||||
#partition_id[0] will contain the partition id of each device
|
||||
#BM/Guest will include this logic. Host will only display primary partition ids.
|
||||
kPOSITION_OF_PARTITION_ID = 0
|
||||
@@ -3079,7 +3202,7 @@ def amdsmi_get_gpu_accelerator_partition_profile_config(processor_handle: amdsmi
|
||||
profile_type_ret = profile_type_ret.replace("INVALID", "N/A")
|
||||
resources = []
|
||||
|
||||
|
||||
|
||||
mem_caps_list = []
|
||||
if profile.memory_caps.nps_flags.nps1_cap == 1:
|
||||
mem_caps_list.append("NPS1")
|
||||
@@ -3104,7 +3227,7 @@ def amdsmi_get_gpu_accelerator_partition_profile_config(processor_handle: amdsmi
|
||||
logging.debug("\namdsmi_interface.py | amdsmi_get_gpu_accelerator_partition_profile_config | resource_profile_dict = " + str(resource_profile_dict))
|
||||
resources.append(resource_profile_dict)
|
||||
resource_idx += 1
|
||||
|
||||
|
||||
profile_dict = {
|
||||
"profile_type": profile_type_ret,
|
||||
"num_partitions": profile.num_partitions,
|
||||
|
||||
@@ -324,6 +324,7 @@ amdsmi_status_t__enumvalues = {
|
||||
54: 'AMDSMI_STATUS_AMDGPU_RESTART_ERR',
|
||||
55: 'AMDSMI_STATUS_SETTING_UNAVAILABLE',
|
||||
56: 'AMDSMI_STATUS_CORRUPTED_EEPROM',
|
||||
57: 'AMDSMI_STATUS_MORE_DATA',
|
||||
4294967294: 'AMDSMI_STATUS_MAP_ERROR',
|
||||
4294967295: 'AMDSMI_STATUS_UNKNOWN_ERROR',
|
||||
}
|
||||
@@ -369,6 +370,7 @@ AMDSMI_STATUS_ARG_PTR_NULL = 53
|
||||
AMDSMI_STATUS_AMDGPU_RESTART_ERR = 54
|
||||
AMDSMI_STATUS_SETTING_UNAVAILABLE = 55
|
||||
AMDSMI_STATUS_CORRUPTED_EEPROM = 56
|
||||
AMDSMI_STATUS_MORE_DATA = 57
|
||||
AMDSMI_STATUS_MAP_ERROR = 4294967294
|
||||
AMDSMI_STATUS_UNKNOWN_ERROR = 4294967295
|
||||
amdsmi_status_t = ctypes.c_uint32 # enum
|
||||
@@ -856,21 +858,6 @@ amdsmi_card_form_factor_t = ctypes.c_uint32 # enum
|
||||
class struct_amdsmi_pcie_info_t(Structure):
|
||||
pass
|
||||
|
||||
class struct_pcie_static_(Structure):
|
||||
pass
|
||||
|
||||
struct_pcie_static_._pack_ = 1 # source:False
|
||||
struct_pcie_static_._fields_ = [
|
||||
('max_pcie_width', ctypes.c_uint16),
|
||||
('PADDING_0', ctypes.c_ubyte * 2),
|
||||
('max_pcie_speed', ctypes.c_uint32),
|
||||
('pcie_interface_version', ctypes.c_uint32),
|
||||
('slot_type', amdsmi_card_form_factor_t),
|
||||
('max_pcie_interface_version', ctypes.c_uint32),
|
||||
('PADDING_1', ctypes.c_ubyte * 4),
|
||||
('reserved', ctypes.c_uint64 * 9),
|
||||
]
|
||||
|
||||
class struct_pcie_metric_(Structure):
|
||||
pass
|
||||
|
||||
@@ -891,6 +878,21 @@ struct_pcie_metric_._fields_ = [
|
||||
('reserved', ctypes.c_uint64 * 12),
|
||||
]
|
||||
|
||||
class struct_pcie_static_(Structure):
|
||||
pass
|
||||
|
||||
struct_pcie_static_._pack_ = 1 # source:False
|
||||
struct_pcie_static_._fields_ = [
|
||||
('max_pcie_width', ctypes.c_uint16),
|
||||
('PADDING_0', ctypes.c_ubyte * 2),
|
||||
('max_pcie_speed', ctypes.c_uint32),
|
||||
('pcie_interface_version', ctypes.c_uint32),
|
||||
('slot_type', amdsmi_card_form_factor_t),
|
||||
('max_pcie_interface_version', ctypes.c_uint32),
|
||||
('PADDING_1', ctypes.c_ubyte * 4),
|
||||
('reserved', ctypes.c_uint64 * 9),
|
||||
]
|
||||
|
||||
struct_amdsmi_pcie_info_t._pack_ = 1 # source:False
|
||||
struct_amdsmi_pcie_info_t._fields_ = [
|
||||
('pcie_static', struct_pcie_static_),
|
||||
@@ -1601,6 +1603,50 @@ CLK_LIMIT_MIN = 0
|
||||
CLK_LIMIT_MAX = 1
|
||||
amdsmi_clk_limit_type_t = ctypes.c_uint32 # enum
|
||||
|
||||
# values for enumeration 'amdsmi_cper_sev_t'
|
||||
amdsmi_cper_sev_t__enumvalues = {
|
||||
0: 'AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED',
|
||||
1: 'AMDSMI_CPER_SEV_FATAL',
|
||||
2: 'AMDSMI_CPER_SEV_NON_FATAL_CORRECTED',
|
||||
3: 'AMDSMI_CPER_SEV_NUM',
|
||||
10: 'AMDSMI_CPER_SEV_UNUSED',
|
||||
}
|
||||
AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED = 0
|
||||
AMDSMI_CPER_SEV_FATAL = 1
|
||||
AMDSMI_CPER_SEV_NON_FATAL_CORRECTED = 2
|
||||
AMDSMI_CPER_SEV_NUM = 3
|
||||
AMDSMI_CPER_SEV_UNUSED = 10
|
||||
amdsmi_cper_sev_t = ctypes.c_uint32 # enum
|
||||
|
||||
# values for enumeration 'amdsmi_cper_notify_type_t'
|
||||
amdsmi_cper_notify_type_t__enumvalues = {
|
||||
4976123370175105969: 'AMDSMI_CPER_NOTIFY_TYPE_CMC',
|
||||
5356425115412803478: 'AMDSMI_CPER_NOTIFY_TYPE_CPE',
|
||||
5531987820403847166: 'AMDSMI_CPER_NOTIFY_TYPE_MCE',
|
||||
5619395120325705759: 'AMDSMI_CPER_NOTIFY_TYPE_PCIE',
|
||||
4992964802890589160: 'AMDSMI_CPER_NOTIFY_TYPE_INIT',
|
||||
4812579876830546431: 'AMDSMI_CPER_NOTIFY_TYPE_NMI',
|
||||
4655221457236894822: 'AMDSMI_CPER_NOTIFY_TYPE_BOOT',
|
||||
5487573144795207569: 'AMDSMI_CPER_NOTIFY_TYPE_DMAR',
|
||||
1289362001033197706: 'AMDSMI_CPER_NOTIFY_TYPE_SEA',
|
||||
5658685719731260545: 'AMDSMI_CPER_NOTIFY_TYPE_SEI',
|
||||
4761520883332928940: 'AMDSMI_CPER_NOTIFY_TYPE_PEI',
|
||||
5306157213770398665: 'AMDSMI_CPER_NOTIFY_TYPE_CXL_COMPONENT',
|
||||
}
|
||||
AMDSMI_CPER_NOTIFY_TYPE_CMC = 4976123370175105969
|
||||
AMDSMI_CPER_NOTIFY_TYPE_CPE = 5356425115412803478
|
||||
AMDSMI_CPER_NOTIFY_TYPE_MCE = 5531987820403847166
|
||||
AMDSMI_CPER_NOTIFY_TYPE_PCIE = 5619395120325705759
|
||||
AMDSMI_CPER_NOTIFY_TYPE_INIT = 4992964802890589160
|
||||
AMDSMI_CPER_NOTIFY_TYPE_NMI = 4812579876830546431
|
||||
AMDSMI_CPER_NOTIFY_TYPE_BOOT = 4655221457236894822
|
||||
AMDSMI_CPER_NOTIFY_TYPE_DMAR = 5487573144795207569
|
||||
AMDSMI_CPER_NOTIFY_TYPE_SEA = 1289362001033197706
|
||||
AMDSMI_CPER_NOTIFY_TYPE_SEI = 5658685719731260545
|
||||
AMDSMI_CPER_NOTIFY_TYPE_PEI = 4761520883332928940
|
||||
AMDSMI_CPER_NOTIFY_TYPE_CXL_COMPONENT = 5306157213770398665
|
||||
amdsmi_cper_notify_type_t = ctypes.c_uint64 # enum
|
||||
|
||||
# values for enumeration 'amdsmi_ras_err_state_t'
|
||||
amdsmi_ras_err_state_t__enumvalues = {
|
||||
0: 'AMDSMI_RAS_ERR_STATE_NONE',
|
||||
@@ -2520,6 +2566,91 @@ amdsmi_get_gpu_ecc_enabled.argtypes = [amdsmi_processor_handle, ctypes.POINTER(c
|
||||
amdsmi_get_gpu_total_ecc_count = _libraries['libamd_smi.so'].amdsmi_get_gpu_total_ecc_count
|
||||
amdsmi_get_gpu_total_ecc_count.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_total_ecc_count.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_error_count_t)]
|
||||
class struct_amdsmi_cper_guid_t(Structure):
|
||||
pass
|
||||
|
||||
struct_amdsmi_cper_guid_t._pack_ = 1 # source:False
|
||||
struct_amdsmi_cper_guid_t._fields_ = [
|
||||
('b', ctypes.c_ubyte * 16),
|
||||
]
|
||||
|
||||
amdsmi_cper_guid_t = struct_amdsmi_cper_guid_t
|
||||
class struct_amdsmi_cper_timestamp_t(Structure):
|
||||
pass
|
||||
|
||||
struct_amdsmi_cper_timestamp_t._pack_ = 1 # source:False
|
||||
struct_amdsmi_cper_timestamp_t._fields_ = [
|
||||
('seconds', ctypes.c_ubyte),
|
||||
('minutes', ctypes.c_ubyte),
|
||||
('hours', ctypes.c_ubyte),
|
||||
('flag', ctypes.c_ubyte),
|
||||
('day', ctypes.c_ubyte),
|
||||
('month', ctypes.c_ubyte),
|
||||
('year', ctypes.c_ubyte),
|
||||
('century', ctypes.c_ubyte),
|
||||
]
|
||||
|
||||
amdsmi_cper_timestamp_t = struct_amdsmi_cper_timestamp_t
|
||||
class struct_valid_bits_t(Structure):
|
||||
pass
|
||||
|
||||
struct_valid_bits_t._pack_ = 1 # source:False
|
||||
struct_valid_bits_t._fields_ = [
|
||||
('platform_id', ctypes.c_uint32, 1),
|
||||
('timestamp', ctypes.c_uint32, 1),
|
||||
('partition_id', ctypes.c_uint32, 1),
|
||||
('reserved', ctypes.c_uint32, 29),
|
||||
]
|
||||
|
||||
valid_bits_t = struct_valid_bits_t
|
||||
class union_amdsmi_cper_valid_bits_t(Union):
|
||||
pass
|
||||
|
||||
class struct_valid_bits_(Structure):
|
||||
pass
|
||||
|
||||
struct_valid_bits_._pack_ = 1 # source:False
|
||||
struct_valid_bits_._fields_ = [
|
||||
('platform_id', ctypes.c_uint32, 1),
|
||||
('timestamp', ctypes.c_uint32, 1),
|
||||
('partition_id', ctypes.c_uint32, 1),
|
||||
('reserved', ctypes.c_uint32, 29),
|
||||
]
|
||||
|
||||
union_amdsmi_cper_valid_bits_t._pack_ = 1 # source:False
|
||||
union_amdsmi_cper_valid_bits_t._fields_ = [
|
||||
('valid_bits', struct_valid_bits_),
|
||||
('valid_mask', ctypes.c_uint32),
|
||||
]
|
||||
|
||||
amdsmi_cper_valid_bits_t = union_amdsmi_cper_valid_bits_t
|
||||
class struct_amdsmi_cper_hdr_t(Structure):
|
||||
pass
|
||||
|
||||
struct_amdsmi_cper_hdr_t._pack_ = 1 # source:False
|
||||
struct_amdsmi_cper_hdr_t._fields_ = [
|
||||
('signature', ctypes.c_char * 4),
|
||||
('revision', ctypes.c_uint16),
|
||||
('signature_end', ctypes.c_uint32),
|
||||
('sec_cnt', ctypes.c_uint16),
|
||||
('error_severity', amdsmi_cper_sev_t),
|
||||
('cper_valid_bits', amdsmi_cper_valid_bits_t),
|
||||
('record_length', ctypes.c_uint32),
|
||||
('timestamp', amdsmi_cper_timestamp_t),
|
||||
('platform_id', ctypes.c_char * 16),
|
||||
('partition_id', amdsmi_cper_guid_t),
|
||||
('creator_id', ctypes.c_char * 16),
|
||||
('notify_type', amdsmi_cper_guid_t),
|
||||
('record_id', ctypes.c_char * 8),
|
||||
('flags', ctypes.c_uint32),
|
||||
('persistence_info', ctypes.c_uint64),
|
||||
('reserved', ctypes.c_ubyte * 12),
|
||||
]
|
||||
|
||||
amdsmi_cper_hdr_t = struct_amdsmi_cper_hdr_t
|
||||
amdsmi_get_gpu_cper_entries = _libraries['libamd_smi.so'].amdsmi_get_gpu_cper_entries
|
||||
amdsmi_get_gpu_cper_entries.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_cper_entries.argtypes = [amdsmi_processor_handle, uint32_t, ctypes.POINTER(ctypes.c_char), ctypes.POINTER(ctypes.c_uint64), ctypes.POINTER(ctypes.POINTER(struct_amdsmi_cper_hdr_t)), ctypes.POINTER(ctypes.c_uint64), ctypes.POINTER(ctypes.c_uint64)]
|
||||
amdsmi_get_gpu_ecc_status = _libraries['libamd_smi.so'].amdsmi_get_gpu_ecc_status
|
||||
amdsmi_get_gpu_ecc_status.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_ecc_status.argtypes = [amdsmi_processor_handle, amdsmi_gpu_block_t, ctypes.POINTER(amdsmi_ras_err_state_t)]
|
||||
@@ -2828,7 +2959,16 @@ __all__ = \
|
||||
'AMDSMI_COMPUTE_PARTITION_INVALID',
|
||||
'AMDSMI_COMPUTE_PARTITION_QPX', 'AMDSMI_COMPUTE_PARTITION_SPX',
|
||||
'AMDSMI_COMPUTE_PARTITION_TPX', 'AMDSMI_CONTAINER_DOCKER',
|
||||
'AMDSMI_CONTAINER_LXC', 'AMDSMI_DEV_PERF_LEVEL_AUTO',
|
||||
'AMDSMI_CONTAINER_LXC', 'AMDSMI_CPER_NOTIFY_TYPE_BOOT',
|
||||
'AMDSMI_CPER_NOTIFY_TYPE_CMC', 'AMDSMI_CPER_NOTIFY_TYPE_CPE',
|
||||
'AMDSMI_CPER_NOTIFY_TYPE_CXL_COMPONENT',
|
||||
'AMDSMI_CPER_NOTIFY_TYPE_DMAR', 'AMDSMI_CPER_NOTIFY_TYPE_INIT',
|
||||
'AMDSMI_CPER_NOTIFY_TYPE_MCE', 'AMDSMI_CPER_NOTIFY_TYPE_NMI',
|
||||
'AMDSMI_CPER_NOTIFY_TYPE_PCIE', 'AMDSMI_CPER_NOTIFY_TYPE_PEI',
|
||||
'AMDSMI_CPER_NOTIFY_TYPE_SEA', 'AMDSMI_CPER_NOTIFY_TYPE_SEI',
|
||||
'AMDSMI_CPER_SEV_FATAL', 'AMDSMI_CPER_SEV_NON_FATAL_CORRECTED',
|
||||
'AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED', 'AMDSMI_CPER_SEV_NUM',
|
||||
'AMDSMI_CPER_SEV_UNUSED', 'AMDSMI_DEV_PERF_LEVEL_AUTO',
|
||||
'AMDSMI_DEV_PERF_LEVEL_DETERMINISM',
|
||||
'AMDSMI_DEV_PERF_LEVEL_FIRST', 'AMDSMI_DEV_PERF_LEVEL_HIGH',
|
||||
'AMDSMI_DEV_PERF_LEVEL_LAST', 'AMDSMI_DEV_PERF_LEVEL_LOW',
|
||||
@@ -2955,9 +3095,9 @@ __all__ = \
|
||||
'AMDSMI_STATUS_INSUFFICIENT_SIZE',
|
||||
'AMDSMI_STATUS_INTERNAL_EXCEPTION', 'AMDSMI_STATUS_INTERRUPT',
|
||||
'AMDSMI_STATUS_INVAL', 'AMDSMI_STATUS_IO',
|
||||
'AMDSMI_STATUS_MAP_ERROR', 'AMDSMI_STATUS_NON_AMD_CPU',
|
||||
'AMDSMI_STATUS_NOT_FOUND', 'AMDSMI_STATUS_NOT_INIT',
|
||||
'AMDSMI_STATUS_NOT_SUPPORTED',
|
||||
'AMDSMI_STATUS_MAP_ERROR', 'AMDSMI_STATUS_MORE_DATA',
|
||||
'AMDSMI_STATUS_NON_AMD_CPU', 'AMDSMI_STATUS_NOT_FOUND',
|
||||
'AMDSMI_STATUS_NOT_INIT', 'AMDSMI_STATUS_NOT_SUPPORTED',
|
||||
'AMDSMI_STATUS_NOT_YET_IMPLEMENTED', 'AMDSMI_STATUS_NO_DATA',
|
||||
'AMDSMI_STATUS_NO_DRV', 'AMDSMI_STATUS_NO_ENERGY_DRV',
|
||||
'AMDSMI_STATUS_NO_HSMP_DRV', 'AMDSMI_STATUS_NO_HSMP_MSG_SUP',
|
||||
@@ -3022,6 +3162,9 @@ __all__ = \
|
||||
'amdsmi_clk_limit_type_t', 'amdsmi_clk_type_t',
|
||||
'amdsmi_compute_partition_type_t', 'amdsmi_container_types_t',
|
||||
'amdsmi_counter_command_t', 'amdsmi_counter_value_t',
|
||||
'amdsmi_cper_guid_t', 'amdsmi_cper_hdr_t',
|
||||
'amdsmi_cper_notify_type_t', 'amdsmi_cper_sev_t',
|
||||
'amdsmi_cper_timestamp_t', 'amdsmi_cper_valid_bits_t',
|
||||
'amdsmi_cpu_apb_disable', 'amdsmi_cpu_apb_enable',
|
||||
'amdsmi_cpu_info_t', 'amdsmi_cpu_util_t',
|
||||
'amdsmi_cpusocket_handle', 'amdsmi_ddr_bw_metrics_t',
|
||||
@@ -3074,10 +3217,10 @@ __all__ = \
|
||||
'amdsmi_get_gpu_compute_process_gpus',
|
||||
'amdsmi_get_gpu_compute_process_info',
|
||||
'amdsmi_get_gpu_compute_process_info_by_pid',
|
||||
'amdsmi_get_gpu_device_bdf', 'amdsmi_get_gpu_device_uuid',
|
||||
'amdsmi_get_gpu_driver_info', 'amdsmi_get_gpu_ecc_count',
|
||||
'amdsmi_get_gpu_ecc_enabled', 'amdsmi_get_gpu_ecc_status',
|
||||
'amdsmi_get_gpu_enumeration_info',
|
||||
'amdsmi_get_gpu_cper_entries', 'amdsmi_get_gpu_device_bdf',
|
||||
'amdsmi_get_gpu_device_uuid', 'amdsmi_get_gpu_driver_info',
|
||||
'amdsmi_get_gpu_ecc_count', 'amdsmi_get_gpu_ecc_enabled',
|
||||
'amdsmi_get_gpu_ecc_status', 'amdsmi_get_gpu_enumeration_info',
|
||||
'amdsmi_get_gpu_event_notification', 'amdsmi_get_gpu_fan_rpms',
|
||||
'amdsmi_get_gpu_fan_speed', 'amdsmi_get_gpu_fan_speed_max',
|
||||
'amdsmi_get_gpu_id', 'amdsmi_get_gpu_kfd_info',
|
||||
@@ -3192,11 +3335,13 @@ __all__ = \
|
||||
'struct_amdsmi_accelerator_partition_resource_profile_t',
|
||||
'struct_amdsmi_asic_info_t', 'struct_amdsmi_board_info_t',
|
||||
'struct_amdsmi_clk_info_t', 'struct_amdsmi_counter_value_t',
|
||||
'struct_amdsmi_cpu_info_t', 'struct_amdsmi_cpu_util_t',
|
||||
'struct_amdsmi_ddr_bw_metrics_t', 'struct_amdsmi_dimm_power_t',
|
||||
'struct_amdsmi_dimm_thermal_t', 'struct_amdsmi_dpm_level_t',
|
||||
'struct_amdsmi_dpm_policy_entry_t', 'struct_amdsmi_dpm_policy_t',
|
||||
'struct_amdsmi_driver_info_t', 'struct_amdsmi_engine_usage_t',
|
||||
'struct_amdsmi_cper_guid_t', 'struct_amdsmi_cper_hdr_t',
|
||||
'struct_amdsmi_cper_timestamp_t', 'struct_amdsmi_cpu_info_t',
|
||||
'struct_amdsmi_cpu_util_t', 'struct_amdsmi_ddr_bw_metrics_t',
|
||||
'struct_amdsmi_dimm_power_t', 'struct_amdsmi_dimm_thermal_t',
|
||||
'struct_amdsmi_dpm_level_t', 'struct_amdsmi_dpm_policy_entry_t',
|
||||
'struct_amdsmi_dpm_policy_t', 'struct_amdsmi_driver_info_t',
|
||||
'struct_amdsmi_engine_usage_t',
|
||||
'struct_amdsmi_enumeration_info_t', 'struct_amdsmi_error_count_t',
|
||||
'struct_amdsmi_evt_notification_data_t',
|
||||
'struct_amdsmi_freq_volt_region_t', 'struct_amdsmi_frequencies_t',
|
||||
@@ -3228,6 +3373,8 @@ __all__ = \
|
||||
'struct_engine_usage_', 'struct_fw_info_list_',
|
||||
'struct_memory_usage_', 'struct_nps_flags_', 'struct_numa_range_',
|
||||
'struct_pcie_metric_', 'struct_pcie_static_',
|
||||
'struct_amdsmi_bdf_t', 'uint32_t', 'uint64_t', 'uint8_t',
|
||||
'union_amdsmi_bdf_t', 'union_amdsmi_nps_caps_t']
|
||||
'struct_amdsmi_bdf_t', 'struct_valid_bits_',
|
||||
'struct_valid_bits_t', 'uint32_t', 'uint64_t', 'uint8_t',
|
||||
'union_amdsmi_bdf_t', 'union_amdsmi_cper_valid_bits_t',
|
||||
'union_amdsmi_nps_caps_t', 'valid_bits_t']
|
||||
|
||||
|
||||
@@ -21,6 +21,7 @@
|
||||
*/
|
||||
|
||||
#include <assert.h>
|
||||
#include <cstdlib>
|
||||
#include <errno.h>
|
||||
#include <sys/utsname.h>
|
||||
#include <stdio.h>
|
||||
@@ -542,6 +543,7 @@ amdsmi_status_t amdsmi_get_processor_type(amdsmi_processor_handle processor_hand
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_device_bdf(amdsmi_processor_handle processor_handle, amdsmi_bdf_t *bdf) {
|
||||
|
||||
@@ -3547,6 +3549,276 @@ amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
namespace {
|
||||
static std::vector<const amdsmi_cper_hdr_t *>
|
||||
amdsmi_get_gpu_cper_headers(const char *buffer, size_t buffer_sz) {
|
||||
|
||||
std::ostringstream ss;
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__
|
||||
<< "[CPER] buffer_sz: " << buffer_sz;
|
||||
LOG_DEBUG(ss);
|
||||
|
||||
std::vector<const amdsmi_cper_hdr_t *> headers;
|
||||
if(!buffer) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__
|
||||
<< "[CPER] buffer is null";
|
||||
LOG_ERROR(ss);
|
||||
return headers;
|
||||
}
|
||||
static constexpr char cper_signature[] = "CPER";
|
||||
static constexpr size_t cper_signature_size = sizeof(cper_signature) - 1;
|
||||
for(size_t data_idx = 0;
|
||||
buffer_sz >= cper_signature_size &&
|
||||
data_idx < buffer_sz - cper_signature_size;
|
||||
++data_idx) {
|
||||
|
||||
const amdsmi_cper_hdr_t *hdr = reinterpret_cast<const amdsmi_cper_hdr_t *>(
|
||||
&buffer[data_idx]);
|
||||
if(hdr->signature[0] != 'C' || hdr->signature[1] != 'P' ||
|
||||
hdr->signature[2] != 'E' || hdr->signature[3] != 'R' ) {
|
||||
continue;
|
||||
}
|
||||
if(hdr->signature_end != 0xFFFFFFFF) {
|
||||
continue;
|
||||
}
|
||||
if(hdr->record_length > buffer_sz) {
|
||||
continue;
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__
|
||||
<< "[CPER] add header at data_idx: " << data_idx
|
||||
<< ", sig: " << hdr->signature[0] << hdr->signature[1] << hdr->signature[2] << hdr->signature[3];
|
||||
LOG_DEBUG(ss);
|
||||
headers.emplace_back(hdr);
|
||||
}
|
||||
return headers;
|
||||
}
|
||||
|
||||
struct CperFileCtx {
|
||||
amdsmi_status_t status = AMDSMI_STATUS_FILE_ERROR;
|
||||
std::unique_ptr<char[]> buffer;
|
||||
long file_size = 0;
|
||||
};
|
||||
|
||||
static auto amdsmi_read_cper_file(const std::string &filepath) {
|
||||
|
||||
std::ostringstream ss;
|
||||
|
||||
CperFileCtx ctx;
|
||||
ctx.status = AMDSMI_STATUS_FILE_ERROR;
|
||||
ctx.file_size = 0;
|
||||
|
||||
struct stat file_stats;
|
||||
if (stat(filepath.c_str(), &file_stats) == 0) {
|
||||
if (!S_ISREG(file_stats.st_mode)) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] file is not a regular file: "
|
||||
<< filepath << ", errno: " << errno << "): " << strerror(errno);
|
||||
return ctx;
|
||||
}
|
||||
} else {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] file does not exist: "
|
||||
<< filepath << ", errno: " << errno << "): " << strerror(errno);
|
||||
ctx.status = AMDSMI_STATUS_FILE_NOT_FOUND;
|
||||
return ctx;
|
||||
}
|
||||
|
||||
ctx.file_size = file_stats.st_size;
|
||||
ctx.buffer = std::make_unique<char[]>(ctx.file_size);
|
||||
int file = open(filepath.c_str(), O_RDONLY);
|
||||
if (file == -1) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] failed to open file: "
|
||||
<< filepath << ", errno:()" << errno << "): " << strerror(errno);
|
||||
LOG_ERROR(ss);
|
||||
return ctx;
|
||||
}
|
||||
long bytes_read = read(file, ctx.buffer.get(), ctx.file_size);
|
||||
if (bytes_read <= 0) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__
|
||||
<< "[CPER] failed to read complete file, read only "
|
||||
<< bytes_read << " of " << ctx.file_size << " bytes";
|
||||
LOG_ERROR(ss);
|
||||
return ctx;
|
||||
}
|
||||
close(file);
|
||||
|
||||
ctx.status = AMDSMI_STATUS_SUCCESS;
|
||||
ctx.file_size = bytes_read;
|
||||
return ctx;
|
||||
}
|
||||
}//namespace
|
||||
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_cper_entries_by_path(
|
||||
const std::string &amdgpu_ring_cper_file,
|
||||
uint32_t severity_mask,
|
||||
char *cper_data,
|
||||
uint64_t *buf_size,
|
||||
amdsmi_cper_hdr_t **cper_hdrs,
|
||||
uint64_t *entry_count,
|
||||
uint64_t *cursor) {
|
||||
|
||||
std::ostringstream ss;
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] begin\n"
|
||||
<< ", amdgpu_ring_cper_file: " << amdgpu_ring_cper_file
|
||||
<< ", severity_mask: " << severity_mask;
|
||||
LOG_DEBUG(ss);
|
||||
|
||||
if(!cper_data) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] cper_data should be a valid memory address\n";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
else if(!buf_size) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] buf_size should be a valid memory address";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
else if(!*buf_size) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] buf_size should be greater than zero";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
else if(!cper_hdrs) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] cper_hdrs should be a valid memory address";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
else if(!entry_count) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] entry_count should be a valid memory address";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
else if(!*entry_count) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] entry_count should be greater than 0";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
else if(!cursor) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] cursor should be a valid memory address";
|
||||
LOG_ERROR(ss);
|
||||
if(entry_count) {*entry_count = 0;}
|
||||
if(buf_size) { *buf_size = 0; }
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
|
||||
auto ctx = amdsmi_read_cper_file(amdgpu_ring_cper_file);
|
||||
if(ctx.status != AMDSMI_STATUS_SUCCESS) {
|
||||
return ctx.status;
|
||||
}
|
||||
|
||||
auto headers = amdsmi_get_gpu_cper_headers(ctx.buffer.get(), ctx.file_size);
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] num headers: " << headers.size();
|
||||
LOG_DEBUG(ss);
|
||||
|
||||
uint64_t data_idx = 0;
|
||||
uint64_t header_idx = 0;
|
||||
size_t num_headers_copied = 0;
|
||||
for(const amdsmi_cper_hdr_t *header: headers) {
|
||||
if(((1 << header->error_severity) & severity_mask) !=
|
||||
(1 << header->error_severity)) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] cper header rejected with severity: 0x"
|
||||
<< std::hex << (1 << header->error_severity) << ", given severity_mask: 0x"
|
||||
<< std::hex << severity_mask << ", record_length:"
|
||||
<< std::dec << header->record_length;
|
||||
LOG_DEBUG(ss);
|
||||
continue;
|
||||
}
|
||||
else {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] cper header accepted with severity: 0x"
|
||||
<< std::hex << (1 << header->error_severity) << ", given severity_mask: 0x"
|
||||
<< std::hex << severity_mask << ", record_length:"
|
||||
<< std::dec << header->record_length;
|
||||
LOG_DEBUG(ss);
|
||||
}
|
||||
if((*buf_size - data_idx) < header->record_length ) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] buffer filled up without copying all cper entries, buf_size: " << std::dec << *buf_size;
|
||||
LOG_ERROR(ss);
|
||||
*entry_count = num_headers_copied;
|
||||
*buf_size = data_idx;
|
||||
return (data_idx == 0) ?
|
||||
AMDSMI_STATUS_OUT_OF_RESOURCES :
|
||||
AMDSMI_STATUS_MORE_DATA;
|
||||
}
|
||||
if(num_headers_copied == *entry_count) {
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__ << "[CPER] cper_hdrs filled up before finished with copying all header pointers, entry_count: " << std::dec << *entry_count;
|
||||
LOG_ERROR(ss);
|
||||
*entry_count = num_headers_copied;
|
||||
*buf_size = data_idx;
|
||||
return (data_idx == 0) ?
|
||||
AMDSMI_STATUS_OUT_OF_RESOURCES :
|
||||
AMDSMI_STATUS_MORE_DATA;
|
||||
}
|
||||
if(*cursor != header_idx) {
|
||||
++header_idx;
|
||||
continue;
|
||||
}
|
||||
cper_hdrs[num_headers_copied] = reinterpret_cast<amdsmi_cper_hdr_t*>(&cper_data[data_idx]);
|
||||
++num_headers_copied;
|
||||
*cursor = ++header_idx;
|
||||
std::memcpy(
|
||||
&cper_data[data_idx],
|
||||
reinterpret_cast<const char*>(header),
|
||||
header->record_length);
|
||||
data_idx += header->record_length;
|
||||
}
|
||||
*entry_count = num_headers_copied;
|
||||
*buf_size = data_idx;
|
||||
|
||||
ss << __PRETTY_FUNCTION__ << "\n:" << __LINE__
|
||||
<< "[CPER] *entry_count: " << (entry_count ? *entry_count : -1)
|
||||
<< ", *cursor: " << (cursor ? *cursor : -1)
|
||||
<< ", *buf_size: " << (buf_size ? *buf_size : -1);
|
||||
|
||||
LOG_DEBUG(ss);
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_cper_entries(
|
||||
amdsmi_processor_handle processor_handle,
|
||||
uint32_t severity_mask,
|
||||
char *cper_data,
|
||||
uint64_t *buf_size,
|
||||
amdsmi_cper_hdr_t **cper_hdrs,
|
||||
uint64_t *entry_count,
|
||||
uint64_t *cursor) {
|
||||
|
||||
AMDSMI_CHECK_INIT();
|
||||
if (!amd::smi::is_sudo_user()) {
|
||||
return AMDSMI_STATUS_NO_PERM;
|
||||
}
|
||||
|
||||
amd::smi::AMDSmiGPUDevice* gpu_device = nullptr;
|
||||
amdsmi_status_t status = get_gpu_device_from_handle(processor_handle, &gpu_device);
|
||||
if (status != AMDSMI_STATUS_SUCCESS) {
|
||||
return status;
|
||||
}
|
||||
std::string path = std::string("/sys/kernel/debug/dri/") +
|
||||
std::to_string(gpu_device->get_card_from_bdf()) +
|
||||
"/amdgpu_ring_cper";
|
||||
|
||||
|
||||
return amdsmi_get_gpu_cper_entries_by_path(
|
||||
path,
|
||||
severity_mask,
|
||||
cper_data,
|
||||
buf_size,
|
||||
cper_hdrs,
|
||||
entry_count,
|
||||
cursor);
|
||||
}
|
||||
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_process_list(amdsmi_processor_handle processor_handle, uint32_t *max_processes, amdsmi_proc_info_t *list) {
|
||||
AMDSMI_CHECK_INIT();
|
||||
|
||||
@@ -54,8 +54,6 @@ include_directories(${TEST} ${CMAKE_CURRENT_SOURCE_DIR}/.. ${ROCM_INC_DIR}/..)
|
||||
|
||||
# Build rules
|
||||
add_executable(${TEST} ${tstSources} ${functionalSources})
|
||||
|
||||
#AMD_SMI_TARGET?
|
||||
target_link_libraries(${TEST}
|
||||
${AMD_SMI_TARGET}
|
||||
GTest::gtest_main
|
||||
@@ -63,6 +61,9 @@ target_link_libraries(${TEST}
|
||||
stdc++
|
||||
pthread)
|
||||
|
||||
target_compile_definitions(${TEST} PRIVATE
|
||||
CPER_SYS_ROOT="${CMAKE_CURRENT_SOURCE_DIR}/cper")
|
||||
|
||||
# Install tests
|
||||
install(
|
||||
TARGETS ${TEST}
|
||||
|
||||
@@ -0,0 +1,365 @@
|
||||
/*
|
||||
* Copyright (c) Advanced Micro Devices, Inc. All rights reserved.
|
||||
*
|
||||
* Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
* of this software and associated documentation files (the "Software"), to deal
|
||||
* in the Software without restriction, including without limitation the rights
|
||||
* to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
* copies of the Software, and to permit persons to whom the Software is
|
||||
* furnished to do so, subject to the following conditions:
|
||||
*
|
||||
* The above copyright notice and this permission notice shall be included in
|
||||
* all copies or substantial portions of the Software.
|
||||
*
|
||||
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||
* THE SOFTWARE.
|
||||
*/
|
||||
|
||||
#include <cstdint>
|
||||
#include <gtest/gtest.h>
|
||||
|
||||
#include "amd_smi/amdsmi.h"
|
||||
#include "rocm_smi/rocm_smi_logger.h"
|
||||
|
||||
extern amdsmi_status_t
|
||||
amdsmi_get_gpu_cper_entries_by_path(
|
||||
const std::string &amdgpu_ring_cper_file,
|
||||
uint32_t severity_mask,
|
||||
char *cper_data,
|
||||
uint64_t *buf_size,
|
||||
amdsmi_cper_hdr_t **cper_hdrs,
|
||||
uint64_t *entry_count,
|
||||
uint64_t *cursor);
|
||||
|
||||
class CperEntriesTest : public testing::Test{
|
||||
//class object public so that it is accessible
|
||||
//within the tests that are written
|
||||
public:
|
||||
CperEntriesTest() {
|
||||
setenv("CPER_SYS_ROOT", CPER_SYS_ROOT, 1);
|
||||
ROCmLogging::Logger::getInstance()->
|
||||
updateLogLevel(ROCmLogging::LogLevel::LOG_LEVEL_DEBUG);
|
||||
}
|
||||
};
|
||||
|
||||
TEST_F(CperEntriesTest, TestNullCperData){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask = amdsmi_cper_sev_t::AMDSMI_CPER_SEV_FATAL;
|
||||
char *cper_data = nullptr;
|
||||
uint64_t buf_size = 0;
|
||||
amdsmi_cper_hdr_t *cper_hdrs = nullptr;
|
||||
uint64_t entry_count = 0;
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data,
|
||||
nullptr,
|
||||
&cper_hdrs,
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_OUT_OF_RESOURCES);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestNullBufferSize){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask = amdsmi_cper_sev_t::AMDSMI_CPER_SEV_FATAL;
|
||||
uint64_t buf_size = 0;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
amdsmi_cper_hdr_t *cper_hdrs = nullptr;
|
||||
uint64_t entry_count = 0;
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
nullptr,
|
||||
&cper_hdrs,
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_OUT_OF_RESOURCES);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestNullCperHeaders){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask = amdsmi_cper_sev_t::AMDSMI_CPER_SEV_FATAL;
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
amdsmi_cper_hdr_t *cper_hdrs = nullptr;
|
||||
uint64_t entry_count = 0;
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
&cper_hdrs,
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_OUT_OF_RESOURCES);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestNullCperHeaderEntryCount){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask = amdsmi_cper_sev_t::AMDSMI_CPER_SEV_FATAL;
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 0;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
nullptr,
|
||||
&cursor);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_OUT_OF_RESOURCES);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestNotEnoughBufferSize){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED|
|
||||
AMDSMI_CPER_SEV_NON_FATAL_CORRECTED|
|
||||
AMDSMI_CPER_SEV_FATAL;
|
||||
uint64_t buf_size = 1024;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_MORE_DATA);
|
||||
ASSERT_EQ(entry_count, 2);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestNotEnoughHeaderPtrs){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED|
|
||||
AMDSMI_CPER_SEV_NON_FATAL_CORRECTED|
|
||||
AMDSMI_CPER_SEV_FATAL;
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 4;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 4);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_MORE_DATA);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestGetsAllSeverityErrors){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_CORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_FATAL);
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 8);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestGetsCorrectableSeverityErrors){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_CORRECTED);
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 1);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestGetsFatalSeverityErrors){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_FATAL);
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 1);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestGetsUncorrectableSeverityErrors){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED);
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 6);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestCursor5GetsLast3HeadersGivenTotal8Headers){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_CORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_FATAL);
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 5;
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 3);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestCursorAdvances){
|
||||
uint32_t gpu_num = 9;
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_CORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_FATAL);
|
||||
uint64_t buf_size = 512;//4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 10;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
|
||||
uint64_t buf_size_original = buf_size;
|
||||
uint64_t entry_count_original = entry_count;
|
||||
uint64_t cursor_idx = 0;
|
||||
uint64_t cursor = 0;
|
||||
while(true) {
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(entry_count, 1);
|
||||
ASSERT_EQ(cursor, ++cursor_idx);
|
||||
ASSERT_TRUE(err == AMDSMI_STATUS_MORE_DATA || err == AMDSMI_STATUS_SUCCESS);
|
||||
if(err == AMDSMI_STATUS_SUCCESS) {
|
||||
break;
|
||||
}
|
||||
buf_size = buf_size_original;
|
||||
entry_count = entry_count_original;
|
||||
}
|
||||
}
|
||||
|
||||
TEST_F(CperEntriesTest, TestGetsCorrectHeaderCountFromAllDevices) {
|
||||
//we can get these deviceids by calling:
|
||||
// ls -alh tests/amd_smi_test/cper/sys/kernel/debug/dri/
|
||||
static constexpr int deviceids[] = { 1, 9, 17, 25, 33};
|
||||
//we can get the numbers in the expected_num_headers array below by calling:
|
||||
// hexdump -C tests/amd_smi_test/cper/sys/kernel/debug/dri/<deviceid>/amdgpu_ring_cper | grep CPER|wc -l
|
||||
// where <deviceid> is one of the entries in the deviceids array above.
|
||||
static constexpr int expected_num_headers[] = { 19, 8, 7, 4, 7};
|
||||
|
||||
for(int device_idx = 0;
|
||||
device_idx < sizeof(deviceids)/sizeof(deviceids[0]);
|
||||
++device_idx) {
|
||||
|
||||
uint32_t gpu_num = deviceids[device_idx];
|
||||
uint32_t severity_mask =
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_UNCORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_NON_FATAL_CORRECTED)|
|
||||
(1 << AMDSMI_CPER_SEV_FATAL);
|
||||
uint64_t buf_size = 4 * (1<<20); //4 MB;
|
||||
auto cper_data = std::make_unique<char[]>(buf_size);
|
||||
uint64_t entry_count = 20;
|
||||
auto cper_hdrs = std::make_unique<amdsmi_cper_hdr_t*[]>(entry_count);
|
||||
uint64_t cursor = 0;
|
||||
|
||||
std::string gpu = std::string(CPER_SYS_ROOT) + "/sys/kernel/debug/dri/" + std::to_string(gpu_num) + "/amdgpu_ring_cper";
|
||||
amdsmi_status_t err = amdsmi_get_gpu_cper_entries_by_path(
|
||||
gpu,
|
||||
severity_mask,
|
||||
cper_data.get(),
|
||||
&buf_size,
|
||||
cper_hdrs.get(),
|
||||
&entry_count,
|
||||
&cursor);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
ASSERT_EQ(entry_count, expected_num_headers[device_idx]);
|
||||
ASSERT_EQ(cursor, expected_num_headers[device_idx]);
|
||||
}
|
||||
}
|
||||
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
İkili dosya gösterilmiyor.
@@ -105,7 +105,6 @@ static void RunGenericTest(TestBase *test) {
|
||||
RunCustomTestEpilog(test);
|
||||
}
|
||||
|
||||
|
||||
// TEST ENTRY TEMPLATE:
|
||||
// TEST(rocrtst, Perf_<test name>) {
|
||||
// <Test Implementation class> <test_obj>;
|
||||
|
||||
Yeni konuda referans
Bir kullanıcı engelle