* fix links to python apis
* add links to repo for example code
* fix `WARNING: Pygments lexer name is not known`
Signed-off-by: Peter Park <Peter.Park@amd.com>
[ROCm/amdsmi commit: 5d0a39fa9d]
3.9 KiB
myst
| myst | ||||||
|---|---|---|---|---|---|---|
|
Reliability, availability, serviceability (RAS)
RAS aims to increase the robustness of a system by detecting hardware errors, recording them, and correcting them where possible. See Reliability, availability, serviceability (Linux kernel) for more general information.
ECC
ECC (Error-Correcting Code) is a type of memory to automatically detect errors. Correctable 1-bit errors are handled by the ECC logic and logged by the hardware. Uncorrectable 2-bit errors can be detected but not reliably fixed; this is a more serious event that must be reported. See RAS Error Count sysfs Interface to learn how AMD SMI accesses error counts.
While ECC is a mechanism to handle different errors, CPER is the standard used to report that the event occurred.
CPER
At its core, CPER (Common Platform Error Record) is a standard format included in the UEFI specification to report errors to the operating system. It works as a standard error report template that different hardware components can fill out when something goes wrong. It consists of a header, one or more section descriptors -- and for each descriptor, an associated section containing error or informational data. See CPER (UEFI Specification) for more information.
A CPER record consists of vital information for diagnostics such as:
- Error source
- Error type
- Error severity
- 0 - Recoverable (also called non-fatal uncorrected)
- 1 - Fatal
- 2 - Corrected
- 3 - Informational
- Timestamp
- Other data
A CPER record might contain an AFID in its data to help map a complex error to a more actionable service task.
AFID
AFIDs (AMD Field ID) are unique numerical IDs associated with specific events or errors produced by AMD Instinct accelerators. It provides a specific identifier for a known condition, which helps facilitate root cause analysis. Each AFID is associated with category, type, and severity fields. See AFID Event List for more information.
From concept to action
AMD SMI provides tools to programmatically monitor and manage these RAS features.
:::::{tab-set} ::::{tab-item} C/C++ The AMD SMI library provides APIs to query ECC error counts and manage CPER records (list, decode, and clear).
See ECC information and RAS information for available APIs. ::::
::::{tab-item} Python See related APIs:
::::{tab-item} amd-smi CLI
See amd-smi ras --help for details and available options.
amd-smi ras --help
:::: :::::