Merge amd-dev into amd-master 20240927

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I6559163305fa1967c3d9105f2d45df9063a02f74
This commit is contained in:
Maisam Arif
2024-09-27 18:55:37 -05:00
33 zmienionych plików z 5188 dodań i 2575 usunięć
+2
Wyświetl plik
@@ -17,6 +17,8 @@ include/amd_smi/amd_smi64Config.h
include/amd_smi/amd_smiConfig.h
rocm_smi/include/rocm_smi/rocm_smi64Config.h
docs/*.pdf
goamdsmi_shim/include/goamdsmi_shimConfig.h
goamdsmi_shim/include/goamdsmi_shim64Config.h
# Byte-compiled / optimized / DLL files
__pycache__/
+363 -86
Wyświetl plik
@@ -8,17 +8,308 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
### Changes
- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**.
Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery:
- `uint64_t accumulation_counter` - used for all throttled calculations
- `uint64_t prochot_residency_acc` - Processor hot accumulator
- `uint64_t ppt_residency_acc` - Package Power Tracking (PPT) accumulator (used in PVIOL calculations)
- `uint64_t socket_thm_residency_acc` - Socket thermal accumulator - (used in TVIOL calculations)
- `uint64_t vr_thm_residency_acc` - Voltage Rail (VR) thermal accumulator
- `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator
- `uint16_t num_partition` - corresponds to the current total number of partitions
- `struct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]` - for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizations
- `uint32_t gfx_busy_inst[MAX_NUM_XCC]` - graphic engine utilization (%)
- `uint16_t jpeg_busy[MAX_NUM_JPEG_ENGS]` - jpeg engine utilization (%)
- `uint16_t vcn_busy[MAX_NUM_VCNS]` - decoder (VCN) engine utilization (%)
- `uint64_t gfx_busy_acc[MAX_NUM_XCC]` - graphic engine utilization accumulated (%)
- `uint32_t pcie_lc_perf_other_end_recovery` - corresponds to the pcie other end recovery counter
- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`**.
***Only available for MI300+ ASICs.***
Users can now retrieve violation status' through either our Python or C++ APIs. Additionally, we have
added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`.
Example outputs are listed below (below is for reference, output is subject to change):
```shell
$ amd-smi metric --throttle
GPU: 0
THROTTLE:
ACCUMULATION_COUNTER: 1226415116
PROCHOT_ACCUMULATED: 0
PPT_ACCUMULATED: 12
SOCKET_THERMAL_ACCUMULATED: 0
VR_THERMAL_ACCUMULATED: 0
HBM_THERMAL_ACCUMULATED: 0
PROCHOT_VIOLATION_ACTIVE: NOT ACTIVE
PPT_VIOLATION_ACTIVE: NOT ACTIVE
SOCKET_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE
VR_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE
HBM_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE
PROCHOT_VIOLATION_PERCENT: 0 %
PPT_VIOLATION_PERCENT: 0 %
SOCKET_THERMAL_VIOLATION_PERCENT: 0 %
VR_THERMAL_VIOLATION_PERCENT: 0 %
HBM_THERMAL_VIOLATION_PERCENT: 0 %
GPU: 1
THROTTLE:
ACCUMULATION_COUNTER: 1226415121
PROCHOT_ACCUMULATED: 0
PPT_ACCUMULATED: 12
SOCKET_THERMAL_ACCUMULATED: 0
VR_THERMAL_ACCUMULATED: 0
HBM_THERMAL_ACCUMULATED: 0
PROCHOT_VIOLATION_ACTIVE: NOT ACTIVE
PPT_VIOLATION_ACTIVE: NOT ACTIVE
SOCKET_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE
VR_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE
HBM_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE
PROCHOT_VIOLATION_PERCENT: 0 %
PPT_VIOLATION_PERCENT: 0 %
SOCKET_THERMAL_VIOLATION_PERCENT: 0 %
VR_THERMAL_VIOLATION_PERCENT: 0 %
HBM_THERMAL_VIOLATION_PERCENT: 0 %
...
```
```shell
$ amd-smi monitor --violation
GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL
0 0 % 0 % 0 % 0 % 0 %
1 0 % 0 % 0 % 0 % 0 %
2 0 % 0 % 0 % 0 % 0 %
3 0 % 0 % 0 % 0 % 0 %
4 0 % 0 % 0 % 0 % 0 %
5 0 % 0 % 0 % 0 % 0 %
6 0 % 0 % 0 % 0 % 0 %
7 0 % 0 % 0 % 0 % 0 %
8 0 % 0 % 0 % 0 % 0 %
9 0 % 0 % 0 % 0 % 0 %
10 0 % 0 % 0 % 0 % 0 %
11 0 % 0 % 0 % 0 % 0 %
12 0 % 0 % 0 % 0 % 0 %
13 0 % 0 % 0 % 0 % 0 %
14 0 % 0 % 0 % 0 % 0 %
15 0 % 0 % 0 % 0 % 0 %
...
```
- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`**.
***Partition specific features are only available on MI300+ ASICs***
Users can now retrieve graphic utilization statistic on a per-XCP (per-partition) basis. Here all XCP activities will be listed,
but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`.
Example outputs are listed below (below is for reference, output is subject to change):
```shell
$ amd-smi metric --usage
GPU: 0
USAGE:
GFX_ACTIVITY: 0 %
UMC_ACTIVITY: 0 %
MM_ACTIVITY: N/A
VCN_ACTIVITY: [0 %, N/A, N/A, N/A]
JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
GFX_BUSY_INST:
XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
JPEG_BUSY:
XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
VCN_BUSY:
XCP_0: [0 %, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A]
GFX_BUSY_ACC:
XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
GPU: 1
USAGE:
GFX_ACTIVITY: 0 %
UMC_ACTIVITY: 0 %
MM_ACTIVITY: N/A
VCN_ACTIVITY: [0 %, N/A, N/A, N/A]
JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
GFX_BUSY_INST:
XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
JPEG_BUSY:
XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A,
N/A, N/A, N/A]
VCN_BUSY:
XCP_0: [0 %, N/A, N/A, N/A]
XCP_1: [0 %, N/A, N/A, N/A]
XCP_2: [0 %, N/A, N/A, N/A]
XCP_3: [0 %, N/A, N/A, N/A]
XCP_4: [0 %, N/A, N/A, N/A]
XCP_5: [0 %, N/A, N/A, N/A]
XCP_6: [0 %, N/A, N/A, N/A]
XCP_7: [0 %, N/A, N/A, N/A]
GFX_BUSY_ACC:
XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A]
...
```
- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value**.
***Feature is only available on MI300+ ASICs***
Users can now retrieve both through `amdsmi_get_pcie_info()` which has an updated structure:
```C
typedef struct {
...
struct pcie_metric_ {
uint16_t pcie_width; //!< current PCIe width
uint32_t pcie_speed; //!< current PCIe speed in MT/s
uint32_t pcie_bandwidth; //!< current instantaneous PCIe bandwidth in Mb/s
uint64_t pcie_replay_count; //!< total number of the replays issued on the PCIe link
uint64_t pcie_l0_to_recovery_count; //!< total number of times the PCIe link transitioned from L0 to the recovery state
uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link
uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device
uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver
uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter
uint64_t reserved[12];
} pcie_metric;
uint64_t reserved[32];
} amdsmi_pcie_info_t;
```
- Example outputs are listed below (below is for reference, output is subject to change):
```shell
$ amd-smi metric --pcie
GPU: 0
PCIE:
WIDTH: 16
SPEED: 32 GT/s
BANDWIDTH: 18 Mb/s
REPLAY_COUNT: 0
L0_TO_RECOVERY_COUNT: 0
REPLAY_ROLL_OVER_COUNT: 0
NAK_SENT_COUNT: 0
NAK_RECEIVED_COUNT: 0
CURRENT_BANDWIDTH_SENT: N/A
CURRENT_BANDWIDTH_RECEIVED: N/A
MAX_PACKET_SIZE: N/A
LC_PERF_OTHER_END_RECOVERY: 0
GPU: 1
PCIE:
WIDTH: 16
SPEED: 32 GT/s
BANDWIDTH: 18 Mb/s
REPLAY_COUNT: 0
L0_TO_RECOVERY_COUNT: 0
REPLAY_ROLL_OVER_COUNT: 0
NAK_SENT_COUNT: 0
NAK_RECEIVED_COUNT: 0
CURRENT_BANDWIDTH_SENT: N/A
CURRENT_BANDWIDTH_RECEIVED: N/A
MAX_PACKET_SIZE: N/A
LC_PERF_OTHER_END_RECOVERY: 0
...
```
- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**.
This aligns BDF output with ROCm SMI.
See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function.
- bits [63:32] = domain
- bits [31:28] = partition id
- bits [27:16] = reserved
- bits [15: 0] = pci bus/device/function
- **Moved python tests directory path install location**.
- `/opt/<rocm-path>/share/amd_smi/pytest/..` to `/opt/<rocm-path>/share/amd_smi/tests/python_unittest/..`
- On amd-smi-lib-tests uninstall, the amd_smi tests folder is removed.
- Removed pytest dependency, our python testing now only depends on the unittest framework.
- **Added retrieving a set of GPUs that are nearest to a given device at a specific link type level**.
- Added `amdsmi_get_link_topology_nearest()` function to amd-smi C and Python Libraries.
- **Added more supported utilization count types to `amdsmi_get_utilization_count()`**.
- **Added `amd-smi set -L/--clk-limit ...` command**.
- Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency.
Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency.
- **Added Pytest functionality to test amdsmi API calls in Python**.
- **Added unittest functionality to test amdsmi API calls in Python**.
- **Changed the `power` parameter in `amdsmi_get_energy_count()` to `energy_accumulator`**.
- Changes propagate forwards into the python interface as well, however we are maintaing backwards compatibility and keeping the `power` field in the python API until ROCm 6.4.
@@ -27,7 +318,7 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
- Added `amdsmi_get_gpu_mem_overdrive_level()` function to amd-smi C and Python Libraries.
- **Added retrieving connection type and P2P capabilities between two GPUs**.
- Added `amdsmi_topo_get_p2p_status` function to amd-smi C and Python Libraries.
- Added `amdsmi_topo_get_p2p_status()` function to amd-smi C and Python Libraries.
- Added retrieving P2P link capabilities to CLI `amd-smi topology`.
```shell
@@ -51,7 +342,6 @@ Topology arguments:
ID: 7 | BDF: 0000:df:00.0 | UUID: <redacted>
all | Selects all devices
-a, --access Displays link accessibility between GPUs
-w, --weight Displays relative weight between GPUs
-o, --hops Displays the number of hops between GPUs
@@ -62,7 +352,6 @@ Topology arguments:
-d, --dma Display P2P direct memory access (DMA) link capability between nodes
-z, --bi-dir Display P2P bi-directional link capability between nodes
Command Modifiers:
--json Displays output in JSON format (human readable by default).
--csv Displays output in CSV format (human readable by default).
@@ -117,7 +406,6 @@ BI-DIRECTIONAL TABLE:
0000:bf:00.0 F T T T F F SELF F
0000:df:00.0 T T T F F T F SELF
Legend:
SELF = Current GPU
ENABLED / DISABLED = Link is enabled or disabled
@@ -129,7 +417,7 @@ Legend:
```
- **Created new amdsmi_kfd_info_t and added information under `amd-smi list`**.
- Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers were added in to a new `amdsmi_kfd_info_t` which gets populated via the API `amdsmi_get_gpu_kfd_info`.
- Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers were added in to a new `amdsmi_kfd_info_t` which gets populated via the API `amdsmi_get_gpu_kfd_info()`.
- This info has been added to the `amd-smi list`.
- These new fields are only available for BM/Guest Linux devices at this time.
@@ -137,7 +425,8 @@ Legend:
typedef struct {
uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported
uint32_t node_id; //< 0xFFFFFFFF if not supported
uint32_t reserved[13];
uint32_t current_partition_id; //< 0xFFFFFFFF if not supported
uint32_t reserved[12];
} amdsmi_kfd_info_t;
```
@@ -213,10 +502,10 @@ GPU: 0
TARGET_GRAPHICS_VERSION: gfx942
```
- **Udpated Partition APIs and struct information and added and partition_id to `amd-smi static --partition` & `amd-smi list`**.
- **Udpated Partition APIs and struct information and added and partition_id to `amd-smi static --partition`**.
- As part of an overhaul to partition information, some partition information will be made available in the `amdsmi_accelerator_partition_profile_t`.
- This struct will be filled out by a new API, `amdsmi_get_gpu_accelerator_partition_profile()`.
- Future data from these APIs wil will eventually get added to `static --partition`.
- Future data from these APIs wil will eventually get added to `amd-smi partition`.
```C
#define AMDSMI_MAX_ACCELERATOR_PROFILE 32
@@ -257,7 +546,6 @@ typedef union {
uint32_t nps_cap_mask;
} amdsmi_nps_caps_t;
typedef struct {
amdsmi_accelerator_partition_type_t profile_type; // SPX, DPX, QPX, CPX and so on
uint32_t num_partitions; // On MI300X, SPX: 1, DPX: 2, QPX: 4, CPX: 8, length of resources array
@@ -276,21 +564,6 @@ GPU: 0
COMPUTE_PARTITION: CPX
MEMORY_PARTITION: NPS4
PARTITION_ID: 0
$ amd-smi list
GPU: 0
BDF: 0000:23:00.0
UUID: <redacted>
KFD_ID: 45412
NODE_ID: 1
PARTITION_ID: 0
GPU: 1
BDF: 0000:26:00.0
UUID: <redacted>
KFD_ID: 59881
NODE_ID: 2
PARTITION_ID: 0
```
### Removals
@@ -319,7 +592,7 @@ plan to eventually remove partition ID from the function portion of the BDF (Bus
- bits [7:3] = Device
- bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI.
- Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI.
```shell
$ amd-smi monitor -p -t -v
@@ -360,7 +633,7 @@ GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL
- **Fixed incorrect implementation of the Python API `amdsmi_get_gpu_metrics_header_info()`**.
- **`amd-smi static --partition` will have updates with additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()`**.
- **`amdsmitst` TestGpuMetricsRead now prints metric in correct units**.
### Known issues
@@ -370,14 +643,18 @@ GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL
- **Python API for `amdsmi_get_energy_count()` will deprecate the `power` field in ROCm 6.4 and use `energy_accumulator` field instead**.
- **Added preliminary `amd-smi partition` command**.
- The new partition command can be used to display GPU information, including memory and accelerator partition information.
- The command will be at full functionality once additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()` has been implemented.
## amd_smi_lib for ROCm 6.2.1
### Additions
- **Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
- **Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**.
Guest VMs do not support getting current ECC counts from the Host cards.
- **Added `amd-smi static --ras`on Guest VMs**.
- **Added `amd-smi static --ras`on Guest VMs**.
Guest VMs can view enabled/disabled ras features that are on Host cards.
### Optimizations
@@ -390,9 +667,9 @@ Guest VMs can view enabled/disabled ras features that are on Host cards.
- **Updated CLI error strings to handle empty and invalid GPU/CPU inputs**.
- **Fixed Guest VM showing passthrough options**.
- **Fixed Guest VM showing passthrough options**.
- **Fixed firmware formatting where leading 0s were missing**.
- **Fixed firmware formatting where leading 0s were missing**.
### Known Issues
@@ -402,9 +679,9 @@ Guest VMs can view enabled/disabled ras features that are on Host cards.
### Additions
- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**.
- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**.
- **Added optional process table under `amd-smi monitor -q`**.
- **Added optional process table under `amd-smi monitor -q`**.
The monitor subcommand within the CLI Tool now has the `-q` option to enable an optional process table underneath the original monitored output.
```shell
@@ -417,10 +694,10 @@ GPU NAME PID GTT_MEM CPU_MEM VRAM_MEM MEM_USAGE GF
0 rvs 1564865 0.0 B 0.0 B 1.1 GB 0.0 B 0 ns 0 ns
```
- **Added Handling to detect VMs with passthrough configurations in CLI Tool**.
- **Added Handling to detect VMs with passthrough configurations in CLI Tool**.
CLI Tool had only allowed a restricted set of options for Virtual Machines with passthrough GPUs. Now we offer an expanded set of functions availble to passthrough configured GPUs.
- **Added Process Isolation and Clear SRAM functionality to the CLI Tool for VMs**.
- **Added Process Isolation and Clear SRAM functionality to the CLI Tool for VMs**.
VMs now have the ability to set the process isolation and clear the sram from the CLI tool. Using the following commands
```shell
@@ -428,10 +705,10 @@ amd-smi set --process-isolation <0 or 1>
amd-smi reset --clean_local_data
```
- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**.
- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**.
Added macros to reference max size limitations for certain amdsmi functions such as max dpm policies and max fanspeed.
- **Added Ring Hang event**.
- **Added Ring Hang event**.
Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_notification_type_t` enum.
### Optimizations
@@ -443,7 +720,7 @@ $ amd-smi static --asic --gpu 123123
Can not find a device: GPU '123123' Error code: -3
```
- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**.
- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**.
Previously if a processes with elevated permissions was running amd-smi would required sudo to display all output. Now amd-smi will populate all process data and return N/A for elevated process names instead. However if ran with sudo you will be able to see the name like so:
```shell
@@ -478,10 +755,10 @@ GPU: 0
ENC: 0 ns
```
- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**.
- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**.
Changed the naming to be more accurate to what the function was doing. This change also extends to the CLI where we changed the `clear-sram-data` command to `clean_local_data`.
- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**.
- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locked value. New struct will be in the following format:
```shell
@@ -495,7 +772,7 @@ Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locke
} amdsmi_clk_info_t;
```
- **Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest**.
- **Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Multiple structures used by APIs were changed for alignment unification:
- Changed `amdsmi_vram_info_t` `vram_size_mb` field changed to to `vram_size`
- Updated `amdsmi_vram_type_t` struct updated to include new enums and added `AMDSMI` prefix
@@ -503,7 +780,7 @@ Multiple structures used by APIs were changed for alignment unification:
- Added `AMDSMI_PROCESSOR_TYPE` prefix to `processor_type_t` enums
- Removed the fields structure definition in favor for an anonymous definition in `amdsmi_bdf_t`
- **Added `AMDSMI` prefix in amdsmi.h and amdsmi_interface.py to align with host/guest**.
- **Added `AMDSMI` prefix in amdsmi.h and amdsmi_interface.py to align with host/guest**.
Multiple structures used by APIs were changed for alignment unification. `AMDSMI` prefix was added to the following structures:
- Added AMDSMI prefix to `amdsmi_container_types_t` enums
- Added AMDSMI prefix to `amdsmi_clk_type_t` enums
@@ -513,13 +790,13 @@ Multiple structures used by APIs were changed for alignment unification. `AMDSMI
- Added AMDSMI prefix to `amdsmi_temperature_type_t` enums
- Added AMDSMI prefix to `amdsmi_fw_block_t` enums
- **Changed dpm_policy references to soc_pstate**.
- **Changed dpm_policy references to soc_pstate**.
The file structure referenced to dpm_policy changed to soc_pstate and we have changed the APIs and CLI tool to be inline with the current structure. `amdsmi_get_dpm_policy()` and `amdsmi_set_dpm_policy()` is no longer valid with the new API being `amdsmi_get_soc_pstate()` and `amdsmi_set_soc_pstate()`. The CLI tool has been changed from `--policy` to `--soc-pstate`
- **Updated `amdsmi_get_gpu_board_info()` product_name to fallback to pciids**.
- **Updated `amdsmi_get_gpu_board_info()` product_name to fallback to pciids**.
Previously on devices without a FRU we would not populate the product name in the `amdsmi_board_info_t` structure, now we will fallback to using the name listed according to the pciids file if available.
- **Updated CLI voltage curve command output**.
- **Updated CLI voltage curve command output**.
The output for `amd-smi metric --voltage-curve` now splits the frequency and voltage output by curve point or outputs N/A for each curve point if not applicable
```shell
@@ -533,16 +810,16 @@ GPU: 0
POINT_2_VOLTAGE: 1186 mV
```
- **Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`**.
- **Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`**.
Updated sizes that work for retreiving relavant board information across AMD's
ASIC products. This requires users to update any ABIs using this structure.
### Fixes
- **Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool**.
- **Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool**.
When running `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in /dev/shm or rebooting the machine.
- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**.
- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**.
Multiple process outputs in the CLI tool were not being registered correctly. The json output did not handle multiple processes and is now in a new valid json format:
```shell
@@ -575,33 +852,33 @@ Multiple process outputs in the CLI tool were not being registered correctly. Th
]
```
- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**.
- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**.
Throttle status may work for older ASICs, but will be replaced with PVIOL and TVIOL metrics for future ASIC support. It remains a field in the gpu_metrics API and in `amd-smi metric --power`.
- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**.
- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**.
Previously if there was a partial failure to retrieve character strings, we would return
garbage output to users using the API. This fix intends to populate as many values as possible.
Then any failure(s) found along the way, `\0` is provided to `amdsmi_board_info_t`
structures data members which cannot be populated. Ensuring empty char string values.
- **Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`**.
- **Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`**.
The parsing of `pp_od_clk_voltage` was not dynamic enough to work with the dropping of voltage curve support on MI series cards. This propagates down to correcting the CLI's output `amd-smi metric --voltage-curve` to N/A if voltage curve is not enabled.
### Known Issues
- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**.
- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**.
## amd_smi_lib for ROCm 6.1.2
### Additions
- **Added process isolation and clean shader APIs and CLI commands**.
- **Added process isolation and clean shader APIs and CLI commands**.
Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs:
- `amdsmi_get_gpu_process_isolation()`
- `amdsmi_set_gpu_process_isolation()`
- `amdsmi_set_gpu_clear_sram_data()`
- **Added `MIN_POWER` to output of `amd-smi static --limit`**.
- **Added `MIN_POWER` to output of `amd-smi static --limit`**.
This change helps users identify the range to which they can change the power cap of the GPU. The change is added to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap <value in W>` or `amd-smi reset -g all --power-cap`.
```shell
@@ -633,7 +910,7 @@ GPU: 1
### Optimizations
- **Updated `amd-smi monitor --pcie` output**.
- **Updated `amd-smi monitor --pcie` output**.
The source for pcie bandwidth monitor output was a legacy file we no longer support and was causing delays within the monitor command. The output is no longer using TX/RX but instantaneous bandwidth from gpu_metrics instead; updated output:
```shell
@@ -642,13 +919,13 @@ GPU PCIE_BW
0 26 Mb/s
```
- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**.
- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**.
`amdsmi_get_power_cap_info` will return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).
- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
Previously calls were returning "No bad pages found." if no pages were found, now it only returns the list type and can be empty.
- **Updated `amd-smi metric --ecc-blocks` output**.
- **Updated `amd-smi metric --ecc-blocks` output**.
The ecc blocks argument was outputing blocks without counters available, updated the filtering show blocks that counters are available for:
``` shell
@@ -685,12 +962,12 @@ GPU: 0
DEFERRED_COUNT: 0
```
- **Removed `amdsmi_get_gpu_process_info` from Python library**.
- **Removed `amdsmi_get_gpu_process_info` from Python library**.
amdsmi_get_gpu_process_info was removed from the C library in an earlier build, but the API was still in the Python interface.
### Fixes
- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**.
- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**.
These systems use an older version of gpu_metrics in amdgpu. This fix only updates what CLI outputs.
No change in any of our APIs.
@@ -715,10 +992,10 @@ GPU: 1
THROTTLE_STATUS: UNTHROTTLED
```
- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**.
- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**.
Updates required `amdsmi_get_power_cap_info` to return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).
- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
Previously Python interface calls to populated bad pages resulted in a `ValueError: NULL pointer access`. This fixes the bad-pages subcommand CLI subcommand as well.
### Known Issues
@@ -729,7 +1006,7 @@ Previously Python interface calls to populated bad pages resulted in a `ValueErr
### Changes
- **Updated metrics --clocks**.
- **Updated metrics --clocks**.
Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status.
``` shell
@@ -840,7 +1117,7 @@ GPU: 0
DEEP_SLEEP: ENABLED
```
- **Added deferred ecc counts**.
- **Added deferred ecc counts**.
Added deferred error correctable counts to `amd-smi metric --ecc --ecc-blocks`
```shell
@@ -864,7 +1141,7 @@ GPU: 0
...
```
- **Updated `amd-smi topology --json` to align with host/guest**.
- **Updated `amd-smi topology --json` to align with host/guest**.
Topology's `--json` output now is changed to align with output host/guest systems. Additionally, users can select/filter specific topology details as desired (refer to `amd-smi topology -h` for full list). See examples shown below.
*Previous format:*
@@ -999,18 +1276,18 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json
### Fixes
- **Fix for GPU reset error on non-amdgpu cards**.
- **Fix for GPU reset error on non-amdgpu cards**.
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
updates CLI to target only AMD ASICs.
- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()` Navi32/31 cards**.
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
- **Fix for `amd-smi process`**.
- **Fix for `amd-smi process`**.
Fixed output results when getting processes running on a device.
- **Improved Error handling for `amd-smi process`**.
- **Improved Error handling for `amd-smi process`**.
Fixed Attribute Error when getting process in csv format
### Known issues
@@ -1021,7 +1298,7 @@ Fixed Attribute Error when getting process in csv format
### Additions
- **Added Monitor Command**.
- **Added Monitor Command**.
Provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI `rocm-smi` (no argument), additionally allows uers to customize what data is helpful for their use-case.
```shell
@@ -1081,7 +1358,7 @@ GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK VRAM_U
7 175 W 34 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB
```
- **Integrated ESMI Tool**.
- **Integrated ESMI Tool**.
Users can get CPU metrics and telemetry through our API and CLI tools. This information can be seen in `amd-smi static` and `amd-smi metric` commands. Only available for limited target processors. As of ROCm 6.0.2, this is listed as:
- AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh
- AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh
@@ -1231,7 +1508,7 @@ CPU: 0
RESPONSE: N/A
```
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**.
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**.
Using the AMD SMI tool, users can retreive VCN, JPEG engines, and PCIe errors by calling `amd-smi metric -P` or `amd-smi metric --usage`. Depending on device support, `VCN_ACTIVITY` will update for MI3x ASICs (with 4 separate VCN engine activities) for older asics `MM_ACTIVITY` with UVD/VCN engine activity (average of all engines). `JPEG_ACTIVITY` is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields.
```shell
@@ -1264,7 +1541,7 @@ GPU: 0
```
- **Added AMDSMI Tool Version**.
- **Added AMDSMI Tool Version**.
AMD SMI will report ***three versions***: AMDSMI Tool, AMDSMI Library version, and ROCm version.
The AMDSMI Tool version is the CLI/tool version number with commit ID appended after `+` sign.
The AMDSMI Library version is the library package version number.
@@ -1275,7 +1552,7 @@ $ amd-smi version
AMDSMI Tool: 23.4.2+505b858 | AMDSMI Library version: 24.2.0.0 | ROCm version: 6.1.0
```
- **Added XGMI table**.
- **Added XGMI table**.
Displays XGMI information for AMD GPU devices in a table format. Only available on supported ASICs (eg. MI300). Here users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes).
```shell
@@ -1309,7 +1586,7 @@ GPU7 0000:df:00.0 32 Gb/s 512 Gb/s XGMI
```
- **Added units of measure to JSON output**.
- **Added units of measure to JSON output**.
We added unit of measure to JSON/CSV `amd-smi metric`, `amd-smi static`, and `amd-smi monitor` commands.
Ex.
@@ -1345,7 +1622,7 @@ amd-smi metric -p --json
### Changes
- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**.
- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**.
We provided each device's BDF for every table's row/columns, then left aligned data. We want AMD SMI Tool output to be easy to understand and digest for our users. Having users scroll up to find this information made it difficult to follow, especially for devices which have many devices associated with one ASIC.
```shell
@@ -1408,9 +1685,9 @@ NUMA BW TABLE:
### Fixes
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**.
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**.
Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string.
- **Fix for devices which have an older pyyaml installed**.
- **Fix for devices which have an older pyyaml installed**.
Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands:
- `amd-smi list`
- `amd-smi static`
@@ -1422,20 +1699,20 @@ Platforms which are identified as having an older pyyaml version or pip, we no m
TypeError: dump_all() got an unexpected keyword argument 'sort_keys'
```
- **Fix for crash when user is not a member of video/render groups**.
- **Fix for crash when user is not a member of video/render groups**.
AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user.
## amd_smi_lib for ROCm 6.0.0
### Additions
- **Integrated the E-SMI (EPYC-SMI) library**.
- **Integrated the E-SMI (EPYC-SMI) library**.
You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details.
- **Added support for gfx942 metrics**.
- **Added support for gfx942 metrics**.
You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance.
- **Compute and memory partition support**.
- **Compute and memory partition support**.
Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration.
### Optimizations
@@ -1444,13 +1721,13 @@ Users can now view, set, and reset partitions. The topology display can provide
### Changes
- **GPU index sorting made consistent with other tools**.
- **GPU index sorting made consistent with other tools**.
To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number.
- **Topology output is now aligned with GPU BDF table**.
- **Topology output is now aligned with GPU BDF table**.
Earlier versions of the topology output were difficult to read since each GPU was displayed linearly.
Now the information is displayed as a table by each GPU's BDF, which closer resembles rocm-smi output.
### Fixes
- **Fix for driver not initialized**.
- **Fix for driver not initialized**.
If driver module is not loaded, user retrieve error reponse indicating amdgpu module is not loaded.
+2 -8
Wyświetl plik
@@ -28,7 +28,7 @@ find_program(GIT NAMES git)
## Setup the package version based on git tags.
set(PKG_VERSION_GIT_TAG_PREFIX "amdsmi_pkg_ver")
get_package_version_number("24.6.5" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
get_package_version_number("24.7.0" ${PKG_VERSION_GIT_TAG_PREFIX} GIT)
message("Package version: ${PKG_VERSION_STR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MAJOR "${CPACK_PACKAGE_VERSION_MAJOR}")
set(${AMD_SMI_LIBS_TARGET}_VERSION_MINOR "${CPACK_PACKAGE_VERSION_MINOR}")
@@ -206,7 +206,7 @@ configure_package_config_file(
write_basic_package_version_file(
${CMAKE_CURRENT_BINARY_DIR}/amd_smi-config-version.cmake
VERSION
"${AMD_SMI_LIBS_TARGET_VERSION_MAJOR}.${AMD_SMI_LIBS_TARGET_VERSION_MINOR}.${AMD_SMI_LIBS_TARGET_VERSION_PATCH}"
"${CPACK_PACKAGE_VERSION}"
COMPATIBILITY SameMajorVersion)
install(
@@ -254,12 +254,9 @@ install(
add_subdirectory(goamdsmi_shim)
#Debian package specific variables
set(CPACK_DEBIAN_PACKAGE_PROVIDES "amd-smi")
set(CPACK_DEBIAN_PACKAGE_RECOMMENDS "python3-argcomplete, libdrm-dev, python3-PyYAML")
set(CPACK_DEBIAN_ASAN_PACKAGE_RECOMMENDS ${CPACK_DEBIAN_PACKAGE_RECOMMENDS})
set(CPACK_DEBIAN_DEV_PACKAGE_RECOMMENDS ${CPACK_DEBIAN_PACKAGE_RECOMMENDS})
set(CPACK_DEBIAN_ASAN_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}-asan")
set(CPACK_DEBIAN_DEV_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}")
set(CPACK_DEBIAN_PACKAGE_DEPENDS "sudo, python3 (>= 3.6.8), python3-pip")
set(CPACK_DEBIAN_ASAN_PACKAGE_DEPENDS ${CPACK_DEBIAN_PACKAGE_DEPENDS})
set(CPACK_DEBIAN_DEV_PACKAGE_DEPENDS ${CPACK_DEBIAN_PACKAGE_DEPENDS})
@@ -276,9 +273,6 @@ set(CPACK_RPM_EXCLUDE_FROM_AUTO_FILELIST_ADDITION
if(CPACK_RPM_PACKAGE_RELEASE)
set(CPACK_RPM_PACKAGE_RELEASE_DIST ON)
endif()
set(CPACK_RPM_PACKAGE_PROVIDES "amd-smi")
set(CPACK_RPM_DEV_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}")
set(CPACK_RPM_ASAN_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}-asan")
# NOTE: RPM SUGGESTS DO NOT WORK! https://bugzilla.redhat.com/show_bug.cgi?id=1811358
set(CPACK_RPM_PACKAGE_SUGGESTS "python3-argcomplete")
set(CPACK_RPM_DEV_PACKAGE_SUGGESTS ${CPACK_RPM_PACKAGE_SUGGESTS})
+1 -1
Wyświetl plik
@@ -81,7 +81,7 @@ AMD-SMI reports the version and current platform detected when running the comma
~$ amd-smi
usage: amd-smi [-h] ...
AMD System Management Interface | Version: 24.6.5.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal
AMD System Management Interface | Version: 24.7.0.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal
options:
-h, --help show this help message and exit
+2 -2
Wyświetl plik
@@ -94,7 +94,8 @@ if __name__ == "__main__":
amd_smi_commands.reset,
amd_smi_commands.monitor,
amd_smi_commands.rocm_smi,
amd_smi_commands.xgmi)
amd_smi_commands.xgmi,
amd_smi_commands.partition)
try:
try:
argcomplete.autocomplete(amd_smi_parser)
@@ -128,7 +129,6 @@ if __name__ == "__main__":
sys.tracebacklimit = 10
else:
sys.tracebacklimit = -1
# Execute subcommands
args.func(args)
except amdsmi_cli_exceptions.AmdSmiException as e:
+430 -40
Wyświetl plik
@@ -174,17 +174,11 @@ class AMDSMICommands():
kfd_info = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)
kfd_id = kfd_info['kfd_id']
node_id = kfd_info['node_id']
partition_id = kfd_info['current_partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
kfd_id = node_id = "N/A"
logging.debug("Failed to get kfd info for gpu %s | %s", gpu_id, e.get_error_info())
try:
partition_info = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(args.gpu)
partition_id = partition_info['partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
partition_id = "N/A"
logging.debug("Failed to get partition ID for gpu %s | %s", gpu_id, e.get_error_info())
# CSV format is intentionally aligned with Host
if self.logger.is_csv_format():
self.logger.store_output(args.gpu, 'gpu_bdf', bdf)
@@ -688,8 +682,8 @@ class AMDSMICommands():
logging.debug("Failed to get memory partition info for gpu %s | %s", gpu_id, e.get_error_info())
try:
partition_info = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(args.gpu)
partition_id = partition_info['partition_id']
kfd_info = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu)
partition_id = kfd_info['current_partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
partition_id = "N/A"
logging.debug("Failed to get partition ID for gpu %s | %s", gpu_id, e.get_error_info())
@@ -801,6 +795,8 @@ class AMDSMICommands():
new_cache_info.update(cache_info)
cache_info_list[index] = new_cache_info
logging.debug(f"[after update] cache_info_list = {cache_info_list}")
cache_size_unit = "KB"
if self.logger.is_human_readable_format():
cache_info_dict_format = {}
@@ -819,6 +815,7 @@ class AMDSMICommands():
cache_info_dict_format[cache_index]["cache_properties"] = ", ".join(cache_info_dict_format[cache_index]["cache_properties"])
cache_info_list = cache_info_dict_format
logging.debug(f"[human readable] cache_info_list = {cache_info_list}")
# Add cache_size_unit to json output
if self.logger.is_json_format():
@@ -1183,7 +1180,7 @@ class AMDSMICommands():
clock=None, temperature=None, ecc=None, ecc_blocks=None, pcie=None,
fan=None, voltage_curve=None, overdrive=None, perf_level=None,
xgmi_err=None, energy=None, mem_usage=None, schedule=None,
guard=None, guest_data=None, fb_usage=None, xgmi=None,):
guard=None, guest_data=None, fb_usage=None, xgmi=None, throttle=None):
"""Get Metric information for target gpu
Args:
@@ -1213,6 +1210,7 @@ class AMDSMICommands():
guest_data (bool, optional): Value override for args.guest_data. Defaults to None.
fb_usage (bool, optional): Value override for args.fb_usage. Defaults to None.
xgmi (bool, optional): Value override for args.xgmi. Defaults to None.
throttle (bool, optional): Value override for args.throttle. Defaults to None.
Raises:
IndexError: Index error if gpu list is empty
@@ -1251,8 +1249,10 @@ class AMDSMICommands():
args.temperature = temperature
if pcie:
args.pcie = pcie
current_platform_args += ["usage", "power", "clock", "temperature", "pcie"]
current_platform_values += [args.usage, args.power, args.clock, args.temperature, args.pcie]
if throttle:
args.throttle = throttle
current_platform_args += ["usage", "power", "clock", "temperature", "pcie", "throttle"]
current_platform_values += [args.usage, args.power, args.clock, args.temperature, args.pcie, args.throttle]
# Only args that are applicable to Hypervisors and BM Linux
if self.helpers.is_hypervisor() or (self.helpers.is_baremetal() and self.helpers.is_linux()):
@@ -1342,13 +1342,16 @@ class AMDSMICommands():
gpu_metric_version_info = amdsmi_interface.amdsmi_get_gpu_metrics_header_info(args.gpu)
gpu_metric_version_str = json.dumps(gpu_metric_version_info, indent=4)
logging.debug("GPU Metrics table Version for GPU %s | %s", gpu_id, gpu_metric_version_str)
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Unable to load GPU Metrics table version for %s | %s", gpu_id, e.err_info)
try:
# Get GPU Metrics table
gpu_metric_debug_info = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)
gpu_metric_str = json.dumps(gpu_metric_debug_info, indent=4)
logging.debug("GPU Metrics table for GPU %s | %s", gpu_id, gpu_metric_str)
logging.debug("GPU Metrics table for GPU %s | %s", gpu_id, str(gpu_metric_str))
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Unabled to load GPU Metrics table for %s | %s", gpu_id, e.err_info)
logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.err_info)
logging.debug(f"Metric Arg information for GPU {gpu_id} on {self.helpers.os_info()}")
logging.debug(f"Args: {current_platform_args}")
@@ -1362,6 +1365,13 @@ class AMDSMICommands():
# Add timestamp and store values for specified arguments
values_dict = {}
#get metric info only once per gpu, this will speed up data output
try:
# Get GPU Metrics table
gpu_metric = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.err_info)
# Populate the pcie_dict first due to multiple gpu metrics calls incorrectly increasing bandwidth
if "pcie" in current_platform_args:
if args.pcie:
@@ -1375,7 +1385,8 @@ class AMDSMICommands():
"nak_received_count" : "N/A",
"current_bandwidth_sent": "N/A",
"current_bandwidth_received": "N/A",
"max_packet_size": "N/A"}
"max_packet_size": "N/A",
"lc_perf_other_end_recovery": "N/A"}
try:
pcie_metric = amdsmi_interface.amdsmi_get_pcie_info(args.gpu)['pcie_metric']
@@ -1396,6 +1407,7 @@ class AMDSMICommands():
pcie_dict['replay_roll_over_count'] = pcie_metric['pcie_replay_roll_over_count']
pcie_dict['nak_received_count'] = pcie_metric['pcie_nak_received_count']
pcie_dict['nak_sent_count'] = pcie_metric['pcie_nak_sent_count']
pcie_dict['lc_perf_other_end_recovery'] = pcie_metric['pcie_lc_perf_other_end_recovery_count']
pcie_speed_unit = 'GT/s'
pcie_bw_unit = 'Mb/s'
@@ -1448,11 +1460,40 @@ class AMDSMICommands():
if args.usage:
try:
engine_usage = amdsmi_interface.amdsmi_get_gpu_activity(args.gpu)
logging.debug(f"engine_usage dictionary = {engine_usage}")
# TODO: move vcn_activity and jpeg_activity into amdsmi_get_gpu_activity
gpu_metric_info = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)
engine_usage['vcn_activity'] = gpu_metric_info.pop('vcn_activity')
engine_usage['jpeg_activity'] = gpu_metric_info.pop('jpeg_activity')
engine_usage['vcn_activity'] = gpu_metric['vcn_activity']
engine_usage['jpeg_activity'] = gpu_metric['jpeg_activity']
num_partition = gpu_metric['num_partition']
engine_usage['gfx_busy_inst'] = "N/A"
engine_usage['jpeg_busy'] = "N/A"
engine_usage['vcn_busy'] = "N/A"
engine_usage['gfx_busy_acc'] = "N/A"
if num_partition != "N/A":
# these are one after another, in order to display each in sub-sections
new_xcp_dict = {}
for current_xcp in range(num_partition):
new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.gfx_busy_inst'][current_xcp]
engine_usage['gfx_busy_inst'] = new_xcp_dict
new_xcp_dict = {}
for current_xcp in range(num_partition):
new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.jpeg_busy'][current_xcp]
engine_usage['jpeg_busy'] = new_xcp_dict
new_xcp_dict = {}
for current_xcp in range(num_partition):
new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.vcn_busy'][current_xcp]
engine_usage['vcn_busy'] = new_xcp_dict
new_xcp_dict = {}
for current_xcp in range(num_partition):
new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.gfx_busy_acc'][current_xcp]
engine_usage['gfx_busy_acc'] = new_xcp_dict
logging.debug(f"After updates to engine_usage dictionary = {engine_usage}")
for key, value in engine_usage.items():
activity_unit = '%'
@@ -1463,6 +1504,13 @@ class AMDSMICommands():
engine_usage[key][index] = f"{activity} {activity_unit}"
# Convert list to a string for human readable format
engine_usage[key] = '[' + ", ".join(engine_usage[key]) + ']'
elif isinstance(value, dict):
for k, v in value.items():
for index, activity in enumerate(v):
if activity != "N/A":
value[k][index] = f"{activity} {activity_unit}"
# Convert list to a string for human readable format
value[k] = '[' + ", ".join(value[k]) + ']'
elif value != "N/A":
engine_usage[key] = f"{value} {activity_unit}"
if self.logger.is_json_format():
@@ -1471,14 +1519,20 @@ class AMDSMICommands():
if activity != "N/A":
engine_usage[key][index] = {"value" : activity,
"unit" : activity_unit}
elif isinstance(value, dict):
for k, v in value.items():
for index, activity in enumerate(v):
if activity != "N/A":
value[k][index] = {"value" : activity,
"unit" : activity_unit}
elif value != "N/A":
engine_usage[key] = {"value" : value,
"unit" : activity_unit}
values_dict['usage'] = engine_usage
except amdsmi_exception.AmdSmiLibraryException as e:
except Exception as e:
values_dict['usage'] = "N/A"
logging.debug("Failed to get gpu activity for gpu %s | %s", gpu_id, e.get_error_info())
logging.debug("Failed to get gpu activity for gpu %s | %s", gpu_id, e)
if "power" in current_platform_args:
if args.power:
power_dict = {'socket_power': "N/A",
@@ -1527,14 +1581,14 @@ class AMDSMICommands():
try:
power_dict['throttle_status'] = "N/A"
throttle_status = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)['throttle_status']
throttle_status = gpu_metric['throttle_status']
if throttle_status != "N/A":
if throttle_status:
power_dict['throttle_status'] = "THROTTLED"
else:
power_dict['throttle_status'] = "UNTHROTTLED"
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get throttle status for gpu %s | %s", gpu_id, e.get_error_info())
except Exception as e:
logging.debug("Failed to get throttle status for gpu %s | %s", gpu_id, e)
values_dict['power'] = power_dict
if "clock" in current_platform_args:
@@ -1578,10 +1632,8 @@ class AMDSMICommands():
# Populate clock values from gpu_metrics_info
try:
gpu_metrics_info = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)
# Populate GFX clock values
current_gfx_clocks = gpu_metrics_info["current_gfxclks"]
current_gfx_clocks = gpu_metric["current_gfxclks"]
for clock_index, current_gfx_clock in enumerate(current_gfx_clocks):
# If the current clock is N/A then nothing else applies
if current_gfx_clock == "N/A":
@@ -1593,9 +1645,9 @@ class AMDSMICommands():
clock_unit)
# Populate clock locked status
if gpu_metrics_info["gfxclk_lock_status"] != "N/A":
if gpu_metric["gfxclk_lock_status"] != "N/A":
gfx_clock_lock_flag = 1 << clock_index # This is the position of the clock lock flag
if gpu_metrics_info["gfxclk_lock_status"] & gfx_clock_lock_flag:
if gpu_metric["gfxclk_lock_status"] & gfx_clock_lock_flag:
clocks[gfx_index]["clk_locked"] = "ENABLED"
else:
clocks[gfx_index]["clk_locked"] = "DISABLED"
@@ -1607,7 +1659,7 @@ class AMDSMICommands():
clocks[gfx_index]["deep_sleep"] = "DISABLED"
# Populate MEM clock value
current_mem_clock = gpu_metrics_info["current_uclk"] # single value
current_mem_clock = gpu_metric["current_uclk"] # single value
if current_mem_clock != "N/A":
clocks["mem_0"]["clk"] = self.helpers.unit_format(self.logger,
current_mem_clock,
@@ -1619,7 +1671,7 @@ class AMDSMICommands():
clocks["mem_0"]["deep_sleep"] = "DISABLED"
# Populate VCLK clock values
current_vclk_clocks = gpu_metrics_info["current_vclk0s"]
current_vclk_clocks = gpu_metric["current_vclk0s"]
for clock_index, current_vclk_clock in enumerate(current_vclk_clocks):
# If the current clock is N/A then nothing else applies
if current_vclk_clock == "N/A":
@@ -1636,7 +1688,7 @@ class AMDSMICommands():
clocks[vclk_index]["deep_sleep"] = "DISABLED"
# Populate DCLK clock values
current_dclk_clocks = gpu_metrics_info["current_dclk0s"]
current_dclk_clocks = gpu_metric["current_dclk0s"]
for clock_index, current_dclk_clock in enumerate(current_dclk_clocks):
# If the current clock is N/A then nothing else applies
if current_dclk_clock == "N/A":
@@ -1651,8 +1703,8 @@ class AMDSMICommands():
clocks[dclk_index]["deep_sleep"] = "ENABLED"
else:
clocks[dclk_index]["deep_sleep"] = "DISABLED"
except amdsmi_exception.AmdSmiLibraryException as e:
logging.debug("Failed to get gpu_metrics_info for gpu %s | %s", gpu_id, e.get_error_info())
except Exception as e:
logging.debug("Failed to get gpu_metrics_info for gpu %s | %s", gpu_id, e)
# Populate the max and min clock values from sysfs
# Min and Max values are per clock type, not per clock engine
@@ -2036,6 +2088,92 @@ class AMDSMICommands():
"unit" : memory_unit}
values_dict['mem_usage'] = memory_usage
if "throttle" in current_platform_args:
if args.throttle:
throttle_status = {
# gpu metric values
'accumulation_counter': "N/A",
'prochot_accumulated': "N/A",
'ppt_accumulated': "N/A",
'socket_thermal_accumulated': "N/A",
'vr_thermal_accumulated': "N/A",
'hbm_thermal_accumulated': "N/A",
# violation status values - active
'prochot_violation_active': "N/A",
'ppt_violation_active': "N/A",
'socket_thermal_violation_active': "N/A",
'vr_thermal_violation_active': "N/A",
'hbm_thermal_violation_active': "N/A",
# violation status values - percent
'prochot_violation_percent': "N/A",
'ppt_violation_percent': "N/A",
'socket_thermal_violation_percent': "N/A",
'vr_thermal_violation_percent': "N/A",
'hbm_thermal_violation_percent': "N/A"
}
try:
throttle_status['accumulation_counter'] = gpu_metric['accumulation_counter']
throttle_status['prochot_accumulated'] = gpu_metric['prochot_residency_acc']
throttle_status['ppt_accumulated'] = gpu_metric['ppt_residency_acc']
throttle_status['socket_thermal_accumulated'] = gpu_metric['socket_thm_residency_acc']
throttle_status['vr_thermal_accumulated'] = gpu_metric['vr_thm_residency_acc']
throttle_status['hbm_thermal_accumulated'] = gpu_metric['hbm_thm_residency_acc']
except Exception as e:
values_dict['throttle'] = throttle_status
logging.debug("Failed to get gpu metric information for throttle status' for gpu %s | %s", gpu_id, e)
try:
violation_status = amdsmi_interface.amdsmi_get_violation_status(args.gpu)
throttle_status['prochot_violation_active'] = violation_status['active_prochot_thrm']
throttle_status['ppt_violation_active'] = violation_status['active_ppt_pwr']
throttle_status['socket_thermal_violation_active'] = violation_status['active_socket_thrm']
throttle_status['vr_thermal_violation_active'] = violation_status['active_vr_thrm']
throttle_status['hbm_thermal_violation_active'] = violation_status['active_hbm_thrm']
throttle_status['prochot_violation_percent'] = violation_status['per_prochot_thrm']
throttle_status['ppt_violation_percent'] = violation_status['per_ppt_pwr']
throttle_status['socket_thermal_violation_percent'] = violation_status['per_socket_thrm']
throttle_status['vr_thermal_violation_percent'] = violation_status['per_vr_thrm']
throttle_status['hbm_thermal_violation_percent'] = violation_status['per_hbm_thrm']
except amdsmi_exception.AmdSmiLibraryException as e:
values_dict['throttle'] = throttle_status
logging.debug("Failed to get violation status' for gpu %s | %s", gpu_id, e.get_error_info())
for key, value in throttle_status.items():
if "active" in key:
throttle_status[key] = "NOT ACTIVE"
if value:
throttle_status[key] = "ACTIVE"
continue
if "percent" not in key:
continue
activity_unit = '%'
if self.logger.is_human_readable_format():
if isinstance(value, list):
for index, activity in enumerate(value):
if activity != "N/A":
throttle_status[key][index] = f"{activity} {activity_unit}"
# Convert list to a string for human readable format
throttle_status[key] = '[' + ", ".join(throttle_status[key]) + ']'
elif value != "N/A":
throttle_status[key] = f"{value} {activity_unit}"
if self.logger.is_json_format():
if isinstance(value, list):
for index, activity in enumerate(value):
if activity != "N/A":
throttle_status[key][index] = {"value" : activity,
"unit" : activity_unit}
elif value != "N/A":
throttle_status[key] = {"value" : value,
"unit" : activity_unit}
values_dict['throttle'] = throttle_status
# Store timestamp first if watching_output is enabled
if watching_output:
@@ -2438,7 +2576,7 @@ class AMDSMICommands():
cpu_temp=None, cpu_dimm_temp_range_rate=None, cpu_dimm_pow_consumption=None,
cpu_dimm_thermal_sensor=None,
core=None, core_boost_limit=None, core_curr_active_freq_core_limit=None,
core_energy=None):
core_energy=None, throttle=None):
"""Get Metric information for target gpu
Args:
@@ -2513,7 +2651,7 @@ class AMDSMICommands():
gpu_attributes = ["usage", "watch", "watch_time", "iterations", "power", "clock",
"temperature", "ecc", "ecc_blocks", "pcie", "fan", "voltage_curve",
"overdrive", "perf_level", "xgmi_err", "energy", "mem_usage", "schedule",
"guard", "guest_data", "fb_usage", "xgmi"]
"guard", "guest_data", "fb_usage", "xgmi", "throttle"]
for attr in gpu_attributes:
if hasattr(args, attr):
if getattr(args, attr):
@@ -2586,7 +2724,7 @@ class AMDSMICommands():
clock, temperature, ecc, ecc_blocks, pcie,
fan, voltage_curve, overdrive, perf_level,
xgmi_err, energy, mem_usage, schedule,
guard, guest_data, fb_usage, xgmi)
guard, guest_data, fb_usage, xgmi, throttle)
elif self.helpers.is_amd_hsmp_initialized(): # Only CPU is initialized
if args.cpu == None and args.core == None:
# If no args are set, print out all CPU and Core metrics info
@@ -2620,7 +2758,7 @@ class AMDSMICommands():
usage, watch, watch_time, iterations, power,
clock, temperature, ecc, ecc_blocks, pcie,
fan, voltage_curve, overdrive, perf_level,
xgmi_err, energy, mem_usage, schedule)
xgmi_err, energy, mem_usage, schedule, throttle)
def process(self, args, multiple_devices=False, watching_output=False,
@@ -4301,7 +4439,7 @@ class AMDSMICommands():
def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None,
watch=None, watch_time=None, iterations=None, power_usage=None,
temperature=None, gfx_util=None, mem_util=None, encoder=None, decoder=None,
ecc=None, vram_usage=None, pcie=None, process=None):
ecc=None, vram_usage=None, pcie=None, process=None, violation=None):
""" Populate a table with each GPU as an index to rows of targeted data
Args:
@@ -4321,6 +4459,7 @@ class AMDSMICommands():
vram_usage (bool, optional): Value override for args.vram_usage. Defaults to None.
pcie (bool, optional): Value override for args.pcie. Defaults to None.
process (bool, optional): Value override for args.process. Defaults to None.
violation (bool, optional): Value override for args.violation. Defaults to None.
Raises:
ValueError: Value error if no gpu value is provided
@@ -4360,6 +4499,8 @@ class AMDSMICommands():
args.pcie = pcie
if process:
args.process = process
if violation:
args.violation = violation
# Handle No GPU passed
if args.gpu == None:
@@ -4369,10 +4510,10 @@ class AMDSMICommands():
# Don't include process in this logic as it's an optional edge case
if not any([args.power_usage, args.temperature, args.gfx, args.mem,
args.encoder, args.decoder, args.ecc,
args.vram_usage, args.pcie]):
args.vram_usage, args.pcie, args.violation]):
args.power_usage = args.temperature = args.gfx = args.mem = \
args.encoder = args.decoder = args.ecc = \
args.vram_usage = args.pcie = True
args.vram_usage = args.pcie = args.violation = True
# Handle watch logic, will only enter this block once
if args.watch:
@@ -4684,6 +4825,50 @@ class AMDSMICommands():
self.logger.table_header += 'PCIE_BW'.rjust(12)
if args.violation:
violation_status = {
"pviol": "N/A",
"tviol": "N/A",
"phot_tviol": "N/A",
"vr_tviol": "N/A",
"hbm_tviol": "N/A",
}
try:
violations = amdsmi_interface.amdsmi_get_violation_status(args.gpu)
violation_status['pviol'] = violations['per_ppt_pwr']
violation_status['tviol'] = violations['per_socket_thrm']
violation_status['phot_tviol'] = violations['per_prochot_thrm']
violation_status['vr_tviol'] = violations['per_vr_thrm']
violation_status['hbm_tviol'] = violations['per_hbm_thrm']
except amdsmi_exception.AmdSmiLibraryException as e:
monitor_values['pviol'] = violation_status['pviol']
monitor_values['tviol'] = violation_status['tviol']
monitor_values['phot_tviol'] = violation_status['phot_tviol']
monitor_values['vr_tviol'] = violation_status['vr_tviol']
monitor_values['hbm_tviol'] = violation_status['hbm_tviol']
logging.debug("Failed to get violation status on gpu %s | %s", gpu_id, e.get_error_info())
violation_status_unit = "%"
kTVIOL_MAX_WIDTH = 10
kPVIOL_MAX_WIDTH = 10
kPHOT_MAX_WIDTH = 12
kVR_MAX_WIDTH = 10
kHBM_MAX_WIDTH = 11
for key, value in violation_status.items():
monitor_values[key] = self.helpers.unit_format(self.logger, violation_status[key], violation_status_unit)
if self.logger.is_human_readable_format():
monitor_values['pviol'] = monitor_values['pviol'].rjust(kPVIOL_MAX_WIDTH, ' ')
monitor_values['tviol'] = monitor_values['tviol'].rjust(kTVIOL_MAX_WIDTH, ' ')
monitor_values['phot_tviol'] = monitor_values['phot_tviol'].rjust(kPHOT_MAX_WIDTH, ' ')
monitor_values['vr_tviol'] = monitor_values['vr_tviol'].rjust(kVR_MAX_WIDTH, ' ')
monitor_values['hbm_tviol'] = monitor_values['hbm_tviol'].rjust(kHBM_MAX_WIDTH, ' ')
self.logger.table_header += 'PVIOL'.rjust(kPVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'TVIOL'.rjust(kTVIOL_MAX_WIDTH, ' ')
self.logger.table_header += 'PHOT_TVIOL'.rjust(kPHOT_MAX_WIDTH, ' ')
self.logger.table_header += 'VR_TVIOL'.rjust(kVR_MAX_WIDTH, ' ')
self.logger.table_header += 'HBM_TVIOL'.rjust(kHBM_MAX_WIDTH, ' ')
self.logger.store_output(args.gpu, 'values', monitor_values)
# intialize dual_csv_format; applicable to process only
@@ -4858,6 +5043,8 @@ class AMDSMICommands():
bitrate = pcie_speed_GTs_value
max_bandwidth = bitrate * pcie_static['max_pcie_width']
except amdsmi_exception.AmdSmiLibraryException as e:
bitrate = "N/A"
max_bandwidth = "N/A"
logging.debug("Failed to get bitrate and bandwidth for GPU %s | %s", src_gpu_id,
e.get_error_info())
@@ -4899,6 +5086,8 @@ class AMDSMICommands():
read = metrics_info['xgmi_read_data_acc'][dest_gpu_id]
write = metrics_info['xgmi_write_data_acc'][dest_gpu_id]
except amdsmi_exception.AmdSmiLibraryException as e:
read = "N/A"
write = "N/A"
logging.debug("Failed to get read data for %s to %s | %s",
self.helpers.get_gpu_id_from_device_handle(src_gpu),
self.helpers.get_gpu_id_from_device_handle(dest_gpu),
@@ -4987,6 +5176,207 @@ class AMDSMICommands():
self.logger.print_output(multiple_device_enabled=True)
def partition(self, args, multiple_devices=False, gpu=None, current=None, memory=None, accelerator=None):
""" Display parition information for the target GPU
param:
args - argparser args to pass to subcommand
multiple_devices (bool) - True if checking for multiple devices
gpu (device_handle) - device_handle for target device
current - boolean which dictates whether the current partition information is shown
memory - boolean which dictates whether the memory partition information is shown
accelerator - boolean which dictates whether the accelerator partition information is shown
returns:
nothing
"""
if gpu:
args.gpu = gpu
if args.gpu == None:
args.gpu = self.device_handles
if not isinstance(args.gpu, list):
args.gpu = [args.gpu]
if current:
args.current = current
if memory:
args.memory = memory
if accelerator:
args.accelerator = accelerator
# if no args are present, then everything should be displayed
if not args.current and not args.memory and not args.accelerator:
args.current = True
args.memory = True
args.accelerator = True
if args.current:
self.logger.table_header = ''.rjust(7)
current_header = "GPU_ID".ljust(13) + \
"MEMORY".ljust(8) + \
"ACCELERATOR_TYPE".ljust(18) + \
"ACCELERATOR_PROFILE_INDEX".ljust(27) + \
"PARTITION_ID".ljust(14)
self.logger.table_header = current_header + self.logger.table_header.strip()
tabular_output = []
for gpu in args.gpu:
gpu_id = self.helpers.get_gpu_id_from_device_handle(gpu)
try:
partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu)
profile_type = partition_dict['partition_profile']['profile_type']
profile_index = partition_dict['partition_profile']['profile_index']
partition_id = partition_dict['partition_id']
except amdsmi_exception.AmdSmiLibraryException as e:
profile_type = "N/A"
profile_index = "N/A"
partition_id = "N/A"
logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info())
try:
current_mem_cap = amdsmi_interface.amdsmi_get_gpu_memory_partition(gpu)
except amdsmi_exception.AmdSmiLibraryException as e:
current_mem_cap = "N/A"
logging.debug("Failed to get current memory partition capabilties for GPU %s | %s", gpu_id, e.get_error_info())
tabular_output_dict = {"gpu_id": gpu_id,
"memory": current_mem_cap,
"accelerator_type": profile_type,
"accelerator_profile_index": profile_index,
"partition_id": partition_id}
tabular_output.append(tabular_output_dict)
self.logger.multiple_device_output = tabular_output
self.logger.table_title = "CURRENT_PARTITION"
self.logger.print_output(multiple_device_enabled=True, tabular=True)
self.logger.clear_multiple_devices_ouput()
if args.memory:
for gpu in args.gpu:
gpu_id = self.helpers.get_gpu_id_from_device_handle(gpu)
try:
memory_partition = amdsmi_interface.amdsmi_get_gpu_memory_partition(gpu) # this info likely actually comes from different apis than used here
except amdsmi_exception.AmdSmiLibraryException as e:
memory_partition = "N/A"
logging.debug("Failed to get current memory partition for GPU %s | %s", gpu_id, e.get_error_info())
try:
partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu)
temp_mem_caps = partition_dict['partition_profile']['memory_caps']
if temp_mem_caps.amdsmi_nps_flags_t == None:
mem_caps = temp_mem_caps.nps_cap_mask
mem_caps_list = []
if mem_caps & 1 == 1:
mem_caps_list.append("NPS1")
if mem_caps & 2 == 2:
mem_caps_list.append("NPS2")
if mem_caps & 4 == 4:
mem_caps_list.append("NPS4")
if mem_caps & 8 == 8:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
else:
mem_caps = temp_mem_caps.amdsmi_nps_flags_t
mem_caps_list = []
if mem_caps.nps1_cap == 1:
mem_caps_list.append("NPS1")
if mem_caps.nps2_cap == 1:
mem_caps_list.append("NPS2")
if mem_caps.nps4_cap == 1:
mem_caps_list.append("NPS4")
if mem_caps.nps8_cap == 1:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
if mem_caps_str == "":
mem_caps_str = "N/A"
except amdsmi_exception.AmdSmiLibraryException as e:
mem_caps_str = "N/A"
logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info())
memory_dict = {'caps': mem_caps_str, 'current': memory_partition}
self.logger.store_output(gpu, 'memory_partition', memory_dict)
self.logger.store_multiple_device_output()
self.logger.print_output(multiple_device_enabled=True)
self.logger.clear_multiple_devices_ouput()
if args.accelerator:
self.logger.table_header = ''.rjust(7)
current_header = "GPU_ID".ljust(13) + \
"PROFILE_INDEX".ljust(15) + \
"MEMORY_PARTITION_CAPS".ljust(23) + \
"ACCELERATOR_TYPE".ljust(18) + \
"PARTITION_ID".ljust(14) + \
"NUM_PARTITIONS".ljust(16) + \
"NUM_RESOURCES".ljust(15) + \
"RESOURCE_INDEX".ljust(16) + \
"RESOURCE_TYPE".ljust(15) + \
"RESOURCE_INSTANCES".ljust(20) + \
"RESOURCES_SHARED".ljust(18)
self.logger.table_header = current_header + self.logger.table_header.strip()
tabular_output = []
for gpu in args.gpu:
gpu_id = self.helpers.get_gpu_id_from_device_handle(gpu)
try:
partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu)
profile_type = partition_dict['partition_profile']['profile_type']
profile_index = partition_dict['partition_profile']['profile_index']
temp_mem_caps = partition_dict['partition_profile']['memory_caps']
parition_id = partition_dict['partition_id']
num_resources = partition_dict['partition_profile']['num_resources']
resources = partition_dict['partition_profile']['resources']
if temp_mem_caps.amdsmi_nps_flags_t == None:
mem_caps = temp_mem_caps.nps_cap_mask
mem_caps_list = []
if mem_caps & 1 == 1:
mem_caps_list.append("NPS1")
if mem_caps & 2 == 2:
mem_caps_list.append("NPS2")
if mem_caps & 4 == 4:
mem_caps_list.append("NPS4")
if mem_caps & 8 == 8:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
else:
mem_caps = temp_mem_caps.amdsmi_nps_flags_t
mem_caps_list = []
if mem_caps.nps1_cap == 1:
mem_caps_list.append("NPS1")
if mem_caps.nps2_cap == 1:
mem_caps_list.append("NPS2")
if mem_caps.nps4_cap == 1:
mem_caps_list.append("NPS4")
if mem_caps.nps8_cap == 1:
mem_caps_list.append("NPS8")
mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "")
if mem_caps_str == "":
mem_caps_str = "N/A"
except amdsmi_exception.AmdSmiLibraryException as e:
profile_type = "N/A"
profile_index = "N/A"
temp_mem_caps = "N/A"
parition_id = "N/A"
num_resources = "N/A"
resources = "N/A"
mem_caps_str = "N/A"
logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info())
tabular_output_dict = {"gpu_id": gpu_id,
"profile_index": profile_index,
"memory_partition_caps": mem_caps_str,
"accelerator_type": profile_type,
"partition_id": parition_id,
"num_partitions": 0,
"num_resources": num_resources,
"resource_index": resources,
"resource_type": resources,
"resource_instances": resources,
"resources_shared": resources}
tabular_output.append(tabular_output_dict)
self.logger.multiple_device_output = tabular_output
self.logger.table_title = "ACCELERATOR_PARTITION_PROFILES"
self.logger.print_output(multiple_device_enabled=True, tabular=True)
self.logger.clear_multiple_devices_ouput()
def _event_thread(self, commands, i):
devices = commands.device_handles
if len(devices) == 0:
+25 -1
Wyświetl plik
@@ -150,8 +150,32 @@ class AMDSMILogger():
table_values += string_value.ljust(14)
elif key == "link_type":
table_values += string_value.ljust(10)
elif key == "memory":
table_values += string_value.ljust(8)
elif key == "accelerator_type":
table_values += string_value.ljust(18)
elif key == "partition_id":
table_values += string_value.ljust(14)
elif key == "accelerator_profile_index":
table_values += string_value.ljust(27)
elif key == "profile_index":
table_values += string_value.ljust(15)
elif key == "memory_partition_caps":
table_values += string_value.ljust(23)
elif key == "num_partitions":
table_values += string_value.ljust(16)
elif key == "num_resources":
table_values += string_value.ljust(15)
elif key == "resource_index":
table_values += string_value.ljust(16)
elif key == "resource_type":
table_values += string_value.ljust(15)
elif key == "resource_instances":
table_values += string_value.ljust(20)
elif key == "resources_shared":
table_values += string_value.ljust(18)
elif key == "RW":
table_values += " " + string_value.ljust(52)
table_values += string_value.ljust(52)
elif key == "process_list":
#Add an additional padding between the first instance of GPU and NAME
table_values += ' '
+41 -2
Wyświetl plik
@@ -71,7 +71,7 @@ class AMDSMIParser(argparse.ArgumentParser):
"""
def __init__(self, version, list, static, firmware, bad_pages, metric,
process, profile, event, topology, set_value, reset, monitor,
rocmsmi, xgmi):
rocmsmi, xgmi, partition):
# Helper variables
self.helpers = AMDSMIHelpers()
@@ -117,7 +117,7 @@ class AMDSMIParser(argparse.ArgumentParser):
# Store possible subcommands & aliases for later errors
self.possible_commands = ['version', 'list', 'static', 'firmware', 'ucode', 'bad-pages',
'metric', 'process', 'profile', 'event', 'topology', 'set',
'reset', 'monitor', 'dmon', 'xgmi']
'reset', 'monitor', 'dmon', 'xgmi', 'partition']
# Add all subparsers
self._add_version_parser(self.subparsers, version)
@@ -135,6 +135,7 @@ class AMDSMIParser(argparse.ArgumentParser):
self._add_monitor_parser(self.subparsers, monitor)
self._add_rocm_smi_parser(self.subparsers, rocmsmi)
self._add_xgmi_parser(self.subparsers, xgmi)
self._add_partition_parser(self.subparsers, partition)
def _not_negative_int(self, int_value):
@@ -758,6 +759,7 @@ class AMDSMIParser(argparse.ArgumentParser):
perf_level_help = "Current DPM performance level"
xgmi_err_help = "XGMI error information since last read"
energy_help = "Amount of energy consumed"
throttle_help = "Displays throttle accumulators; Only available for MI300 or newer ASICs"
# Help text for Arguments only on Hypervisors
schedule_help = "All scheduling information"
@@ -832,6 +834,7 @@ class AMDSMIParser(argparse.ArgumentParser):
metric_parser.add_argument('-l', '--perf-level', action='store_true', required=False, help=perf_level_help)
metric_parser.add_argument('-x', '--xgmi-err', action='store_true', required=False, help=xgmi_err_help)
metric_parser.add_argument('-E', '--energy', action='store_true', required=False, help=energy_help)
metric_parser.add_argument('-T', '--throttle', action='store_true', required=False, help=throttle_help)
# Options to only display to Hypervisors
if self.helpers.is_hypervisor():
@@ -1184,6 +1187,7 @@ class AMDSMIParser(argparse.ArgumentParser):
mem_usage_help = "Monitor memory usage in MB"
pcie_bandwidth_help = "Monitor PCIe bandwidth in Mb/s"
process_help = "Enable Process information table below monitor output"
violation_help = "Monitor power and thermal violation status (%%); Only available for MI300 or newer ASICs"
# Create monitor subparser
monitor_parser = subparsers.add_parser('monitor', help=monitor_help, description=monitor_subcommand_help, aliases=["dmon"])
@@ -1207,6 +1211,7 @@ class AMDSMIParser(argparse.ArgumentParser):
monitor_parser.add_argument('-v', '--vram-usage', action='store_true', required=False, help=mem_usage_help)
monitor_parser.add_argument('-r', '--pcie', action='store_true', required=False, help=pcie_bandwidth_help)
monitor_parser.add_argument('-q', '--process', action='store_true', required=False, help=process_help)
monitor_parser.add_argument('-V', '--violation', action='store_true', required=False, help=violation_help)
def _add_rocm_smi_parser(self, subparsers, func):
@@ -1282,6 +1287,40 @@ class AMDSMIParser(argparse.ArgumentParser):
xgmi_parser.add_argument('-m', '--metric', action='store_true', required=False, help=metrics_help)
def _add_partition_parser(self, subparsers, func):
if not self.helpers.is_amdgpu_initialized():
# The partition subcommand is only applicable to systems with amdgpu initialized
return
# Subparser help text
partition_help = "Displays partition information of the devices"
partition_subcommand_help = "If no GPU is specified, returns information for all GPUs on the system.\
\nIf no partition argument is provided all partition information will be displayed."
partition_optionals_title = "partition arguments"
# Options help text
current_help = "display the current partition information"
memory_help = "display the current memory partition mode and capabilities"
accelerator_help = "display accelerator partition information"
# Create partition subparser
partition_parser = subparsers.add_parser('partition', help=partition_help, description=partition_subcommand_help)
partition_parser._optionals.title = partition_optionals_title
partition_parser.formatter_class=lambda prog: AMDSMISubparserHelpFormatter(prog)
partition_parser.set_defaults(func=func)
# Add Universal Arguments
self._add_device_arguments(partition_parser, required=False)
# Handle GPU Options
partition_parser.add_argument('-c', '--current', action='store_true', required=False, help=current_help)
partition_parser.add_argument('-m', '--memory', action='store_true', required=False, help=memory_help)
partition_parser.add_argument('-a', '--accelerator', action='store_true', required=False, help=accelerator_help)
# Add command modifiers to the bottom
self._add_command_modifiers(partition_parser)
def error(self, message):
outputformat = self.helpers.get_output_format()
+1 -1
Wyświetl plik
@@ -48,7 +48,7 @@ PROJECT_NAME = AMD SMI
# could be handy for archiving the generated documentation or if some version
# control system is used.
PROJECT_NUMBER = "24.6.5.0"
PROJECT_NUMBER = "24.7.0.0"
# Using the PROJECT_BRIEF tag one can provide an optional one line description
# for a project that appears at the top of each page and should give viewer a
+1 -1
Wyświetl plik
@@ -8,7 +8,7 @@ AMD-SMI reports the version and current platform detected when running the comma
~$ amd-smi
usage: amd-smi [-h] ...
AMD System Management Interface | Version: 24.6.5.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal
AMD System Management Interface | Version: 24.7.0.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal
options:
-h, --help show this help message and exit
@@ -3867,6 +3867,55 @@ except AmdSmiException as e:
print(e)
```
### amdsmi_get_link_topology_nearest
Description: Retrieve the set of GPUs that are nearest to a given device
at a specific interconnectivity level.
Input parameters:
* `processor_handle` The identifier of the given device.
* `link_type` The AmdSmiLinkType level to search for nearest devices
Output: Dictionary holding the following fields.
* `count` number of nearest devices found based on given topology level
* `processor_list` list of all nearest device handlers found
Exceptions that can be thrown by `amdsmi_get_link_topology_nearest` function:
* `AmdSmiLibraryException`
Example:
```python
try:
amdsmi_init()
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs found on machine")
exit()
else:
print(amdsmi_get_gpu_device_uuid(devices[0]))
nearest_gpus = amdsmi_topology_nearest_t()
nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType(2))
if (nearest_gpus['count']) == 0:
print("No nearest GPUs found on machine")
else:
print("Nearest GPUs")
for gpu in nearest_gpus['processor_list']:
print(amdsmi_get_gpu_device_uuid(gpu))
except AmdSmiException as e:
print(e)
finally:
try:
amdsmi_shut_down()
except AmdSmiException as e:
print(e)
```
## CPU APIs
### amdsmi_get_processor_info
+362 -149
Wyświetl plik
@@ -62,7 +62,7 @@
const char *err_str; \
std::cout << "AMDSMI call returned " << RET << " at line " \
<< __LINE__ << std::endl; \
amdsmi_status_code_to_string(RET, &err_str); \
amdsmi_status_code_to_string(RET, &err_str); \
std::cout << err_str << std::endl; \
return RET; \
} \
@@ -264,6 +264,8 @@ int main() {
&device_count, &processor_handles[0]);
CHK_AMDSMI_RET(ret)
std::cout << "Processor Count: " << device_count << std::endl;
// For each device of the socket, get name and temperature.
for (uint32_t j = 0; j < device_count; j++) {
// Get device type. Since the amdsmi is initialized with
@@ -494,7 +496,10 @@ int main() {
block = (amdsmi_gpu_block_t)(block * 2)) {
ret = amdsmi_get_gpu_ras_block_features_enabled(processor_handles[j], block,
&state);
CHK_AMDSMI_RET(ret)
if (ret != AMDSMI_STATUS_API_FAILED) {
CHK_AMDSMI_RET(ret)
}
printf("\tBlock: %s\n", block_names[index]);
printf("\tStatus: %s\n", status_names[state]);
index++;
@@ -507,7 +512,9 @@ int main() {
uint32_t num_pages = 0;
ret = amdsmi_get_gpu_bad_page_info(processor_handles[j], &num_pages,
nullptr);
CHK_AMDSMI_RET(ret)
if (ret != AMDSMI_STATUS_NOT_SUPPORTED) {
CHK_AMDSMI_RET(ret)
}
printf(" Output of amdsmi_get_gpu_bad_page_info:\n");
if (!num_pages) {
printf("\tNo bad pages found.\n");
@@ -684,8 +691,8 @@ int main() {
/// Get GPU Metrics info
std::cout << "\n\n";
amdsmi_gpu_metrics_t gpu_metrics;
ret = amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics);
amdsmi_gpu_metrics_t smu;
ret = amdsmi_get_gpu_metrics_info(processor_handles[j], &smu);
CHK_AMDSMI_RET(ret)
printf(" Output of amdsmi_get_gpu_metrics_info:\n");
printf("\tDevice[%d] BDF %04lx:%02x:%02x.%d\n\n", i,
@@ -694,165 +701,371 @@ int main() {
bdf.device_number,
bdf.function_number);
std::cout << "\t**.common_header.format_revision : "
<< print_unsigned_int(gpu_metrics.common_header.format_revision) << "\n";
std::cout << "\t**.common_header.content_revision : "
<< print_unsigned_int(gpu_metrics.common_header.content_revision) << "\n";
std::cout << "\t**.temperature_edge : " << std::dec
<< gpu_metrics.temperature_edge << "\n";
std::cout << "\t**.temperature_hotspot : " << std::dec
<< gpu_metrics.temperature_hotspot << "\n";
std::cout << "\t**.temperature_mem : " << std::dec
<< gpu_metrics.temperature_mem << "\n";
std::cout << "\t**.temperature_vrgfx : " << std::dec
<< gpu_metrics.temperature_vrgfx << "\n";
std::cout << "\t**.temperature_vrsoc : " << std::dec
<< gpu_metrics.temperature_vrsoc << "\n";
std::cout << "\t**.temperature_vrmem : " << std::dec
<< gpu_metrics.temperature_vrmem << "\n";
std::cout << "\t**.average_gfx_activity : " << std::dec
<< gpu_metrics.average_gfx_activity << "\n";
std::cout << "\t**.average_umc_activity : " << std::dec
<< gpu_metrics.average_umc_activity << "\n";
std::cout << "\t**.average_mm_activity : " << std::dec
<< gpu_metrics.average_mm_activity << "\n";
std::cout << "\t**.average_socket_power : " << std::dec
<< gpu_metrics.average_socket_power << "\n";
std::cout << "\t**.energy_accumulator : " << std::dec
<< gpu_metrics.energy_accumulator << "\n";
std::cout << "\t**.system_clock_counter : " << std::dec
<< gpu_metrics.system_clock_counter << "\n";
std::cout << "\t**.average_gfxclk_frequency : " << std::dec
<< gpu_metrics.average_gfxclk_frequency << "\n";
std::cout << "\t**.average_socclk_frequency : " << std::dec
<< gpu_metrics.average_socclk_frequency << "\n";
std::cout << "\t**.average_uclk_frequency : " << std::dec
<< gpu_metrics.average_uclk_frequency << "\n";
std::cout << "\t**.average_vclk0_frequency : " << std::dec
<< gpu_metrics.average_vclk0_frequency<< "\n";
std::cout << "\t**.average_dclk0_frequency : " << std::dec
<< gpu_metrics.average_dclk0_frequency << "\n";
std::cout << "\t**.average_vclk1_frequency : " << std::dec
<< gpu_metrics.average_vclk1_frequency << "\n";
std::cout << "\t**.average_dclk1_frequency : " << std::dec
<< gpu_metrics.average_dclk1_frequency << "\n";
std::cout << "\t**.current_gfxclk : " << std::dec
<< gpu_metrics.current_gfxclk << "\n";
std::cout << "\t**.current_socclk : " << std::dec
<< gpu_metrics.current_socclk << "\n";
std::cout << "\t**.current_uclk : " << std::dec
<< gpu_metrics.current_uclk << "\n";
std::cout << "\t**.current_vclk0 : " << std::dec
<< gpu_metrics.current_vclk0 << "\n";
std::cout << "\t**.current_dclk0 : " << std::dec
<< gpu_metrics.current_dclk0 << "\n";
std::cout << "\t**.current_vclk1 : " << std::dec
<< gpu_metrics.current_vclk1 << "\n";
std::cout << "\t**.current_dclk1 : " << std::dec
<< gpu_metrics.current_dclk1 << "\n";
std::cout << "\t**.throttle_status : " << std::dec
<< gpu_metrics.throttle_status << "\n";
std::cout << "\t**.current_fan_speed : " << std::dec
<< gpu_metrics.current_fan_speed << "\n";
std::cout << "\t**.pcie_link_width : " << std::dec
<< gpu_metrics.pcie_link_width << "\n";
std::cout << "\t**.pcie_link_speed : " << std::dec
<< gpu_metrics.pcie_link_speed << "\n";
std::cout << "\t**.gfx_activity_acc : " << std::dec
<< gpu_metrics.gfx_activity_acc << "\n";
std::cout << "\t**.mem_activity_acc : " << std::dec
<< gpu_metrics.mem_activity_acc << "\n";
std::cout << "\t**.firmware_timestamp : " << std::dec
<< gpu_metrics.firmware_timestamp << "\n";
std::cout << "\t**.voltage_soc : " << std::dec
<< gpu_metrics.voltage_soc << "\n";
std::cout << "\t**.voltage_gfx : " << std::dec
<< gpu_metrics.voltage_gfx << "\n";
std::cout << "\t**.voltage_mem : " << std::dec
<< gpu_metrics.voltage_mem << "\n";
std::cout << "\t**.indep_throttle_status : " << std::dec
<< gpu_metrics.indep_throttle_status << "\n";
std::cout << "\t**.current_socket_power : " << std::dec
<< gpu_metrics.current_socket_power << "\n";
std::cout << "\t**.gfxclk_lock_status : " << std::dec
<< gpu_metrics.gfxclk_lock_status << "\n";
std::cout << "\t**.xgmi_link_width : " << std::dec
<< gpu_metrics.xgmi_link_width << "\n";
std::cout << "\t**.xgmi_link_speed : " << std::dec
<< gpu_metrics.xgmi_link_speed << "\n";
std::cout << "\t**.pcie_bandwidth_acc : " << std::dec
<< gpu_metrics.pcie_bandwidth_acc << "\n";
std::cout << "\t**.pcie_bandwidth_inst : " << std::dec
<< gpu_metrics.pcie_bandwidth_inst << "\n";
std::cout << "\t**.pcie_l0_to_recov_count_acc : " << std::dec
<< gpu_metrics.pcie_l0_to_recov_count_acc << "\n";
std::cout << "\t**.pcie_replay_count_acc : " << std::dec
<< gpu_metrics.pcie_replay_count_acc << "\n";
std::cout << "\t**.pcie_replay_rover_count_acc : " << std::dec
<< gpu_metrics.pcie_replay_rover_count_acc << "\n";
std::cout << "METRIC TABLE HEADER:\n";
std::cout << "structure_size=" << std::dec
<< static_cast<uint16_t>(smu.common_header.structure_size) << "\n";
std::cout << "\tformat_revision=" << std::dec
<< static_cast<uint16_t>(smu.common_header.format_revision) << "\n";
std::cout << "\tcontent_revision=" << std::dec
<< static_cast<uint16_t>(smu.common_header.content_revision) << "\n";
std::cout << "\t**.temperature_hbm[] : " << std::dec << "\n";
for (const auto& temp : gpu_metrics.temperature_hbm) {
std::cout << "\t -> " << std::dec << temp << "\n";
}
std::cout << "\n";
std::cout << "TIME STAMPS (ns):\n";
std::cout << std::dec << "\tsystem_clock_counter=" << smu.system_clock_counter << "\n";
std::cout << "\tfirmware_timestamp (10ns resolution)=" << std::dec << smu.firmware_timestamp
<< "\n";
std::cout << "\t**.vcn_activity[] : " << std::dec << "\n";
for (const auto& vcn : gpu_metrics.vcn_activity) {
std::cout << "\t -> " << std::dec << vcn << "\n";
}
std::cout << "\t**.xgmi_read_data_acc[] : " << std::dec << "\n";
for (const auto& read_data : gpu_metrics.xgmi_read_data_acc) {
std::cout << "\t -> " << std::dec << read_data << "\n";
}
std::cout << "\t**.xgmi_write_data_acc[] : " << std::dec << "\n";
for (const auto& write_data : gpu_metrics.xgmi_write_data_acc) {
std::cout << "\t -> " << std::dec << write_data << "\n";
}
std::cout << "\t**.current_gfxclks[] : " << std::dec << "\n";
for (const auto& gfxclk : gpu_metrics.current_gfxclks) {
std::cout << "\t -> " << std::dec << gfxclk << "\n";
}
std::cout << "\t**.current_socclks[] : " << std::dec << "\n";
for (const auto& socclk : gpu_metrics.current_socclks) {
std::cout << "\t -> " << std::dec << socclk << "\n";
}
std::cout << "\t**.current_vclk0s[] : " << std::dec << "\n";
for (const auto& vclk : gpu_metrics.current_vclk0s) {
std::cout << "\t -> " << std::dec << vclk << "\n";
}
std::cout << "\t**.current_dclk0s[] : " << std::dec << "\n";
for (const auto& dclk : gpu_metrics.current_dclk0s) {
std::cout << "\t -> " << std::dec << dclk << "\n";
std::cout << "\n";
std::cout << "TEMPERATURES (C):\n";
std::cout << std::dec << "\ttemperature_edge= " << smu.temperature_edge << "\n";
std::cout << std::dec << "\ttemperature_hotspot= " << smu.temperature_hotspot << "\n";
std::cout << std::dec << "\ttemperature_mem= " << smu.temperature_mem << "\n";
std::cout << std::dec << "\ttemperature_vrgfx= " << smu.temperature_vrgfx << "\n";
std::cout << std::dec << "\ttemperature_vrsoc= " << smu.temperature_vrsoc << "\n";
std::cout << std::dec << "\ttemperature_vrmem= " << smu.temperature_vrmem << "\n";
std::cout << "\ttemperature_hbm = [";
auto idx = 0;
for (const auto& temp : smu.temperature_hbm) {
std::cout << temp;
if ((idx + 1) != std::size(smu.temperature_hbm)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << "\n";
std::cout << "UTILIZATION (%):\n";
std::cout << std::dec << "\taverage_gfx_activity=" << smu.average_gfx_activity << "\n";
std::cout << std::dec << "\taverage_umc_activity=" << smu.average_umc_activity << "\n";
std::cout << std::dec << "\taverage_mm_activity=" << smu.average_mm_activity << "\n";
std::cout << std::dec << "\tvcn_activity= [";
idx = 0;
for (const auto& temp : smu.vcn_activity) {
std::cout << temp;
if ((idx + 1) != std::size(smu.vcn_activity)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << "\n";
std::cout << std::dec << "\tjpeg_activity= [";
idx = 0;
for (const auto& temp : smu.jpeg_activity) {
std::cout << temp;
if ((idx + 1) != std::size(smu.jpeg_activity)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << "\n";
std::cout << "POWER (W)/ENERGY (15.259uJ per 1ns):\n";
std::cout << std::dec << "\taverage_socket_power=" << smu.average_socket_power << "\n";
std::cout << std::dec << "\tcurrent_socket_power=" << smu.current_socket_power << "\n";
std::cout << std::dec << "\tenergy_accumulator=" << smu.energy_accumulator << "\n";
std::cout << "\n";
std::cout << "AVG CLOCKS (MHz):\n";
std::cout << std::dec << "\taverage_gfxclk_frequency=" << smu.average_gfxclk_frequency
<< "\n";
std::cout << std::dec << "\taverage_gfxclk_frequency=" << smu.average_gfxclk_frequency
<< "\n";
std::cout << std::dec << "\taverage_uclk_frequency=" << smu.average_uclk_frequency << "\n";
std::cout << std::dec << "\taverage_vclk0_frequency=" << smu.average_vclk0_frequency
<< "\n";
std::cout << std::dec << "\taverage_dclk0_frequency=" << smu.average_dclk0_frequency
<< "\n";
std::cout << std::dec << "\taverage_vclk1_frequency=" << smu.average_vclk1_frequency
<< "\n";
std::cout << std::dec << "\taverage_dclk1_frequency=" << smu.average_dclk1_frequency
<< "\n";
std::cout << "\n";
std::cout << "CURRENT CLOCKS (MHz):\n";
std::cout << std::dec << "\tcurrent_gfxclk=" << smu.current_gfxclk << "\n";
std::cout << std::dec << "\tcurrent_gfxclks= [";
idx = 0;
for (const auto& temp : smu.current_gfxclks) {
std::cout << temp;
if ((idx + 1) != std::size(smu.current_gfxclks)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << std::dec << "\tcurrent_socclk=" << smu.current_socclk << "\n";
std::cout << std::dec << "\tcurrent_socclks= [";
idx = 0;
for (const auto& temp : smu.current_socclks) {
std::cout << temp;
if ((idx + 1) != std::size(smu.current_socclks)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << std::dec << "\tcurrent_uclk=" << smu.current_uclk << "\n";
std::cout << std::dec << "\tcurrent_vclk0=" << smu.current_vclk0 << "\n";
std::cout << std::dec << "\tcurrent_vclk0s= [";
idx = 0;
for (const auto& temp : smu.current_vclk0s) {
std::cout << temp;
if ((idx + 1) != std::size(smu.current_vclk0s)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << std::dec << "\tcurrent_dclk0=" << smu.current_dclk0 << "\n";
std::cout << std::dec << "\tcurrent_dclk0s= [";
idx = 0;
for (const auto& temp : smu.current_dclk0s) {
std::cout << temp;
if ((idx + 1) != std::size(smu.current_dclk0s)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << std::dec << "\tcurrent_vclk1=" << smu.current_vclk1 << "\n";
std::cout << std::dec << "\tcurrent_dclk1=" << smu.current_dclk1 << "\n";
std::cout << "\n";
std::cout << "TROTTLE STATUS:\n";
std::cout << std::dec << "\tthrottle_status=" << smu.throttle_status << "\n";
std::cout << "\n";
std::cout << "FAN SPEED:\n";
std::cout << std::dec << "\tcurrent_fan_speed=" << smu.current_fan_speed << "\n";
std::cout << "\n";
std::cout << "LINK WIDTH (number of lanes) /SPEED (0.1 GT/s):\n";
std::cout << "\tpcie_link_width=" << smu.pcie_link_width << "\n";
std::cout << "\tpcie_link_speed=" << smu.pcie_link_speed << "\n";
std::cout << "\txgmi_link_width=" << smu.xgmi_link_width << "\n";
std::cout << "\txgmi_link_speed=" << smu.xgmi_link_speed << "\n";
std::cout << "\n";
std::cout << "Utilization Accumulated(%):\n";
std::cout << "\tgfx_activity_acc=" << std::dec << smu.gfx_activity_acc << "\n";
std::cout << "\tmem_activity_acc=" << std::dec << smu.mem_activity_acc << "\n";
std::cout << "\n";
std::cout << "XGMI ACCUMULATED DATA TRANSFER SIZE (KB):\n";
std::cout << std::dec << "\txgmi_read_data_acc= [";
idx = 0;
for (const auto& temp : smu.xgmi_read_data_acc) {
std::cout << temp;
if ((idx + 1) != std::size(smu.xgmi_read_data_acc)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
std::cout << std::dec << "\txgmi_write_data_acc= [";
idx = 0;
for (const auto& temp : smu.xgmi_write_data_acc) {
std::cout << temp;
if ((idx + 1) != std::size(smu.xgmi_write_data_acc)) {
std::cout << ", ";
} else {
std::cout << "]\n";
}
++idx;
}
// Voltage (mV)
std::cout << "\tvoltage_soc = " << std::dec << smu.voltage_soc << "\n";
std::cout << "\tvoltage_gfx = " << std::dec << smu.voltage_gfx << "\n";
std::cout << "\tvoltage_mem = " << std::dec << smu.voltage_mem << "\n";
std::cout << "\tindep_throttle_status = " << std::dec << smu.indep_throttle_status << "\n";
// Clock Lock Status. Each bit corresponds to clock instance
std::cout << "\tgfxclk_lock_status (in hex) = " << std::hex
<< smu.gfxclk_lock_status << std::dec <<"\n";
// Bandwidth (GB/sec)
std::cout << "\tpcie_bandwidth_acc=" << std::dec << smu.pcie_bandwidth_acc << "\n";
std::cout << "\tpcie_bandwidth_inst=" << std::dec << smu.pcie_bandwidth_inst << "\n";
// Counts
std::cout << "\tpcie_l0_to_recov_count_acc= " << std::dec << smu.pcie_l0_to_recov_count_acc
<< "\n";
std::cout << "\tpcie_replay_count_acc= " << std::dec << smu.pcie_replay_count_acc << "\n";
std::cout << "\tpcie_replay_rover_count_acc= " << std::dec
<< smu.pcie_replay_rover_count_acc << "\n";
std::cout << "\tpcie_nak_sent_count_acc= " << std::dec << smu.pcie_nak_sent_count_acc
<< "\n";
std::cout << "\tpcie_nak_rcvd_count_acc= " << std::dec << smu.pcie_nak_rcvd_count_acc
<< "\n";
// Accumulation cycle counter
// Accumulated throttler residencies
std::cout << "\n";
std::cout << "RESIDENCY ACCUMULATION / COUNTER:\n";
std::cout << "\taccumulation_counter = " << std::dec << smu.accumulation_counter << "\n";
std::cout << "\tprochot_residency_acc = " << std::dec << smu.prochot_residency_acc << "\n";
std::cout << "\tppt_residency_acc = " << std::dec << smu.ppt_residency_acc << "\n";
std::cout << "\tsocket_thm_residency_acc = " << std::dec << smu.socket_thm_residency_acc
<< "\n";
std::cout << "\tvr_thm_residency_acc = " << std::dec << smu.vr_thm_residency_acc
<< "\n";
std::cout << "\thbm_thm_residency_acc = " << std::dec << smu.hbm_thm_residency_acc << "\n";
// Number of current partitions
std::cout << "\tnum_partition = " << std::dec << smu.num_partition << "\n";
// PCIE other end recovery counter
std::cout << "\tpcie_lc_perf_other_end_recovery = "
<< std::dec << smu.pcie_lc_perf_other_end_recovery << "\n";
idx = 0;
auto idy = 0;
std::cout << "\txcp_stats.gfx_busy_inst: " << "\n";
for (auto& row : smu.xcp_stats) {
std::cout << "\t XCP [" << idx << "] : [";
for (auto& col : row.gfx_busy_inst) {
if ((idy + 1) != std::size(row.gfx_busy_inst)) {
std::cout << col << ", ";
} else {
std::cout << col;
}
idy++;
}
std::cout << "]\n";
idy = 0;
idx++;
}
idx = 0;
idy = 0;
std::cout << "\txcp_stats.vcn_busy: " << "\n";
for (auto& row : smu.xcp_stats) {
std::cout << "\t XCP [" << idx << "] : [";
for (auto& col : row.vcn_busy) {
if ((idy + 1) != std::size(row.vcn_busy)) {
std::cout << col << ", ";
} else {
std::cout << col;
}
idy++;
}
std::cout << "]\n";
idy = 0;
idx++;
}
idx = 0;
idy = 0;
std::cout << "\txcp_stats.jpeg_busy: " << "\n";
for (auto& row : smu.xcp_stats) {
std::cout << "\t XCP [" << idx << "] : [";
for (auto& col : row.jpeg_busy) {
if ((idy + 1) != std::size(row.jpeg_busy)) {
std::cout << col << ", ";
} else {
std::cout << col;
}
idy++;
}
std::cout << "]\n";
idy = 0;
idx++;
}
idx = 0;
idy = 0;
std::cout << "\txcp_stats.gfx_busy_acc: " << "\n";
for (auto& row : smu.xcp_stats) {
std::cout << "\t XCP [" << idx << "] : [";
for (auto& col : row.gfx_busy_acc) {
if ((idy + 1) != std::size(row.gfx_busy_acc)) {
std::cout << col << ", ";
} else {
std::cout << col;
}
idy++;
}
std::cout << "]\n";
idy = 0;
idx++;
}
std::cout << "\n\n";
std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n";
constexpr uint16_t kMAX_ITER_TEST = 10;
amdsmi_gpu_metrics_t gpu_metrics_check;
amdsmi_gpu_metrics_t gpu_metrics_check = {};
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check);
std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.firmware_timestamp << "\n";
amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check);
std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: "
<< gpu_metrics_check.firmware_timestamp << "\n";
}
std::cout << "\n";
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check);
std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.system_clock_counter << "\n";
amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check);
std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: "
<< gpu_metrics_check.system_clock_counter << "\n";
}
std::cout << "\n";
std::cout << "\n";
std::cout << "\t ** Note: Values MAX'ed out (UINTX MAX are unsupported for the version in question) ** " << "\n";
std::cout << "\n";
std::cout << "+=======+==================+============+=============="
<< "+=============+=============+=============+============+\n";
}
std::cout << " ** Note: Values MAX'ed out "
<< "(UINTX MAX are unsupported for the version in question) ** " << "\n\n";
// Get nearest GPUs
char *topology_link_type_str[] = {
"AMDSMI_LINK_TYPE_INTERNAL",
"AMDSMI_LINK_TYPE_XGMI",
"AMDSMI_LINK_TYPE_PCIE",
"AMDSMI_LINK_TYPE_NOT_APPLICABLE",
"AMDSMI_LINK_TYPE_UNKNOWN",
};
printf("\tOutput of amdsmi_get_link_topology_nearest:\n");
for (uint32_t topo_link_type = AMDSMI_LINK_TYPE_INTERNAL; topo_link_type <= AMDSMI_LINK_TYPE_UNKNOWN; topo_link_type++) {
auto topology_nearest_info = amdsmi_topology_nearest_t();
ret = amdsmi_get_link_topology_nearest(processor_handles[j],
static_cast<amdsmi_link_type_t>(topo_link_type),
nullptr);
if (ret != AMDSMI_STATUS_INVAL) {
CHK_AMDSMI_RET(ret);
}
ret = amdsmi_get_link_topology_nearest(processor_handles[j],
static_cast<amdsmi_link_type_t>(topo_link_type),
&topology_nearest_info);
if (ret != AMDSMI_STATUS_INVAL) {
CHK_AMDSMI_RET(ret);
}
printf("\tNearest GPUs found at %s\n", topology_link_type_str[topo_link_type]);
printf("\tNearest Count: %d\n", topology_nearest_info.count);
for (uint32_t k = 0; k < topology_nearest_info.count; k++) {
amdsmi_bdf_t bdf = {};
ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf);
CHK_AMDSMI_RET(ret)
printf("\t\tGPU BDF %04lx:%02x:%02x.%d\n", bdf.domain_number,
bdf.bus_number, bdf.device_number, bdf.function_number);
}
}
}
}
// Clean up resources allocated at amdsmi_init. It will invalidate sockets
+33 -2
Wyświetl plik
@@ -61,7 +61,7 @@
const char *err_str; \
std::cout << "AMDSMI call returned " << RET << " at line " \
<< __LINE__ << std::endl; \
amdsmi_status_code_to_string(RET, &err_str); \
amdsmi_status_code_to_string(RET, &err_str); \
std::cout << err_str << std::endl; \
return RET; \
} \
@@ -262,8 +262,10 @@ int main() {
char bad_page_status_names[3][15] = {"RESERVED", "PENDING",
"UNRESERVABLE"};
uint32_t num_pages = 0;
std::vector<amdsmi_retired_page_record_t> bad_page_info(num_pages);
ret = amdsmi_get_gpu_bad_page_info(processor_handles[j], &num_pages,
nullptr);
bad_page_info.data());
std::cout << "num_pages = " << num_pages << "\n";
CHK_AMDSMI_RET(ret)
printf(" Output of amdsmi_get_gpu_bad_page_info:\n");
if (!num_pages) {
@@ -344,6 +346,35 @@ int main() {
<<"," << policy.policies[x].policy_description << ")\n";
}
}
// Get nearest GPUs
char *topology_link_type_str[] = {
"AMDSMI_LINK_TYPE_INTERNAL",
"AMDSMI_LINK_TYPE_XGMI",
"AMDSMI_LINK_TYPE_PCIE",
"AMDSMI_LINK_TYPE_NOT_APPLICABLE",
"AMDSMI_LINK_TYPE_UNKNOWN",
};
printf("\tOutput of amdsmi_get_link_topology_nearest:\n");
for (uint32_t topo_link_type = AMDSMI_LINK_TYPE_INTERNAL; topo_link_type <= AMDSMI_LINK_TYPE_UNKNOWN; topo_link_type++) {
auto topology_nearest_info = amdsmi_topology_nearest_t();
ret = amdsmi_get_link_topology_nearest(processor_handles[j],
static_cast<amdsmi_link_type_t>(topo_link_type),
nullptr);
CHK_AMDSMI_RET(ret);
ret = amdsmi_get_link_topology_nearest(processor_handles[j],
static_cast<amdsmi_link_type_t>(topo_link_type),
&topology_nearest_info);
CHK_AMDSMI_RET(ret);
printf("\tNearest GPUs found at %s\n", topology_link_type_str[topo_link_type]);
for (uint32_t k = 0; k < topology_nearest_info.count; k++) {
amdsmi_bdf_t bdf = {};
ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf);
CHK_AMDSMI_RET(ret)
printf("\tGPU BDF %04lx:%02x:%02x.%d\n", bdf.domain_number,
bdf.bus_number, bdf.device_number, bdf.function_number);
}
}
}
}
+179 -12
Wyświetl plik
@@ -142,6 +142,29 @@ typedef enum {
*/
#define AMDSMI_MAX_NUM_JPEG 32
/**
* @brief This should match AMDSMI_MAX_NUM_XCC;
* XCC - Accelerated Compute Core, the collection of compute units,
* ACE (Asynchronous Compute Engines), caches,
* and global resources organized as one unit.
*
* Refer to amd.com documentation for more detail:
* https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
*/
#define AMDSMI_MAX_NUM_XCC 8
/**
* @brief This should match AMDSMI_MAX_NUM_XCP;
* XCP - Accelerated Compute Processor,
* also referred to as the Graphics Compute Partitions.
* Each physical gpu could have a maximum of 8 separate partitions
* associated with each (depending on ASIC support).
*
* Refer to amd.com documentation for more detail:
* https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
*/
#define AMDSMI_MAX_NUM_XCP 8
/* string format */
#define AMDSMI_TIME_FORMAT "%02d:%02d:%02d.%03d"
#define AMDSMI_DATE_FORMAT "%04d-%02d-%02d:%02d:%02d:%02d.%03d"
@@ -154,10 +177,10 @@ typedef enum {
#define AMDSMI_LIB_VERSION_YEAR 24
//! Major version should be changed for every header change (adding/deleting APIs, changing names, fields of structures, etc.)
#define AMDSMI_LIB_VERSION_MAJOR 6
#define AMDSMI_LIB_VERSION_MAJOR 7
//! Minor version should be updated for each API change, but without changing headers
#define AMDSMI_LIB_VERSION_MINOR 5
#define AMDSMI_LIB_VERSION_MINOR 0
//! Release version should be set to 0 as default and can be updated by the PMs for each CSP point release
#define AMDSMI_LIB_VERSION_RELEASE 0
@@ -503,7 +526,25 @@ typedef struct {
uint32_t vram_used;
uint32_t reserved[2];
} amdsmi_vram_usage_t;
/**
* @brief This structure hold violation status information.
* Note: for MI3x asics and higher, older ASICs will show unsupported.
*/
typedef struct {
uint64_t reference_timestamp; //!< Represents CPU timestamp in microseconds (uS)
uint64_t violation_timestamp; //!< Violation time in milliseconds (ms)
uint64_t per_prochot_thrm; //!< Processor hot violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_ppt_pwr; //!< PVIOL; Package Power Tracking (PPT) violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_socket_thrm; //!< TVIOL; Socket thermal violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_vr_thrm; //!< Voltage regulator violation % (greater than 0% is a violation); Max uint64 means unsupported
uint64_t per_hbm_thrm; //!< High Bandwidth Memory (HBM) thermal violation % (greater than 0% is a violation); Max uint64 means unsupported
uint8_t active_prochot_thrm; //!< Processor hot violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_ppt_pwr; //!< Package Power Tracking (PPT) violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_socket_thrm; //!< Socket thermal violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_vr_thrm; //!< Voltage regulator violation; 1 = active 0 = not active; Max uint8 means unsupported
uint8_t active_hbm_thrm; //!< High Bandwidth Memory (HBM) thermal violation; 1 = active 0 = not active; Max uint8 means unsupported
uint64_t reserved[24]; // Reserved for new violation info
} amdsmi_violation_status_t;
typedef struct {
amdsmi_range_t supported_freq_range;
amdsmi_range_t current_freq_range;
@@ -544,7 +585,8 @@ typedef struct {
uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link
uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device
uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver
uint64_t reserved[13];
uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter
uint64_t reserved[12];
} pcie_metric;
uint64_t reserved[32];
} amdsmi_pcie_info_t;
@@ -617,7 +659,8 @@ typedef struct {
typedef struct {
uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported
uint32_t node_id; //< 0xFFFFFFFF if not supported
uint32_t reserved[13];
uint32_t current_partition_id; //< 0xFFFFFFFF if not supported
uint32_t reserved[12];
} amdsmi_kfd_info_t;
/**
@@ -651,8 +694,9 @@ typedef struct {
} amdsmi_accelerator_partition_profile_t;
typedef enum {
AMDSMI_LINK_TYPE_PCIE,
AMDSMI_LINK_TYPE_INTERNAL,
AMDSMI_LINK_TYPE_XGMI,
AMDSMI_LINK_TYPE_PCIE,
AMDSMI_LINK_TYPE_NOT_APPLICABLE,
AMDSMI_LINK_TYPE_UNKNOWN
} amdsmi_link_type_t;
@@ -1382,6 +1426,21 @@ typedef struct {
/// \endcond
} amd_metrics_table_header_t;
/**
* @brief The following structures hold the gpu statistics for a device.
*/
struct amdsmi_gpu_xcp_metrics_t {
/* Utilization Instantaneous (%) */
uint32_t gfx_busy_inst[AMDSMI_MAX_NUM_XCC];
uint16_t jpeg_busy[AMDSMI_MAX_NUM_JPEG];
uint16_t vcn_busy[AMDSMI_MAX_NUM_VCN];
/* Utilization Accumulated (%) */
uint64_t gfx_busy_acc[AMDSMI_MAX_NUM_XCC];
};
typedef struct {
// TODO(amd) Doxygen documents
// Note: This structure is extended to fit the needs of different GPU metric
@@ -1401,6 +1460,7 @@ typedef struct {
/*
* v1.0 Base
*/
// Temperature (C)
uint16_t temperature_edge;
uint16_t temperature_hotspot;
@@ -1493,10 +1553,10 @@ typedef struct {
uint16_t xgmi_link_width;
uint16_t xgmi_link_speed;
// PCIe accumulated bandwidth (GB/sec)
// PCIE accumulated bandwidth (GB/sec)
uint64_t pcie_bandwidth_acc;
// PCIe instantaneous bandwidth (GB/sec)
// PCIE instantaneous bandwidth (GB/sec)
uint64_t pcie_bandwidth_inst;
// PCIE L0 to recovery state transition accumulated count
@@ -1508,20 +1568,20 @@ typedef struct {
// PCIE replay rollover accumulated count
uint64_t pcie_replay_rover_count_acc;
// XGMI accumulated data transfer size (KB)
// XGMI accumulated data transfer size(KiloBytes)
uint64_t xgmi_read_data_acc[AMDSMI_MAX_NUM_XGMI_LINKS];
uint64_t xgmi_write_data_acc[AMDSMI_MAX_NUM_XGMI_LINKS];
// Current clock frequencies (MHz)
// XGMI accumulated data transfer size(KiloBytes)
uint16_t current_gfxclks[AMDSMI_MAX_NUM_GFX_CLKS];
uint16_t current_socclks[AMDSMI_MAX_NUM_CLKS];
uint16_t current_vclk0s[AMDSMI_MAX_NUM_CLKS];
uint16_t current_dclk0s[AMDSMI_MAX_NUM_CLKS];
/*
/*
* v1.5 additions
*/
// JPEG activity % per AID
// JPEG activity percent (encode/decode)
uint16_t jpeg_activity[AMDSMI_MAX_NUM_JPEG];
// PCIE NAK sent accumulated count
@@ -1529,6 +1589,59 @@ typedef struct {
// PCIE NAK received accumulated count
uint32_t pcie_nak_rcvd_count_acc;
/*
* v1.6 additions
*/
/* Accumulation cycle counter */
uint64_t accumulation_counter;
/**
* Accumulated throttler residencies
*/
uint64_t prochot_residency_acc;
/**
* Accumulated throttler residencies
*
* Prochot (thermal) - PPT (power)
* Package Power Tracking (PPT) violation % (greater than 0% is a violation);
* aka PVIOL
*
* Ex. PVIOL/TVIOL calculations
* Where A and B are measurments recorded at prior points in time.
* Typically A is the earlier measured value and B is the latest measured value.
*
* PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A))
* TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A))
*/
uint64_t ppt_residency_acc;
/**
* Accumulated throttler residencies
*
* Socket (thermal) -
* Socket thermal violation % (greater than 0% is a violation);
* aka TVIOL
*
* Ex. PVIOL/TVIOL calculations
* Where A and B are measurments recorded at prior points in time.
* Typically A is the earlier measured value and B is the latest measured value.
*
* PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A))
* TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A))
*/
uint64_t socket_thm_residency_acc;
uint64_t vr_thm_residency_acc;
uint64_t hbm_thm_residency_acc;
/* Number of current partition */
uint16_t num_partition;
/* XCP (Graphic Cluster Partitions) metrics stats */
struct amdsmi_gpu_xcp_metrics_t xcp_stats[AMDSMI_MAX_NUM_XCP];
/* PCIE other end recovery counter */
uint32_t pcie_lc_perf_other_end_recovery;
/// \endcond
} amdsmi_gpu_metrics_t;
@@ -1585,6 +1698,14 @@ typedef struct {
uint32_t cu_occupancy; //!< Compute Unit usage in percent
} amdsmi_process_info_t;
typedef struct {
uint32_t count;
amdsmi_processor_handle processor_list[AMDSMI_MAX_DEVICES];
uint64_t reserved[15];
} amdsmi_topology_nearest_t;
//! Place-holder "variant" for functions that have don't have any variants,
//! but do have monitors or sensors.
#define AMDSMI_DEFAULT_VARIANT 0xFFFFFFFFFFFFFFFF
@@ -5013,6 +5134,23 @@ amdsmi_get_clock_info(amdsmi_processor_handle processor_handle, amdsmi_clk_type_
amdsmi_status_t
amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_handle, amdsmi_vram_usage_t *info);
/**
* @brief Returns the violations for a processor
*
* @platform{gpu_bm_linux} @platform{host} @platform{guest_1vf} @platform{guest_mvf}
*
* @param[in] processor_handle Device which to query
*
*
* @param[in,out] info Reference to all violation status details available.
* Must be allocated by user.
*
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
*/
amdsmi_status_t
amdsmi_get_violation_status(amdsmi_processor_handle processor_handle,
amdsmi_violation_status_t *info);
/** @} End gpumon */
@@ -5097,6 +5235,35 @@ amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_
/** @} End eccinfo */
/**
* @brief Retrieve the set of GPUs that are nearest to a given device
* at a specific interconnectivity level.
*
* @platform{gpu_bm_linux} @platform{host}
*
* @details Once called topology_nearest_info will get populated with a list of
* all nearest devices for a given link_type. The list has a count of
* the number of devices found and their respective handles/identifiers.
*
* @param[in] processor_handle The identifier of the given device.
*
* @param[in] link_type The amdsmi_link_type_t level to search for nearest GPUs.
*
* @param[in,out] topology_nearest_info
* .count;
* - When zero, is set to the number of matching GPUs such that .device_list can
* be malloc'd.
* - When non-zero, .device_list will be filled with count number of processor_handle.
*
* @param[out] .device_list An array of processor_handle for GPUs found at level.
*
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail.
*/
amdsmi_status_t
amdsmi_get_link_topology_nearest(amdsmi_processor_handle processor_handle,
amdsmi_link_type_t link_type,
amdsmi_topology_nearest_t* topology_nearest_info);
#ifdef ENABLE_ESMI_LIB
/*****************************************************************************/
+48
Wyświetl plik
@@ -3906,6 +3906,54 @@ except AmdSmiException as e:
print(e)
```
### amdsmi_get_link_topology_nearest
Description: Retrieve the set of GPUs that are nearest to a given device
at a specific interconnectivity level.
Input parameters:
* `processor_handle` The identifier of the given device.
* `link_type` The AmdSmiLinkType level to search for nearest devices
Output: Dictionary holding the following fields.
* `count` number of nearest devices found based on given topology level
* `processor_list` list of all nearest device handlers found
Exceptions that can be thrown by `amdsmi_get_link_topology_nearest` function:
* `AmdSmiLibraryException`
Example:
```python
try:
amdsmi_init()
devices = amdsmi_get_processor_handles()
if len(devices) == 0:
print("No GPUs found on machine")
exit()
else:
print(amdsmi_get_gpu_device_uuid(devices[0]))
nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType.AMDSMI_LINK_TYPE_PCIE)
if (nearest_gpus['count']) == 0:
print("No nearest GPUs found on machine")
else:
print("Nearest GPUs")
for gpu in nearest_gpus['processor_list']:
print(amdsmi_get_gpu_device_uuid(gpu))
except AmdSmiException as e:
print(e)
finally:
try:
amdsmi_shut_down()
except AmdSmiException as e:
print(e)
```
## CPU APIs
### amdsmi_get_processor_info
+3
Wyświetl plik
@@ -106,6 +106,7 @@ from .amdsmi_interface import amdsmi_get_clock_info
from .amdsmi_interface import amdsmi_get_pcie_info
from .amdsmi_interface import amdsmi_get_gpu_bad_page_info
from .amdsmi_interface import amdsmi_get_violation_status
# # Process Information
from .amdsmi_interface import amdsmi_get_gpu_process_list
@@ -216,6 +217,7 @@ from .amdsmi_interface import amdsmi_topo_get_link_type
from .amdsmi_interface import amdsmi_topo_get_p2p_status
from .amdsmi_interface import amdsmi_is_P2P_accessible
from .amdsmi_interface import amdsmi_get_xgmi_info
from .amdsmi_interface import amdsmi_get_link_topology_nearest
# # Partition Functions
from .amdsmi_interface import amdsmi_get_gpu_compute_partition
@@ -255,6 +257,7 @@ from .amdsmi_interface import AmdSmiFreqInd
from .amdsmi_interface import AmdSmiXgmiStatus
from .amdsmi_interface import AmdSmiMemoryPageStatus
from .amdsmi_interface import AmdSmiIoLinkType
from .amdsmi_interface import AmdSmiLinkType
from .amdsmi_interface import AmdSmiUtilizationCounterType
from .amdsmi_interface import AmdSmiProcessorType
+179 -133
Wyświetl plik
@@ -42,6 +42,8 @@ AMDSMI_MAX_NUM_GFX_CLKS = 8
AMDSMI_MAX_AID = 4
AMDSMI_MAX_ENGINES = 8
AMDSMI_MAX_NUM_JPEG = 32
AMDSMI_MAX_NUM_XCC = 8
AMDSMI_MAX_NUM_XCP = 8
# Max number of DPM policies
AMDSMI_MAX_NUM_PM_POLICIES = 32
@@ -383,6 +385,14 @@ class AmdSmiIoLinkType(IntEnum):
SIZE = amdsmi_wrapper.AMDSMI_IOLINK_TYPE_SIZE
class AmdSmiLinkType(IntEnum):
AMDSMI_LINK_TYPE_INTERNAL = amdsmi_wrapper.AMDSMI_LINK_TYPE_INTERNAL
AMDSMI_LINK_TYPE_XGMI = amdsmi_wrapper.AMDSMI_LINK_TYPE_XGMI
AMDSMI_LINK_TYPE_PCIE = amdsmi_wrapper.AMDSMI_LINK_TYPE_PCIE
AMDSMI_LINK_TYPE_NOT_APPLICABLE = amdsmi_wrapper.AMDSMI_LINK_TYPE_NOT_APPLICABLE
AMDSMI_LINK_TYPE_UNKNOWN = amdsmi_wrapper.AMDSMI_LINK_TYPE_UNKNOWN
class AmdSmiUtilizationCounterType(IntEnum):
COARSE_GRAIN_GFX_ACTIVITY = amdsmi_wrapper.AMDSMI_COARSE_GRAIN_GFX_ACTIVITY
COARSE_GRAIN_MEM_ACTIVITY = amdsmi_wrapper.AMDSMI_COARSE_GRAIN_MEM_ACTIVITY
@@ -596,19 +606,27 @@ class MaxUIntegerTypes(IntEnum):
UINT32_T = 0xFFFFFFFF
UINT64_T = 0xFFFFFFFFFFFFFFFF
def _validate_if_max_uint(value, uint_type: MaxUIntegerTypes):
def _validate_if_max_uint(value, uint_type: MaxUIntegerTypes, isActivity=False, isBool=False):
return_val = "N/A"
if not isinstance(value, list):
if value == uint_type:
if (value == uint_type) or (isActivity and value > 100):
return return_val
else:
return value
if isBool:
return bool(value)
else:
return value
else:
return_val = value
for idx, v in enumerate(value):
if v == uint_type:
return_val[idx] = "N/A"
return return_val
return_val = []
for _, v in enumerate(value):
if (v == uint_type) or (isActivity and v > 100):
return_val.append("N/A")
else:
return_val.append(v)
if isBool:
return bool(return_val)
else:
return return_val
def amdsmi_get_socket_handles() -> List[amdsmi_wrapper.amdsmi_socket_handle]:
@@ -1717,8 +1735,9 @@ def amdsmi_get_gpu_kfd_info(
)
kfd_info = {
"kfd_id": _validate_if_max_uint(kfd_info_struct.kfd_id, MaxUIntegerTypes.UINT32_T),
"node_id": _validate_if_max_uint(kfd_info_struct.node_id, MaxUIntegerTypes.UINT64_T)
"kfd_id": _validate_if_max_uint(kfd_info_struct.kfd_id, MaxUIntegerTypes.UINT64_T),
"node_id": _validate_if_max_uint(kfd_info_struct.node_id, MaxUIntegerTypes.UINT32_T),
"current_partition_id": _validate_if_max_uint(kfd_info_struct.current_partition_id, MaxUIntegerTypes.UINT32_T)
}
return kfd_info
@@ -1984,6 +2003,35 @@ def amdsmi_get_gpu_bad_page_info(
return _format_bad_page_info(bad_pages, num_pages)
def amdsmi_get_violation_status(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
) -> Dict[str, Any]:
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
raise AmdSmiParameterException(
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
)
violation_status = amdsmi_wrapper.amdsmi_violation_status_t()
_check_res(
amdsmi_wrapper.amdsmi_get_violation_status(
processor_handle, ctypes.byref(violation_status))
)
return {
"reference_timestamp": _validate_if_max_uint(violation_status.reference_timestamp, MaxUIntegerTypes.UINT64_T),
"violation_timestamp": _validate_if_max_uint(violation_status.violation_timestamp, MaxUIntegerTypes.UINT64_T),
"per_prochot_thrm": _validate_if_max_uint(violation_status.per_prochot_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True),
"per_ppt_pwr": _validate_if_max_uint(violation_status.per_ppt_pwr, MaxUIntegerTypes.UINT64_T, isActivity=True), #PVIOL
"per_socket_thrm": _validate_if_max_uint(violation_status.per_socket_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True), #TVIOL
"per_vr_thrm": _validate_if_max_uint(violation_status.per_vr_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True),
"per_hbm_thrm": _validate_if_max_uint(violation_status.per_hbm_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True),
"active_prochot_thrm": _validate_if_max_uint(violation_status.active_prochot_thrm, MaxUIntegerTypes.UINT8_T, isBool=True),
"active_ppt_pwr": _validate_if_max_uint(violation_status.active_ppt_pwr, MaxUIntegerTypes.UINT8_T, isBool=True), #PVIOL
"active_socket_thrm": _validate_if_max_uint(violation_status.active_socket_thrm, MaxUIntegerTypes.UINT8_T, isBool=True), #TVIOL
"active_vr_thrm": _validate_if_max_uint(violation_status.active_vr_thrm, MaxUIntegerTypes.UINT8_T, isBool=True),
"active_hbm_thrm": _validate_if_max_uint(violation_status.active_hbm_thrm, MaxUIntegerTypes.UINT8_T, isBool=True)
}
def amdsmi_get_gpu_total_ecc_count(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
) -> Dict[str, Any]:
@@ -2337,6 +2385,7 @@ def amdsmi_get_pcie_info(
"pcie_replay_roll_over_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_replay_roll_over_count, MaxUIntegerTypes.UINT64_T),
"pcie_nak_sent_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_nak_sent_count, MaxUIntegerTypes.UINT64_T),
"pcie_nak_received_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_nak_received_count, MaxUIntegerTypes.UINT64_T),
"pcie_lc_perf_other_end_recovery_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_lc_perf_other_end_recovery_count, MaxUIntegerTypes.UINT32_T)
}
}
@@ -2739,7 +2788,7 @@ def amdsmi_get_gpu_accelerator_partition_profile(
"profile_type" : profile.profile_type,
"num_partitions" : profile.num_partitions,
"profile_index" : profile.profile_index,
"memory_caps" : "N/A",
"memory_caps" : profile.memory_caps,
"num_resources" : profile.num_resources,
"resources" : "N/A"
}
@@ -3765,130 +3814,104 @@ def amdsmi_get_gpu_metrics_info(
)
gpu_metrics_output = {
"temperature_edge": gpu_metrics.temperature_edge,
"temperature_hotspot": gpu_metrics.temperature_hotspot,
"temperature_mem": gpu_metrics.temperature_mem,
"temperature_vrgfx": gpu_metrics.temperature_vrgfx,
"temperature_vrsoc": gpu_metrics.temperature_vrsoc,
"temperature_vrmem": gpu_metrics.temperature_vrmem,
"average_gfx_activity": gpu_metrics.average_gfx_activity,
"average_umc_activity": gpu_metrics.average_umc_activity,
"average_mm_activity": gpu_metrics.average_mm_activity,
"average_socket_power": gpu_metrics.average_socket_power,
"energy_accumulator": gpu_metrics.energy_accumulator,
"system_clock_counter": gpu_metrics.system_clock_counter,
"average_gfxclk_frequency": gpu_metrics.average_gfxclk_frequency,
"average_socclk_frequency": gpu_metrics.average_socclk_frequency,
"average_uclk_frequency": gpu_metrics.average_uclk_frequency,
"average_vclk0_frequency": gpu_metrics.average_vclk0_frequency,
"average_dclk0_frequency": gpu_metrics.average_dclk0_frequency,
"average_vclk1_frequency": gpu_metrics.average_vclk1_frequency,
"average_dclk1_frequency": gpu_metrics.average_dclk1_frequency,
"current_gfxclk": gpu_metrics.current_gfxclk,
"current_socclk": gpu_metrics.current_socclk,
"current_uclk": gpu_metrics.current_uclk,
"current_vclk0": gpu_metrics.current_vclk0,
"current_dclk0": gpu_metrics.current_dclk0,
"current_vclk1": gpu_metrics.current_vclk1,
"current_dclk1": gpu_metrics.current_dclk1,
"throttle_status": gpu_metrics.throttle_status,
"current_fan_speed": gpu_metrics.current_fan_speed,
"pcie_link_width": gpu_metrics.pcie_link_width,
"pcie_link_speed": gpu_metrics.pcie_link_speed,
"gfx_activity_acc": gpu_metrics.gfx_activity_acc,
"mem_activity_acc": gpu_metrics.mem_activity_acc,
"temperature_hbm": list(gpu_metrics.temperature_hbm),
"firmware_timestamp": gpu_metrics.firmware_timestamp,
"voltage_soc": gpu_metrics.voltage_soc,
"voltage_gfx": gpu_metrics.voltage_gfx,
"voltage_mem": gpu_metrics.voltage_mem,
"indep_throttle_status": gpu_metrics.indep_throttle_status,
"current_socket_power": gpu_metrics.current_socket_power,
"vcn_activity": list(gpu_metrics.vcn_activity),
"gfxclk_lock_status": gpu_metrics.gfxclk_lock_status,
"xgmi_link_width": gpu_metrics.xgmi_link_width,
"xgmi_link_speed": gpu_metrics.xgmi_link_speed,
"pcie_bandwidth_acc": gpu_metrics.pcie_bandwidth_acc,
"pcie_bandwidth_inst": gpu_metrics.pcie_bandwidth_inst,
"pcie_l0_to_recov_count_acc": gpu_metrics.pcie_l0_to_recov_count_acc,
"pcie_replay_count_acc": gpu_metrics.pcie_replay_count_acc,
"pcie_replay_rover_count_acc": gpu_metrics.pcie_replay_rover_count_acc,
"xgmi_read_data_acc": list(gpu_metrics.xgmi_read_data_acc),
"xgmi_write_data_acc": list(gpu_metrics.xgmi_write_data_acc),
"current_gfxclks": list(gpu_metrics.current_gfxclks),
"current_socclks": list(gpu_metrics.current_socclks),
"current_vclk0s": list(gpu_metrics.current_vclk0s),
"current_dclk0s": list(gpu_metrics.current_dclk0s),
"pcie_nak_sent_count_acc": gpu_metrics.pcie_nak_sent_count_acc,
"pcie_nak_rcvd_count_acc": gpu_metrics.pcie_nak_rcvd_count_acc,
"jpeg_activity": list(gpu_metrics.jpeg_activity),
"temperature_edge": _validate_if_max_uint(gpu_metrics.temperature_edge, MaxUIntegerTypes.UINT16_T),
"temperature_hotspot": _validate_if_max_uint(gpu_metrics.temperature_hotspot, MaxUIntegerTypes.UINT16_T),
"temperature_mem": _validate_if_max_uint(gpu_metrics.temperature_mem, MaxUIntegerTypes.UINT16_T),
"temperature_vrgfx": _validate_if_max_uint(gpu_metrics.temperature_vrgfx, MaxUIntegerTypes.UINT16_T),
"temperature_vrsoc": _validate_if_max_uint(gpu_metrics.temperature_vrsoc, MaxUIntegerTypes.UINT16_T),
"temperature_vrmem": _validate_if_max_uint(gpu_metrics.temperature_vrmem, MaxUIntegerTypes.UINT16_T),
"average_gfx_activity": _validate_if_max_uint(gpu_metrics.average_gfx_activity, MaxUIntegerTypes.UINT16_T, isActivity=True),
"average_umc_activity": _validate_if_max_uint(gpu_metrics.average_umc_activity, MaxUIntegerTypes.UINT16_T, isActivity=True),
"average_mm_activity": _validate_if_max_uint(gpu_metrics.average_mm_activity, MaxUIntegerTypes.UINT16_T, isActivity=True),
"average_socket_power": _validate_if_max_uint(gpu_metrics.average_socket_power, MaxUIntegerTypes.UINT16_T),
"energy_accumulator": _validate_if_max_uint(gpu_metrics.energy_accumulator, MaxUIntegerTypes.UINT64_T),
"system_clock_counter": _validate_if_max_uint(gpu_metrics.system_clock_counter, MaxUIntegerTypes.UINT64_T),
"average_gfxclk_frequency": _validate_if_max_uint(gpu_metrics.average_gfxclk_frequency, MaxUIntegerTypes.UINT16_T),
"average_socclk_frequency": _validate_if_max_uint(gpu_metrics.average_socclk_frequency, MaxUIntegerTypes.UINT16_T),
"average_uclk_frequency": _validate_if_max_uint(gpu_metrics.average_uclk_frequency, MaxUIntegerTypes.UINT16_T),
"average_vclk0_frequency": _validate_if_max_uint(gpu_metrics.average_vclk0_frequency, MaxUIntegerTypes.UINT16_T),
"average_dclk0_frequency": _validate_if_max_uint(gpu_metrics.average_dclk0_frequency, MaxUIntegerTypes.UINT16_T),
"average_vclk1_frequency": _validate_if_max_uint(gpu_metrics.average_vclk1_frequency, MaxUIntegerTypes.UINT16_T),
"average_dclk1_frequency": _validate_if_max_uint(gpu_metrics.average_dclk1_frequency, MaxUIntegerTypes.UINT16_T),
"current_gfxclk": _validate_if_max_uint(gpu_metrics.current_gfxclk, MaxUIntegerTypes.UINT16_T),
"current_socclk": _validate_if_max_uint(gpu_metrics.current_socclk, MaxUIntegerTypes.UINT16_T),
"current_uclk": _validate_if_max_uint(gpu_metrics.current_uclk, MaxUIntegerTypes.UINT16_T),
"current_vclk0": _validate_if_max_uint(gpu_metrics.current_vclk0, MaxUIntegerTypes.UINT16_T),
"current_dclk0": _validate_if_max_uint(gpu_metrics.current_dclk0, MaxUIntegerTypes.UINT16_T),
"current_vclk1": _validate_if_max_uint(gpu_metrics.current_vclk1, MaxUIntegerTypes.UINT16_T),
"current_dclk1": _validate_if_max_uint(gpu_metrics.current_dclk1, MaxUIntegerTypes.UINT16_T),
"throttle_status": _validate_if_max_uint(gpu_metrics.throttle_status, MaxUIntegerTypes.UINT32_T, isBool=True),
"current_fan_speed": _validate_if_max_uint(gpu_metrics.current_fan_speed, MaxUIntegerTypes.UINT16_T),
"pcie_link_width": _validate_if_max_uint(gpu_metrics.pcie_link_width, MaxUIntegerTypes.UINT16_T),
"pcie_link_speed": _validate_if_max_uint(gpu_metrics.pcie_link_speed, MaxUIntegerTypes.UINT16_T),
"gfx_activity_acc": _validate_if_max_uint(gpu_metrics.gfx_activity_acc, MaxUIntegerTypes.UINT32_T),
"mem_activity_acc": _validate_if_max_uint(gpu_metrics.mem_activity_acc, MaxUIntegerTypes.UINT32_T),
"temperature_hbm": _validate_if_max_uint(list(gpu_metrics.temperature_hbm), MaxUIntegerTypes.UINT16_T),
"firmware_timestamp": _validate_if_max_uint(gpu_metrics.firmware_timestamp, MaxUIntegerTypes.UINT64_T),
"voltage_soc": _validate_if_max_uint(gpu_metrics.voltage_soc, MaxUIntegerTypes.UINT16_T),
"voltage_gfx": _validate_if_max_uint(gpu_metrics.voltage_gfx, MaxUIntegerTypes.UINT16_T),
"voltage_mem": _validate_if_max_uint(gpu_metrics.voltage_mem, MaxUIntegerTypes.UINT16_T),
"indep_throttle_status": _validate_if_max_uint(gpu_metrics.indep_throttle_status, MaxUIntegerTypes.UINT64_T, isBool=True),
"current_socket_power": _validate_if_max_uint(gpu_metrics.current_socket_power, MaxUIntegerTypes.UINT16_T),
"vcn_activity": _validate_if_max_uint(list(gpu_metrics.vcn_activity), MaxUIntegerTypes.UINT16_T, isActivity=True),
"gfxclk_lock_status": _validate_if_max_uint(gpu_metrics.gfxclk_lock_status, MaxUIntegerTypes.UINT32_T),
"xgmi_link_width": _validate_if_max_uint(gpu_metrics.xgmi_link_width, MaxUIntegerTypes.UINT16_T),
"xgmi_link_speed": _validate_if_max_uint(gpu_metrics.xgmi_link_speed, MaxUIntegerTypes.UINT16_T),
"pcie_bandwidth_acc": _validate_if_max_uint(gpu_metrics.pcie_bandwidth_acc, MaxUIntegerTypes.UINT64_T),
"pcie_bandwidth_inst": _validate_if_max_uint(gpu_metrics.pcie_bandwidth_inst, MaxUIntegerTypes.UINT64_T),
"pcie_l0_to_recov_count_acc": _validate_if_max_uint(gpu_metrics.pcie_l0_to_recov_count_acc, MaxUIntegerTypes.UINT64_T),
"pcie_replay_count_acc": _validate_if_max_uint(gpu_metrics.pcie_replay_count_acc, MaxUIntegerTypes.UINT64_T),
"pcie_replay_rover_count_acc": _validate_if_max_uint(gpu_metrics.pcie_replay_rover_count_acc, MaxUIntegerTypes.UINT64_T),
"xgmi_read_data_acc": _validate_if_max_uint(list(gpu_metrics.xgmi_read_data_acc), MaxUIntegerTypes.UINT64_T),
"xgmi_write_data_acc": _validate_if_max_uint(list(gpu_metrics.xgmi_write_data_acc), MaxUIntegerTypes.UINT64_T),
"current_gfxclks": _validate_if_max_uint(list(gpu_metrics.current_gfxclks), MaxUIntegerTypes.UINT16_T),
"current_socclks": _validate_if_max_uint(list(gpu_metrics.current_socclks), MaxUIntegerTypes.UINT16_T),
"current_vclk0s": _validate_if_max_uint(list(gpu_metrics.current_vclk0s), MaxUIntegerTypes.UINT16_T),
"current_dclk0s": _validate_if_max_uint(list(gpu_metrics.current_dclk0s), MaxUIntegerTypes.UINT16_T),
"jpeg_activity": _validate_if_max_uint(list(gpu_metrics.jpeg_activity), MaxUIntegerTypes.UINT16_T, isActivity=True),
"pcie_nak_sent_count_acc": _validate_if_max_uint(gpu_metrics.pcie_nak_sent_count_acc, MaxUIntegerTypes.UINT32_T),
"pcie_nak_rcvd_count_acc": _validate_if_max_uint(gpu_metrics.pcie_nak_rcvd_count_acc, MaxUIntegerTypes.UINT32_T),
"accumulation_counter": _validate_if_max_uint(gpu_metrics.accumulation_counter, MaxUIntegerTypes.UINT64_T),
"prochot_residency_acc": _validate_if_max_uint(gpu_metrics.prochot_residency_acc, MaxUIntegerTypes.UINT64_T),
"ppt_residency_acc": _validate_if_max_uint(gpu_metrics.ppt_residency_acc, MaxUIntegerTypes.UINT64_T),
"socket_thm_residency_acc": _validate_if_max_uint(gpu_metrics.socket_thm_residency_acc, MaxUIntegerTypes.UINT64_T),
"vr_thm_residency_acc": _validate_if_max_uint(gpu_metrics.vr_thm_residency_acc, MaxUIntegerTypes.UINT64_T),
"hbm_thm_residency_acc": _validate_if_max_uint(gpu_metrics.hbm_thm_residency_acc, MaxUIntegerTypes.UINT64_T),
"num_partition": _validate_if_max_uint(gpu_metrics.num_partition, MaxUIntegerTypes.UINT16_T),
"xcp_stats.gfx_busy_inst": list(gpu_metrics.xcp_stats),
"xcp_stats.jpeg_busy": list(gpu_metrics.xcp_stats),
"xcp_stats.vcn_busy": list(gpu_metrics.xcp_stats),
"xcp_stats.gfx_busy_acc": list(gpu_metrics.xcp_stats),
"pcie_lc_perf_other_end_recovery": _validate_if_max_uint(gpu_metrics.pcie_lc_perf_other_end_recovery, MaxUIntegerTypes.UINT32_T),
}
# Validate support for each gpu_metric
uint_16_metrics = ['temperature_edge', 'temperature_hotspot', 'temperature_mem',
'temperature_vrgfx', 'temperature_vrsoc', 'temperature_vrmem',
'average_gfx_activity', 'average_umc_activity', 'average_mm_activity',
'average_socket_power', 'average_gfxclk_frequency', 'average_socclk_frequency',
'average_uclk_frequency', 'average_vclk0_frequency', 'average_dclk0_frequency',
'average_vclk1_frequency', 'average_dclk1_frequency', 'current_gfxclk',
'current_socclk', 'current_uclk', 'current_vclk0', 'current_dclk0',
'current_vclk1', 'current_dclk1', 'current_fan_speed', 'pcie_link_width',
'pcie_link_speed', 'voltage_soc', 'voltage_gfx', 'voltage_mem',
'current_socket_power', 'xgmi_link_width', 'xgmi_link_speed']
for metric in uint_16_metrics:
if gpu_metrics_output[metric] == 0xFFFF:
gpu_metrics_output[metric] = "N/A"
uint_32_metrics = ['gfx_activity_acc','mem_activity_acc', 'pcie_nak_sent_count_acc', 'pcie_nak_rcvd_count_acc', 'gfxclk_lock_status']
for metric in uint_32_metrics:
if gpu_metrics_output[metric] == 0xFFFFFFFF:
gpu_metrics_output[metric] = "N/A"
uint_64_metrics = ['energy_accumulator', 'system_clock_counter', 'firmware_timestamp',
'pcie_bandwidth_acc', 'pcie_bandwidth_inst',
'pcie_l0_to_recov_count_acc', 'pcie_replay_count_acc',
'pcie_replay_rover_count_acc']
for metric in uint_64_metrics:
if gpu_metrics_output[metric] == 0xFFFFFFFFFFFFFFFF:
gpu_metrics_output[metric] = "N/A"
# Custom validation for metrics in a bool format
uint_32_bool_metrics = ['throttle_status']
for metric in uint_32_bool_metrics:
if gpu_metrics_output[metric] == 0xFFFFFFFF:
gpu_metrics_output[metric] = "N/A"
else:
gpu_metrics_output[metric] = bool(gpu_metrics_output[metric])
# Custom validation for metrics in a list format
uint_16_clock_list_metrics = ['current_gfxclks', 'current_socclks', 'current_vclk0s', 'current_dclk0s']
for clock in uint_16_clock_list_metrics:
for index, clk in enumerate(gpu_metrics_output[clock]):
if clk == 0xFFFF:
gpu_metrics_output[clock][index] = "N/A"
uint_16_activity_list_metrics = ['vcn_activity', 'jpeg_activity']
for activity_metric in uint_16_activity_list_metrics:
for index, activity in enumerate(gpu_metrics_output[activity_metric]):
if activity == 0xFFFF or activity > 110:
gpu_metrics_output[activity_metric][index] = "N/A"
uint_64_xgmi_metrics = ['xgmi_read_data_acc', 'xgmi_write_data_acc']
for metric in uint_64_xgmi_metrics:
for index, data in enumerate(gpu_metrics_output[metric]):
if data == 0xFFFFFFFFFFFFFFFF:
gpu_metrics_output[metric][index] = "N/A"
# Custom validation for specific gpu_metrics
for index, temp in enumerate(gpu_metrics_output['temperature_hbm']):
if temp == 0xFFFF:
gpu_metrics_output['temperature_hbm'][index] = "N/A"
if gpu_metrics_output['indep_throttle_status'] == 0xFFFFFFFFFFFFFFFF:
gpu_metrics_output['indep_throttle_status'] = "N/A"
else:
gpu_metrics_output['indep_throttle_status'] = bool(gpu_metrics_output['indep_throttle_status'])
# Create 2d array with each XCD's stats
for k,v in gpu_metrics_output.items():
if 'xcp_stats' in k:
if 'xcp_stats.gfx_busy_inst' in k:
for curr_xcp, item in enumerate(v):
print_xcp_detail = []
for val in item.gfx_busy_inst:
print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT32_T, isActivity=True))
gpu_metrics_output[k][curr_xcp] = print_xcp_detail
if 'xcp_stats.jpeg_busy' in k:
for curr_xcp, item in enumerate(v):
print_xcp_detail = []
for val in item.jpeg_busy:
print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT16_T, isActivity=True))
gpu_metrics_output[k][curr_xcp] = print_xcp_detail
if 'xcp_stats.vcn_busy' in k:
for curr_xcp, item in enumerate(v):
print_xcp_detail = []
for val in item.vcn_busy:
print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT16_T, isActivity=True))
gpu_metrics_output[k][curr_xcp] = print_xcp_detail
if 'xcp_stats.gfx_busy_acc' in k:
for curr_xcp, item in enumerate(v):
print_xcp_detail = []
for val in item.gfx_busy_acc:
print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True))
gpu_metrics_output[k][curr_xcp] = print_xcp_detail
return gpu_metrics_output
@@ -4174,3 +4197,26 @@ def amdsmi_get_gpu_metrics_header_info(
"format_revision": header_info.format_revision,
"content_revision": header_info.content_revision
}
def amdsmi_get_link_topology_nearest(
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
link_type: AmdSmiLinkType,
)-> Dict[str, Any]:
topology_nearest_list = amdsmi_wrapper.amdsmi_topology_nearest_t()
_check_res(
amdsmi_wrapper.amdsmi_get_link_topology_nearest(
processor_handle,
link_type,
ctypes.byref(topology_nearest_list)
)
)
device_list = []
for index in range(topology_nearest_list.count):
device_list.append(topology_nearest_list.processor_list[index])
return {
'count': topology_nearest_list.count,
'processor_list': device_list
}
+104 -34
Wyświetl plik
@@ -727,6 +727,28 @@ struct_amdsmi_vram_usage_t._fields_ = [
]
amdsmi_vram_usage_t = struct_amdsmi_vram_usage_t
class struct_amdsmi_violation_status_t(Structure):
pass
struct_amdsmi_violation_status_t._pack_ = 1 # source:False
struct_amdsmi_violation_status_t._fields_ = [
('reference_timestamp', ctypes.c_uint64),
('violation_timestamp', ctypes.c_uint64),
('per_prochot_thrm', ctypes.c_uint64),
('per_ppt_pwr', ctypes.c_uint64),
('per_socket_thrm', ctypes.c_uint64),
('per_vr_thrm', ctypes.c_uint64),
('per_hbm_thrm', ctypes.c_uint64),
('active_prochot_thrm', ctypes.c_ubyte),
('active_ppt_pwr', ctypes.c_ubyte),
('active_socket_thrm', ctypes.c_ubyte),
('active_vr_thrm', ctypes.c_ubyte),
('active_hbm_thrm', ctypes.c_ubyte),
('PADDING_0', ctypes.c_ubyte * 3),
('reserved', ctypes.c_uint64 * 24),
]
amdsmi_violation_status_t = struct_amdsmi_violation_status_t
class struct_amdsmi_frequency_range_t(Structure):
pass
@@ -804,7 +826,9 @@ struct_pcie_metric_._fields_ = [
('pcie_replay_roll_over_count', ctypes.c_uint64),
('pcie_nak_sent_count', ctypes.c_uint64),
('pcie_nak_received_count', ctypes.c_uint64),
('reserved', ctypes.c_uint64 * 13),
('pcie_lc_perf_other_end_recovery_count', ctypes.c_uint32),
('PADDING_2', ctypes.c_ubyte * 4),
('reserved', ctypes.c_uint64 * 12),
]
struct_amdsmi_pcie_info_t._pack_ = 1 # source:False
@@ -933,7 +957,8 @@ struct_amdsmi_kfd_info_t._pack_ = 1 # source:False
struct_amdsmi_kfd_info_t._fields_ = [
('kfd_id', ctypes.c_uint64),
('node_id', ctypes.c_uint32),
('reserved', ctypes.c_uint32 * 13),
('current_partition_id', ctypes.c_uint32),
('reserved', ctypes.c_uint32 * 12),
]
amdsmi_kfd_info_t = struct_amdsmi_kfd_info_t
@@ -978,15 +1003,17 @@ amdsmi_accelerator_partition_profile_t = struct_amdsmi_accelerator_partition_pro
# values for enumeration 'amdsmi_link_type_t'
amdsmi_link_type_t__enumvalues = {
0: 'AMDSMI_LINK_TYPE_PCIE',
0: 'AMDSMI_LINK_TYPE_INTERNAL',
1: 'AMDSMI_LINK_TYPE_XGMI',
2: 'AMDSMI_LINK_TYPE_NOT_APPLICABLE',
3: 'AMDSMI_LINK_TYPE_UNKNOWN',
2: 'AMDSMI_LINK_TYPE_PCIE',
3: 'AMDSMI_LINK_TYPE_NOT_APPLICABLE',
4: 'AMDSMI_LINK_TYPE_UNKNOWN',
}
AMDSMI_LINK_TYPE_PCIE = 0
AMDSMI_LINK_TYPE_INTERNAL = 0
AMDSMI_LINK_TYPE_XGMI = 1
AMDSMI_LINK_TYPE_NOT_APPLICABLE = 2
AMDSMI_LINK_TYPE_UNKNOWN = 3
AMDSMI_LINK_TYPE_PCIE = 2
AMDSMI_LINK_TYPE_NOT_APPLICABLE = 3
AMDSMI_LINK_TYPE_UNKNOWN = 4
amdsmi_link_type_t = ctypes.c_uint32 # enum
class struct_amdsmi_link_metrics_t(Structure):
pass
@@ -1100,16 +1127,6 @@ amdsmi_process_handle_t = ctypes.c_uint32
class struct_amdsmi_proc_info_t(Structure):
pass
class struct_engine_usage_(Structure):
pass
struct_engine_usage_._pack_ = 1 # source:False
struct_engine_usage_._fields_ = [
('gfx', ctypes.c_uint64),
('enc', ctypes.c_uint64),
('reserved', ctypes.c_uint32 * 12),
]
class struct_memory_usage_(Structure):
pass
@@ -1121,6 +1138,16 @@ struct_memory_usage_._fields_ = [
('reserved', ctypes.c_uint32 * 10),
]
class struct_engine_usage_(Structure):
pass
struct_engine_usage_._pack_ = 1 # source:False
struct_engine_usage_._fields_ = [
('gfx', ctypes.c_uint64),
('enc', ctypes.c_uint64),
('reserved', ctypes.c_uint32 * 12),
]
struct_amdsmi_proc_info_t._pack_ = 1 # source:False
struct_amdsmi_proc_info_t._fields_ = [
('name', ctypes.c_char * 32),
@@ -1711,6 +1738,17 @@ struct_amd_metrics_table_header_t._fields_ = [
]
amd_metrics_table_header_t = struct_amd_metrics_table_header_t
class struct_amdsmi_gpu_xcp_metrics_t(Structure):
pass
struct_amdsmi_gpu_xcp_metrics_t._pack_ = 1 # source:False
struct_amdsmi_gpu_xcp_metrics_t._fields_ = [
('gfx_busy_inst', ctypes.c_uint32 * 8),
('jpeg_busy', ctypes.c_uint16 * 32),
('vcn_busy', ctypes.c_uint16 * 4),
('gfx_busy_acc', ctypes.c_uint64 * 8),
]
class struct_amdsmi_gpu_metrics_t(Structure):
pass
@@ -1778,6 +1816,17 @@ struct_amdsmi_gpu_metrics_t._fields_ = [
('jpeg_activity', ctypes.c_uint16 * 32),
('pcie_nak_sent_count_acc', ctypes.c_uint32),
('pcie_nak_rcvd_count_acc', ctypes.c_uint32),
('accumulation_counter', ctypes.c_uint64),
('prochot_residency_acc', ctypes.c_uint64),
('ppt_residency_acc', ctypes.c_uint64),
('socket_thm_residency_acc', ctypes.c_uint64),
('vr_thm_residency_acc', ctypes.c_uint64),
('hbm_thm_residency_acc', ctypes.c_uint64),
('num_partition', ctypes.c_uint16),
('PADDING_4', ctypes.c_ubyte * 6),
('xcp_stats', struct_amdsmi_gpu_xcp_metrics_t * 8),
('pcie_lc_perf_other_end_recovery', ctypes.c_uint32),
('PADDING_5', ctypes.c_ubyte * 4),
]
amdsmi_gpu_metrics_t = struct_amdsmi_gpu_metrics_t
@@ -1842,6 +1891,18 @@ struct_amdsmi_process_info_t._fields_ = [
]
amdsmi_process_info_t = struct_amdsmi_process_info_t
class struct_amdsmi_topology_nearest_t(Structure):
pass
struct_amdsmi_topology_nearest_t._pack_ = 1 # source:False
struct_amdsmi_topology_nearest_t._fields_ = [
('count', ctypes.c_uint32),
('PADDING_0', ctypes.c_ubyte * 4),
('processor_list', ctypes.POINTER(None) * 32),
('reserved', ctypes.c_uint64 * 15),
]
amdsmi_topology_nearest_t = struct_amdsmi_topology_nearest_t
class struct_amdsmi_smu_fw_version_t(Structure):
pass
@@ -2370,12 +2431,18 @@ amdsmi_get_clock_info.argtypes = [amdsmi_processor_handle, amdsmi_clk_type_t, ct
amdsmi_get_gpu_vram_usage = _libraries['libamd_smi.so'].amdsmi_get_gpu_vram_usage
amdsmi_get_gpu_vram_usage.restype = amdsmi_status_t
amdsmi_get_gpu_vram_usage.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_vram_usage_t)]
amdsmi_get_violation_status = _libraries['libamd_smi.so'].amdsmi_get_violation_status
amdsmi_get_violation_status.restype = amdsmi_status_t
amdsmi_get_violation_status.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_violation_status_t)]
amdsmi_get_gpu_process_list = _libraries['libamd_smi.so'].amdsmi_get_gpu_process_list
amdsmi_get_gpu_process_list.restype = amdsmi_status_t
amdsmi_get_gpu_process_list.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint32), ctypes.POINTER(struct_amdsmi_proc_info_t)]
amdsmi_get_gpu_total_ecc_count = _libraries['libamd_smi.so'].amdsmi_get_gpu_total_ecc_count
amdsmi_get_gpu_total_ecc_count.restype = amdsmi_status_t
amdsmi_get_gpu_total_ecc_count.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_error_count_t)]
amdsmi_get_link_topology_nearest = _libraries['libamd_smi.so'].amdsmi_get_link_topology_nearest
amdsmi_get_link_topology_nearest.restype = amdsmi_status_t
amdsmi_get_link_topology_nearest.argtypes = [amdsmi_processor_handle, amdsmi_link_type_t, ctypes.POINTER(struct_amdsmi_topology_nearest_t)]
amdsmi_get_cpu_core_energy = _libraries['libamd_smi.so'].amdsmi_get_cpu_core_energy
amdsmi_get_cpu_core_energy.restype = amdsmi_status_t
amdsmi_get_cpu_core_energy.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint64)]
@@ -2616,11 +2683,11 @@ __all__ = \
'AMDSMI_IOLINK_TYPE_NUMIOLINKTYPES',
'AMDSMI_IOLINK_TYPE_PCIEXPRESS', 'AMDSMI_IOLINK_TYPE_SIZE',
'AMDSMI_IOLINK_TYPE_UNDEFINED', 'AMDSMI_IOLINK_TYPE_XGMI',
'AMDSMI_LINK_TYPE_NOT_APPLICABLE', 'AMDSMI_LINK_TYPE_PCIE',
'AMDSMI_LINK_TYPE_UNKNOWN', 'AMDSMI_LINK_TYPE_XGMI',
'AMDSMI_MEMORY_PARTITION_NPS1', 'AMDSMI_MEMORY_PARTITION_NPS2',
'AMDSMI_MEMORY_PARTITION_NPS4', 'AMDSMI_MEMORY_PARTITION_NPS8',
'AMDSMI_MEMORY_PARTITION_UNKNOWN',
'AMDSMI_LINK_TYPE_INTERNAL', 'AMDSMI_LINK_TYPE_NOT_APPLICABLE',
'AMDSMI_LINK_TYPE_PCIE', 'AMDSMI_LINK_TYPE_UNKNOWN',
'AMDSMI_LINK_TYPE_XGMI', 'AMDSMI_MEMORY_PARTITION_NPS1',
'AMDSMI_MEMORY_PARTITION_NPS2', 'AMDSMI_MEMORY_PARTITION_NPS4',
'AMDSMI_MEMORY_PARTITION_NPS8', 'AMDSMI_MEMORY_PARTITION_UNKNOWN',
'AMDSMI_MEM_PAGE_STATUS_PENDING',
'AMDSMI_MEM_PAGE_STATUS_RESERVED',
'AMDSMI_MEM_PAGE_STATUS_UNRESERVABLE', 'AMDSMI_MEM_TYPE_FIRST',
@@ -2798,7 +2865,7 @@ __all__ = \
'amdsmi_get_gpu_vram_info', 'amdsmi_get_gpu_vram_usage',
'amdsmi_get_gpu_vram_vendor', 'amdsmi_get_hsmp_metrics_table',
'amdsmi_get_hsmp_metrics_table_version', 'amdsmi_get_lib_version',
'amdsmi_get_link_metrics',
'amdsmi_get_link_metrics', 'amdsmi_get_link_topology_nearest',
'amdsmi_get_minmax_bandwidth_between_processors',
'amdsmi_get_pcie_info', 'amdsmi_get_power_cap_info',
'amdsmi_get_power_info',
@@ -2810,9 +2877,9 @@ __all__ = \
'amdsmi_get_soc_pstate', 'amdsmi_get_socket_handles',
'amdsmi_get_socket_info', 'amdsmi_get_temp_metric',
'amdsmi_get_threads_per_core', 'amdsmi_get_utilization_count',
'amdsmi_get_xgmi_info', 'amdsmi_get_xgmi_plpd',
'amdsmi_gpu_block_t', 'amdsmi_gpu_cache_info_t',
'amdsmi_gpu_control_counter',
'amdsmi_get_violation_status', 'amdsmi_get_xgmi_info',
'amdsmi_get_xgmi_plpd', 'amdsmi_gpu_block_t',
'amdsmi_gpu_cache_info_t', 'amdsmi_gpu_control_counter',
'amdsmi_gpu_counter_group_supported', 'amdsmi_gpu_create_counter',
'amdsmi_gpu_destroy_counter', 'amdsmi_gpu_metrics_t',
'amdsmi_gpu_read_counter', 'amdsmi_gpu_xgmi_error_status',
@@ -2862,11 +2929,12 @@ __all__ = \
'amdsmi_temp_range_refresh_rate_t', 'amdsmi_temperature_metric_t',
'amdsmi_temperature_type_t', 'amdsmi_topo_get_link_type',
'amdsmi_topo_get_link_weight', 'amdsmi_topo_get_numa_node_number',
'amdsmi_topo_get_p2p_status', 'amdsmi_utilization_counter_t',
'amdsmi_topo_get_p2p_status', 'amdsmi_topology_nearest_t',
'amdsmi_utilization_counter_t',
'amdsmi_utilization_counter_type_t', 'amdsmi_vbios_info_t',
'amdsmi_version_t', 'amdsmi_voltage_metric_t',
'amdsmi_voltage_type_t', 'amdsmi_vram_info_t',
'amdsmi_vram_type_t', 'amdsmi_vram_usage_t',
'amdsmi_version_t', 'amdsmi_violation_status_t',
'amdsmi_voltage_metric_t', 'amdsmi_voltage_type_t',
'amdsmi_vram_info_t', 'amdsmi_vram_type_t', 'amdsmi_vram_usage_t',
'amdsmi_vram_vendor_type_t', 'amdsmi_xgmi_info_t',
'amdsmi_xgmi_status_t', 'processor_type_t', 'size_t',
'struct__links', 'struct_amd_metrics_table_header_t',
@@ -2882,6 +2950,7 @@ __all__ = \
'struct_amdsmi_freq_volt_region_t', 'struct_amdsmi_frequencies_t',
'struct_amdsmi_frequency_range_t', 'struct_amdsmi_fw_info_t',
'struct_amdsmi_gpu_cache_info_t', 'struct_amdsmi_gpu_metrics_t',
'struct_amdsmi_gpu_xcp_metrics_t',
'struct_amdsmi_hsmp_metrics_table_t', 'struct_amdsmi_kfd_info_t',
'struct_amdsmi_link_id_bw_type_t', 'struct_amdsmi_link_metrics_t',
'struct_amdsmi_name_value_t', 'struct_amdsmi_od_vddc_point_t',
@@ -2896,11 +2965,12 @@ __all__ = \
'struct_amdsmi_retired_page_record_t',
'struct_amdsmi_smu_fw_version_t',
'struct_amdsmi_temp_range_refresh_rate_t',
'struct_amdsmi_topology_nearest_t',
'struct_amdsmi_utilization_counter_t',
'struct_amdsmi_vbios_info_t', 'struct_amdsmi_version_t',
'struct_amdsmi_vram_info_t', 'struct_amdsmi_vram_usage_t',
'struct_amdsmi_xgmi_info_t', 'struct_cache_',
'struct_engine_usage_', 'struct_fw_info_list_',
'struct_amdsmi_violation_status_t', 'struct_amdsmi_vram_info_t',
'struct_amdsmi_vram_usage_t', 'struct_amdsmi_xgmi_info_t',
'struct_cache_', 'struct_engine_usage_', 'struct_fw_info_list_',
'struct_memory_usage_', 'struct_nps_flags_',
'struct_pcie_metric_', 'struct_pcie_static_',
'struct_amdsmi_bdf_t','uint32_t', 'uint64_t', 'uint8_t',
+60
Wyświetl plik
@@ -937,6 +937,22 @@ int main() {
<< gpu_metrics.pcie_replay_count_acc << "\n";
std::cout << "\t**.pcie_replay_rover_count_acc : " << std::dec
<< gpu_metrics.pcie_replay_rover_count_acc << "\n";
std::cout << "\t**.accumulation_counter : " << std::dec
<< gpu_metrics.accumulation_counter << "\n";
std::cout << "\t**.prochot_residency_acc : " << std::dec
<< gpu_metrics.prochot_residency_acc << "\n";
std::cout << "\t**.ppt_residency_acc : " << std::dec
<< gpu_metrics.ppt_residency_acc << "\n";
std::cout << "\t**.socket_thm_residency_acc : " << std::dec
<< gpu_metrics.socket_thm_residency_acc << "\n";
std::cout << "\t**.vr_thm_residency_acc : " << std::dec
<< gpu_metrics.vr_thm_residency_acc << "\n";
std::cout << "\t**.hbm_thm_residency_acc : " << std::dec
<< gpu_metrics.hbm_thm_residency_acc << "\n";
std::cout << "\t**.num_partition: " << std::dec
<< gpu_metrics.num_partition << "\n";
std::cout << "\t**.pcie_lc_perf_other_end_recovery: "
<< gpu_metrics.pcie_lc_perf_other_end_recovery << "\n";
std::cout << "\t**.temperature_hbm[] : " << std::dec << "\n";
for (const auto& temp : gpu_metrics.temperature_hbm) {
@@ -978,6 +994,50 @@ int main() {
std::cout << "\t -> " << std::dec << dclk << "\n";
}
std::cout << std::dec << "xcp_stats.gfx_busy_inst = \n";
auto xcp = 0;
for (auto& row : gpu_metrics.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_busy_inst),
std::end(row.gfx_busy_inst),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.jpeg_busy = \n";
for (auto& row : gpu_metrics.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.jpeg_busy),
std::end(row.jpeg_busy),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.vcn_busy = \n";
for (auto& row : gpu_metrics.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.vcn_busy),
std::end(row.vcn_busy),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_busy_acc = \n";
for (auto& row : gpu_metrics.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_busy_acc),
std::end(row.gfx_busy_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
std::cout << "\n";
std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n";
constexpr uint16_t kMAX_ITER_TEST = 10;
+87
Wyświetl plik
@@ -1071,6 +1071,41 @@ typedef struct metrics_table_header_t metrics_table_header_t;
*/
#define RSMI_MAX_NUM_GFX_CLKS 8
/**
* @brief This should match kRSMI_MAX_NUM_XCC;
* XCC - Accelerated Compute Core, the collection of compute units,
* ACE (Asynchronous Compute Engines), caches,
* and global resources organized as one unit.
*
* Refer to amd.com documentation for more detail:
* https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
*/
#define RSMI_MAX_NUM_XCC 8
/**
* @brief This should match kRSMI_MAX_NUM_XCP;
* XCP - Accelerated Compute Processor,
* also referred to as the Graphics Compute Partitions.
* Each physical gpu could have a maximum of 8 separate partitions
* associated with each (depending on ASIC support).
*
* Refer to amd.com documentation for more detail:
* https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf
*/
#define RSMI_MAX_NUM_XCP 8
/**
* @brief The following structures hold the gpu statistics for a device.
*/
struct amdgpu_xcp_metrics_t {
/* Utilization Instantaneous (%) */
uint32_t gfx_busy_inst[RSMI_MAX_NUM_XCC];
uint16_t jpeg_busy[RSMI_MAX_NUM_JPEG_ENGS];
uint16_t vcn_busy[RSMI_MAX_NUM_VCNS];
/* Utilization Accumulated (%) */
uint64_t gfx_busy_acc[RSMI_MAX_NUM_XCC];
};
typedef struct {
// TODO(amd) Doxygen documents
@@ -1221,6 +1256,57 @@ typedef struct {
// PCIE NAK received accumulated count
uint32_t pcie_nak_rcvd_count_acc;
/*
* v1.6 additions
*/
/* Accumulation cycle counter */
uint64_t accumulation_counter;
/**
* Accumulated throttler residencies
*/
uint64_t prochot_residency_acc;
/**
* Accumulated throttler residencies
*
* Prochot (thermal) - PPT (power)
* Package Power Tracking (PPT) violation % (greater than 0% is a violation);
* aka PVIOL
*
* Ex. PVIOL/TVIOL calculations
* Where A and B are measurments recorded at prior points in time.
* Typically A is the earlier measured value and B is the latest measured value.
*
* PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A))
* TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A))
*/
uint64_t ppt_residency_acc;
/**
* Accumulated throttler residencies
*
* Socket (thermal) -
* Socket thermal violation % (greater than 0% is a violation);
* aka TVIOL
*
* Ex. PVIOL/TVIOL calculations
* Where A and B are measurments recorded at prior points in time.
* Typically A is the earlier measured value and B is the latest measured value.
*
* PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A))
* TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A))
*/
uint64_t socket_thm_residency_acc;
uint64_t vr_thm_residency_acc;
uint64_t hbm_thm_residency_acc;
/* Number of current partition */
uint16_t num_partition;
/* XCP (Graphic Cluster Partitions) metrics stats */
struct amdgpu_xcp_metrics_t xcp_stats[RSMI_MAX_NUM_XCP];
/* PCIE other end recovery counter */
uint32_t pcie_lc_perf_other_end_recovery;
/// \endcond
} rsmi_gpu_metrics_t;
@@ -3081,6 +3167,7 @@ rsmi_status_t rsmi_dev_reg_table_info_get(uint32_t dv_ind,
rsmi_name_value_t** reg_metrics,
uint32_t *num_of_metrics);
/**
* @brief This function sets the clock range information
*
@@ -225,7 +225,6 @@ class Device {
void set_drm_render_minor(uint32_t minor) {drm_render_minor_ = minor;}
static rsmi_dev_perf_level perfLvlStrToEnum(std::string s);
uint64_t bdfid(void) const {return bdfid_;}
int get_partition_id() const {return (bdfid_ >> 28) & 0xf; } // location_id[31:28]
void set_bdfid(uint64_t val) {bdfid_ = val;}
pthread_mutex_t *mutex(void) {return mutex_.ptr;}
evt::dev_evt_grp_set_t* supported_event_groups(void) {
@@ -261,6 +260,8 @@ class Device {
AMGpuMetricsPublicLatestTupl_t dev_copy_internal_to_external_metrics();
static const std::map<DevInfoTypes, const char*> devInfoTypesStrings;
void set_smi_device_id(uint32_t device_id) { m_device_id = device_id; }
void set_smi_partition_id(uint32_t partition_id) { m_partition_id = partition_id; }
static const char* get_type_string(DevInfoTypes type);
private:
@@ -298,6 +299,8 @@ class Device {
GpuMetricsBasePtr m_gpu_metrics_ptr;
AMDGpuMetricsHeader_v1_t m_gpu_metrics_header;
uint64_t m_gpu_metrics_updated_timestamp;
uint32_t m_device_id;
uint32_t m_partition_id;
};
@@ -52,6 +52,7 @@
#include <cassert>
#include <cstdint>
#include <cstring>
#include <string>
#include <map>
#include <memory>
#include <type_traits>
@@ -64,21 +65,19 @@
* All 1.4 and newer GPU metrics are now defined in this header.
*
*/
namespace amd::smi
{
namespace amd::smi {
constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MAJOR_VER_1 = 1;
constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_1 = 1;
constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_2 = 2;
constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_3 = 3;
constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_4 = 4;
constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MAJOR_VER = kRSMI_GPU_METRICS_API_CONTENT_MAJOR_VER_1;
constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MINON_VER = kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_4;
constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MAJOR_VER
= kRSMI_GPU_METRICS_API_CONTENT_MAJOR_VER_1;
constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MINON_VER
= kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_4;
// Note: As gpu metrics are updating
constexpr uint32_t kRSMI_GPU_METRICS_EXPIRATION_SECS = 5;
// Note: This *must* match NUM_HBM_INSTANCES
constexpr uint32_t kRSMI_MAX_NUM_HBM_INSTANCES = 4;
@@ -97,23 +96,36 @@ constexpr uint32_t kRSMI_MAX_NUM_VCNS = 4;
// Note: This *must* match NUM_JPEG_ENG
constexpr uint32_t kRSMI_MAX_JPEG_ENGINES = 32;
// Note: This *must* match MAX_XCC
constexpr uint32_t kRSMI_MAX_NUM_XCC = 8;
struct AMDGpuMetricsHeader_v1_t
{
// Note: This *must* match MAX_XCP
constexpr uint32_t kRSMI_MAX_NUM_XCP = 8;
struct AMDGpuMetricsHeader_v1_t {
uint16_t m_structure_size;
uint8_t m_format_revision;
uint8_t m_content_revision;
};
struct AMDGpuMetricsBase_t
{
struct amdgpu_xcp_metrics {
/* Utilization Instantaneous (%) */
uint32_t gfx_busy_inst[kRSMI_MAX_NUM_XCC];
uint16_t jpeg_busy[kRSMI_MAX_JPEG_ENGINES];
uint16_t vcn_busy[kRSMI_MAX_NUM_VCNS];
/* Utilization Accumulated (%) */
uint64_t gfx_busy_acc[kRSMI_MAX_NUM_XCC];
};
struct AMDGpuMetricsBase_t {
virtual ~AMDGpuMetricsBase_t() = default;
};
using AMDGpuMetricsBaseRef = AMDGpuMetricsBase_t&;
struct AMDGpuMetrics_v11_t
{
struct AMDGpuMetrics_v11_t {
~AMDGpuMetrics_v11_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
@@ -174,8 +186,7 @@ struct AMDGpuMetrics_v11_t
uint16_t m_temperature_hbm[kRSMI_MAX_NUM_HBM_INSTANCES];
};
struct AMDGpuMetrics_v12_t
{
struct AMDGpuMetrics_v12_t {
~AMDGpuMetrics_v12_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
@@ -238,8 +249,7 @@ struct AMDGpuMetrics_v12_t
uint64_t m_firmware_timestamp;
};
struct AMDGpuMetrics_v13_t
{
struct AMDGpuMetrics_v13_t {
~AMDGpuMetrics_v13_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
@@ -298,7 +308,7 @@ struct AMDGpuMetrics_v13_t
uint32_t m_mem_activity_acc; // new in v1
uint16_t m_temperature_hbm[kRSMI_MAX_NUM_HBM_INSTANCES]; // new in v1
// PMFW attached timestamp (10ns resolution)
// PMFW attached timestamp (10ns resolution)
uint64_t m_firmware_timestamp;
// Voltage (mV)
@@ -312,8 +322,7 @@ struct AMDGpuMetrics_v13_t
uint64_t m_indep_throttle_status;
};
struct AMDGpuMetrics_v14_t
{
struct AMDGpuMetrics_v14_t {
~AMDGpuMetrics_v14_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
@@ -329,7 +338,7 @@ struct AMDGpuMetrics_v14_t
// Utilization (%)
uint16_t m_average_gfx_activity;
uint16_t m_average_umc_activity; // memory controller
uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode)
uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode)
// Energy (15.259uJ (2^-16) units)
uint64_t m_energy_accumulator;
@@ -345,9 +354,9 @@ struct AMDGpuMetrics_v14_t
// Link width (number of lanes) and speed (in 0.1 GT/s)
uint16_t m_pcie_link_width;
uint16_t m_pcie_link_speed; // in 0.1 GT/s
uint16_t m_pcie_link_speed; // in 0.1 GT/s
// XGMI bus width and bitrate (in Gbps)
// XGMI bus width and bitrate (in Gbps)
uint16_t m_xgmi_link_width;
uint16_t m_xgmi_link_speed;
@@ -358,7 +367,7 @@ struct AMDGpuMetrics_v14_t
// PCIE accumulated bandwidth (GB/sec)
uint64_t m_pcie_bandwidth_acc;
// PCIE instantaneous bandwidth (GB/sec)
// PCIE instantaneous bandwidth (GB/sec)
uint64_t m_pcie_bandwidth_inst;
// PCIE L0 to recovery state transition accumulated count
@@ -387,8 +396,7 @@ struct AMDGpuMetrics_v14_t
uint16_t m_padding;
};
struct AMDGpuMetrics_v15_t
{
struct AMDGpuMetrics_v15_t {
~AMDGpuMetrics_v15_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
@@ -404,7 +412,7 @@ struct AMDGpuMetrics_v15_t
// Utilization (%)
uint16_t m_average_gfx_activity;
uint16_t m_average_umc_activity; // memory controller
uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode)
uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode)
uint16_t m_jpeg_activity[kRSMI_MAX_JPEG_ENGINES]; // JPEG activity percent (encode/decode)
// Energy (15.259uJ (2^-16) units)
@@ -421,7 +429,7 @@ struct AMDGpuMetrics_v15_t
// Link width (number of lanes) and speed (in 0.1 GT/s)
uint16_t m_pcie_link_width;
uint16_t m_pcie_link_speed; // in 0.1 GT/s
uint16_t m_pcie_link_speed; // in 0.1 GT/s
// XGMI bus width and bitrate (in Gbps)
uint16_t m_xgmi_link_width;
@@ -468,7 +476,103 @@ struct AMDGpuMetrics_v15_t
uint16_t m_padding;
};
using AMGpuMetricsLatest_t = AMDGpuMetrics_v15_t;
struct AMDGpuMetrics_v16_t {
~AMDGpuMetrics_v16_t() = default;
struct AMDGpuMetricsHeader_v1_t m_common_header;
// Temperature (Celsius). It will be zero (0) if unsupported.
uint16_t m_temperature_hotspot;
uint16_t m_temperature_mem;
uint16_t m_temperature_vrsoc;
// Power (Watts)
uint16_t m_current_socket_power;
// Utilization (%)
uint16_t m_average_gfx_activity;
uint16_t m_average_umc_activity; // memory controller
// Energy (15.259uJ (2^-16) units)
uint64_t m_energy_accumulator;
// Driver attached timestamp (in ns)
uint64_t m_system_clock_counter;
/*
* Important: bumped up public to uint64_t due to planned size increase
* for newer ASICs
*/
/* Accumulation cycle counter */
uint32_t m_accumulation_counter;
/* Accumulated throttler residencies */
uint32_t m_prochot_residency_acc;
uint32_t m_ppt_residency_acc;
uint32_t m_socket_thm_residency_acc;
uint32_t m_vr_thm_residency_acc;
uint32_t m_hbm_thm_residency_acc;
// Clock Lock Status. Each bit corresponds to clock instance
uint32_t m_gfxclk_lock_status;
// Link width (number of lanes) and speed (in 0.1 GT/s)
uint16_t m_pcie_link_width;
uint16_t m_pcie_link_speed; // in 0.1 GT/s
// XGMI bus width and bitrate (in Gbps)
uint16_t m_xgmi_link_width;
uint16_t m_xgmi_link_speed;
// Utilization Accumulated (%)
uint32_t m_gfx_activity_acc;
uint32_t m_mem_activity_acc;
// PCIE accumulated bandwidth (GB/sec)
uint64_t m_pcie_bandwidth_acc;
// PCIE instantaneous bandwidth (GB/sec)
uint64_t m_pcie_bandwidth_inst;
// PCIE L0 to recovery state transition accumulated count
uint64_t m_pcie_l0_to_recov_count_acc;
// PCIE replay accumulated count
uint64_t m_pcie_replay_count_acc;
// PCIE replay rollover accumulated count
uint64_t m_pcie_replay_rover_count_acc;
// PCIE NAK sent accumulated count
uint32_t m_pcie_nak_sent_count_acc;
// PCIE NAK received accumulated count
uint32_t m_pcie_nak_rcvd_count_acc;
// XGMI accumulated data transfer size(KiloBytes)
uint64_t m_xgmi_read_data_acc[kRSMI_MAX_NUM_XGMI_LINKS];
uint64_t m_xgmi_write_data_acc[kRSMI_MAX_NUM_XGMI_LINKS];
// PMFW attached timestamp (10ns resolution)
uint64_t m_firmware_timestamp;
// Current clocks (Mhz)
uint16_t m_current_gfxclk[kRSMI_MAX_NUM_GFX_CLKS];
uint16_t m_current_socclk[kRSMI_MAX_NUM_CLKS];
uint16_t m_current_vclk0[kRSMI_MAX_NUM_CLKS];
uint16_t m_current_dclk0[kRSMI_MAX_NUM_CLKS];
uint16_t m_current_uclk;
/* Number of current partition */
uint16_t m_num_partition;
/* XCP (Graphic Cluster Partitions) metrics stats */
struct amdgpu_xcp_metrics m_xcp_stats[kRSMI_MAX_NUM_XCP];
/* PCIE other end recovery counter */
uint32_t m_pcie_lc_perf_other_end_recovery;
};
using AMGpuMetricsLatest_t = AMDGpuMetrics_v16_t;
/**
* This is GPU Metrics version that gets to public access.
@@ -555,8 +659,7 @@ using AMDGpuMetricVersionFlagId_t = uint32_t;
* Each Metric Unit (or a set of them) is related to a Metric class.
*
*/
enum class AMDGpuMetricsClassId_t : AMDGpuMetricTypeId_t
{
enum class AMDGpuMetricsClassId_t : AMDGpuMetricTypeId_t {
kGpuMetricHeader,
kGpuMetricTemperature,
kGpuMetricUtilization,
@@ -569,6 +672,9 @@ enum class AMDGpuMetricsClassId_t : AMDGpuMetricTypeId_t
kGpuMetricLinkWidthSpeed,
kGpuMetricVoltage,
kGpuMetricTimestamp,
kGpuMetricThrottleResidency,
kGpuMetricPartition,
kGpuMetricXcpStats,
};
using AMDGpuMetricsClassIdTranslationTbl_t = std::map<AMDGpuMetricsClassId_t, std::string>;
@@ -605,8 +711,8 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t
kMetricAvgMmActivity,
kMetricGfxActivityAccumulator,
kMetricMemActivityAccumulator,
kMetricVcnActivity, //v1.4
kMetricJpegActivity, //v1.5
kMetricVcnActivity, // v1.4
kMetricJpegActivity, // v1.5
// kGpuMetricAverageClock counters
kMetricAvgGfxClockFrequency,
@@ -618,11 +724,11 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t
kMetricAvgDClock1Frequency,
// kGpuMetricCurrentClock counters
kMetricCurrGfxClock, //v1.4: Changed to multi-valued
kMetricCurrSocClock, //v1.4: Changed to multi-valued
kMetricCurrGfxClock, // v1.4: Changed to multi-valued
kMetricCurrSocClock, // v1.4: Changed to multi-valued
kMetricCurrUClock,
kMetricCurrVClock0, //v1.4: Changed to multi-valued
kMetricCurrDClock0, //v1.4: Changed to multi-valued
kMetricCurrVClock0, // v1.4: Changed to multi-valued
kMetricCurrDClock0, // v1.4: Changed to multi-valued
kMetricCurrVClock1,
kMetricCurrDClock1,
@@ -631,7 +737,7 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t
kMetricIndepThrottleStatus,
// kGpuMetricGfxClkLockStatus counters
kMetricGfxClkLockStatus, //v1.4
kMetricGfxClkLockStatus, // v1.4
// kGpuMetricCurrentFanSpeed counters
kMetricCurrFanSpeed,
@@ -639,31 +745,50 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t
// kGpuMetricLinkWidthSpeed counters
kMetricPcieLinkWidth,
kMetricPcieLinkSpeed,
kMetricPcieBandwidthAccumulator, //v1.4
kMetricPcieBandwidthInst, //v1.4
kMetricXgmiLinkWidth, //v1.4
kMetricXgmiLinkSpeed, //v1.4
kMetricXgmiReadDataAccumulator, //v1.4
kMetricXgmiWriteDataAccumulator, //v1.4
kMetricPcieL0RecovCountAccumulator, //v1.4
kMetricPcieReplayCountAccumulator, //v1.4
kMetricPcieReplayRollOverCountAccumulator, //v1.4
kMetricPcieNakSentCountAccumulator, //v1.5
kMetricPcieNakReceivedCountAccumulator, //v1.5
kMetricPcieBandwidthAccumulator, // v1.4
kMetricPcieBandwidthInst, // v1.4
kMetricXgmiLinkWidth, // v1.4
kMetricXgmiLinkSpeed, // v1.4
kMetricXgmiReadDataAccumulator, // v1.4
kMetricXgmiWriteDataAccumulator, // v1.4
kMetricPcieL0RecovCountAccumulator, // v1.4
kMetricPcieReplayCountAccumulator, // v1.4
kMetricPcieReplayRollOverCountAccumulator, // v1.4
kMetricPcieNakSentCountAccumulator, // v1.5
kMetricPcieNakReceivedCountAccumulator, // v1.5
// kGpuMetricPowerEnergy counters
kMetricAvgSocketPower,
kMetricCurrSocketPower, //v1.4
kMetricEnergyAccumulator, //v1.4
kMetricCurrSocketPower, // v1.4
kMetricEnergyAccumulator, // v1.4
// kGpuMetricVoltage counters
kMetricVoltageSoc, //v1.3
kMetricVoltageGfx, //v1.3
kMetricVoltageMem, //v1.3
kMetricVoltageSoc, // v1.3
kMetricVoltageGfx, // v1.3
kMetricVoltageMem, // v1.3
// kGpuMetricTimestamp counters
kMetricTSClockCounter,
kMetricTSFirmware,
// kMetricAccumulationCounter counters
kMetricAccumulationCounter, // v1.6
kMetricProchotResidencyAccumulator, // v1.6
kMetricPPTResidencyAccumulator, // v1.6
kMetricSocketThmResidencyAccumulator, // v1.6
kMetricVRThmResidencyAccumulator, // v1.6
kMetricHBMThmResidencyAccumulator, // v1.6
// kGpuMetricPartition
kGpuMetricNumPartition, // v1.6
// kGpuMetricXcpStats
kMetricGfxBusyInst, // v1.6
kMetricJpegBusy, // v1.6
kMetricVcnBusy, // v1.6
kMetricGfxBusyAcc, // v1.6
kMetricPcieLCPerfOtherEndRecov, // v1.6
};
using AMDGpuMetricsUnitTypeTranslationTbl_t = std::map<AMDGpuMetricsUnitType_t, std::string>;
@@ -676,14 +801,14 @@ enum class AMDGpuMetricsDataType_t : AMDGpuMetricsDataTypeId_t
kUInt64,
};
struct AMDGpuDynamicMetricsValue_t
{
struct AMDGpuDynamicMetricsValue_t {
uint64_t m_value;
std::string m_info;
AMDGpuMetricsDataType_t m_original_type;
};
using AMDGpuDynamicMetricTblValues_t = std::vector<AMDGpuDynamicMetricsValue_t>;
using AMDGpuDynamicMetricsTbl_t = std::map<AMDGpuMetricsClassId_t, std::map<AMDGpuMetricsUnitType_t, AMDGpuDynamicMetricTblValues_t>>;
using AMDGpuDynamicMetricsTbl_t = std::map<AMDGpuMetricsClassId_t,
std::map<AMDGpuMetricsUnitType_t, AMDGpuDynamicMetricTblValues_t>>;
/*
@@ -700,13 +825,13 @@ enum class AMDGpuMetricVersionFlags_t : AMDGpuMetricVersionFlagId_t
kGpuMetricV13 = (0x1 << 3),
kGpuMetricV14 = (0x1 << 4),
kGpuMetricV15 = (0x1 << 5),
kGpuMetricV16 = (0x1 << 6),
};
using AMDGpuMetricVersionTranslationTbl_t = std::map<uint16_t, AMDGpuMetricVersionFlags_t>;
using GpuMetricTypePtr_t = std::shared_ptr<void>;
class GpuMetricsBase_t
{
public:
class GpuMetricsBase_t {
public:
virtual ~GpuMetricsBase_t() = default;
virtual size_t sizeof_metric_table() = 0;
virtual GpuMetricTypePtr_t get_metrics_table() = 0;
@@ -714,30 +839,32 @@ class GpuMetricsBase_t
virtual AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() = 0;
virtual rsmi_status_t populate_metrics_dynamic_tbl() = 0;
virtual AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() = 0;
virtual void set_device_id(uint32_t device_id) { m_device_id = device_id; }
virtual void set_partition_id(uint32_t partition_id) { m_partition_id = partition_id; }
virtual AMDGpuDynamicMetricsTbl_t get_metrics_dynamic_tbl() {
return m_metrics_dynamic_tbl;
}
protected:
protected:
AMDGpuDynamicMetricsTbl_t m_metrics_dynamic_tbl;
uint64_t m_metrics_timestamp;
uint32_t m_device_id;
uint32_t m_partition_id;
};
using GpuMetricsBasePtr = std::shared_ptr<GpuMetricsBase_t>;
using AMDGpuMetricFactories_t = const std::map<AMDGpuMetricVersionFlags_t, GpuMetricsBasePtr>;
class GpuMetricsBase_v11_t final : public GpuMetricsBase_t
{
public:
class GpuMetricsBase_v11_t final : public GpuMetricsBase_t {
public:
virtual ~GpuMetricsBase_v11_t() = default;
size_t sizeof_metric_table() override {
return sizeof(AMDGpuMetrics_v11_t);
}
GpuMetricTypePtr_t get_metrics_table() override
{
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v11_t*){});
}
@@ -745,13 +872,11 @@ class GpuMetricsBase_v11_t final : public GpuMetricsBase_t
return m_gpu_metric_ptr;
}
void dump_internal_metrics_table() override
{
void dump_internal_metrics_table() override {
return;
}
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override
{
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
return AMDGpuMetricVersionFlags_t::kGpuMetricV11;
}
@@ -759,23 +884,20 @@ class GpuMetricsBase_v11_t final : public GpuMetricsBase_t
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
private:
AMDGpuMetrics_v11_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v11_t> m_gpu_metric_ptr;
};
class GpuMetricsBase_v12_t final : public GpuMetricsBase_t
{
public:
class GpuMetricsBase_v12_t final : public GpuMetricsBase_t {
public:
~GpuMetricsBase_v12_t() = default;
size_t sizeof_metric_table() override {
return sizeof(AMDGpuMetrics_v12_t);
}
GpuMetricTypePtr_t get_metrics_table() override
{
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v12_t*){});
}
@@ -783,36 +905,31 @@ class GpuMetricsBase_v12_t final : public GpuMetricsBase_t
return m_gpu_metric_ptr;
}
void dump_internal_metrics_table() override
{
void dump_internal_metrics_table() override {
return;
}
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override
{
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
return AMDGpuMetricVersionFlags_t::kGpuMetricV12;
}
rsmi_status_t populate_metrics_dynamic_tbl() override;
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
private:
AMDGpuMetrics_v12_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v12_t> m_gpu_metric_ptr;
};
class GpuMetricsBase_v13_t final : public GpuMetricsBase_t
{
public:
class GpuMetricsBase_v13_t final : public GpuMetricsBase_t {
public:
~GpuMetricsBase_v13_t() = default;
size_t sizeof_metric_table() override {
return sizeof(AMDGpuMetrics_v13_t);
}
GpuMetricTypePtr_t get_metrics_table() override
{
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v13_t*){});
}
@@ -822,8 +939,7 @@ class GpuMetricsBase_v13_t final : public GpuMetricsBase_t
void dump_internal_metrics_table() override;
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override
{
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
return AMDGpuMetricVersionFlags_t::kGpuMetricV13;
}
@@ -831,23 +947,20 @@ class GpuMetricsBase_v13_t final : public GpuMetricsBase_t
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
private:
AMDGpuMetrics_v13_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v13_t> m_gpu_metric_ptr;
};
class GpuMetricsBase_v14_t final : public GpuMetricsBase_t
{
public:
class GpuMetricsBase_v14_t final : public GpuMetricsBase_t {
public:
~GpuMetricsBase_v14_t() = default;
size_t sizeof_metric_table() override {
return sizeof(AMDGpuMetrics_v14_t);
}
GpuMetricTypePtr_t get_metrics_table() override
{
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v14_t*){});
}
@@ -857,8 +970,7 @@ class GpuMetricsBase_v14_t final : public GpuMetricsBase_t
void dump_internal_metrics_table() override;
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override
{
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
return AMDGpuMetricVersionFlags_t::kGpuMetricV14;
}
@@ -866,23 +978,20 @@ class GpuMetricsBase_v14_t final : public GpuMetricsBase_t
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
private:
AMDGpuMetrics_v14_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v14_t> m_gpu_metric_ptr;
};
class GpuMetricsBase_v15_t final : public GpuMetricsBase_t
{
public:
class GpuMetricsBase_v15_t final : public GpuMetricsBase_t {
public:
~GpuMetricsBase_v15_t() = default;
size_t sizeof_metric_table() override {
return sizeof(AMDGpuMetrics_v15_t);
}
GpuMetricTypePtr_t get_metrics_table() override
{
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v15_t*){});
}
@@ -892,8 +1001,7 @@ class GpuMetricsBase_v15_t final : public GpuMetricsBase_t
void dump_internal_metrics_table() override;
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override
{
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
return AMDGpuMetricVersionFlags_t::kGpuMetricV15;
}
@@ -901,20 +1009,51 @@ class GpuMetricsBase_v15_t final : public GpuMetricsBase_t
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
private:
AMDGpuMetrics_v15_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v15_t> m_gpu_metric_ptr;
};
class GpuMetricsBase_v16_t final : public GpuMetricsBase_t {
public:
~GpuMetricsBase_v16_t() = default;
size_t sizeof_metric_table() override {
return sizeof(AMDGpuMetrics_v16_t);
}
GpuMetricTypePtr_t get_metrics_table() override {
if (!m_gpu_metric_ptr) {
m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v16_t*){});
}
assert(m_gpu_metric_ptr != nullptr);
return m_gpu_metric_ptr;
}
void dump_internal_metrics_table() override;
AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override {
return AMDGpuMetricVersionFlags_t::kGpuMetricV16;
}
rsmi_status_t populate_metrics_dynamic_tbl() override;
AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override;
private:
AMDGpuMetrics_v16_t m_gpu_metrics_tbl;
std::shared_ptr<AMDGpuMetrics_v16_t> m_gpu_metric_ptr;
};
template<typename T>
rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, AMDGpuMetricsUnitType_t metric_counter, T& metric_value);
rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind,
AMDGpuMetricsUnitType_t metric_counter, T& metric_value);
} // namespace amd::smi
} // namespace amd::smi
rsmi_status_t
rsmi_dev_gpu_metrics_header_info_get(uint32_t dv_ind, metrics_table_header_t& header_value);
rsmi_dev_gpu_metrics_header_info_get(uint32_t dv_ind,
metrics_table_header_t& header_value);
#endif // ROCM_SMI_ROCM_SMI_GPU_METRICS_H_
#endif // ROCM_SMI_ROCM_SMI_GPU_METRICS_H_
@@ -59,6 +59,7 @@
#include <tuple>
#include <type_traits>
#include <vector>
#include <utility>
#include "rocm_smi/rocm_smi_device.h"
@@ -604,6 +605,74 @@ using TextFileTagContents_t = TagTextContents_t<std::string, std::string,
std::string, std::string>;
//
// Note: Output iterator that inserts a delimiter between elements.
//
template<typename DelimiterType, typename CharType = char,
typename TraitsType = std::char_traits<CharType>>
class ostream_joiner {
public:
using Char_t = CharType;
using Traits_t = TraitsType;
using Ostream_t = std::basic_ostream<Char_t, Traits_t>;
using iterator_category = std::output_iterator_tag;
using value_type = void;
using difference_type = void;
using pointer = void;
using reference = void;
ostream_joiner(Ostream_t* outstream,
const DelimiterType& delimiter) noexcept
(std::is_nothrow_copy_constructible_v<DelimiterType>)
: m_outstream(outstream), m_delimiter(delimiter) {}
ostream_joiner(Ostream_t* outstream, DelimiterType&& delimiter) noexcept
(std::is_nothrow_move_constructible_v<DelimiterType>)
: m_outstream(outstream), m_delimiter(std::move(delimiter)) {}
template<typename ValueType> ostream_joiner& operator=(const ValueType& value) {
if (!m_is_first) {
*m_outstream << m_delimiter;
}
this->m_is_first = false;
this->m_value_count++;
if ((m_value_count % kMAX_VALUES_PER_LINE) == 0) {
*m_outstream << "\n" << value;
this->m_value_count = 0;
} else {
*m_outstream << value;
}
return *this;
}
ostream_joiner& operator*() noexcept { return *this; }
ostream_joiner& operator++() noexcept { return *this; }
ostream_joiner& operator++(int) noexcept { return *this; }
private:
Ostream_t* m_outstream;
DelimiterType m_delimiter;
bool m_is_first = true;
uint32_t m_value_count = 0;
const uint32_t kMAX_VALUES_PER_LINE = 9;
};
/// Object generator for ostream_joiner.
template<typename CharType, typename TraitsType, typename DelimiterType>
inline ostream_joiner<std::decay_t<DelimiterType>, CharType, TraitsType>
make_ostream_joiner(std::basic_ostream<CharType, TraitsType>* outstream,
DelimiterType&& delimiter) {
return {
outstream,
std::forward<DelimiterType>(delimiter)
};
}
} // namespace smi
} // namespace amd
+10 -7
Wyświetl plik
@@ -1006,6 +1006,7 @@ const char* Device::get_type_string(DevInfoTypes type) {
return "Unknown";
}
int Device::readDevInfoBinary(DevInfoTypes type, std::size_t b_size,
void *p_binary_data) {
auto sysfs_path = path_;
@@ -1043,15 +1044,17 @@ int Device::readDevInfoBinary(DevInfoTypes type, std::size_t b_size,
LOG_ERROR(ss);
return ENOENT;
}
ss << "Successfully read DevInfoBinary for DevInfoType ("
<< get_type_string(type) << ") - SYSFS ("
<< sysfs_path << "), returning binaryData = " << p_binary_data
<< "; byte_size = " << std::dec << static_cast<int>(b_size);
if (ROCmLogging::Logger::getInstance()->isLoggerEnabled()) {
ss << "Successfully read DevInfoBinary for DevInfoType ("
<< get_type_string(type) << ") - SYSFS ("
<< sysfs_path << "), returning binaryData = " << p_binary_data
<< "; byte_size = " << std::dec << static_cast<int>(b_size);
std::string metricDescription = "AMD SMI GPU METRICS (16-byte width), "
std::string metricDescription = "AMD SMI GPU METRICS (16-byte width), "
+ sysfs_path;
logHexDump(metricDescription.c_str(), p_binary_data, b_size, 16);
LOG_INFO(ss);
logHexDump(metricDescription.c_str(), p_binary_data, b_size, 16);
LOG_INFO(ss);
}
return 0;
}
Plik diff jest za duży Load Diff
-1
Wyświetl plik
@@ -395,7 +395,6 @@ Monitor::setVoltSensorLabelMap(void) {
volt_type_index_map_[t_type] = file_index;
index_volt_type_map_.insert({file_index, t_type});
}
return 0;
};
+467 -12
Wyświetl plik
@@ -51,11 +51,13 @@
#include <iomanip>
#include <iostream>
#include <fstream>
#include <queue>
#include <vector>
#include <set>
#include <map>
#include <memory>
#include <limits>
#include <functional>
#include <xf86drm.h>
#include "amd_smi/amdsmi.h"
#include "amd_smi/impl/fdinfo.h"
@@ -567,7 +569,9 @@ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_hand
amd::smi::AMDSmiProcessor* device = nullptr;
amdsmi_status_t ret = amd::smi::AMDSmiSystem::getInstance()
.handle_to_processor(processor_handle, &device);
if (ret != AMDSMI_STATUS_SUCCESS) return ret;
if (ret != AMDSMI_STATUS_SUCCESS) {
return ret;
}
if (device->get_processor_type() != AMDSMI_PROCESSOR_TYPE_AMD_GPU) {
return AMDSMI_STATUS_NOT_SUPPORTED;
@@ -575,8 +579,9 @@ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_hand
amd::smi::AMDSmiGPUDevice* gpu_device = nullptr;
amdsmi_status_t r = get_gpu_device_from_handle(processor_handle, &gpu_device);
if (r != AMDSMI_STATUS_SUCCESS)
if (r != AMDSMI_STATUS_SUCCESS) {
return r;
}
struct drm_amdgpu_info_vram_gtt gtt;
uint64_t vram_used = 0;
@@ -590,13 +595,282 @@ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_hand
r = gpu_device->amdgpu_query_info(AMDGPU_INFO_VRAM_USAGE,
sizeof(vram_used), &vram_used);
if (r != AMDSMI_STATUS_SUCCESS) return r;
if (r != AMDSMI_STATUS_SUCCESS) {
return r;
}
vram_info->vram_used = static_cast<uint32_t>(vram_used / (1024 * 1024));
return AMDSMI_STATUS_SUCCESS;
}
static void system_wait(int milli_seconds) {
std::ostringstream ss;
auto start = std::chrono::high_resolution_clock::now();
// 1 ms = 1000 us
int waitTime = milli_seconds * 1000;
ss << __PRETTY_FUNCTION__ << " | "
<< "** Waiting for " << std::dec << waitTime
<< " us (" << waitTime/1000 << " seconds) **";
LOG_DEBUG(ss);
usleep(waitTime);
auto stop = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
ss << __PRETTY_FUNCTION__ << " | "
<< "** Waiting took " << duration.count() / 1000
<< " milli-seconds **";
LOG_DEBUG(ss);
}
amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_handle,
amdsmi_violation_status_t *violation_status) {
AMDSMI_CHECK_INIT();
std::ostringstream ss;
if (violation_status == nullptr) {
return AMDSMI_STATUS_INVAL;
}
// 1 sec = 1000 ms = 1000000 us
constexpr uint64_t kFASTEST_POLL_TIME_MS = 1; // fastest SMU FW sample time is 1ms
violation_status->reference_timestamp = std::numeric_limits<uint64_t>::max();
violation_status->violation_timestamp = std::numeric_limits<uint64_t>::max();
violation_status->per_prochot_thrm = std::numeric_limits<uint64_t>::max();
violation_status->per_ppt_pwr = std::numeric_limits<uint64_t>::max();
violation_status->per_socket_thrm = std::numeric_limits<uint64_t>::max();
violation_status->per_vr_thrm = std::numeric_limits<uint64_t>::max();
violation_status->per_hbm_thrm = std::numeric_limits<uint64_t>::max();
violation_status->active_prochot_thrm = std::numeric_limits<uint8_t>::max();
violation_status->active_ppt_pwr = std::numeric_limits<uint8_t>::max();
violation_status->active_socket_thrm = std::numeric_limits<uint8_t>::max();
violation_status->active_vr_thrm = std::numeric_limits<uint8_t>::max();
violation_status->active_hbm_thrm = std::numeric_limits<uint8_t>::max();
const auto p1 = std::chrono::system_clock::now();
auto current_time = std::chrono::duration_cast<std::chrono::microseconds>(
p1.time_since_epoch()).count();
violation_status->reference_timestamp = current_time;
amd::smi::AMDSmiProcessor* device = nullptr;
amdsmi_status_t ret = amd::smi::AMDSmiSystem::getInstance()
.handle_to_processor(processor_handle, &device);
if (ret != AMDSMI_STATUS_SUCCESS) {
return ret;
}
if (device->get_processor_type() != AMDSMI_PROCESSOR_TYPE_AMD_GPU) {
return AMDSMI_STATUS_NOT_SUPPORTED;
}
amd::smi::AMDSmiGPUDevice* gpu_device = nullptr;
amdsmi_status_t r = get_gpu_device_from_handle(processor_handle, &gpu_device);
if (r != AMDSMI_STATUS_SUCCESS) {
return r;
}
amdsmi_gpu_metrics_t metric_info_a = {};
amdsmi_status_t status = amdsmi_get_gpu_metrics_info(
processor_handle, &metric_info_a);
if (status != AMDSMI_STATUS_SUCCESS) {
return status;
}
// if all of these values are "undefined" then the feature is not supported on the ASIC
if (metric_info_a.accumulation_counter == std::numeric_limits<uint64_t>::max()
&& metric_info_a.prochot_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.ppt_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.socket_thm_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.vr_thm_residency_acc == std::numeric_limits<uint64_t>::max()
&& metric_info_a.hbm_thm_residency_acc == std::numeric_limits<uint64_t>::max()) {
ss << __PRETTY_FUNCTION__
<< " | ASIC does not support throttle violations!, "
<< "returning AMDSMI_STATUS_NOT_SUPPORTED";
LOG_INFO(ss);
return AMDSMI_STATUS_NOT_SUPPORTED;
}
// wait 1ms before reading again
system_wait(static_cast<int>(kFASTEST_POLL_TIME_MS));
amdsmi_gpu_metrics_t metric_info_b = {};
status = amdsmi_get_gpu_metrics_info(
processor_handle, &metric_info_b);
if (status != AMDSMI_STATUS_SUCCESS) {
return status;
}
ss << __PRETTY_FUNCTION__ << " | "
<< "[gpu_metrics A] metric_info_a.accumulation_counter: " << std::dec
<< metric_info_a.accumulation_counter
<< "; metric_info_a.prochot_residency_acc: " << std::dec
<< metric_info_a.prochot_residency_acc
<< "; metric_info_a.ppt_residency_acc (pviol): " << std::dec
<< metric_info_a.ppt_residency_acc
<< "; metric_info_a.socket_thm_residency_acc (tviol): " << std::dec
<< metric_info_a.socket_thm_residency_acc
<< "; metric_info_a.vr_thm_residency_acc: " << std::dec
<< metric_info_a.vr_thm_residency_acc
<< "; metric_info_a.hbm_thm_residency_acc: " << std::dec
<< metric_info_a.hbm_thm_residency_acc << "\n"
<< " [gpu_metrics B] metric_info_b.accumulation_counter: " << std::dec
<< metric_info_b.accumulation_counter
<< "; metric_info_b.prochot_residency_acc: " << std::dec
<< metric_info_b.prochot_residency_acc
<< "; metric_info_b.ppt_residency_acc (pviol): " << std::dec
<< metric_info_b.ppt_residency_acc
<< "; metric_info_b.socket_thm_residency_acc (tviol): " << std::dec
<< metric_info_b.socket_thm_residency_acc
<< "; metric_info_b.vr_thm_residency_acc: " << std::dec
<< metric_info_b.vr_thm_residency_acc
<< "; metric_info_b.hbm_thm_residency_acc: " << std::dec
<< metric_info_b.hbm_thm_residency_acc
<< "\n";
LOG_DEBUG(ss);
if ( (metric_info_b.prochot_residency_acc != std::numeric_limits<uint64_t>::max()
|| metric_info_a.prochot_residency_acc != std::numeric_limits<uint64_t>::max())
&& (metric_info_b.prochot_residency_acc >= metric_info_a.prochot_residency_acc)
&& ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) {
violation_status->per_prochot_thrm =
(((metric_info_b.prochot_residency_acc - metric_info_a.prochot_residency_acc) * 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
if (violation_status->per_prochot_thrm > 0) {
violation_status->active_prochot_thrm = 1;
violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS;
} else {
violation_status->active_prochot_thrm = 0;
}
ss << __PRETTY_FUNCTION__ << " | "
<< "ENTERED prochot_residency_acc | per_prochot_thrm: " << std::dec
<< violation_status->per_prochot_thrm
<< "%; active_prochot_thrm = " << std::dec
<< violation_status->active_prochot_thrm << "\n";
LOG_DEBUG(ss);
}
if ( (metric_info_b.ppt_residency_acc != std::numeric_limits<uint64_t>::max()
|| metric_info_a.ppt_residency_acc != std::numeric_limits<uint64_t>::max())
&& (metric_info_b.ppt_residency_acc >= metric_info_a.ppt_residency_acc)
&& ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) {
violation_status->per_ppt_pwr =
(((metric_info_b.ppt_residency_acc - metric_info_a.ppt_residency_acc) * 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
if (violation_status->per_ppt_pwr > 0) {
violation_status->active_ppt_pwr = 1;
violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS;
} else {
violation_status->active_ppt_pwr = 0;
}
ss << __PRETTY_FUNCTION__ << " | "
<< "ENTERED ppt_residency_acc | per_ppt_pwr: " << std::dec
<< violation_status->per_ppt_pwr
<< "%; active_ppt_pwr = " << std::dec
<< violation_status->active_ppt_pwr << "\n";
LOG_DEBUG(ss);
}
if ( (metric_info_b.socket_thm_residency_acc != std::numeric_limits<uint64_t>::max()
|| metric_info_a.socket_thm_residency_acc != std::numeric_limits<uint64_t>::max())
&& (metric_info_b.socket_thm_residency_acc >= metric_info_a.socket_thm_residency_acc)
&& ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) {
violation_status->per_socket_thrm =
(((metric_info_b.socket_thm_residency_acc -
metric_info_a.socket_thm_residency_acc) * 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
if (violation_status->per_socket_thrm > 0) {
violation_status->active_socket_thrm = 1;
violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS;
} else {
violation_status->active_socket_thrm = 0;
}
ss << __PRETTY_FUNCTION__ << " | "
<< "ENTERED socket_thm_residency_acc | per_socket_thrm: " << std::dec
<< violation_status->per_socket_thrm
<< "%; active_ppt_pwr = " << std::dec
<< violation_status->active_socket_thrm << "\n";
LOG_DEBUG(ss);
}
if ( (metric_info_b.vr_thm_residency_acc != std::numeric_limits<uint64_t>::max()
|| metric_info_a.vr_thm_residency_acc != std::numeric_limits<uint64_t>::max())
&& (metric_info_b.vr_thm_residency_acc >= metric_info_a.vr_thm_residency_acc)
&& ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) {
violation_status->per_vr_thrm =
(((metric_info_b.vr_thm_residency_acc -
metric_info_a.vr_thm_residency_acc) * 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
if (violation_status->per_vr_thrm > 0) {
violation_status->active_vr_thrm = 1;
violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS;
} else {
violation_status->active_vr_thrm = 0;
}
ss << __PRETTY_FUNCTION__ << " | "
<< "ENTERED vr_thm_residency_acc | per_vr_thrm: " << std::dec
<< violation_status->per_vr_thrm
<< "%; active_ppt_pwr = " << std::dec
<< violation_status->active_vr_thrm << "\n";
LOG_DEBUG(ss);
}
if ( (metric_info_b.hbm_thm_residency_acc != std::numeric_limits<uint64_t>::max()
|| metric_info_a.hbm_thm_residency_acc != std::numeric_limits<uint64_t>::max())
&& (metric_info_b.hbm_thm_residency_acc >= metric_info_a.vr_thm_residency_acc)
&& ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0) ) {
violation_status->per_hbm_thrm =
(((metric_info_b.hbm_thm_residency_acc -
metric_info_a.hbm_thm_residency_acc) * 100) /
(metric_info_b.accumulation_counter - metric_info_a.accumulation_counter));
if (violation_status->per_hbm_thrm > 0) {
violation_status->active_hbm_thrm = 1;
violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS;
} else {
violation_status->active_hbm_thrm = 0;
}
ss << __PRETTY_FUNCTION__ << " | "
<< "ENTERED hbm_thm_residency_acc | per_hbm_thrm: " << std::dec
<< violation_status->per_hbm_thrm
<< "%; active_ppt_pwr = " << std::dec
<< violation_status->active_hbm_thrm << "\n";
LOG_DEBUG(ss);
}
ss << __PRETTY_FUNCTION__ << " | "
<< "RETURNING AMDSMI_STATUS_SUCCESS | "
<< "violation_status->reference_timestamp (time since epoch): " << std::dec
<< violation_status->reference_timestamp
<< "; violation_status->violation_timestamp (ms): " << std::dec
<< violation_status->violation_timestamp
<< "; violation_status->per_prochot_thrm (%): " << std::dec
<< violation_status->per_prochot_thrm
<< "; violation_status->per_ppt_pwr (%): " << std::dec
<< violation_status->per_ppt_pwr
<< "; violation_status->per_socket_thrm (%): " << std::dec
<< violation_status->per_socket_thrm
<< "; violation_status->per_vr_thrm (%): " << std::dec
<< violation_status->per_vr_thrm
<< "; violation_status->per_hbm_thrm (%): " << std::dec
<< violation_status->per_hbm_thrm
<< "; violation_status->active_prochot_thrm (bool): " << std::dec
<< static_cast<int>(violation_status->active_prochot_thrm)
<< "; violation_status->active_ppt_pwr (bool): " << std::dec
<< static_cast<int>(violation_status->active_ppt_pwr)
<< "; violation_status->active_socket_thrm (bool): " << std::dec
<< static_cast<int>(violation_status->active_socket_thrm)
<< "; violation_status->active_vr_thrm (bool): " << std::dec
<< static_cast<int>(violation_status->active_vr_thrm)
<< "; violation_status->active_hbm_thrm (bool): " << std::dec
<< static_cast<int>(violation_status->active_hbm_thrm)
<< "\n";
LOG_INFO(ss);
return AMDSMI_STATUS_SUCCESS;
}
amdsmi_status_t amdsmi_get_gpu_fan_rpms(amdsmi_processor_handle processor_handle,
uint32_t sensor_ind, int64_t *speed) {
return rsmi_wrapper(rsmi_dev_fan_rpms_get, processor_handle, sensor_ind,
@@ -753,7 +1027,8 @@ amdsmi_get_gpu_asic_info(amdsmi_processor_handle processor_handle, amdsmi_asic_i
// default to 0xffff as not supported
info->oam_id = std::numeric_limits<uint16_t>::max();
uint16_t tmp_oam_id = 0;
status = rsmi_wrapper(rsmi_dev_xgmi_physical_id_get, processor_handle, &(tmp_oam_id));
status = rsmi_wrapper(rsmi_dev_xgmi_physical_id_get, processor_handle,
&(tmp_oam_id));
info->oam_id = tmp_oam_id;
// default to 0xffffffff as not supported
@@ -790,9 +1065,9 @@ amdsmi_status_t amdsmi_get_gpu_kfd_info(amdsmi_processor_handle processor_handle
info->kfd_id = std::numeric_limits<uint64_t>::max();
auto tmp_kfd_id = uint64_t(0);
status = rsmi_wrapper(rsmi_dev_guid_get, processor_handle, &(tmp_kfd_id));
if (status != AMDSMI_STATUS_SUCCESS) {
return status;
} else {
// Do not return early if this value fails
// continue to try getting all info
if (status == AMDSMI_STATUS_SUCCESS) {
info->kfd_id = tmp_kfd_id;
}
@@ -800,12 +1075,22 @@ amdsmi_status_t amdsmi_get_gpu_kfd_info(amdsmi_processor_handle processor_handle
info->node_id = std::numeric_limits<uint32_t>::max();
auto tmp_node_id = uint32_t(0);
status = rsmi_wrapper(rsmi_dev_node_id_get, processor_handle, &(tmp_node_id));
if (status != AMDSMI_STATUS_SUCCESS) {
return status;
} else {
// Do not return early if this value fails
// continue to try getting all info
if (status == AMDSMI_STATUS_SUCCESS) {
info->node_id = tmp_node_id;
}
// default to 0xffffffff as not supported
info->current_partition_id = std::numeric_limits<uint32_t>::max();
auto tmp_current_partition_id = uint32_t(0);
status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, &(tmp_current_partition_id));
// Do not return early if this value fails
// continue to try getting all info
if (status == AMDSMI_STATUS_SUCCESS) {
info->current_partition_id = tmp_current_partition_id;
}
return AMDSMI_STATUS_SUCCESS;
}
@@ -1277,8 +1562,11 @@ amdsmi_status_t amdsmi_get_gpu_metrics_info(
amdsmi_gpu_metrics_t *pgpu_metrics) {
AMDSMI_CHECK_INIT();
// nullptr api supported
if (pgpu_metrics != nullptr) {
*pgpu_metrics = {};
}
return rsmi_wrapper(rsmi_dev_gpu_metrics_info_get, processor_handle,
reinterpret_cast<rsmi_gpu_metrics_t*>(pgpu_metrics));
reinterpret_cast<rsmi_gpu_metrics_t*>(pgpu_metrics));
}
@@ -1447,7 +1735,6 @@ amdsmi_status_t amdsmi_get_clk_freq(amdsmi_processor_handle processor_handle,
clk_type == AMDSMI_CLK_TYPE_VCLK1 ||
clk_type == AMDSMI_CLK_TYPE_DCLK0 ||
clk_type == AMDSMI_CLK_TYPE_DCLK1 ) {
// when f == nullptr -> check if metrics are supported
amdsmi_gpu_metrics_t metric_info;
amdsmi_gpu_metrics_t * metric_info_p = nullptr;
@@ -2264,6 +2551,14 @@ amdsmi_status_t amdsmi_get_pcie_info(amdsmi_processor_handle processor_handle, a
*/
info->pcie_metric.pcie_nak_sent_count = translate_umax_or_assign_value<decltype(info->pcie_metric.pcie_nak_sent_count)>
(metric_info.pcie_nak_sent_count_acc, (metric_info.pcie_nak_sent_count_acc));
/**
* pcie_metric.pcie_lc_perf_other_end_recovery: (uint32_t)
*/
info->pcie_metric.pcie_lc_perf_other_end_recovery_count =
translate_umax_or_assign_value<decltype(
info->pcie_metric.pcie_lc_perf_other_end_recovery_count)> (
metric_info.pcie_lc_perf_other_end_recovery,
(metric_info.pcie_lc_perf_other_end_recovery));
return AMDSMI_STATUS_SUCCESS;
}
@@ -2321,6 +2616,166 @@ amdsmi_status_t amdsmi_get_processor_handle_from_bdf(amdsmi_bdf_t bdf,
return AMDSMI_STATUS_API_FAILED;
}
amdsmi_status_t
amdsmi_get_link_topology_nearest(amdsmi_processor_handle processor_handle,
amdsmi_link_type_t link_type,
amdsmi_topology_nearest_t* topology_nearest_info)
{
if (topology_nearest_info == nullptr) {
return amdsmi_status_t::AMDSMI_STATUS_INVAL;
}
if (link_type < amdsmi_link_type_t::AMDSMI_LINK_TYPE_INTERNAL ||
link_type > amdsmi_link_type_t::AMDSMI_LINK_TYPE_UNKNOWN) {
return amdsmi_status_t::AMDSMI_STATUS_INVAL;
}
auto status(amdsmi_status_t::AMDSMI_STATUS_SUCCESS);
constexpr auto kKFD_CRAT_INTRA_SOCKET_WEIGHT = uint32_t(13);
constexpr auto kKFD_CRAT_XGMI_WEIGHT = uint32_t(15);
/*
* Note: This will need to be eventually consolidated within a unique link type.
*/
static const std::map<amdsmi_link_type_t, amdsmi_io_link_type_t> kLinkToIoLinkTypeTranslationTable =
{
{amdsmi_link_type_t::AMDSMI_LINK_TYPE_INTERNAL, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED},
{amdsmi_link_type_t::AMDSMI_LINK_TYPE_XGMI, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_XGMI},
{amdsmi_link_type_t::AMDSMI_LINK_TYPE_PCIE, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_PCIEXPRESS},
{amdsmi_link_type_t::AMDSMI_LINK_TYPE_NOT_APPLICABLE, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED},
{amdsmi_link_type_t::AMDSMI_LINK_TYPE_UNKNOWN, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED}
};
auto translated_link_type = [&](amdsmi_link_type_t link_type) {
auto io_link_type(amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED);
if (kLinkToIoLinkTypeTranslationTable.find(link_type) != kLinkToIoLinkTypeTranslationTable.end()) {
io_link_type = kLinkToIoLinkTypeTranslationTable.at(link_type);
}
return io_link_type;
};
auto translated_io_link_type = [&](amdsmi_io_link_type_t io_link_type) {
auto link_type(amdsmi_link_type_t::AMDSMI_LINK_TYPE_UNKNOWN);
for (const auto& [key, value] : kLinkToIoLinkTypeTranslationTable) {
if (value == io_link_type) {
link_type = key;
break;
}
}
return link_type;
};
//
struct LinkTopolyInfo_t
{
amdsmi_processor_handle target_processor_handle;
amdsmi_link_type_t link_type;
bool is_accessible;
uint64_t num_hops;
uint64_t link_weight;
};
using LinkTopogyOrderPair_t = std::pair<uint64_t, uint64_t>;
/*
* Note: The link topology table is sorted by the number of hops and link weight.
*/
struct LinkTopogyOrderCmp_t {
constexpr bool operator()(const LinkTopolyInfo_t& left,
const LinkTopolyInfo_t& right) const noexcept
{
if (left.num_hops == right.num_hops) {
return (left.num_hops >= right.num_hops);
}
else {
return (left.link_weight > right.link_weight);
}
}
};
std::priority_queue<LinkTopolyInfo_t,
std::vector<LinkTopolyInfo_t>,
LinkTopogyOrderCmp_t> link_topology_order{};
//
AMDSMI_CHECK_INIT();
auto socket_counter = uint32_t(0);
if (auto api_status = amdsmi_get_socket_handles(&socket_counter, nullptr);
(api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) {
return api_status;
}
amdsmi_socket_handle socket_list[socket_counter];
if (auto api_status = amdsmi_get_socket_handles(&socket_counter, &socket_list[0]);
(api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) {
return api_status;
}
uint32_t device_counter(AMDSMI_MAX_DEVICES);
amdsmi_processor_handle device_list[AMDSMI_MAX_DEVICES];
for (auto socket_idx = uint32_t(0); socket_idx < socket_counter; ++socket_idx) {
if (auto api_status = amdsmi_get_processor_handles(socket_list[socket_idx], &device_counter, device_list);
(api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) {
return api_status;
}
for (auto device_idx = uint32_t(0); device_idx < device_counter; ++device_idx) {
/* Note: Skip the processor handle that is being queried. */
if (processor_handle != device_list[device_idx]) {
// Accessibility?
auto is_accessible(false);
if (auto api_status = amdsmi_is_P2P_accessible(processor_handle, device_list[device_idx], &is_accessible);
(api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) || !is_accessible) {
continue;
}
// Link type matches what we are searching for?
auto io_link_type = translated_link_type(link_type);
auto io_link_type_bck(io_link_type);
auto num_hops = uint64_t(0);
if (auto api_status = amdsmi_topo_get_link_type(processor_handle, device_list[device_idx], &num_hops, &io_link_type);
(api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) || (translated_io_link_type(io_link_type) != link_type)) {
continue;
}
// Link weights
auto link_weight = uint64_t(0);
if (auto api_status = amdsmi_topo_get_link_weight(processor_handle, device_list[device_idx], &link_weight);
(api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) {
continue;
}
// Topology nearest info
LinkTopolyInfo_t link_info = {
.target_processor_handle = device_list[device_idx],
.link_type = translated_io_link_type(io_link_type),
.is_accessible = is_accessible,
.num_hops = num_hops,
.link_weight = link_weight
};
link_topology_order.push(link_info);
}
}
}
/*
* Note: The link topology table is sorted by the number of hops and link weight.
*/
topology_nearest_info->processor_list[AMDSMI_MAX_DEVICES] = {nullptr};
topology_nearest_info->count = link_topology_order.size();
auto topology_nearest_counter = uint32_t(0);
while (!link_topology_order.empty()) {
auto link_info = link_topology_order.top();
link_topology_order.pop();
if (topology_nearest_counter < AMDSMI_MAX_DEVICES) {
topology_nearest_info->processor_list[topology_nearest_counter++] = link_info.target_processor_handle;
}
}
return status;
}
#ifdef ENABLE_ESMI_LIB
static amdsmi_status_t amdsmi_errno_to_esmi_status(amdsmi_status_t status)
+15 -2
Wyświetl plik
@@ -237,7 +237,20 @@ amdsmi_status_t AMDSmiSystem::get_gpu_socket_id(uint32_t index,
return amd::smi::rsmi_to_amdsmi_status(ret);
}
/**
* | Name | Field | KFD property KFD -> PCIe ID (uint64_t)
* -------------- | ------- | ---------------- | ---------------------------- |
* | Domain | [63:32] | "domain" | (DOMAIN & 0xFFFFFFFF) << 32 |
* | Partition id | [31:28] | "location id" | (LOCATION & 0xF0000000) |
* | Reserved | [27:16] | "location id" | N/A |
* | Bus | [15: 8] | "location id" | (LOCATION & 0xFF00) |
* | Device | [ 7: 3] | "location id" | (LOCATION & 0xF8) |
* | Function | [ 2: 0] | "location id" | (LOCATION & 0x7) |
*/
uint64_t domain = (bdfid >> 32) & 0xffffffff;
// may need to identify with partition_id in the future as well... TBD
uint64_t partition_id = (bdfid >> 28) & 0xf;
uint64_t bus = (bdfid >> 8) & 0xff;
uint64_t device_id = (bdfid >> 3) & 0x1f;
uint64_t function = bdfid & 0x7;
@@ -246,8 +259,8 @@ amdsmi_status_t AMDSmiSystem::get_gpu_socket_id(uint32_t index,
// represents a physical device.
std::stringstream ss;
ss << std::setfill('0') << std::uppercase << std::hex
<< std::setw(4) << domain << ":" << std::setw(2) << bus << ":"
<< std::setw(2) << device_id;
<< std::setw(4) << domain << ":" << std::setw(2) << bus << ":"
<< std::setw(2) << device_id;
socket_id = ss.str();
return AMDSMI_STATUS_SUCCESS;
}
@@ -46,7 +46,10 @@
#include <stdint.h>
#include <stddef.h>
#include <algorithm>
#include <iostream>
#include <iterator>
#include <sstream>
#include <string>
#include <map>
@@ -54,6 +57,7 @@
#include "amd_smi/amdsmi.h"
#include "gpu_metrics_read.h"
#include "../test_common.h"
#include "rocm_smi/rocm_smi_utils.h"
TestGpuMetricsRead::TestGpuMetricsRead() : TestBase() {
@@ -87,6 +91,7 @@ void TestGpuMetricsRead::Close() {
}
void TestGpuMetricsRead::Run(void) {
amdsmi_status_t err;
@@ -101,9 +106,10 @@ void TestGpuMetricsRead::Run(void) {
std::cout << "Device #" << std::to_string(i) << "\n";
IF_VERB(STANDARD) {
std::cout << "\n\n";
std::cout << "\t**GPU METRICS: Using static struct (Backwards Compatibility):\n";
}
amdsmi_gpu_metrics_t smu;
amdsmi_gpu_metrics_t smu = {};
err = amdsmi_get_gpu_metrics_info(processor_handles_[i], &smu);
const char *status_string;
amdsmi_status_code_to_string(err, &status_string);
@@ -122,250 +128,250 @@ void TestGpuMetricsRead::Run(void) {
IF_VERB(STANDARD) {
std::cout << "METRIC TABLE HEADER:\n";
std::cout << "structure_size=" << std::dec
<< static_cast<int>(smu.common_header.structure_size) << '\n';
<< static_cast<uint16_t>(smu.common_header.structure_size) << "\n";
std::cout << "format_revision=" << std::dec
<< static_cast<int>(smu.common_header.format_revision) << '\n';
<< static_cast<uint16_t>(smu.common_header.format_revision) << "\n";
std::cout << "content_revision=" << std::dec
<< static_cast<int>(smu.common_header.content_revision) << '\n';
<< static_cast<uint16_t>(smu.common_header.content_revision) << "\n";
std::cout << "\n";
std::cout << "TIME STAMPS (ns):\n";
std::cout << std::dec << "system_clock_counter="
<< smu.system_clock_counter << '\n';
std::cout << "firmware_timestamp (10ns resolution)=" << std::dec
<< smu.firmware_timestamp << '\n';
std::cout << std::dec << "system_clock_counter=" << smu.system_clock_counter << "\n";
std::cout << "firmware_timestamp (10ns resolution)=" << std::dec << smu.firmware_timestamp
<< "\n";
std::cout << "\n";
std::cout << "TEMPERATURES (C):\n";
std::cout << std::dec << "temperature_edge= "
<< static_cast<uint16_t>(smu.temperature_edge) << '\n';
std::cout << std::dec << "temperature_hotspot= "
<< static_cast<uint16_t>(smu.temperature_hotspot) << '\n';
std::cout << std::dec << "temperature_mem= "
<< static_cast<uint16_t>(smu.temperature_mem) << '\n';
std::cout << std::dec << "temperature_vrgfx= "
<< static_cast<uint16_t>(smu.temperature_vrgfx) << '\n';
std::cout << std::dec << "temperature_vrsoc= "
<< static_cast<uint16_t>(smu.temperature_vrsoc) << '\n';
std::cout << std::dec << "temperature_vrmem= "
<< static_cast<uint16_t>(smu.temperature_vrmem) << '\n';
for (int i = 0; i < AMDSMI_NUM_HBM_INSTANCES; ++i) {
std::cout << "temperature_hbm[" << i << "]= " << std::dec
<< static_cast<uint16_t>(smu.temperature_hbm[i]) << '\n';
}
std::cout << std::dec << "temperature_edge= " << smu.temperature_edge << "\n";
std::cout << std::dec << "temperature_hotspot= " << smu.temperature_hotspot << "\n";
std::cout << std::dec << "temperature_mem= " << smu.temperature_mem << "\n";
std::cout << std::dec << "temperature_vrgfx= " << smu.temperature_vrgfx << "\n";
std::cout << std::dec << "temperature_vrsoc= " << smu.temperature_vrsoc << "\n";
std::cout << std::dec << "temperature_vrmem= " << smu.temperature_vrmem << "\n";
std::cout << "temperature_hbm = [";
std::copy(std::begin(smu.temperature_hbm),
std::end(smu.temperature_hbm),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << "\n";
std::cout << "UTILIZATION (%):\n";
std::cout << std::dec << "average_gfx_activity="
<< static_cast<uint16_t>(smu.average_gfx_activity) << '\n';
std::cout << std::dec << "average_umc_activity="
<< static_cast<uint16_t>(smu.average_umc_activity) << '\n';
std::cout << std::dec << "average_mm_activity="
<< static_cast<uint16_t>(smu.average_mm_activity) << '\n';
std::cout << std::dec << "average_gfx_activity=" << smu.average_gfx_activity << "\n";
std::cout << std::dec << "average_umc_activity=" << smu.average_umc_activity << "\n";
std::cout << std::dec << "average_mm_activity=" << smu.average_mm_activity << "\n";
std::cout << std::dec << "vcn_activity= [";
uint16_t size = static_cast<uint16_t>(
sizeof(smu.vcn_activity)/sizeof(smu.vcn_activity[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint16_t>(smu.vcn_activity[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint16_t>(smu.vcn_activity[i]);
}
}
std::copy(std::begin(smu.vcn_activity),
std::end(smu.vcn_activity),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << "\n";
std::cout << std::dec << "jpeg_activity= [";
size = static_cast<uint16_t>(
sizeof(smu.jpeg_activity)/sizeof(smu.jpeg_activity[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint16_t>(smu.jpeg_activity[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint16_t>(smu.jpeg_activity[i]);
}
}
std::copy(std::begin(smu.jpeg_activity),
std::end(smu.jpeg_activity),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << "\n";
std::cout << "POWER (W)/ENERGY (15.259uJ per 1ns):\n";
std::cout << std::dec << "average_socket_power="
<< static_cast<uint16_t>(smu.average_socket_power) << '\n';
std::cout << std::dec << "current_socket_power="
<< static_cast<uint16_t>(smu.current_socket_power) << '\n';
std::cout << std::dec << "energy_accumulator="
<< static_cast<uint16_t>(smu.energy_accumulator) << '\n';
std::cout << std::dec << "average_socket_power=" << smu.average_socket_power << "\n";
std::cout << std::dec << "current_socket_power=" << smu.current_socket_power << "\n";
std::cout << std::dec << "energy_accumulator=" << smu.energy_accumulator << "\n";
std::cout << "\n";
std::cout << "AVG CLOCKS (MHz):\n";
std::cout << std::dec << "average_gfxclk_frequency="
<< static_cast<uint16_t>(smu.average_gfxclk_frequency) << '\n';
std::cout << std::dec << "average_gfxclk_frequency="
<< static_cast<uint16_t>(smu.average_gfxclk_frequency) << '\n';
std::cout << std::dec << "average_uclk_frequency="
<< static_cast<uint16_t>(smu.average_uclk_frequency) << '\n';
std::cout << std::dec << "average_vclk0_frequency="
<< static_cast<uint16_t>(smu.average_vclk0_frequency) << '\n';
std::cout << std::dec << "average_dclk0_frequency="
<< static_cast<uint16_t>(smu.average_dclk0_frequency) << '\n';
std::cout << std::dec << "average_vclk1_frequency="
<< static_cast<uint16_t>(smu.average_vclk1_frequency) << '\n';
std::cout << std::dec << "average_dclk1_frequency="
<< static_cast<uint16_t>(smu.average_dclk1_frequency) << '\n';
std::cout << std::dec << "average_gfxclk_frequency=" << smu.average_gfxclk_frequency
<< "\n";
std::cout << std::dec << "average_gfxclk_frequency=" << smu.average_gfxclk_frequency
<< "\n";
std::cout << std::dec << "average_uclk_frequency=" << smu.average_uclk_frequency << "\n";
std::cout << std::dec << "average_vclk0_frequency=" << smu.average_vclk0_frequency
<< "\n";
std::cout << std::dec << "average_dclk0_frequency=" << smu.average_dclk0_frequency
<< "\n";
std::cout << std::dec << "average_vclk1_frequency=" << smu.average_vclk1_frequency
<< "\n";
std::cout << std::dec << "average_dclk1_frequency=" << smu.average_dclk1_frequency
<< "\n";
std::cout << "\n";
std::cout << "CURRENT CLOCKS (MHz):\n";
std::cout << std::dec << "current_gfxclk="
<< smu.current_gfxclk << '\n';
std::cout << std::dec << "current_gfxclk=" << smu.current_gfxclk << "\n";
std::cout << std::dec << "current_gfxclks= [";
size = static_cast<uint16_t>(
sizeof(smu.current_gfxclks)/sizeof(smu.current_gfxclks[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint16_t>(smu.current_gfxclks[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint16_t>(smu.current_gfxclks[i]);
}
}
std::copy(std::begin(smu.current_gfxclks),
std::end(smu.current_gfxclks),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_socclk="
<< smu.current_socclk << '\n';
std::cout << std::dec << "current_socclk=" << smu.current_socclk << "\n";
std::cout << std::dec << "current_socclks= [";
size = static_cast<uint16_t>(
sizeof(smu.current_socclks)/sizeof(smu.current_socclks[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint16_t>(smu.current_socclks[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint16_t>(smu.current_socclks[i]);
}
}
std::copy(std::begin(smu.current_socclks),
std::end(smu.current_socclks),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_uclk="
<< static_cast<uint16_t>(smu.current_uclk) << '\n';
std::cout << std::dec << "current_vclk0="
<< static_cast<uint16_t>(smu.current_vclk0) << '\n';
std::cout << std::dec << "current_uclk=" << smu.current_uclk << "\n";
std::cout << std::dec << "current_vclk0=" << smu.current_vclk0 << "\n";
std::cout << std::dec << "current_vclk0s= [";
size = static_cast<uint16_t>(
sizeof(smu.current_vclk0s)/sizeof(smu.current_vclk0s[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint16_t>(smu.current_vclk0s[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint16_t>(smu.current_vclk0s[i]);
}
}
std::copy(std::begin(smu.current_vclk0s),
std::end(smu.current_vclk0s),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_dclk0="
<< smu.current_dclk0 << '\n';
std::cout << std::dec << "current_dclk0=" << smu.current_dclk0 << "\n";
std::cout << std::dec << "current_dclk0s= [";
size = static_cast<uint16_t>(
sizeof(smu.current_dclk0s)/sizeof(smu.current_dclk0s[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint16_t>(smu.current_dclk0s[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint16_t>(smu.current_dclk0s[i]);
}
}
std::copy(std::begin(smu.current_dclk0s),
std::end(smu.current_dclk0s),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "current_vclk1="
<< static_cast<uint16_t>(smu.current_vclk1) << '\n';
std::cout << std::dec << "current_dclk1="
<< static_cast<uint16_t>(smu.current_dclk1) << '\n';
std::cout << std::dec << "current_vclk1=" << smu.current_vclk1 << "\n";
std::cout << std::dec << "current_dclk1=" << smu.current_dclk1 << "\n";
std::cout << "\n";
std::cout << "TROTTLE STATUS:\n";
std::cout << std::dec << "throttle_status="
<< static_cast<uint32_t>(smu.throttle_status) << '\n';
std::cout << std::dec << "throttle_status=" << smu.throttle_status << "\n";
std::cout << "\n";
std::cout << "FAN SPEED:\n";
std::cout << std::dec << "current_fan_speed="
<< static_cast<uint16_t>(smu.current_fan_speed) << '\n';
std::cout << std::dec << "current_fan_speed=" << smu.current_fan_speed << "\n";
std::cout << "\n";
std::cout << "LINK WIDTH (number of lanes) /SPEED (0.1 GT/s):\n";
std::cout << "pcie_link_width="
<< std::to_string(smu.pcie_link_width) << '\n';
std::cout << "pcie_link_speed="
<< std::to_string(smu.pcie_link_speed) << '\n';
std::cout << "xgmi_link_width="
<< std::to_string(smu.xgmi_link_width) << '\n';
std::cout << "xgmi_link_speed="
<< std::to_string(smu.xgmi_link_speed) << '\n';
std::cout << "pcie_link_width=" << smu.pcie_link_width << "\n";
std::cout << "pcie_link_speed=" << smu.pcie_link_speed << "\n";
std::cout << "xgmi_link_width=" << smu.xgmi_link_width << "\n";
std::cout << "xgmi_link_speed=" << smu.xgmi_link_speed << "\n";
std::cout << "\n";
std::cout << "Utilization Accumulated(%):\n";
std::cout << "gfx_activity_acc="
<< std::dec << static_cast<uint32_t>(smu.gfx_activity_acc) << '\n';
std::cout << "mem_activity_acc="
<< std::dec << static_cast<uint32_t>(smu.mem_activity_acc) << '\n';
std::cout << "gfx_activity_acc=" << std::dec << smu.gfx_activity_acc << "\n";
std::cout << "mem_activity_acc=" << std::dec << smu.mem_activity_acc << "\n";
std::cout << "\n";
std::cout << "XGMI ACCUMULATED DATA TRANSFER SIZE (KB):\n";
std::cout << std::dec << "xgmi_read_data_acc= [";
size = static_cast<uint16_t>(
sizeof(smu.xgmi_read_data_acc)/sizeof(smu.xgmi_read_data_acc[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint64_t>(smu.xgmi_read_data_acc[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint64_t>(smu.xgmi_read_data_acc[i]);
}
}
std::copy(std::begin(smu.xgmi_read_data_acc),
std::end(smu.xgmi_read_data_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
std::cout << std::dec << "xgmi_write_data_acc= [";
size = static_cast<uint16_t>(
sizeof(smu.xgmi_write_data_acc)/sizeof(smu.xgmi_write_data_acc[0]));
for (uint16_t i= 0; i < size; i++) {
if (i+1 < size) {
std::cout << std::dec << static_cast<uint64_t>(smu.xgmi_write_data_acc[i]) << ", ";
} else {
std::cout << std::dec << static_cast<uint64_t>(smu.xgmi_write_data_acc[i]);
}
}
std::copy(std::begin(smu.xgmi_write_data_acc),
std::end(smu.xgmi_write_data_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << std::dec << "]\n";
// Voltage (mV)
std::cout << "voltage_soc = "
<< std::dec << static_cast<uint16_t>(smu.voltage_soc) << "\n";
std::cout << "voltage_soc = "
<< std::dec << static_cast<uint16_t>(smu.voltage_gfx) << "\n";
std::cout << "voltage_mem = "
<< std::dec << static_cast<uint16_t>(smu.voltage_mem) << "\n";
std::cout << "voltage_soc = " << std::dec << smu.voltage_soc << "\n";
std::cout << "voltage_gfx = " << std::dec << smu.voltage_gfx << "\n";
std::cout << "voltage_mem = " << std::dec << smu.voltage_mem << "\n";
std::cout << "indep_throttle_status = "
<< std::dec << static_cast<uint64_t>(smu.indep_throttle_status) << "\n";
std::cout << "indep_throttle_status = " << std::dec << smu.indep_throttle_status << "\n";
// Clock Lock Status. Each bit corresponds to clock instance
std::cout << "gfxclk_lock_status (in hex) = "
<< std::hex << static_cast<uint32_t>(smu.gfxclk_lock_status) << std::dec <<"\n";
std::cout << "gfxclk_lock_status (in hex) = " << std::hex
<< smu.gfxclk_lock_status << std::dec <<"\n";
// Bandwidth (GB/sec)
std::cout << "pcie_bandwidth_acc=" << std::dec
<< static_cast<uint64_t>(smu.pcie_bandwidth_acc) << "\n";
std::cout << "pcie_bandwidth_inst=" << std::dec
<< static_cast<uint64_t>(smu.pcie_bandwidth_inst) << "\n";
std::cout << "pcie_bandwidth_acc=" << std::dec << smu.pcie_bandwidth_acc << "\n";
std::cout << "pcie_bandwidth_inst=" << std::dec << smu.pcie_bandwidth_inst << "\n";
// Counts
std::cout << "pcie_l0_to_recov_count_acc= " << std::dec
<< static_cast<uint64_t>(smu.pcie_l0_to_recov_count_acc) << "\n";
std::cout << "pcie_replay_count_acc= " << std::dec
<< static_cast<uint64_t>(smu.pcie_replay_count_acc) << "\n";
std::cout << "pcie_l0_to_recov_count_acc= " << std::dec << smu.pcie_l0_to_recov_count_acc
<< "\n";
std::cout << "pcie_replay_count_acc= " << std::dec << smu.pcie_replay_count_acc << "\n";
std::cout << "pcie_replay_rover_count_acc= " << std::dec
<< static_cast<uint64_t>(smu.pcie_replay_rover_count_acc) << "\n";
std::cout << "pcie_nak_rcvd_count_acc= " << std::dec
<< static_cast<uint32_t>(smu.pcie_nak_rcvd_count_acc) << "\n";
std::cout << "pcie_replay_rover_count_acc= " << std::dec
<< static_cast<uint64_t>(smu.pcie_replay_rover_count_acc) << "\n";
<< smu.pcie_replay_rover_count_acc << "\n";
std::cout << "pcie_nak_sent_count_acc= " << std::dec << smu.pcie_nak_sent_count_acc
<< "\n";
std::cout << "pcie_nak_rcvd_count_acc= " << std::dec << smu.pcie_nak_rcvd_count_acc
<< "\n";
// Check for constant changes/refresh metrics
// Accumulation cycle counter
// Accumulated throttler residencies
std::cout << "\n";
std::cout << "RESIDENCY ACCUMULATION / COUNTER:\n";
std::cout << "accumulation_counter = " << std::dec << smu.accumulation_counter << "\n";
std::cout << "prochot_residency_acc = " << std::dec << smu.prochot_residency_acc << "\n";
std::cout << "ppt_residency_acc = " << std::dec << smu.ppt_residency_acc << "\n";
std::cout << "socket_thm_residency_acc = " << std::dec << smu.socket_thm_residency_acc
<< "\n";
std::cout << "vr_thm_residency_acc = " << std::dec << smu.vr_thm_residency_acc
<< "\n";
std::cout << "hbm_thm_residency_acc = " << std::dec << smu.hbm_thm_residency_acc << "\n";
// Number of current partitions
std::cout << "num_partition = " << std::dec << smu.num_partition << "\n";
// PCIE other end recovery counter
std::cout << "pcie_lc_perf_other_end_recovery = "
<< std::dec << smu.pcie_lc_perf_other_end_recovery << "\n";
std::cout << std::dec << "xcp_stats.gfx_busy_inst = \n";
auto xcp = 0;
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_busy_inst),
std::end(row.gfx_busy_inst),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.jpeg_busy = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.jpeg_busy),
std::end(row.jpeg_busy),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.vcn_busy = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.vcn_busy),
std::end(row.vcn_busy),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
xcp = 0;
std::cout << std::dec << "xcp_stats.gfx_busy_acc = \n";
for (auto& row : smu.xcp_stats) {
std::cout << "XCP[" << xcp << "] = " << "[ ";
std::copy(std::begin(row.gfx_busy_acc),
std::end(row.gfx_busy_acc),
amd::smi::make_ostream_joiner(&std::cout, ", "));
std::cout << " ]\n";
xcp++;
}
std::cout << "\n\n";
std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n";
constexpr uint16_t kMAX_ITER_TEST = 10;
amdsmi_gpu_metrics_t gpu_metrics_check;
amdsmi_gpu_metrics_t gpu_metrics_check = {};
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check);
std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.firmware_timestamp << "\n";
amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check);
std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: "
<< gpu_metrics_check.firmware_timestamp << "\n";
}
std::cout << "\n";
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check);
std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.system_clock_counter << "\n";
amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check);
std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: "
<< gpu_metrics_check.system_clock_counter << "\n";
}
std::cout << "\n";
std::cout << " ** Note: Values MAX'ed out "
<< "(UINTX MAX are unsupported for the version in question) ** " << "\n\n";
}
}
@@ -377,5 +383,13 @@ void TestGpuMetricsRead::Run(void) {
amdsmi_status_code_to_string(err, &status_string);
std::cout << "\t\t** amdsmi_get_gpu_metrics_info(nullptr check): " << status_string << "\n";
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
// TODO(AMD_SMI_team): add xcd_counter_get for amd smi
// auto temp_xcd_counter_value = uint16_t(0);
// err = rsmi_dev_metrics_xcd_counter_get(i, &temp_xcd_counter_value);
// if (err != RSMI_STATUS_NOT_SUPPORTED) {
// CHK_ERR_ASRT(err);
// }
}
}
@@ -456,4 +456,58 @@ void TestHWTopologyRead::Run(void) {
std::cout << std::endl;
}
std::cout << std::endl;
char *topology_link_type_str[] = {
"AMDSMI_LINK_TYPE_INTERNAL",
"AMDSMI_LINK_TYPE_XGMI",
"AMDSMI_LINK_TYPE_PCIE",
"AMDSMI_LINK_TYPE_NOT_APPLICABLE",
"AMDSMI_LINK_TYPE_UNKNOWN",
};
auto ret(amdsmi_status_t::AMDSMI_STATUS_SUCCESS);
for (uint32_t dv_ind_src = 0; dv_ind_src < num_devices; dv_ind_src++) {
std::cout <<"** Nearest GPUs for GPU" << dv_ind_src << " **" << "\n";
for (uint32_t topo_link_type = AMDSMI_LINK_TYPE_INTERNAL; topo_link_type <= AMDSMI_LINK_TYPE_UNKNOWN; topo_link_type++) {
/*
* Note: We should get AMDSMI_STATUS_INVAL for the first call with amdsmi_topology_nearest_t = nullptr
*/
ret = amdsmi_get_link_topology_nearest(processor_handles_[dv_ind_src],
static_cast<amdsmi_link_type_t>(topo_link_type),
nullptr);
ASSERT_EQ(ret, amdsmi_status_t::AMDSMI_STATUS_INVAL);
/*
*
*/
auto topology_nearest_info = amdsmi_topology_nearest_t();
ret = amdsmi_get_link_topology_nearest(processor_handles_[dv_ind_src],
static_cast<amdsmi_link_type_t>(topo_link_type),
&topology_nearest_info);
if (ret != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
continue;
}
std::cout <<"Nearest GPUs found for Link Type: " << topology_link_type_str[topo_link_type] << "\n";
if (topology_nearest_info.count > 0) {
for (uint32_t k = 0; k < topology_nearest_info.count; k++) {
amdsmi_bdf_t bdf = {};
ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf);
if (ret != AMDSMI_STATUS_SUCCESS) {
continue;
}
printf("\tGPU BDF %04lx:%02x:%02x.%d\n", bdf.domain_number,
bdf.bus_number, bdf.device_number, bdf.function_number);
}
}
else {
std::cout << "\tNot found" << "\n";
}
}
std::cout << "\n";
}
}
@@ -183,22 +183,26 @@ void TestSysInfoRead::Run(void) {
}
}
// kfd_id, node_id
// kfd_id, node_id, current_partition_id
amdsmi_kfd_info_t kfd_info = {};
err = amdsmi_get_gpu_kfd_info(processor_handles_[i], &kfd_info);
if (err != AMDSMI_STATUS_SUCCESS) {
EXPECT_EQ(kfd_info.kfd_id, std::numeric_limits<uint64_t>::max());
EXPECT_EQ(kfd_info.node_id, std::numeric_limits<uint32_t>::max());
EXPECT_EQ(kfd_info.current_partition_id, std::numeric_limits<uint32_t>::max());
} else {
IF_VERB(STANDARD) {
std::cout << "\t**KFD ID: " << std::dec
<< kfd_info.kfd_id << "\n";
std::cout << "\t**Node ID: " << std::dec
<< kfd_info.node_id << "\n";
std::cout << "\t**Current Parition ID: " << std::dec
<< kfd_info.current_partition_id << "\n";
}
EXPECT_EQ(err, AMDSMI_STATUS_SUCCESS);
EXPECT_NE(kfd_info.kfd_id, std::numeric_limits<uint64_t>::max());
EXPECT_NE(kfd_info.node_id, std::numeric_limits<uint32_t>::max());
EXPECT_NE(kfd_info.current_partition_id, std::numeric_limits<uint32_t>::max());
}
// Verify api support checking functionality is working
err = amdsmi_get_gpu_kfd_info(processor_handles_[i], nullptr);
Plik diff jest za duży Load Diff
+559 -114
Wyświetl plik
@@ -32,6 +32,9 @@ import threading
import multiprocessing
from datetime import datetime
# Note: amdsmi_status_code_to_string is not tested due to the nature and functionality of the AMDSMI Python wrapper.
# The function is to be tested in the future after the wrapper is updated to return status codes after API calls.
def handle_exceptions(func):
"""Exposes, silences, and logs AMD SMI exceptions to users what exception was raised.
@@ -46,15 +49,19 @@ def handle_exceptions(func):
return func(*args, **kwargs)
except amdsmi.AmdSmiRetryException as e:
print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught AmdSmiRetryException: {}".format(e))
amdsmi.amdsmi_shut_down()
pass
except amdsmi.AmdSmiTimeoutException as e:
print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught AmdSmiTimeoutException: {}".format(e))
amdsmi.amdsmi_shut_down()
pass
except amdsmi.AmdSmiLibraryException as e:
print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught AmdSmiLibraryException: {}".format(e))
amdsmi.amdsmi_shut_down()
pass
except Exception as e:
print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught unknown exception: {}".format(e))
amdsmi.amdsmi_shut_down()
pass
return wrapper
@@ -68,13 +75,54 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
@handle_exceptions
def setUp(self):
amdsmi.amdsmi_init()
@handle_exceptions
def tearDown(self):
amdsmi.amdsmi_shut_down()
# Bad page is not supported in Navi21 and Navi31
def test_asic_kfd_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_asic_info \n")
asic_info = amdsmi.amdsmi_get_gpu_asic_info(processors[i])
print(" asic_info['market_name'] is: {}".format(
asic_info['market_name']))
print(" asic_info['vendor_id'] is: {}".format(
asic_info['vendor_id']))
print(" asic_info['vendor_name'] is: {}".format(
asic_info['vendor_name']))
print(" asic_info['device_id'] is: {}".format(
asic_info['device_id']))
print(" asic_info['rev_id'] is: {}".format(
asic_info['rev_id']))
print(" asic_info['asic_serial'] is: {}".format(
asic_info['asic_serial']))
print(" asic_info['oam_id'] is: {}".format(
asic_info['oam_id']))
print(" asic_info['target_graphics_version'] is: {}".format(
asic_info['target_graphics_version']))
print(" asic_info['num_compute_units'] is: {}".format(
asic_info['num_compute_units']))
print("\n###Test amdsmi_get_gpu_kfd_info \n")
kfd_info = amdsmi.amdsmi_get_gpu_kfd_info(processors[i])
print(" kfd_info['kfd_id'] is: {}".format(
kfd_info['kfd_id']))
print(" kfd_info['node_id'] is: {}".format(
kfd_info['node_id']))
print(" kfd_info['current_partition_id'] is: {}\n".format(
kfd_info['current_partition_id']))
print()
self.tearDown()
# amdsmi_get_gpu_bad_page_info is not supported in Navi2x, Navi3x
@handle_exceptions
def test_bad_page_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -96,14 +144,17 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print()
j += 1
print()
self.tearDown()
def test_bdf_device_id(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_processor_handle_from_bdf \n")
processor = amdsmi.amdsmi_get_processor_handle_from_bdf(bdf)
print("\n###Test amdsmi_get_gpu_vbios_info \n")
vbios_info = amdsmi.amdsmi_get_gpu_vbios_info(processor)
@@ -119,49 +170,83 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
uuid = amdsmi.amdsmi_get_gpu_device_uuid(processor)
print(" uuid is: {}".format(uuid))
print()
self.tearDown()
def test_ecc(self):
def test_board_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_total_ecc_count \n")
ecc_info = amdsmi.amdsmi_get_gpu_total_ecc_count(processors[i])
print("Number of uncorrectable errors: {}".format(
ecc_info['uncorrectable_count']))
print("Number of correctable errors: {}".format(
ecc_info['correctable_count']))
print("Number of deferred errors: {}".format(
ecc_info['deferred_count']))
self.assertGreaterEqual(ecc_info['uncorrectable_count'], 0)
self.assertGreaterEqual(ecc_info['correctable_count'], 0)
self.assertGreaterEqual(ecc_info['deferred_count'], 0)
print("\n###Test amdsmi_get_gpu_board_info \n")
board_info = amdsmi.amdsmi_get_gpu_board_info(processors[i])
print(" board_info['model_number'] is: {}".format(
board_info['model_number']))
print(" board_info['product_serial'] is: {}".format(
board_info['product_serial']))
print(" board_info['fru_id'] is: {}".format(
board_info['fru_id']))
print(" board_info['manufacturer_name'] is: {}".format(
board_info['manufacturer_name']))
print(" board_info['product_name'] is: {}".format(
board_info['product_name']))
print()
self.tearDown()
# RAS is not supported in Navi21 and Navi31
def test_clock_frequency(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_clk_freq \n")
clock_frequency = amdsmi.amdsmi_get_clk_freq(
processors[i], amdsmi.AmdSmiClkType.SYS)
print(" SYS clock_frequency['num_supported']: {}".format(
clock_frequency['num_supported']))
print(" SYS clock_frequency['current']: {}".format(
clock_frequency['current']))
print(" SYS clock_frequency['frequency']: {}".format(
clock_frequency['frequency']))
clock_frequency = amdsmi.amdsmi_get_clk_freq(
processors[i], amdsmi.AmdSmiClkType.DF)
print(" DF clock_frequency['num_supported']: {}".format(
clock_frequency['num_supported']))
print(" DF clock_frequency['current']: {}".format(
clock_frequency['current']))
print(" DF clock_frequency['frequency']: {}".format(
clock_frequency['frequency']))
print()
self.tearDown()
# amdsmi_get_clk_freq with AmdSmiClkType.DCEF is not supported in MI210, MI300A
@handle_exceptions
def test_ras(self):
def test_clock_frequency_DCEF(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_ras_feature_info \n")
ras_feature = amdsmi.amdsmi_get_gpu_ras_feature_info(processors[i])
print("ras_feature: " + str(ras_feature))
if ras_feature != None:
print("ras_feature: " + str(ras_feature))
print("RAS eeprom version: {}".format(ras_feature['eeprom_version']))
print("RAS parity schema: {}".format(ras_feature['parity_schema']))
print("RAS single bit schema: {}".format(ras_feature['single_bit_schema']))
print("RAS double bit schema: {}".format(ras_feature['double_bit_schema']))
print("Poisioning supported: {}".format(ras_feature['poison_schema']))
print("\n###Test amdsmi_get_clk_freq \n")
clock_frequency = amdsmi.amdsmi_get_clk_freq(
processors[i], amdsmi.AmdSmiClkType.DCEF)
print(" DCEF clock_frequency['num_supported']: {}".format(
clock_frequency['num_supported']))
print(" DCEF clock_frequency['current']: {}".format(
clock_frequency['current']))
print(" DCEF clock_frequency['frequency']: {}".format(
clock_frequency['frequency']))
print()
self.tearDown()
def test_clock_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -192,10 +277,12 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print(" Is MEM clock in deep sleep: {}".format(
clock_measure['clk_deep_sleep']))
print()
self.tearDown()
# VCLK0 and DCLK0 are not supported in MI210
# AmdSmiClkType.VCLK0 and DCLK0 are not supported in MI210
@handle_exceptions
def test_gpu_clock_vclk0_dclk0(self):
def test_clock_info_vclk0_dclk0(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -224,10 +311,12 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print(" Is DCLK0 clock in deep sleep: {}".format(
clock_measure['clk_deep_sleep']))
print()
self.tearDown()
# VCLK1 and DCLK1 are not supported in Navi 31, MI210, and MI300
# AmdSmiClkType.VCLK1 and DCLK1 are not supported in MI210, MI300A, MI300X
@handle_exceptions
def test_gpu_clock_vclk1_dclk1(self):
def test_clock_info_vclk1_dclk1(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -256,8 +345,118 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print(" Is DCLK1 clock in deep sleep: {}".format(
clock_measure['clk_deep_sleep']))
print()
self.tearDown()
def test_driver_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_driver_info \n")
driver_info = amdsmi.amdsmi_get_gpu_driver_info(processors[i])
print("Driver info: {}".format(driver_info))
print()
self.tearDown()
# amdsmi_get_gpu_ecc_count is not supported in Navi2x, Navi3x, MI210, MI300A
@handle_exceptions
def test_ecc_count_block(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
gpu_blocks = {
"INVALID": amdsmi.AmdSmiGpuBlock.INVALID,
"UMC": amdsmi.AmdSmiGpuBlock.UMC,
"SDMA": amdsmi.AmdSmiGpuBlock.SDMA,
"GFX": amdsmi.AmdSmiGpuBlock.GFX,
"MMHUB": amdsmi.AmdSmiGpuBlock.MMHUB,
"ATHUB": amdsmi.AmdSmiGpuBlock.ATHUB,
"PCIE_BIF": amdsmi.AmdSmiGpuBlock.PCIE_BIF,
"HDP": amdsmi.AmdSmiGpuBlock.HDP,
"XGMI_WAFL": amdsmi.AmdSmiGpuBlock.XGMI_WAFL,
"DF": amdsmi.AmdSmiGpuBlock.DF,
"SMN": amdsmi.AmdSmiGpuBlock.SMN,
"SEM": amdsmi.AmdSmiGpuBlock.SEM,
"MP0": amdsmi.AmdSmiGpuBlock.MP0,
"MP1": amdsmi.AmdSmiGpuBlock.MP1,
"FUSE": amdsmi.AmdSmiGpuBlock.FUSE,
"MCA": amdsmi.AmdSmiGpuBlock.MCA,
"VCN": amdsmi.AmdSmiGpuBlock.VCN,
"JPEG": amdsmi.AmdSmiGpuBlock.JPEG,
"IH": amdsmi.AmdSmiGpuBlock.IH,
"MPIO": amdsmi.AmdSmiGpuBlock.MPIO,
"RESERVED": amdsmi.AmdSmiGpuBlock.RESERVED
}
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_ecc_count \n")
for block_name, block_code in gpu_blocks.items():
ecc_count = amdsmi.amdsmi_get_gpu_ecc_count(
processors[i], block_code)
print(" Number of uncorrectable errors for {}: {}".format(
block_name, ecc_count['uncorrectable_count']))
print(" Number of correctable errors for {}: {}".format(
block_name, ecc_count['correctable_count']))
print(" Number of deferred errors for {}: {}".format(
block_name, ecc_count['deferred_count']))
self.assertGreaterEqual(ecc_count['uncorrectable_count'], 0)
self.assertGreaterEqual(ecc_count['correctable_count'], 0)
self.assertGreaterEqual(ecc_count['deferred_count'], 0)
print()
print()
self.tearDown()
def test_ecc_count_total(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_total_ecc_count \n")
ecc_info = amdsmi.amdsmi_get_gpu_total_ecc_count(processors[i])
print("Number of uncorrectable errors: {}".format(
ecc_info['uncorrectable_count']))
print("Number of correctable errors: {}".format(
ecc_info['correctable_count']))
print("Number of deferred errors: {}".format(
ecc_info['deferred_count']))
self.assertGreaterEqual(ecc_info['uncorrectable_count'], 0)
self.assertGreaterEqual(ecc_info['correctable_count'], 0)
self.assertGreaterEqual(ecc_info['deferred_count'], 0)
print()
self.tearDown()
def test_fw_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_fw_info \n")
fw_info = amdsmi.amdsmi_get_fw_info(processors[i])
fw_num = len(fw_info['fw_list'])
self.assertLessEqual(fw_num, len(amdsmi.AmdSmiFwBlock))
for j in range(0, fw_num):
fw = fw_info['fw_list'][j]
if fw['fw_version'] != 0:
print(" FW name: {}".format(
fw['fw_name'].name))
print(" FW version: {}".format(
fw['fw_version']))
print()
self.tearDown()
def test_gpu_activity(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -273,8 +472,31 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print(" engine_usage['mm_activity'] is: {} %".format(
engine_usage['mm_activity']))
print()
self.tearDown()
def test_pcie(self):
def test_memory_usage(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_memory_usage \n")
memory_usage = amdsmi.amdsmi_get_gpu_memory_usage(
processors[i], amdsmi.AmdSmiMemoryType.VRAM)
print(" memory_usage for VRAM is: {}".format(memory_usage))
memory_usage = amdsmi.amdsmi_get_gpu_memory_usage(
processors[i], amdsmi.AmdSmiMemoryType.VIS_VRAM)
print(" memory_usage for VIS_VRAM is: {}".format(memory_usage))
memory_usage = amdsmi.amdsmi_get_gpu_memory_usage(
processors[i], amdsmi.AmdSmiMemoryType.GTT)
print(" memory_usage for GTT is: {}".format(memory_usage))
print()
self.tearDown()
def test_pcie_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -307,9 +529,13 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
pcie_info['pcie_metric']['pcie_nak_sent_count']))
print(" pcie_info['pcie_metric']['pcie_nak_received_count'] is: {}".format(
pcie_info['pcie_metric']['pcie_nak_received_count']))
print(" pcie_info['pcie_metric']['pcie_lc_perf_other_end_recovery_count'] is: {}".format(
pcie_info['pcie_metric']['pcie_lc_perf_other_end_recovery_count']))
print()
self.tearDown()
def test_power(self):
def test_power_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -330,13 +556,99 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
power_info['mem_voltage']))
print(" power_info['power_limit'] is: {}".format(
power_info['power_limit']))
print("\n###Test amdsmi_get_power_cap_info \n")
power_cap_info = amdsmi.amdsmi_get_power_cap_info(processors[i])
print(" power_info['dpm_cap'] is: {}".format(
power_cap_info['dpm_cap']))
print(" power_info['power_cap'] is: {}".format(
power_cap_info['power_cap']))
print("\n###Test amdsmi_is_gpu_power_management_enabled \n")
is_power_management_enabled = amdsmi.amdsmi_is_gpu_power_management_enabled(processors[i])
print(" Is power management enabled is: {}".format(
print(" Power management enabled: {}".format(
is_power_management_enabled))
print()
self.tearDown()
def test_temperature(self):
def test_process_list(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_process_list \n")
process_list = amdsmi.amdsmi_get_gpu_process_list(processors[i])
print(" Process list: {}".format(process_list))
print()
self.tearDown()
def test_processor_type(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_processor_type \n")
processor_type = amdsmi.amdsmi_get_processor_type(processors[i])
print(" Processor type is: {}".format(processor_type['processor_type']))
print()
self.tearDown()
# amdsmi_get_gpu_ras_block_features_enabled is not supported in Navi2x, Navi3x
@handle_exceptions
def test_ras_block_features_enabled(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_ras_block_features_enabled \n")
ras_enabled = amdsmi.amdsmi_get_gpu_ras_block_features_enabled(processors[i])
for j in range(0, len(ras_enabled)):
print(" RAS status for {} is: {}".format(ras_enabled[j]['block'], ras_enabled[j]['status']))
print()
self.tearDown()
# amdsmi_get_gpu_ras_feature_info is not supported in Navi2x, Navi3x
@handle_exceptions
def test_ras_feature_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_ras_feature_info \n")
ras_feature = amdsmi.amdsmi_get_gpu_ras_feature_info(processors[i])
if ras_feature != None:
print("RAS eeprom version: {}".format(ras_feature['eeprom_version']))
print("RAS parity schema: {}".format(ras_feature['parity_schema']))
print("RAS single bit schema: {}".format(ras_feature['single_bit_schema']))
print("RAS double bit schema: {}".format(ras_feature['double_bit_schema']))
print("Poisoning supported: {}".format(ras_feature['poison_schema']))
print()
self.tearDown()
def test_socket_info(self):
self.setUp()
print("\n\n###Test amdsmi_get_socket_handles")
sockets = amdsmi.amdsmi_get_socket_handles()
for i in range(0, len(sockets)):
print("\n\n###Test Socket {}".format(i))
print("\n###Test amdsmi_get_socket_info \n")
socket_name = amdsmi.amdsmi_get_socket_info(sockets[i])
print(" Socket: {}".format(socket_name))
print()
self.tearDown()
def test_temperature_metric(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -371,10 +683,12 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print(" Shutdown (emergency) temperature for VRAM is: {}".format(
temperature_measure))
print()
self.tearDown()
# Edge temperature is not supported in MI300
# AmdSmiTemperatureType.EDGE is not supported in MI300A, MI300X
@handle_exceptions
def test_temperature_edge(self):
def test_temperature_metric_edge(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
@@ -383,21 +697,227 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_temp_metric \n")
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CURRENT) # current
processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CURRENT)
print(" Current temperature for EDGE is: {}".format(
temperature_measure))
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CRITICAL) # slowdown/limit
processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CRITICAL)
print(" Limit (critical) temperature for EDGE is: {}".format(
temperature_measure))
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.EMERGENCY) # shutdown
processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.EMERGENCY)
print(" Shutdown (emergency) temperature for EDGE is: {}".format(
temperature_measure))
print()
self.tearDown()
def test_temperature_metric_plx(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_temp_metric \n")
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], amdsmi.AmdSmiTemperatureType.PLX, amdsmi.AmdSmiTemperatureMetric.CURRENT)
print(" Current temperature for PLX is: {}".format(
temperature_measure))
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], amdsmi.AmdSmiTemperatureType.PLX, amdsmi.AmdSmiTemperatureMetric.CRITICAL)
print(" Limit (critical) temperature for PLX is: {}".format(
temperature_measure))
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], amdsmi.AmdSmiTemperatureType.PLX, amdsmi.AmdSmiTemperatureMetric.EMERGENCY)
print(" Shutdown (emergency) temperature for PLX is: {}".format(
temperature_measure))
print()
self.tearDown()
# AmdSmiTemperatureType.HBM_0, HBM_1, HBM_2, HBM_3 are not supported in Navi2x, Navi3x, MI210, MI300A
@handle_exceptions
def test_temperature_metric_hbm(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
temp_types = {
"HBM_0": amdsmi.AmdSmiTemperatureType.HBM_0,
"HBM_1": amdsmi.AmdSmiTemperatureType.HBM_1,
"HBM_2": amdsmi.AmdSmiTemperatureType.HBM_2,
"HBM_3": amdsmi.AmdSmiTemperatureType.HBM_3,
}
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_temp_metric \n")
for temp_type_name, temp_type_code in temp_types.items():
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], temp_type_code, amdsmi.AmdSmiTemperatureMetric.CURRENT)
print(" Current temperature for {} is: {}".format(
temp_type_name, temperature_measure))
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], temp_type_code, amdsmi.AmdSmiTemperatureMetric.CRITICAL)
print(" Limit (critical) temperature for {} is: {}".format(
temp_type_name, temperature_measure))
temperature_measure = amdsmi.amdsmi_get_temp_metric(
processors[i], temp_type_code, amdsmi.AmdSmiTemperatureMetric.EMERGENCY)
print(" Shutdown (emergency) temperature for {} is: {}".format(
temp_type_name, temperature_measure))
print()
self.tearDown()
def test_utilization_count(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_utilization_count \n")
utilization_counter_types = [
amdsmi.AmdSmiUtilizationCounterType.COARSE_GRAIN_GFX_ACTIVITY,
amdsmi.AmdSmiUtilizationCounterType.COARSE_GRAIN_MEM_ACTIVITY,
amdsmi.AmdSmiUtilizationCounterType.COARSE_DECODER_ACTIVITY
]
utilization_count = amdsmi.amdsmi_get_utilization_count(
processors[i], utilization_counter_types)
print(" Timestamp: {}".format(
utilization_count[0]['timestamp']))
print(" Utilization count for {} is: {}".format(
utilization_count[1]['type'], utilization_count[1]['value']))
print(" Utilization count for {} is: {}".format(
utilization_count[2]['type'], utilization_count[2]['value']))
print(" Utilization count for {} is: {}".format(
utilization_count[3]['type'], utilization_count[3]['value']))
self.assertLessEqual(len(processors), 32)
print()
utilization_counter_types = [
amdsmi.AmdSmiUtilizationCounterType.FINE_GRAIN_GFX_ACTIVITY,
amdsmi.AmdSmiUtilizationCounterType.FINE_GRAIN_MEM_ACTIVITY,
amdsmi.AmdSmiUtilizationCounterType.FINE_DECODER_ACTIVITY
]
utilization_count = amdsmi.amdsmi_get_utilization_count(
processors[i], utilization_counter_types)
print(" Timestamp: {}".format(
utilization_count[0]['timestamp']))
print(" Utilization count for {} is: {}".format(
utilization_count[1]['type'], utilization_count[1]['value']))
print(" Utilization count for {} is: {}".format(
utilization_count[2]['type'], utilization_count[2]['value']))
print(" Utilization count for {} is: {}".format(
utilization_count[3]['type'], utilization_count[3]['value']))
print()
self.tearDown()
def test_vbios_info(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_vbios_info \n")
vbios_info = amdsmi.amdsmi_get_gpu_vbios_info(processors[i])
print(" vbios_info['part_number'] is: {}".format(
vbios_info['part_number']))
print(" vbios_info['build_date'] is: {}".format(
vbios_info['build_date']))
print(" vbios_info['name'] is: {}".format(
vbios_info['name']))
print(" vbios_info['version'] is: {}".format(
vbios_info['version']))
print()
self.tearDown()
def test_vendor_name(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_vendor_name \n")
vendor_name = amdsmi.amdsmi_get_gpu_vendor_name(processors[i])
print(" Vendor name is: {}".format(vendor_name))
print()
self.tearDown()
# @unittest.SkipTest
@handle_exceptions
def test_accelerator_partition_profile(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_gpu_accelerator_partition_profile \n")
accelerator_partition = amdsmi.amdsmi_get_gpu_accelerator_partition_profile(processors[i])
print(" Current partition id: {}".format(
accelerator_partition['partition_id']))
print()
self.tearDown()
# Only supported on MI300+ ASICs
@handle_exceptions
def test_get_violation_status(self):
self.setUp()
processors = amdsmi.amdsmi_get_processor_handles()
self.assertGreaterEqual(len(processors), 1)
self.assertLessEqual(len(processors), 32)
for i in range(0, len(processors)):
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("\n\n###Test Processor {}, bdf: {}".format(i, bdf))
print("\n###Test amdsmi_get_violation_status \n")
violation_status = amdsmi.amdsmi_get_violation_status(processors[i])
print(" Reference Timestamp: {}".format(
violation_status['reference_timestamp']))
print(" Violation Timestamp: {}".format(
violation_status['violation_timestamp']))
print(" Prochot Thrm Violation (%): {}".format(
violation_status['per_prochot_thrm']))
print(" PVIOL (per_ppt_pwr) (%): {}".format(
violation_status['per_ppt_pwr']))
print(" TVIOL (per_socket_thrm) (%): {}".format(
violation_status['per_socket_thrm']))
print(" VR_THRM Violation (%): {}".format(
violation_status['per_vr_thrm']))
print(" HBM Thrm Violation (%): {}".format(
violation_status['per_hbm_thrm']))
print(" Prochot Thrm Violation (bool): {}".format(
violation_status['active_prochot_thrm']))
print(" PVIOL (active_ppt_pwr) (bool): {}".format(
violation_status['active_ppt_pwr']))
print(" TVIOL (active_socket_thrm) (bool): {}".format(
violation_status['active_socket_thrm']))
print(" VR_THRM Violation (bool): {}".format(
violation_status['active_vr_thrm']))
print(" HBM Thrm Violation (bool): {}".format(
violation_status['active_hbm_thrm']))
print()
self.tearDown()
def test_walkthrough(self):
walk_through(self)
print("\n\n#######################################################################")
print("========> test_walkthrough start <========\n")
self.test_asic_kfd_info()
self.test_power_info()
self.test_vbios_info()
self.test_board_info()
self.test_fw_info()
self.test_driver_info()
print("\n========> test_walkthrough end <========")
print("#######################################################################\n")
# Unstable on workstation cards
# @handle_exceptions
@@ -486,80 +1006,5 @@ class TestAmdSmiPythonInterface(unittest.TestCase):
# # t3.join()
# print("\n========> test_z_gpureset_asicinfo_multithread end <========\n")
def walk_through(self):
print("\n###Test amdsmi_get_processor_handles() \n")
processors = amdsmi.amdsmi_get_processor_handles()
for i in range(0, len(processors)):
print("\n###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = " + str(i) + "\n")
bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i])
print("###Test Processor {}, bdf: {} ".format(i, bdf))
print("\n###Test amdsmi_get_gpu_asic_info \n")
asic_info = amdsmi.amdsmi_get_gpu_asic_info(processors[i])
print(" asic_info['market_name'] is: {}".format(
asic_info['market_name']))
print(" asic_info['vendor_id'] is: {}".format(
asic_info['vendor_id']))
print(" asic_info['vendor_name'] is: {}".format(
asic_info['vendor_name']))
print(" asic_info['device_id'] is: {}".format(
asic_info['device_id']))
print(" asic_info['rev_id'] is: {}\n".format(
asic_info['rev_id']))
print(" asic_info['asic_serial'] is: {}\n".format(
asic_info['asic_serial']))
print(" asic_info['oam_id'] is: {}\n".format(
asic_info['oam_id']))
print(" asic_info['target_graphics_version'] is: {}\n".format(
asic_info['target_graphics_version']))
print("\n###Test amdsmi_get_gpu_kfd_info \n")
kfd_info = amdsmi.amdsmi_get_gpu_kfd_info(processors[i])
print(" kfd_info['kfd_id'] is: {}\n".format(
kfd_info['kfd_id']))
print(" kfd_info['node_id'] is: {}\n".format(
kfd_info['node_id']))
print("###Test amdsmi_get_power_cap_info \n")
power_info = amdsmi.amdsmi_get_power_cap_info(processors[i])
print(" power_info['dpm_cap'] is: {}".format(
power_info['dpm_cap']))
print(" power_info['power_cap'] is: {}\n".format(
power_info['power_cap']))
print("###Test amdsmi_get_gpu_vbios_info \n")
vbios_info = amdsmi.amdsmi_get_gpu_vbios_info(processors[i])
print(" vbios_info['part_number'] is: {}".format(
vbios_info['part_number']))
print(" vbios_info['build_date'] is: {}".format(
vbios_info['build_date']))
print(" vbios_info['name'] is: {}\n".format(
vbios_info['name']))
print(" vbios_info['version'] is: {}\n".format(
vbios_info['version']))
print("###Test amdsmi_get_gpu_board_info \n")
board_info = amdsmi.amdsmi_get_gpu_board_info(processors[i])
print(" board_info['model_number'] is: {}\n".format(
board_info['model_number']))
print(" board_info['product_serial'] is: {}\n".format(
board_info['product_serial']))
print(" board_info['fru_id'] is: {}\n".format(
board_info['fru_id']))
print(" board_info['manufacturer_name'] is: {}\n".format(
board_info['manufacturer_name']))
print(" board_info['product_name'] is: {}\n".format(
board_info['product_name']))
print("###Test amdsmi_get_fw_info \n")
fw_info = amdsmi.amdsmi_get_fw_info(processors[i])
fw_num = len(fw_info['fw_list'])
self.assertLessEqual(fw_num, len(amdsmi.AmdSmiFwBlock))
for j in range(0, fw_num):
fw = fw_info['fw_list'][j]
if fw['fw_version'] != 0:
print("FW name: {}".format(
fw['fw_name'].name))
print("FW version: {}".format(
fw['fw_version']))
print("\n###Test amdsmi_get_gpu_driver_info \n")
driver_info = amdsmi.amdsmi_get_gpu_driver_info(processors[i])
print("Driver info: {}".format(driver_info))
print("\n###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = " + str(i) + "\n")
if __name__ == '__main__':
unittest.main()