SWDEV-452739 - Add CEM slot type to amd-smi

Updated CHANGELOG.md and re-added spaces after bolded lines

Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ic728b3e9b083c62fe4c9791b8ede991f5dacc1ca


[ROCm/amdsmi commit: 51b3f8cccb]
Этот коммит содержится в:
Maisam Arif
2024-03-27 00:45:53 -05:00
коммит произвёл Maisam Arif
родитель 980da3b329
Коммит 144ddec250
5 изменённых файлов: 48 добавлений и 25 удалений
+33 -16
Просмотреть файл
@@ -4,11 +4,11 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/](
***All information listed below is for reference and subject to change.***
## amd_smi_lib for ROCm 6.2.0
## amd_smi_lib for ROCm 6.1.1
### Changed
- **Updated metrics --clocks**
- **Updated metrics --clocks**
Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status.
``` shell
@@ -119,7 +119,7 @@ GPU: 0
DEEP_SLEEP: ENABLED
```
- **Added deferred ecc counts**
- **Added deferred ecc counts**
Added deferred error correctable counts to `amd-smi metric --ecc --ecc-blocks`
```shell
@@ -143,11 +143,28 @@ GPU: 0
...
```
### Fixed
- **Fix for GPU reset error on non-amdgpu cards**
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
updates CLI to target only AMD ASICs.
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
- **Improved Error handling for `amd-smi process`**
Fixed Attribute Error when getting process in csv format
### Known issues
- `amd-smi bad-pages` can results with "ValueError: NULL pointer access" with certain PM FW versions
## amd_smi_lib for ROCm 6.1.0
### Added
- **Added Monitor Command**
- **Added Monitor Command**
Provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI `rocm-smi` (no argument), additionally allows uers to customize what data is helpful for their use-case.
```shell
@@ -207,7 +224,7 @@ GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK VRAM_U
7 175 W 34 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB
```
- **Integrated ESMI Tool**
- **Integrated ESMI Tool**
Users can get CPU metrics and telemetry through our API and CLI tools. This information can be seen in `amd-smi static` and `amd-smi metric` commands. Only available for limited target processors. As of ROCm 6.0.2, this is listed as:
- AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh
- AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh
@@ -357,7 +374,7 @@ CPU: 0
RESPONSE: N/A
```
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**
Using the AMD SMI tool, users can retreive VCN, JPEG engines, and PCIe errors by calling `amd-smi metric -P` or `amd-smi metric --usage`. Depending on device support, `VCN_ACTIVITY` will update for MI3x ASICs (with 4 separate VCN engine activities) for older asics `MM_ACTIVITY` with UVD/VCN engine activity (average of all engines). `JPEG_ACTIVITY` is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields.
```shell
@@ -401,7 +418,7 @@ $ amd-smi version
AMDSMI Tool: 23.4.2+505b858 | AMDSMI Library version: 24.2.0.0 | ROCm version: 6.1.0
```
- **Added XGMI table**
- **Added XGMI table**
Displays XGMI information for AMD GPU devices in a table format. Only available on supported ASICs (eg. MI300). Here users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes).
```shell
@@ -538,9 +555,9 @@ NUMA BW TABLE:
### Fixed
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**
Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string.
- **Fix for devices which have an older pyyaml installed**
- **Fix for devices which have an older pyyaml installed**
Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands:
- `amd-smi list`
- `amd-smi static`
@@ -552,7 +569,7 @@ Platforms which are identified as having an older pyyaml version or pip, we no m
TypeError: dump_all() got an unexpected keyword argument 'sort_keys'
```
- **Fix for crash when user is not a member of video/render groups**
- **Fix for crash when user is not a member of video/render groups**
AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user.
### Known Issues
@@ -563,20 +580,20 @@ AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid cr
### Added
- **Integrated the E-SMI (EPYC-SMI) library**
- **Integrated the E-SMI (EPYC-SMI) library**
You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details.
- **Added support for gfx942 metrics**
- **Added support for gfx942 metrics**
You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance.
- **Compute and memory partition support**
- **Compute and memory partition support**
Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration.
### Changed
- **GPU index sorting made consistent with other tools**
- **GPU index sorting made consistent with other tools**
To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number.
- **Topology output is now aligned with GPU BDF table**
- **Topology output is now aligned with GPU BDF table**
Earlier versions of the topology output were difficult to read since each GPU was displayed linearly.
Now the information is displayed as a table by each GPU's BDF, which closer resembles rocm-smi output.
@@ -586,7 +603,7 @@ Now the information is displayed as a table by each GPU's BDF, which closer rese
### Fixed
- **Fix for driver not initialized**
- **Fix for driver not initialized**
If driver module is not loaded, user retrieve error reponse indicating amdgpu module is not loaded.
### Known Issues
+1
Просмотреть файл
@@ -495,6 +495,7 @@ typedef union {
typedef enum {
AMDSMI_CARD_FORM_FACTOR_PCIE,
AMDSMI_CARD_FORM_FACTOR_OAM,
AMDSMI_CARD_FORM_FACTOR_CEM,
AMDSMI_CARD_FORM_FACTOR_UNKNOWN
} amdsmi_card_form_factor_t;
+1 -1
Просмотреть файл
@@ -809,7 +809,7 @@ Output: Dictionary with 2 fields `pcie_static` and `pcie_metric`
Fields | Description
---|---
`pcie_static` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`max_pcie_width`</td><td>Maximum number of pcie lanes available</td></tr><tr><td>`max_pcie_speed`</td><td>Maximum capable pcie speed in GT/s</td></tr><tr><td>`pcie_interface_version`</td><td>PCIe generation ie. 3,4,5...</td></tr><tr><td>`slot_type`</td><td>The type of form factor of the slot: PCIE, OAM, or Unknown</td></tr></tbody></table>
`pcie_static` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`max_pcie_width`</td><td>Maximum number of pcie lanes available</td></tr><tr><td>`max_pcie_speed`</td><td>Maximum capable pcie speed in GT/s</td></tr><tr><td>`pcie_interface_version`</td><td>PCIe generation ie. 3,4,5...</td></tr><tr><td>`slot_type`</td><td>The type of form factor of the slot: OAM, PCIE, CEM, or Unknown</td></tr></tbody></table>
`pcie_metric` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`pcie_width`</td><td>Current number of pcie lanes available</td></tr><tr><td>`pcie_speed`</td><td>Current pcie speed capable in GT/s</td></tr><tr><td>`pcie_bandwidth`</td><td>Current instantaneous bandwidth usage in Mb/s</td></tr><tr><td>`pcie_replay_count`</td><td>Total number of PCIe replays (NAKs)</td></tr><tr><td>`pcie_l0_to_recovery_count`</td><td>PCIE L0 to recovery state transition accumulated count</td></tr><tr><td>`pcie_replay_roll_over_count`</td><td>PCIe Replay accumulated count</td></tr><tr><td>`pcie_nak_sent_count`</td><td>PCIe NAK sent accumulated count</td></tr><tr><td>`pcie_nak_received_count`</td><td>PCIe NAK received accumulated count</td></tr></tbody></table>
Exceptions that can be thrown by `amdsmi_get_pcie_info` function:
+8 -6
Просмотреть файл
@@ -737,11 +737,13 @@ amdsmi_bdf_t = union_amdsmi_bdf_t
amdsmi_card_form_factor_t__enumvalues = {
0: 'AMDSMI_CARD_FORM_FACTOR_PCIE',
1: 'AMDSMI_CARD_FORM_FACTOR_OAM',
2: 'AMDSMI_CARD_FORM_FACTOR_UNKNOWN',
2: 'AMDSMI_CARD_FORM_FACTOR_CEM',
3: 'AMDSMI_CARD_FORM_FACTOR_UNKNOWN',
}
AMDSMI_CARD_FORM_FACTOR_PCIE = 0
AMDSMI_CARD_FORM_FACTOR_OAM = 1
AMDSMI_CARD_FORM_FACTOR_UNKNOWN = 2
AMDSMI_CARD_FORM_FACTOR_CEM = 2
AMDSMI_CARD_FORM_FACTOR_UNKNOWN = 3
amdsmi_card_form_factor_t = ctypes.c_uint32 # enum
class struct_amdsmi_pcie_info_t(Structure):
pass
@@ -2347,10 +2349,10 @@ __all__ = \
'AMDSMI_CACHE_PROPERTY_DATA_CACHE',
'AMDSMI_CACHE_PROPERTY_ENABLED',
'AMDSMI_CACHE_PROPERTY_INST_CACHE',
'AMDSMI_CACHE_PROPERTY_SIMD_CACHE', 'AMDSMI_CARD_FORM_FACTOR_OAM',
'AMDSMI_CARD_FORM_FACTOR_PCIE', 'AMDSMI_CARD_FORM_FACTOR_UNKNOWN',
'AMDSMI_CNTR_CMD_START', 'AMDSMI_CNTR_CMD_STOP',
'AMDSMI_COARSE_GRAIN_GFX_ACTIVITY',
'AMDSMI_CACHE_PROPERTY_SIMD_CACHE', 'AMDSMI_CARD_FORM_FACTOR_CEM',
'AMDSMI_CARD_FORM_FACTOR_OAM', 'AMDSMI_CARD_FORM_FACTOR_PCIE',
'AMDSMI_CARD_FORM_FACTOR_UNKNOWN', 'AMDSMI_CNTR_CMD_START',
'AMDSMI_CNTR_CMD_STOP', 'AMDSMI_COARSE_GRAIN_GFX_ACTIVITY',
'AMDSMI_COARSE_GRAIN_MEM_ACTIVITY', 'AMDSMI_CURRENT_POWER',
'AMDSMI_DEV_PERF_LEVEL_AUTO', 'AMDSMI_DEV_PERF_LEVEL_DETERMINISM',
'AMDSMI_DEV_PERF_LEVEL_FIRST', 'AMDSMI_DEV_PERF_LEVEL_HIGH',
+5 -2
Просмотреть файл
@@ -2050,11 +2050,14 @@ amdsmi_status_t amdsmi_get_pcie_info(amdsmi_processor_handle processor_handle, a
processor_handle, &slot_type);
if (status == AMDSMI_STATUS_SUCCESS) {
switch (slot_type) {
case RSMI_PCIE_SLOT_PCIE:
info->pcie_static.slot_type = AMDSMI_CARD_FORM_FACTOR_PCIE;
break;
case RSMI_PCIE_SLOT_OAM:
info->pcie_static.slot_type = AMDSMI_CARD_FORM_FACTOR_OAM;
break;
case RSMI_PCIE_SLOT_PCIE:
info->pcie_static.slot_type = AMDSMI_CARD_FORM_FACTOR_PCIE;
case RSMI_PCIE_SLOT_CEM:
info->pcie_static.slot_type = AMDSMI_CARD_FORM_FACTOR_CEM;
break;
default:
info->pcie_static.slot_type = AMDSMI_CARD_FORM_FACTOR_UNKNOWN;