Added #defines from amdsmi.h to python interface
Signed-off-by: Maisam Arif <maisarif@amd.com>
Change-Id: Ic1a17d20f9f1f76e55813db8e2fe287279cb231e
[ROCm/amdsmi commit: 8f8d88416f]
Этот коммит содержится в:
@@ -8,16 +8,19 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr
|
||||
|
||||
### Additions
|
||||
|
||||
- **Added Ring Hang event**.
|
||||
- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**.
|
||||
Added macros to reference max size limitations for certain amdsmi functions such as max dpm policies and max fanspeed.
|
||||
|
||||
- **Added Ring Hang event**.
|
||||
Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_notification_type_t` enum.
|
||||
|
||||
- **Added process isolation and clean shader APIs and CLI commands**.
|
||||
- **Added process isolation and clean shader APIs and CLI commands**.
|
||||
Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs:
|
||||
- `amdsmi_get_gpu_process_isolation()`
|
||||
- `amdsmi_set_gpu_process_isolation()`
|
||||
- `amdsmi_set_gpu_clear_sram_data()`
|
||||
|
||||
- **Added `MIN_POWER` to output of `amd-smi static --limit`**.
|
||||
- **Added `MIN_POWER` to output of `amd-smi static --limit`**.
|
||||
This change helps users identify the range to which they can change the power cap of the GPU. The change is added to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap <value in W>` or `amd-smi reset -g all --power-cap`.
|
||||
|
||||
```shell
|
||||
@@ -49,7 +52,7 @@ GPU: 1
|
||||
|
||||
### Optimizations
|
||||
|
||||
- **Updated `amd-smi monitor --pcie` output**.
|
||||
- **Updated `amd-smi monitor --pcie` output**.
|
||||
The source for pcie bandwidth monitor output was a legacy file we no longer support and was causing delays within the monitor command. The output is no longer using TX/RX but instantaneous bandwidth from gpu_metrics instead; updated output:
|
||||
|
||||
```shell
|
||||
@@ -64,13 +67,13 @@ GPU PCIE_BW
|
||||
Updated sizes that work for retreiving relavant board information across AMD's
|
||||
ASIC products. This requires users to update any ABIs using this structure.
|
||||
|
||||
- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**.
|
||||
- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**.
|
||||
`amdsmi_get_power_cap_info` will return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).
|
||||
|
||||
- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
|
||||
Previously calls were returning "No bad pages found." if no pages were found, now it only returns the list type and can be empty.
|
||||
|
||||
- **Updated `amd-smi metric --ecc-blocks` output**.
|
||||
- **Updated `amd-smi metric --ecc-blocks` output**.
|
||||
The ecc blocks argument was outputing blocks without counters available, updated the filtering show blocks that counters are available for:
|
||||
|
||||
``` shell
|
||||
@@ -107,18 +110,18 @@ GPU: 0
|
||||
DEFERRED_COUNT: 0
|
||||
```
|
||||
|
||||
- **Removed `amdsmi_get_gpu_process_info` from Python library**.
|
||||
- **Removed `amdsmi_get_gpu_process_info` from Python library**.
|
||||
amdsmi_get_gpu_process_info was removed from the C library in an earlier build, but the API was still in the Python interface.
|
||||
|
||||
### Fixes
|
||||
|
||||
- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**.
|
||||
- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**.
|
||||
Previously if there was a partial failure to retrieve character strings, we would return
|
||||
garbage output to users using the API. This fix intends to populate as many values as possible.
|
||||
Then any failure(s) found along the way, `\0` is provided to `amdsmi_board_info_t`
|
||||
structures data members which cannot be populated. Ensuring empty char string values.
|
||||
|
||||
- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**.
|
||||
- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**.
|
||||
These systems use an older version of gpu_metrics in amdgpu. This fix only updates what CLI outputs.
|
||||
No change in any of our APIs.
|
||||
|
||||
@@ -143,10 +146,10 @@ GPU: 1
|
||||
THROTTLE_STATUS: UNTHROTTLED
|
||||
```
|
||||
|
||||
- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**.
|
||||
- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**.
|
||||
Updates required `amdsmi_get_power_cap_info` to return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same).
|
||||
|
||||
- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
|
||||
- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**.
|
||||
Previously Python interface calls to populated bad pages resulted in a `ValueError: NULL pointer access`. This fixes the bad-pages subcommand CLI subcommand as well.
|
||||
|
||||
### Known Issues
|
||||
@@ -157,7 +160,7 @@ Previously Python interface calls to populated bad pages resulted in a `ValueErr
|
||||
|
||||
### Changes
|
||||
|
||||
- **Updated metrics --clocks**.
|
||||
- **Updated metrics --clocks**.
|
||||
Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status.
|
||||
|
||||
``` shell
|
||||
@@ -268,7 +271,7 @@ GPU: 0
|
||||
DEEP_SLEEP: ENABLED
|
||||
```
|
||||
|
||||
- **Added deferred ecc counts**.
|
||||
- **Added deferred ecc counts**.
|
||||
Added deferred error correctable counts to `amd-smi metric --ecc --ecc-blocks`
|
||||
|
||||
```shell
|
||||
@@ -292,7 +295,7 @@ GPU: 0
|
||||
...
|
||||
```
|
||||
|
||||
- **Updated `amd-smi topology --json` to align with host/guest**.
|
||||
- **Updated `amd-smi topology --json` to align with host/guest**.
|
||||
Topology's `--json` output now is changed to align with output host/guest systems. Additionally, users can select/filter specific topology details as desired (refer to `amd-smi topology -h` for full list). See examples shown below.
|
||||
|
||||
*Previous format:*
|
||||
@@ -427,18 +430,18 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json
|
||||
|
||||
### Fixes
|
||||
|
||||
- **Fix for GPU reset error on non-amdgpu cards**.
|
||||
- **Fix for GPU reset error on non-amdgpu cards**.
|
||||
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
|
||||
updates CLI to target only AMD ASICs.
|
||||
|
||||
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
|
||||
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
|
||||
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
|
||||
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
|
||||
|
||||
- **Fix for `amd-smi process`**.
|
||||
- **Fix for `amd-smi process`**.
|
||||
Fixed output results when getting processes running on a device.
|
||||
|
||||
- **Improved Error handling for `amd-smi process`**.
|
||||
- **Improved Error handling for `amd-smi process`**.
|
||||
Fixed Attribute Error when getting process in csv format
|
||||
|
||||
### Known issues
|
||||
@@ -449,7 +452,7 @@ Fixed Attribute Error when getting process in csv format
|
||||
|
||||
### Additions
|
||||
|
||||
- **Added Monitor Command**.
|
||||
- **Added Monitor Command**.
|
||||
Provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI `rocm-smi` (no argument), additionally allows uers to customize what data is helpful for their use-case.
|
||||
|
||||
```shell
|
||||
@@ -509,7 +512,7 @@ GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK VRAM_U
|
||||
7 175 W 34 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB
|
||||
```
|
||||
|
||||
- **Integrated ESMI Tool**.
|
||||
- **Integrated ESMI Tool**.
|
||||
Users can get CPU metrics and telemetry through our API and CLI tools. This information can be seen in `amd-smi static` and `amd-smi metric` commands. Only available for limited target processors. As of ROCm 6.0.2, this is listed as:
|
||||
- AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh
|
||||
- AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh
|
||||
@@ -659,7 +662,7 @@ CPU: 0
|
||||
RESPONSE: N/A
|
||||
```
|
||||
|
||||
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**.
|
||||
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**.
|
||||
Using the AMD SMI tool, users can retreive VCN, JPEG engines, and PCIe errors by calling `amd-smi metric -P` or `amd-smi metric --usage`. Depending on device support, `VCN_ACTIVITY` will update for MI3x ASICs (with 4 separate VCN engine activities) for older asics `MM_ACTIVITY` with UVD/VCN engine activity (average of all engines). `JPEG_ACTIVITY` is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields.
|
||||
|
||||
```shell
|
||||
@@ -692,7 +695,7 @@ GPU: 0
|
||||
|
||||
```
|
||||
|
||||
- **Added AMDSMI Tool Version**.
|
||||
- **Added AMDSMI Tool Version**.
|
||||
AMD SMI will report ***three versions***: AMDSMI Tool, AMDSMI Library version, and ROCm version.
|
||||
The AMDSMI Tool version is the CLI/tool version number with commit ID appended after `+` sign.
|
||||
The AMDSMI Library version is the library package version number.
|
||||
@@ -703,7 +706,7 @@ $ amd-smi version
|
||||
AMDSMI Tool: 23.4.2+505b858 | AMDSMI Library version: 24.2.0.0 | ROCm version: 6.1.0
|
||||
```
|
||||
|
||||
- **Added XGMI table**.
|
||||
- **Added XGMI table**.
|
||||
Displays XGMI information for AMD GPU devices in a table format. Only available on supported ASICs (eg. MI300). Here users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes).
|
||||
|
||||
```shell
|
||||
@@ -737,7 +740,7 @@ GPU7 0000:df:00.0 32 Gb/s 512 Gb/s XGMI
|
||||
|
||||
```
|
||||
|
||||
- **Added units of measure to JSON output**.
|
||||
- **Added units of measure to JSON output**.
|
||||
We added unit of measure to JSON/CSV `amd-smi metric`, `amd-smi static`, and `amd-smi monitor` commands.
|
||||
|
||||
Ex.
|
||||
@@ -773,7 +776,7 @@ amd-smi metric -p --json
|
||||
|
||||
### Changes
|
||||
|
||||
- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**.
|
||||
- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**.
|
||||
We provided each device's BDF for every table's row/columns, then left aligned data. We want AMD SMI Tool output to be easy to understand and digest for our users. Having users scroll up to find this information made it difficult to follow, especially for devices which have many devices associated with one ASIC.
|
||||
|
||||
```shell
|
||||
@@ -836,9 +839,9 @@ NUMA BW TABLE:
|
||||
|
||||
### Fixes
|
||||
|
||||
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**.
|
||||
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**.
|
||||
Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string.
|
||||
- **Fix for devices which have an older pyyaml installed**.
|
||||
- **Fix for devices which have an older pyyaml installed**.
|
||||
Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands:
|
||||
- `amd-smi list`
|
||||
- `amd-smi static`
|
||||
@@ -850,20 +853,20 @@ Platforms which are identified as having an older pyyaml version or pip, we no m
|
||||
TypeError: dump_all() got an unexpected keyword argument 'sort_keys'
|
||||
```
|
||||
|
||||
- **Fix for crash when user is not a member of video/render groups**.
|
||||
- **Fix for crash when user is not a member of video/render groups**.
|
||||
AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user.
|
||||
|
||||
## amd_smi_lib for ROCm 6.0.0
|
||||
|
||||
### Additions
|
||||
|
||||
- **Integrated the E-SMI (EPYC-SMI) library**.
|
||||
- **Integrated the E-SMI (EPYC-SMI) library**.
|
||||
You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details.
|
||||
|
||||
- **Added support for gfx942 metrics**.
|
||||
- **Added support for gfx942 metrics**.
|
||||
You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance.
|
||||
|
||||
- **Compute and memory partition support**.
|
||||
- **Compute and memory partition support**.
|
||||
Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration.
|
||||
|
||||
### Optimizations
|
||||
@@ -872,13 +875,13 @@ Users can now view, set, and reset partitions. The topology display can provide
|
||||
|
||||
### Changes
|
||||
|
||||
- **GPU index sorting made consistent with other tools**.
|
||||
- **GPU index sorting made consistent with other tools**.
|
||||
To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number.
|
||||
- **Topology output is now aligned with GPU BDF table**.
|
||||
- **Topology output is now aligned with GPU BDF table**.
|
||||
Earlier versions of the topology output were difficult to read since each GPU was displayed linearly.
|
||||
Now the information is displayed as a table by each GPU's BDF, which closer resembles rocm-smi output.
|
||||
|
||||
### Fixes
|
||||
|
||||
- **Fix for driver not initialized**.
|
||||
- **Fix for driver not initialized**.
|
||||
If driver module is not loaded, user retrieve error reponse indicating amdgpu module is not loaded.
|
||||
|
||||
@@ -39,6 +39,37 @@ AMDSMI_MAX_NUM_VCN = 4
|
||||
AMDSMI_MAX_NUM_CLKS = 4
|
||||
AMDSMI_MAX_NUM_XGMI_LINKS = 8
|
||||
AMDSMI_MAX_NUM_GFX_CLKS = 8
|
||||
AMDSMI_MAX_AID = 4
|
||||
AMDSMI_MAX_ENGINES = 8
|
||||
AMDSMI_MAX_NUM_JPEG = 32
|
||||
|
||||
# Max number of DPM policies
|
||||
AMDSMI_MAX_NUM_PM_POLICIES = 32
|
||||
|
||||
# Max supported frequencies
|
||||
AMDSMI_MAX_NUM_FREQUENCIES = 33
|
||||
|
||||
# Max Fan speed
|
||||
AMDSMI_MAX_FAN_SPEED = 255
|
||||
|
||||
# Max Votlage Curve Points
|
||||
AMDSMI_NUM_VOLTAGE_CURVE_POINTS = 3
|
||||
|
||||
# Max size definitions
|
||||
AMDSMI_MAX_MM_IP_COUNT = 8
|
||||
AMDSMI_MAX_DATE_LENGTH = 32
|
||||
AMDSMI_MAX_STRING_LENGTH = 64
|
||||
AMDSMI_NORMAL_STRING_LENGTH = 32
|
||||
AMDSMI_MAX_DEVICES = 32
|
||||
AMDSMI_MAX_NAME = 32
|
||||
AMDSMI_MAX_DRIVER_VERSION_LENGTH = 80
|
||||
AMDSMI_256_LENGTH = 256
|
||||
AMDSMI_MAX_CONTAINER_TYPE = 2
|
||||
AMDSMI_MAX_CACHE_TYPES = 10
|
||||
AMDSMI_MAX_NUM_XGMI_PHYSICAL_LINK = 64
|
||||
AMDSMI_GPU_UUID_SIZE = 38
|
||||
MAX_AMDSMI_NAME_LENGTH = 64
|
||||
MAX_EVENT_NOTIFICATION_MSG_SIZE = 64
|
||||
|
||||
|
||||
class AmdSmiInitFlags(IntEnum):
|
||||
@@ -224,7 +255,6 @@ class AmdSmiEvtNotificationType(IntEnum):
|
||||
GPU_POST_RESET = amdsmi_wrapper.AMDSMI_EVT_NOTIF_GPU_POST_RESET
|
||||
RING_HANG = amdsmi_wrapper.AMDSMI_EVT_NOTIF_RING_HANG
|
||||
|
||||
|
||||
class AmdSmiTemperatureMetric(IntEnum):
|
||||
CURRENT = amdsmi_wrapper.AMDSMI_TEMP_CURRENT
|
||||
MAX = amdsmi_wrapper.AMDSMI_TEMP_MAX
|
||||
|
||||
Ссылка в новой задаче
Block a user