fix: [SWDEV-442525] [rocm/amd_smi_lib]
Fixes gpu_process_list Code changes related to the following: * amdsmi_get_gpu_process_list() * CLI * Examples * Unit tests * Changelog * Readme * rocm_smi_lib commit:677433b367Change-Id: I9210fbca7a5da92d0a8b472b72ca82597c8e4fb5 Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com> [ROCm/amdsmi commit:08e2e21bab]
Этот коммит содержится в:
@@ -8,7 +8,7 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/](
|
||||
|
||||
### Changed
|
||||
|
||||
- **Updated metrics --clocks**
|
||||
- **Updated metrics --clocks**
|
||||
Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status.
|
||||
|
||||
``` shell
|
||||
@@ -119,7 +119,7 @@ GPU: 0
|
||||
DEEP_SLEEP: ENABLED
|
||||
```
|
||||
|
||||
- **Added deferred ecc counts**
|
||||
- **Added deferred ecc counts**
|
||||
Added deferred error correctable counts to `amd-smi metric --ecc --ecc-blocks`
|
||||
|
||||
```shell
|
||||
@@ -149,11 +149,14 @@ GPU: 0
|
||||
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
|
||||
updates CLI to target only AMD ASICs.
|
||||
|
||||
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**
|
||||
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**
|
||||
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
|
||||
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
|
||||
|
||||
- **Improved Error handling for `amd-smi process`**
|
||||
- **Fix for `amd-smi process`**
|
||||
Fixed output results when getting processes running on a device.
|
||||
|
||||
- **Improved Error handling for `amd-smi process`**
|
||||
Fixed Attribute Error when getting process in csv format
|
||||
|
||||
### Known issues
|
||||
@@ -164,7 +167,7 @@ Fixed Attribute Error when getting process in csv format
|
||||
|
||||
### Added
|
||||
|
||||
- **Added Monitor Command**
|
||||
- **Added Monitor Command**
|
||||
Provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI `rocm-smi` (no argument), additionally allows uers to customize what data is helpful for their use-case.
|
||||
|
||||
```shell
|
||||
@@ -224,7 +227,7 @@ GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK VRAM_U
|
||||
7 175 W 34 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB
|
||||
```
|
||||
|
||||
- **Integrated ESMI Tool**
|
||||
- **Integrated ESMI Tool**
|
||||
Users can get CPU metrics and telemetry through our API and CLI tools. This information can be seen in `amd-smi static` and `amd-smi metric` commands. Only available for limited target processors. As of ROCm 6.0.2, this is listed as:
|
||||
- AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh
|
||||
- AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh
|
||||
@@ -374,7 +377,7 @@ CPU: 0
|
||||
RESPONSE: N/A
|
||||
```
|
||||
|
||||
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**
|
||||
- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**
|
||||
Using the AMD SMI tool, users can retreive VCN, JPEG engines, and PCIe errors by calling `amd-smi metric -P` or `amd-smi metric --usage`. Depending on device support, `VCN_ACTIVITY` will update for MI3x ASICs (with 4 separate VCN engine activities) for older asics `MM_ACTIVITY` with UVD/VCN engine activity (average of all engines). `JPEG_ACTIVITY` is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields.
|
||||
|
||||
```shell
|
||||
@@ -407,10 +410,10 @@ GPU: 0
|
||||
|
||||
```
|
||||
|
||||
- **Added AMDSMI Tool Version**
|
||||
AMD SMI will report ***three versions***: AMDSMI Tool, AMDSMI Library version, and ROCm version.
|
||||
The AMDSMI Tool version is the CLI/tool version number with commit ID appended after `+` sign.
|
||||
The AMDSMI Library version is the library package version number.
|
||||
- **Added AMDSMI Tool Version**
|
||||
AMD SMI will report ***three versions***: AMDSMI Tool, AMDSMI Library version, and ROCm version.
|
||||
The AMDSMI Tool version is the CLI/tool version number with commit ID appended after `+` sign.
|
||||
The AMDSMI Library version is the library package version number.
|
||||
The ROCm version is the system's installed ROCm version, if ROCm is not installed it will report N/A.
|
||||
|
||||
```shell
|
||||
@@ -418,7 +421,7 @@ $ amd-smi version
|
||||
AMDSMI Tool: 23.4.2+505b858 | AMDSMI Library version: 24.2.0.0 | ROCm version: 6.1.0
|
||||
```
|
||||
|
||||
- **Added XGMI table**
|
||||
- **Added XGMI table**
|
||||
Displays XGMI information for AMD GPU devices in a table format. Only available on supported ASICs (eg. MI300). Here users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes).
|
||||
|
||||
```shell
|
||||
@@ -452,10 +455,10 @@ GPU7 0000:df:00.0 32 Gb/s 512 Gb/s XGMI
|
||||
|
||||
```
|
||||
|
||||
- **Added units of measure to JSON output.**
|
||||
- **Added units of measure to JSON output.**
|
||||
We added unit of measure to JSON/CSV `amd-smi metric`, `amd-smi static`, and `amd-smi monitor` commands.
|
||||
|
||||
Ex.
|
||||
Ex.
|
||||
|
||||
```shell
|
||||
amd-smi metric -p --json
|
||||
@@ -488,7 +491,7 @@ amd-smi metric -p --json
|
||||
|
||||
### Changed
|
||||
|
||||
- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns.**
|
||||
- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns.**
|
||||
We provided each device's BDF for every table's row/columns, then left aligned data. We want AMD SMI Tool output to be easy to understand and digest for our users. Having users scroll up to find this information made it difficult to follow, especially for devices which have many devices associated with one ASIC.
|
||||
|
||||
```shell
|
||||
@@ -555,9 +558,9 @@ NUMA BW TABLE:
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**
|
||||
- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**
|
||||
Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string.
|
||||
- **Fix for devices which have an older pyyaml installed**
|
||||
- **Fix for devices which have an older pyyaml installed**
|
||||
Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands:
|
||||
- `amd-smi list`
|
||||
- `amd-smi static`
|
||||
@@ -569,7 +572,7 @@ Platforms which are identified as having an older pyyaml version or pip, we no m
|
||||
TypeError: dump_all() got an unexpected keyword argument 'sort_keys'
|
||||
```
|
||||
|
||||
- **Fix for crash when user is not a member of video/render groups**
|
||||
- **Fix for crash when user is not a member of video/render groups**
|
||||
AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user.
|
||||
|
||||
### Known Issues
|
||||
@@ -580,20 +583,20 @@ AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid cr
|
||||
|
||||
### Added
|
||||
|
||||
- **Integrated the E-SMI (EPYC-SMI) library**
|
||||
- **Integrated the E-SMI (EPYC-SMI) library**
|
||||
You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details.
|
||||
|
||||
- **Added support for gfx942 metrics**
|
||||
- **Added support for gfx942 metrics**
|
||||
You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance.
|
||||
|
||||
- **Compute and memory partition support**
|
||||
- **Compute and memory partition support**
|
||||
Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration.
|
||||
|
||||
### Changed
|
||||
|
||||
- **GPU index sorting made consistent with other tools**
|
||||
- **GPU index sorting made consistent with other tools**
|
||||
To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number.
|
||||
- **Topology output is now aligned with GPU BDF table**
|
||||
- **Topology output is now aligned with GPU BDF table**
|
||||
Earlier versions of the topology output were difficult to read since each GPU was displayed linearly.
|
||||
Now the information is displayed as a table by each GPU's BDF, which closer resembles rocm-smi output.
|
||||
|
||||
@@ -603,7 +606,7 @@ Now the information is displayed as a table by each GPU's BDF, which closer rese
|
||||
|
||||
### Fixed
|
||||
|
||||
- **Fix for driver not initialized**
|
||||
- **Fix for driver not initialized**
|
||||
If driver module is not loaded, user retrieve error reponse indicating amdgpu module is not loaded.
|
||||
|
||||
### Known Issues
|
||||
|
||||
@@ -2570,16 +2570,18 @@ class AMDSMICommands():
|
||||
try:
|
||||
process_list = amdsmi_interface.amdsmi_get_gpu_process_list(args.gpu)
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
if e.get_error_code() == amdsmi_interface.amdsmi_wrapper.AMDSMI_STATUS_NO_PERM:
|
||||
raise PermissionError('Command requires elevation') from e
|
||||
logging.debug("Failed to get process list for gpu %s | %s", gpu_id, e.get_error_info())
|
||||
raise e
|
||||
|
||||
filtered_process_values = []
|
||||
for process_handle in process_list:
|
||||
for process in process_list:
|
||||
try:
|
||||
process_info = amdsmi_interface.amdsmi_get_gpu_process_info(args.gpu, process_handle)
|
||||
process_info = amdsmi_interface.amdsmi_get_gpu_process_info(args.gpu, process)
|
||||
except amdsmi_exception.AmdSmiLibraryException as e:
|
||||
process_info = "N/A"
|
||||
logging.debug("Failed to get process info for gpu %s on process_handle %s | %s", gpu_id, process_handle, e.get_error_info())
|
||||
logging.debug("Failed to get process info for process %s on gpu %s | %s", process, gpu_id, e.get_error_info())
|
||||
filtered_process_values.append({'process_info': process_info})
|
||||
continue
|
||||
|
||||
|
||||
@@ -432,7 +432,9 @@ int main() {
|
||||
ret = amdsmi_get_temp_metric(
|
||||
processor_handles[j], TEMPERATURE_TYPE_EDGE,
|
||||
AMDSMI_TEMP_CRITICAL, &temperature);
|
||||
CHK_AMDSMI_RET(ret)
|
||||
if (ret != amdsmi_status_t::AMDSMI_STATUS_NOT_SUPPORTED) {
|
||||
CHK_AMDSMI_RET(ret)
|
||||
}
|
||||
printf("\tGPU GFX temp limit: %ld\n\n", temperature);
|
||||
|
||||
// Get temperature measurements
|
||||
@@ -447,7 +449,9 @@ int main() {
|
||||
processor_handles[j], temp_type,
|
||||
AMDSMI_TEMP_CURRENT,
|
||||
&temp_measurements[(int)(temp_type)]);
|
||||
CHK_AMDSMI_RET(ret)
|
||||
if (ret != amdsmi_status_t::AMDSMI_STATUS_NOT_SUPPORTED) {
|
||||
CHK_AMDSMI_RET(ret)
|
||||
}
|
||||
}
|
||||
printf(" Output of amdsmi_get_temp_metric:\n");
|
||||
printf("\tGPU Edge temp measurement: %ld\n",
|
||||
@@ -526,14 +530,13 @@ int main() {
|
||||
};
|
||||
|
||||
uint32_t num_process = 0;
|
||||
ret = amdsmi_get_gpu_process_list(processor_handles[j], &num_process,
|
||||
nullptr);
|
||||
ret = amdsmi_get_gpu_process_list(processor_handles[j], &num_process, nullptr);
|
||||
CHK_AMDSMI_RET(ret)
|
||||
if (!num_process) {
|
||||
printf("No processes found.\n");
|
||||
} else {
|
||||
amdsmi_process_handle_t process_list[num_process];
|
||||
amdsmi_proc_info_t info_list[num_process];
|
||||
std::cout << "Processes found: " << num_process << "\n";
|
||||
amdsmi_proc_info_t process_info_list[num_process];
|
||||
amdsmi_proc_info_t process = {};
|
||||
uint64_t mem = 0, gtt_mem = 0, cpu_mem = 0, vram_mem = 0;
|
||||
uint64_t gfx = 0, enc = 0;
|
||||
@@ -544,24 +547,14 @@ int main() {
|
||||
bdf.fields.device_number,
|
||||
bdf.fields.function_number);
|
||||
int num = 0;
|
||||
ret = amdsmi_get_gpu_process_list(processor_handles[j], &num_process,
|
||||
process_list);
|
||||
CHK_AMDSMI_RET(ret)
|
||||
for (uint32_t it = 0; it < num_process; it += 1) {
|
||||
if (getpid() == process_list[it]) {
|
||||
continue;
|
||||
}
|
||||
ret = amdsmi_get_gpu_process_info(processor_handles[j],
|
||||
process_list[it], &process);
|
||||
if (ret != AMDSMI_STATUS_SUCCESS) {
|
||||
printf("amdsmi_get_gpu_process_info() failed for "
|
||||
"process_list[%d], returned %d\n",
|
||||
it, ret);
|
||||
continue;
|
||||
}
|
||||
info_list[num++] = process;
|
||||
ret = amdsmi_get_gpu_process_list(processor_handles[j], &num_process, process_info_list);
|
||||
std::cout << "Allocation size for process list: " << num_process << "\n";
|
||||
CHK_AMDSMI_RET(ret);
|
||||
for (auto idx = uint32_t(0); idx < num_process; ++idx) {
|
||||
process = static_cast<amdsmi_proc_info_t>(process_info_list[idx]);
|
||||
printf("\t *Process id: %ld / Name: %s / VRAM: %lld \n", process.pid, process.name, process.memory_usage.vram_mem);
|
||||
}
|
||||
qsort(info_list, num, sizeof(info_list[0]), compare);
|
||||
|
||||
printf("+=======+==================+============+=============="
|
||||
"+=============+=============+=============+============"
|
||||
"==+=========================================+\n");
|
||||
@@ -575,41 +568,41 @@ int main() {
|
||||
printf("+=======+"
|
||||
"+=============+=============+=============+============"
|
||||
"==+=========================================+\n");
|
||||
for (int it = 0; it < num; it++) {
|
||||
for (int it = 0; it < num_process; it++) {
|
||||
char command[30];
|
||||
struct passwd *pwd = nullptr;
|
||||
struct stat st;
|
||||
|
||||
sprintf(command, "/proc/%d", info_list[it].pid);
|
||||
sprintf(command, "/proc/%d", process_info_list[it].pid);
|
||||
if (stat(command, &st))
|
||||
continue;
|
||||
pwd = getpwuid(st.st_uid);
|
||||
if (!pwd)
|
||||
printf("| %5d | %16s | %10d | %s | %7ld KiB | %7ld KiB "
|
||||
"| %7ld KiB | %7ld KiB | %lu %lu |\n",
|
||||
info_list[it].pid, info_list[it].name, st.st_uid,
|
||||
bdf_str, info_list[it].mem / 1024,
|
||||
info_list[it].memory_usage.gtt_mem / 1024,
|
||||
info_list[it].memory_usage.cpu_mem / 1024,
|
||||
info_list[it].memory_usage.vram_mem / 1024,
|
||||
info_list[it].engine_usage.gfx,
|
||||
info_list[it].engine_usage.enc);
|
||||
process_info_list[it].pid, process_info_list[it].name, st.st_uid,
|
||||
bdf_str, process_info_list[it].mem / 1024,
|
||||
process_info_list[it].memory_usage.gtt_mem / 1024,
|
||||
process_info_list[it].memory_usage.cpu_mem / 1024,
|
||||
process_info_list[it].memory_usage.vram_mem / 1024,
|
||||
process_info_list[it].engine_usage.gfx,
|
||||
process_info_list[it].engine_usage.enc);
|
||||
else
|
||||
printf("| %5d | %16s | %10s | %s | %7ld KiB | %7ld KiB "
|
||||
"| %7ld KiB | %7ld KiB | %lu %lu |\n",
|
||||
info_list[it].pid, info_list[it].name,
|
||||
pwd->pw_name, bdf_str, info_list[it].mem / 1024,
|
||||
info_list[it].memory_usage.gtt_mem / 1024,
|
||||
info_list[it].memory_usage.cpu_mem / 1024,
|
||||
info_list[it].memory_usage.vram_mem / 1024,
|
||||
info_list[it].engine_usage.gfx,
|
||||
info_list[it].engine_usage.enc);
|
||||
mem += info_list[it].mem / 1024;
|
||||
gtt_mem += info_list[it].memory_usage.gtt_mem / 1024;
|
||||
cpu_mem += info_list[it].memory_usage.cpu_mem / 1024;
|
||||
vram_mem += info_list[it].memory_usage.vram_mem / 1024;
|
||||
gfx = info_list[it].engine_usage.gfx;
|
||||
enc = info_list[it].engine_usage.enc;
|
||||
process_info_list[it].pid, process_info_list[it].name,
|
||||
pwd->pw_name, bdf_str, process_info_list[it].mem / 1024,
|
||||
process_info_list[it].memory_usage.gtt_mem / 1024,
|
||||
process_info_list[it].memory_usage.cpu_mem / 1024,
|
||||
process_info_list[it].memory_usage.vram_mem / 1024,
|
||||
process_info_list[it].engine_usage.gfx,
|
||||
process_info_list[it].engine_usage.enc);
|
||||
mem += process_info_list[it].mem / 1024;
|
||||
gtt_mem += process_info_list[it].memory_usage.gtt_mem / 1024;
|
||||
cpu_mem += process_info_list[it].memory_usage.cpu_mem / 1024;
|
||||
vram_mem += process_info_list[it].memory_usage.vram_mem / 1024;
|
||||
gfx = process_info_list[it].engine_usage.gfx;
|
||||
enc = process_info_list[it].engine_usage.enc;
|
||||
printf(
|
||||
"+-------+------------------+------------+-------------"
|
||||
"-+-------------+-------------+-------------+----------"
|
||||
@@ -644,7 +637,9 @@ int main() {
|
||||
int64_t val_i64 = 0;
|
||||
ret = amdsmi_get_temp_metric(processor_handles[j], TEMPERATURE_TYPE_EDGE,
|
||||
AMDSMI_TEMP_CURRENT, &val_i64);
|
||||
CHK_AMDSMI_RET(ret)
|
||||
if (ret != amdsmi_status_t::AMDSMI_STATUS_NOT_SUPPORTED) {
|
||||
CHK_AMDSMI_RET(ret)
|
||||
}
|
||||
printf(" Output of amdsmi_get_temp_metric:\n");
|
||||
std::cout << "\t\tTemperature: " << val_i64 << "C"
|
||||
<< "\n\n";
|
||||
|
||||
@@ -657,9 +657,9 @@ typedef struct {
|
||||
uint32_t mm_activity;
|
||||
uint32_t reserved[13];
|
||||
} amdsmi_engine_usage_t;
|
||||
|
||||
typedef uint32_t amdsmi_process_handle_t;
|
||||
|
||||
|
||||
typedef struct {
|
||||
char name[AMDSMI_NORMAL_STRING_LENGTH];
|
||||
amdsmi_process_handle_t pid;
|
||||
@@ -679,6 +679,7 @@ typedef struct {
|
||||
uint32_t reserved[4];
|
||||
} amdsmi_proc_info_t;
|
||||
|
||||
|
||||
//! Guaranteed maximum possible number of supported frequencies
|
||||
#define AMDSMI_MAX_NUM_FREQUENCIES 33
|
||||
|
||||
@@ -4743,33 +4744,39 @@ amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_handle, amdsmi_vram_
|
||||
* number of processes currently running,
|
||||
* AMDSMI_STATUS_OUT_OF_RESOURCES will be returned.
|
||||
*
|
||||
* For cases where max_process is not zero (0), it specifies the list's size limit.
|
||||
* That is, the maximum size this list will be able to hold. After the list is built
|
||||
* internally, as a return status, we will have AMDSMI_STATUS_OUT_OF_RESOURCES when
|
||||
* the original size limit is smaller than the actual list of processes running.
|
||||
* Hence, the caller is aware the list size needs to be resized, or
|
||||
* AMDSMI_STATUS_SUCCESS otherwise.
|
||||
* Holding a copy of max_process before it is passed in will be helpful for monitoring
|
||||
* the allocations done upon each call since the max_process will permanently be changed
|
||||
* to reflect the actual number of processes running.
|
||||
* Note: For the specific cases where the return status is AMDSMI_STATUS_NO_PERM only.
|
||||
* The list of process and size are AMDSMI_STATUS_SUCCESS, however there are
|
||||
* processes details not fully retrieved due to permissions.
|
||||
*
|
||||
*
|
||||
* @param[out] list Reference to a user-provided buffer where the process
|
||||
* list will be returned. This buffer must contain at least
|
||||
* max_processes entries of type smi_process_handle. Must be allocated
|
||||
* max_processes entries of type amd_proc_info_list_t. Must be allocated
|
||||
* by user.
|
||||
*
|
||||
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
|
||||
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success,
|
||||
* | ::AMDSMI_STATUS_NO_PERM on success, but not all details from process retrieved,
|
||||
* | ::AMDSMI_STATUS_OUT_OF_RESOURCES, filled list buffer with data, but number of
|
||||
* actual running processes is larger than the size provided.
|
||||
*
|
||||
*/
|
||||
// Note: If the reserved size for processes is smaller than the number of
|
||||
// actual processes running. The AMDSMI_STATUS_OUT_OF_RESOURCES is
|
||||
// an indication the caller should handle the situation (resize).
|
||||
// The max_processes is always changed to reflect the actual size of
|
||||
// list of processes running, so the caller knows where it is at.
|
||||
//
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_process_list(amdsmi_processor_handle processor_handle, uint32_t *max_processes, amdsmi_process_handle_t *list);
|
||||
|
||||
/**
|
||||
* @brief Returns the process information of a given process.
|
||||
* Engine usage show how much time the process spend using these engines in ns.
|
||||
*
|
||||
* @platform{gpu_bm_linux} @platform{guest_1vf} @platform{guest_mvf} @platform{guest_windows}
|
||||
*
|
||||
* @param[in] processor_handle Device which to query
|
||||
*
|
||||
* @param[in] process Handle of process to query.
|
||||
*
|
||||
* @param[out] info Reference to a process information structure where to return
|
||||
* information. Must be allocated by user.
|
||||
*
|
||||
* @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail
|
||||
*/
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_process_info(amdsmi_processor_handle processor_handle, amdsmi_process_handle_t process, amdsmi_proc_info_t *info);
|
||||
amdsmi_get_gpu_process_list(amdsmi_processor_handle processor_handle, uint32_t *max_processes, amdsmi_proc_info_t *list);
|
||||
|
||||
/** @} End processinfo */
|
||||
|
||||
|
||||
@@ -53,7 +53,20 @@
|
||||
namespace amd {
|
||||
namespace smi {
|
||||
|
||||
|
||||
// PID, amdsmi_proc_info_t
|
||||
using GPUComputeProcessList_t = std::map<amdsmi_process_handle_t, amdsmi_proc_info_t>;
|
||||
using ComputeProcessListClassType_t = uint16_t;
|
||||
|
||||
enum class ComputeProcessListType_t : ComputeProcessListClassType_t
|
||||
{
|
||||
kAllProcesses,
|
||||
kAllProcessesOnDevice,
|
||||
};
|
||||
|
||||
|
||||
class AMDSmiGPUDevice: public AMDSmiProcessor {
|
||||
|
||||
public:
|
||||
AMDSmiGPUDevice(uint32_t gpu_id, uint32_t fd, std::string path, amdsmi_bdf_t bdf, AMDSmiDrm& drm):
|
||||
AMDSmiProcessor(AMD_GPU), gpu_id_(gpu_id), fd_(fd), path_(path), bdf_(bdf), drm_(drm) {}
|
||||
@@ -73,6 +86,10 @@ class AMDSmiGPUDevice: public AMDSmiProcessor {
|
||||
amdsmi_bdf_t get_bdf();
|
||||
bool check_if_drm_is_supported() { return drm_.check_if_drm_is_supported(); }
|
||||
uint32_t get_vendor_id();
|
||||
const GPUComputeProcessList_t& amdgpu_get_compute_process_list(ComputeProcessListType_t list_type = ComputeProcessListType_t::kAllProcessesOnDevice);
|
||||
const GPUComputeProcessList_t& amdgpu_get_all_compute_process_list() {
|
||||
return amdgpu_get_compute_process_list(ComputeProcessListType_t::kAllProcesses);
|
||||
}
|
||||
|
||||
amdsmi_status_t amdgpu_query_info(unsigned info_id,
|
||||
unsigned size, void *value) const;
|
||||
@@ -83,6 +100,7 @@ class AMDSmiGPUDevice: public AMDSmiProcessor {
|
||||
amdsmi_status_t amdgpu_query_vbios(void *info) const;
|
||||
amdsmi_status_t amdgpu_query_driver_name(std::string& name) const;
|
||||
amdsmi_status_t amdgpu_query_driver_date(std::string& date) const;
|
||||
|
||||
private:
|
||||
uint32_t gpu_id_;
|
||||
uint32_t fd_;
|
||||
@@ -90,6 +108,10 @@ class AMDSmiGPUDevice: public AMDSmiProcessor {
|
||||
amdsmi_bdf_t bdf_;
|
||||
uint32_t vendor_id_;
|
||||
AMDSmiDrm& drm_;
|
||||
GPUComputeProcessList_t compute_process_list_;
|
||||
int32_t get_compute_process_list_impl(GPUComputeProcessList_t& compute_process_list,
|
||||
ComputeProcessListType_t list_type);
|
||||
|
||||
};
|
||||
|
||||
|
||||
|
||||
@@ -882,13 +882,24 @@ except AmdSmiException as e:
|
||||
|
||||
### amdsmi_get_gpu_process_list
|
||||
|
||||
Description: Returns the list of processes for the given GPU
|
||||
Description: Returns the list of processes for the given GPU.
|
||||
The list is of type `amdsmi_proc_info_t` and holds information about the running process.
|
||||
|
||||
Input parameters:
|
||||
|
||||
* `processor_handle` device which to query
|
||||
|
||||
Output: List of process handles found
|
||||
Output: List of process processes with fields
|
||||
|
||||
Output: Dictionary with fields
|
||||
|
||||
Field | Description
|
||||
---|---
|
||||
`name` | Name of process
|
||||
`pid` | Process ID
|
||||
`mem` | Process memory usage
|
||||
`engine_usage` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`gfx`</td><td>GFX engine usage in ns</td></tr><tr><td>`enc`</td><td>Encode engine usage in ns</td></tr></tbody></table>
|
||||
`memory_usage` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`gtt_mem`</td><td>GTT memory usage</td></tr><tr><td>`cpu_mem`</td><td>CPU memory usage</td></tr><tr><td>`vram_mem`</td><td>VRAM memory usage</td></tr> </tbody></table>
|
||||
|
||||
Exceptions that can be thrown by `amdsmi_get_gpu_process_list` function:
|
||||
|
||||
@@ -906,48 +917,11 @@ try:
|
||||
else:
|
||||
for device in devices:
|
||||
processes = amdsmi_get_gpu_process_list(device)
|
||||
print(processes)
|
||||
except AmdSmiException as e:
|
||||
print(e)
|
||||
```
|
||||
|
||||
### amdsmi_get_gpu_process_info
|
||||
|
||||
Description: Returns the info for the given process
|
||||
|
||||
Input parameters:
|
||||
|
||||
* `processor_handle` device which to query
|
||||
* `process_handle` process which to query
|
||||
|
||||
Output: Dictionary with fields
|
||||
|
||||
Field | Description
|
||||
---|---
|
||||
`name` | Name of process
|
||||
`pid` | Process ID
|
||||
`mem` | Process memory usage
|
||||
`engine_usage` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`gfx`</td><td>GFX engine usage in ns</td></tr><tr><td>`enc`</td><td>Encode engine usage in ns</td></tr></tbody></table>
|
||||
`memory_usage` | <table><thead><tr> <th> Subfield </th> <th> Description</th> </tr></thead><tbody><tr><td>`gtt_mem`</td><td>GTT memory usage</td></tr><tr><td>`cpu_mem`</td><td>CPU memory usage</td></tr><tr><td>`vram_mem`</td><td>VRAM memory usage</td></tr> </tbody></table>
|
||||
|
||||
Exceptions that can be thrown by `amdsmi_get_gpu_process_info` function:
|
||||
|
||||
* `AmdSmiLibraryException`
|
||||
* `AmdSmiRetryException`
|
||||
* `AmdSmiParameterException`
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
try:
|
||||
devices = amdsmi_get_processor_handles()
|
||||
if len(devices) == 0:
|
||||
print("No GPUs on machine")
|
||||
else:
|
||||
for device in devices:
|
||||
processes = amdsmi_get_gpu_process_list(device)
|
||||
for process in processes:
|
||||
print(amdsmi_get_gpu_process_info(device, process))
|
||||
if len(processes) == 0:
|
||||
print("No processes running on this GPU")
|
||||
else:
|
||||
for process in processes:
|
||||
print(process)
|
||||
except AmdSmiException as e:
|
||||
print(e)
|
||||
```
|
||||
|
||||
@@ -1923,15 +1923,16 @@ def amdsmi_get_gpu_ras_block_features_enabled(
|
||||
|
||||
def amdsmi_get_gpu_process_list(
|
||||
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
|
||||
) -> List[amdsmi_wrapper.amdsmi_process_handle_t]:
|
||||
) -> List[amdsmi_wrapper.amdsmi_proc_info_t]:
|
||||
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
|
||||
raise AmdSmiParameterException(
|
||||
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
|
||||
)
|
||||
|
||||
# This will get populated with the number of processes found
|
||||
max_processes = ctypes.c_uint32(MAX_NUM_PROCESSES)
|
||||
|
||||
process_list = (amdsmi_wrapper.amdsmi_process_handle_t *
|
||||
process_list = (amdsmi_wrapper.amdsmi_proc_info_t *
|
||||
max_processes.value)()
|
||||
_check_res(
|
||||
amdsmi_wrapper.amdsmi_get_gpu_process_list(
|
||||
@@ -1939,42 +1940,37 @@ def amdsmi_get_gpu_process_list(
|
||||
)
|
||||
)
|
||||
|
||||
return [amdsmi_wrapper.amdsmi_process_handle_t(process_list[x])\
|
||||
for x in range(0, max_processes.value)]
|
||||
result = []
|
||||
for index in range(max_processes.value):
|
||||
result.append(process_list[index])
|
||||
return result
|
||||
|
||||
|
||||
def amdsmi_get_gpu_process_info(
|
||||
processor_handle: amdsmi_wrapper.amdsmi_processor_handle,
|
||||
process: amdsmi_wrapper.amdsmi_process_handle_t,
|
||||
process: amdsmi_wrapper.amdsmi_proc_info_t,
|
||||
) -> Dict[str, Any]:
|
||||
if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle):
|
||||
raise AmdSmiParameterException(
|
||||
processor_handle, amdsmi_wrapper.amdsmi_processor_handle
|
||||
)
|
||||
|
||||
if not isinstance(process, amdsmi_wrapper.amdsmi_process_handle_t):
|
||||
if not isinstance(process, amdsmi_wrapper.amdsmi_proc_info_t):
|
||||
raise AmdSmiParameterException(
|
||||
process, amdsmi_wrapper.amdsmi_process_handle_t)
|
||||
|
||||
info = amdsmi_wrapper.amdsmi_proc_info_t()
|
||||
_check_res(
|
||||
amdsmi_wrapper.amdsmi_get_gpu_process_info(
|
||||
processor_handle, process, ctypes.byref(info)
|
||||
)
|
||||
)
|
||||
process, amdsmi_wrapper.amdsmi_proc_info_t)
|
||||
|
||||
return {
|
||||
"name": info.name.decode("utf-8"),
|
||||
"pid": info.pid,
|
||||
"mem": info.mem,
|
||||
"name": process.name.decode("utf-8"),
|
||||
"pid": process.pid,
|
||||
"mem": process.mem,
|
||||
"engine_usage": {
|
||||
"gfx": info.engine_usage.gfx,
|
||||
"enc": info.engine_usage.enc
|
||||
"gfx": process.engine_usage.gfx,
|
||||
"enc": process.engine_usage.enc
|
||||
},
|
||||
"memory_usage": {
|
||||
"gtt_mem": info.memory_usage.gtt_mem,
|
||||
"cpu_mem": info.memory_usage.cpu_mem,
|
||||
"vram_mem": info.memory_usage.vram_mem,
|
||||
"gtt_mem": process.memory_usage.gtt_mem,
|
||||
"cpu_mem": process.memory_usage.cpu_mem,
|
||||
"vram_mem": process.memory_usage.vram_mem,
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
@@ -2212,10 +2212,7 @@ amdsmi_get_gpu_vram_usage.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_vram_usage.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_vram_usage_t)]
|
||||
amdsmi_get_gpu_process_list = _libraries['libamd_smi.so'].amdsmi_get_gpu_process_list
|
||||
amdsmi_get_gpu_process_list.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_process_list.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint32), ctypes.POINTER(ctypes.c_uint32)]
|
||||
amdsmi_get_gpu_process_info = _libraries['libamd_smi.so'].amdsmi_get_gpu_process_info
|
||||
amdsmi_get_gpu_process_info.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_process_info.argtypes = [amdsmi_processor_handle, amdsmi_process_handle_t, ctypes.POINTER(struct_amdsmi_proc_info_t)]
|
||||
amdsmi_get_gpu_process_list.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint32), ctypes.POINTER(struct_amdsmi_proc_info_t)]
|
||||
amdsmi_get_gpu_total_ecc_count = _libraries['libamd_smi.so'].amdsmi_get_gpu_total_ecc_count
|
||||
amdsmi_get_gpu_total_ecc_count.restype = amdsmi_status_t
|
||||
amdsmi_get_gpu_total_ecc_count.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_error_count_t)]
|
||||
@@ -2580,7 +2577,7 @@ __all__ = \
|
||||
'amdsmi_get_gpu_pci_throughput', 'amdsmi_get_gpu_perf_level',
|
||||
'amdsmi_get_gpu_pm_metrics_info',
|
||||
'amdsmi_get_gpu_power_profile_presets',
|
||||
'amdsmi_get_gpu_process_info', 'amdsmi_get_gpu_process_list',
|
||||
'amdsmi_get_gpu_process_list',
|
||||
'amdsmi_get_gpu_ras_block_features_enabled',
|
||||
'amdsmi_get_gpu_ras_feature_info',
|
||||
'amdsmi_get_gpu_reg_table_info', 'amdsmi_get_gpu_revision',
|
||||
|
||||
@@ -902,7 +902,7 @@ typedef struct {
|
||||
struct {
|
||||
uint32_t cache_size_kb; /* In KB */
|
||||
uint32_t cache_level;
|
||||
/*
|
||||
/*
|
||||
HSA_CACHE_TYPE_DATA 0x00000001
|
||||
HSA_CACHE_TYPE_INSTRUCTION 0x00000002
|
||||
HSA_CACHE_TYPE_CPU 0x00000004
|
||||
@@ -1248,12 +1248,14 @@ typedef struct {
|
||||
*/
|
||||
typedef struct {
|
||||
uint32_t process_id; //!< Process ID
|
||||
uint32_t pasid; //!< PASID
|
||||
uint32_t pasid; //!< PASID: (Process Address Space ID)
|
||||
uint64_t vram_usage; //!< VRAM usage
|
||||
uint64_t sdma_usage; //!< SDMA usage in microseconds
|
||||
uint32_t cu_occupancy; //!< Compute Unit usage in percent
|
||||
} rsmi_process_info_t;
|
||||
|
||||
//! CU occupancy invalidation value for the GFX revisions not providing cu_occupancy debugfs method
|
||||
#define CU_OCCUPANCY_INVALID 0xFFFFFFFF
|
||||
|
||||
/**
|
||||
* @brief Opaque handle to function-support object
|
||||
@@ -1447,7 +1449,7 @@ rsmi_status_t rsmi_dev_vendor_id_get(uint32_t dv_ind, uint16_t *id);
|
||||
*
|
||||
* @details Given a device index @p dv_ind, a pointer to a caller provided
|
||||
* char buffer @p name, and a length of this buffer @p len, this function will
|
||||
* write the name of the PCIe vendor (up to @p len characters) buffer @p name.
|
||||
* write the name of the PCIe vendor (up to @p len characters) buffer @p name.
|
||||
*
|
||||
* If the integer ID associated with the PCIe vendor is not found in one of the
|
||||
* system files containing device name information (e.g.
|
||||
@@ -2294,9 +2296,9 @@ rsmi_dev_memory_total_get(uint32_t dv_ind, rsmi_memory_type_t mem_type,
|
||||
|
||||
/**
|
||||
* @brief Get gpu cache info.
|
||||
*
|
||||
* @details Given a device index @p dv_ind, and a pointer to a cache
|
||||
* info @p info, this function will write the cache size and level
|
||||
*
|
||||
* @details Given a device index @p dv_ind, and a pointer to a cache
|
||||
* info @p info, this function will write the cache size and level
|
||||
* to the location pointed to by @p info.
|
||||
* @param[in] dv_ind a device index
|
||||
*
|
||||
@@ -2930,16 +2932,16 @@ rsmi_status_t rsmi_dev_gpu_metrics_info_get(uint32_t dv_ind,
|
||||
* @brief Get the pm metrics table with provided device index.
|
||||
*
|
||||
* @details Given a device index @p dv_ind, @p pm_metrics pointer,
|
||||
* and @p num_of_metrics pointer,
|
||||
* and @p num_of_metrics pointer,
|
||||
* this function will write the pm metrics name value pair
|
||||
* to the array at @p pm_metrics and the number of metrics retreived to @p num_of_metrics
|
||||
* Note: the library allocated memory for pm_metrics, and user must call
|
||||
* free(pm_metrics) to free it after use.
|
||||
*
|
||||
*
|
||||
* @param[in] dv_ind a device index
|
||||
*
|
||||
* @param[inout] pm_metrics A pointerto an array to hold multiple PM metrics. On successs,
|
||||
* the library will allocate memory of pm_metrics and write metrics to this array.
|
||||
* the library will allocate memory of pm_metrics and write metrics to this array.
|
||||
* The caller must free this memory after usage to avoid memory leak.
|
||||
*
|
||||
* @param[inout] num_of_metrics a pointer to uint32_t to which the number of
|
||||
@@ -2964,18 +2966,18 @@ rsmi_status_t rsmi_dev_pm_metrics_info_get(uint32_t dv_ind,
|
||||
* @brief Get the register metrics table with provided device index and registertype.
|
||||
*
|
||||
* @details Given a device index @p dv_ind, @p reg_type, @p reg_metrics pointer,
|
||||
* and @p num_of_metrics pointer,
|
||||
* and @p num_of_metrics pointer,
|
||||
* this function will write the register metrics name value pair
|
||||
* to the array at @p reg_metrics and the number of metrics retreived to @p num_of_metrics
|
||||
* Note: the library allocated memory for reg_metrics, and user must call
|
||||
* free(reg_metrics) to free it after use.
|
||||
*
|
||||
*
|
||||
* @param[in] dv_ind a device index
|
||||
*
|
||||
*
|
||||
* @param[in] reg_type The register type
|
||||
*
|
||||
* @param[inout] reg_metrics A pointerto an array to hold multiple register metrics. On successs,
|
||||
* the library will allocate memory of reg_metrics and write metrics to this array.
|
||||
* the library will allocate memory of reg_metrics and write metrics to this array.
|
||||
* The caller must free this memory after usage to avoid memory leak.
|
||||
*
|
||||
* @param[inout] num_of_metrics a pointer to uint32_t to which the number of
|
||||
|
||||
@@ -526,7 +526,9 @@ int GetProcessInfoForPID(uint32_t pid, rsmi_process_info_t *proc,
|
||||
// Collect count of compute units
|
||||
cu_count += kfd_node_map[gpu_id]->cu_count();
|
||||
} else {
|
||||
return err;
|
||||
//Some GFX revisions do not provide cu_occupancy debugfs method
|
||||
proc->cu_occupancy = CU_OCCUPANCY_INVALID;
|
||||
cu_count = 0;
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
@@ -1785,76 +1785,55 @@ amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_
|
||||
}
|
||||
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_process_list(amdsmi_processor_handle processor_handle, uint32_t *max_processes, amdsmi_process_handle_t *list) {
|
||||
amdsmi_get_gpu_process_list(amdsmi_processor_handle processor_handle, uint32_t *max_processes, amdsmi_proc_info_t *list) {
|
||||
AMDSMI_CHECK_INIT();
|
||||
|
||||
if (max_processes == nullptr) {
|
||||
if (!max_processes) {
|
||||
return AMDSMI_STATUS_INVAL;
|
||||
}
|
||||
|
||||
std::vector<long int> pids;
|
||||
uint32_t i = 0;
|
||||
uint64_t size = 0;
|
||||
amdsmi_status_t status;
|
||||
amd::smi::AMDSmiGPUDevice* gpu_device = nullptr;
|
||||
amdsmi_status_t r = get_gpu_device_from_handle(processor_handle, &gpu_device);
|
||||
if (r != AMDSMI_STATUS_SUCCESS)
|
||||
return r;
|
||||
amdsmi_status_t status_code = get_gpu_device_from_handle(processor_handle, &gpu_device);
|
||||
if (status_code != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
return status_code;
|
||||
}
|
||||
|
||||
if (gpu_device->check_if_drm_is_supported()){
|
||||
amdsmi_bdf_t bdf = gpu_device->get_bdf();
|
||||
status = gpuvsmi_get_pids(bdf, pids, &size);
|
||||
if (status != AMDSMI_STATUS_SUCCESS) {
|
||||
return status;
|
||||
}
|
||||
if (*max_processes == 0 || (pids.size() == 0)) {
|
||||
*max_processes = (uint32_t)pids.size();
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
if (!list) {
|
||||
return AMDSMI_STATUS_INVAL;
|
||||
}
|
||||
if (*max_processes < pids.size()) {
|
||||
return AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
for (auto &pid : pids) {
|
||||
if (i >= *max_processes) {
|
||||
break;
|
||||
auto compute_process_list = gpu_device->amdgpu_get_compute_process_list();
|
||||
if ((*max_processes == 0) || compute_process_list.empty()) {
|
||||
*max_processes = static_cast<uint32_t>(compute_process_list.size());
|
||||
return amdsmi_status_t::AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
if (!list) {
|
||||
return amdsmi_status_t::AMDSMI_STATUS_INVAL;
|
||||
}
|
||||
|
||||
const auto max_processes_original_size(*max_processes);
|
||||
auto idx = uint32_t(0);
|
||||
auto is_required_previlegies_required(false);
|
||||
for (auto& process : compute_process_list) {
|
||||
if (idx < *max_processes) {
|
||||
list[idx++] = static_cast<amdsmi_proc_info_t>(process.second);
|
||||
// Note: If we could not read the process info for an existing process,
|
||||
// that is likely a permission error.
|
||||
if (!is_required_previlegies_required && std::string(process.second.name).empty()) {
|
||||
is_required_previlegies_required = true;
|
||||
}
|
||||
list[i++] = (uint32_t)pid;
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
*max_processes = (uint32_t)pids.size();
|
||||
}
|
||||
else {
|
||||
// rocm
|
||||
}
|
||||
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
amdsmi_status_t
|
||||
amdsmi_get_gpu_process_info(amdsmi_processor_handle processor_handle, amdsmi_process_handle_t process, amdsmi_proc_info_t *info) {
|
||||
AMDSMI_CHECK_INIT();
|
||||
|
||||
if (info == nullptr) {
|
||||
return AMDSMI_STATUS_INVAL;
|
||||
}
|
||||
|
||||
amd::smi::AMDSmiGPUDevice* gpu_device = nullptr;
|
||||
amdsmi_status_t r = get_gpu_device_from_handle(processor_handle, &gpu_device);
|
||||
if (r != AMDSMI_STATUS_SUCCESS)
|
||||
return r;
|
||||
|
||||
amdsmi_status_t status;
|
||||
if (gpu_device->check_if_drm_is_supported()) {
|
||||
status = gpuvsmi_get_pid_info(gpu_device->get_bdf(), process, *info);
|
||||
if (status != AMDSMI_STATUS_SUCCESS) return status;
|
||||
}
|
||||
else {
|
||||
// rocm
|
||||
}
|
||||
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
// Note: If the reserved size for processes is smaller than the number of
|
||||
// actual processes running. The AMDSMI_STATUS_OUT_OF_RESOURCES is
|
||||
// an indication the caller should handle the situation (resize).
|
||||
// The max_processes is always changed to reflect the actual size of
|
||||
// list of processes running, so the caller knows where it is at.
|
||||
// Holding a copy of max_process before it is passed in will be helpful
|
||||
// for the caller.
|
||||
status_code = is_required_previlegies_required
|
||||
? amdsmi_status_t::AMDSMI_STATUS_NO_PERM : AMDSMI_STATUS_SUCCESS;
|
||||
*max_processes = static_cast<uint32_t>(compute_process_list.size());
|
||||
return (max_processes_original_size >= static_cast<uint32_t>(compute_process_list.size()))
|
||||
? status_code : amdsmi_status_t::AMDSMI_STATUS_OUT_OF_RESOURCES;
|
||||
}
|
||||
|
||||
amdsmi_status_t
|
||||
|
||||
@@ -41,10 +41,16 @@
|
||||
*
|
||||
*/
|
||||
|
||||
#include <functional>
|
||||
#include "amd_smi/impl/amd_smi_gpu_device.h"
|
||||
#include "amd_smi/impl/amd_smi_common.h"
|
||||
#include "amd_smi/impl/fdinfo.h"
|
||||
#include "rocm_smi/rocm_smi_kfd.h"
|
||||
#include "rocm_smi/rocm_smi_utils.h"
|
||||
|
||||
#include <functional>
|
||||
#include <map>
|
||||
#include <memory>
|
||||
#include <unordered_set>
|
||||
|
||||
namespace amd {
|
||||
namespace smi {
|
||||
@@ -148,6 +154,153 @@ amdsmi_status_t AMDSmiGPUDevice::amdgpu_query_vbios(void *info) const {
|
||||
return drm_.amdgpu_query_vbios(fd, info);
|
||||
}
|
||||
|
||||
|
||||
int32_t AMDSmiGPUDevice::get_compute_process_list_impl(GPUComputeProcessList_t& compute_process_list,
|
||||
ComputeProcessListType_t list_type)
|
||||
{
|
||||
/**
|
||||
* The first call to GetProcessInfo() helps to find the size it needs,
|
||||
* so we can create a tailored size list.
|
||||
*/
|
||||
auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS);
|
||||
auto list_process_running_size = uint32_t(0);
|
||||
auto list_process_allocation_size = uint32_t(0);
|
||||
|
||||
status_code = rsmi_compute_process_info_get(nullptr, &list_process_running_size);
|
||||
if ((status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) || (list_process_running_size <= 0)) {
|
||||
return status_code;
|
||||
}
|
||||
|
||||
/**
|
||||
* The second call to GetProcessInfo() helps to set proper sizes for both,
|
||||
* the raw array of processes (amdsmi_process_info_t) and list of processes (amdsmi_proc_info_t).
|
||||
*/
|
||||
using RsmiDeviceList_t = uint32_t[];
|
||||
using RsmiProcessList_t = rsmi_process_info_t[];
|
||||
std::unique_ptr<RsmiProcessList_t> list_all_processes_ptr = std::make_unique<RsmiProcessList_t>(list_process_running_size);
|
||||
|
||||
list_process_allocation_size = list_process_running_size;
|
||||
status_code = rsmi_compute_process_info_get(list_all_processes_ptr.get(), &list_process_allocation_size);
|
||||
if (status_code) {
|
||||
return status_code;
|
||||
}
|
||||
|
||||
// Restore the original size to read
|
||||
list_process_running_size = list_process_allocation_size;
|
||||
if (list_process_running_size <= 0) {
|
||||
return rsmi_status_t::RSMI_STATUS_NOT_FOUND;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Setup for the cases where the process list is by device.
|
||||
*/
|
||||
auto list_device_running_size = uint32_t(0);
|
||||
auto list_device_allocation_size = uint32_t(0);
|
||||
status_code = rsmi_num_monitor_devices(&list_device_running_size);
|
||||
if ((status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) || (list_device_running_size <= 0)) {
|
||||
return status_code;
|
||||
}
|
||||
|
||||
|
||||
/**
|
||||
* Complete the process information
|
||||
*/
|
||||
auto get_process_info = [&](const rsmi_process_info_t& rsmi_proc_info, amdsmi_proc_info_t& asmi_proc_info) {
|
||||
auto status_code = gpuvsmi_get_pid_info(get_bdf(), rsmi_proc_info.process_id, asmi_proc_info);
|
||||
// If we cannot get the info from sysfs, save the minimum info
|
||||
if (status_code != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
asmi_proc_info.pid = rsmi_proc_info.process_id;
|
||||
asmi_proc_info.memory_usage.vram_mem = rsmi_proc_info.vram_usage;
|
||||
}
|
||||
|
||||
return status_code;
|
||||
};
|
||||
|
||||
/**
|
||||
* Get process information
|
||||
*/
|
||||
auto update_list_by_running_process = [&](const uint32_t process_id) {
|
||||
auto status_result(true);
|
||||
rsmi_process_info_t rsmi_proc_info{};
|
||||
auto status_code = rsmi_compute_process_info_by_pid_get(process_id, &rsmi_proc_info);
|
||||
if (status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) {
|
||||
status_result = false;
|
||||
return status_result;
|
||||
}
|
||||
|
||||
amdsmi_proc_info_t tmp_asmi_proc_info{};
|
||||
get_process_info(rsmi_proc_info, tmp_asmi_proc_info);
|
||||
compute_process_list.emplace(process_id, tmp_asmi_proc_info);
|
||||
|
||||
return status_result;
|
||||
};
|
||||
|
||||
|
||||
/**
|
||||
* Devices used by a process.
|
||||
*/
|
||||
auto update_list_by_running_device = [&](const uint32_t process_id,
|
||||
const uint32_t proc_addr_id) {
|
||||
// Get all devices running this process
|
||||
auto status_result(true);
|
||||
std::unique_ptr<RsmiDeviceList_t> list_device_ptr = std::make_unique<RsmiDeviceList_t>(list_device_running_size);
|
||||
list_device_allocation_size = list_device_running_size;
|
||||
auto status_code = rsmi_compute_process_gpus_get(process_id, list_device_ptr.get(), &list_device_allocation_size);
|
||||
if (status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) {
|
||||
status_result = false;
|
||||
return status_result;
|
||||
}
|
||||
|
||||
for (auto device_idx = uint32_t(0); device_idx < list_device_allocation_size; ++device_idx) {
|
||||
// Is this device running this process?
|
||||
if (list_device_ptr[device_idx] == get_gpu_id()) {
|
||||
rsmi_process_info_t rsmi_dev_proc_info{};
|
||||
auto status_code = rsmi_compute_process_info_by_device_get(process_id, list_device_ptr[device_idx], &rsmi_dev_proc_info);
|
||||
if ((status_code == rsmi_status_t::RSMI_STATUS_SUCCESS) &&
|
||||
((rsmi_dev_proc_info.process_id == process_id) && (rsmi_dev_proc_info.pasid == proc_addr_id))) {
|
||||
amdsmi_proc_info_t tmp_asmi_proc_info{};
|
||||
get_process_info(rsmi_dev_proc_info, tmp_asmi_proc_info);
|
||||
compute_process_list.emplace(process_id, tmp_asmi_proc_info);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return status_result;
|
||||
};
|
||||
|
||||
|
||||
/**
|
||||
* Transfer/Save the ones linked to this device.
|
||||
*/
|
||||
compute_process_list.clear();
|
||||
for (auto process_idx = uint32_t(0); process_idx < list_process_running_size; ++process_idx) {
|
||||
if (list_type == ComputeProcessListType_t::kAllProcesses) {
|
||||
if (update_list_by_running_process(list_all_processes_ptr[process_idx].process_id)) {
|
||||
}
|
||||
}
|
||||
|
||||
if (list_type == ComputeProcessListType_t::kAllProcessesOnDevice) {
|
||||
if (update_list_by_running_device(list_all_processes_ptr[process_idx].process_id,
|
||||
list_all_processes_ptr[process_idx].pasid)) {
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
return status_code;
|
||||
}
|
||||
|
||||
const GPUComputeProcessList_t& AMDSmiGPUDevice::amdgpu_get_compute_process_list(ComputeProcessListType_t list_type)
|
||||
{
|
||||
auto error_code = get_compute_process_list_impl(compute_process_list_, list_type);
|
||||
if (error_code) {
|
||||
compute_process_list_.clear();
|
||||
}
|
||||
|
||||
return compute_process_list_;
|
||||
}
|
||||
|
||||
|
||||
} // namespace smi
|
||||
} // namespace amd
|
||||
|
||||
|
||||
@@ -220,12 +220,10 @@ amdsmi_status_t gpuvsmi_get_pid_info(const amdsmi_bdf_t &bdf, long int pid,
|
||||
}
|
||||
}
|
||||
|
||||
|
||||
closedir(d);
|
||||
|
||||
if (!pasids.size())
|
||||
return AMDSMI_STATUS_NOT_FOUND;
|
||||
|
||||
// Note: If possible at all, try to get the name of the process/container.
|
||||
// In case the other info fail, get at least something.
|
||||
std::ifstream filename(name_path.c_str());
|
||||
std::string name;
|
||||
|
||||
@@ -252,9 +250,12 @@ amdsmi_status_t gpuvsmi_get_pid_info(const amdsmi_bdf_t &bdf, long int pid,
|
||||
if (strlen(info.container_name) > 0)
|
||||
break;
|
||||
}
|
||||
|
||||
info.pid = (uint32_t)pid;
|
||||
|
||||
if (!pasids.size()) {
|
||||
return AMDSMI_STATUS_NOT_FOUND;
|
||||
}
|
||||
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
@@ -226,4 +226,5 @@ void TestProcInfoRead::Run(void) {
|
||||
}
|
||||
}
|
||||
delete []procs;
|
||||
|
||||
}
|
||||
|
||||
Ссылка в новой задаче
Block a user