[SWDEV-483526] Fix MI3x partitions not showing all logical nodes
Changes:
- Updates to amdsmi_asic_info_t structure to include:
target_graphics_version, kfd_id, node_id, partition_id
- Updates to amd-smi static --asic to display new
samdsmi_asic_info_t fields
- Updates to gpu enumeration during amdsmi_init()
to discover all logical GPUs when in a non-SPX mode
(ex. DPX, TPX, QPX, or CPX)
- Updates to amdsmi_get_gpu_bdf_id(..) to include
partition_id details when in BDF or optional bits.
- bits [63:32] = domain
- bits [31:28] or bits [2:0] = partition id
- bits [27:16] = reserved
- bits [15:8] = Bus
- bits [7:3] = Device
- bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
- C++/Python tests updated to reflect these outputs
Change-Id: I4be0ea35bb98f3109ae2ca9e82f6b21baa38de29
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/amdsmi commit: a33e4c9e14]
Этот коммит содержится в:
коммит произвёл
Maisam Arif
родитель
202ddc01aa
Коммит
df9d5d3ee5
@@ -175,6 +175,98 @@ Legend:
|
||||
64,32 = 64 bit and 32 bit atomic support
|
||||
<BW from>-<BW to>
|
||||
```
|
||||
- **Added Target_Graphics_Version, KFD_ID, Node_id, and partition id to `amd-smi static --asic`**.
|
||||
Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers
|
||||
were placed within the `amdsmi_asic_info_t` struct. These new fields are only available for BM/Guest Linux
|
||||
devices at this time.
|
||||
|
||||
```C
|
||||
typedef struct {
|
||||
char market_name[AMDSMI_256_LENGTH];
|
||||
uint32_t vendor_id; //< Use 32 bit to be compatible with other platform.
|
||||
char vendor_name[AMDSMI_MAX_STRING_LENGTH];
|
||||
uint32_t subvendor_id; //< The subsystem vendor id
|
||||
uint64_t device_id; //< The device id of a GPU
|
||||
uint32_t rev_id;
|
||||
char asic_serial[AMDSMI_NORMAL_STRING_LENGTH];
|
||||
uint32_t oam_id; //< 0xFFFF if not supported
|
||||
uint32_t num_of_compute_units; //< 0xFFFFFFFF if not supported
|
||||
uint64_t target_graphics_version; //< 0xFFFFFFFFFFFFFFFF if not supported
|
||||
uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported
|
||||
uint32_t node_id; //< 0xFFFFFFFF if not supported
|
||||
uint32_t partition_id; //< 0xFFFFFFFF if not supported
|
||||
uint32_t reserved[17];
|
||||
} amdsmi_asic_info_t;
|
||||
```
|
||||
|
||||
```shell
|
||||
$ amd-smi static --asic --board --bus --partition
|
||||
GPU: 0
|
||||
ASIC:
|
||||
MARKET_NAME: MI308X
|
||||
VENDOR_ID: 0x1002
|
||||
VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI]
|
||||
SUBVENDOR_ID: 0x1002
|
||||
DEVICE_ID: 0x74a2
|
||||
TARGET_GRAPHICS_VERSION: gfx942
|
||||
KFD_ID: 24248
|
||||
NODE_ID: 2
|
||||
PARTITION_ID: 0
|
||||
SUBSYSTEM_ID: 0x74a2
|
||||
REV_ID: 0x00
|
||||
ASIC_SERIAL: <redacted>
|
||||
OAM_ID: 5
|
||||
NUM_COMPUTE_UNITS: 20
|
||||
BUS:
|
||||
BDF: 0000:0A:00.0
|
||||
MAX_PCIE_WIDTH: 16
|
||||
MAX_PCIE_SPEED: 32 GT/s
|
||||
PCIE_INTERFACE_VERSION: Gen 5
|
||||
SLOT_TYPE: PCIE
|
||||
BOARD:
|
||||
MODEL_NUMBER: 102-G30218-00
|
||||
PRODUCT_SERIAL: 692432000576
|
||||
FRU_ID: 113-AMDG302180002-0000000000000
|
||||
PRODUCT_NAME: AMD Instinct MI308X OAM
|
||||
MANUFACTURER_NAME: AMD
|
||||
PARTITION:
|
||||
COMPUTE_PARTITION: CPX
|
||||
MEMORY_PARTITION: NPS4
|
||||
|
||||
GPU: 1
|
||||
ASIC:
|
||||
MARKET_NAME: MI308X
|
||||
VENDOR_ID: 0x1002
|
||||
VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI]
|
||||
SUBVENDOR_ID: 0x1002
|
||||
DEVICE_ID: 0x74a2
|
||||
TARGET_GRAPHICS_VERSION: gfx942
|
||||
KFD_ID: 41657
|
||||
NODE_ID: 3
|
||||
PARTITION_ID: 1
|
||||
SUBSYSTEM_ID: 0x74a2
|
||||
REV_ID: 0x00
|
||||
ASIC_SERIAL: <redacted>
|
||||
OAM_ID: 5
|
||||
NUM_COMPUTE_UNITS: 20
|
||||
BUS:
|
||||
BDF: 0000:0A:00.1
|
||||
MAX_PCIE_WIDTH: 16
|
||||
MAX_PCIE_SPEED: 32 GT/s
|
||||
PCIE_INTERFACE_VERSION: Gen 5
|
||||
SLOT_TYPE: PCIE
|
||||
BOARD:
|
||||
MODEL_NUMBER: 102-G30218-00
|
||||
PRODUCT_SERIAL: 692432000576
|
||||
FRU_ID: 113-AMDG302180002-0000000000000
|
||||
PRODUCT_NAME: AMD Instinct MI308X OAM
|
||||
MANUFACTURER_NAME: AMD
|
||||
PARTITION:
|
||||
COMPUTE_PARTITION: CPX
|
||||
MEMORY_PARTITION: NPS4
|
||||
...
|
||||
```
|
||||
|
||||
|
||||
### Removals
|
||||
|
||||
@@ -186,7 +278,58 @@ Legend:
|
||||
|
||||
### Resolved issues
|
||||
|
||||
- N/A
|
||||
- **Fixed CPX not showing total number of logical GPUs**.
|
||||
Updates were made to `amdsmi_init()` and `amdsmi_get_gpu_bdf_id(..)`. In order to display all logical devices, we needed a way to provide order to GPU's enumerated. This was done
|
||||
by adding a partition_id within the BDF optional pci_id bits.
|
||||
|
||||
Due to driver changes in KFD, some devices may report bits [31:28] or [2:0]. With the newly added `amdsmi_get_gpu_bdf_id(..)`, we provided this fallback to properly retreive partition ID. We
|
||||
plan to eventually remove partition ID from the function portion of the BDF (Bus Device Function). See below for PCI ID description.
|
||||
|
||||
- bits [63:32] = domain
|
||||
- bits [31:28] or bits [2:0] = partition id
|
||||
- bits [27:16] = reserved
|
||||
- bits [15:8] = Bus
|
||||
- bits [7:3] = Device
|
||||
- bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
|
||||
|
||||
Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI.
|
||||
|
||||
```shell
|
||||
$ amd-smi monitor -p -t -v
|
||||
GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL
|
||||
0 248 W 55 °C 48 °C 283 MB 196300 MB
|
||||
1 247 W 55 °C 48 °C 283 MB 196300 MB
|
||||
2 247 W 55 °C 48 °C 283 MB 196300 MB
|
||||
3 247 W 55 °C 48 °C 283 MB 196300 MB
|
||||
4 221 W 50 °C 42 °C 283 MB 196300 MB
|
||||
5 221 W 50 °C 42 °C 283 MB 196300 MB
|
||||
6 222 W 50 °C 42 °C 283 MB 196300 MB
|
||||
7 221 W 50 °C 42 °C 283 MB 196300 MB
|
||||
8 239 W 53 °C 46 °C 283 MB 196300 MB
|
||||
9 239 W 53 °C 46 °C 283 MB 196300 MB
|
||||
10 239 W 53 °C 46 °C 283 MB 196300 MB
|
||||
11 239 W 53 °C 46 °C 283 MB 196300 MB
|
||||
12 219 W 51 °C 48 °C 283 MB 196300 MB
|
||||
13 219 W 51 °C 48 °C 283 MB 196300 MB
|
||||
14 219 W 51 °C 48 °C 283 MB 196300 MB
|
||||
15 219 W 51 °C 48 °C 283 MB 196300 MB
|
||||
16 222 W 51 °C 47 °C 283 MB 196300 MB
|
||||
17 222 W 51 °C 47 °C 283 MB 196300 MB
|
||||
18 222 W 51 °C 47 °C 283 MB 196300 MB
|
||||
19 222 W 51 °C 48 °C 283 MB 196300 MB
|
||||
20 241 W 55 °C 48 °C 283 MB 196300 MB
|
||||
21 241 W 55 °C 48 °C 283 MB 196300 MB
|
||||
22 241 W 55 °C 48 °C 283 MB 196300 MB
|
||||
23 240 W 55 °C 48 °C 283 MB 196300 MB
|
||||
24 211 W 51 °C 45 °C 283 MB 196300 MB
|
||||
25 211 W 51 °C 45 °C 283 MB 196300 MB
|
||||
26 211 W 51 °C 45 °C 283 MB 196300 MB
|
||||
27 211 W 51 °C 45 °C 283 MB 196300 MB
|
||||
28 227 W 51 °C 49 °C 283 MB 196300 MB
|
||||
29 227 W 51 °C 49 °C 283 MB 196300 MB
|
||||
30 227 W 51 °C 49 °C 283 MB 196300 MB
|
||||
31 227 W 51 °C 49 °C 283 MB 196300 MB
|
||||
```
|
||||
|
||||
### Known issues
|
||||
|
||||
@@ -829,7 +972,7 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json
|
||||
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
|
||||
updates CLI to target only AMD ASICs.
|
||||
|
||||
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
|
||||
- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
|
||||
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
|
||||
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
|
||||
|
||||
|
||||
@@ -281,16 +281,16 @@ typedef enum {
|
||||
*/
|
||||
typedef enum {
|
||||
AMDSMI_COMPUTE_PARTITION_INVALID = 0,
|
||||
AMDSMI_COMPUTE_PARTITION_CPX, //!< Core mode (CPX)- Per-chip XCC with
|
||||
//!< shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_SPX, //!< Single GPU mode (SPX)- All XCCs work
|
||||
//!< together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_DPX, //!< Dual GPU mode (DPX)- Half XCCs work
|
||||
//!< together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_TPX, //!< Triple GPU mode (TPX)- One-third XCCs
|
||||
//!< work together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_QPX //!< Quad GPU mode (QPX)- Quarter XCCs
|
||||
//!< work together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_SPX, //!< Single GPU mode (SPX)- All XCCs work
|
||||
//!< together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_DPX, //!< Dual GPU mode (DPX)- Half XCCs work
|
||||
//!< together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_TPX, //!< Triple GPU mode (TPX)- One-third XCCs
|
||||
//!< work together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_QPX, //!< Quad GPU mode (QPX)- Quarter XCCs
|
||||
//!< work together with shared memory
|
||||
AMDSMI_COMPUTE_PARTITION_CPX, //!< Core mode (CPX)- Per-chip XCC with
|
||||
//!< shared memory
|
||||
} amdsmi_compute_partition_type_t;
|
||||
|
||||
/**
|
||||
@@ -589,7 +589,11 @@ typedef struct {
|
||||
char asic_serial[AMDSMI_NORMAL_STRING_LENGTH];
|
||||
uint32_t oam_id; //< 0xFFFF if not supported
|
||||
uint32_t num_of_compute_units; //< 0xFFFFFFFF if not supported
|
||||
uint32_t reserved[17];
|
||||
uint64_t target_graphics_version; //< 0xFFFFFFFFFFFFFFFF if not supported
|
||||
uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported
|
||||
uint32_t node_id; //< 0xFFFFFFFF if not supported
|
||||
uint32_t partition_id; //< 0xFFFFFFFF if not supported
|
||||
uint32_t reserved[11];
|
||||
} amdsmi_asic_info_t;
|
||||
|
||||
typedef enum {
|
||||
@@ -2233,16 +2237,18 @@ amdsmi_get_gpu_pci_bandwidth(amdsmi_processor_handle processor_handle,
|
||||
*
|
||||
* The format of @p bdfid will be as follows:
|
||||
*
|
||||
* BDFID = ((DOMAIN & 0xffffffff) << 32) | ((BUS & 0xff) << 8) |
|
||||
* ((DEVICE & 0x1f) <<3 ) | (FUNCTION & 0x7)
|
||||
* BDFID = ((DOMAIN & 0xFFFFFFFF) << 32) | ((Partition & 0xF) << 28)
|
||||
* | ((BUS & 0xFF) << 8) | ((DEVICE & 0x1F) <<3 )
|
||||
* | (FUNCTION & 0x7)
|
||||
*
|
||||
* | Name | Field |
|
||||
* ---------- | ------- |
|
||||
* | Domain | [64:32] |
|
||||
* | Reserved | [31:16] |
|
||||
* | Bus | [15: 8] |
|
||||
* | Device | [ 7: 3] |
|
||||
* | Function | [ 2: 0] |
|
||||
* | Name | Field | KFD property KFD -> PCIe ID (uint64_t)
|
||||
* -------------- | ------- | ---------------- | ---------------------------- |
|
||||
* | Domain | [63:32] | "domain" | (DOMAIN & 0xFFFFFFFF) << 32 |
|
||||
* | Partition id | [31:28] | "location id" | (LOCATION & 0xF0000000) |
|
||||
* | Reserved | [27:16] | "location id" | N/A |
|
||||
* | Bus | [15: 8] | "location id" | (LOCATION & 0xFF00) |
|
||||
* | Device | [ 7: 3] | "location id" | (LOCATION & 0xF8) |
|
||||
* | Function | [ 2: 0] | "location id" | (LOCATION & 0x7) |
|
||||
*
|
||||
* @param[in] processor_handle a processor handle
|
||||
*
|
||||
|
||||
@@ -1664,7 +1664,11 @@ def amdsmi_get_gpu_asic_info(
|
||||
"rev_id": _padHexValue(hex(asic_info_struct.rev_id), 2),
|
||||
"asic_serial": asic_info_struct.asic_serial.decode("utf-8"),
|
||||
"oam_id": asic_info_struct.oam_id,
|
||||
"num_compute_units": asic_info_struct.num_of_compute_units
|
||||
"num_compute_units": asic_info_struct.num_of_compute_units,
|
||||
"target_graphics_version": "gfx" + str(asic_info_struct.target_graphics_version),
|
||||
"kfd_id": asic_info_struct.kfd_id,
|
||||
"node_id": asic_info_struct.node_id,
|
||||
"partition_id": asic_info_struct.partition_id
|
||||
}
|
||||
|
||||
string_values = ["market_name", "vendor_name"]
|
||||
|
||||
@@ -380,18 +380,18 @@ amdsmi_clk_type_t = ctypes.c_uint32 # enum
|
||||
# values for enumeration 'amdsmi_compute_partition_type_t'
|
||||
amdsmi_compute_partition_type_t__enumvalues = {
|
||||
0: 'AMDSMI_COMPUTE_PARTITION_INVALID',
|
||||
1: 'AMDSMI_COMPUTE_PARTITION_CPX',
|
||||
2: 'AMDSMI_COMPUTE_PARTITION_SPX',
|
||||
3: 'AMDSMI_COMPUTE_PARTITION_DPX',
|
||||
4: 'AMDSMI_COMPUTE_PARTITION_TPX',
|
||||
5: 'AMDSMI_COMPUTE_PARTITION_QPX',
|
||||
1: 'AMDSMI_COMPUTE_PARTITION_SPX',
|
||||
2: 'AMDSMI_COMPUTE_PARTITION_DPX',
|
||||
3: 'AMDSMI_COMPUTE_PARTITION_TPX',
|
||||
4: 'AMDSMI_COMPUTE_PARTITION_QPX',
|
||||
5: 'AMDSMI_COMPUTE_PARTITION_CPX',
|
||||
}
|
||||
AMDSMI_COMPUTE_PARTITION_INVALID = 0
|
||||
AMDSMI_COMPUTE_PARTITION_CPX = 1
|
||||
AMDSMI_COMPUTE_PARTITION_SPX = 2
|
||||
AMDSMI_COMPUTE_PARTITION_DPX = 3
|
||||
AMDSMI_COMPUTE_PARTITION_TPX = 4
|
||||
AMDSMI_COMPUTE_PARTITION_QPX = 5
|
||||
AMDSMI_COMPUTE_PARTITION_SPX = 1
|
||||
AMDSMI_COMPUTE_PARTITION_DPX = 2
|
||||
AMDSMI_COMPUTE_PARTITION_TPX = 3
|
||||
AMDSMI_COMPUTE_PARTITION_QPX = 4
|
||||
AMDSMI_COMPUTE_PARTITION_CPX = 5
|
||||
amdsmi_compute_partition_type_t = ctypes.c_uint32 # enum
|
||||
|
||||
# values for enumeration 'amdsmi_memory_partition_type_t'
|
||||
@@ -902,7 +902,13 @@ struct_amdsmi_asic_info_t._fields_ = [
|
||||
('asic_serial', ctypes.c_char * 32),
|
||||
('oam_id', ctypes.c_uint32),
|
||||
('num_of_compute_units', ctypes.c_uint32),
|
||||
('PADDING_0', ctypes.c_ubyte * 4),
|
||||
('target_graphics_version', ctypes.c_uint64),
|
||||
('kfd_id', ctypes.c_uint64),
|
||||
('node_id', ctypes.c_uint32),
|
||||
('partition_id', ctypes.c_uint32),
|
||||
('reserved', ctypes.c_uint32 * 17),
|
||||
('PADDING_1', ctypes.c_ubyte * 4),
|
||||
]
|
||||
|
||||
amdsmi_asic_info_t = struct_amdsmi_asic_info_t
|
||||
|
||||
@@ -509,6 +509,14 @@ def walk_through(self):
|
||||
asic_info['asic_serial']))
|
||||
print(" asic_info['oam_id'] is: {}\n".format(
|
||||
asic_info['oam_id']))
|
||||
print(" asic_info['target_graphics_version'] is: {}\n".format(
|
||||
asic_info['target_graphics_version']))
|
||||
print(" asic_info['kfd_id'] is: {}\n".format(
|
||||
asic_info['kfd_id']))
|
||||
print(" asic_info['node_id'] is: {}\n".format(
|
||||
asic_info['node_id']))
|
||||
print(" asic_info['partition_id'] is: {}\n".format(
|
||||
asic_info['partition_id']))
|
||||
print("###Test amdsmi_get_power_cap_info \n")
|
||||
power_info = amdsmi.amdsmi_get_power_cap_info(processors[i])
|
||||
print(" power_info['dpm_cap'] is: {}".format(
|
||||
|
||||
@@ -53,6 +53,7 @@
|
||||
#include <map>
|
||||
#include <vector>
|
||||
#include <type_traits>
|
||||
#include <cstring>
|
||||
|
||||
#include "rocm_smi/rocm_smi.h"
|
||||
#include "rocm_smi/rocm_smi_utils.h"
|
||||
@@ -730,30 +731,6 @@ template<typename T> constexpr float convert_mw_to_w(T mw) {
|
||||
return static_cast<float>(mw / 1000.0);
|
||||
}
|
||||
|
||||
template <typename T>
|
||||
auto print_error_or_value(rsmi_status_t status_code, const T& metric) {
|
||||
if (status_code == rsmi_status_t::RSMI_STATUS_SUCCESS) {
|
||||
if constexpr (std::is_array_v<T>) {
|
||||
auto idx = uint16_t(0);
|
||||
auto str_values = std::string();
|
||||
const auto num_elems = static_cast<uint16_t>(std::end(metric) - std::begin(metric));
|
||||
str_values = ("\n\t\t num of values: " + std::to_string(num_elems) + "\n");
|
||||
for (const auto& el : metric) {
|
||||
str_values += "\t\t [" + std::to_string(idx) + "]: " + std::to_string(el) + "\n";
|
||||
++idx;
|
||||
}
|
||||
return str_values;
|
||||
}
|
||||
else if constexpr ((std::is_same_v<T, std::uint16_t>) ||
|
||||
(std::is_same_v<T, std::uint32_t>) ||
|
||||
(std::is_same_v<T, std::uint64_t>)) {
|
||||
return std::to_string(metric);
|
||||
}
|
||||
}
|
||||
else {
|
||||
return ("\n\t\tStatus: [" + std::to_string(status_code) + "] " + "-> " + amd::smi::getRSMIStatusString(status_code));
|
||||
}
|
||||
};
|
||||
|
||||
template <typename T>
|
||||
std::string print_unsigned_int(T value) {
|
||||
@@ -780,6 +757,7 @@ int main() {
|
||||
uint32_t num_monitor_devs = 0;
|
||||
rsmi_gpu_metrics_t gpu_metrics;
|
||||
std::string val_str;
|
||||
|
||||
RSMI_POWER_TYPE power_type = RSMI_INVALID_POWER;
|
||||
|
||||
rsmi_num_monitor_devices(&num_monitor_devs);
|
||||
@@ -791,13 +769,23 @@ int main() {
|
||||
ret = rsmi_dev_revision_get(i, &val_ui16);
|
||||
CHK_RSMI_RET_I(ret)
|
||||
std::cout << "\t**Dev.Rev.ID: 0x" << std::hex << val_ui16 << "\n";
|
||||
ret = amd::smi::rsmi_get_gfx_target_version(i , &val_str);
|
||||
std::cout << "\t**Target Graphics Version: " << val_str << "\n";
|
||||
|
||||
char pcie_vendor_name[256];
|
||||
ret = rsmi_dev_pcie_vendor_name_get(i, pcie_vendor_name, 256);
|
||||
CHK_RSMI_RET_I(ret)
|
||||
std::cout << "\t**PCIe vendor name: " << pcie_vendor_name << std::endl;
|
||||
ret = rsmi_dev_target_graphics_version_get(i, &val_ui64);
|
||||
std::cout << "\t**Target Graphics Version: " << std::dec
|
||||
<< static_cast<uint64_t>(val_ui64) << "\n";
|
||||
ret = rsmi_dev_guid_get(i, &val_ui64);
|
||||
std::cout << "\t**GUID: " << std::dec
|
||||
<< static_cast<uint64_t>(val_ui64) << "\n";
|
||||
ret = rsmi_dev_node_id_get(i, &val_ui32);
|
||||
std::cout << "\t**Node ID: " << std::dec
|
||||
<< static_cast<uint32_t>(val_ui32) << "\n";
|
||||
char vbios_version[256];
|
||||
ret = rsmi_dev_vbios_version_get(i, vbios_version, 256);
|
||||
if (ret == RSMI_STATUS_SUCCESS) {
|
||||
std::cout << "\t**VBIOS Version: " << vbios_version << "\n";
|
||||
} else {
|
||||
std::cout << "\t**VBIOS Version: "
|
||||
<< amd::smi::getRSMIStatusString(ret, false) << "\n";
|
||||
}
|
||||
|
||||
char current_compute_partition[256];
|
||||
current_compute_partition[0] = '\0';
|
||||
@@ -848,8 +836,9 @@ int main() {
|
||||
//
|
||||
std::cout << "\n";
|
||||
print_test_header("GPU METRICS: Using static struct (Backwards Compatibility) ", i);
|
||||
print_function_header_with_rsmi_ret(ret, "rsmi_dev_gpu_metrics_info_get(" + std::to_string(i) + ", &gpu_metrics)");
|
||||
rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics);
|
||||
ret = rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics);
|
||||
print_function_header_with_rsmi_ret(ret, "rsmi_dev_gpu_metrics_info_get("
|
||||
+ std::to_string(i) + ", &gpu_metrics)");
|
||||
|
||||
std::cout << "\t**.common_header.format_revision : "
|
||||
<< print_unsigned_int(gpu_metrics.common_header.format_revision) << "\n";
|
||||
@@ -988,173 +977,58 @@ int main() {
|
||||
for (const auto& dclk : gpu_metrics.current_dclk0s) {
|
||||
std::cout << "\t -> " << std::dec << dclk << "\n";
|
||||
}
|
||||
std::cout << " ** Note: Values MAX'ed out (UINTX MAX are unsupported for the version in question) ** " << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n";
|
||||
constexpr uint16_t kMAX_ITER_TEST = 10;
|
||||
rsmi_gpu_metrics_t gpu_metrics_check;
|
||||
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
|
||||
rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics_check);
|
||||
std::cout << "\t\t -> firmware_timestamp [" << idx
|
||||
<< "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.firmware_timestamp << "\n";
|
||||
}
|
||||
|
||||
std::cout << "\n";
|
||||
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
|
||||
rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics_check);
|
||||
std::cout << "\t\t -> system_clock_counter [" << idx
|
||||
<< "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.system_clock_counter << "\n";
|
||||
}
|
||||
|
||||
std::cout << "\n\n";
|
||||
std::cout << " ** Note: Values MAX'ed out "
|
||||
"(UINTX MAX are unsupported for the version in question) ** " << "\n";
|
||||
|
||||
|
||||
std::cout << "\n\n";
|
||||
print_test_header("GPU METRICS: Using direct APIs (newer)", i);
|
||||
metrics_table_header_t header_values;
|
||||
GPUMetricTempHbm_t hbm_values;
|
||||
GPUMetricVcnActivity_t vcn_values;
|
||||
GPUMetricXgmiReadDataAcc_t xgmi_read_values;
|
||||
GPUMetricXgmiWriteDataAcc_t xgmi_write_values;
|
||||
GPUMetricCurrGfxClk_t curr_gfxclk_values;
|
||||
GPUMetricCurrSocClk_t curr_socclk_values;
|
||||
GPUMetricCurrVClk0_t curr_vclk0_values;
|
||||
GPUMetricCurrDClk0_t curr_dclk0_values;
|
||||
|
||||
ret = rsmi_dev_metrics_header_info_get(i, &header_values);
|
||||
std::cout << "\t[Metrics Header]" << "\n";
|
||||
std::cout << "\t -> format_revision : " << print_unsigned_int(header_values.format_revision) << "\n";
|
||||
std::cout << "\t -> content_revision : " << print_unsigned_int(header_values.content_revision) << "\n";
|
||||
std::cout << "\t -> format_revision : "
|
||||
<< print_unsigned_int(header_values.format_revision) << "\n";
|
||||
std::cout << "\t -> content_revision : "
|
||||
<< print_unsigned_int(header_values.content_revision) << "\n";
|
||||
std::cout << "\t--------------------" << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Temperature]" << "\n";
|
||||
ret = rsmi_dev_metrics_temp_edge_get(i, &val_ui16);
|
||||
std::cout << "\t -> temp_edge(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_temp_hotspot_get(i, &val_ui16);
|
||||
std::cout << "\t -> temp_hotspot(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_temp_mem_get(i, &val_ui16);
|
||||
std::cout << "\t -> temp_mem(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_temp_vrgfx_get(i, &val_ui16);
|
||||
std::cout << "\t -> temp_vrgfx(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_temp_vrsoc_get(i, &val_ui16);
|
||||
std::cout << "\t -> temp_vrsoc(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_temp_vrmem_get(i, &val_ui16);
|
||||
std::cout << "\t -> temp_vrmem(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_temp_hbm_get(i, &hbm_values);
|
||||
std::cout << "\t -> temp_hbm(): " << print_error_or_value(ret, hbm_values) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Power/Energy]" << "\n";
|
||||
ret = rsmi_dev_metrics_curr_socket_power_get(i, &val_ui16);
|
||||
std::cout << "\t -> current_socket_power(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_energy_acc_get(i, &val_ui64);
|
||||
std::cout << "\t -> energy_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_socket_power_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_socket_power(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Utilization]" << "\n";
|
||||
ret = rsmi_dev_metrics_avg_gfx_activity_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_gfx_activity(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_umc_activity_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_umc_activity(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_mm_activity_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_mm_activity(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_vcn_activity_get(i, &vcn_values);
|
||||
std::cout << "\t -> vcn_activity(): " << print_error_or_value(ret, vcn_values) << "\n";
|
||||
ret = rsmi_dev_metrics_mem_activity_acc_get(i, &val_ui32);
|
||||
std::cout << "\t -> mem_activity_accum(): " << print_error_or_value(ret, val_ui32) << "\n";
|
||||
ret = rsmi_dev_metrics_gfx_activity_acc_get(i, &val_ui32);
|
||||
std::cout << "\t -> gfx_activity_accum(): " << print_error_or_value(ret, val_ui32) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Average Clock]" << "\n";
|
||||
ret = rsmi_dev_metrics_avg_gfx_clock_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_gfx_clock_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_soc_clock_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_soc_clock_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_uclock_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_uclock_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_vclock0_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_vclock0_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_dclock0_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_dclock0_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_vclock1_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_vclock1_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_avg_dclock1_frequency_get(i, &val_ui16);
|
||||
std::cout << "\t -> average_dclock1_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Current Clock]" << "\n";
|
||||
ret = rsmi_dev_metrics_curr_vclk1_get(i, &val_ui16);
|
||||
std::cout << "\t -> current_vclock1(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_curr_dclk1_get(i, &val_ui16);
|
||||
std::cout << "\t -> current_dclock1(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_curr_uclk_get(i, &val_ui16);
|
||||
std::cout << "\t -> current_uclock(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_curr_dclk0_get(i, &curr_dclk0_values);
|
||||
std::cout << "\t -> current_dclk0(): " << print_error_or_value(ret, curr_dclk0_values) << "\n";
|
||||
ret = rsmi_dev_metrics_curr_gfxclk_get(i, &curr_gfxclk_values);
|
||||
std::cout << "\t -> current_gfxclk(): " << print_error_or_value(ret, curr_gfxclk_values) << "\n";
|
||||
ret = rsmi_dev_metrics_curr_socclk_get(i, &curr_socclk_values);
|
||||
std::cout << "\t -> current_soc_clock(): " << print_error_or_value(ret, curr_socclk_values) << "\n";
|
||||
ret = rsmi_dev_metrics_curr_vclk0_get(i, &curr_vclk0_values);
|
||||
std::cout << "\t -> current_vclk0(): " << print_error_or_value(ret, curr_vclk0_values) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Throttle]" << "\n";
|
||||
ret = rsmi_dev_metrics_indep_throttle_status_get(i, &val_ui64);
|
||||
std::cout << "\t -> indep_throttle_status(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_throttle_status_get(i, &val_ui32);
|
||||
std::cout << "\t -> throttle_status(): " << print_error_or_value(ret, val_ui32) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Gfx Clock Lock]" << "\n";
|
||||
ret = rsmi_dev_metrics_gfxclk_lock_status_get(i, &val_ui32);
|
||||
std::cout << "\t -> gfxclk_lock_status(): " << print_error_or_value(ret, val_ui32) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Current Fan Speed]" << "\n";
|
||||
ret = rsmi_dev_metrics_curr_fan_speed_get(i, &val_ui16);
|
||||
std::cout << "\t -> current_fan_speed(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Link/Bandwidth/Speed]" << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_link_width_get(i, &val_ui16);
|
||||
std::cout << "\t -> pcie_link_width(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_link_speed_get(i, &val_ui16);
|
||||
std::cout << "\t -> pcie_link_speed(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_bandwidth_acc_get(i, &val_ui64);
|
||||
std::cout << "\t -> pcie_bandwidth_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_bandwidth_inst_get(i, &val_ui64);
|
||||
std::cout << "\t -> pcie_bandwidth_inst(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_l0_recov_count_acc_get(i, &val_ui64);
|
||||
std::cout << "\t -> pcie_l0_recov_count_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_replay_count_acc_get(i, &val_ui64);
|
||||
std::cout << "\t -> pcie_replay_count_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_pcie_replay_rover_count_acc_get(i, &val_ui64);
|
||||
std::cout << "\t -> pcie_replay_rollover_count_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_xgmi_link_width_get(i, &val_ui16);
|
||||
std::cout << "\t -> xgmi_link_width(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_xgmi_link_speed_get(i, &val_ui16);
|
||||
std::cout << "\t -> xgmi_link_speed(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_xgmi_read_data_get(i, &xgmi_read_values);
|
||||
std::cout << "\t -> xgmi_read_data(): " << print_error_or_value(ret, xgmi_read_values) << "\n";
|
||||
ret = rsmi_dev_metrics_xgmi_write_data_get(i, &xgmi_write_values);
|
||||
std::cout << "\t -> xgmi_write_data(): " << print_error_or_value(ret, xgmi_write_values) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Voltage]" << "\n";
|
||||
ret = rsmi_dev_metrics_volt_soc_get(i, &val_ui16);
|
||||
std::cout << "\t -> voltage_soc(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_volt_gfx_get(i, &val_ui16);
|
||||
std::cout << "\t -> voltage_gfx(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
ret = rsmi_dev_metrics_volt_mem_get(i, &val_ui16);
|
||||
std::cout << "\t -> voltage_mem(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[Timestamp]" << "\n";
|
||||
ret = rsmi_dev_metrics_system_clock_counter_get(i, &val_ui64);
|
||||
std::cout << "\t -> system_clock_counter(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
ret = rsmi_dev_metrics_firmware_timestamp_get(i, &val_ui64);
|
||||
std::cout << "\t -> firmware_timestamp(): " << print_error_or_value(ret, val_ui64) << "\n";
|
||||
|
||||
std::cout << "\n";
|
||||
std::cout << "\t[XCD CounterVoltage]" << "\n";
|
||||
ret = rsmi_dev_metrics_xcd_counter_get(i, &val_ui16);
|
||||
std::cout << "\t -> xcd_counter(): " << print_error_or_value(ret, val_ui16) << "\n";
|
||||
std::cout << "\t -> xcd_counter(): " << val_ui16;
|
||||
std::cout << "\n\n";
|
||||
|
||||
|
||||
ret = rsmi_dev_perf_level_get(i, &pfl);
|
||||
CHK_AND_PRINT_RSMI_ERR_RET(ret)
|
||||
std::cout << "\t**Performance Level:" <<
|
||||
perf_level_string(pfl) << "\n";
|
||||
ret = rsmi_dev_overdrive_level_get(i, &val_ui32);
|
||||
CHK_AND_PRINT_RSMI_ERR_RET(ret)
|
||||
std::cout << "\t**OverDrive Level:" << val_ui32 << "\n";
|
||||
std::cout << "\t**OverDrive Level: ";
|
||||
if (ret == RSMI_STATUS_SUCCESS) {
|
||||
std::cout << val_ui32 << "\n";
|
||||
} else {
|
||||
CHK_RSMI_NOT_SUPPORTED_OR_UNEXPECTED_DATA_RET(ret)
|
||||
}
|
||||
|
||||
print_test_header("GPU Clocks", i);
|
||||
for (int clkType = static_cast<int>(RSMI_CLK_TYPE_SYS);
|
||||
@@ -1271,9 +1145,6 @@ int main() {
|
||||
}
|
||||
|
||||
for (uint32_t i = 0; i < num_monitor_devs; ++i) {
|
||||
ret = test_set_overdrive(i);
|
||||
CHK_AND_PRINT_RSMI_ERR_RET(ret)
|
||||
|
||||
ret = test_set_perf_level(i);
|
||||
CHK_AND_PRINT_RSMI_ERR_RET(ret)
|
||||
|
||||
@@ -1294,6 +1165,9 @@ int main() {
|
||||
|
||||
ret = test_set_memory_partition(i);
|
||||
CHK_AND_PRINT_RSMI_ERR_RET(ret)
|
||||
|
||||
ret = test_set_overdrive(i);
|
||||
CHK_RSMI_NOT_SUPPORTED_RET(ret)
|
||||
}
|
||||
|
||||
return 0;
|
||||
|
||||
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
@@ -94,6 +94,11 @@ class KFDNode {
|
||||
int32_t get_simd_per_cu(uint64_t* simd_per_cu) const;
|
||||
int32_t get_simd_count(uint64_t* simd_count) const;
|
||||
|
||||
// Get gpu_id (AKA GUID) version from kfd
|
||||
int get_gpu_id(uint64_t *gpu_id);
|
||||
// Get node id from kfd
|
||||
int get_node_id(uint32_t *node_id);
|
||||
|
||||
private:
|
||||
uint32_t node_indx_;
|
||||
uint32_t amdgpu_dev_index_;
|
||||
|
||||
@@ -48,8 +48,11 @@
|
||||
#include <algorithm>
|
||||
#include <cstdint>
|
||||
#include <iomanip>
|
||||
#include <iosfwd>
|
||||
#include <iostream>
|
||||
#include <iterator>
|
||||
#include <limits>
|
||||
#include <ostream>
|
||||
#include <queue>
|
||||
#include <sstream>
|
||||
#include <string>
|
||||
@@ -594,6 +597,7 @@ class TagTextContents_t
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
};
|
||||
|
||||
using TextFileTagContents_t = TagTextContents_t<std::string, std::string,
|
||||
|
||||
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
@@ -490,7 +490,7 @@ static const std::map<const char *, dev_depends_t> kDevFuncDependsMap = {
|
||||
// Functions with only mandatory dependencies
|
||||
{"rsmi_dev_vram_vendor_get", {{kDevVramVendorFName}, {}}},
|
||||
{"rsmi_dev_id_get", {{kDevDevIDFName}, {}}},
|
||||
{"rsmi_dev_oam_id_get", {{kDevXGMIPhysicalIDFName}, {}}},
|
||||
{"rsmi_dev_xgmi_physical_id_get", {{kDevXGMIPhysicalIDFName}, {}}},
|
||||
{"rsmi_dev_revision_get", {{kDevDevRevIDFName}, {}}},
|
||||
{"rsmi_dev_vendor_id_get", {{kDevVendorIDFName}, {}}},
|
||||
{"rsmi_dev_name_get", {{kDevVendorIDFName,
|
||||
|
||||
@@ -526,7 +526,7 @@ int GetProcessInfoForPID(uint32_t pid, rsmi_process_info_t *proc,
|
||||
// Collect count of compute units
|
||||
cu_count += kfd_node_map[gpu_id]->cu_count();
|
||||
} else {
|
||||
//Some GFX revisions do not provide cu_occupancy debugfs method
|
||||
// Some GFX revisions do not provide cu_occupancy debugfs method
|
||||
proc->cu_occupancy = CU_OCCUPANCY_INVALID;
|
||||
cu_count = 0;
|
||||
}
|
||||
@@ -1067,18 +1067,18 @@ int KFDNode::get_gfx_target_version(uint64_t *gfx_target_version) {
|
||||
*gfx_target_version = gfx_version;
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | File: " << properties_path
|
||||
<< " | Successfully read node #" << std::to_string(this->node_indx_)
|
||||
<< " | Read node: " << std::to_string(this->node_indx_)
|
||||
<< " for gfx_target_version"
|
||||
<< " | Data (gfx_target_version) *gfx_target_version = "
|
||||
<< " | Data (*gfx_target_version): "
|
||||
<< std::to_string(*gfx_target_version)
|
||||
<< " | return = " << std::to_string(ret)
|
||||
<< " | Return: "
|
||||
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
|
||||
<< " | ";
|
||||
LOG_DEBUG(ss);
|
||||
return ret;
|
||||
}
|
||||
|
||||
int32_t KFDNode::get_simd_per_cu(uint64_t* simd_per_cu) const
|
||||
{
|
||||
int32_t KFDNode::get_simd_per_cu(uint64_t* simd_per_cu) const {
|
||||
const std::string properties_path("/sys/class/kfd/kfd/topology/nodes/" +
|
||||
std::to_string(this->node_indx_) +
|
||||
"/properties");
|
||||
@@ -1090,8 +1090,7 @@ int32_t KFDNode::get_simd_per_cu(uint64_t* simd_per_cu) const
|
||||
return ret;
|
||||
}
|
||||
|
||||
int32_t KFDNode::get_simd_count(uint64_t* simd_count) const
|
||||
{
|
||||
int32_t KFDNode::get_simd_count(uint64_t* simd_count) const {
|
||||
const std::string properties_path("/sys/class/kfd/kfd/topology/nodes/" +
|
||||
std::to_string(this->node_indx_) +
|
||||
"/properties");
|
||||
@@ -1103,6 +1102,62 @@ int32_t KFDNode::get_simd_count(uint64_t* simd_count) const
|
||||
return ret;
|
||||
}
|
||||
|
||||
// Public interface for device
|
||||
// /sys/class/kfd/kfd/topology/nodes/*/gpu_id
|
||||
int KFDNode::get_gpu_id(uint64_t *gpu_id) {
|
||||
std::ostringstream ss;
|
||||
std::string gpuid_path = "/sys/class/kfd/kfd/topology/nodes/"
|
||||
+ std::to_string(this->node_indx_) + "/gpu_id";
|
||||
const uint64_t undefined_gpu_id = std::numeric_limits<uint64_t>::max();
|
||||
std::string gpu_id_string = "";
|
||||
*gpu_id = undefined_gpu_id;
|
||||
int ret = ReadSysfsStr(gpuid_path, &gpu_id_string);
|
||||
if (ret != 0 || gpu_id_string.empty()) {
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | File: " << gpuid_path
|
||||
<< " | Data (*gpu_id): empty or nullptr"
|
||||
<< " | Issue: Could not read node #" << std::to_string(this->node_indx_)
|
||||
<< ". KFD node was an unsupported node or value read was empty."
|
||||
<< " | Return: "
|
||||
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
|
||||
<< " | ";
|
||||
LOG_ERROR(ss);
|
||||
return ret;
|
||||
}
|
||||
*gpu_id = std::stoull(gpu_id_string);
|
||||
if (*gpu_id == 0) { // CPU node - return not supported
|
||||
*gpu_id = undefined_gpu_id;
|
||||
ret = ENOENT; // map to RSMI_STATUS_NOT_SUPPORTED
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | File: " << gpuid_path
|
||||
<< " | Read node #: " << std::to_string(this->node_indx_)
|
||||
<< " | Data (*gpu_id): " << std::to_string(*gpu_id)
|
||||
<< " | Return: "
|
||||
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
|
||||
<< " | ";
|
||||
LOG_DEBUG(ss);
|
||||
return ret;
|
||||
}
|
||||
|
||||
// Public interface for device
|
||||
// /sys/class/kfd/kfd/topology/nodes/<node_id>
|
||||
int KFDNode::get_node_id(uint32_t *node_id) {
|
||||
std::ostringstream ss;
|
||||
int ret = 0;
|
||||
std::string nodeid_path = "/sys/class/kfd/kfd/topology/nodes/"
|
||||
+ std::to_string(this->node_indx_);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | File: " << nodeid_path
|
||||
<< " | Read node #: " << std::to_string(this->node_indx_)
|
||||
<< " | Data (*node_id): " << std::to_string(*node_id)
|
||||
<< " | Return: "
|
||||
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
|
||||
<< " | ";
|
||||
*node_id = this->node_indx_;
|
||||
LOG_DEBUG(ss);
|
||||
return ret;
|
||||
}
|
||||
|
||||
} // namespace smi
|
||||
} // namespace amd
|
||||
|
||||
@@ -235,15 +235,7 @@ RocmSMI::Initialize(uint64_t flags) {
|
||||
int i_ret;
|
||||
std::ostringstream ss;
|
||||
|
||||
LOG_ALWAYS("=============== ROCM SMI initialize ================");
|
||||
ROCmLogging::Logger::getInstance()->enableAllLogLevels();
|
||||
// Leaving below to allow developers to check current log settings
|
||||
// std::string logSettings = Logger::getInstance()->getLogSettings();
|
||||
// std::cout << "Current log settings:\n" << logSettings << std::endl;
|
||||
|
||||
if (ROCmLogging::Logger::getInstance()->isLoggerEnabled()) {
|
||||
logSystemDetails();
|
||||
}
|
||||
|
||||
assert(ref_count_ == 1);
|
||||
if (ref_count_ != 1) {
|
||||
@@ -259,6 +251,15 @@ RocmSMI::Initialize(uint64_t flags) {
|
||||
// To help debug env variable issues
|
||||
// debugRSMIEnvVarInfo();
|
||||
|
||||
if (ROCmLogging::Logger::getInstance()->isLoggerEnabled()) {
|
||||
ROCmLogging::Logger::getInstance()->enableAllLogLevels();
|
||||
LOG_ALWAYS("=============== ROCM SMI initialize ================");
|
||||
logSystemDetails();
|
||||
}
|
||||
// Leaving below to allow developers to check current log settings
|
||||
// std::string logSettings = ROCmLogging::Logger::getInstance()->getLogSettings();
|
||||
// std::cout << "Current log settings:\n" << logSettings << std::endl;
|
||||
|
||||
while (!std::string(kAMDMonitorTypes[i]).empty()) {
|
||||
amd_monitor_types_.insert(kAMDMonitorTypes[i]);
|
||||
++i;
|
||||
@@ -283,6 +284,7 @@ RocmSMI::Initialize(uint64_t flags) {
|
||||
<< " | [before] device->path() = " << device->path()
|
||||
<< "\n | bdfid = " << bdfid
|
||||
<< "\n | device->bdfid() = " << device->bdfid()
|
||||
<< " (" << print_int_as_hex(device->bdfid()) << ")"
|
||||
<< "\n | (xgmi node) setting to setting "
|
||||
<< "device->set_bdfid(device->bdfid())";
|
||||
LOG_TRACE(ss);
|
||||
@@ -293,6 +295,7 @@ RocmSMI::Initialize(uint64_t flags) {
|
||||
<< " | [before] device->path() = " << device->path()
|
||||
<< "\n | bdfid = " << bdfid
|
||||
<< "\n | device->bdfid() = " << device->bdfid()
|
||||
<< " (" << print_int_as_hex(device->bdfid()) << ")"
|
||||
<< "\n | (legacy/pcie card) setting device->set_bdfid(bdfid)";
|
||||
LOG_TRACE(ss);
|
||||
device->set_bdfid(bdfid);
|
||||
@@ -301,6 +304,7 @@ RocmSMI::Initialize(uint64_t flags) {
|
||||
<< " | [after] device->path() = " << device->path()
|
||||
<< "\n | bdfid = " << bdfid
|
||||
<< "\n | device->bdfid() = " << device->bdfid()
|
||||
<< " (" << print_int_as_hex(device->bdfid()) << ")"
|
||||
<< "\n | final update: device->bdfid() holds correct device bdf";
|
||||
LOG_TRACE(ss);
|
||||
}
|
||||
@@ -312,8 +316,11 @@ RocmSMI::Initialize(uint64_t flags) {
|
||||
for (uint32_t dv_ind = 0; dv_ind < devices_.size(); ++dv_ind) {
|
||||
dev = devices_[dv_ind];
|
||||
uint64_t bdfid = dev->bdfid();
|
||||
bdfid = bdfid & 0xFFFFFFFF0FFFFFFF; // clear out partition id in bdf
|
||||
// NOTE: partition_id is not part of bdf (but is part of pci_id)
|
||||
// which is why it is removed in sorting
|
||||
dv_to_id.push_back({bdfid, dev});
|
||||
}
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__ << " Sort index based on BDF.";
|
||||
LOG_DEBUG(ss);
|
||||
|
||||
@@ -734,7 +741,7 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
continue;
|
||||
sscanf(&dentry->d_name[strlen(kDeviceNamePrefix)], "%d", &cardId);
|
||||
if (cardId > max_cardId)
|
||||
max_cardId = cardId;
|
||||
max_cardId = cardId;
|
||||
count++;
|
||||
}
|
||||
dentry = readdir(drm_dir);
|
||||
@@ -748,23 +755,47 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
uint64_t s_gpu_id = 0;
|
||||
uint64_t s_unique_id = 0;
|
||||
uint64_t s_location_id = 0;
|
||||
uint64_t s_bdf = 0;
|
||||
uint64_t s_domain = 0;
|
||||
uint8_t s_bus = 0;
|
||||
uint8_t s_device = 0;
|
||||
uint8_t s_function = 0;
|
||||
uint8_t s_partition_id = 0;
|
||||
uint64_t padding = 0; // padding added in case new changes in future
|
||||
};
|
||||
// allSystemNodes[key = unique_id] => {node_id, gpu_id, unique_id,
|
||||
// location_id}
|
||||
// location_id, bdf, domain, bus, device,
|
||||
// partition_id}
|
||||
std::multimap<uint64_t, systemNode> allSystemNodes;
|
||||
uint32_t node_id = 0;
|
||||
static const int BYTE = 8;
|
||||
while (true) {
|
||||
uint64_t gpu_id = 0, unique_id = 0, location_id = 0;
|
||||
uint64_t gpu_id = 0, unique_id = 0, location_id = 0, domain = 0;
|
||||
int ret_gpu_id = get_gpu_id(node_id, &gpu_id);
|
||||
int ret_unique_id = read_node_properties(node_id, "unique_id", &unique_id);
|
||||
int ret_loc_id =
|
||||
read_node_properties(node_id, "location_id", &location_id);
|
||||
if (ret_gpu_id == 0 || ret_unique_id == 0 || ret_loc_id == 0) {
|
||||
int ret_domain =
|
||||
read_node_properties(node_id, "domain", &domain);
|
||||
if (ret_gpu_id == 0 &&
|
||||
~(ret_unique_id != 0 || ret_loc_id != 0 || ret_unique_id != 0)) {
|
||||
// Do not try to build a node if one of these fields
|
||||
// do not exist in KFD (0 as values okay)
|
||||
systemNode myNode;
|
||||
myNode.s_node_id = node_id;
|
||||
myNode.s_gpu_id = gpu_id;
|
||||
myNode.s_unique_id = unique_id;
|
||||
myNode.s_location_id = location_id;
|
||||
myNode.s_domain = domain & 0xFFFFFFFF;
|
||||
myNode.s_bdf = (myNode.s_domain << 32) | (myNode.s_location_id);
|
||||
myNode.s_location_id = myNode.s_bdf;
|
||||
myNode.s_bdf |= ((domain & 0xFFFFFFFF) << 32);
|
||||
myNode.s_location_id = myNode.s_bdf;
|
||||
myNode.s_domain = myNode.s_location_id >> 32;
|
||||
myNode.s_bus = ((myNode.s_location_id >> 8) & 0xFF);
|
||||
myNode.s_device = ((myNode.s_location_id >> 3) & 0x1F);
|
||||
myNode.s_function = myNode.s_location_id & 0x7;
|
||||
myNode.s_partition_id = ((myNode.s_location_id >> 28) & 0xF);
|
||||
if (gpu_id != 0) { // only add gpu nodes, 0 = CPU
|
||||
allSystemNodes.emplace(unique_id, myNode);
|
||||
}
|
||||
@@ -780,6 +811,12 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
<< "; gpu_id = " << std::to_string(i.second.s_gpu_id)
|
||||
<< "; unique_id = " << std::to_string(i.second.s_unique_id)
|
||||
<< "; location_id = " << std::to_string(i.second.s_location_id)
|
||||
<< "; bdf = " << print_int_as_hex(i.second.s_bdf)
|
||||
<< "; domain = " << print_int_as_hex(i.second.s_domain, true, 2*BYTE)
|
||||
<< "; bus = " << print_int_as_hex(i.second.s_bus, true, BYTE)
|
||||
<< "; device = " << print_int_as_hex(i.second.s_device, true, BYTE)
|
||||
<< "; function = " << std::to_string(i.second.s_function)
|
||||
<< "; partition_id = " << std::to_string(i.second.s_partition_id)
|
||||
<< "], ";
|
||||
}
|
||||
ss << "}";
|
||||
@@ -817,13 +854,67 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
rsmi_status_t ret_unique_id =
|
||||
rsmi_dev_unique_id_get(cardAdded, &device_uuid);
|
||||
auto temp_numb_nodes = allSystemNodes.count(device_uuid);
|
||||
auto it = allSystemNodes.lower_bound(device_uuid);
|
||||
if (it != allSystemNodes.end() && doesDeviceSupportPartitions && temp_numb_nodes > 1
|
||||
auto primaryBdfId =
|
||||
allSystemNodes.lower_bound(device_uuid)->second.s_location_id;
|
||||
auto i = allSystemNodes.lower_bound(device_uuid);
|
||||
if (doesDeviceSupportPartitions && temp_numb_nodes > 1
|
||||
&& ret_unique_id == RSMI_STATUS_SUCCESS) {
|
||||
auto primaryBdfId = it->second.s_location_id;
|
||||
// helps identify xgmi nodes (secondary nodes) easier
|
||||
ss << __PRETTY_FUNCTION__ << " | secondary node add ; "
|
||||
<< " BDF = " << std::to_string(primaryBdfId)
|
||||
<< " (" << print_int_as_hex(primaryBdfId) << ")";
|
||||
LOG_DEBUG(ss);
|
||||
if (doesDeviceSupportPartitions && strCompPartition != "SPX"
|
||||
&& i->second.s_partition_id == 0) {
|
||||
i->second.s_partition_id = i->second.s_function;
|
||||
ss << __PRETTY_FUNCTION__ << " | (secondary node add) fall back - "
|
||||
<< "detected !SPX && partition_id == 0"
|
||||
<< "; function = " << std::to_string(i->second.s_function)
|
||||
<< "; partition_id = " << std::to_string(i->second.s_partition_id);
|
||||
LOG_DEBUG(ss);
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | (secondary node add) B4 AddToDeviceList() -->"
|
||||
<< "\n[node_id = " << std::to_string(i->second.s_node_id)
|
||||
<< "; gpu_id = " << std::to_string(i->second.s_gpu_id)
|
||||
<< "; unique_id = " << std::to_string(i->second.s_unique_id)
|
||||
<< "; location_id = " << std::to_string(i->second.s_location_id)
|
||||
<< "; bdf = " << print_int_as_hex(i->second.s_bdf)
|
||||
<< "; domain = " << print_int_as_hex(i->second.s_domain, true, 2*BYTE)
|
||||
<< "; bus = " << print_int_as_hex(i->second.s_bus, true, BYTE)
|
||||
<< "; device = " << print_int_as_hex(i->second.s_device, true, BYTE)
|
||||
<< "; function = " << std::to_string(i->second.s_function)
|
||||
<< "; partition_id = " << std::to_string(i->second.s_partition_id)
|
||||
<< "], ";
|
||||
LOG_DEBUG(ss);
|
||||
AddToDeviceList(d_name, primaryBdfId);
|
||||
} else {
|
||||
ss << __PRETTY_FUNCTION__ << " | primary node add ; "
|
||||
<< " BDF = " << std::to_string(UINT64_MAX);
|
||||
if (doesDeviceSupportPartitions && strCompPartition != "SPX"
|
||||
&& i->second.s_partition_id == 0) {
|
||||
i->second.s_partition_id = i->second.s_function;
|
||||
ss << __PRETTY_FUNCTION__ << " | (primary node add) fall back - "
|
||||
<< "detected !SPX && partition_id == 0"
|
||||
<< "; function = " << std::to_string(i->second.s_function)
|
||||
<< "; partition_id = " << std::to_string(i->second.s_partition_id);
|
||||
LOG_DEBUG(ss);
|
||||
}
|
||||
LOG_DEBUG(ss);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | (primary node add) After AddToDeviceList() -->"
|
||||
<< "\n[node_id = " << std::to_string(i->second.s_node_id)
|
||||
<< "; gpu_id = " << std::to_string(i->second.s_gpu_id)
|
||||
<< "; unique_id = " << std::to_string(i->second.s_unique_id)
|
||||
<< "; location_id = " << std::to_string(i->second.s_location_id)
|
||||
<< "; bdf = " << print_int_as_hex(i->second.s_bdf)
|
||||
<< "; domain = " << print_int_as_hex(i->second.s_domain, true, 2*BYTE)
|
||||
<< "; bus = " << print_int_as_hex(i->second.s_bus, true, BYTE)
|
||||
<< "; device = " << print_int_as_hex(i->second.s_device, true, BYTE)
|
||||
<< "; function = " << std::to_string(i->second.s_function)
|
||||
<< "; partition_id = " << std::to_string(i->second.s_partition_id)
|
||||
<< "], ";
|
||||
LOG_DEBUG(ss);
|
||||
AddToDeviceList(d_name, UINT64_MAX);
|
||||
}
|
||||
|
||||
@@ -834,6 +925,12 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
<< "; gpu_id = " << std::to_string(i.second.s_gpu_id)
|
||||
<< "; unique_id = " << std::to_string(i.second.s_unique_id)
|
||||
<< "; location_id = " << std::to_string(i.second.s_location_id)
|
||||
<< "; bdf = " << print_int_as_hex(i.second.s_bdf)
|
||||
<< "; domain = " << print_int_as_hex(i.second.s_domain, true, 2*BYTE)
|
||||
<< "; bus = " << print_int_as_hex(i.second.s_bus, true, BYTE)
|
||||
<< "; device = " << print_int_as_hex(i.second.s_device, true, BYTE)
|
||||
<< "; function = " << std::to_string(i.second.s_function)
|
||||
<< "; partition_id = " << std::to_string(i.second.s_partition_id)
|
||||
<< "], ";
|
||||
}
|
||||
ss << "}";
|
||||
@@ -909,6 +1006,7 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
auto removalGpuId = it->second.s_gpu_id;
|
||||
auto removalUniqueId = it->second.s_unique_id;
|
||||
auto removalLocId = it->second.s_location_id;
|
||||
auto removaldomain = it->second.s_domain;
|
||||
auto nodesErased = 1;
|
||||
primary_location_id = removalLocId;
|
||||
allSystemNodes.erase(it++);
|
||||
@@ -919,6 +1017,7 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
<< "; gpu_id = " << std::to_string(removalGpuId)
|
||||
<< "; unique_id = " << std::to_string(removalUniqueId)
|
||||
<< "; location_id = " << std::to_string(removalLocId)
|
||||
<< "; removaldomain = " << std::to_string(removaldomain)
|
||||
<< "]";
|
||||
LOG_DEBUG(ss);
|
||||
}
|
||||
@@ -926,15 +1025,34 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
|
||||
break;
|
||||
}
|
||||
auto myBdfId = it->second.s_location_id;
|
||||
AddToDeviceList(secNode, myBdfId);
|
||||
ss << __PRETTY_FUNCTION__ << " | secondary node add #2; "
|
||||
<< " BDF = " << std::to_string(myBdfId)
|
||||
<< " (" << print_int_as_hex(myBdfId) << ")";
|
||||
LOG_DEBUG(ss);
|
||||
if (doesDeviceSupportPartitions && strCompPartition != "SPX"
|
||||
&& it->second.s_partition_id == 0) {
|
||||
it->second.s_partition_id = it->second.s_function;
|
||||
ss << __PRETTY_FUNCTION__ << " | (secondary node add #2) fall back - "
|
||||
<< "detected !SPX && partition_id == 0"
|
||||
<< "; function = " << std::to_string(it->second.s_function)
|
||||
<< "; partition_id = " << std::to_string(it->second.s_partition_id);
|
||||
LOG_DEBUG(ss);
|
||||
}
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< "\nSECONDARY --> After adding new node; ERASING -> [node_id = "
|
||||
<< std::to_string(it->second.s_node_id)
|
||||
<< " | (secondary node add #2) B4 AddToDeviceList() -->"
|
||||
<< "\n[node_id = " << std::to_string(it->second.s_node_id)
|
||||
<< "; gpu_id = " << std::to_string(it->second.s_gpu_id)
|
||||
<< "; unique_id = " << std::to_string(it->second.s_unique_id)
|
||||
<< "; location_id = " << std::to_string(it->second.s_location_id)
|
||||
<< "]";
|
||||
<< "; bdf = " << print_int_as_hex(it->second.s_bdf)
|
||||
<< "; domain = " << print_int_as_hex(it->second.s_domain, true, 2*BYTE)
|
||||
<< "; bus = " << print_int_as_hex(it->second.s_bus, true, BYTE)
|
||||
<< "; device = " << print_int_as_hex(it->second.s_device, true, BYTE)
|
||||
<< "; function = " << std::to_string(it->second.s_function)
|
||||
<< "; partition_id = " << std::to_string(it->second.s_partition_id)
|
||||
<< "], ";
|
||||
LOG_DEBUG(ss);
|
||||
AddToDeviceList(secNode, myBdfId);
|
||||
allSystemNodes.erase(it++);
|
||||
numb_nodes--;
|
||||
cardAdded++;
|
||||
|
||||
@@ -1113,6 +1113,7 @@ static std::string print_pnt(rsmi_od_vddc_point_t *pt) {
|
||||
ss << "\t\t** Voltage: " << pt->voltage << " mV\n";
|
||||
return ss.str();
|
||||
}
|
||||
|
||||
static std::string pt_vddc_curve(rsmi_od_volt_curve *c) {
|
||||
std::ostringstream ss;
|
||||
if (c == nullptr) {
|
||||
@@ -1182,16 +1183,31 @@ bool is_sudo_user() {
|
||||
return isRunningWithSudo;
|
||||
}
|
||||
|
||||
rsmi_status_t rsmi_get_gfx_target_version(uint32_t dv_ind,
|
||||
std::string *gfx_version) {
|
||||
// string output of gfx_<version>
|
||||
rsmi_status_t rsmi_get_gfx_target_version(uint32_t dv_ind, std::string *gfx_version) {
|
||||
std::ostringstream ss;
|
||||
uint64_t kfd_gfx_version = 0;
|
||||
GET_DEV_AND_KFDNODE_FROM_INDX
|
||||
|
||||
int ret = kfd_node->get_gfx_target_version(&kfd_gfx_version);
|
||||
uint64_t orig_target_version = 0;
|
||||
uint64_t major = 0;
|
||||
uint64_t minor = 0;
|
||||
uint64_t rev = 0;
|
||||
if (ret == 0) {
|
||||
ss << "gfx" << kfd_gfx_version;
|
||||
*gfx_version = ss.str();
|
||||
orig_target_version = std::stoull(std::to_string(kfd_gfx_version));
|
||||
// separate out parts -> put back into normal graphics version format
|
||||
major = static_cast<uint64_t>((orig_target_version / 10000) * 100);
|
||||
minor = static_cast<uint64_t>((orig_target_version % 10000 / 100) * 10);
|
||||
if (minor == 0) major *= 10; // 0 as a minor is correct, but bump up by 10
|
||||
rev = static_cast<uint64_t>(orig_target_version % 100);
|
||||
*gfx_version = "gfx" + std::to_string(major + minor + rev);
|
||||
ss << __PRETTY_FUNCTION__
|
||||
<< " | " << std::dec << "kfd_target_version = " << orig_target_version
|
||||
<< "; major = " << major << "; minor = " << minor << "; rev = "
|
||||
<< rev << "\nReporting rsmi_get_gfx_target_version = " << *gfx_version
|
||||
<< "\n";
|
||||
LOG_INFO(ss);
|
||||
return RSMI_STATUS_SUCCESS;
|
||||
} else {
|
||||
*gfx_version = "Unknown";
|
||||
|
||||
@@ -753,18 +753,54 @@ amdsmi_get_gpu_asic_info(amdsmi_processor_handle processor_handle, amdsmi_asic_i
|
||||
// default to 0xffff as not supported
|
||||
info->oam_id = std::numeric_limits<uint16_t>::max();
|
||||
uint16_t tmp_oam_id = 0;
|
||||
status = rsmi_wrapper(rsmi_dev_oam_id_get, processor_handle, &(tmp_oam_id));
|
||||
status = rsmi_wrapper(rsmi_dev_xgmi_physical_id_get, processor_handle, &(tmp_oam_id));
|
||||
info->oam_id = tmp_oam_id;
|
||||
|
||||
// default to 0xffffffff as not supported
|
||||
info->num_of_compute_units = std::numeric_limits<uint32_t>::max();
|
||||
auto tmp_num_of_compute_units = uint32_t(0);
|
||||
status = rsmi_wrapper(amd::smi::rsmi_dev_number_of_computes_get, processor_handle,
|
||||
&tmp_num_of_compute_units);
|
||||
&(tmp_num_of_compute_units));
|
||||
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
info->num_of_compute_units = tmp_num_of_compute_units;
|
||||
}
|
||||
|
||||
// default to 0xffffffffffffffff as not supported
|
||||
info->target_graphics_version = std::numeric_limits<uint64_t>::max();
|
||||
auto tmp_target_gfx_version = uint64_t(0);
|
||||
status = rsmi_wrapper(rsmi_dev_target_graphics_version_get, processor_handle,
|
||||
&(tmp_target_gfx_version));
|
||||
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
info->target_graphics_version = tmp_target_gfx_version;
|
||||
}
|
||||
|
||||
// default to 0xffffffffffffffff as not supported
|
||||
info->kfd_id = std::numeric_limits<uint64_t>::max();
|
||||
auto tmp_kfd_id = uint64_t(0);
|
||||
status = rsmi_wrapper(rsmi_dev_guid_get, processor_handle,
|
||||
&(tmp_kfd_id));
|
||||
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
info->kfd_id = tmp_kfd_id;
|
||||
}
|
||||
|
||||
// default to 0xffffffff as not supported
|
||||
info->node_id = std::numeric_limits<uint32_t>::max();
|
||||
auto tmp_node_id = uint32_t(0);
|
||||
status = rsmi_wrapper(rsmi_dev_node_id_get, processor_handle,
|
||||
&(tmp_node_id));
|
||||
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
info->node_id = tmp_node_id;
|
||||
}
|
||||
|
||||
// default to 0xffffffff as not supported
|
||||
info->partition_id = std::numeric_limits<uint32_t>::max();
|
||||
auto tmp_partition_id = uint32_t(0);
|
||||
status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle,
|
||||
&(tmp_partition_id));
|
||||
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
|
||||
info->partition_id = tmp_partition_id;
|
||||
}
|
||||
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
@@ -52,6 +52,8 @@
|
||||
#include "amd_smi/impl/amd_smi_common.h"
|
||||
#include "rocm_smi/rocm_smi.h"
|
||||
#include "rocm_smi/rocm_smi_main.h"
|
||||
#include "rocm_smi/rocm_smi_utils.h"
|
||||
#include "rocm_smi/rocm_smi_logger.h"
|
||||
|
||||
namespace amd {
|
||||
namespace smi {
|
||||
@@ -173,10 +175,26 @@ amdsmi_status_t AMDSmiDrm::init() {
|
||||
}
|
||||
|
||||
has_valid_fds = true;
|
||||
bdf.function_number = device->businfo.pci->func;
|
||||
bdf.device_number = device->businfo.pci->dev;
|
||||
bdf.bus_number = device->businfo.pci->bus;
|
||||
bdf.domain_number = device->businfo.pci->domain;
|
||||
std::ostringstream ss;
|
||||
uint64_t bdf_rocm = 0;
|
||||
rsmi_dev_pci_id_get(i, &bdf_rocm);
|
||||
ss << __PRETTY_FUNCTION__ << " | "
|
||||
<< "bdf_rocm | Received bdf: "
|
||||
<< "\nWhole BDF: " << amd::smi::print_unsigned_hex_and_int(bdf_rocm)
|
||||
<< "\nDomain = "
|
||||
<< amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0xFFFFFFFF00000000) >> 32)
|
||||
<< "; \nBus# = " << amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0xFF00) >> 8)
|
||||
<< "; \nDevice# = "<< amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0xF8) >> 3)
|
||||
<< "; \nFunction# = " << amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0x7));
|
||||
LOG_INFO(ss);
|
||||
bdf.function_number = ((bdf_rocm & 0x7));
|
||||
bdf.device_number = ((bdf_rocm & 0xF8) >> 3);
|
||||
bdf.bus_number = ((bdf_rocm & 0xFF00) >> 8);
|
||||
bdf.domain_number = ((bdf_rocm & 0xFFFFFFFF00000000) >> 32);
|
||||
ss << __PRETTY_FUNCTION__ << " | " << "Received bdf: Domain = " << bdf.domain_number
|
||||
<< "; Bus# = " << bdf.bus_number << "; Device# = "<< bdf.device_number
|
||||
<< "; Function# = " << bdf.function_number;
|
||||
LOG_INFO(ss);
|
||||
|
||||
vendor_id = device->deviceinfo.pci->vendor_id;
|
||||
|
||||
@@ -309,6 +327,14 @@ amdsmi_status_t AMDSmiDrm::get_drm_fd_by_index(uint32_t gpu_index, uint32_t *fd_
|
||||
amdsmi_status_t AMDSmiDrm::get_bdf_by_index(uint32_t gpu_index, amdsmi_bdf_t *bdf_info) const {
|
||||
if (gpu_index + 1 > drm_bdfs_.size()) return AMDSMI_STATUS_NOT_SUPPORTED;
|
||||
*bdf_info = drm_bdfs_[gpu_index];
|
||||
std::ostringstream ss;
|
||||
ss << __PRETTY_FUNCTION__ << " | gpu_index = " << gpu_index
|
||||
<< "; \nreceived bdf: Domain = " << bdf_info->domain_number
|
||||
<< "; \nBus# = " << bdf_info->bus_number
|
||||
<< "; \nDevice# = " << bdf_info->device_number
|
||||
<< "; \nFunction# = " << bdf_info->function_number
|
||||
<< "\nReturning = AMDSMI_STATUS_SUCCESS";
|
||||
LOG_INFO(ss);
|
||||
return AMDSMI_STATUS_SUCCESS;
|
||||
}
|
||||
|
||||
|
||||
@@ -48,6 +48,7 @@
|
||||
|
||||
#include <iostream>
|
||||
#include <string>
|
||||
#include <limits>
|
||||
|
||||
#include <gtest/gtest.h>
|
||||
#include "amd_smi/amdsmi.h"
|
||||
@@ -58,7 +59,9 @@
|
||||
TestSysInfoRead::TestSysInfoRead() : TestBase() {
|
||||
set_title("AMDSMI System Info Read Test");
|
||||
set_description("This test verifies that system information such as the "
|
||||
"BDFID, AMDSMI version, VBIOS version, etc. can be read properly.");
|
||||
"BDFID, AMDSMI version, VBIOS version, "
|
||||
"vendor_id, unique_id, target_gfx_version, kfd_id, node_id, partition_id, etc. "
|
||||
"can be read properly.");
|
||||
}
|
||||
|
||||
TestSysInfoRead::~TestSysInfoRead(void) {
|
||||
@@ -150,22 +153,39 @@ void TestSysInfoRead::Run(void) {
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
|
||||
|
||||
|
||||
// vendor_id, unique_id
|
||||
amdsmi_asic_info_t asci_info;
|
||||
err = amdsmi_get_gpu_asic_info(processor_handles_[0], &asci_info);
|
||||
// vendor_id, unique_id, target_gfx_version, kfd_id, node_id, partition_id
|
||||
amdsmi_asic_info_t asci_info = {};
|
||||
err = amdsmi_get_gpu_asic_info(processor_handles_[i], &asci_info);
|
||||
if (err == AMDSMI_STATUS_NOT_SUPPORTED) {
|
||||
std::cout <<
|
||||
"\t**amdsmi_dev_unique_id() is not supported"
|
||||
" on this machine" << std::endl;
|
||||
EXPECT_EQ(asci_info.target_graphics_version, std::numeric_limits<uint64_t>::max());
|
||||
EXPECT_EQ(asci_info.kfd_id, std::numeric_limits<uint64_t>::max());
|
||||
EXPECT_EQ(asci_info.node_id, std::numeric_limits<uint32_t>::max());
|
||||
EXPECT_EQ(asci_info.partition_id, std::numeric_limits<uint32_t>::max());
|
||||
// Verify api support checking functionality is working
|
||||
err = amdsmi_get_gpu_asic_info(processor_handles_[i], nullptr);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_NOT_SUPPORTED);
|
||||
} else {
|
||||
if (err == AMDSMI_STATUS_SUCCESS) {
|
||||
IF_VERB(STANDARD) {
|
||||
std:: cout << "\t**GPU PCIe Vendor : "
|
||||
std:: cout << "\t**GPU PCIe Vendor : "
|
||||
<< asci_info.vendor_name << std::endl;
|
||||
std::cout << "\t**Target GFX version: " << std::dec
|
||||
<< asci_info.target_graphics_version << "\n";
|
||||
std::cout << "\t**KFD ID: " << std::dec
|
||||
<< asci_info.kfd_id << "\n";
|
||||
std::cout << "\t**Node ID: " << std::dec
|
||||
<< asci_info.node_id << "\n";
|
||||
std::cout << "\t**Partition ID: " << std::dec
|
||||
<< asci_info.partition_id << "\n";
|
||||
}
|
||||
EXPECT_EQ(err, AMDSMI_STATUS_SUCCESS);
|
||||
EXPECT_NE(asci_info.target_graphics_version, std::numeric_limits<uint64_t>::max());
|
||||
EXPECT_NE(asci_info.kfd_id, std::numeric_limits<uint64_t>::max());
|
||||
EXPECT_NE(asci_info.node_id, std::numeric_limits<uint32_t>::max());
|
||||
EXPECT_NE(asci_info.partition_id, std::numeric_limits<uint32_t>::max());
|
||||
// Verify api support checking functionality is working
|
||||
err = amdsmi_get_gpu_asic_info(processor_handles_[i], nullptr);
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
|
||||
|
||||
@@ -137,8 +137,7 @@ void TestTempRead::Run(void) {
|
||||
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
|
||||
|
||||
IF_VERB(STANDARD) {
|
||||
std::cout << "\t**" << label << ": " << val_i64/1000 <<
|
||||
"C" << std::endl;
|
||||
std::cout << "\t**" << label << ": " << val_i64 << "C" << std::endl;
|
||||
}
|
||||
};
|
||||
for (type = AMDSMI_TEMPERATURE_TYPE_FIRST; type <= AMDSMI_TEMPERATURE_TYPE__MAX; ++type) {
|
||||
|
||||
Ссылка в новой задаче
Block a user