[SWDEV-483526] Fix MI3x partitions not showing all logical nodes

Changes:
- Updates to amdsmi_asic_info_t structure to include:
  target_graphics_version, kfd_id, node_id, partition_id
- Updates to amd-smi static --asic to display new
  samdsmi_asic_info_t fields
- Updates to gpu enumeration during amdsmi_init()
  to discover all logical GPUs when in a non-SPX mode
  (ex. DPX, TPX, QPX, or CPX)
 - Updates to amdsmi_get_gpu_bdf_id(..) to include
   partition_id details when in BDF or optional bits.
     - bits [63:32] = domain
     - bits [31:28] or bits [2:0] = partition id
     - bits [27:16] = reserved
     - bits [15:8]  = Bus
     - bits [7:3] = Device
     - bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

- C++/Python tests updated to reflect these outputs

Change-Id: I4be0ea35bb98f3109ae2ca9e82f6b21baa38de29
Signed-off-by: Charis Poag <Charis.Poag@amd.com>


[ROCm/amdsmi commit: a33e4c9e14]
Этот коммит содержится в:
Charis Poag
2024-09-11 09:42:32 -05:00
коммит произвёл Maisam Arif
родитель 202ddc01aa
Коммит df9d5d3ee5
18 изменённых файлов: 971 добавлений и 2774 удалений
+145 -2
Просмотреть файл
@@ -175,6 +175,98 @@ Legend:
64,32 = 64 bit and 32 bit atomic support
<BW from>-<BW to>
```
- **Added Target_Graphics_Version, KFD_ID, Node_id, and partition id to `amd-smi static --asic`**.
Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers
were placed within the `amdsmi_asic_info_t` struct. These new fields are only available for BM/Guest Linux
devices at this time.
```C
typedef struct {
char market_name[AMDSMI_256_LENGTH];
uint32_t vendor_id; //< Use 32 bit to be compatible with other platform.
char vendor_name[AMDSMI_MAX_STRING_LENGTH];
uint32_t subvendor_id; //< The subsystem vendor id
uint64_t device_id; //< The device id of a GPU
uint32_t rev_id;
char asic_serial[AMDSMI_NORMAL_STRING_LENGTH];
uint32_t oam_id; //< 0xFFFF if not supported
uint32_t num_of_compute_units; //< 0xFFFFFFFF if not supported
uint64_t target_graphics_version; //< 0xFFFFFFFFFFFFFFFF if not supported
uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported
uint32_t node_id; //< 0xFFFFFFFF if not supported
uint32_t partition_id; //< 0xFFFFFFFF if not supported
uint32_t reserved[17];
} amdsmi_asic_info_t;
```
```shell
$ amd-smi static --asic --board --bus --partition
GPU: 0
ASIC:
MARKET_NAME: MI308X
VENDOR_ID: 0x1002
VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI]
SUBVENDOR_ID: 0x1002
DEVICE_ID: 0x74a2
TARGET_GRAPHICS_VERSION: gfx942
KFD_ID: 24248
NODE_ID: 2
PARTITION_ID: 0
SUBSYSTEM_ID: 0x74a2
REV_ID: 0x00
ASIC_SERIAL: <redacted>
OAM_ID: 5
NUM_COMPUTE_UNITS: 20
BUS:
BDF: 0000:0A:00.0
MAX_PCIE_WIDTH: 16
MAX_PCIE_SPEED: 32 GT/s
PCIE_INTERFACE_VERSION: Gen 5
SLOT_TYPE: PCIE
BOARD:
MODEL_NUMBER: 102-G30218-00
PRODUCT_SERIAL: 692432000576
FRU_ID: 113-AMDG302180002-0000000000000
PRODUCT_NAME: AMD Instinct MI308X OAM
MANUFACTURER_NAME: AMD
PARTITION:
COMPUTE_PARTITION: CPX
MEMORY_PARTITION: NPS4
GPU: 1
ASIC:
MARKET_NAME: MI308X
VENDOR_ID: 0x1002
VENDOR_NAME: Advanced Micro Devices Inc. [AMD/ATI]
SUBVENDOR_ID: 0x1002
DEVICE_ID: 0x74a2
TARGET_GRAPHICS_VERSION: gfx942
KFD_ID: 41657
NODE_ID: 3
PARTITION_ID: 1
SUBSYSTEM_ID: 0x74a2
REV_ID: 0x00
ASIC_SERIAL: <redacted>
OAM_ID: 5
NUM_COMPUTE_UNITS: 20
BUS:
BDF: 0000:0A:00.1
MAX_PCIE_WIDTH: 16
MAX_PCIE_SPEED: 32 GT/s
PCIE_INTERFACE_VERSION: Gen 5
SLOT_TYPE: PCIE
BOARD:
MODEL_NUMBER: 102-G30218-00
PRODUCT_SERIAL: 692432000576
FRU_ID: 113-AMDG302180002-0000000000000
PRODUCT_NAME: AMD Instinct MI308X OAM
MANUFACTURER_NAME: AMD
PARTITION:
COMPUTE_PARTITION: CPX
MEMORY_PARTITION: NPS4
...
```
### Removals
@@ -186,7 +278,58 @@ Legend:
### Resolved issues
- N/A
- **Fixed CPX not showing total number of logical GPUs**.
Updates were made to `amdsmi_init()` and `amdsmi_get_gpu_bdf_id(..)`. In order to display all logical devices, we needed a way to provide order to GPU's enumerated. This was done
by adding a partition_id within the BDF optional pci_id bits.
Due to driver changes in KFD, some devices may report bits [31:28] or [2:0]. With the newly added `amdsmi_get_gpu_bdf_id(..)`, we provided this fallback to properly retreive partition ID. We
plan to eventually remove partition ID from the function portion of the BDF (Bus Device Function). See below for PCI ID description.
- bits [63:32] = domain
- bits [31:28] or bits [2:0] = partition id
- bits [27:16] = reserved
- bits [15:8] = Bus
- bits [7:3] = Device
- bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI.
```shell
$ amd-smi monitor -p -t -v
GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL
0 248 W 55 °C 48 °C 283 MB 196300 MB
1 247 W 55 °C 48 °C 283 MB 196300 MB
2 247 W 55 °C 48 °C 283 MB 196300 MB
3 247 W 55 °C 48 °C 283 MB 196300 MB
4 221 W 50 °C 42 °C 283 MB 196300 MB
5 221 W 50 °C 42 °C 283 MB 196300 MB
6 222 W 50 °C 42 °C 283 MB 196300 MB
7 221 W 50 °C 42 °C 283 MB 196300 MB
8 239 W 53 °C 46 °C 283 MB 196300 MB
9 239 W 53 °C 46 °C 283 MB 196300 MB
10 239 W 53 °C 46 °C 283 MB 196300 MB
11 239 W 53 °C 46 °C 283 MB 196300 MB
12 219 W 51 °C 48 °C 283 MB 196300 MB
13 219 W 51 °C 48 °C 283 MB 196300 MB
14 219 W 51 °C 48 °C 283 MB 196300 MB
15 219 W 51 °C 48 °C 283 MB 196300 MB
16 222 W 51 °C 47 °C 283 MB 196300 MB
17 222 W 51 °C 47 °C 283 MB 196300 MB
18 222 W 51 °C 47 °C 283 MB 196300 MB
19 222 W 51 °C 48 °C 283 MB 196300 MB
20 241 W 55 °C 48 °C 283 MB 196300 MB
21 241 W 55 °C 48 °C 283 MB 196300 MB
22 241 W 55 °C 48 °C 283 MB 196300 MB
23 240 W 55 °C 48 °C 283 MB 196300 MB
24 211 W 51 °C 45 °C 283 MB 196300 MB
25 211 W 51 °C 45 °C 283 MB 196300 MB
26 211 W 51 °C 45 °C 283 MB 196300 MB
27 211 W 51 °C 45 °C 283 MB 196300 MB
28 227 W 51 °C 49 °C 283 MB 196300 MB
29 227 W 51 °C 49 °C 283 MB 196300 MB
30 227 W 51 °C 49 °C 283 MB 196300 MB
31 227 W 51 °C 49 °C 283 MB 196300 MB
```
### Known issues
@@ -829,7 +972,7 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json
Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix
updates CLI to target only AMD ASICs.
- **Fix for `amd-smi metric --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**.
Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix
provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards).
+26 -20
Просмотреть файл
@@ -281,16 +281,16 @@ typedef enum {
*/
typedef enum {
AMDSMI_COMPUTE_PARTITION_INVALID = 0,
AMDSMI_COMPUTE_PARTITION_CPX, //!< Core mode (CPX)- Per-chip XCC with
//!< shared memory
AMDSMI_COMPUTE_PARTITION_SPX, //!< Single GPU mode (SPX)- All XCCs work
//!< together with shared memory
AMDSMI_COMPUTE_PARTITION_DPX, //!< Dual GPU mode (DPX)- Half XCCs work
//!< together with shared memory
AMDSMI_COMPUTE_PARTITION_TPX, //!< Triple GPU mode (TPX)- One-third XCCs
//!< work together with shared memory
AMDSMI_COMPUTE_PARTITION_QPX //!< Quad GPU mode (QPX)- Quarter XCCs
//!< work together with shared memory
AMDSMI_COMPUTE_PARTITION_SPX, //!< Single GPU mode (SPX)- All XCCs work
//!< together with shared memory
AMDSMI_COMPUTE_PARTITION_DPX, //!< Dual GPU mode (DPX)- Half XCCs work
//!< together with shared memory
AMDSMI_COMPUTE_PARTITION_TPX, //!< Triple GPU mode (TPX)- One-third XCCs
//!< work together with shared memory
AMDSMI_COMPUTE_PARTITION_QPX, //!< Quad GPU mode (QPX)- Quarter XCCs
//!< work together with shared memory
AMDSMI_COMPUTE_PARTITION_CPX, //!< Core mode (CPX)- Per-chip XCC with
//!< shared memory
} amdsmi_compute_partition_type_t;
/**
@@ -589,7 +589,11 @@ typedef struct {
char asic_serial[AMDSMI_NORMAL_STRING_LENGTH];
uint32_t oam_id; //< 0xFFFF if not supported
uint32_t num_of_compute_units; //< 0xFFFFFFFF if not supported
uint32_t reserved[17];
uint64_t target_graphics_version; //< 0xFFFFFFFFFFFFFFFF if not supported
uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported
uint32_t node_id; //< 0xFFFFFFFF if not supported
uint32_t partition_id; //< 0xFFFFFFFF if not supported
uint32_t reserved[11];
} amdsmi_asic_info_t;
typedef enum {
@@ -2233,16 +2237,18 @@ amdsmi_get_gpu_pci_bandwidth(amdsmi_processor_handle processor_handle,
*
* The format of @p bdfid will be as follows:
*
* BDFID = ((DOMAIN & 0xffffffff) << 32) | ((BUS & 0xff) << 8) |
* ((DEVICE & 0x1f) <<3 ) | (FUNCTION & 0x7)
* BDFID = ((DOMAIN & 0xFFFFFFFF) << 32) | ((Partition & 0xF) << 28)
* | ((BUS & 0xFF) << 8) | ((DEVICE & 0x1F) <<3 )
* | (FUNCTION & 0x7)
*
* | Name | Field |
* ---------- | ------- |
* | Domain | [64:32] |
* | Reserved | [31:16] |
* | Bus | [15: 8] |
* | Device | [ 7: 3] |
* | Function | [ 2: 0] |
* | Name | Field | KFD property KFD -> PCIe ID (uint64_t)
* -------------- | ------- | ---------------- | ---------------------------- |
* | Domain | [63:32] | "domain" | (DOMAIN & 0xFFFFFFFF) << 32 |
* | Partition id | [31:28] | "location id" | (LOCATION & 0xF0000000) |
* | Reserved | [27:16] | "location id" | N/A |
* | Bus | [15: 8] | "location id" | (LOCATION & 0xFF00) |
* | Device | [ 7: 3] | "location id" | (LOCATION & 0xF8) |
* | Function | [ 2: 0] | "location id" | (LOCATION & 0x7) |
*
* @param[in] processor_handle a processor handle
*
+5 -1
Просмотреть файл
@@ -1664,7 +1664,11 @@ def amdsmi_get_gpu_asic_info(
"rev_id": _padHexValue(hex(asic_info_struct.rev_id), 2),
"asic_serial": asic_info_struct.asic_serial.decode("utf-8"),
"oam_id": asic_info_struct.oam_id,
"num_compute_units": asic_info_struct.num_of_compute_units
"num_compute_units": asic_info_struct.num_of_compute_units,
"target_graphics_version": "gfx" + str(asic_info_struct.target_graphics_version),
"kfd_id": asic_info_struct.kfd_id,
"node_id": asic_info_struct.node_id,
"partition_id": asic_info_struct.partition_id
}
string_values = ["market_name", "vendor_name"]
+16 -10
Просмотреть файл
@@ -380,18 +380,18 @@ amdsmi_clk_type_t = ctypes.c_uint32 # enum
# values for enumeration 'amdsmi_compute_partition_type_t'
amdsmi_compute_partition_type_t__enumvalues = {
0: 'AMDSMI_COMPUTE_PARTITION_INVALID',
1: 'AMDSMI_COMPUTE_PARTITION_CPX',
2: 'AMDSMI_COMPUTE_PARTITION_SPX',
3: 'AMDSMI_COMPUTE_PARTITION_DPX',
4: 'AMDSMI_COMPUTE_PARTITION_TPX',
5: 'AMDSMI_COMPUTE_PARTITION_QPX',
1: 'AMDSMI_COMPUTE_PARTITION_SPX',
2: 'AMDSMI_COMPUTE_PARTITION_DPX',
3: 'AMDSMI_COMPUTE_PARTITION_TPX',
4: 'AMDSMI_COMPUTE_PARTITION_QPX',
5: 'AMDSMI_COMPUTE_PARTITION_CPX',
}
AMDSMI_COMPUTE_PARTITION_INVALID = 0
AMDSMI_COMPUTE_PARTITION_CPX = 1
AMDSMI_COMPUTE_PARTITION_SPX = 2
AMDSMI_COMPUTE_PARTITION_DPX = 3
AMDSMI_COMPUTE_PARTITION_TPX = 4
AMDSMI_COMPUTE_PARTITION_QPX = 5
AMDSMI_COMPUTE_PARTITION_SPX = 1
AMDSMI_COMPUTE_PARTITION_DPX = 2
AMDSMI_COMPUTE_PARTITION_TPX = 3
AMDSMI_COMPUTE_PARTITION_QPX = 4
AMDSMI_COMPUTE_PARTITION_CPX = 5
amdsmi_compute_partition_type_t = ctypes.c_uint32 # enum
# values for enumeration 'amdsmi_memory_partition_type_t'
@@ -902,7 +902,13 @@ struct_amdsmi_asic_info_t._fields_ = [
('asic_serial', ctypes.c_char * 32),
('oam_id', ctypes.c_uint32),
('num_of_compute_units', ctypes.c_uint32),
('PADDING_0', ctypes.c_ubyte * 4),
('target_graphics_version', ctypes.c_uint64),
('kfd_id', ctypes.c_uint64),
('node_id', ctypes.c_uint32),
('partition_id', ctypes.c_uint32),
('reserved', ctypes.c_uint32 * 17),
('PADDING_1', ctypes.c_ubyte * 4),
]
amdsmi_asic_info_t = struct_amdsmi_asic_info_t
+8
Просмотреть файл
@@ -509,6 +509,14 @@ def walk_through(self):
asic_info['asic_serial']))
print(" asic_info['oam_id'] is: {}\n".format(
asic_info['oam_id']))
print(" asic_info['target_graphics_version'] is: {}\n".format(
asic_info['target_graphics_version']))
print(" asic_info['kfd_id'] is: {}\n".format(
asic_info['kfd_id']))
print(" asic_info['node_id'] is: {}\n".format(
asic_info['node_id']))
print(" asic_info['partition_id'] is: {}\n".format(
asic_info['partition_id']))
print("###Test amdsmi_get_power_cap_info \n")
power_info = amdsmi.amdsmi_get_power_cap_info(processors[i])
print(" power_info['dpm_cap'] is: {}".format(
+58 -184
Просмотреть файл
@@ -53,6 +53,7 @@
#include <map>
#include <vector>
#include <type_traits>
#include <cstring>
#include "rocm_smi/rocm_smi.h"
#include "rocm_smi/rocm_smi_utils.h"
@@ -730,30 +731,6 @@ template<typename T> constexpr float convert_mw_to_w(T mw) {
return static_cast<float>(mw / 1000.0);
}
template <typename T>
auto print_error_or_value(rsmi_status_t status_code, const T& metric) {
if (status_code == rsmi_status_t::RSMI_STATUS_SUCCESS) {
if constexpr (std::is_array_v<T>) {
auto idx = uint16_t(0);
auto str_values = std::string();
const auto num_elems = static_cast<uint16_t>(std::end(metric) - std::begin(metric));
str_values = ("\n\t\t num of values: " + std::to_string(num_elems) + "\n");
for (const auto& el : metric) {
str_values += "\t\t [" + std::to_string(idx) + "]: " + std::to_string(el) + "\n";
++idx;
}
return str_values;
}
else if constexpr ((std::is_same_v<T, std::uint16_t>) ||
(std::is_same_v<T, std::uint32_t>) ||
(std::is_same_v<T, std::uint64_t>)) {
return std::to_string(metric);
}
}
else {
return ("\n\t\tStatus: [" + std::to_string(status_code) + "] " + "-> " + amd::smi::getRSMIStatusString(status_code));
}
};
template <typename T>
std::string print_unsigned_int(T value) {
@@ -780,6 +757,7 @@ int main() {
uint32_t num_monitor_devs = 0;
rsmi_gpu_metrics_t gpu_metrics;
std::string val_str;
RSMI_POWER_TYPE power_type = RSMI_INVALID_POWER;
rsmi_num_monitor_devices(&num_monitor_devs);
@@ -791,13 +769,23 @@ int main() {
ret = rsmi_dev_revision_get(i, &val_ui16);
CHK_RSMI_RET_I(ret)
std::cout << "\t**Dev.Rev.ID: 0x" << std::hex << val_ui16 << "\n";
ret = amd::smi::rsmi_get_gfx_target_version(i , &val_str);
std::cout << "\t**Target Graphics Version: " << val_str << "\n";
char pcie_vendor_name[256];
ret = rsmi_dev_pcie_vendor_name_get(i, pcie_vendor_name, 256);
CHK_RSMI_RET_I(ret)
std::cout << "\t**PCIe vendor name: " << pcie_vendor_name << std::endl;
ret = rsmi_dev_target_graphics_version_get(i, &val_ui64);
std::cout << "\t**Target Graphics Version: " << std::dec
<< static_cast<uint64_t>(val_ui64) << "\n";
ret = rsmi_dev_guid_get(i, &val_ui64);
std::cout << "\t**GUID: " << std::dec
<< static_cast<uint64_t>(val_ui64) << "\n";
ret = rsmi_dev_node_id_get(i, &val_ui32);
std::cout << "\t**Node ID: " << std::dec
<< static_cast<uint32_t>(val_ui32) << "\n";
char vbios_version[256];
ret = rsmi_dev_vbios_version_get(i, vbios_version, 256);
if (ret == RSMI_STATUS_SUCCESS) {
std::cout << "\t**VBIOS Version: " << vbios_version << "\n";
} else {
std::cout << "\t**VBIOS Version: "
<< amd::smi::getRSMIStatusString(ret, false) << "\n";
}
char current_compute_partition[256];
current_compute_partition[0] = '\0';
@@ -848,8 +836,9 @@ int main() {
//
std::cout << "\n";
print_test_header("GPU METRICS: Using static struct (Backwards Compatibility) ", i);
print_function_header_with_rsmi_ret(ret, "rsmi_dev_gpu_metrics_info_get(" + std::to_string(i) + ", &gpu_metrics)");
rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics);
ret = rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics);
print_function_header_with_rsmi_ret(ret, "rsmi_dev_gpu_metrics_info_get("
+ std::to_string(i) + ", &gpu_metrics)");
std::cout << "\t**.common_header.format_revision : "
<< print_unsigned_int(gpu_metrics.common_header.format_revision) << "\n";
@@ -988,173 +977,58 @@ int main() {
for (const auto& dclk : gpu_metrics.current_dclk0s) {
std::cout << "\t -> " << std::dec << dclk << "\n";
}
std::cout << " ** Note: Values MAX'ed out (UINTX MAX are unsupported for the version in question) ** " << "\n";
std::cout << "\n";
std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n";
constexpr uint16_t kMAX_ITER_TEST = 10;
rsmi_gpu_metrics_t gpu_metrics_check;
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics_check);
std::cout << "\t\t -> firmware_timestamp [" << idx
<< "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.firmware_timestamp << "\n";
}
std::cout << "\n";
for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) {
rsmi_dev_gpu_metrics_info_get(i, &gpu_metrics_check);
std::cout << "\t\t -> system_clock_counter [" << idx
<< "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.system_clock_counter << "\n";
}
std::cout << "\n\n";
std::cout << " ** Note: Values MAX'ed out "
"(UINTX MAX are unsupported for the version in question) ** " << "\n";
std::cout << "\n\n";
print_test_header("GPU METRICS: Using direct APIs (newer)", i);
metrics_table_header_t header_values;
GPUMetricTempHbm_t hbm_values;
GPUMetricVcnActivity_t vcn_values;
GPUMetricXgmiReadDataAcc_t xgmi_read_values;
GPUMetricXgmiWriteDataAcc_t xgmi_write_values;
GPUMetricCurrGfxClk_t curr_gfxclk_values;
GPUMetricCurrSocClk_t curr_socclk_values;
GPUMetricCurrVClk0_t curr_vclk0_values;
GPUMetricCurrDClk0_t curr_dclk0_values;
ret = rsmi_dev_metrics_header_info_get(i, &header_values);
std::cout << "\t[Metrics Header]" << "\n";
std::cout << "\t -> format_revision : " << print_unsigned_int(header_values.format_revision) << "\n";
std::cout << "\t -> content_revision : " << print_unsigned_int(header_values.content_revision) << "\n";
std::cout << "\t -> format_revision : "
<< print_unsigned_int(header_values.format_revision) << "\n";
std::cout << "\t -> content_revision : "
<< print_unsigned_int(header_values.content_revision) << "\n";
std::cout << "\t--------------------" << "\n";
std::cout << "\n";
std::cout << "\t[Temperature]" << "\n";
ret = rsmi_dev_metrics_temp_edge_get(i, &val_ui16);
std::cout << "\t -> temp_edge(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_temp_hotspot_get(i, &val_ui16);
std::cout << "\t -> temp_hotspot(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_temp_mem_get(i, &val_ui16);
std::cout << "\t -> temp_mem(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_temp_vrgfx_get(i, &val_ui16);
std::cout << "\t -> temp_vrgfx(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_temp_vrsoc_get(i, &val_ui16);
std::cout << "\t -> temp_vrsoc(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_temp_vrmem_get(i, &val_ui16);
std::cout << "\t -> temp_vrmem(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_temp_hbm_get(i, &hbm_values);
std::cout << "\t -> temp_hbm(): " << print_error_or_value(ret, hbm_values) << "\n";
std::cout << "\n";
std::cout << "\t[Power/Energy]" << "\n";
ret = rsmi_dev_metrics_curr_socket_power_get(i, &val_ui16);
std::cout << "\t -> current_socket_power(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_energy_acc_get(i, &val_ui64);
std::cout << "\t -> energy_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_avg_socket_power_get(i, &val_ui16);
std::cout << "\t -> average_socket_power(): " << print_error_or_value(ret, val_ui16) << "\n";
std::cout << "\n";
std::cout << "\t[Utilization]" << "\n";
ret = rsmi_dev_metrics_avg_gfx_activity_get(i, &val_ui16);
std::cout << "\t -> average_gfx_activity(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_umc_activity_get(i, &val_ui16);
std::cout << "\t -> average_umc_activity(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_mm_activity_get(i, &val_ui16);
std::cout << "\t -> average_mm_activity(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_vcn_activity_get(i, &vcn_values);
std::cout << "\t -> vcn_activity(): " << print_error_or_value(ret, vcn_values) << "\n";
ret = rsmi_dev_metrics_mem_activity_acc_get(i, &val_ui32);
std::cout << "\t -> mem_activity_accum(): " << print_error_or_value(ret, val_ui32) << "\n";
ret = rsmi_dev_metrics_gfx_activity_acc_get(i, &val_ui32);
std::cout << "\t -> gfx_activity_accum(): " << print_error_or_value(ret, val_ui32) << "\n";
std::cout << "\n";
std::cout << "\t[Average Clock]" << "\n";
ret = rsmi_dev_metrics_avg_gfx_clock_frequency_get(i, &val_ui16);
std::cout << "\t -> average_gfx_clock_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_soc_clock_frequency_get(i, &val_ui16);
std::cout << "\t -> average_soc_clock_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_uclock_frequency_get(i, &val_ui16);
std::cout << "\t -> average_uclock_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_vclock0_frequency_get(i, &val_ui16);
std::cout << "\t -> average_vclock0_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_dclock0_frequency_get(i, &val_ui16);
std::cout << "\t -> average_dclock0_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_vclock1_frequency_get(i, &val_ui16);
std::cout << "\t -> average_vclock1_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_avg_dclock1_frequency_get(i, &val_ui16);
std::cout << "\t -> average_dclock1_frequency(): " << print_error_or_value(ret, val_ui16) << "\n";
std::cout << "\n";
std::cout << "\t[Current Clock]" << "\n";
ret = rsmi_dev_metrics_curr_vclk1_get(i, &val_ui16);
std::cout << "\t -> current_vclock1(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_curr_dclk1_get(i, &val_ui16);
std::cout << "\t -> current_dclock1(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_curr_uclk_get(i, &val_ui16);
std::cout << "\t -> current_uclock(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_curr_dclk0_get(i, &curr_dclk0_values);
std::cout << "\t -> current_dclk0(): " << print_error_or_value(ret, curr_dclk0_values) << "\n";
ret = rsmi_dev_metrics_curr_gfxclk_get(i, &curr_gfxclk_values);
std::cout << "\t -> current_gfxclk(): " << print_error_or_value(ret, curr_gfxclk_values) << "\n";
ret = rsmi_dev_metrics_curr_socclk_get(i, &curr_socclk_values);
std::cout << "\t -> current_soc_clock(): " << print_error_or_value(ret, curr_socclk_values) << "\n";
ret = rsmi_dev_metrics_curr_vclk0_get(i, &curr_vclk0_values);
std::cout << "\t -> current_vclk0(): " << print_error_or_value(ret, curr_vclk0_values) << "\n";
std::cout << "\n";
std::cout << "\t[Throttle]" << "\n";
ret = rsmi_dev_metrics_indep_throttle_status_get(i, &val_ui64);
std::cout << "\t -> indep_throttle_status(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_throttle_status_get(i, &val_ui32);
std::cout << "\t -> throttle_status(): " << print_error_or_value(ret, val_ui32) << "\n";
std::cout << "\n";
std::cout << "\t[Gfx Clock Lock]" << "\n";
ret = rsmi_dev_metrics_gfxclk_lock_status_get(i, &val_ui32);
std::cout << "\t -> gfxclk_lock_status(): " << print_error_or_value(ret, val_ui32) << "\n";
std::cout << "\n";
std::cout << "\t[Current Fan Speed]" << "\n";
ret = rsmi_dev_metrics_curr_fan_speed_get(i, &val_ui16);
std::cout << "\t -> current_fan_speed(): " << print_error_or_value(ret, val_ui16) << "\n";
std::cout << "\n";
std::cout << "\t[Link/Bandwidth/Speed]" << "\n";
ret = rsmi_dev_metrics_pcie_link_width_get(i, &val_ui16);
std::cout << "\t -> pcie_link_width(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_pcie_link_speed_get(i, &val_ui16);
std::cout << "\t -> pcie_link_speed(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_pcie_bandwidth_acc_get(i, &val_ui64);
std::cout << "\t -> pcie_bandwidth_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_pcie_bandwidth_inst_get(i, &val_ui64);
std::cout << "\t -> pcie_bandwidth_inst(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_pcie_l0_recov_count_acc_get(i, &val_ui64);
std::cout << "\t -> pcie_l0_recov_count_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_pcie_replay_count_acc_get(i, &val_ui64);
std::cout << "\t -> pcie_replay_count_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_pcie_replay_rover_count_acc_get(i, &val_ui64);
std::cout << "\t -> pcie_replay_rollover_count_accum(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_xgmi_link_width_get(i, &val_ui16);
std::cout << "\t -> xgmi_link_width(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_xgmi_link_speed_get(i, &val_ui16);
std::cout << "\t -> xgmi_link_speed(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_xgmi_read_data_get(i, &xgmi_read_values);
std::cout << "\t -> xgmi_read_data(): " << print_error_or_value(ret, xgmi_read_values) << "\n";
ret = rsmi_dev_metrics_xgmi_write_data_get(i, &xgmi_write_values);
std::cout << "\t -> xgmi_write_data(): " << print_error_or_value(ret, xgmi_write_values) << "\n";
std::cout << "\n";
std::cout << "\t[Voltage]" << "\n";
ret = rsmi_dev_metrics_volt_soc_get(i, &val_ui16);
std::cout << "\t -> voltage_soc(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_volt_gfx_get(i, &val_ui16);
std::cout << "\t -> voltage_gfx(): " << print_error_or_value(ret, val_ui16) << "\n";
ret = rsmi_dev_metrics_volt_mem_get(i, &val_ui16);
std::cout << "\t -> voltage_mem(): " << print_error_or_value(ret, val_ui16) << "\n";
std::cout << "\n";
std::cout << "\t[Timestamp]" << "\n";
ret = rsmi_dev_metrics_system_clock_counter_get(i, &val_ui64);
std::cout << "\t -> system_clock_counter(): " << print_error_or_value(ret, val_ui64) << "\n";
ret = rsmi_dev_metrics_firmware_timestamp_get(i, &val_ui64);
std::cout << "\t -> firmware_timestamp(): " << print_error_or_value(ret, val_ui64) << "\n";
std::cout << "\n";
std::cout << "\t[XCD CounterVoltage]" << "\n";
ret = rsmi_dev_metrics_xcd_counter_get(i, &val_ui16);
std::cout << "\t -> xcd_counter(): " << print_error_or_value(ret, val_ui16) << "\n";
std::cout << "\t -> xcd_counter(): " << val_ui16;
std::cout << "\n\n";
ret = rsmi_dev_perf_level_get(i, &pfl);
CHK_AND_PRINT_RSMI_ERR_RET(ret)
std::cout << "\t**Performance Level:" <<
perf_level_string(pfl) << "\n";
ret = rsmi_dev_overdrive_level_get(i, &val_ui32);
CHK_AND_PRINT_RSMI_ERR_RET(ret)
std::cout << "\t**OverDrive Level:" << val_ui32 << "\n";
std::cout << "\t**OverDrive Level: ";
if (ret == RSMI_STATUS_SUCCESS) {
std::cout << val_ui32 << "\n";
} else {
CHK_RSMI_NOT_SUPPORTED_OR_UNEXPECTED_DATA_RET(ret)
}
print_test_header("GPU Clocks", i);
for (int clkType = static_cast<int>(RSMI_CLK_TYPE_SYS);
@@ -1271,9 +1145,6 @@ int main() {
}
for (uint32_t i = 0; i < num_monitor_devs; ++i) {
ret = test_set_overdrive(i);
CHK_AND_PRINT_RSMI_ERR_RET(ret)
ret = test_set_perf_level(i);
CHK_AND_PRINT_RSMI_ERR_RET(ret)
@@ -1294,6 +1165,9 @@ int main() {
ret = test_set_memory_partition(i);
CHK_AND_PRINT_RSMI_ERR_RET(ret)
ret = test_set_overdrive(i);
CHK_RSMI_NOT_SUPPORTED_RET(ret)
}
return 0;
Разница между файлами не показана из-за своего большого размера Загрузить разницу
+5
Просмотреть файл
@@ -94,6 +94,11 @@ class KFDNode {
int32_t get_simd_per_cu(uint64_t* simd_per_cu) const;
int32_t get_simd_count(uint64_t* simd_count) const;
// Get gpu_id (AKA GUID) version from kfd
int get_gpu_id(uint64_t *gpu_id);
// Get node id from kfd
int get_node_id(uint32_t *node_id);
private:
uint32_t node_indx_;
uint32_t amdgpu_dev_index_;
+4
Просмотреть файл
@@ -48,8 +48,11 @@
#include <algorithm>
#include <cstdint>
#include <iomanip>
#include <iosfwd>
#include <iostream>
#include <iterator>
#include <limits>
#include <ostream>
#include <queue>
#include <sstream>
#include <string>
@@ -594,6 +597,7 @@ class TagTextContents_t
}
}
}
};
using TextFileTagContents_t = TagTextContents_t<std::string, std::string,
Разница между файлами не показана из-за своего большого размера Загрузить разницу
+1 -1
Просмотреть файл
@@ -490,7 +490,7 @@ static const std::map<const char *, dev_depends_t> kDevFuncDependsMap = {
// Functions with only mandatory dependencies
{"rsmi_dev_vram_vendor_get", {{kDevVramVendorFName}, {}}},
{"rsmi_dev_id_get", {{kDevDevIDFName}, {}}},
{"rsmi_dev_oam_id_get", {{kDevXGMIPhysicalIDFName}, {}}},
{"rsmi_dev_xgmi_physical_id_get", {{kDevXGMIPhysicalIDFName}, {}}},
{"rsmi_dev_revision_get", {{kDevDevRevIDFName}, {}}},
{"rsmi_dev_vendor_id_get", {{kDevVendorIDFName}, {}}},
{"rsmi_dev_name_get", {{kDevVendorIDFName,
+63 -8
Просмотреть файл
@@ -526,7 +526,7 @@ int GetProcessInfoForPID(uint32_t pid, rsmi_process_info_t *proc,
// Collect count of compute units
cu_count += kfd_node_map[gpu_id]->cu_count();
} else {
//Some GFX revisions do not provide cu_occupancy debugfs method
// Some GFX revisions do not provide cu_occupancy debugfs method
proc->cu_occupancy = CU_OCCUPANCY_INVALID;
cu_count = 0;
}
@@ -1067,18 +1067,18 @@ int KFDNode::get_gfx_target_version(uint64_t *gfx_target_version) {
*gfx_target_version = gfx_version;
ss << __PRETTY_FUNCTION__
<< " | File: " << properties_path
<< " | Successfully read node #" << std::to_string(this->node_indx_)
<< " | Read node: " << std::to_string(this->node_indx_)
<< " for gfx_target_version"
<< " | Data (gfx_target_version) *gfx_target_version = "
<< " | Data (*gfx_target_version): "
<< std::to_string(*gfx_target_version)
<< " | return = " << std::to_string(ret)
<< " | Return: "
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
<< " | ";
LOG_DEBUG(ss);
return ret;
}
int32_t KFDNode::get_simd_per_cu(uint64_t* simd_per_cu) const
{
int32_t KFDNode::get_simd_per_cu(uint64_t* simd_per_cu) const {
const std::string properties_path("/sys/class/kfd/kfd/topology/nodes/" +
std::to_string(this->node_indx_) +
"/properties");
@@ -1090,8 +1090,7 @@ int32_t KFDNode::get_simd_per_cu(uint64_t* simd_per_cu) const
return ret;
}
int32_t KFDNode::get_simd_count(uint64_t* simd_count) const
{
int32_t KFDNode::get_simd_count(uint64_t* simd_count) const {
const std::string properties_path("/sys/class/kfd/kfd/topology/nodes/" +
std::to_string(this->node_indx_) +
"/properties");
@@ -1103,6 +1102,62 @@ int32_t KFDNode::get_simd_count(uint64_t* simd_count) const
return ret;
}
// Public interface for device
// /sys/class/kfd/kfd/topology/nodes/*/gpu_id
int KFDNode::get_gpu_id(uint64_t *gpu_id) {
std::ostringstream ss;
std::string gpuid_path = "/sys/class/kfd/kfd/topology/nodes/"
+ std::to_string(this->node_indx_) + "/gpu_id";
const uint64_t undefined_gpu_id = std::numeric_limits<uint64_t>::max();
std::string gpu_id_string = "";
*gpu_id = undefined_gpu_id;
int ret = ReadSysfsStr(gpuid_path, &gpu_id_string);
if (ret != 0 || gpu_id_string.empty()) {
ss << __PRETTY_FUNCTION__
<< " | File: " << gpuid_path
<< " | Data (*gpu_id): empty or nullptr"
<< " | Issue: Could not read node #" << std::to_string(this->node_indx_)
<< ". KFD node was an unsupported node or value read was empty."
<< " | Return: "
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
<< " | ";
LOG_ERROR(ss);
return ret;
}
*gpu_id = std::stoull(gpu_id_string);
if (*gpu_id == 0) { // CPU node - return not supported
*gpu_id = undefined_gpu_id;
ret = ENOENT; // map to RSMI_STATUS_NOT_SUPPORTED
}
ss << __PRETTY_FUNCTION__
<< " | File: " << gpuid_path
<< " | Read node #: " << std::to_string(this->node_indx_)
<< " | Data (*gpu_id): " << std::to_string(*gpu_id)
<< " | Return: "
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
<< " | ";
LOG_DEBUG(ss);
return ret;
}
// Public interface for device
// /sys/class/kfd/kfd/topology/nodes/<node_id>
int KFDNode::get_node_id(uint32_t *node_id) {
std::ostringstream ss;
int ret = 0;
std::string nodeid_path = "/sys/class/kfd/kfd/topology/nodes/"
+ std::to_string(this->node_indx_);
ss << __PRETTY_FUNCTION__
<< " | File: " << nodeid_path
<< " | Read node #: " << std::to_string(this->node_indx_)
<< " | Data (*node_id): " << std::to_string(*node_id)
<< " | Return: "
<< getRSMIStatusString(amd::smi::ErrnoToRsmiStatus(ret), false)
<< " | ";
*node_id = this->node_indx_;
LOG_DEBUG(ss);
return ret;
}
} // namespace smi
} // namespace amd
+138 -20
Просмотреть файл
@@ -235,15 +235,7 @@ RocmSMI::Initialize(uint64_t flags) {
int i_ret;
std::ostringstream ss;
LOG_ALWAYS("=============== ROCM SMI initialize ================");
ROCmLogging::Logger::getInstance()->enableAllLogLevels();
// Leaving below to allow developers to check current log settings
// std::string logSettings = Logger::getInstance()->getLogSettings();
// std::cout << "Current log settings:\n" << logSettings << std::endl;
if (ROCmLogging::Logger::getInstance()->isLoggerEnabled()) {
logSystemDetails();
}
assert(ref_count_ == 1);
if (ref_count_ != 1) {
@@ -259,6 +251,15 @@ RocmSMI::Initialize(uint64_t flags) {
// To help debug env variable issues
// debugRSMIEnvVarInfo();
if (ROCmLogging::Logger::getInstance()->isLoggerEnabled()) {
ROCmLogging::Logger::getInstance()->enableAllLogLevels();
LOG_ALWAYS("=============== ROCM SMI initialize ================");
logSystemDetails();
}
// Leaving below to allow developers to check current log settings
// std::string logSettings = ROCmLogging::Logger::getInstance()->getLogSettings();
// std::cout << "Current log settings:\n" << logSettings << std::endl;
while (!std::string(kAMDMonitorTypes[i]).empty()) {
amd_monitor_types_.insert(kAMDMonitorTypes[i]);
++i;
@@ -283,6 +284,7 @@ RocmSMI::Initialize(uint64_t flags) {
<< " | [before] device->path() = " << device->path()
<< "\n | bdfid = " << bdfid
<< "\n | device->bdfid() = " << device->bdfid()
<< " (" << print_int_as_hex(device->bdfid()) << ")"
<< "\n | (xgmi node) setting to setting "
<< "device->set_bdfid(device->bdfid())";
LOG_TRACE(ss);
@@ -293,6 +295,7 @@ RocmSMI::Initialize(uint64_t flags) {
<< " | [before] device->path() = " << device->path()
<< "\n | bdfid = " << bdfid
<< "\n | device->bdfid() = " << device->bdfid()
<< " (" << print_int_as_hex(device->bdfid()) << ")"
<< "\n | (legacy/pcie card) setting device->set_bdfid(bdfid)";
LOG_TRACE(ss);
device->set_bdfid(bdfid);
@@ -301,6 +304,7 @@ RocmSMI::Initialize(uint64_t flags) {
<< " | [after] device->path() = " << device->path()
<< "\n | bdfid = " << bdfid
<< "\n | device->bdfid() = " << device->bdfid()
<< " (" << print_int_as_hex(device->bdfid()) << ")"
<< "\n | final update: device->bdfid() holds correct device bdf";
LOG_TRACE(ss);
}
@@ -312,8 +316,11 @@ RocmSMI::Initialize(uint64_t flags) {
for (uint32_t dv_ind = 0; dv_ind < devices_.size(); ++dv_ind) {
dev = devices_[dv_ind];
uint64_t bdfid = dev->bdfid();
bdfid = bdfid & 0xFFFFFFFF0FFFFFFF; // clear out partition id in bdf
// NOTE: partition_id is not part of bdf (but is part of pci_id)
// which is why it is removed in sorting
dv_to_id.push_back({bdfid, dev});
}
}
ss << __PRETTY_FUNCTION__ << " Sort index based on BDF.";
LOG_DEBUG(ss);
@@ -734,7 +741,7 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
continue;
sscanf(&dentry->d_name[strlen(kDeviceNamePrefix)], "%d", &cardId);
if (cardId > max_cardId)
max_cardId = cardId;
max_cardId = cardId;
count++;
}
dentry = readdir(drm_dir);
@@ -748,23 +755,47 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
uint64_t s_gpu_id = 0;
uint64_t s_unique_id = 0;
uint64_t s_location_id = 0;
uint64_t s_bdf = 0;
uint64_t s_domain = 0;
uint8_t s_bus = 0;
uint8_t s_device = 0;
uint8_t s_function = 0;
uint8_t s_partition_id = 0;
uint64_t padding = 0; // padding added in case new changes in future
};
// allSystemNodes[key = unique_id] => {node_id, gpu_id, unique_id,
// location_id}
// location_id, bdf, domain, bus, device,
// partition_id}
std::multimap<uint64_t, systemNode> allSystemNodes;
uint32_t node_id = 0;
static const int BYTE = 8;
while (true) {
uint64_t gpu_id = 0, unique_id = 0, location_id = 0;
uint64_t gpu_id = 0, unique_id = 0, location_id = 0, domain = 0;
int ret_gpu_id = get_gpu_id(node_id, &gpu_id);
int ret_unique_id = read_node_properties(node_id, "unique_id", &unique_id);
int ret_loc_id =
read_node_properties(node_id, "location_id", &location_id);
if (ret_gpu_id == 0 || ret_unique_id == 0 || ret_loc_id == 0) {
int ret_domain =
read_node_properties(node_id, "domain", &domain);
if (ret_gpu_id == 0 &&
~(ret_unique_id != 0 || ret_loc_id != 0 || ret_unique_id != 0)) {
// Do not try to build a node if one of these fields
// do not exist in KFD (0 as values okay)
systemNode myNode;
myNode.s_node_id = node_id;
myNode.s_gpu_id = gpu_id;
myNode.s_unique_id = unique_id;
myNode.s_location_id = location_id;
myNode.s_domain = domain & 0xFFFFFFFF;
myNode.s_bdf = (myNode.s_domain << 32) | (myNode.s_location_id);
myNode.s_location_id = myNode.s_bdf;
myNode.s_bdf |= ((domain & 0xFFFFFFFF) << 32);
myNode.s_location_id = myNode.s_bdf;
myNode.s_domain = myNode.s_location_id >> 32;
myNode.s_bus = ((myNode.s_location_id >> 8) & 0xFF);
myNode.s_device = ((myNode.s_location_id >> 3) & 0x1F);
myNode.s_function = myNode.s_location_id & 0x7;
myNode.s_partition_id = ((myNode.s_location_id >> 28) & 0xF);
if (gpu_id != 0) { // only add gpu nodes, 0 = CPU
allSystemNodes.emplace(unique_id, myNode);
}
@@ -780,6 +811,12 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
<< "; gpu_id = " << std::to_string(i.second.s_gpu_id)
<< "; unique_id = " << std::to_string(i.second.s_unique_id)
<< "; location_id = " << std::to_string(i.second.s_location_id)
<< "; bdf = " << print_int_as_hex(i.second.s_bdf)
<< "; domain = " << print_int_as_hex(i.second.s_domain, true, 2*BYTE)
<< "; bus = " << print_int_as_hex(i.second.s_bus, true, BYTE)
<< "; device = " << print_int_as_hex(i.second.s_device, true, BYTE)
<< "; function = " << std::to_string(i.second.s_function)
<< "; partition_id = " << std::to_string(i.second.s_partition_id)
<< "], ";
}
ss << "}";
@@ -817,13 +854,67 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
rsmi_status_t ret_unique_id =
rsmi_dev_unique_id_get(cardAdded, &device_uuid);
auto temp_numb_nodes = allSystemNodes.count(device_uuid);
auto it = allSystemNodes.lower_bound(device_uuid);
if (it != allSystemNodes.end() && doesDeviceSupportPartitions && temp_numb_nodes > 1
auto primaryBdfId =
allSystemNodes.lower_bound(device_uuid)->second.s_location_id;
auto i = allSystemNodes.lower_bound(device_uuid);
if (doesDeviceSupportPartitions && temp_numb_nodes > 1
&& ret_unique_id == RSMI_STATUS_SUCCESS) {
auto primaryBdfId = it->second.s_location_id;
// helps identify xgmi nodes (secondary nodes) easier
ss << __PRETTY_FUNCTION__ << " | secondary node add ; "
<< " BDF = " << std::to_string(primaryBdfId)
<< " (" << print_int_as_hex(primaryBdfId) << ")";
LOG_DEBUG(ss);
if (doesDeviceSupportPartitions && strCompPartition != "SPX"
&& i->second.s_partition_id == 0) {
i->second.s_partition_id = i->second.s_function;
ss << __PRETTY_FUNCTION__ << " | (secondary node add) fall back - "
<< "detected !SPX && partition_id == 0"
<< "; function = " << std::to_string(i->second.s_function)
<< "; partition_id = " << std::to_string(i->second.s_partition_id);
LOG_DEBUG(ss);
}
ss << __PRETTY_FUNCTION__
<< " | (secondary node add) B4 AddToDeviceList() -->"
<< "\n[node_id = " << std::to_string(i->second.s_node_id)
<< "; gpu_id = " << std::to_string(i->second.s_gpu_id)
<< "; unique_id = " << std::to_string(i->second.s_unique_id)
<< "; location_id = " << std::to_string(i->second.s_location_id)
<< "; bdf = " << print_int_as_hex(i->second.s_bdf)
<< "; domain = " << print_int_as_hex(i->second.s_domain, true, 2*BYTE)
<< "; bus = " << print_int_as_hex(i->second.s_bus, true, BYTE)
<< "; device = " << print_int_as_hex(i->second.s_device, true, BYTE)
<< "; function = " << std::to_string(i->second.s_function)
<< "; partition_id = " << std::to_string(i->second.s_partition_id)
<< "], ";
LOG_DEBUG(ss);
AddToDeviceList(d_name, primaryBdfId);
} else {
ss << __PRETTY_FUNCTION__ << " | primary node add ; "
<< " BDF = " << std::to_string(UINT64_MAX);
if (doesDeviceSupportPartitions && strCompPartition != "SPX"
&& i->second.s_partition_id == 0) {
i->second.s_partition_id = i->second.s_function;
ss << __PRETTY_FUNCTION__ << " | (primary node add) fall back - "
<< "detected !SPX && partition_id == 0"
<< "; function = " << std::to_string(i->second.s_function)
<< "; partition_id = " << std::to_string(i->second.s_partition_id);
LOG_DEBUG(ss);
}
LOG_DEBUG(ss);
ss << __PRETTY_FUNCTION__
<< " | (primary node add) After AddToDeviceList() -->"
<< "\n[node_id = " << std::to_string(i->second.s_node_id)
<< "; gpu_id = " << std::to_string(i->second.s_gpu_id)
<< "; unique_id = " << std::to_string(i->second.s_unique_id)
<< "; location_id = " << std::to_string(i->second.s_location_id)
<< "; bdf = " << print_int_as_hex(i->second.s_bdf)
<< "; domain = " << print_int_as_hex(i->second.s_domain, true, 2*BYTE)
<< "; bus = " << print_int_as_hex(i->second.s_bus, true, BYTE)
<< "; device = " << print_int_as_hex(i->second.s_device, true, BYTE)
<< "; function = " << std::to_string(i->second.s_function)
<< "; partition_id = " << std::to_string(i->second.s_partition_id)
<< "], ";
LOG_DEBUG(ss);
AddToDeviceList(d_name, UINT64_MAX);
}
@@ -834,6 +925,12 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
<< "; gpu_id = " << std::to_string(i.second.s_gpu_id)
<< "; unique_id = " << std::to_string(i.second.s_unique_id)
<< "; location_id = " << std::to_string(i.second.s_location_id)
<< "; bdf = " << print_int_as_hex(i.second.s_bdf)
<< "; domain = " << print_int_as_hex(i.second.s_domain, true, 2*BYTE)
<< "; bus = " << print_int_as_hex(i.second.s_bus, true, BYTE)
<< "; device = " << print_int_as_hex(i.second.s_device, true, BYTE)
<< "; function = " << std::to_string(i.second.s_function)
<< "; partition_id = " << std::to_string(i.second.s_partition_id)
<< "], ";
}
ss << "}";
@@ -909,6 +1006,7 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
auto removalGpuId = it->second.s_gpu_id;
auto removalUniqueId = it->second.s_unique_id;
auto removalLocId = it->second.s_location_id;
auto removaldomain = it->second.s_domain;
auto nodesErased = 1;
primary_location_id = removalLocId;
allSystemNodes.erase(it++);
@@ -919,6 +1017,7 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
<< "; gpu_id = " << std::to_string(removalGpuId)
<< "; unique_id = " << std::to_string(removalUniqueId)
<< "; location_id = " << std::to_string(removalLocId)
<< "; removaldomain = " << std::to_string(removaldomain)
<< "]";
LOG_DEBUG(ss);
}
@@ -926,15 +1025,34 @@ uint32_t RocmSMI::DiscoverAmdgpuDevices(void) {
break;
}
auto myBdfId = it->second.s_location_id;
AddToDeviceList(secNode, myBdfId);
ss << __PRETTY_FUNCTION__ << " | secondary node add #2; "
<< " BDF = " << std::to_string(myBdfId)
<< " (" << print_int_as_hex(myBdfId) << ")";
LOG_DEBUG(ss);
if (doesDeviceSupportPartitions && strCompPartition != "SPX"
&& it->second.s_partition_id == 0) {
it->second.s_partition_id = it->second.s_function;
ss << __PRETTY_FUNCTION__ << " | (secondary node add #2) fall back - "
<< "detected !SPX && partition_id == 0"
<< "; function = " << std::to_string(it->second.s_function)
<< "; partition_id = " << std::to_string(it->second.s_partition_id);
LOG_DEBUG(ss);
}
ss << __PRETTY_FUNCTION__
<< "\nSECONDARY --> After adding new node; ERASING -> [node_id = "
<< std::to_string(it->second.s_node_id)
<< " | (secondary node add #2) B4 AddToDeviceList() -->"
<< "\n[node_id = " << std::to_string(it->second.s_node_id)
<< "; gpu_id = " << std::to_string(it->second.s_gpu_id)
<< "; unique_id = " << std::to_string(it->second.s_unique_id)
<< "; location_id = " << std::to_string(it->second.s_location_id)
<< "]";
<< "; bdf = " << print_int_as_hex(it->second.s_bdf)
<< "; domain = " << print_int_as_hex(it->second.s_domain, true, 2*BYTE)
<< "; bus = " << print_int_as_hex(it->second.s_bus, true, BYTE)
<< "; device = " << print_int_as_hex(it->second.s_device, true, BYTE)
<< "; function = " << std::to_string(it->second.s_function)
<< "; partition_id = " << std::to_string(it->second.s_partition_id)
<< "], ";
LOG_DEBUG(ss);
AddToDeviceList(secNode, myBdfId);
allSystemNodes.erase(it++);
numb_nodes--;
cardAdded++;
+20 -4
Просмотреть файл
@@ -1113,6 +1113,7 @@ static std::string print_pnt(rsmi_od_vddc_point_t *pt) {
ss << "\t\t** Voltage: " << pt->voltage << " mV\n";
return ss.str();
}
static std::string pt_vddc_curve(rsmi_od_volt_curve *c) {
std::ostringstream ss;
if (c == nullptr) {
@@ -1182,16 +1183,31 @@ bool is_sudo_user() {
return isRunningWithSudo;
}
rsmi_status_t rsmi_get_gfx_target_version(uint32_t dv_ind,
std::string *gfx_version) {
// string output of gfx_<version>
rsmi_status_t rsmi_get_gfx_target_version(uint32_t dv_ind, std::string *gfx_version) {
std::ostringstream ss;
uint64_t kfd_gfx_version = 0;
GET_DEV_AND_KFDNODE_FROM_INDX
int ret = kfd_node->get_gfx_target_version(&kfd_gfx_version);
uint64_t orig_target_version = 0;
uint64_t major = 0;
uint64_t minor = 0;
uint64_t rev = 0;
if (ret == 0) {
ss << "gfx" << kfd_gfx_version;
*gfx_version = ss.str();
orig_target_version = std::stoull(std::to_string(kfd_gfx_version));
// separate out parts -> put back into normal graphics version format
major = static_cast<uint64_t>((orig_target_version / 10000) * 100);
minor = static_cast<uint64_t>((orig_target_version % 10000 / 100) * 10);
if (minor == 0) major *= 10; // 0 as a minor is correct, but bump up by 10
rev = static_cast<uint64_t>(orig_target_version % 100);
*gfx_version = "gfx" + std::to_string(major + minor + rev);
ss << __PRETTY_FUNCTION__
<< " | " << std::dec << "kfd_target_version = " << orig_target_version
<< "; major = " << major << "; minor = " << minor << "; rev = "
<< rev << "\nReporting rsmi_get_gfx_target_version = " << *gfx_version
<< "\n";
LOG_INFO(ss);
return RSMI_STATUS_SUCCESS;
} else {
*gfx_version = "Unknown";
+38 -2
Просмотреть файл
@@ -753,18 +753,54 @@ amdsmi_get_gpu_asic_info(amdsmi_processor_handle processor_handle, amdsmi_asic_i
// default to 0xffff as not supported
info->oam_id = std::numeric_limits<uint16_t>::max();
uint16_t tmp_oam_id = 0;
status = rsmi_wrapper(rsmi_dev_oam_id_get, processor_handle, &(tmp_oam_id));
status = rsmi_wrapper(rsmi_dev_xgmi_physical_id_get, processor_handle, &(tmp_oam_id));
info->oam_id = tmp_oam_id;
// default to 0xffffffff as not supported
info->num_of_compute_units = std::numeric_limits<uint32_t>::max();
auto tmp_num_of_compute_units = uint32_t(0);
status = rsmi_wrapper(amd::smi::rsmi_dev_number_of_computes_get, processor_handle,
&tmp_num_of_compute_units);
&(tmp_num_of_compute_units));
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
info->num_of_compute_units = tmp_num_of_compute_units;
}
// default to 0xffffffffffffffff as not supported
info->target_graphics_version = std::numeric_limits<uint64_t>::max();
auto tmp_target_gfx_version = uint64_t(0);
status = rsmi_wrapper(rsmi_dev_target_graphics_version_get, processor_handle,
&(tmp_target_gfx_version));
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
info->target_graphics_version = tmp_target_gfx_version;
}
// default to 0xffffffffffffffff as not supported
info->kfd_id = std::numeric_limits<uint64_t>::max();
auto tmp_kfd_id = uint64_t(0);
status = rsmi_wrapper(rsmi_dev_guid_get, processor_handle,
&(tmp_kfd_id));
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
info->kfd_id = tmp_kfd_id;
}
// default to 0xffffffff as not supported
info->node_id = std::numeric_limits<uint32_t>::max();
auto tmp_node_id = uint32_t(0);
status = rsmi_wrapper(rsmi_dev_node_id_get, processor_handle,
&(tmp_node_id));
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
info->node_id = tmp_node_id;
}
// default to 0xffffffff as not supported
info->partition_id = std::numeric_limits<uint32_t>::max();
auto tmp_partition_id = uint32_t(0);
status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle,
&(tmp_partition_id));
if (status == amdsmi_status_t::AMDSMI_STATUS_SUCCESS) {
info->partition_id = tmp_partition_id;
}
return AMDSMI_STATUS_SUCCESS;
}
+30 -4
Просмотреть файл
@@ -52,6 +52,8 @@
#include "amd_smi/impl/amd_smi_common.h"
#include "rocm_smi/rocm_smi.h"
#include "rocm_smi/rocm_smi_main.h"
#include "rocm_smi/rocm_smi_utils.h"
#include "rocm_smi/rocm_smi_logger.h"
namespace amd {
namespace smi {
@@ -173,10 +175,26 @@ amdsmi_status_t AMDSmiDrm::init() {
}
has_valid_fds = true;
bdf.function_number = device->businfo.pci->func;
bdf.device_number = device->businfo.pci->dev;
bdf.bus_number = device->businfo.pci->bus;
bdf.domain_number = device->businfo.pci->domain;
std::ostringstream ss;
uint64_t bdf_rocm = 0;
rsmi_dev_pci_id_get(i, &bdf_rocm);
ss << __PRETTY_FUNCTION__ << " | "
<< "bdf_rocm | Received bdf: "
<< "\nWhole BDF: " << amd::smi::print_unsigned_hex_and_int(bdf_rocm)
<< "\nDomain = "
<< amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0xFFFFFFFF00000000) >> 32)
<< "; \nBus# = " << amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0xFF00) >> 8)
<< "; \nDevice# = "<< amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0xF8) >> 3)
<< "; \nFunction# = " << amd::smi::print_unsigned_hex_and_int((bdf_rocm & 0x7));
LOG_INFO(ss);
bdf.function_number = ((bdf_rocm & 0x7));
bdf.device_number = ((bdf_rocm & 0xF8) >> 3);
bdf.bus_number = ((bdf_rocm & 0xFF00) >> 8);
bdf.domain_number = ((bdf_rocm & 0xFFFFFFFF00000000) >> 32);
ss << __PRETTY_FUNCTION__ << " | " << "Received bdf: Domain = " << bdf.domain_number
<< "; Bus# = " << bdf.bus_number << "; Device# = "<< bdf.device_number
<< "; Function# = " << bdf.function_number;
LOG_INFO(ss);
vendor_id = device->deviceinfo.pci->vendor_id;
@@ -309,6 +327,14 @@ amdsmi_status_t AMDSmiDrm::get_drm_fd_by_index(uint32_t gpu_index, uint32_t *fd_
amdsmi_status_t AMDSmiDrm::get_bdf_by_index(uint32_t gpu_index, amdsmi_bdf_t *bdf_info) const {
if (gpu_index + 1 > drm_bdfs_.size()) return AMDSMI_STATUS_NOT_SUPPORTED;
*bdf_info = drm_bdfs_[gpu_index];
std::ostringstream ss;
ss << __PRETTY_FUNCTION__ << " | gpu_index = " << gpu_index
<< "; \nreceived bdf: Domain = " << bdf_info->domain_number
<< "; \nBus# = " << bdf_info->bus_number
<< "; \nDevice# = " << bdf_info->device_number
<< "; \nFunction# = " << bdf_info->function_number
<< "\nReturning = AMDSMI_STATUS_SUCCESS";
LOG_INFO(ss);
return AMDSMI_STATUS_SUCCESS;
}
+25 -5
Просмотреть файл
@@ -48,6 +48,7 @@
#include <iostream>
#include <string>
#include <limits>
#include <gtest/gtest.h>
#include "amd_smi/amdsmi.h"
@@ -58,7 +59,9 @@
TestSysInfoRead::TestSysInfoRead() : TestBase() {
set_title("AMDSMI System Info Read Test");
set_description("This test verifies that system information such as the "
"BDFID, AMDSMI version, VBIOS version, etc. can be read properly.");
"BDFID, AMDSMI version, VBIOS version, "
"vendor_id, unique_id, target_gfx_version, kfd_id, node_id, partition_id, etc. "
"can be read properly.");
}
TestSysInfoRead::~TestSysInfoRead(void) {
@@ -150,22 +153,39 @@ void TestSysInfoRead::Run(void) {
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
// vendor_id, unique_id
amdsmi_asic_info_t asci_info;
err = amdsmi_get_gpu_asic_info(processor_handles_[0], &asci_info);
// vendor_id, unique_id, target_gfx_version, kfd_id, node_id, partition_id
amdsmi_asic_info_t asci_info = {};
err = amdsmi_get_gpu_asic_info(processor_handles_[i], &asci_info);
if (err == AMDSMI_STATUS_NOT_SUPPORTED) {
std::cout <<
"\t**amdsmi_dev_unique_id() is not supported"
" on this machine" << std::endl;
EXPECT_EQ(asci_info.target_graphics_version, std::numeric_limits<uint64_t>::max());
EXPECT_EQ(asci_info.kfd_id, std::numeric_limits<uint64_t>::max());
EXPECT_EQ(asci_info.node_id, std::numeric_limits<uint32_t>::max());
EXPECT_EQ(asci_info.partition_id, std::numeric_limits<uint32_t>::max());
// Verify api support checking functionality is working
err = amdsmi_get_gpu_asic_info(processor_handles_[i], nullptr);
ASSERT_EQ(err, AMDSMI_STATUS_NOT_SUPPORTED);
} else {
if (err == AMDSMI_STATUS_SUCCESS) {
IF_VERB(STANDARD) {
std:: cout << "\t**GPU PCIe Vendor : "
std:: cout << "\t**GPU PCIe Vendor : "
<< asci_info.vendor_name << std::endl;
std::cout << "\t**Target GFX version: " << std::dec
<< asci_info.target_graphics_version << "\n";
std::cout << "\t**KFD ID: " << std::dec
<< asci_info.kfd_id << "\n";
std::cout << "\t**Node ID: " << std::dec
<< asci_info.node_id << "\n";
std::cout << "\t**Partition ID: " << std::dec
<< asci_info.partition_id << "\n";
}
EXPECT_EQ(err, AMDSMI_STATUS_SUCCESS);
EXPECT_NE(asci_info.target_graphics_version, std::numeric_limits<uint64_t>::max());
EXPECT_NE(asci_info.kfd_id, std::numeric_limits<uint64_t>::max());
EXPECT_NE(asci_info.node_id, std::numeric_limits<uint32_t>::max());
EXPECT_NE(asci_info.partition_id, std::numeric_limits<uint32_t>::max());
// Verify api support checking functionality is working
err = amdsmi_get_gpu_asic_info(processor_handles_[i], nullptr);
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
+1 -2
Просмотреть файл
@@ -137,8 +137,7 @@ void TestTempRead::Run(void) {
ASSERT_EQ(err, AMDSMI_STATUS_INVAL);
IF_VERB(STANDARD) {
std::cout << "\t**" << label << ": " << val_i64/1000 <<
"C" << std::endl;
std::cout << "\t**" << label << ": " << val_i64 << "C" << std::endl;
}
};
for (type = AMDSMI_TEMPERATURE_TYPE_FIRST; type <= AMDSMI_TEMPERATURE_TYPE__MAX; ++type) {