From 9979be85126260361fa12922fa715294610158ad Mon Sep 17 00:00:00 2001 From: Ryo Ficano Date: Mon, 16 Sep 2024 16:50:20 -0500 Subject: [PATCH 1/8] [SWDEV-482963] [Test updates] Add new tests for p0 items - BM v2 Updates: - Added tests for these API calls: amdsmi_get_socket_handles amdsmi_get_processor_type amdsmi_get_clk_freq amdsmi_get_gpu_process_info amdsmi_get_gpu_ras_block_features_enabled amdsmi_get_gpu_ecc_count amdsmi_get_gpu_memory_usage amdsmi_get_gpu_vendor_name amdsmi_get_utilization_count - Added amdsmi_init() and amdsmi_shut_down() before and after each test. - Updated README and removed all pytest references. Change-Id: Ida0c165a466571b1df36c413161bd95c070f6ff1 Signed-off-by: Ryo Ficano --- .gitignore | 1 + tests/python_unittest/README.md | 2027 +++++++-------------- tests/python_unittest/integration_test.py | 610 +++++-- 3 files changed, 1113 insertions(+), 1525 deletions(-) diff --git a/.gitignore b/.gitignore index d126460b84..b01b2b7fdd 100644 --- a/.gitignore +++ b/.gitignore @@ -17,6 +17,7 @@ include/amd_smi/amd_smi64Config.h include/amd_smi/amd_smiConfig.h rocm_smi/include/rocm_smi/rocm_smi64Config.h docs/*.pdf +goamdsmi_shim/include/goamdsmi_shimConfig.h # Byte-compiled / optimized / DLL files __pycache__/ diff --git a/tests/python_unittest/README.md b/tests/python_unittest/README.md index 26e9fb7dbd..4fcf240513 100644 --- a/tests/python_unittest/README.md +++ b/tests/python_unittest/README.md @@ -1,6 +1,6 @@ # How to Python Unit Tests ## Overview -We use Python's default Python unittest testing framework. You can read more about it here [Python unittest v3.8](https://docs.python.org/3.8/library/unittest.html). Alternatively, you can read up on pytest through here [Pytest how-to usage](https://docs.pytest.org/en/latest/index.html). +We use Python's default Python unittest testing framework. You can read more about it here [Python unittest v3.8](https://docs.python.org/3.8/library/unittest.html). ## Warning to Users AMD SMI Python API tests are subject to change. These tests are currently a work in progress and may not work on your system. @@ -13,416 +13,42 @@ Follow our install/build guides to ensure the Python API is installed correctly ## How to Run ### Basic How To The 2 tests are in this PATH: -```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py``` -```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py``` +```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py``` +```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py``` -The recommended method to run the tests: +The recommended method to run the tests: +Unittest only (not verbose) +```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -b -v``` +```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -b -v``` Unittest verbose ```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v``` ```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v``` -Unittest only (not verbose) +Unittest filter and verbose +```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -k "testname" -v``` +```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "testname" -v``` + +## Unittest Run Options +The Unittest Run calls the tests directly. The cache provider will always be used. + +options: + - -h, --help show this help message and exit + - -v, --verbose Verbose output + - -q, --quiet Quiet output + - -b, --buffer Buffer stdout and stderr during tests + - -k "testname" Only run tests which match the given substring + +### Unittest: not verbose +Runs all tests. Silence print statements to stdout. Lists tests results. +This is also the best way to list all tests available. + ```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -b -v``` ```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -b -v``` -See sections below for more detailed options with examples. - -## Unittest Run Options -### Unittest Run: Verbose on -Helpful to see print outs of Python. - -```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v``` - -```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v``` - ex.
- Click for example: Unittest run: verbose on - -~~~shell -/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v -test_init (__main__.TestAmdSmiInit) ... ok -test_bad_page_info (__main__.TestAmdSmiPythonInterface) ... ###Test amdsmi_get_gpu_bad_page_info - -**** [ERROR] | Test: test_bad_page_info | Caught AmdSmiLibraryException -ok -test_bdf_device_id (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['version'] is: 020.001.000.038.015697 - - vbios_info['name'] is: NAVI21 Gaming XL D412 - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 81ff73bf-0000-1000-80c1-6890a5911040 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['version'] is: 020.001.000.060.016898 - - vbios_info['name'] is: NAVI21 D43001 GLXL - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 1fff73a3-0000-1000-8075-223e5e64eac1 -ok -test_ecc (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_ras_feature_info - -**** [ERROR] | Test: test_ecc | Caught AmdSmiLibraryException -ok -test_gpu_performance (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 3 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 8 - power_info['gfx_voltage'] is: 768 - power_info['soc_voltage'] is: 918 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 203000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True -###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 42 - Current temperature for HOTSPOT is: 42 - Current temperature for VRAM is: 38 -###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 100 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 105 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2475 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 4 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 5000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 0 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 13 - power_info['gfx_voltage'] is: 781 - power_info['soc_voltage'] is: 812 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 213000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True -###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 34 - Current temperature for HOTSPOT is: 38 - Current temperature for VRAM is: 36 -###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 109 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 114 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2555 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 16 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 8000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -ok -test_walkthrough (__main__.TestAmdSmiPythonInterface) ... ###Test amdsmi_get_processor_handles() -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 0 -###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: NAVI21 - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73bf - asic_info['rev_id'] is: 0xc3 - - asic_info['asic_serial'] is: 0x81C16890A5911040 - - asic_info['oam_id'] is: N/A - -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 203000000 - -###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['name'] is: NAVI21 Gaming XL D412 - - vbios_info['version'] is: 020.001.000.038.015697 - -###Test amdsmi_get_gpu_board_info - board_info['model_number'] is: N/A - - board_info['product_serial'] is: N/A - - board_info['fru_id'] is: N/A - - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - - board_info['product_name'] is: Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] - -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 0 -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 1 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73a3 - asic_info['rev_id'] is: 0x00 - - asic_info['asic_serial'] is: 0x1F75223E5E64EAC1 - - asic_info['oam_id'] is: N/A - -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 213000000 - -###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['name'] is: NAVI21 D43001 GLXL - - vbios_info['version'] is: 020.001.000.060.016898 - -###Test amdsmi_get_gpu_board_info - board_info['model_number'] is: N/A - - board_info['product_serial'] is: N/A - - board_info['fru_id'] is: N/A - - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - - board_info['product_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 1 -ok - ----------------------------------------------------------------------- -Ran 6 tests in 0.083s - -OK - -~~~ - -
- - -### Unittest Run: Verbose on + Filter (or exclude) a test - -```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "test_walkthrough" -v``` - -```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "not test_walkthrough" -v``` - -ex. -
- Click for example: Unittest Run: Verbose on + Filter (or exclude) a Test - -~~~shell -> /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "test_bdf_device_id" -v -test_bdf_device_id (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['version'] is: 020.001.000.038.015697 - - vbios_info['name'] is: NAVI21 Gaming XL D412 - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 81ff73bf-0000-1000-80c1-6890a5911040 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['version'] is: 020.001.000.060.016898 - - vbios_info['name'] is: NAVI21 D43001 GLXL - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 1fff73a3-0000-1000-8075-223e5e64eac1 -ok - ----------------------------------------------------------------------- -Ran 1 test in 0.012s - -OK -~~~ -
- - -### Unittest Run: Silence stdout (print statements) and run all tests - Runs all tests. Silence print statements to stdout. Lists tests results. - This is also the best way to list all tests available. - -```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -b -v``` - -```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -b -v``` - -ex. -
- Click for example: Unittest Run: Silence stdout (print statements) and run all tests + Click for example: Unittest: not verbose ~~~shell /opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -b -v @@ -438,425 +64,591 @@ OK
-## Pytest Run Options -### Pytest: List tests -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py --co``` +### Unittest: verbose (with print statements) +Helpful to see print outs of Python. + +```/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v``` +```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v``` -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py --co``` ex.
- Click for example: Pytest: List tests + Click for example: Unittest: verbose (with print statements) ~~~shell -python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py --co -===================================================== test session starts ===================================================== -platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -rootdir: /opt/rocm/share/amd_smi -configfile: pyproject.toml -collected 6 items +/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v +test_init (__main__.TestAmdSmiInit) ... ok +test_asic_kfd_info (__main__.TestAmdSmiPythonInterface) ... - - - - - - - - - - - - - -================================================= 6 tests collected in 0.04s ================================================== -~~~ -
- -### Pytest Run: Verbose on -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v``` - -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v``` - -ex. -
- Click for example: Pytest Run: verbose on - -~~~shell - python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v -===================================================== test session starts ===================================================== -platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /usr/bin/python3 -rootdir: /opt/rocm/share/amd_smi -configfile: pyproject.toml -collected 3 items - -../../opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py::TestAmdSmiPythonBDF::test_check_res PASSED [ 33%] -../../opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py::TestAmdSmiPythonBDF::test_format_bdf PASSED [ 66%] -../../opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py::TestAmdSmiPythonBDF::test_parse_bdf PASSED [100%] - -====================================================== 3 passed in 0.04s ====================================================== -~~~ -
- -### Pytest Run: Verbose on + stdout (print statements) -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -s -v``` - -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -s -v``` - -ex. -
- Click for example: Pytest Run: verbose on + stdout (print statements) - -~~~shell -python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -s -v -===================================================== test session starts ===================================================== -platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /usr/bin/python3 -rootdir: /opt/rocm/share/amd_smi -configfile: pyproject.toml -collected 6 items - -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiInit::test_init PASSED -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_bad_page_info ###Test amdsmi_get_gpu_bad_page_info - -**** [ERROR] | Test: test_bad_page_info | Caught AmdSmiLibraryException -PASSED -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_bdf_device_id ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['version'] is: 020.001.000.038.015697 - - vbios_info['name'] is: NAVI21 Gaming XL D412 - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 81ff73bf-0000-1000-80c1-6890a5911040 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['version'] is: 020.001.000.060.016898 - - vbios_info['name'] is: NAVI21 D43001 GLXL - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 1fff73a3-0000-1000-8075-223e5e64eac1 -PASSED -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_ecc ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_ras_feature_info - -**** [ERROR] | Test: test_ecc | Caught AmdSmiLibraryException -PASSED -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_gpu_performance ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 1 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 8 - power_info['gfx_voltage'] is: 768 - power_info['soc_voltage'] is: 918 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 203000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True -###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 42 - Current temperature for HOTSPOT is: 43 - Current temperature for VRAM is: 38 -###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 100 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 105 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2475 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 4 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 5000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 0 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 13 - power_info['gfx_voltage'] is: 787 - power_info['soc_voltage'] is: 806 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 213000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True -###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 34 - Current temperature for HOTSPOT is: 37 - Current temperature for VRAM is: 36 -###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 109 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 114 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2555 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 16 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 8000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -PASSED -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_walkthrough ###Test amdsmi_get_processor_handles() -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 0 -###Test Processor 0, bdf: 0000:08:00.0 +###Test Processor 0, bdf: 0000:43:00.0 ###Test amdsmi_get_gpu_asic_info + asic_info['market_name'] is: NAVI21 asic_info['vendor_id'] is: 0x1002 asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] asic_info['device_id'] is: 0x73bf - asic_info['rev_id'] is: 0xc3 - - asic_info['asic_serial'] is: 0x81C16890A5911040 - + asic_info['rev_id'] is: 0xc1 + asic_info['asic_serial'] is: 0xF8FFEB47A027DE4D asic_info['oam_id'] is: N/A + asic_info['target_graphics_version'] is: gfx1030 + asic_info['num_compute_units'] is: 72 + +###Test amdsmi_get_gpu_kfd_info + + kfd_info['kfd_id'] is: 16970 + kfd_info['node_id'] is: 1 + +ok +test_bad_page_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_bad_page_info + +**** [ERROR] | Test: test_bad_page_info | Caught AmdSmiLibraryException: Error code: + 2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported +ok +test_bdf_device_id (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_processor_handle_from_bdf -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 203000000 ###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['name'] is: NAVI21 Gaming XL D412 - vbios_info['version'] is: 020.001.000.038.015697 + vbios_info['part_number'] is: 113-V395TRIO-2OC + vbios_info['build_date'] is: 2021/03/28 21:35 + vbios_info['version'] is: 020.001.000.060.000000 + vbios_info['name'] is: 113-MSITV395MH.132 + +###Test amdsmi_get_gpu_device_uuid + + uuid is: f8ff73bf-0000-1000-80ff-eb47a027de4d + +ok +test_board_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 ###Test amdsmi_get_gpu_board_info + board_info['model_number'] is: N/A - board_info['product_serial'] is: N/A - board_info['fru_id'] is: N/A - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - board_info['product_name'] is: Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 +ok +test_clock_frequency (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_clk_freq + + SYS clock_frequency['num_supported']: 2 + SYS clock_frequency['current']: 0 + SYS clock_frequency['frequency']: [500000000, 2575000000] + DF clock_frequency['num_supported']: 3 + DF clock_frequency['current']: 1 + DF clock_frequency['frequency']: [500000000, 666000000, 1941000000] + +ok +test_clock_frequency_DCEF (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_clk_freq + + DCEF clock_frequency['num_supported']: 2 + DCEF clock_frequency['current']: 0 + DCEF clock_frequency['frequency']: [417000000, 1200000000] + +ok +test_clock_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_clock_info + + Current clock for domain GFX is: 500 + Max clock for domain GFX is: 2575 + Min clock for domain GFX is: 500 + Is GFX clock locked: 0 + Is GFX clock in deep sleep: 255 + Current clock for domain MEM is: 96 + Max clock for domain MEM is: 1000 + Min clock for domain MEM is: 96 + Is MEM clock in deep sleep: 255 + +ok +test_clock_info_vclk0_dclk0 (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_clock_info + + Current clock for domain VCLK0 is: 1400 + Max clock for domain VCLK0 is: 1400 + Min clock for domain VCLK0 is: 0 + Is VCLK0 clock in deep sleep: 255 + Current clock for domain DCLK0 is: 1225 + Max clock for domain DCLK0 is: 1225 + Min clock for domain DCLK0 is: 0 + Is DCLK0 clock in deep sleep: 255 + +ok +test_clock_info_vclk1_dclk1 (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_clock_info + + Current clock for domain VCLK1 is: 362 + Max clock for domain VCLK1 is: 362 + Min clock for domain VCLK1 is: 0 + Is VCLK1 clock in deep sleep: 255 + Current clock for domain DCLK1 is: 316 + Max clock for domain DCLK1 is: 316 + Min clock for domain DCLK1 is: 0 + Is DCLK1 clock in deep sleep: 255 + +ok +test_driver_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + ###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 0 -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 1 -###Test Processor 1, bdf: 0000:44:00.0 -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73a3 - asic_info['rev_id'] is: 0x00 +Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.9.2', 'driver_date': '2015/01/01 00:00'} - asic_info['asic_serial'] is: 0x1F75223E5E64EAC1 +ok +test_ecc_count_block (__main__.TestAmdSmiPythonInterface) ... - asic_info['oam_id'] is: N/A +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_ecc_count + +**** [ERROR] | Test: test_ecc_count_block | Caught AmdSmiLibraryException: Error code: + 2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported +ok +test_ecc_count_total (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_total_ecc_count + +Number of uncorrectable errors: 0 +Number of correctable errors: 0 +Number of deferred errors: 0 + +ok +test_fw_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_fw_info + + FW name: AMDSMI_FW_ID_CP_CE + FW version: 37 + FW name: AMDSMI_FW_ID_CP_PFP + FW version: 98 + FW name: AMDSMI_FW_ID_CP_ME + FW version: 64 + FW name: AMDSMI_FW_ID_CP_MEC1 + FW version: 118 + FW name: AMDSMI_FW_ID_CP_MEC2 + FW version: 118 + FW name: AMDSMI_FW_ID_RLC + FW version: 96 + FW name: AMDSMI_FW_ID_SDMA0 + FW version: 83 + FW name: AMDSMI_FW_ID_SDMA1 + FW version: 83 + FW name: AMDSMI_FW_ID_VCN + FW version: 04.11.F0.00 + FW name: AMDSMI_FW_ID_PSP_SOSDRV + FW version: 00.21.0E.64 + FW name: AMDSMI_FW_ID_ASD + FW version: 553648350 + FW name: AMDSMI_FW_ID_TA_RAS + FW version: 1B.00.01.3E + FW name: AMDSMI_FW_ID_TA_XGMI + FW version: 20.00.00.0F + FW name: AMDSMI_FW_ID_PM + FW version: 00.58.90.00 + +ok +test_gpu_activity (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_activity + + engine_usage['gfx_activity'] is: 9 % + engine_usage['umc_activity'] is: 0 % + engine_usage['mm_activity'] is: 22 % + +ok +test_memory_usage (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_memory_usage + + memory_usage for VRAM is: 17104896 + memory_usage for VIS_VRAM is: 17104896 + memory_usage for GTT is: 15065088 + +ok +test_pcie_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_pcie_info + + pcie_info['pcie_metric']['pcie_width'] is: 16 + pcie_info['pcie_static']['max_pcie_width'] is: 16 + pcie_info['pcie_metric']['pcie_speed'] is: 16000 MT/s + pcie_info['pcie_static']['max_pcie_speed'] is: 16000 + pcie_info['pcie_static']['pcie_interface_version'] is: 4 + pcie_info['pcie_static']['slot_type'] is: CEM + pcie_info['pcie_metric']['pcie_replay_count'] is: N/A + pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A + pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A + pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A + pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A + pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A + +ok +test_power_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_power_info + + power_info['current_socket_power'] is: N/A + power_info['average_socket_power'] is: 39 + power_info['gfx_voltage'] is: 856 + power_info['soc_voltage'] is: 937 + power_info['mem_voltage'] is: 843 + power_info['power_limit'] is: 272000000 ###Test amdsmi_get_power_cap_info + power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 213000000 + power_info['power_cap'] is: 272000000 + +###Test amdsmi_is_gpu_power_management_enabled + + Power management enabled: True + +ok +test_process_list (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_process_list + + Process list: [] + +ok +test_processor_type (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_processor_type + + Processor type is: AMDSMI_PROCESSOR_TYPE_AMD_GPU + +ok +test_ras_block_features_enabled (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_ras_block_features_enabled + +**** [ERROR] | Test: test_ras_block_features_enabled | Caught AmdSmiLibraryException: Error code: + 7 | AMDSMI_STATUS_API_FAILED - API call failed +ok +test_ras_feature_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_ras_feature_info + +**** [ERROR] | Test: test_ras_feature_info | Caught AmdSmiLibraryException: Error code: + 2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported +ok +test_socket_info (__main__.TestAmdSmiPythonInterface) ... + +###Test amdsmi_get_socket_handles + + +###Test Socket 0 + +###Test amdsmi_get_socket_info + + Socket: 0000:43:00 + +ok +test_temperature_metric (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_temp_metric + + Current temperature for HOTSPOT is: 35 + Current temperature for VRAM is: 32 + +###Test amdsmi_get_temp_metric + + Limit (critical) temperature for HOTSPOT is: 110 + Limit (critical) temperature for VRAM is: 100 + +###Test amdsmi_get_temp_metric + + Shutdown (emergency) temperature for HOTSPOT is: 115 + Shutdown (emergency) temperature for VRAM is: 105 + +ok +test_temperature_metric_edge (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_temp_metric + + Current temperature for EDGE is: 31 + Limit (critical) temperature for EDGE is: 100 + Shutdown (emergency) temperature for EDGE is: 105 + +ok +test_temperature_metric_hbm (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_temp_metric + +**** [ERROR] | Test: test_temperature_metric_hbm | Caught AmdSmiLibraryException: Error code: + 2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported +ok +test_temperature_metric_plx (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_temp_metric + + Current temperature for PLX is: 30 + Limit (critical) temperature for PLX is: 30 + Shutdown (emergency) temperature for PLX is: 30 + +ok +test_utilization_count (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_utilization_count + + Timestamp: 2570588934628027 + Utilization count for AMDSMI_COARSE_GRAIN_GFX_ACTIVITY is: 4294967295 + Utilization count for AMDSMI_COARSE_GRAIN_MEM_ACTIVITY is: 4294967295 + Utilization count for AMDSMI_COARSE_DECODER_ACTIVITY is: 0 + + Timestamp: 2570588935626503 + Utilization count for AMDSMI_FINE_GRAIN_GFX_ACTIVITY is: 4294967295 + Utilization count for AMDSMI_FINE_GRAIN_MEM_ACTIVITY is: 4294967295 + Utilization count for AMDSMI_FINE_DECODER_ACTIVITY is: 0 + +ok +test_vbios_info (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 ###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['name'] is: NAVI21 D43001 GLXL - vbios_info['version'] is: 020.001.000.060.016898 + vbios_info['part_number'] is: 113-V395TRIO-2OC + vbios_info['build_date'] is: 2021/03/28 21:35 + vbios_info['name'] is: 113-MSITV395MH.132 + vbios_info['version'] is: 020.001.000.060.000000 + +ok +test_vendor_name (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_vendor_name + + Vendor name is: Advanced Micro Devices, Inc. [AMD/ATI] + +ok +test_walkthrough (__main__.TestAmdSmiPythonInterface) ... + +####################################################################### +========> test_walkthrough start <======== + + + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_asic_info + + asic_info['market_name'] is: NAVI21 + asic_info['vendor_id'] is: 0x1002 + asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] + asic_info['device_id'] is: 0x73bf + asic_info['rev_id'] is: 0xc1 + asic_info['asic_serial'] is: 0xF8FFEB47A027DE4D + asic_info['oam_id'] is: N/A + asic_info['target_graphics_version'] is: gfx1030 + asic_info['num_compute_units'] is: 72 + +###Test amdsmi_get_gpu_kfd_info + + kfd_info['kfd_id'] is: 16970 + kfd_info['node_id'] is: 1 + + + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_power_info + + power_info['current_socket_power'] is: N/A + power_info['average_socket_power'] is: 31 + power_info['gfx_voltage'] is: 856 + power_info['soc_voltage'] is: 937 + power_info['mem_voltage'] is: 837 + power_info['power_limit'] is: 272000000 + +###Test amdsmi_get_power_cap_info + + power_info['dpm_cap'] is: 1 + power_info['power_cap'] is: 272000000 + +###Test amdsmi_is_gpu_power_management_enabled + + Power management enabled: True + + + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_vbios_info + + vbios_info['part_number'] is: 113-V395TRIO-2OC + vbios_info['build_date'] is: 2021/03/28 21:35 + vbios_info['name'] is: 113-MSITV395MH.132 + vbios_info['version'] is: 020.001.000.060.000000 + + + +###Test Processor 0, bdf: 0000:43:00.0 ###Test amdsmi_get_gpu_board_info + board_info['model_number'] is: N/A - board_info['product_serial'] is: N/A - board_info['fru_id'] is: N/A - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] + board_info['product_name'] is: Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] - board_info['product_name'] is: Navi 21 GL-XL [Radeon PRO W6800] + + +###Test Processor 0, bdf: 0000:43:00.0 ###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 1 -PASSED -====================================================== 6 passed in 0.13s ====================================================== + FW name: AMDSMI_FW_ID_CP_CE + FW version: 37 + FW name: AMDSMI_FW_ID_CP_PFP + FW version: 98 + FW name: AMDSMI_FW_ID_CP_ME + FW version: 64 + FW name: AMDSMI_FW_ID_CP_MEC1 + FW version: 118 + FW name: AMDSMI_FW_ID_CP_MEC2 + FW version: 118 + FW name: AMDSMI_FW_ID_RLC + FW version: 96 + FW name: AMDSMI_FW_ID_SDMA0 + FW version: 83 + FW name: AMDSMI_FW_ID_SDMA1 + FW version: 83 + FW name: AMDSMI_FW_ID_VCN + FW version: 04.11.F0.00 + FW name: AMDSMI_FW_ID_PSP_SOSDRV + FW version: 00.21.0E.64 + FW name: AMDSMI_FW_ID_ASD + FW version: 553648350 + FW name: AMDSMI_FW_ID_TA_RAS + FW version: 1B.00.01.3E + FW name: AMDSMI_FW_ID_TA_XGMI + FW version: 20.00.00.0F + FW name: AMDSMI_FW_ID_PM + FW version: 00.58.90.00 + + + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_gpu_driver_info + +Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.9.2', 'driver_date': '2015/01/01 00:00'} + + +========> test_walkthrough end <======== +####################################################################### + +ok + +---------------------------------------------------------------------- +Ran 31 tests in 0.592s + +OK ~~~ +
-### Pytest Run: Verbose on + Filter (or exclude) a Test -Use [Pytest: List tests](###-Pytest:-List-tests) then either exclude (with "not") or only run the specified test. -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "test_gpu_performance" -v``` +### Unittest: filter and verbose +Allow filtering based on common or specific test names. -```python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "not test_gpu_performance" -v``` +```/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "test_walkthrough" -v``` ex.
- Click for example: Pytest Run: Verbose on + Filter (or exclude) a Test + Click for example: Unittest: filter and verbose ~~~shell -python3 -m pytest -p no:cacheprovider /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "not test_gpu_performance" -v -===================================================== test session starts ===================================================== -platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /usr/bin/python3 -rootdir: /opt/rocm/share/amd_smi -configfile: pyproject.toml -collected 6 items / 1 deselected / 5 selected +/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "test_asic_kfd_info" -v +test_asic_kfd_info (__main__.TestAmdSmiPythonInterface) ... -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiInit::test_init PASSED [ 20%] -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_bad_page_info PASSED [ 40%] -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_bdf_device_id PASSED [ 60%] -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_ecc PASSED [ 80%] -../../opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py::TestAmdSmiPythonInterface::test_walkthrough PASSED [100%] +###Test Processor 0, bdf: 0000:43:00.0 -=============================================== 5 passed, 1 deselected in 0.09s =============================================== +###Test amdsmi_get_gpu_asic_info + + asic_info['market_name'] is: NAVI21 + asic_info['vendor_id'] is: 0x1002 + asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] + asic_info['device_id'] is: 0x73bf + asic_info['rev_id'] is: 0xc1 + asic_info['asic_serial'] is: 0xF8FFEB47A027DE4D + asic_info['oam_id'] is: N/A + asic_info['target_graphics_version'] is: gfx1030 + asic_info['num_compute_units'] is: 72 + +###Test amdsmi_get_gpu_kfd_info + + kfd_info['kfd_id'] is: 16970 + kfd_info['node_id'] is: 1 + +ok + +---------------------------------------------------------------------- +Ran 1 test in 0.453s + +OK ~~~
@@ -865,682 +657,95 @@ collected 6 items / 1 deselected / 5 selected Please refer to Python's UnitTest documentation for better overview of commands to run. ```shell -python3 /opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v -test_check_res (tests.amd_smi_test.py-test.unit_tests.TestAmdSmiPythonBDF) ... ok -test_format_bdf (tests.amd_smi_test.py-test.unit_tests.TestAmdSmiPythonBDF) ... ok -test_parse_bdf (tests.amd_smi_test.py-test.unit_tests.TestAmdSmiPythonBDF) ... ok -``` - -```shell -python3 /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -v -test_init (__main__.TestAmdSmiInit) ... ok -test_bad_page_info (__main__.TestAmdSmiPythonInterface) ... ###Test amdsmi_get_gpu_bad_page_info - -**** [ERROR] | Test: test_bad_page_info | Caught AmdSmiLibraryException -ok -test_bdf_device_id (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['version'] is: 020.001.000.038.015697 - - vbios_info['name'] is: NAVI21 Gaming XL D412 - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 81ff73bf-0000-1000-80c1-6890a5911040 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['version'] is: 020.001.000.060.016898 - - vbios_info['name'] is: NAVI21 D43001 GLXL - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 1fff73a3-0000-1000-8075-223e5e64eac1 -ok -test_ecc (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_ras_feature_info - -**** [ERROR] | Test: test_ecc | Caught AmdSmiLibraryException -ok -test_gpu_performance (__main__.TestAmdSmiPythonInterface) ... ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 5 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 8 - power_info['gfx_voltage'] is: 768 - power_info['soc_voltage'] is: 918 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 203000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True -###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 41 - Current temperature for HOTSPOT is: 42 - Current temperature for VRAM is: 38 -###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 100 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 105 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2475 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 4 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 5000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 0 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 12 - power_info['gfx_voltage'] is: 787 - power_info['soc_voltage'] is: 806 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 213000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True -###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 33 - Current temperature for HOTSPOT is: 37 - Current temperature for VRAM is: 36 -###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 109 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 114 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2555 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 16 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 8000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -ok -test_walkthrough (__main__.TestAmdSmiPythonInterface) ... ###Test amdsmi_get_processor_handles() -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 0 -###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: NAVI21 - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73bf - asic_info['rev_id'] is: 0xc3 - - asic_info['asic_serial'] is: 0x81C16890A5911040 - - asic_info['oam_id'] is: N/A - -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 203000000 - -###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['name'] is: NAVI21 Gaming XL D412 - - vbios_info['version'] is: 020.001.000.038.015697 - -###Test amdsmi_get_gpu_board_info - board_info['model_number'] is: N/A - - board_info['product_serial'] is: N/A - - board_info['fru_id'] is: N/A - - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - - board_info['product_name'] is: Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] - -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 0 -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 1 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73a3 - asic_info['rev_id'] is: 0x00 - - asic_info['asic_serial'] is: 0x1F75223E5E64EAC1 - - asic_info['oam_id'] is: N/A - -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 213000000 - -###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['name'] is: NAVI21 D43001 GLXL - - vbios_info['version'] is: 020.001.000.060.016898 - -###Test amdsmi_get_gpu_board_info - board_info['model_number'] is: N/A - - board_info['product_serial'] is: N/A - - board_info['fru_id'] is: N/A - - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - - board_info['product_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 1 -ok +/opt/rocm/share/amd_smi/tests/python_unittest/unit_tests.py -v +test_check_res (__main__.TestAmdSmiPythonBDF) ... ok +test_format_bdf (__main__.TestAmdSmiPythonBDF) ... ok +test_parse_bdf (__main__.TestAmdSmiPythonBDF) ... ok ---------------------------------------------------------------------- -Ran 6 tests in 0.077s +Ran 3 tests in 0.001s OK ``` ```shell -(Tue Jul-7 12:07:47am)-(CPU 0.3%:0:Net 18)-(charpoag@mlsetools2:/opt/rocm/share/amd_smi/tests/python_unittest)-(44K:3) -> python3 -m pytest -s -ra -vvv -p no:cacheprovider -==================================== test session starts ===================================== -platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /usr/bin/python3 -rootdir: /opt/rocm/share/amd_smi -configfile: pyproject.toml -collected 6 items +/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "temperature" -v +test_temperature_metric (__main__.TestAmdSmiPythonInterface) ... -integration_test.py::TestAmdSmiInit::test_init PASSED -integration_test.py::TestAmdSmiPythonInterface::test_bad_page_info ###Test amdsmi_get_gpu_bad_page_info +###Test Processor 0, bdf: 0000:43:00.0 -**** [ERROR] | Test: test_bad_page_info | Caught AmdSmiLibraryException -PASSED -integration_test.py::TestAmdSmiPythonInterface::test_bdf_device_id ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['version'] is: 020.001.000.038.015697 - - vbios_info['name'] is: NAVI21 Gaming XL D412 - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 81ff73bf-0000-1000-80c1-6890a5911040 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_vbios_info - - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['version'] is: 020.001.000.060.016898 - - vbios_info['name'] is: NAVI21 D43001 GLXL - -###Test amdsmi_get_gpu_device_uuid - - uuid is: 1fff73a3-0000-1000-8075-223e5e64eac1 -PASSED -integration_test.py::TestAmdSmiPythonInterface::test_ecc ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_ras_feature_info - -**** [ERROR] | Test: test_ecc | Caught AmdSmiLibraryException -PASSED -integration_test.py::TestAmdSmiPythonInterface::test_gpu_performance ###Test Processor 0, bdf: 0000:08:00.0 - -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 3 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % - -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 8 - power_info['gfx_voltage'] is: 768 - power_info['soc_voltage'] is: 918 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 203000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True ###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 44 - Current temperature for HOTSPOT is: 45 - Current temperature for VRAM is: 40 + + Current temperature for HOTSPOT is: 33 + Current temperature for VRAM is: 32 + ###Test amdsmi_get_temp_metric + + Limit (critical) temperature for HOTSPOT is: 110 + Limit (critical) temperature for VRAM is: 100 + +###Test amdsmi_get_temp_metric + + Shutdown (emergency) temperature for HOTSPOT is: 115 + Shutdown (emergency) temperature for VRAM is: 105 + +ok +test_temperature_metric_edge (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + +###Test amdsmi_get_temp_metric + + Current temperature for EDGE is: 31 Limit (critical) temperature for EDGE is: 100 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric Shutdown (emergency) temperature for EDGE is: 105 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2475 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 4 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 5000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -###Test Processor 1, bdf: 0000:44:00.0 -###Test amdsmi_get_gpu_activity - engine_usage['gfx_activity'] is: 0 % - engine_usage['umc_activity'] is: 0 % - engine_usage['mm_activity'] is: 0 % +ok +test_temperature_metric_hbm (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 -###Test amdsmi_get_power_info - power_info['current_socket_power'] is: N/A - power_info['average_socket_power'] is: 13 - power_info['gfx_voltage'] is: 781 - power_info['soc_voltage'] is: 806 - power_info['mem_voltage'] is: 1250 - power_info['power_limit'] is: 213000000 -###Test amdsmi_is_gpu_power_management_enabled - Is power management enabled is: True ###Test amdsmi_get_temp_metric - Current temperature for EDGE is: 36 - Current temperature for HOTSPOT is: 39 - Current temperature for VRAM is: 38 + +**** [ERROR] | Test: test_temperature_metric_hbm | Caught AmdSmiLibraryException: Error code: + 2 | AMDSMI_STATUS_NOT_SUPPORTED - Feature not supported +ok +test_temperature_metric_plx (__main__.TestAmdSmiPythonInterface) ... + +###Test Processor 0, bdf: 0000:43:00.0 + ###Test amdsmi_get_temp_metric - Limit (critical) temperature for EDGE is: 109 - Limit (critical) temperature for HOTSPOT is: 110 - Limit (critical) temperature for VRAM is: 100 -###Test amdsmi_get_temp_metric - Shutdown (emergency) temperature for EDGE is: 114 - Shutdown (emergency) temperature for HOTSPOT is: 115 - Shutdown (emergency) temperature for VRAM is: 105 -###Test amdsmi_get_clock_info - Current clock for domain GFX is: 500 - Max clock for domain GFX is: 2555 - Min clock for domain GFX is: 500 - Is GFX clock locked: 0 - Is GFX clock in deep sleep: 255 - Current clock for domain MEM is: 96 - Max clock for domain MEM is: 1000 - Min clock for domain MEM is: 96 - Is MEM clock in deep sleep: 255 - Current clock for domain VCLK0 is: 0 - Max clock for domain VCLK0 is: 0 - Min clock for domain VCLK0 is: 0 - Is VCLK0 clock in deep sleep: 255 - Current clock for domain VCLK1 is: 0 - Max clock for domain VCLK1 is: 0 - Min clock for domain VCLK1 is: 0 - Is VCLK1 clock in deep sleep: 255 - Current clock for domain DCLK0 is: 0 - Max clock for domain DCLK0 is: 0 - Min clock for domain DCLK0 is: 0 - Is DCLK0 clock in deep sleep: 255 - Current clock for domain DCLK1 is: 0 - Max clock for domain DCLK1 is: 0 - Min clock for domain DCLK1 is: 0 - Is DCLK1 clock in deep sleep: 255 -###Test amdsmi_get_pcie_info - pcie_info['pcie_metric']['pcie_width'] is: 16 - pcie_info['pcie_static']['max_pcie_width'] is: 16 - pcie_info['pcie_metric']['pcie_speed'] is: 8000 MT/s - pcie_info['pcie_static']['max_pcie_speed'] is: 16000 - pcie_info['pcie_static']['pcie_interface_version'] is: 4 - pcie_info['pcie_static']['slot_type'] is: CEM - pcie_info['pcie_metric']['pcie_replay_count'] is: N/A - pcie_info['pcie_metric']['pcie_bandwidth'] is: N/A - pcie_info['pcie_metric']['pcie_l0_to_recovery_count'] is: N/A - pcie_info['pcie_metric']['pcie_replay_roll_over_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_sent_count'] is: N/A - pcie_info['pcie_metric']['pcie_nak_received_count'] is: N/A -PASSED -integration_test.py::TestAmdSmiPythonInterface::test_walkthrough ###Test amdsmi_get_processor_handles() -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 0 -###Test Processor 0, bdf: 0000:08:00.0 -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: NAVI21 - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73bf - asic_info['rev_id'] is: 0xc3 + Current temperature for PLX is: 30 + Limit (critical) temperature for PLX is: 30 + Shutdown (emergency) temperature for PLX is: 30 - asic_info['asic_serial'] is: 0x81C16890A5911040 - - asic_info['oam_id'] is: N/A - -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 203000000 - -###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D41207XL-038 - vbios_info['build_date'] is: 2020/10/06 17:59 - vbios_info['name'] is: NAVI21 Gaming XL D412 - - vbios_info['version'] is: 020.001.000.038.015697 - -###Test amdsmi_get_gpu_board_info - board_info['model_number'] is: N/A - - board_info['product_serial'] is: N/A - - board_info['fru_id'] is: N/A - - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - - board_info['product_name'] is: Navi 21 [Radeon RX 6800/6800 XT / 6900 XT] - -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 0 -###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = 1 -###Test Processor 1, bdf: 0000:44:00.0 - -###Test amdsmi_get_gpu_asic_info - asic_info['market_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - asic_info['vendor_id'] is: 0x1002 - asic_info['vendor_name'] is: Advanced Micro Devices Inc. [AMD/ATI] - asic_info['device_id'] is: 0x73a3 - asic_info['rev_id'] is: 0x00 - - asic_info['asic_serial'] is: 0x1F75223E5E64EAC1 - - asic_info['oam_id'] is: N/A - -###Test amdsmi_get_power_cap_info - power_info['dpm_cap'] is: 1 - power_info['power_cap'] is: 213000000 - -###Test amdsmi_get_gpu_vbios_info - vbios_info['part_number'] is: 113-D4300100-100 - vbios_info['build_date'] is: 2021/04/22 09:34 - vbios_info['name'] is: NAVI21 D43001 GLXL - - vbios_info['version'] is: 020.001.000.060.016898 - -###Test amdsmi_get_gpu_board_info - board_info['model_number'] is: N/A - - board_info['product_serial'] is: N/A - - board_info['fru_id'] is: N/A - - board_info['manufacturer_name'] is: Advanced Micro Devices, Inc. [AMD/ATI] - - board_info['product_name'] is: Navi 21 GL-XL [Radeon PRO W6800] - -###Test amdsmi_get_fw_info -FW name: AMDSMI_FW_ID_CP_CE -FW version: 37 -FW name: AMDSMI_FW_ID_CP_PFP -FW version: 98 -FW name: AMDSMI_FW_ID_CP_ME -FW version: 64 -FW name: AMDSMI_FW_ID_CP_MEC1 -FW version: 118 -FW name: AMDSMI_FW_ID_CP_MEC2 -FW version: 118 -FW name: AMDSMI_FW_ID_RLC -FW version: 96 -FW name: AMDSMI_FW_ID_SDMA0 -FW version: 83 -FW name: AMDSMI_FW_ID_SDMA1 -FW version: 83 -FW name: AMDSMI_FW_ID_VCN -FW version: 31.1E.00.8 -FW name: AMDSMI_FW_ID_PSP_SOSDRV -FW version: 21.0E.64 -FW name: AMDSMI_FW_ID_ASD -FW version: 553648340 -FW name: AMDSMI_FW_ID_TA_RAS -FW version: 1B.00.01.3E -FW name: AMDSMI_FW_ID_TA_XGMI -FW version: 20.00.00.0F -FW name: AMDSMI_FW_ID_PM -FW version: 58.89.0 -###Test amdsmi_get_gpu_driver_info -Driver info: {'driver_name': 'amdgpu', 'driver_version': '6.7.8', 'driver_date': '2015/01/01 00:00'} -###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = 1 -PASSED - -===================================== 6 passed in 0.10s ====================================== -``` - -```shell -$ python3 /opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "*test_init" -vvv -test_init (__main__.TestAmdSmiInit) ... ok +ok ---------------------------------------------------------------------- -Ran 1 test in 0.009s +Ran 4 tests in 0.466s OK - ``` ```shell -(Tue Jul-7 12:10:10am)-(CPU 0.3%:0:Net 16)-(charpoag@mlsetools2:/opt/rocm/share/amd_smi/tests/python_unittest)-(44K:3) -> python3 -m pytest -ra -vvv -p no:cacheprovider -==================================== test session starts ===================================== -platform linux -- Python 3.8.10, pytest-8.2.2, pluggy-1.5.0 -- /usr/bin/python3 -rootdir: /opt/rocm/share/amd_smi -configfile: pyproject.toml -collected 6 items +/opt/rocm/share/amd_smi/tests/python_unittest/integration_test.py -k "info" -b -v +test_asic_kfd_info (__main__.TestAmdSmiPythonInterface) ... ok +test_bad_page_info (__main__.TestAmdSmiPythonInterface) ... ok +test_board_info (__main__.TestAmdSmiPythonInterface) ... ok +test_clock_info (__main__.TestAmdSmiPythonInterface) ... ok +test_clock_info_vclk0_dclk0 (__main__.TestAmdSmiPythonInterface) ... ok +test_clock_info_vclk1_dclk1 (__main__.TestAmdSmiPythonInterface) ... ok +test_driver_info (__main__.TestAmdSmiPythonInterface) ... ok +test_fw_info (__main__.TestAmdSmiPythonInterface) ... ok +test_pcie_info (__main__.TestAmdSmiPythonInterface) ... ok +test_power_info (__main__.TestAmdSmiPythonInterface) ... ok +test_ras_feature_info (__main__.TestAmdSmiPythonInterface) ... ok +test_socket_info (__main__.TestAmdSmiPythonInterface) ... ok +test_vbios_info (__main__.TestAmdSmiPythonInterface) ... ok -integration_test.py::TestAmdSmiInit::test_init PASSED [ 16%] -integration_test.py::TestAmdSmiPythonInterface::test_bad_page_info PASSED [ 33%] -integration_test.py::TestAmdSmiPythonInterface::test_bdf_device_id PASSED [ 50%] -integration_test.py::TestAmdSmiPythonInterface::test_ecc PASSED [ 66%] -integration_test.py::TestAmdSmiPythonInterface::test_gpu_performance PASSED [ 83%] -integration_test.py::TestAmdSmiPythonInterface::test_walkthrough PASSED [100%] +---------------------------------------------------------------------- +Ran 13 tests in 0.506s -===================================== 6 passed in 0.11s ====================================== +OK ``` \ No newline at end of file diff --git a/tests/python_unittest/integration_test.py b/tests/python_unittest/integration_test.py index 7c829e87bc..51cd8e08a0 100755 --- a/tests/python_unittest/integration_test.py +++ b/tests/python_unittest/integration_test.py @@ -32,6 +32,9 @@ import threading import multiprocessing from datetime import datetime +# Note: amdsmi_status_code_to_string is not tested due to the nature and functionality of the AMDSMI Python wrapper. +# The function is to be tested in the future after the wrapper is updated to return status codes after API calls. + def handle_exceptions(func): """Exposes, silences, and logs AMD SMI exceptions to users what exception was raised. @@ -46,15 +49,19 @@ def handle_exceptions(func): return func(*args, **kwargs) except amdsmi.AmdSmiRetryException as e: print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught AmdSmiRetryException: {}".format(e)) + amdsmi.amdsmi_shut_down() pass except amdsmi.AmdSmiTimeoutException as e: print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught AmdSmiTimeoutException: {}".format(e)) + amdsmi.amdsmi_shut_down() pass except amdsmi.AmdSmiLibraryException as e: print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught AmdSmiLibraryException: {}".format(e)) + amdsmi.amdsmi_shut_down() pass except Exception as e: print("**** [ERROR] | Test: " + str(func.__name__) + " | Caught unknown exception: {}".format(e)) + amdsmi.amdsmi_shut_down() pass return wrapper @@ -68,13 +75,52 @@ class TestAmdSmiPythonInterface(unittest.TestCase): @handle_exceptions def setUp(self): amdsmi.amdsmi_init() + @handle_exceptions def tearDown(self): amdsmi.amdsmi_shut_down() - # Bad page is not supported in Navi21 and Navi31 + def test_asic_kfd_info(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_asic_info \n") + asic_info = amdsmi.amdsmi_get_gpu_asic_info(processors[i]) + print(" asic_info['market_name'] is: {}".format( + asic_info['market_name'])) + print(" asic_info['vendor_id'] is: {}".format( + asic_info['vendor_id'])) + print(" asic_info['vendor_name'] is: {}".format( + asic_info['vendor_name'])) + print(" asic_info['device_id'] is: {}".format( + asic_info['device_id'])) + print(" asic_info['rev_id'] is: {}".format( + asic_info['rev_id'])) + print(" asic_info['asic_serial'] is: {}".format( + asic_info['asic_serial'])) + print(" asic_info['oam_id'] is: {}".format( + asic_info['oam_id'])) + print(" asic_info['target_graphics_version'] is: {}".format( + asic_info['target_graphics_version'])) + print(" asic_info['num_compute_units'] is: {}".format( + asic_info['num_compute_units'])) + print("\n###Test amdsmi_get_gpu_kfd_info \n") + kfd_info = amdsmi.amdsmi_get_gpu_kfd_info(processors[i]) + print(" kfd_info['kfd_id'] is: {}".format( + kfd_info['kfd_id'])) + print(" kfd_info['node_id'] is: {}".format( + kfd_info['node_id'])) + print() + self.tearDown() + + # amdsmi_get_gpu_bad_page_info is not supported in Navi2x, Navi3x @handle_exceptions def test_bad_page_info(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -96,14 +142,17 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print() j += 1 print() + self.tearDown() def test_bdf_device_id(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) for i in range(0, len(processors)): bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_processor_handle_from_bdf \n") processor = amdsmi.amdsmi_get_processor_handle_from_bdf(bdf) print("\n###Test amdsmi_get_gpu_vbios_info \n") vbios_info = amdsmi.amdsmi_get_gpu_vbios_info(processor) @@ -119,49 +168,83 @@ class TestAmdSmiPythonInterface(unittest.TestCase): uuid = amdsmi.amdsmi_get_gpu_device_uuid(processor) print(" uuid is: {}".format(uuid)) print() + self.tearDown() - def test_ecc(self): + def test_board_info(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) for i in range(0, len(processors)): bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) - print("\n###Test amdsmi_get_gpu_total_ecc_count \n") - ecc_info = amdsmi.amdsmi_get_gpu_total_ecc_count(processors[i]) - print("Number of uncorrectable errors: {}".format( - ecc_info['uncorrectable_count'])) - print("Number of correctable errors: {}".format( - ecc_info['correctable_count'])) - print("Number of deferred errors: {}".format( - ecc_info['deferred_count'])) - self.assertGreaterEqual(ecc_info['uncorrectable_count'], 0) - self.assertGreaterEqual(ecc_info['correctable_count'], 0) - self.assertGreaterEqual(ecc_info['deferred_count'], 0) + print("\n###Test amdsmi_get_gpu_board_info \n") + board_info = amdsmi.amdsmi_get_gpu_board_info(processors[i]) + print(" board_info['model_number'] is: {}".format( + board_info['model_number'])) + print(" board_info['product_serial'] is: {}".format( + board_info['product_serial'])) + print(" board_info['fru_id'] is: {}".format( + board_info['fru_id'])) + print(" board_info['manufacturer_name'] is: {}".format( + board_info['manufacturer_name'])) + print(" board_info['product_name'] is: {}".format( + board_info['product_name'])) print() + self.tearDown() - # RAS is not supported in Navi21 and Navi31 + def test_clock_frequency(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_clk_freq \n") + clock_frequency = amdsmi.amdsmi_get_clk_freq( + processors[i], amdsmi.AmdSmiClkType.SYS) + print(" SYS clock_frequency['num_supported']: {}".format( + clock_frequency['num_supported'])) + print(" SYS clock_frequency['current']: {}".format( + clock_frequency['current'])) + print(" SYS clock_frequency['frequency']: {}".format( + clock_frequency['frequency'])) + clock_frequency = amdsmi.amdsmi_get_clk_freq( + processors[i], amdsmi.AmdSmiClkType.DF) + print(" DF clock_frequency['num_supported']: {}".format( + clock_frequency['num_supported'])) + print(" DF clock_frequency['current']: {}".format( + clock_frequency['current'])) + print(" DF clock_frequency['frequency']: {}".format( + clock_frequency['frequency'])) + print() + self.tearDown() + + # amdsmi_get_clk_freq with AmdSmiClkType.DCEF is not supported in MI210, MI300A @handle_exceptions - def test_ras(self): + def test_clock_frequency_DCEF(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) for i in range(0, len(processors)): bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) - print("\n###Test amdsmi_get_gpu_ras_feature_info \n") - ras_feature = amdsmi.amdsmi_get_gpu_ras_feature_info(processors[i]) - print("ras_feature: " + str(ras_feature)) - if ras_feature != None: - print("ras_feature: " + str(ras_feature)) - print("RAS eeprom version: {}".format(ras_feature['eeprom_version'])) - print("RAS parity schema: {}".format(ras_feature['parity_schema'])) - print("RAS single bit schema: {}".format(ras_feature['single_bit_schema'])) - print("RAS double bit schema: {}".format(ras_feature['double_bit_schema'])) - print("Poisioning supported: {}".format(ras_feature['poison_schema'])) + print("\n###Test amdsmi_get_clk_freq \n") + clock_frequency = amdsmi.amdsmi_get_clk_freq( + processors[i], amdsmi.AmdSmiClkType.DCEF) + print(" DCEF clock_frequency['num_supported']: {}".format( + clock_frequency['num_supported'])) + print(" DCEF clock_frequency['current']: {}".format( + clock_frequency['current'])) + print(" DCEF clock_frequency['frequency']: {}".format( + clock_frequency['frequency'])) print() + self.tearDown() def test_clock_info(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -192,10 +275,12 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print(" Is MEM clock in deep sleep: {}".format( clock_measure['clk_deep_sleep'])) print() + self.tearDown() - # VCLK0 and DCLK0 are not supported in MI210 + # AmdSmiClkType.VCLK0 and DCLK0 are not supported in MI210 @handle_exceptions - def test_gpu_clock_vclk0_dclk0(self): + def test_clock_info_vclk0_dclk0(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -224,10 +309,12 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print(" Is DCLK0 clock in deep sleep: {}".format( clock_measure['clk_deep_sleep'])) print() + self.tearDown() - # VCLK1 and DCLK1 are not supported in Navi 31, MI210, and MI300 + # AmdSmiClkType.VCLK1 and DCLK1 are not supported in MI210, MI300A, MI300X @handle_exceptions - def test_gpu_clock_vclk1_dclk1(self): + def test_clock_info_vclk1_dclk1(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -256,8 +343,118 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print(" Is DCLK1 clock in deep sleep: {}".format( clock_measure['clk_deep_sleep'])) print() + self.tearDown() + + def test_driver_info(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_driver_info \n") + driver_info = amdsmi.amdsmi_get_gpu_driver_info(processors[i]) + print("Driver info: {}".format(driver_info)) + print() + self.tearDown() + + # amdsmi_get_gpu_ecc_count is not supported in Navi2x, Navi3x, MI210, MI300A + @handle_exceptions + def test_ecc_count_block(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + gpu_blocks = { + "INVALID": amdsmi.AmdSmiGpuBlock.INVALID, + "UMC": amdsmi.AmdSmiGpuBlock.UMC, + "SDMA": amdsmi.AmdSmiGpuBlock.SDMA, + "GFX": amdsmi.AmdSmiGpuBlock.GFX, + "MMHUB": amdsmi.AmdSmiGpuBlock.MMHUB, + "ATHUB": amdsmi.AmdSmiGpuBlock.ATHUB, + "PCIE_BIF": amdsmi.AmdSmiGpuBlock.PCIE_BIF, + "HDP": amdsmi.AmdSmiGpuBlock.HDP, + "XGMI_WAFL": amdsmi.AmdSmiGpuBlock.XGMI_WAFL, + "DF": amdsmi.AmdSmiGpuBlock.DF, + "SMN": amdsmi.AmdSmiGpuBlock.SMN, + "SEM": amdsmi.AmdSmiGpuBlock.SEM, + "MP0": amdsmi.AmdSmiGpuBlock.MP0, + "MP1": amdsmi.AmdSmiGpuBlock.MP1, + "FUSE": amdsmi.AmdSmiGpuBlock.FUSE, + "MCA": amdsmi.AmdSmiGpuBlock.MCA, + "VCN": amdsmi.AmdSmiGpuBlock.VCN, + "JPEG": amdsmi.AmdSmiGpuBlock.JPEG, + "IH": amdsmi.AmdSmiGpuBlock.IH, + "MPIO": amdsmi.AmdSmiGpuBlock.MPIO, + "RESERVED": amdsmi.AmdSmiGpuBlock.RESERVED + } + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_ecc_count \n") + for block_name, block_code in gpu_blocks.items(): + ecc_count = amdsmi.amdsmi_get_gpu_ecc_count( + processors[i], block_code) + print(" Number of uncorrectable errors for {}: {}".format( + block_name, ecc_count['uncorrectable_count'])) + print(" Number of correctable errors for {}: {}".format( + block_name, ecc_count['correctable_count'])) + print(" Number of deferred errors for {}: {}".format( + block_name, ecc_count['deferred_count'])) + self.assertGreaterEqual(ecc_count['uncorrectable_count'], 0) + self.assertGreaterEqual(ecc_count['correctable_count'], 0) + self.assertGreaterEqual(ecc_count['deferred_count'], 0) + print() + print() + self.tearDown() + + def test_ecc_count_total(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_total_ecc_count \n") + ecc_info = amdsmi.amdsmi_get_gpu_total_ecc_count(processors[i]) + print("Number of uncorrectable errors: {}".format( + ecc_info['uncorrectable_count'])) + print("Number of correctable errors: {}".format( + ecc_info['correctable_count'])) + print("Number of deferred errors: {}".format( + ecc_info['deferred_count'])) + self.assertGreaterEqual(ecc_info['uncorrectable_count'], 0) + self.assertGreaterEqual(ecc_info['correctable_count'], 0) + self.assertGreaterEqual(ecc_info['deferred_count'], 0) + print() + self.tearDown() + + def test_fw_info(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_fw_info \n") + fw_info = amdsmi.amdsmi_get_fw_info(processors[i]) + fw_num = len(fw_info['fw_list']) + self.assertLessEqual(fw_num, len(amdsmi.AmdSmiFwBlock)) + for j in range(0, fw_num): + fw = fw_info['fw_list'][j] + if fw['fw_version'] != 0: + print(" FW name: {}".format( + fw['fw_name'].name)) + print(" FW version: {}".format( + fw['fw_version'])) + print() + self.tearDown() def test_gpu_activity(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -273,8 +470,31 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print(" engine_usage['mm_activity'] is: {} %".format( engine_usage['mm_activity'])) print() + self.tearDown() - def test_pcie(self): + def test_memory_usage(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_memory_usage \n") + memory_usage = amdsmi.amdsmi_get_gpu_memory_usage( + processors[i], amdsmi.AmdSmiMemoryType.VRAM) + print(" memory_usage for VRAM is: {}".format(memory_usage)) + memory_usage = amdsmi.amdsmi_get_gpu_memory_usage( + processors[i], amdsmi.AmdSmiMemoryType.VIS_VRAM) + print(" memory_usage for VIS_VRAM is: {}".format(memory_usage)) + memory_usage = amdsmi.amdsmi_get_gpu_memory_usage( + processors[i], amdsmi.AmdSmiMemoryType.GTT) + print(" memory_usage for GTT is: {}".format(memory_usage)) + print() + self.tearDown() + + def test_pcie_info(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -308,8 +528,10 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print(" pcie_info['pcie_metric']['pcie_nak_received_count'] is: {}".format( pcie_info['pcie_metric']['pcie_nak_received_count'])) print() + self.tearDown() - def test_power(self): + def test_power_info(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -330,13 +552,99 @@ class TestAmdSmiPythonInterface(unittest.TestCase): power_info['mem_voltage'])) print(" power_info['power_limit'] is: {}".format( power_info['power_limit'])) + print("\n###Test amdsmi_get_power_cap_info \n") + power_cap_info = amdsmi.amdsmi_get_power_cap_info(processors[i]) + print(" power_info['dpm_cap'] is: {}".format( + power_cap_info['dpm_cap'])) + print(" power_info['power_cap'] is: {}".format( + power_cap_info['power_cap'])) print("\n###Test amdsmi_is_gpu_power_management_enabled \n") is_power_management_enabled = amdsmi.amdsmi_is_gpu_power_management_enabled(processors[i]) - print(" Is power management enabled is: {}".format( + print(" Power management enabled: {}".format( is_power_management_enabled)) print() + self.tearDown() - def test_temperature(self): + def test_process_list(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_process_list \n") + process_list = amdsmi.amdsmi_get_gpu_process_list(processors[i]) + print(" Process list: {}".format(process_list)) + print() + self.tearDown() + + def test_processor_type(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_processor_type \n") + processor_type = amdsmi.amdsmi_get_processor_type(processors[i]) + print(" Processor type is: {}".format(processor_type['processor_type'])) + print() + self.tearDown() + + # amdsmi_get_gpu_ras_block_features_enabled is not supported in Navi2x, Navi3x + @handle_exceptions + def test_ras_block_features_enabled(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_ras_block_features_enabled \n") + ras_enabled = amdsmi.amdsmi_get_gpu_ras_block_features_enabled(processors[i]) + for j in range(0, len(ras_enabled)): + print(" RAS status for {} is: {}".format(ras_enabled[j]['block'], ras_enabled[j]['status'])) + print() + self.tearDown() + + # amdsmi_get_gpu_ras_feature_info is not supported in Navi2x, Navi3x + @handle_exceptions + def test_ras_feature_info(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_ras_feature_info \n") + ras_feature = amdsmi.amdsmi_get_gpu_ras_feature_info(processors[i]) + if ras_feature != None: + print("RAS eeprom version: {}".format(ras_feature['eeprom_version'])) + print("RAS parity schema: {}".format(ras_feature['parity_schema'])) + print("RAS single bit schema: {}".format(ras_feature['single_bit_schema'])) + print("RAS double bit schema: {}".format(ras_feature['double_bit_schema'])) + print("Poisoning supported: {}".format(ras_feature['poison_schema'])) + print() + self.tearDown() + + def test_socket_info(self): + self.setUp() + print("\n\n###Test amdsmi_get_socket_handles") + sockets = amdsmi.amdsmi_get_socket_handles() + for i in range(0, len(sockets)): + print("\n\n###Test Socket {}".format(i)) + print("\n###Test amdsmi_get_socket_info \n") + socket_name = amdsmi.amdsmi_get_socket_info(sockets[i]) + print(" Socket: {}".format(socket_name)) + print() + self.tearDown() + + def test_temperature_metric(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -371,10 +679,12 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print(" Shutdown (emergency) temperature for VRAM is: {}".format( temperature_measure)) print() + self.tearDown() - # Edge temperature is not supported in MI300 + # AmdSmiTemperatureType.EDGE is not supported in MI300A, MI300X @handle_exceptions - def test_temperature_edge(self): + def test_temperature_metric_edge(self): + self.setUp() processors = amdsmi.amdsmi_get_processor_handles() self.assertGreaterEqual(len(processors), 1) self.assertLessEqual(len(processors), 32) @@ -383,21 +693,168 @@ class TestAmdSmiPythonInterface(unittest.TestCase): print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) print("\n###Test amdsmi_get_temp_metric \n") temperature_measure = amdsmi.amdsmi_get_temp_metric( - processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CURRENT) # current + processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CURRENT) print(" Current temperature for EDGE is: {}".format( temperature_measure)) temperature_measure = amdsmi.amdsmi_get_temp_metric( - processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CRITICAL) # slowdown/limit + processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.CRITICAL) print(" Limit (critical) temperature for EDGE is: {}".format( temperature_measure)) temperature_measure = amdsmi.amdsmi_get_temp_metric( - processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.EMERGENCY) # shutdown + processors[i], amdsmi.AmdSmiTemperatureType.EDGE, amdsmi.AmdSmiTemperatureMetric.EMERGENCY) print(" Shutdown (emergency) temperature for EDGE is: {}".format( temperature_measure)) print() + self.tearDown() + def test_temperature_metric_plx(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_temp_metric \n") + temperature_measure = amdsmi.amdsmi_get_temp_metric( + processors[i], amdsmi.AmdSmiTemperatureType.PLX, amdsmi.AmdSmiTemperatureMetric.CURRENT) + print(" Current temperature for PLX is: {}".format( + temperature_measure)) + temperature_measure = amdsmi.amdsmi_get_temp_metric( + processors[i], amdsmi.AmdSmiTemperatureType.PLX, amdsmi.AmdSmiTemperatureMetric.CRITICAL) + print(" Limit (critical) temperature for PLX is: {}".format( + temperature_measure)) + temperature_measure = amdsmi.amdsmi_get_temp_metric( + processors[i], amdsmi.AmdSmiTemperatureType.PLX, amdsmi.AmdSmiTemperatureMetric.EMERGENCY) + print(" Shutdown (emergency) temperature for PLX is: {}".format( + temperature_measure)) + print() + self.tearDown() + + # AmdSmiTemperatureType.HBM_0, HBM_1, HBM_2, HBM_3 are not supported in Navi2x, Navi3x, MI210, MI300A + @handle_exceptions + def test_temperature_metric_hbm(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + temp_types = { + "HBM_0": amdsmi.AmdSmiTemperatureType.HBM_0, + "HBM_1": amdsmi.AmdSmiTemperatureType.HBM_1, + "HBM_2": amdsmi.AmdSmiTemperatureType.HBM_2, + "HBM_3": amdsmi.AmdSmiTemperatureType.HBM_3, + } + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_temp_metric \n") + for temp_type_name, temp_type_code in temp_types.items(): + temperature_measure = amdsmi.amdsmi_get_temp_metric( + processors[i], temp_type_code, amdsmi.AmdSmiTemperatureMetric.CURRENT) + print(" Current temperature for {} is: {}".format( + temp_type_name, temperature_measure)) + temperature_measure = amdsmi.amdsmi_get_temp_metric( + processors[i], temp_type_code, amdsmi.AmdSmiTemperatureMetric.CRITICAL) + print(" Limit (critical) temperature for {} is: {}".format( + temp_type_name, temperature_measure)) + temperature_measure = amdsmi.amdsmi_get_temp_metric( + processors[i], temp_type_code, amdsmi.AmdSmiTemperatureMetric.EMERGENCY) + print(" Shutdown (emergency) temperature for {} is: {}".format( + temp_type_name, temperature_measure)) + print() + self.tearDown() + + def test_utilization_count(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_utilization_count \n") + utilization_counter_types = [ + amdsmi.AmdSmiUtilizationCounterType.COARSE_GRAIN_GFX_ACTIVITY, + amdsmi.AmdSmiUtilizationCounterType.COARSE_GRAIN_MEM_ACTIVITY, + amdsmi.AmdSmiUtilizationCounterType.COARSE_DECODER_ACTIVITY + ] + utilization_count = amdsmi.amdsmi_get_utilization_count( + processors[i], utilization_counter_types) + print(" Timestamp: {}".format( + utilization_count[0]['timestamp'])) + print(" Utilization count for {} is: {}".format( + utilization_count[1]['type'], utilization_count[1]['value'])) + print(" Utilization count for {} is: {}".format( + utilization_count[2]['type'], utilization_count[2]['value'])) + print(" Utilization count for {} is: {}".format( + utilization_count[3]['type'], utilization_count[3]['value'])) + self.assertLessEqual(len(processors), 32) + print() + utilization_counter_types = [ + amdsmi.AmdSmiUtilizationCounterType.FINE_GRAIN_GFX_ACTIVITY, + amdsmi.AmdSmiUtilizationCounterType.FINE_GRAIN_MEM_ACTIVITY, + amdsmi.AmdSmiUtilizationCounterType.FINE_DECODER_ACTIVITY + ] + utilization_count = amdsmi.amdsmi_get_utilization_count( + processors[i], utilization_counter_types) + print(" Timestamp: {}".format( + utilization_count[0]['timestamp'])) + print(" Utilization count for {} is: {}".format( + utilization_count[1]['type'], utilization_count[1]['value'])) + print(" Utilization count for {} is: {}".format( + utilization_count[2]['type'], utilization_count[2]['value'])) + print(" Utilization count for {} is: {}".format( + utilization_count[3]['type'], utilization_count[3]['value'])) + print() + self.tearDown() + + def test_vbios_info(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_vbios_info \n") + vbios_info = amdsmi.amdsmi_get_gpu_vbios_info(processors[i]) + print(" vbios_info['part_number'] is: {}".format( + vbios_info['part_number'])) + print(" vbios_info['build_date'] is: {}".format( + vbios_info['build_date'])) + print(" vbios_info['name'] is: {}".format( + vbios_info['name'])) + print(" vbios_info['version'] is: {}".format( + vbios_info['version'])) + print() + self.tearDown() + + def test_vendor_name(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_vendor_name \n") + vendor_name = amdsmi.amdsmi_get_gpu_vendor_name(processors[i]) + print(" Vendor name is: {}".format(vendor_name)) + print() + self.tearDown() + + # @unittest.SkipTest def test_walkthrough(self): - walk_through(self) + print("\n\n#######################################################################") + print("========> test_walkthrough start <========\n") + self.test_asic_kfd_info() + self.test_power_info() + self.test_vbios_info() + self.test_board_info() + self.test_fw_info() + self.test_driver_info() + print("\n========> test_walkthrough end <========") + print("#######################################################################\n") # Unstable on workstation cards # @handle_exceptions @@ -486,80 +943,5 @@ class TestAmdSmiPythonInterface(unittest.TestCase): # # t3.join() # print("\n========> test_z_gpureset_asicinfo_multithread end <========\n") -def walk_through(self): - print("\n###Test amdsmi_get_processor_handles() \n") - processors = amdsmi.amdsmi_get_processor_handles() - for i in range(0, len(processors)): - print("\n###Test amdsmi_get_gpu_device_bdf() | START walk_through | processor i = " + str(i) + "\n") - bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) - print("###Test Processor {}, bdf: {} ".format(i, bdf)) - print("\n###Test amdsmi_get_gpu_asic_info \n") - asic_info = amdsmi.amdsmi_get_gpu_asic_info(processors[i]) - print(" asic_info['market_name'] is: {}".format( - asic_info['market_name'])) - print(" asic_info['vendor_id'] is: {}".format( - asic_info['vendor_id'])) - print(" asic_info['vendor_name'] is: {}".format( - asic_info['vendor_name'])) - print(" asic_info['device_id'] is: {}".format( - asic_info['device_id'])) - print(" asic_info['rev_id'] is: {}\n".format( - asic_info['rev_id'])) - print(" asic_info['asic_serial'] is: {}\n".format( - asic_info['asic_serial'])) - print(" asic_info['oam_id'] is: {}\n".format( - asic_info['oam_id'])) - print(" asic_info['target_graphics_version'] is: {}\n".format( - asic_info['target_graphics_version'])) - print("\n###Test amdsmi_get_gpu_kfd_info \n") - kfd_info = amdsmi.amdsmi_get_gpu_kfd_info(processors[i]) - print(" kfd_info['kfd_id'] is: {}\n".format( - kfd_info['kfd_id'])) - print(" kfd_info['node_id'] is: {}\n".format( - kfd_info['node_id'])) - print("###Test amdsmi_get_power_cap_info \n") - power_info = amdsmi.amdsmi_get_power_cap_info(processors[i]) - print(" power_info['dpm_cap'] is: {}".format( - power_info['dpm_cap'])) - print(" power_info['power_cap'] is: {}\n".format( - power_info['power_cap'])) - print("###Test amdsmi_get_gpu_vbios_info \n") - vbios_info = amdsmi.amdsmi_get_gpu_vbios_info(processors[i]) - print(" vbios_info['part_number'] is: {}".format( - vbios_info['part_number'])) - print(" vbios_info['build_date'] is: {}".format( - vbios_info['build_date'])) - print(" vbios_info['name'] is: {}\n".format( - vbios_info['name'])) - print(" vbios_info['version'] is: {}\n".format( - vbios_info['version'])) - print("###Test amdsmi_get_gpu_board_info \n") - board_info = amdsmi.amdsmi_get_gpu_board_info(processors[i]) - print(" board_info['model_number'] is: {}\n".format( - board_info['model_number'])) - print(" board_info['product_serial'] is: {}\n".format( - board_info['product_serial'])) - print(" board_info['fru_id'] is: {}\n".format( - board_info['fru_id'])) - print(" board_info['manufacturer_name'] is: {}\n".format( - board_info['manufacturer_name'])) - print(" board_info['product_name'] is: {}\n".format( - board_info['product_name'])) - print("###Test amdsmi_get_fw_info \n") - fw_info = amdsmi.amdsmi_get_fw_info(processors[i]) - fw_num = len(fw_info['fw_list']) - self.assertLessEqual(fw_num, len(amdsmi.AmdSmiFwBlock)) - for j in range(0, fw_num): - fw = fw_info['fw_list'][j] - if fw['fw_version'] != 0: - print("FW name: {}".format( - fw['fw_name'].name)) - print("FW version: {}".format( - fw['fw_version'])) - print("\n###Test amdsmi_get_gpu_driver_info \n") - driver_info = amdsmi.amdsmi_get_gpu_driver_info(processors[i]) - print("Driver info: {}".format(driver_info)) - print("\n###Test amdsmi_get_gpu_driver_info() | END walk_through | processor i = " + str(i) + "\n") - if __name__ == '__main__': unittest.main() \ No newline at end of file From f00a03ed2b908af2cbb09726c8667529092591c7 Mon Sep 17 00:00:00 2001 From: Ranjith Ramakrishnan Date: Wed, 25 Sep 2024 15:58:51 -0700 Subject: [PATCH 2/8] Remove package provides field from RPM and DEB package The provides tag is required when the package provides a virtual package. Package name along with version will be provided by default and the provides tag is not required for this. Change-Id: I6d42cd1a6e2247e33708a1fa2627897e86099815 --- CMakeLists.txt | 6 ------ 1 file changed, 6 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index b5c692cc7b..97793e0290 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -254,12 +254,9 @@ install( add_subdirectory(goamdsmi_shim) #Debian package specific variables -set(CPACK_DEBIAN_PACKAGE_PROVIDES "amd-smi") set(CPACK_DEBIAN_PACKAGE_RECOMMENDS "python3-argcomplete, libdrm-dev, python3-PyYAML") set(CPACK_DEBIAN_ASAN_PACKAGE_RECOMMENDS ${CPACK_DEBIAN_PACKAGE_RECOMMENDS}) set(CPACK_DEBIAN_DEV_PACKAGE_RECOMMENDS ${CPACK_DEBIAN_PACKAGE_RECOMMENDS}) -set(CPACK_DEBIAN_ASAN_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}-asan") -set(CPACK_DEBIAN_DEV_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}") set(CPACK_DEBIAN_PACKAGE_DEPENDS "sudo, python3 (>= 3.6.8), python3-pip") set(CPACK_DEBIAN_ASAN_PACKAGE_DEPENDS ${CPACK_DEBIAN_PACKAGE_DEPENDS}) set(CPACK_DEBIAN_DEV_PACKAGE_DEPENDS ${CPACK_DEBIAN_PACKAGE_DEPENDS}) @@ -276,9 +273,6 @@ set(CPACK_RPM_EXCLUDE_FROM_AUTO_FILELIST_ADDITION if(CPACK_RPM_PACKAGE_RELEASE) set(CPACK_RPM_PACKAGE_RELEASE_DIST ON) endif() -set(CPACK_RPM_PACKAGE_PROVIDES "amd-smi") -set(CPACK_RPM_DEV_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}") -set(CPACK_RPM_ASAN_PACKAGE_PROVIDES "${AMD_SMI_PACKAGE}-asan") # NOTE: RPM SUGGESTS DO NOT WORK! https://bugzilla.redhat.com/show_bug.cgi?id=1811358 set(CPACK_RPM_PACKAGE_SUGGESTS "python3-argcomplete") set(CPACK_RPM_DEV_PACKAGE_SUGGESTS ${CPACK_RPM_PACKAGE_SUGGESTS}) From 7a557b1c508fc920f3fe2193c377d7d5026c4d74 Mon Sep 17 00:00:00 2001 From: Lang Yu Date: Mon, 26 Aug 2024 05:29:24 -0400 Subject: [PATCH 3/8] SWDEV-463405: Add amdsmi_get_link_topology_nearest support amdsmi_get_link_topology_nearest() is used to retrieve the set of GPUs that are nearest to a given device at a specific interconnectivity level. Code changes related to the following: * API * CLI * Unit tests * Examples Header Unification Change: "/amdsmi/+/1122408" Change-Id: Id0317797c652c267742513936d321677793ec634 Signed-off-by: Lang Yu --- CHANGELOG.md | 113 ++++++------ docs/how-to/using-amdsmi-for-python.md | 49 ++++++ example/amd_smi_drm_example.cc | 30 ++++ example/amd_smi_nodrm_example.cc | 29 ++++ include/amd_smi/amdsmi.h | 40 ++++- py-interface/README.md | 48 ++++++ py-interface/__init__.py | 2 + py-interface/amdsmi_interface.py | 31 ++++ py-interface/amdsmi_wrapper.py | 46 +++-- src/amd_smi/amd_smi.cc | 162 ++++++++++++++++++ .../functional/hw_topology_read.cc | 54 ++++++ 11 files changed, 535 insertions(+), 69 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f09568d169..c6a693663c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -13,6 +13,9 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr - On amd-smi-lib-tests uninstall, the amd_smi tests folder is removed. - Removed pytest dependency, our python testing now only depends on the unittest framework. +- **Added retrieving a set of GPUs that are nearest to a given device at a specific link type level**. + - Added `amdsmi_get_link_topology_nearest()` function to amd-smi C and Python Libraries. + - **Added more supported utilization count types to `amdsmi_get_utilization_count()`**. - **Added `amd-smi set -L/--clk-limit ...` command**. @@ -27,7 +30,7 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr - Added `amdsmi_get_gpu_mem_overdrive_level()` function to amd-smi C and Python Libraries. - **Added retrieving connection type and P2P capabilities between two GPUs**. - - Added `amdsmi_topo_get_p2p_status` function to amd-smi C and Python Libraries. + - Added `amdsmi_topo_get_p2p_status()` function to amd-smi C and Python Libraries. - Added retrieving P2P link capabilities to CLI `amd-smi topology`. ```shell @@ -129,7 +132,7 @@ Legend: ``` - **Created new amdsmi_kfd_info_t and added information under `amd-smi list`**. - - Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers were added in to a new `amdsmi_kfd_info_t` which gets populated via the API `amdsmi_get_gpu_kfd_info`. + - Due to fixes needed to properly enumerate all logical GPUs in CPX, new device identifiers were added in to a new `amdsmi_kfd_info_t` which gets populated via the API `amdsmi_get_gpu_kfd_info()`. - This info has been added to the `amd-smi list`. - These new fields are only available for BM/Guest Linux devices at this time. @@ -402,9 +405,9 @@ Guest VMs can view enabled/disabled ras features that are on Host cards. ### Additions -- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**. +- **`amd-smi dmon` is now available as an alias to `amd-smi monitor`**. -- **Added optional process table under `amd-smi monitor -q`**. +- **Added optional process table under `amd-smi monitor -q`**. The monitor subcommand within the CLI Tool now has the `-q` option to enable an optional process table underneath the original monitored output. ```shell @@ -417,10 +420,10 @@ GPU NAME PID GTT_MEM CPU_MEM VRAM_MEM MEM_USAGE GF 0 rvs 1564865 0.0 B 0.0 B 1.1 GB 0.0 B 0 ns 0 ns ``` -- **Added Handling to detect VMs with passthrough configurations in CLI Tool**. +- **Added Handling to detect VMs with passthrough configurations in CLI Tool**. CLI Tool had only allowed a restricted set of options for Virtual Machines with passthrough GPUs. Now we offer an expanded set of functions availble to passthrough configured GPUs. -- **Added Process Isolation and Clear SRAM functionality to the CLI Tool for VMs**. +- **Added Process Isolation and Clear SRAM functionality to the CLI Tool for VMs**. VMs now have the ability to set the process isolation and clear the sram from the CLI tool. Using the following commands ```shell @@ -428,10 +431,10 @@ amd-smi set --process-isolation <0 or 1> amd-smi reset --clean_local_data ``` -- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**. +- **Added macros that were in `amdsmi.h` to the amdsmi Python library `amdsmi_interface.py`**. Added macros to reference max size limitations for certain amdsmi functions such as max dpm policies and max fanspeed. -- **Added Ring Hang event**. +- **Added Ring Hang event**. Added `AMDSMI_EVT_NOTIF_RING_HANG` to the possible events in the `amdsmi_evt_notification_type_t` enum. ### Optimizations @@ -443,7 +446,7 @@ $ amd-smi static --asic --gpu 123123 Can not find a device: GPU '123123' Error code: -3 ``` -- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**. +- **Removed elevated permission requirements for `amdsmi_get_gpu_process_list()`**. Previously if a processes with elevated permissions was running amd-smi would required sudo to display all output. Now amd-smi will populate all process data and return N/A for elevated process names instead. However if ran with sudo you will be able to see the name like so: ```shell @@ -478,10 +481,10 @@ GPU: 0 ENC: 0 ns ``` -- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**. +- **Updated naming for `amdsmi_set_gpu_clear_sram_data()` to `amdsmi_clean_gpu_local_data()`**. Changed the naming to be more accurate to what the function was doing. This change also extends to the CLI where we changed the `clear-sram-data` command to `clean_local_data`. -- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**. +- **Updated `amdsmi_clk_info_t` struct in amdsmi.h and amdsmi_interface.py to align with host/guest**. Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locked value. New struct will be in the following format: ```shell @@ -495,7 +498,7 @@ Changed cur_clk to clk, changed sleep_clk to clk_deep_sleep, and added clk_locke } amdsmi_clk_info_t; ``` -- **Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest**. +- **Multiple structure updates in amdsmi.h and amdsmi_interface.py to align with host/guest**. Multiple structures used by APIs were changed for alignment unification: - Changed `amdsmi_vram_info_t` `vram_size_mb` field changed to to `vram_size` - Updated `amdsmi_vram_type_t` struct updated to include new enums and added `AMDSMI` prefix @@ -503,7 +506,7 @@ Multiple structures used by APIs were changed for alignment unification: - Added `AMDSMI_PROCESSOR_TYPE` prefix to `processor_type_t` enums - Removed the fields structure definition in favor for an anonymous definition in `amdsmi_bdf_t` -- **Added `AMDSMI` prefix in amdsmi.h and amdsmi_interface.py to align with host/guest**. +- **Added `AMDSMI` prefix in amdsmi.h and amdsmi_interface.py to align with host/guest**. Multiple structures used by APIs were changed for alignment unification. `AMDSMI` prefix was added to the following structures: - Added AMDSMI prefix to `amdsmi_container_types_t` enums - Added AMDSMI prefix to `amdsmi_clk_type_t` enums @@ -513,13 +516,13 @@ Multiple structures used by APIs were changed for alignment unification. `AMDSMI - Added AMDSMI prefix to `amdsmi_temperature_type_t` enums - Added AMDSMI prefix to `amdsmi_fw_block_t` enums -- **Changed dpm_policy references to soc_pstate**. +- **Changed dpm_policy references to soc_pstate**. The file structure referenced to dpm_policy changed to soc_pstate and we have changed the APIs and CLI tool to be inline with the current structure. `amdsmi_get_dpm_policy()` and `amdsmi_set_dpm_policy()` is no longer valid with the new API being `amdsmi_get_soc_pstate()` and `amdsmi_set_soc_pstate()`. The CLI tool has been changed from `--policy` to `--soc-pstate` -- **Updated `amdsmi_get_gpu_board_info()` product_name to fallback to pciids**. +- **Updated `amdsmi_get_gpu_board_info()` product_name to fallback to pciids**. Previously on devices without a FRU we would not populate the product name in the `amdsmi_board_info_t` structure, now we will fallback to using the name listed according to the pciids file if available. -- **Updated CLI voltage curve command output**. +- **Updated CLI voltage curve command output**. The output for `amd-smi metric --voltage-curve` now splits the frequency and voltage output by curve point or outputs N/A for each curve point if not applicable ```shell @@ -533,16 +536,16 @@ GPU: 0 POINT_2_VOLTAGE: 1186 mV ``` -- **Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`**. +- **Updated `amdsmi_get_gpu_board_info()` now has larger structure sizes for `amdsmi_board_info_t`**. Updated sizes that work for retreiving relavant board information across AMD's ASIC products. This requires users to update any ABIs using this structure. ### Fixes -- **Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool**. +- **Fixed Leftover Mutex deadlock when running multiple instances of the CLI tool**. When running `amd-smi reset --gpureset --gpu all` and then running an instance of `amd-smi static` (or any other subcommand that access the GPUs) a mutex would lock and not return requiring either a clear of the mutex in /dev/shm or rebooting the machine. -- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**. +- **Fixed multiple processes not being registered in `amd-smi process` with json and csv format**. Multiple process outputs in the CLI tool were not being registered correctly. The json output did not handle multiple processes and is now in a new valid json format: ```shell @@ -575,33 +578,33 @@ Multiple process outputs in the CLI tool were not being registered correctly. Th ] ``` -- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**. +- **Removed `throttle-status` from `amd-smi monitor` as it is no longer reliably supported**. Throttle status may work for older ASICs, but will be replaced with PVIOL and TVIOL metrics for future ASIC support. It remains a field in the gpu_metrics API and in `amd-smi metric --power`. -- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**. +- **`amdsmi_get_gpu_board_info()` no longer returns junk char strings**. Previously if there was a partial failure to retrieve character strings, we would return garbage output to users using the API. This fix intends to populate as many values as possible. Then any failure(s) found along the way, `\0` is provided to `amdsmi_board_info_t` structures data members which cannot be populated. Ensuring empty char string values. -- **Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`**. +- **Fixed parsing of `pp_od_clk_voltage` within `amdsmi_get_gpu_od_volt_info`**. The parsing of `pp_od_clk_voltage` was not dynamic enough to work with the dropping of voltage curve support on MI series cards. This propagates down to correcting the CLI's output `amd-smi metric --voltage-curve` to N/A if voltage curve is not enabled. ### Known Issues -- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**. +- **`amdsmi_get_gpu_process_isolation` and `amdsmi_clean_gpu_local_data` commands do no currently work and will be supported in a future release**. ## amd_smi_lib for ROCm 6.1.2 ### Additions -- **Added process isolation and clean shader APIs and CLI commands**. +- **Added process isolation and clean shader APIs and CLI commands**. Added APIs CLI and APIs to address LeftoverLocals security issues. Allowing clearing the sram data and setting process isolation on a per GPU basis. New APIs: - `amdsmi_get_gpu_process_isolation()` - `amdsmi_set_gpu_process_isolation()` - `amdsmi_set_gpu_clear_sram_data()` -- **Added `MIN_POWER` to output of `amd-smi static --limit`**. +- **Added `MIN_POWER` to output of `amd-smi static --limit`**. This change helps users identify the range to which they can change the power cap of the GPU. The change is added to simplify why a device supports (or does not support) power capping (also known as overdrive). See `amd-smi set -g all --power-cap ` or `amd-smi reset -g all --power-cap`. ```shell @@ -633,7 +636,7 @@ GPU: 1 ### Optimizations -- **Updated `amd-smi monitor --pcie` output**. +- **Updated `amd-smi monitor --pcie` output**. The source for pcie bandwidth monitor output was a legacy file we no longer support and was causing delays within the monitor command. The output is no longer using TX/RX but instantaneous bandwidth from gpu_metrics instead; updated output: ```shell @@ -642,13 +645,13 @@ GPU PCIE_BW 0 26 Mb/s ``` -- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**. +- **`amdsmi_get_power_cap_info` now returns values in uW instead of W**. `amdsmi_get_power_cap_info` will return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same). -- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**. +- **Updated Python Library return types for amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**. Previously calls were returning "No bad pages found." if no pages were found, now it only returns the list type and can be empty. -- **Updated `amd-smi metric --ecc-blocks` output**. +- **Updated `amd-smi metric --ecc-blocks` output**. The ecc blocks argument was outputing blocks without counters available, updated the filtering show blocks that counters are available for: ``` shell @@ -685,12 +688,12 @@ GPU: 0 DEFERRED_COUNT: 0 ``` -- **Removed `amdsmi_get_gpu_process_info` from Python library**. +- **Removed `amdsmi_get_gpu_process_info` from Python library**. amdsmi_get_gpu_process_info was removed from the C library in an earlier build, but the API was still in the Python interface. ### Fixes -- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**. +- **Fixed `amd-smi metric --power` now provides power output for Navi2x/Navi3x/MI1x**. These systems use an older version of gpu_metrics in amdgpu. This fix only updates what CLI outputs. No change in any of our APIs. @@ -715,10 +718,10 @@ GPU: 1 THROTTLE_STATUS: UNTHROTTLED ``` -- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**. +- **Fixed `amdsmitstReadWrite.TestPowerCapReadWrite` test for Navi3X, Navi2X, MI100**. Updates required `amdsmi_get_power_cap_info` to return in uW as originally reflected by driver. Previously `amdsmi_get_power_cap_info` returned W values, this conflicts with our sets and modifies values retrieved from driver. We decided to keep the values returned from driver untouched (in original units, uW). Then in CLI we will convert to watts (as previously done - no changes here). Additionally, driver made updates to min power cap displayed for devices when overdrive is disabled which prompted for this change (in this case min_power_cap and max_power_cap are the same). -- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**. +- **Fixed Python interface call amdsmi_get_gpu_memory_reserved_pages & amdsmi_get_gpu_bad_page_info**. Previously Python interface calls to populated bad pages resulted in a `ValueError: NULL pointer access`. This fixes the bad-pages subcommand CLI subcommand as well. ### Known Issues @@ -729,7 +732,7 @@ Previously Python interface calls to populated bad pages resulted in a `ValueErr ### Changes -- **Updated metrics --clocks**. +- **Updated metrics --clocks**. Output for `amd-smi metric --clock` is updated to reflect each engine and bug fixes for the clock lock status and deep sleep status. ``` shell @@ -840,7 +843,7 @@ GPU: 0 DEEP_SLEEP: ENABLED ``` -- **Added deferred ecc counts**. +- **Added deferred ecc counts**. Added deferred error correctable counts to `amd-smi metric --ecc --ecc-blocks` ```shell @@ -864,7 +867,7 @@ GPU: 0 ... ``` -- **Updated `amd-smi topology --json` to align with host/guest**. +- **Updated `amd-smi topology --json` to align with host/guest**. Topology's `--json` output now is changed to align with output host/guest systems. Additionally, users can select/filter specific topology details as desired (refer to `amd-smi topology -h` for full list). See examples shown below. *Previous format:* @@ -999,7 +1002,7 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json ### Fixes -- **Fix for GPU reset error on non-amdgpu cards**. +- **Fix for GPU reset error on non-amdgpu cards**. Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix updates CLI to target only AMD ASICs. @@ -1007,10 +1010,10 @@ updates CLI to target only AMD ASICs. Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards). -- **Fix for `amd-smi process`**. +- **Fix for `amd-smi process`**. Fixed output results when getting processes running on a device. -- **Improved Error handling for `amd-smi process`**. +- **Improved Error handling for `amd-smi process`**. Fixed Attribute Error when getting process in csv format ### Known issues @@ -1021,7 +1024,7 @@ Fixed Attribute Error when getting process in csv format ### Additions -- **Added Monitor Command**. +- **Added Monitor Command**. Provides users the ability to customize GPU metrics to capture, collect, and observe. Output is provided in a table view. This aligns closer to ROCm SMI `rocm-smi` (no argument), additionally allows uers to customize what data is helpful for their use-case. ```shell @@ -1081,7 +1084,7 @@ GPU POWER GPU_TEMP MEM_TEMP GFX_UTIL GFX_CLOCK MEM_UTIL MEM_CLOCK VRAM_U 7 175 W 34 °C 32 °C 0 % 113 MHz 0 % 900 MHz 283 MB 196300 MB ``` -- **Integrated ESMI Tool**. +- **Integrated ESMI Tool**. Users can get CPU metrics and telemetry through our API and CLI tools. This information can be seen in `amd-smi static` and `amd-smi metric` commands. Only available for limited target processors. As of ROCm 6.0.2, this is listed as: - AMD Zen3 based CPU Family 19h Models 0h-Fh and 30h-3Fh - AMD Zen4 based CPU Family 19h Models 10h-1Fh and A0-AFh @@ -1231,7 +1234,7 @@ CPU: 0 RESPONSE: N/A ``` -- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**. +- **Added support for new metrics: VCN, JPEG engines, and PCIe errors**. Using the AMD SMI tool, users can retreive VCN, JPEG engines, and PCIe errors by calling `amd-smi metric -P` or `amd-smi metric --usage`. Depending on device support, `VCN_ACTIVITY` will update for MI3x ASICs (with 4 separate VCN engine activities) for older asics `MM_ACTIVITY` with UVD/VCN engine activity (average of all engines). `JPEG_ACTIVITY` is a new field for MI3x ASICs, where device can support up to 32 JPEG engine activities. See our documentation for more in-depth understanding of these new fields. ```shell @@ -1264,7 +1267,7 @@ GPU: 0 ``` -- **Added AMDSMI Tool Version**. +- **Added AMDSMI Tool Version**. AMD SMI will report ***three versions***: AMDSMI Tool, AMDSMI Library version, and ROCm version. The AMDSMI Tool version is the CLI/tool version number with commit ID appended after `+` sign. The AMDSMI Library version is the library package version number. @@ -1275,7 +1278,7 @@ $ amd-smi version AMDSMI Tool: 23.4.2+505b858 | AMDSMI Library version: 24.2.0.0 | ROCm version: 6.1.0 ``` -- **Added XGMI table**. +- **Added XGMI table**. Displays XGMI information for AMD GPU devices in a table format. Only available on supported ASICs (eg. MI300). Here users can view read/write data XGMI or PCIe accumulated data transfer size (in KiloBytes). ```shell @@ -1309,7 +1312,7 @@ GPU7 0000:df:00.0 32 Gb/s 512 Gb/s XGMI ``` -- **Added units of measure to JSON output**. +- **Added units of measure to JSON output**. We added unit of measure to JSON/CSV `amd-smi metric`, `amd-smi static`, and `amd-smi monitor` commands. Ex. @@ -1345,7 +1348,7 @@ amd-smi metric -p --json ### Changes -- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**. +- **Topology is now left-aligned with BDF of each device listed individual table's row/coloumns**. We provided each device's BDF for every table's row/columns, then left aligned data. We want AMD SMI Tool output to be easy to understand and digest for our users. Having users scroll up to find this information made it difficult to follow, especially for devices which have many devices associated with one ASIC. ```shell @@ -1408,9 +1411,9 @@ NUMA BW TABLE: ### Fixes -- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**. +- **Fix for Navi3X/Navi2X/MI100 `amdsmi_get_gpu_pci_bandwidth()` in frequencies_read tests**. Devices which do not report (eg. Navi3X/Navi2X/MI100) we have added checks to confirm these devices return AMDSMI_STATUS_NOT_SUPPORTED. Otherwise, tests now display a return string. -- **Fix for devices which have an older pyyaml installed**. +- **Fix for devices which have an older pyyaml installed**. Platforms which are identified as having an older pyyaml version or pip, we no manually update both pip and pyyaml as needed. This corrects issues identified below. Fix impacts the following CLI commands: - `amd-smi list` - `amd-smi static` @@ -1422,20 +1425,20 @@ Platforms which are identified as having an older pyyaml version or pip, we no m TypeError: dump_all() got an unexpected keyword argument 'sort_keys' ``` -- **Fix for crash when user is not a member of video/render groups**. +- **Fix for crash when user is not a member of video/render groups**. AMD SMI now uses same mutex handler for devices as rocm-smi. This helps avoid crashes when DRM/device data is inaccessable to the logged in user. ## amd_smi_lib for ROCm 6.0.0 ### Additions -- **Integrated the E-SMI (EPYC-SMI) library**. +- **Integrated the E-SMI (EPYC-SMI) library**. You can now query CPU-related information directly through AMD SMI. Metrics include power, energy, performance, and other system details. -- **Added support for gfx942 metrics**. +- **Added support for gfx942 metrics**. You can now query MI300 device metrics to get real-time information. Metrics include power, temperature, energy, and performance. -- **Compute and memory partition support**. +- **Compute and memory partition support**. Users can now view, set, and reset partitions. The topology display can provide a more in-depth look at the device's current configuration. ### Optimizations @@ -1444,13 +1447,13 @@ Users can now view, set, and reset partitions. The topology display can provide ### Changes -- **GPU index sorting made consistent with other tools**. +- **GPU index sorting made consistent with other tools**. To ensure alignment with other ROCm software tools, GPU index sorting is optimized to use Bus:Device.Function (BDF) rather than the card number. -- **Topology output is now aligned with GPU BDF table**. +- **Topology output is now aligned with GPU BDF table**. Earlier versions of the topology output were difficult to read since each GPU was displayed linearly. Now the information is displayed as a table by each GPU's BDF, which closer resembles rocm-smi output. ### Fixes -- **Fix for driver not initialized**. +- **Fix for driver not initialized**. If driver module is not loaded, user retrieve error reponse indicating amdgpu module is not loaded. diff --git a/docs/how-to/using-amdsmi-for-python.md b/docs/how-to/using-amdsmi-for-python.md index 40edc84f8c..26deb2109f 100644 --- a/docs/how-to/using-amdsmi-for-python.md +++ b/docs/how-to/using-amdsmi-for-python.md @@ -3867,6 +3867,55 @@ except AmdSmiException as e: print(e) ``` +### amdsmi_get_link_topology_nearest + +Description: Retrieve the set of GPUs that are nearest to a given device + at a specific interconnectivity level. + +Input parameters: +* `processor_handle` The identifier of the given device. +* `link_type` The AmdSmiLinkType level to search for nearest devices + +Output: Dictionary holding the following fields. +* `count` number of nearest devices found based on given topology level +* `processor_list` list of all nearest device handlers found + + +Exceptions that can be thrown by `amdsmi_get_link_topology_nearest` function: + +* `AmdSmiLibraryException` + +Example: + +```python +try: + amdsmi_init() + + devices = amdsmi_get_processor_handles() + if len(devices) == 0: + print("No GPUs found on machine") + exit() + else: + print(amdsmi_get_gpu_device_uuid(devices[0])) + + nearest_gpus = amdsmi_topology_nearest_t() + nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType(2)) + if (nearest_gpus['count']) == 0: + print("No nearest GPUs found on machine") + else: + print("Nearest GPUs") + for gpu in nearest_gpus['processor_list']: + print(amdsmi_get_gpu_device_uuid(gpu)) + +except AmdSmiException as e: + print(e) +finally: + try: + amdsmi_shut_down() + except AmdSmiException as e: + print(e) +``` + ## CPU APIs ### amdsmi_get_processor_info diff --git a/example/amd_smi_drm_example.cc b/example/amd_smi_drm_example.cc index 8be267e6f6..0887b1f0e6 100644 --- a/example/amd_smi_drm_example.cc +++ b/example/amd_smi_drm_example.cc @@ -852,6 +852,36 @@ int main() { std::cout << "\n"; std::cout << "+=======+==================+============+==============" << "+=============+=============+=============+============+\n"; + + // Get nearest GPUs + char *topology_link_type_str[] = { + "AMDSMI_LINK_TYPE_INTERNAL", + "AMDSMI_LINK_TYPE_XGMI", + "AMDSMI_LINK_TYPE_PCIE", + "AMDSMI_LINK_TYPE_NOT_APPLICABLE", + "AMDSMI_LINK_TYPE_UNKNOWN", + }; + printf("\tOutput of amdsmi_get_link_topology_nearest:\n"); + for (uint32_t topo_link_type = AMDSMI_LINK_TYPE_INTERNAL; topo_link_type <= AMDSMI_LINK_TYPE_UNKNOWN; topo_link_type++) { + auto topology_nearest_info = amdsmi_topology_nearest_t(); + ret = amdsmi_get_link_topology_nearest(processor_handles[j], + static_cast(topo_link_type), + nullptr); + CHK_AMDSMI_RET(ret); + + ret = amdsmi_get_link_topology_nearest(processor_handles[j], + static_cast(topo_link_type), + &topology_nearest_info); + CHK_AMDSMI_RET(ret); + printf("\tNearest GPUs found at %s\n", topology_link_type_str[topo_link_type]); + for (uint32_t k = 0; k < topology_nearest_info.count; k++) { + amdsmi_bdf_t bdf = {}; + ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf); + CHK_AMDSMI_RET(ret) + printf("\t\tGPU BDF %04lx:%02x:%02x.%d\n", bdf.domain_number, + bdf.bus_number, bdf.device_number, bdf.function_number); + } + } } } diff --git a/example/amd_smi_nodrm_example.cc b/example/amd_smi_nodrm_example.cc index bcfca83681..cd35ce1990 100644 --- a/example/amd_smi_nodrm_example.cc +++ b/example/amd_smi_nodrm_example.cc @@ -344,6 +344,35 @@ int main() { <<"," << policy.policies[x].policy_description << ")\n"; } } + + // Get nearest GPUs + char *topology_link_type_str[] = { + "AMDSMI_LINK_TYPE_INTERNAL", + "AMDSMI_LINK_TYPE_XGMI", + "AMDSMI_LINK_TYPE_PCIE", + "AMDSMI_LINK_TYPE_NOT_APPLICABLE", + "AMDSMI_LINK_TYPE_UNKNOWN", + }; + printf("\tOutput of amdsmi_get_link_topology_nearest:\n"); + for (uint32_t topo_link_type = AMDSMI_LINK_TYPE_INTERNAL; topo_link_type <= AMDSMI_LINK_TYPE_UNKNOWN; topo_link_type++) { + auto topology_nearest_info = amdsmi_topology_nearest_t(); + ret = amdsmi_get_link_topology_nearest(processor_handles[j], + static_cast(topo_link_type), + nullptr); + CHK_AMDSMI_RET(ret); + ret = amdsmi_get_link_topology_nearest(processor_handles[j], + static_cast(topo_link_type), + &topology_nearest_info); + CHK_AMDSMI_RET(ret); + printf("\tNearest GPUs found at %s\n", topology_link_type_str[topo_link_type]); + for (uint32_t k = 0; k < topology_nearest_info.count; k++) { + amdsmi_bdf_t bdf = {}; + ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf); + CHK_AMDSMI_RET(ret) + printf("\tGPU BDF %04lx:%02x:%02x.%d\n", bdf.domain_number, + bdf.bus_number, bdf.device_number, bdf.function_number); + } + } } } diff --git a/include/amd_smi/amdsmi.h b/include/amd_smi/amdsmi.h index ab91abf82b..7b76235252 100644 --- a/include/amd_smi/amdsmi.h +++ b/include/amd_smi/amdsmi.h @@ -651,8 +651,9 @@ typedef struct { } amdsmi_accelerator_partition_profile_t; typedef enum { - AMDSMI_LINK_TYPE_PCIE, + AMDSMI_LINK_TYPE_INTERNAL, AMDSMI_LINK_TYPE_XGMI, + AMDSMI_LINK_TYPE_PCIE, AMDSMI_LINK_TYPE_NOT_APPLICABLE, AMDSMI_LINK_TYPE_UNKNOWN } amdsmi_link_type_t; @@ -1585,6 +1586,14 @@ typedef struct { uint32_t cu_occupancy; //!< Compute Unit usage in percent } amdsmi_process_info_t; + +typedef struct { + uint32_t count; + amdsmi_processor_handle processor_list[AMDSMI_MAX_DEVICES]; + uint64_t reserved[15]; +} amdsmi_topology_nearest_t; + + //! Place-holder "variant" for functions that have don't have any variants, //! but do have monitors or sensors. #define AMDSMI_DEFAULT_VARIANT 0xFFFFFFFFFFFFFFFF @@ -5097,6 +5106,35 @@ amdsmi_get_gpu_total_ecc_count(amdsmi_processor_handle processor_handle, amdsmi_ /** @} End eccinfo */ +/** + * @brief Retrieve the set of GPUs that are nearest to a given device + * at a specific interconnectivity level. + * + * @platform{gpu_bm_linux} @platform{host} + * + * @details Once called topology_nearest_info will get populated with a list of + * all nearest devices for a given link_type. The list has a count of + * the number of devices found and their respective handles/identifiers. + * + * @param[in] processor_handle The identifier of the given device. + * + * @param[in] link_type The amdsmi_link_type_t level to search for nearest GPUs. + * + * @param[in,out] topology_nearest_info + * .count; + * - When zero, is set to the number of matching GPUs such that .device_list can + * be malloc'd. + * - When non-zero, .device_list will be filled with count number of processor_handle. + * + * @param[out] .device_list An array of processor_handle for GPUs found at level. + * + * @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail. + */ +amdsmi_status_t +amdsmi_get_link_topology_nearest(amdsmi_processor_handle processor_handle, + amdsmi_link_type_t link_type, + amdsmi_topology_nearest_t* topology_nearest_info); + #ifdef ENABLE_ESMI_LIB /*****************************************************************************/ diff --git a/py-interface/README.md b/py-interface/README.md index dc4001403e..922462ab4f 100644 --- a/py-interface/README.md +++ b/py-interface/README.md @@ -3906,6 +3906,54 @@ except AmdSmiException as e: print(e) ``` +### amdsmi_get_link_topology_nearest + +Description: Retrieve the set of GPUs that are nearest to a given device + at a specific interconnectivity level. + +Input parameters: +* `processor_handle` The identifier of the given device. +* `link_type` The AmdSmiLinkType level to search for nearest devices + +Output: Dictionary holding the following fields. +* `count` number of nearest devices found based on given topology level +* `processor_list` list of all nearest device handlers found + + +Exceptions that can be thrown by `amdsmi_get_link_topology_nearest` function: + +* `AmdSmiLibraryException` + +Example: + +```python +try: + amdsmi_init() + + devices = amdsmi_get_processor_handles() + if len(devices) == 0: + print("No GPUs found on machine") + exit() + else: + print(amdsmi_get_gpu_device_uuid(devices[0])) + + nearest_gpus = amdsmi_get_link_topology_nearest(devices[0], AmdSmiLinkType.AMDSMI_LINK_TYPE_PCIE) + if (nearest_gpus['count']) == 0: + print("No nearest GPUs found on machine") + else: + print("Nearest GPUs") + for gpu in nearest_gpus['processor_list']: + print(amdsmi_get_gpu_device_uuid(gpu)) + +except AmdSmiException as e: + print(e) +finally: + try: + amdsmi_shut_down() + except AmdSmiException as e: + print(e) +``` + ## CPU APIs ### amdsmi_get_processor_info diff --git a/py-interface/__init__.py b/py-interface/__init__.py index e0ffcd2c28..e731120eda 100644 --- a/py-interface/__init__.py +++ b/py-interface/__init__.py @@ -216,6 +216,7 @@ from .amdsmi_interface import amdsmi_topo_get_link_type from .amdsmi_interface import amdsmi_topo_get_p2p_status from .amdsmi_interface import amdsmi_is_P2P_accessible from .amdsmi_interface import amdsmi_get_xgmi_info +from .amdsmi_interface import amdsmi_get_link_topology_nearest # # Partition Functions from .amdsmi_interface import amdsmi_get_gpu_compute_partition @@ -255,6 +256,7 @@ from .amdsmi_interface import AmdSmiFreqInd from .amdsmi_interface import AmdSmiXgmiStatus from .amdsmi_interface import AmdSmiMemoryPageStatus from .amdsmi_interface import AmdSmiIoLinkType +from .amdsmi_interface import AmdSmiLinkType from .amdsmi_interface import AmdSmiUtilizationCounterType from .amdsmi_interface import AmdSmiProcessorType diff --git a/py-interface/amdsmi_interface.py b/py-interface/amdsmi_interface.py index cae4a2d1b7..000c22b853 100644 --- a/py-interface/amdsmi_interface.py +++ b/py-interface/amdsmi_interface.py @@ -383,6 +383,14 @@ class AmdSmiIoLinkType(IntEnum): SIZE = amdsmi_wrapper.AMDSMI_IOLINK_TYPE_SIZE +class AmdSmiLinkType(IntEnum): + AMDSMI_LINK_TYPE_INTERNAL = amdsmi_wrapper.AMDSMI_LINK_TYPE_INTERNAL + AMDSMI_LINK_TYPE_XGMI = amdsmi_wrapper.AMDSMI_LINK_TYPE_XGMI + AMDSMI_LINK_TYPE_PCIE = amdsmi_wrapper.AMDSMI_LINK_TYPE_PCIE + AMDSMI_LINK_TYPE_NOT_APPLICABLE = amdsmi_wrapper.AMDSMI_LINK_TYPE_NOT_APPLICABLE + AMDSMI_LINK_TYPE_UNKNOWN = amdsmi_wrapper.AMDSMI_LINK_TYPE_UNKNOWN + + class AmdSmiUtilizationCounterType(IntEnum): COARSE_GRAIN_GFX_ACTIVITY = amdsmi_wrapper.AMDSMI_COARSE_GRAIN_GFX_ACTIVITY COARSE_GRAIN_MEM_ACTIVITY = amdsmi_wrapper.AMDSMI_COARSE_GRAIN_MEM_ACTIVITY @@ -4174,3 +4182,26 @@ def amdsmi_get_gpu_metrics_header_info( "format_revision": header_info.format_revision, "content_revision": header_info.content_revision } + +def amdsmi_get_link_topology_nearest( + processor_handle: amdsmi_wrapper.amdsmi_processor_handle, + link_type: AmdSmiLinkType, + )-> Dict[str, Any]: + + topology_nearest_list = amdsmi_wrapper.amdsmi_topology_nearest_t() + _check_res( + amdsmi_wrapper.amdsmi_get_link_topology_nearest( + processor_handle, + link_type, + ctypes.byref(topology_nearest_list) + ) + ) + + device_list = [] + for index in range(topology_nearest_list.count): + device_list.append(topology_nearest_list.processor_list[index]) + + return { + 'count': topology_nearest_list.count, + 'processor_list': device_list + } diff --git a/py-interface/amdsmi_wrapper.py b/py-interface/amdsmi_wrapper.py index f7acf2026e..d8a169fca1 100644 --- a/py-interface/amdsmi_wrapper.py +++ b/py-interface/amdsmi_wrapper.py @@ -978,15 +978,17 @@ amdsmi_accelerator_partition_profile_t = struct_amdsmi_accelerator_partition_pro # values for enumeration 'amdsmi_link_type_t' amdsmi_link_type_t__enumvalues = { - 0: 'AMDSMI_LINK_TYPE_PCIE', + 0: 'AMDSMI_LINK_TYPE_INTERNAL', 1: 'AMDSMI_LINK_TYPE_XGMI', - 2: 'AMDSMI_LINK_TYPE_NOT_APPLICABLE', - 3: 'AMDSMI_LINK_TYPE_UNKNOWN', + 2: 'AMDSMI_LINK_TYPE_PCIE', + 3: 'AMDSMI_LINK_TYPE_NOT_APPLICABLE', + 4: 'AMDSMI_LINK_TYPE_UNKNOWN', } -AMDSMI_LINK_TYPE_PCIE = 0 +AMDSMI_LINK_TYPE_INTERNAL = 0 AMDSMI_LINK_TYPE_XGMI = 1 -AMDSMI_LINK_TYPE_NOT_APPLICABLE = 2 -AMDSMI_LINK_TYPE_UNKNOWN = 3 +AMDSMI_LINK_TYPE_PCIE = 2 +AMDSMI_LINK_TYPE_NOT_APPLICABLE = 3 +AMDSMI_LINK_TYPE_UNKNOWN = 4 amdsmi_link_type_t = ctypes.c_uint32 # enum class struct_amdsmi_link_metrics_t(Structure): pass @@ -1842,6 +1844,19 @@ struct_amdsmi_process_info_t._fields_ = [ ] amdsmi_process_info_t = struct_amdsmi_process_info_t +class struct_amdsmi_topology_nearest_t(Structure): + pass + +struct_amdsmi_topology_nearest_t._pack_ = 1 # source:False +struct_amdsmi_topology_nearest_t._fields_ = [ + ('count', ctypes.c_uint32), + ('PADDING_0', ctypes.c_ubyte * 4), + ('processor_list', ctypes.POINTER(None) * 32), + ('reserved', ctypes.c_uint32 * 15), + ('PADDING_1', ctypes.c_ubyte * 4), +] + +amdsmi_topology_nearest_t = struct_amdsmi_topology_nearest_t class struct_amdsmi_smu_fw_version_t(Structure): pass @@ -2376,6 +2391,9 @@ amdsmi_get_gpu_process_list.argtypes = [amdsmi_processor_handle, ctypes.POINTER( amdsmi_get_gpu_total_ecc_count = _libraries['libamd_smi.so'].amdsmi_get_gpu_total_ecc_count amdsmi_get_gpu_total_ecc_count.restype = amdsmi_status_t amdsmi_get_gpu_total_ecc_count.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_error_count_t)] +amdsmi_get_link_topology_nearest = _libraries['libamd_smi.so'].amdsmi_get_link_topology_nearest +amdsmi_get_link_topology_nearest.restype = amdsmi_status_t +amdsmi_get_link_topology_nearest.argtypes = [amdsmi_processor_handle, amdsmi_link_type_t, ctypes.POINTER(struct_amdsmi_topology_nearest_t)] amdsmi_get_cpu_core_energy = _libraries['libamd_smi.so'].amdsmi_get_cpu_core_energy amdsmi_get_cpu_core_energy.restype = amdsmi_status_t amdsmi_get_cpu_core_energy.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint64)] @@ -2616,11 +2634,11 @@ __all__ = \ 'AMDSMI_IOLINK_TYPE_NUMIOLINKTYPES', 'AMDSMI_IOLINK_TYPE_PCIEXPRESS', 'AMDSMI_IOLINK_TYPE_SIZE', 'AMDSMI_IOLINK_TYPE_UNDEFINED', 'AMDSMI_IOLINK_TYPE_XGMI', - 'AMDSMI_LINK_TYPE_NOT_APPLICABLE', 'AMDSMI_LINK_TYPE_PCIE', - 'AMDSMI_LINK_TYPE_UNKNOWN', 'AMDSMI_LINK_TYPE_XGMI', - 'AMDSMI_MEMORY_PARTITION_NPS1', 'AMDSMI_MEMORY_PARTITION_NPS2', - 'AMDSMI_MEMORY_PARTITION_NPS4', 'AMDSMI_MEMORY_PARTITION_NPS8', - 'AMDSMI_MEMORY_PARTITION_UNKNOWN', + 'AMDSMI_LINK_TYPE_INTERNAL', 'AMDSMI_LINK_TYPE_NOT_APPLICABLE', + 'AMDSMI_LINK_TYPE_PCIE', 'AMDSMI_LINK_TYPE_UNKNOWN', + 'AMDSMI_LINK_TYPE_XGMI', 'AMDSMI_MEMORY_PARTITION_NPS1', + 'AMDSMI_MEMORY_PARTITION_NPS2', 'AMDSMI_MEMORY_PARTITION_NPS4', + 'AMDSMI_MEMORY_PARTITION_NPS8', 'AMDSMI_MEMORY_PARTITION_UNKNOWN', 'AMDSMI_MEM_PAGE_STATUS_PENDING', 'AMDSMI_MEM_PAGE_STATUS_RESERVED', 'AMDSMI_MEM_PAGE_STATUS_UNRESERVABLE', 'AMDSMI_MEM_TYPE_FIRST', @@ -2798,7 +2816,7 @@ __all__ = \ 'amdsmi_get_gpu_vram_info', 'amdsmi_get_gpu_vram_usage', 'amdsmi_get_gpu_vram_vendor', 'amdsmi_get_hsmp_metrics_table', 'amdsmi_get_hsmp_metrics_table_version', 'amdsmi_get_lib_version', - 'amdsmi_get_link_metrics', + 'amdsmi_get_link_metrics', 'amdsmi_get_link_topology_nearest', 'amdsmi_get_minmax_bandwidth_between_processors', 'amdsmi_get_pcie_info', 'amdsmi_get_power_cap_info', 'amdsmi_get_power_info', @@ -2862,7 +2880,8 @@ __all__ = \ 'amdsmi_temp_range_refresh_rate_t', 'amdsmi_temperature_metric_t', 'amdsmi_temperature_type_t', 'amdsmi_topo_get_link_type', 'amdsmi_topo_get_link_weight', 'amdsmi_topo_get_numa_node_number', - 'amdsmi_topo_get_p2p_status', 'amdsmi_utilization_counter_t', + 'amdsmi_topo_get_p2p_status', 'amdsmi_topology_nearest_t', + 'amdsmi_utilization_counter_t', 'amdsmi_utilization_counter_type_t', 'amdsmi_vbios_info_t', 'amdsmi_version_t', 'amdsmi_voltage_metric_t', 'amdsmi_voltage_type_t', 'amdsmi_vram_info_t', @@ -2896,6 +2915,7 @@ __all__ = \ 'struct_amdsmi_retired_page_record_t', 'struct_amdsmi_smu_fw_version_t', 'struct_amdsmi_temp_range_refresh_rate_t', + 'struct_amdsmi_topology_nearest_t', 'struct_amdsmi_utilization_counter_t', 'struct_amdsmi_vbios_info_t', 'struct_amdsmi_version_t', 'struct_amdsmi_vram_info_t', 'struct_amdsmi_vram_usage_t', diff --git a/src/amd_smi/amd_smi.cc b/src/amd_smi/amd_smi.cc index 7b7eda3a2d..7b8a4fb032 100644 --- a/src/amd_smi/amd_smi.cc +++ b/src/amd_smi/amd_smi.cc @@ -51,11 +51,13 @@ #include #include #include +#include #include #include #include #include #include +#include #include #include "amd_smi/amdsmi.h" #include "amd_smi/impl/fdinfo.h" @@ -2321,6 +2323,166 @@ amdsmi_status_t amdsmi_get_processor_handle_from_bdf(amdsmi_bdf_t bdf, return AMDSMI_STATUS_API_FAILED; } +amdsmi_status_t +amdsmi_get_link_topology_nearest(amdsmi_processor_handle processor_handle, + amdsmi_link_type_t link_type, + amdsmi_topology_nearest_t* topology_nearest_info) +{ + if (topology_nearest_info == nullptr) { + return amdsmi_status_t::AMDSMI_STATUS_INVAL; + } + + if (link_type < amdsmi_link_type_t::AMDSMI_LINK_TYPE_INTERNAL || + link_type > amdsmi_link_type_t::AMDSMI_LINK_TYPE_UNKNOWN) { + return amdsmi_status_t::AMDSMI_STATUS_INVAL; + } + + + auto status(amdsmi_status_t::AMDSMI_STATUS_SUCCESS); + constexpr auto kKFD_CRAT_INTRA_SOCKET_WEIGHT = uint32_t(13); + constexpr auto kKFD_CRAT_XGMI_WEIGHT = uint32_t(15); + + /* + * Note: This will need to be eventually consolidated within a unique link type. + */ + static const std::map kLinkToIoLinkTypeTranslationTable = + { + {amdsmi_link_type_t::AMDSMI_LINK_TYPE_INTERNAL, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED}, + {amdsmi_link_type_t::AMDSMI_LINK_TYPE_XGMI, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_XGMI}, + {amdsmi_link_type_t::AMDSMI_LINK_TYPE_PCIE, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_PCIEXPRESS}, + {amdsmi_link_type_t::AMDSMI_LINK_TYPE_NOT_APPLICABLE, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED}, + {amdsmi_link_type_t::AMDSMI_LINK_TYPE_UNKNOWN, amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED} + }; + + auto translated_link_type = [&](amdsmi_link_type_t link_type) { + auto io_link_type(amdsmi_io_link_type_t::AMDSMI_IOLINK_TYPE_UNDEFINED); + if (kLinkToIoLinkTypeTranslationTable.find(link_type) != kLinkToIoLinkTypeTranslationTable.end()) { + io_link_type = kLinkToIoLinkTypeTranslationTable.at(link_type); + } + return io_link_type; + }; + + auto translated_io_link_type = [&](amdsmi_io_link_type_t io_link_type) { + auto link_type(amdsmi_link_type_t::AMDSMI_LINK_TYPE_UNKNOWN); + for (const auto& [key, value] : kLinkToIoLinkTypeTranslationTable) { + if (value == io_link_type) { + link_type = key; + break; + } + } + return link_type; + }; + // + + struct LinkTopolyInfo_t + { + amdsmi_processor_handle target_processor_handle; + amdsmi_link_type_t link_type; + bool is_accessible; + uint64_t num_hops; + uint64_t link_weight; + }; + + using LinkTopogyOrderPair_t = std::pair; + /* + * Note: The link topology table is sorted by the number of hops and link weight. + */ + struct LinkTopogyOrderCmp_t { + constexpr bool operator()(const LinkTopolyInfo_t& left, + const LinkTopolyInfo_t& right) const noexcept + { + if (left.num_hops == right.num_hops) { + return (left.num_hops >= right.num_hops); + } + else { + return (left.link_weight > right.link_weight); + } + } + }; + std::priority_queue, + LinkTopogyOrderCmp_t> link_topology_order{}; + // + + + AMDSMI_CHECK_INIT(); + auto socket_counter = uint32_t(0); + if (auto api_status = amdsmi_get_socket_handles(&socket_counter, nullptr); + (api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) { + return api_status; + } + + amdsmi_socket_handle socket_list[socket_counter]; + if (auto api_status = amdsmi_get_socket_handles(&socket_counter, &socket_list[0]); + (api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) { + return api_status; + } + + + uint32_t device_counter(AMDSMI_MAX_DEVICES); + amdsmi_processor_handle device_list[AMDSMI_MAX_DEVICES]; + for (auto socket_idx = uint32_t(0); socket_idx < socket_counter; ++socket_idx) { + if (auto api_status = amdsmi_get_processor_handles(socket_list[socket_idx], &device_counter, device_list); + (api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) { + return api_status; + } + + for (auto device_idx = uint32_t(0); device_idx < device_counter; ++device_idx) { + /* Note: Skip the processor handle that is being queried. */ + if (processor_handle != device_list[device_idx]) { + // Accessibility? + auto is_accessible(false); + if (auto api_status = amdsmi_is_P2P_accessible(processor_handle, device_list[device_idx], &is_accessible); + (api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) || !is_accessible) { + continue; + } + + // Link type matches what we are searching for? + auto io_link_type = translated_link_type(link_type); + auto io_link_type_bck(io_link_type); + auto num_hops = uint64_t(0); + if (auto api_status = amdsmi_topo_get_link_type(processor_handle, device_list[device_idx], &num_hops, &io_link_type); + (api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) || (translated_io_link_type(io_link_type) != link_type)) { + continue; + } + + // Link weights + auto link_weight = uint64_t(0); + if (auto api_status = amdsmi_topo_get_link_weight(processor_handle, device_list[device_idx], &link_weight); + (api_status != amdsmi_status_t::AMDSMI_STATUS_SUCCESS)) { + continue; + } + + // Topology nearest info + LinkTopolyInfo_t link_info = { + .target_processor_handle = device_list[device_idx], + .link_type = translated_io_link_type(io_link_type), + .is_accessible = is_accessible, + .num_hops = num_hops, + .link_weight = link_weight + }; + link_topology_order.push(link_info); + } + } + } + + /* + * Note: The link topology table is sorted by the number of hops and link weight. + */ + topology_nearest_info->processor_list[AMDSMI_MAX_DEVICES] = {nullptr}; + topology_nearest_info->count = link_topology_order.size(); + auto topology_nearest_counter = uint32_t(0); + while (!link_topology_order.empty()) { + auto link_info = link_topology_order.top(); + link_topology_order.pop(); + + if (topology_nearest_counter < AMDSMI_MAX_DEVICES) { + topology_nearest_info->processor_list[topology_nearest_counter++] = link_info.target_processor_handle; + } + } + + return status; +} #ifdef ENABLE_ESMI_LIB static amdsmi_status_t amdsmi_errno_to_esmi_status(amdsmi_status_t status) diff --git a/tests/amd_smi_test/functional/hw_topology_read.cc b/tests/amd_smi_test/functional/hw_topology_read.cc index 56860aa7c0..282d6e07df 100644 --- a/tests/amd_smi_test/functional/hw_topology_read.cc +++ b/tests/amd_smi_test/functional/hw_topology_read.cc @@ -456,4 +456,58 @@ void TestHWTopologyRead::Run(void) { std::cout << std::endl; } std::cout << std::endl; + + char *topology_link_type_str[] = { + "AMDSMI_LINK_TYPE_INTERNAL", + "AMDSMI_LINK_TYPE_XGMI", + "AMDSMI_LINK_TYPE_PCIE", + "AMDSMI_LINK_TYPE_NOT_APPLICABLE", + "AMDSMI_LINK_TYPE_UNKNOWN", + }; + + auto ret(amdsmi_status_t::AMDSMI_STATUS_SUCCESS); + for (uint32_t dv_ind_src = 0; dv_ind_src < num_devices; dv_ind_src++) { + std::cout <<"** Nearest GPUs for GPU" << dv_ind_src << " **" << "\n"; + for (uint32_t topo_link_type = AMDSMI_LINK_TYPE_INTERNAL; topo_link_type <= AMDSMI_LINK_TYPE_UNKNOWN; topo_link_type++) { + + + /* + * Note: We should get AMDSMI_STATUS_INVAL for the first call with amdsmi_topology_nearest_t = nullptr + */ + ret = amdsmi_get_link_topology_nearest(processor_handles_[dv_ind_src], + static_cast(topo_link_type), + nullptr); + ASSERT_EQ(ret, amdsmi_status_t::AMDSMI_STATUS_INVAL); + + + /* + * + */ + auto topology_nearest_info = amdsmi_topology_nearest_t(); + ret = amdsmi_get_link_topology_nearest(processor_handles_[dv_ind_src], + static_cast(topo_link_type), + &topology_nearest_info); + if (ret != amdsmi_status_t::AMDSMI_STATUS_SUCCESS) { + continue; + } + + std::cout <<"Nearest GPUs found for Link Type: " << topology_link_type_str[topo_link_type] << "\n"; + if (topology_nearest_info.count > 0) { + for (uint32_t k = 0; k < topology_nearest_info.count; k++) { + amdsmi_bdf_t bdf = {}; + ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf); + if (ret != AMDSMI_STATUS_SUCCESS) { + continue; + } + + printf("\tGPU BDF %04lx:%02x:%02x.%d\n", bdf.domain_number, + bdf.bus_number, bdf.device_number, bdf.function_number); + } + } + else { + std::cout << "\tNot found" << "\n"; + } + } + std::cout << "\n"; + } } From 3a4abbd8c0e16a05d0b39ecea93f70b0954f3ce4 Mon Sep 17 00:00:00 2001 From: Charis Poag Date: Tue, 21 May 2024 20:30:16 -0500 Subject: [PATCH 4/8] [SWDEV-422195/SWDEV-440985] GPU metrics 1.6 Changes: - Added new GPU metrics: 1) Violation status' (ex. PVIOL/TVIOL) accumulators 2) XCP (Graphics Compute Partitions) statistics 3) pcie other end recovery counter - CLI/API/tests changes were made accordingly Change-Id: I589b9b1f570f25dda12d95bb501feca85da8b3bb Signed-off-by: Charis Poag --- .gitignore | 1 + CHANGELOG.md | 303 +++- amdsmi_cli/amdsmi_commands.py | 267 +++- amdsmi_cli/amdsmi_parser.py | 4 + example/amd_smi_drm_example.cc | 485 +++++-- example/amd_smi_nodrm_example.cc | 6 +- include/amd_smi/amdsmi.h | 147 +- py-interface/__init__.py | 1 + py-interface/amdsmi_interface.py | 279 ++-- py-interface/amdsmi_wrapper.py | 96 +- rocm_smi/example/rocm_smi_example.cc | 60 + rocm_smi/include/rocm_smi/rocm_smi.h | 87 ++ rocm_smi/include/rocm_smi/rocm_smi_device.h | 5 +- .../include/rocm_smi/rocm_smi_gpu_metrics.h | 365 +++-- rocm_smi/include/rocm_smi/rocm_smi_utils.h | 69 + rocm_smi/src/rocm_smi_device.cc | 17 +- rocm_smi/src/rocm_smi_gpu_metrics.cc | 1230 +++++++++++++---- rocm_smi/src/rocm_smi_monitor.cc | 1 - src/amd_smi/amd_smi.cc | 317 ++++- src/amd_smi/amd_smi_system.cc | 17 +- .../functional/gpu_metrics_read.cc | 376 ++--- .../amd_smi_test/functional/sys_info_read.cc | 6 +- tests/python_unittest/integration_test.py | 63 + 23 files changed, 3260 insertions(+), 942 deletions(-) diff --git a/.gitignore b/.gitignore index b01b2b7fdd..5b1ace20ed 100644 --- a/.gitignore +++ b/.gitignore @@ -18,6 +18,7 @@ include/amd_smi/amd_smiConfig.h rocm_smi/include/rocm_smi/rocm_smi64Config.h docs/*.pdf goamdsmi_shim/include/goamdsmi_shimConfig.h +goamdsmi_shim/include/goamdsmi_shim64Config.h # Byte-compiled / optimized / DLL files __pycache__/ diff --git a/CHANGELOG.md b/CHANGELOG.md index c6a693663c..e0c1c745ac 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,6 +7,291 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr ## amd_smi_lib for ROCm 6.3.0 ### Changes +- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`** +Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery: + - `uint64_t accumulation_counter` - used for all throttled calculations + - `uint64_t prochot_residency_acc` - Processor hot accumulator + - `uint64_t ppt_residency_acc` - Package Power Tracking (PPT) accumulator (used in PVIOL calculations) + - `uint64_t socket_thm_residency_acc` - Socket thermal accumulator - (used in TVIOL calculations) + - `uint64_t vr_thm_residency_acc` - Voltage Rail (VR) thermal accumulator + - `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator + - `uint16_t num_partition` - corresponds to the current total number of partitions + - `struct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]` - for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizations + - `uint32_t gfx_busy_inst[MAX_NUM_XCC]` - graphic engine utilization (%) + - `uint16_t jpeg_busy[MAX_NUM_JPEG_ENGS]` - jpeg engine utilization (%) + - `uint16_t vcn_busy[MAX_NUM_VCNS]` - decoder (VCN) engine utilization (%) + - `uint64_t gfx_busy_acc[MAX_NUM_XCC]` - graphic engine utilization accumulated (%) + - `uint32_t pcie_lc_perf_other_end_recovery` - corresponds to the pcie other end recovery counter + +- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`** + ***Only available for MI300+ ASICs.*** + Users can now retrieve violation status' through either our Python or C++ APIs. Additionally, we have + added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`. + Example outputs are listed below (below is for reference, output is subject to change): +```shell +$ amd-smi metric --throttle +GPU: 0 + THROTTLE: + ACCUMULATION_COUNTER: 1226415116 + PROCHOT_ACCUMULATED: 0 + PPT_ACCUMULATED: 12 + SOCKET_THERMAL_ACCUMULATED: 0 + VR_THERMAL_ACCUMULATED: 0 + HBM_THERMAL_ACCUMULATED: 0 + PROCHOT_VIOLATION_ACTIVE: NOT ACTIVE + PPT_VIOLATION_ACTIVE: NOT ACTIVE + SOCKET_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + VR_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + HBM_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + PROCHOT_VIOLATION_PERCENT: 0 % + PPT_VIOLATION_PERCENT: 0 % + SOCKET_THERMAL_VIOLATION_PERCENT: 0 % + VR_THERMAL_VIOLATION_PERCENT: 0 % + HBM_THERMAL_VIOLATION_PERCENT: 0 % + +GPU: 1 + THROTTLE: + ACCUMULATION_COUNTER: 1226415121 + PROCHOT_ACCUMULATED: 0 + PPT_ACCUMULATED: 12 + SOCKET_THERMAL_ACCUMULATED: 0 + VR_THERMAL_ACCUMULATED: 0 + HBM_THERMAL_ACCUMULATED: 0 + PROCHOT_VIOLATION_ACTIVE: NOT ACTIVE + PPT_VIOLATION_ACTIVE: NOT ACTIVE + SOCKET_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + VR_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + HBM_THERMAL_VIOLATION_ACTIVE: NOT ACTIVE + PROCHOT_VIOLATION_PERCENT: 0 % + PPT_VIOLATION_PERCENT: 0 % + SOCKET_THERMAL_VIOLATION_PERCENT: 0 % + VR_THERMAL_VIOLATION_PERCENT: 0 % + HBM_THERMAL_VIOLATION_PERCENT: 0 % +... +``` +```shell +$ amd-smi monitor --violation +GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL + 0 0 % 0 % 0 % 0 % 0 % + 1 0 % 0 % 0 % 0 % 0 % + 2 0 % 0 % 0 % 0 % 0 % + 3 0 % 0 % 0 % 0 % 0 % + 4 0 % 0 % 0 % 0 % 0 % + 5 0 % 0 % 0 % 0 % 0 % + 6 0 % 0 % 0 % 0 % 0 % + 7 0 % 0 % 0 % 0 % 0 % + 8 0 % 0 % 0 % 0 % 0 % + 9 0 % 0 % 0 % 0 % 0 % + 10 0 % 0 % 0 % 0 % 0 % + 11 0 % 0 % 0 % 0 % 0 % + 12 0 % 0 % 0 % 0 % 0 % + 13 0 % 0 % 0 % 0 % 0 % + 14 0 % 0 % 0 % 0 % 0 % + 15 0 % 0 % 0 % 0 % 0 % +... +``` + +- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`** + ***Partition specific features are only available on MI300+ ASICs*** + Users can now retrieve graphic utilization statistic on a per-XCP (per-partition) basis. Here all XCP activities will be listed, + but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`. + + Example outputs are listed below (below is for reference, output is subject to change): +```shell +$ amd-smi metric --usage +GPU: 0 + USAGE: + GFX_ACTIVITY: 0 % + UMC_ACTIVITY: 0 % + MM_ACTIVITY: N/A + VCN_ACTIVITY: [0 %, N/A, N/A, N/A] + JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + GFX_BUSY_INST: + XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + JPEG_BUSY: + XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + VCN_BUSY: + XCP_0: [0 %, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A] + GFX_BUSY_ACC: + XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + + +GPU: 1 + USAGE: + GFX_ACTIVITY: 0 % + UMC_ACTIVITY: 0 % + MM_ACTIVITY: N/A + VCN_ACTIVITY: [0 %, N/A, N/A, N/A] + JPEG_ACTIVITY: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + GFX_BUSY_INST: + XCP_0: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + JPEG_BUSY: + XCP_0: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_1: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_2: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_3: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_4: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_5: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_6: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + XCP_7: [0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, 0 %, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, + N/A, N/A, N/A] + VCN_BUSY: + XCP_0: [0 %, N/A, N/A, N/A] + XCP_1: [0 %, N/A, N/A, N/A] + XCP_2: [0 %, N/A, N/A, N/A] + XCP_3: [0 %, N/A, N/A, N/A] + XCP_4: [0 %, N/A, N/A, N/A] + XCP_5: [0 %, N/A, N/A, N/A] + XCP_6: [0 %, N/A, N/A, N/A] + XCP_7: [0 %, N/A, N/A, N/A] + GFX_BUSY_ACC: + XCP_0: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_1: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_2: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_3: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_4: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_5: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] + +... +``` + +- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value** + ***Feature is only available on MI300+ ASICs*** + Users can now retrieve both through `amdsmi_get_pcie_info()` which has an updated structure: +```C +typedef struct { + ... + struct pcie_metric_ { + uint16_t pcie_width; //!< current PCIe width + uint32_t pcie_speed; //!< current PCIe speed in MT/s + uint32_t pcie_bandwidth; //!< current instantaneous PCIe bandwidth in Mb/s + uint64_t pcie_replay_count; //!< total number of the replays issued on the PCIe link + uint64_t pcie_l0_to_recovery_count; //!< total number of times the PCIe link transitioned from L0 to the recovery state + uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link + uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device + uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver + uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter + uint64_t reserved[12]; + } pcie_metric; + uint64_t reserved[32]; +} amdsmi_pcie_info_t; +``` + + Example outputs are listed below (below is for reference, output is subject to change): +```shell +$ amd-smi metric --pcie +GPU: 0 + PCIE: + WIDTH: 16 + SPEED: 32 GT/s + BANDWIDTH: 18 Mb/s + REPLAY_COUNT: 0 + L0_TO_RECOVERY_COUNT: 0 + REPLAY_ROLL_OVER_COUNT: 0 + NAK_SENT_COUNT: 0 + NAK_RECEIVED_COUNT: 0 + CURRENT_BANDWIDTH_SENT: N/A + CURRENT_BANDWIDTH_RECEIVED: N/A + MAX_PACKET_SIZE: N/A + LC_PERF_OTHER_END_RECOVERY: 0 + +GPU: 1 + PCIE: + WIDTH: 16 + SPEED: 32 GT/s + BANDWIDTH: 18 Mb/s + REPLAY_COUNT: 0 + L0_TO_RECOVERY_COUNT: 0 + REPLAY_ROLL_OVER_COUNT: 0 + NAK_SENT_COUNT: 0 + NAK_RECEIVED_COUNT: 0 + CURRENT_BANDWIDTH_SENT: N/A + CURRENT_BANDWIDTH_RECEIVED: N/A + MAX_PACKET_SIZE: N/A + LC_PERF_OTHER_END_RECOVERY: 0 +... +``` + +- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`** +This aligns BDF output with ROCm SMI. +See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function. + - bits [63:32] = domain + - bits [31:28] = partition id + - bits [27:16] = reserved + - bits [15: 0] = pci bus/device/function + - **Moved python tests directory path install location**. - `/opt//share/amd_smi/pytest/..` to `/opt//share/amd_smi/tests/python_unittest/..` @@ -19,7 +304,9 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr - **Added more supported utilization count types to `amdsmi_get_utilization_count()`**. - **Added `amd-smi set -L/--clk-limit ...` command**. - - Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency. + Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency. + + - **Added Pytest functionality to test amdsmi API calls in Python**. @@ -140,7 +427,8 @@ Legend: typedef struct { uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported uint32_t node_id; //< 0xFFFFFFFF if not supported - uint32_t reserved[13]; + uint32_t current_partition_id; //< 0xFFFFFFFF if not supported + uint32_t reserved[12]; } amdsmi_kfd_info_t; ``` @@ -362,6 +650,7 @@ GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL ``` - **Fixed incorrect implementation of the Python API `amdsmi_get_gpu_metrics_header_info()`**. +- **`amdsmitst` TestGpuMetricsRead now prints metric in correct units** - **`amd-smi static --partition` will have updates with additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()`**. @@ -377,10 +666,10 @@ GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL ### Additions -- **Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**. +- **Removed `amd-smi metric --ecc` & `amd-smi metric --ecc-blocks` on Guest VMs**. Guest VMs do not support getting current ECC counts from the Host cards. -- **Added `amd-smi static --ras`on Guest VMs**. +- **Added `amd-smi static --ras`on Guest VMs**. Guest VMs can view enabled/disabled ras features that are on Host cards. ### Optimizations @@ -393,9 +682,9 @@ Guest VMs can view enabled/disabled ras features that are on Host cards. - **Updated CLI error strings to handle empty and invalid GPU/CPU inputs**. -- **Fixed Guest VM showing passthrough options**. +- **Fixed Guest VM showing passthrough options**. -- **Fixed firmware formatting where leading 0s were missing**. +- **Fixed firmware formatting where leading 0s were missing**. ### Known Issues @@ -1006,7 +1295,7 @@ $ /opt/rocm/bin/amd-smi topology -a -t --json Previously our reset could attempting to reset non-amd GPUS- resuting in "Unable to reset non-amd GPU" error. Fix updates CLI to target only AMD ASICs. -- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()`Navi32/31 cards**. +- **Fix for `amd-smi static --pcie` and `amdsmi_get_pcie_info()` Navi32/31 cards**. Updated API to include `amdsmi_card_form_factor_t.AMDSMI_CARD_FORM_FACTOR_CEM`. Prevously, this would report "UNKNOWN". This fix provides the correct board `SLOT_TYPE` associated with these ASICs (and other Navi cards). diff --git a/amdsmi_cli/amdsmi_commands.py b/amdsmi_cli/amdsmi_commands.py index 880fd66788..159ee69ceb 100644 --- a/amdsmi_cli/amdsmi_commands.py +++ b/amdsmi_cli/amdsmi_commands.py @@ -174,17 +174,11 @@ class AMDSMICommands(): kfd_info = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu) kfd_id = kfd_info['kfd_id'] node_id = kfd_info['node_id'] + partition_id = kfd_info['current_partition_id'] except amdsmi_exception.AmdSmiLibraryException as e: kfd_id = node_id = "N/A" logging.debug("Failed to get kfd info for gpu %s | %s", gpu_id, e.get_error_info()) - try: - partition_info = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(args.gpu) - partition_id = partition_info['partition_id'] - except amdsmi_exception.AmdSmiLibraryException as e: - partition_id = "N/A" - logging.debug("Failed to get partition ID for gpu %s | %s", gpu_id, e.get_error_info()) - # CSV format is intentionally aligned with Host if self.logger.is_csv_format(): self.logger.store_output(args.gpu, 'gpu_bdf', bdf) @@ -688,8 +682,8 @@ class AMDSMICommands(): logging.debug("Failed to get memory partition info for gpu %s | %s", gpu_id, e.get_error_info()) try: - partition_info = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(args.gpu) - partition_id = partition_info['partition_id'] + kfd_info = amdsmi_interface.amdsmi_get_gpu_kfd_info(args.gpu) + partition_id = kfd_info['current_partition_id'] except amdsmi_exception.AmdSmiLibraryException as e: partition_id = "N/A" logging.debug("Failed to get partition ID for gpu %s | %s", gpu_id, e.get_error_info()) @@ -801,6 +795,8 @@ class AMDSMICommands(): new_cache_info.update(cache_info) cache_info_list[index] = new_cache_info + logging.debug(f"[after update] cache_info_list = {cache_info_list}") + cache_size_unit = "KB" if self.logger.is_human_readable_format(): cache_info_dict_format = {} @@ -819,6 +815,7 @@ class AMDSMICommands(): cache_info_dict_format[cache_index]["cache_properties"] = ", ".join(cache_info_dict_format[cache_index]["cache_properties"]) cache_info_list = cache_info_dict_format + logging.debug(f"[human readable] cache_info_list = {cache_info_list}") # Add cache_size_unit to json output if self.logger.is_json_format(): @@ -1183,7 +1180,7 @@ class AMDSMICommands(): clock=None, temperature=None, ecc=None, ecc_blocks=None, pcie=None, fan=None, voltage_curve=None, overdrive=None, perf_level=None, xgmi_err=None, energy=None, mem_usage=None, schedule=None, - guard=None, guest_data=None, fb_usage=None, xgmi=None,): + guard=None, guest_data=None, fb_usage=None, xgmi=None, throttle=None): """Get Metric information for target gpu Args: @@ -1213,6 +1210,7 @@ class AMDSMICommands(): guest_data (bool, optional): Value override for args.guest_data. Defaults to None. fb_usage (bool, optional): Value override for args.fb_usage. Defaults to None. xgmi (bool, optional): Value override for args.xgmi. Defaults to None. + throttle (bool, optional): Value override for args.throttle. Defaults to None. Raises: IndexError: Index error if gpu list is empty @@ -1251,8 +1249,10 @@ class AMDSMICommands(): args.temperature = temperature if pcie: args.pcie = pcie - current_platform_args += ["usage", "power", "clock", "temperature", "pcie"] - current_platform_values += [args.usage, args.power, args.clock, args.temperature, args.pcie] + if throttle: + args.throttle = throttle + current_platform_args += ["usage", "power", "clock", "temperature", "pcie", "throttle"] + current_platform_values += [args.usage, args.power, args.clock, args.temperature, args.pcie, args.throttle] # Only args that are applicable to Hypervisors and BM Linux if self.helpers.is_hypervisor() or (self.helpers.is_baremetal() and self.helpers.is_linux()): @@ -1342,13 +1342,16 @@ class AMDSMICommands(): gpu_metric_version_info = amdsmi_interface.amdsmi_get_gpu_metrics_header_info(args.gpu) gpu_metric_version_str = json.dumps(gpu_metric_version_info, indent=4) logging.debug("GPU Metrics table Version for GPU %s | %s", gpu_id, gpu_metric_version_str) + except amdsmi_exception.AmdSmiLibraryException as e: + logging.debug("Unable to load GPU Metrics table version for %s | %s", gpu_id, e.err_info) + try: # Get GPU Metrics table gpu_metric_debug_info = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu) gpu_metric_str = json.dumps(gpu_metric_debug_info, indent=4) - logging.debug("GPU Metrics table for GPU %s | %s", gpu_id, gpu_metric_str) + logging.debug("GPU Metrics table for GPU %s | %s", gpu_id, str(gpu_metric_str)) except amdsmi_exception.AmdSmiLibraryException as e: - logging.debug("Unabled to load GPU Metrics table for %s | %s", gpu_id, e.err_info) + logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.err_info) logging.debug(f"Metric Arg information for GPU {gpu_id} on {self.helpers.os_info()}") logging.debug(f"Args: {current_platform_args}") @@ -1362,6 +1365,13 @@ class AMDSMICommands(): # Add timestamp and store values for specified arguments values_dict = {} + #get metric info only once per gpu, this will speed up data output + try: + # Get GPU Metrics table + gpu_metric = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu) + except amdsmi_exception.AmdSmiLibraryException as e: + logging.debug("Unable to load GPU Metrics table for %s | %s", gpu_id, e.err_info) + # Populate the pcie_dict first due to multiple gpu metrics calls incorrectly increasing bandwidth if "pcie" in current_platform_args: if args.pcie: @@ -1375,7 +1385,8 @@ class AMDSMICommands(): "nak_received_count" : "N/A", "current_bandwidth_sent": "N/A", "current_bandwidth_received": "N/A", - "max_packet_size": "N/A"} + "max_packet_size": "N/A", + "lc_perf_other_end_recovery": "N/A"} try: pcie_metric = amdsmi_interface.amdsmi_get_pcie_info(args.gpu)['pcie_metric'] @@ -1396,6 +1407,7 @@ class AMDSMICommands(): pcie_dict['replay_roll_over_count'] = pcie_metric['pcie_replay_roll_over_count'] pcie_dict['nak_received_count'] = pcie_metric['pcie_nak_received_count'] pcie_dict['nak_sent_count'] = pcie_metric['pcie_nak_sent_count'] + pcie_dict['lc_perf_other_end_recovery'] = pcie_metric['pcie_lc_perf_other_end_recovery_count'] pcie_speed_unit = 'GT/s' pcie_bw_unit = 'Mb/s' @@ -1448,11 +1460,40 @@ class AMDSMICommands(): if args.usage: try: engine_usage = amdsmi_interface.amdsmi_get_gpu_activity(args.gpu) + logging.debug(f"engine_usage dictionary = {engine_usage}") # TODO: move vcn_activity and jpeg_activity into amdsmi_get_gpu_activity - gpu_metric_info = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu) - engine_usage['vcn_activity'] = gpu_metric_info.pop('vcn_activity') - engine_usage['jpeg_activity'] = gpu_metric_info.pop('jpeg_activity') + engine_usage['vcn_activity'] = gpu_metric['vcn_activity'] + engine_usage['jpeg_activity'] = gpu_metric['jpeg_activity'] + num_partition = gpu_metric['num_partition'] + engine_usage['gfx_busy_inst'] = "N/A" + engine_usage['jpeg_busy'] = "N/A" + engine_usage['vcn_busy'] = "N/A" + engine_usage['gfx_busy_acc'] = "N/A" + + if num_partition != "N/A": + # these are one after another, in order to display each in sub-sections + new_xcp_dict = {} + for current_xcp in range(num_partition): + new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.gfx_busy_inst'][current_xcp] + engine_usage['gfx_busy_inst'] = new_xcp_dict + + new_xcp_dict = {} + for current_xcp in range(num_partition): + new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.jpeg_busy'][current_xcp] + engine_usage['jpeg_busy'] = new_xcp_dict + + new_xcp_dict = {} + for current_xcp in range(num_partition): + new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.vcn_busy'][current_xcp] + engine_usage['vcn_busy'] = new_xcp_dict + + new_xcp_dict = {} + for current_xcp in range(num_partition): + new_xcp_dict[f"xcp_{current_xcp}"] = gpu_metric['xcp_stats.gfx_busy_acc'][current_xcp] + engine_usage['gfx_busy_acc'] = new_xcp_dict + + logging.debug(f"After updates to engine_usage dictionary = {engine_usage}") for key, value in engine_usage.items(): activity_unit = '%' @@ -1463,6 +1504,13 @@ class AMDSMICommands(): engine_usage[key][index] = f"{activity} {activity_unit}" # Convert list to a string for human readable format engine_usage[key] = '[' + ", ".join(engine_usage[key]) + ']' + elif isinstance(value, dict): + for k, v in value.items(): + for index, activity in enumerate(v): + if activity != "N/A": + value[k][index] = f"{activity} {activity_unit}" + # Convert list to a string for human readable format + value[k] = '[' + ", ".join(value[k]) + ']' elif value != "N/A": engine_usage[key] = f"{value} {activity_unit}" if self.logger.is_json_format(): @@ -1471,14 +1519,20 @@ class AMDSMICommands(): if activity != "N/A": engine_usage[key][index] = {"value" : activity, "unit" : activity_unit} + elif isinstance(value, dict): + for k, v in value.items(): + for index, activity in enumerate(v): + if activity != "N/A": + value[k][index] = {"value" : activity, + "unit" : activity_unit} elif value != "N/A": engine_usage[key] = {"value" : value, "unit" : activity_unit} values_dict['usage'] = engine_usage - except amdsmi_exception.AmdSmiLibraryException as e: + except Exception as e: values_dict['usage'] = "N/A" - logging.debug("Failed to get gpu activity for gpu %s | %s", gpu_id, e.get_error_info()) + logging.debug("Failed to get gpu activity for gpu %s | %s", gpu_id, e) if "power" in current_platform_args: if args.power: power_dict = {'socket_power': "N/A", @@ -1527,14 +1581,14 @@ class AMDSMICommands(): try: power_dict['throttle_status'] = "N/A" - throttle_status = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu)['throttle_status'] + throttle_status = gpu_metric['throttle_status'] if throttle_status != "N/A": if throttle_status: power_dict['throttle_status'] = "THROTTLED" else: power_dict['throttle_status'] = "UNTHROTTLED" - except amdsmi_exception.AmdSmiLibraryException as e: - logging.debug("Failed to get throttle status for gpu %s | %s", gpu_id, e.get_error_info()) + except Exception as e: + logging.debug("Failed to get throttle status for gpu %s | %s", gpu_id, e) values_dict['power'] = power_dict if "clock" in current_platform_args: @@ -1578,10 +1632,8 @@ class AMDSMICommands(): # Populate clock values from gpu_metrics_info try: - gpu_metrics_info = amdsmi_interface.amdsmi_get_gpu_metrics_info(args.gpu) - # Populate GFX clock values - current_gfx_clocks = gpu_metrics_info["current_gfxclks"] + current_gfx_clocks = gpu_metric["current_gfxclks"] for clock_index, current_gfx_clock in enumerate(current_gfx_clocks): # If the current clock is N/A then nothing else applies if current_gfx_clock == "N/A": @@ -1593,9 +1645,9 @@ class AMDSMICommands(): clock_unit) # Populate clock locked status - if gpu_metrics_info["gfxclk_lock_status"] != "N/A": + if gpu_metric["gfxclk_lock_status"] != "N/A": gfx_clock_lock_flag = 1 << clock_index # This is the position of the clock lock flag - if gpu_metrics_info["gfxclk_lock_status"] & gfx_clock_lock_flag: + if gpu_metric["gfxclk_lock_status"] & gfx_clock_lock_flag: clocks[gfx_index]["clk_locked"] = "ENABLED" else: clocks[gfx_index]["clk_locked"] = "DISABLED" @@ -1607,7 +1659,7 @@ class AMDSMICommands(): clocks[gfx_index]["deep_sleep"] = "DISABLED" # Populate MEM clock value - current_mem_clock = gpu_metrics_info["current_uclk"] # single value + current_mem_clock = gpu_metric["current_uclk"] # single value if current_mem_clock != "N/A": clocks["mem_0"]["clk"] = self.helpers.unit_format(self.logger, current_mem_clock, @@ -1619,7 +1671,7 @@ class AMDSMICommands(): clocks["mem_0"]["deep_sleep"] = "DISABLED" # Populate VCLK clock values - current_vclk_clocks = gpu_metrics_info["current_vclk0s"] + current_vclk_clocks = gpu_metric["current_vclk0s"] for clock_index, current_vclk_clock in enumerate(current_vclk_clocks): # If the current clock is N/A then nothing else applies if current_vclk_clock == "N/A": @@ -1636,7 +1688,7 @@ class AMDSMICommands(): clocks[vclk_index]["deep_sleep"] = "DISABLED" # Populate DCLK clock values - current_dclk_clocks = gpu_metrics_info["current_dclk0s"] + current_dclk_clocks = gpu_metric["current_dclk0s"] for clock_index, current_dclk_clock in enumerate(current_dclk_clocks): # If the current clock is N/A then nothing else applies if current_dclk_clock == "N/A": @@ -1651,8 +1703,8 @@ class AMDSMICommands(): clocks[dclk_index]["deep_sleep"] = "ENABLED" else: clocks[dclk_index]["deep_sleep"] = "DISABLED" - except amdsmi_exception.AmdSmiLibraryException as e: - logging.debug("Failed to get gpu_metrics_info for gpu %s | %s", gpu_id, e.get_error_info()) + except Exception as e: + logging.debug("Failed to get gpu_metrics_info for gpu %s | %s", gpu_id, e) # Populate the max and min clock values from sysfs # Min and Max values are per clock type, not per clock engine @@ -2036,6 +2088,94 @@ class AMDSMICommands(): "unit" : memory_unit} values_dict['mem_usage'] = memory_usage + if "throttle" in current_platform_args: + if args.throttle: + throttle_status = { + # gpu metric values + 'accumulation_counter': "N/A", + 'prochot_accumulated': "N/A", + 'ppt_accumulated': "N/A", + 'socket_thermal_accumulated': "N/A", + 'vr_thermal_accumulated': "N/A", + 'hbm_thermal_accumulated': "N/A", + + # violation status values - active + 'prochot_violation_active': "N/A", + 'ppt_violation_active': "N/A", + 'socket_thermal_violation_active': "N/A", + 'vr_thermal_violation_active': "N/A", + 'hbm_thermal_violation_active': "N/A", + + # violation status values - percent + 'prochot_violation_percent': "N/A", + 'ppt_violation_percent': "N/A", + 'socket_thermal_violation_percent': "N/A", + 'vr_thermal_violation_percent': "N/A", + 'hbm_thermal_violation_percent': "N/A" + } + + try: + throttle_status['accumulation_counter'] = gpu_metric['accumulation_counter'] + throttle_status['prochot_accumulated'] = gpu_metric['prochot_residency_acc'] + throttle_status['ppt_accumulated'] = gpu_metric['ppt_residency_acc'] + throttle_status['socket_thermal_accumulated'] = gpu_metric['socket_thm_residency_acc'] + throttle_status['vr_thermal_accumulated'] = gpu_metric['vr_thm_residency_acc'] + throttle_status['hbm_thermal_accumulated'] = gpu_metric['hbm_thm_residency_acc'] + + except Exception as e: + values_dict['throttle'] = throttle_status + logging.debug("Failed to get gpu metric information for throttle status' for gpu %s | %s", gpu_id, e) + + try: + violation_status = amdsmi_interface.amdsmi_get_violation_status(args.gpu) + throttle_status['prochot_violation_active'] = violation_status['active_prochot_thrm'] + throttle_status['ppt_violation_active'] = violation_status['active_ppt_pwr'] + throttle_status['socket_thermal_violation_active'] = violation_status['active_socket_thrm'] + throttle_status['vr_thermal_violation_active'] = violation_status['active_vr_thrm'] + throttle_status['hbm_thermal_violation_active'] = violation_status['active_hbm_thrm'] + + throttle_status['prochot_violation_percent'] = violation_status['per_prochot_thrm'] + throttle_status['ppt_violation_percent'] = violation_status['per_ppt_pwr'] + throttle_status['socket_thermal_violation_percent'] = violation_status['per_socket_thrm'] + throttle_status['vr_thermal_violation_percent'] = violation_status['per_vr_thrm'] + throttle_status['hbm_thermal_violation_percent'] = violation_status['per_hbm_thrm'] + + except amdsmi_exception.AmdSmiLibraryException as e: + values_dict['throttle'] = throttle_status + logging.debug("Failed to get violation status' for gpu %s | %s", gpu_id, e.get_error_info()) + + for key, value in throttle_status.items(): + if ("active" in key) and (value is True): + throttle_status[key] = "ACTIVE" + continue + elif ("active" in key) and (value is False): + throttle_status[key] = "NOT ACTIVE" + continue + if "percent" in key: + True # continue with rest of logic + else: + continue + + activity_unit = '%' + if self.logger.is_human_readable_format(): + if isinstance(value, list): + for index, activity in enumerate(value): + if activity != "N/A": + throttle_status[key][index] = f"{activity} {activity_unit}" + # Convert list to a string for human readable format + throttle_status[key] = '[' + ", ".join(throttle_status[key]) + ']' + elif value != "N/A": + throttle_status[key] = f"{value} {activity_unit}" + if self.logger.is_json_format(): + if isinstance(value, list): + for index, activity in enumerate(value): + if activity != "N/A": + throttle_status[key][index] = {"value" : activity, + "unit" : activity_unit} + elif value != "N/A": + throttle_status[key] = {"value" : value, + "unit" : activity_unit} + values_dict['throttle'] = throttle_status # Store timestamp first if watching_output is enabled if watching_output: @@ -2438,7 +2578,7 @@ class AMDSMICommands(): cpu_temp=None, cpu_dimm_temp_range_rate=None, cpu_dimm_pow_consumption=None, cpu_dimm_thermal_sensor=None, core=None, core_boost_limit=None, core_curr_active_freq_core_limit=None, - core_energy=None): + core_energy=None, throttle=None): """Get Metric information for target gpu Args: @@ -2513,7 +2653,7 @@ class AMDSMICommands(): gpu_attributes = ["usage", "watch", "watch_time", "iterations", "power", "clock", "temperature", "ecc", "ecc_blocks", "pcie", "fan", "voltage_curve", "overdrive", "perf_level", "xgmi_err", "energy", "mem_usage", "schedule", - "guard", "guest_data", "fb_usage", "xgmi"] + "guard", "guest_data", "fb_usage", "xgmi", "throttle"] for attr in gpu_attributes: if hasattr(args, attr): if getattr(args, attr): @@ -2586,7 +2726,7 @@ class AMDSMICommands(): clock, temperature, ecc, ecc_blocks, pcie, fan, voltage_curve, overdrive, perf_level, xgmi_err, energy, mem_usage, schedule, - guard, guest_data, fb_usage, xgmi) + guard, guest_data, fb_usage, xgmi, throttle) elif self.helpers.is_amd_hsmp_initialized(): # Only CPU is initialized if args.cpu == None and args.core == None: # If no args are set, print out all CPU and Core metrics info @@ -2620,7 +2760,7 @@ class AMDSMICommands(): usage, watch, watch_time, iterations, power, clock, temperature, ecc, ecc_blocks, pcie, fan, voltage_curve, overdrive, perf_level, - xgmi_err, energy, mem_usage, schedule) + xgmi_err, energy, mem_usage, schedule, throttle) def process(self, args, multiple_devices=False, watching_output=False, @@ -4301,7 +4441,7 @@ class AMDSMICommands(): def monitor(self, args, multiple_devices=False, watching_output=False, gpu=None, watch=None, watch_time=None, iterations=None, power_usage=None, temperature=None, gfx_util=None, mem_util=None, encoder=None, decoder=None, - ecc=None, vram_usage=None, pcie=None, process=None): + ecc=None, vram_usage=None, pcie=None, process=None, violation=None): """ Populate a table with each GPU as an index to rows of targeted data Args: @@ -4321,6 +4461,7 @@ class AMDSMICommands(): vram_usage (bool, optional): Value override for args.vram_usage. Defaults to None. pcie (bool, optional): Value override for args.pcie. Defaults to None. process (bool, optional): Value override for args.process. Defaults to None. + violation (bool, optional): Value override for args.violation. Defaults to None. Raises: ValueError: Value error if no gpu value is provided @@ -4360,6 +4501,8 @@ class AMDSMICommands(): args.pcie = pcie if process: args.process = process + if violation: + args.violation = violation # Handle No GPU passed if args.gpu == None: @@ -4369,10 +4512,10 @@ class AMDSMICommands(): # Don't include process in this logic as it's an optional edge case if not any([args.power_usage, args.temperature, args.gfx, args.mem, args.encoder, args.decoder, args.ecc, - args.vram_usage, args.pcie]): + args.vram_usage, args.pcie, args.violation]): args.power_usage = args.temperature = args.gfx = args.mem = \ args.encoder = args.decoder = args.ecc = \ - args.vram_usage = args.pcie = True + args.vram_usage = args.pcie = args.violation = True # Handle watch logic, will only enter this block once if args.watch: @@ -4684,6 +4827,50 @@ class AMDSMICommands(): self.logger.table_header += 'PCIE_BW'.rjust(12) + if args.violation: + violation_status = { + "pviol": "N/A", + "tviol": "N/A", + "phot_tviol": "N/A", + "vr_tviol": "N/A", + "hbm_tviol": "N/A", + } + try: + violations = amdsmi_interface.amdsmi_get_violation_status(args.gpu) + violation_status['pviol'] = violations['per_ppt_pwr'] + violation_status['tviol'] = violations['per_socket_thrm'] + violation_status['phot_tviol'] = violations['per_prochot_thrm'] + violation_status['vr_tviol'] = violations['per_vr_thrm'] + violation_status['hbm_tviol'] = violations['per_hbm_thrm'] + except amdsmi_exception.AmdSmiLibraryException as e: + monitor_values['pviol'] = violation_status['pviol'] + monitor_values['tviol'] = violation_status['tviol'] + monitor_values['phot_tviol'] = violation_status['phot_tviol'] + monitor_values['vr_tviol'] = violation_status['vr_tviol'] + monitor_values['hbm_tviol'] = violation_status['hbm_tviol'] + logging.debug("Failed to get violation status on gpu %s | %s", gpu_id, e.get_error_info()) + violation_status_unit = "%" + kTVIOL_MAX_WIDTH = 10 + kPVIOL_MAX_WIDTH = 10 + kPHOT_MAX_WIDTH = 12 + kVR_MAX_WIDTH = 10 + kHBM_MAX_WIDTH = 11 + + for key, value in violation_status.items(): + monitor_values[key] = self.helpers.unit_format(self.logger, violation_status[key], violation_status_unit) + + if self.logger.is_human_readable_format(): + monitor_values['pviol'] = monitor_values['pviol'].rjust(kPVIOL_MAX_WIDTH, ' ') + monitor_values['tviol'] = monitor_values['tviol'].rjust(kTVIOL_MAX_WIDTH, ' ') + monitor_values['phot_tviol'] = monitor_values['phot_tviol'].rjust(kPHOT_MAX_WIDTH, ' ') + monitor_values['vr_tviol'] = monitor_values['vr_tviol'].rjust(kVR_MAX_WIDTH, ' ') + monitor_values['hbm_tviol'] = monitor_values['hbm_tviol'].rjust(kHBM_MAX_WIDTH, ' ') + self.logger.table_header += 'PVIOL'.rjust(kPVIOL_MAX_WIDTH, ' ') + self.logger.table_header += 'TVIOL'.rjust(kTVIOL_MAX_WIDTH, ' ') + self.logger.table_header += 'PHOT_TVIOL'.rjust(kPHOT_MAX_WIDTH, ' ') + self.logger.table_header += 'VR_TVIOL'.rjust(kVR_MAX_WIDTH, ' ') + self.logger.table_header += 'HBM_TVIOL'.rjust(kHBM_MAX_WIDTH, ' ') + self.logger.store_output(args.gpu, 'values', monitor_values) # intialize dual_csv_format; applicable to process only diff --git a/amdsmi_cli/amdsmi_parser.py b/amdsmi_cli/amdsmi_parser.py index be58f7b0fe..bc9c85149b 100644 --- a/amdsmi_cli/amdsmi_parser.py +++ b/amdsmi_cli/amdsmi_parser.py @@ -758,6 +758,7 @@ class AMDSMIParser(argparse.ArgumentParser): perf_level_help = "Current DPM performance level" xgmi_err_help = "XGMI error information since last read" energy_help = "Amount of energy consumed" + throttle_help = "Displays throttle accumulators; Only available for MI300 or newer ASICs" # Help text for Arguments only on Hypervisors schedule_help = "All scheduling information" @@ -832,6 +833,7 @@ class AMDSMIParser(argparse.ArgumentParser): metric_parser.add_argument('-l', '--perf-level', action='store_true', required=False, help=perf_level_help) metric_parser.add_argument('-x', '--xgmi-err', action='store_true', required=False, help=xgmi_err_help) metric_parser.add_argument('-E', '--energy', action='store_true', required=False, help=energy_help) + metric_parser.add_argument('-T', '--throttle', action='store_true', required=False, help=throttle_help) # Options to only display to Hypervisors if self.helpers.is_hypervisor(): @@ -1184,6 +1186,7 @@ class AMDSMIParser(argparse.ArgumentParser): mem_usage_help = "Monitor memory usage in MB" pcie_bandwidth_help = "Monitor PCIe bandwidth in Mb/s" process_help = "Enable Process information table below monitor output" + violation_help = "Monitor power and thermal violation status (%%); Only available for MI300 or newer ASICs" # Create monitor subparser monitor_parser = subparsers.add_parser('monitor', help=monitor_help, description=monitor_subcommand_help, aliases=["dmon"]) @@ -1207,6 +1210,7 @@ class AMDSMIParser(argparse.ArgumentParser): monitor_parser.add_argument('-v', '--vram-usage', action='store_true', required=False, help=mem_usage_help) monitor_parser.add_argument('-r', '--pcie', action='store_true', required=False, help=pcie_bandwidth_help) monitor_parser.add_argument('-q', '--process', action='store_true', required=False, help=process_help) + monitor_parser.add_argument('-V', '--violation', action='store_true', required=False, help=violation_help) def _add_rocm_smi_parser(self, subparsers, func): diff --git a/example/amd_smi_drm_example.cc b/example/amd_smi_drm_example.cc index 0887b1f0e6..3864262895 100644 --- a/example/amd_smi_drm_example.cc +++ b/example/amd_smi_drm_example.cc @@ -62,7 +62,7 @@ const char *err_str; \ std::cout << "AMDSMI call returned " << RET << " at line " \ << __LINE__ << std::endl; \ - amdsmi_status_code_to_string(RET, &err_str); \ + amdsmi_status_code_to_string(RET, &err_str); \ std::cout << err_str << std::endl; \ return RET; \ } \ @@ -264,6 +264,8 @@ int main() { &device_count, &processor_handles[0]); CHK_AMDSMI_RET(ret) + std::cout << "Processor Count: " << device_count << std::endl; + // For each device of the socket, get name and temperature. for (uint32_t j = 0; j < device_count; j++) { // Get device type. Since the amdsmi is initialized with @@ -494,7 +496,10 @@ int main() { block = (amdsmi_gpu_block_t)(block * 2)) { ret = amdsmi_get_gpu_ras_block_features_enabled(processor_handles[j], block, &state); - CHK_AMDSMI_RET(ret) + if (ret != AMDSMI_STATUS_API_FAILED) { + CHK_AMDSMI_RET(ret) + } + printf("\tBlock: %s\n", block_names[index]); printf("\tStatus: %s\n", status_names[state]); index++; @@ -507,7 +512,9 @@ int main() { uint32_t num_pages = 0; ret = amdsmi_get_gpu_bad_page_info(processor_handles[j], &num_pages, nullptr); - CHK_AMDSMI_RET(ret) + if (ret != AMDSMI_STATUS_NOT_SUPPORTED) { + CHK_AMDSMI_RET(ret) + } printf(" Output of amdsmi_get_gpu_bad_page_info:\n"); if (!num_pages) { printf("\tNo bad pages found.\n"); @@ -684,8 +691,8 @@ int main() { /// Get GPU Metrics info std::cout << "\n\n"; - amdsmi_gpu_metrics_t gpu_metrics; - ret = amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics); + amdsmi_gpu_metrics_t smu; + ret = amdsmi_get_gpu_metrics_info(processor_handles[j], &smu); CHK_AMDSMI_RET(ret) printf(" Output of amdsmi_get_gpu_metrics_info:\n"); printf("\tDevice[%d] BDF %04lx:%02x:%02x.%d\n\n", i, @@ -694,164 +701,334 @@ int main() { bdf.device_number, bdf.function_number); - std::cout << "\t**.common_header.format_revision : " - << print_unsigned_int(gpu_metrics.common_header.format_revision) << "\n"; - std::cout << "\t**.common_header.content_revision : " - << print_unsigned_int(gpu_metrics.common_header.content_revision) << "\n"; - std::cout << "\t**.temperature_edge : " << std::dec - << gpu_metrics.temperature_edge << "\n"; - std::cout << "\t**.temperature_hotspot : " << std::dec - << gpu_metrics.temperature_hotspot << "\n"; - std::cout << "\t**.temperature_mem : " << std::dec - << gpu_metrics.temperature_mem << "\n"; - std::cout << "\t**.temperature_vrgfx : " << std::dec - << gpu_metrics.temperature_vrgfx << "\n"; - std::cout << "\t**.temperature_vrsoc : " << std::dec - << gpu_metrics.temperature_vrsoc << "\n"; - std::cout << "\t**.temperature_vrmem : " << std::dec - << gpu_metrics.temperature_vrmem << "\n"; - std::cout << "\t**.average_gfx_activity : " << std::dec - << gpu_metrics.average_gfx_activity << "\n"; - std::cout << "\t**.average_umc_activity : " << std::dec - << gpu_metrics.average_umc_activity << "\n"; - std::cout << "\t**.average_mm_activity : " << std::dec - << gpu_metrics.average_mm_activity << "\n"; - std::cout << "\t**.average_socket_power : " << std::dec - << gpu_metrics.average_socket_power << "\n"; - std::cout << "\t**.energy_accumulator : " << std::dec - << gpu_metrics.energy_accumulator << "\n"; - std::cout << "\t**.system_clock_counter : " << std::dec - << gpu_metrics.system_clock_counter << "\n"; - std::cout << "\t**.average_gfxclk_frequency : " << std::dec - << gpu_metrics.average_gfxclk_frequency << "\n"; - std::cout << "\t**.average_socclk_frequency : " << std::dec - << gpu_metrics.average_socclk_frequency << "\n"; - std::cout << "\t**.average_uclk_frequency : " << std::dec - << gpu_metrics.average_uclk_frequency << "\n"; - std::cout << "\t**.average_vclk0_frequency : " << std::dec - << gpu_metrics.average_vclk0_frequency<< "\n"; - std::cout << "\t**.average_dclk0_frequency : " << std::dec - << gpu_metrics.average_dclk0_frequency << "\n"; - std::cout << "\t**.average_vclk1_frequency : " << std::dec - << gpu_metrics.average_vclk1_frequency << "\n"; - std::cout << "\t**.average_dclk1_frequency : " << std::dec - << gpu_metrics.average_dclk1_frequency << "\n"; - std::cout << "\t**.current_gfxclk : " << std::dec - << gpu_metrics.current_gfxclk << "\n"; - std::cout << "\t**.current_socclk : " << std::dec - << gpu_metrics.current_socclk << "\n"; - std::cout << "\t**.current_uclk : " << std::dec - << gpu_metrics.current_uclk << "\n"; - std::cout << "\t**.current_vclk0 : " << std::dec - << gpu_metrics.current_vclk0 << "\n"; - std::cout << "\t**.current_dclk0 : " << std::dec - << gpu_metrics.current_dclk0 << "\n"; - std::cout << "\t**.current_vclk1 : " << std::dec - << gpu_metrics.current_vclk1 << "\n"; - std::cout << "\t**.current_dclk1 : " << std::dec - << gpu_metrics.current_dclk1 << "\n"; - std::cout << "\t**.throttle_status : " << std::dec - << gpu_metrics.throttle_status << "\n"; - std::cout << "\t**.current_fan_speed : " << std::dec - << gpu_metrics.current_fan_speed << "\n"; - std::cout << "\t**.pcie_link_width : " << std::dec - << gpu_metrics.pcie_link_width << "\n"; - std::cout << "\t**.pcie_link_speed : " << std::dec - << gpu_metrics.pcie_link_speed << "\n"; - std::cout << "\t**.gfx_activity_acc : " << std::dec - << gpu_metrics.gfx_activity_acc << "\n"; - std::cout << "\t**.mem_activity_acc : " << std::dec - << gpu_metrics.mem_activity_acc << "\n"; - std::cout << "\t**.firmware_timestamp : " << std::dec - << gpu_metrics.firmware_timestamp << "\n"; - std::cout << "\t**.voltage_soc : " << std::dec - << gpu_metrics.voltage_soc << "\n"; - std::cout << "\t**.voltage_gfx : " << std::dec - << gpu_metrics.voltage_gfx << "\n"; - std::cout << "\t**.voltage_mem : " << std::dec - << gpu_metrics.voltage_mem << "\n"; - std::cout << "\t**.indep_throttle_status : " << std::dec - << gpu_metrics.indep_throttle_status << "\n"; - std::cout << "\t**.current_socket_power : " << std::dec - << gpu_metrics.current_socket_power << "\n"; - std::cout << "\t**.gfxclk_lock_status : " << std::dec - << gpu_metrics.gfxclk_lock_status << "\n"; - std::cout << "\t**.xgmi_link_width : " << std::dec - << gpu_metrics.xgmi_link_width << "\n"; - std::cout << "\t**.xgmi_link_speed : " << std::dec - << gpu_metrics.xgmi_link_speed << "\n"; - std::cout << "\t**.pcie_bandwidth_acc : " << std::dec - << gpu_metrics.pcie_bandwidth_acc << "\n"; - std::cout << "\t**.pcie_bandwidth_inst : " << std::dec - << gpu_metrics.pcie_bandwidth_inst << "\n"; - std::cout << "\t**.pcie_l0_to_recov_count_acc : " << std::dec - << gpu_metrics.pcie_l0_to_recov_count_acc << "\n"; - std::cout << "\t**.pcie_replay_count_acc : " << std::dec - << gpu_metrics.pcie_replay_count_acc << "\n"; - std::cout << "\t**.pcie_replay_rover_count_acc : " << std::dec - << gpu_metrics.pcie_replay_rover_count_acc << "\n"; + std::cout << "METRIC TABLE HEADER:\n"; + std::cout << "structure_size=" << std::dec + << static_cast(smu.common_header.structure_size) << "\n"; + std::cout << "\tformat_revision=" << std::dec + << static_cast(smu.common_header.format_revision) << "\n"; + std::cout << "\tcontent_revision=" << std::dec + << static_cast(smu.common_header.content_revision) << "\n"; - std::cout << "\t**.temperature_hbm[] : " << std::dec << "\n"; - for (const auto& temp : gpu_metrics.temperature_hbm) { - std::cout << "\t -> " << std::dec << temp << "\n"; - } + std::cout << "\n"; + std::cout << "TIME STAMPS (ns):\n"; + std::cout << std::dec << "\tsystem_clock_counter=" << smu.system_clock_counter << "\n"; + std::cout << "\tfirmware_timestamp (10ns resolution)=" << std::dec << smu.firmware_timestamp + << "\n"; - std::cout << "\t**.vcn_activity[] : " << std::dec << "\n"; - for (const auto& vcn : gpu_metrics.vcn_activity) { - std::cout << "\t -> " << std::dec << vcn << "\n"; - } - - std::cout << "\t**.xgmi_read_data_acc[] : " << std::dec << "\n"; - for (const auto& read_data : gpu_metrics.xgmi_read_data_acc) { - std::cout << "\t -> " << std::dec << read_data << "\n"; - } - - std::cout << "\t**.xgmi_write_data_acc[] : " << std::dec << "\n"; - for (const auto& write_data : gpu_metrics.xgmi_write_data_acc) { - std::cout << "\t -> " << std::dec << write_data << "\n"; - } - - std::cout << "\t**.current_gfxclks[] : " << std::dec << "\n"; - for (const auto& gfxclk : gpu_metrics.current_gfxclks) { - std::cout << "\t -> " << std::dec << gfxclk << "\n"; - } - - std::cout << "\t**.current_socclks[] : " << std::dec << "\n"; - for (const auto& socclk : gpu_metrics.current_socclks) { - std::cout << "\t -> " << std::dec << socclk << "\n"; - } - - std::cout << "\t**.current_vclk0s[] : " << std::dec << "\n"; - for (const auto& vclk : gpu_metrics.current_vclk0s) { - std::cout << "\t -> " << std::dec << vclk << "\n"; - } - - std::cout << "\t**.current_dclk0s[] : " << std::dec << "\n"; - for (const auto& dclk : gpu_metrics.current_dclk0s) { - std::cout << "\t -> " << std::dec << dclk << "\n"; + std::cout << "\n"; + std::cout << "TEMPERATURES (C):\n"; + std::cout << std::dec << "\ttemperature_edge= " << smu.temperature_edge << "\n"; + std::cout << std::dec << "\ttemperature_hotspot= " << smu.temperature_hotspot << "\n"; + std::cout << std::dec << "\ttemperature_mem= " << smu.temperature_mem << "\n"; + std::cout << std::dec << "\ttemperature_vrgfx= " << smu.temperature_vrgfx << "\n"; + std::cout << std::dec << "\ttemperature_vrsoc= " << smu.temperature_vrsoc << "\n"; + std::cout << std::dec << "\ttemperature_vrmem= " << smu.temperature_vrmem << "\n"; + std::cout << "\ttemperature_hbm = ["; + auto idx = 0; + for (const auto& temp : smu.temperature_hbm) { + std::cout << temp; + if ((idx + 1) != std::size(smu.temperature_hbm)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; } std::cout << "\n"; + std::cout << "UTILIZATION (%):\n"; + std::cout << std::dec << "\taverage_gfx_activity=" << smu.average_gfx_activity << "\n"; + std::cout << std::dec << "\taverage_umc_activity=" << smu.average_umc_activity << "\n"; + std::cout << std::dec << "\taverage_mm_activity=" << smu.average_mm_activity << "\n"; + std::cout << std::dec << "\tvcn_activity= ["; + idx = 0; + for (const auto& temp : smu.vcn_activity) { + std::cout << temp; + if ((idx + 1) != std::size(smu.vcn_activity)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << "\n"; + std::cout << std::dec << "\tjpeg_activity= ["; + idx = 0; + for (const auto& temp : smu.jpeg_activity) { + std::cout << temp; + if ((idx + 1) != std::size(smu.jpeg_activity)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << "\n"; + std::cout << "POWER (W)/ENERGY (15.259uJ per 1ns):\n"; + std::cout << std::dec << "\taverage_socket_power=" << smu.average_socket_power << "\n"; + std::cout << std::dec << "\tcurrent_socket_power=" << smu.current_socket_power << "\n"; + std::cout << std::dec << "\tenergy_accumulator=" << smu.energy_accumulator << "\n"; + + std::cout << "\n"; + std::cout << "AVG CLOCKS (MHz):\n"; + std::cout << std::dec << "\taverage_gfxclk_frequency=" << smu.average_gfxclk_frequency + << "\n"; + std::cout << std::dec << "\taverage_gfxclk_frequency=" << smu.average_gfxclk_frequency + << "\n"; + std::cout << std::dec << "\taverage_uclk_frequency=" << smu.average_uclk_frequency << "\n"; + std::cout << std::dec << "\taverage_vclk0_frequency=" << smu.average_vclk0_frequency + << "\n"; + std::cout << std::dec << "\taverage_dclk0_frequency=" << smu.average_dclk0_frequency + << "\n"; + std::cout << std::dec << "\taverage_vclk1_frequency=" << smu.average_vclk1_frequency + << "\n"; + std::cout << std::dec << "\taverage_dclk1_frequency=" << smu.average_dclk1_frequency + << "\n"; + + std::cout << "\n"; + std::cout << "CURRENT CLOCKS (MHz):\n"; + std::cout << std::dec << "\tcurrent_gfxclk=" << smu.current_gfxclk << "\n"; + std::cout << std::dec << "\tcurrent_gfxclks= ["; + idx = 0; + for (const auto& temp : smu.current_gfxclks) { + std::cout << temp; + if ((idx + 1) != std::size(smu.current_gfxclks)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << std::dec << "\tcurrent_socclk=" << smu.current_socclk << "\n"; + std::cout << std::dec << "\tcurrent_socclks= ["; + idx = 0; + for (const auto& temp : smu.current_socclks) { + std::cout << temp; + if ((idx + 1) != std::size(smu.current_socclks)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << std::dec << "\tcurrent_uclk=" << smu.current_uclk << "\n"; + std::cout << std::dec << "\tcurrent_vclk0=" << smu.current_vclk0 << "\n"; + std::cout << std::dec << "\tcurrent_vclk0s= ["; + idx = 0; + for (const auto& temp : smu.current_vclk0s) { + std::cout << temp; + if ((idx + 1) != std::size(smu.current_vclk0s)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << std::dec << "\tcurrent_dclk0=" << smu.current_dclk0 << "\n"; + std::cout << std::dec << "\tcurrent_dclk0s= ["; + idx = 0; + for (const auto& temp : smu.current_dclk0s) { + std::cout << temp; + if ((idx + 1) != std::size(smu.current_dclk0s)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << std::dec << "\tcurrent_vclk1=" << smu.current_vclk1 << "\n"; + std::cout << std::dec << "\tcurrent_dclk1=" << smu.current_dclk1 << "\n"; + + std::cout << "\n"; + std::cout << "TROTTLE STATUS:\n"; + std::cout << std::dec << "\tthrottle_status=" << smu.throttle_status << "\n"; + + std::cout << "\n"; + std::cout << "FAN SPEED:\n"; + std::cout << std::dec << "\tcurrent_fan_speed=" << smu.current_fan_speed << "\n"; + + std::cout << "\n"; + std::cout << "LINK WIDTH (number of lanes) /SPEED (0.1 GT/s):\n"; + std::cout << "\tpcie_link_width=" << smu.pcie_link_width << "\n"; + std::cout << "\tpcie_link_speed=" << smu.pcie_link_speed << "\n"; + std::cout << "\txgmi_link_width=" << smu.xgmi_link_width << "\n"; + std::cout << "\txgmi_link_speed=" << smu.xgmi_link_speed << "\n"; + + std::cout << "\n"; + std::cout << "Utilization Accumulated(%):\n"; + std::cout << "\tgfx_activity_acc=" << std::dec << smu.gfx_activity_acc << "\n"; + std::cout << "\tmem_activity_acc=" << std::dec << smu.mem_activity_acc << "\n"; + + std::cout << "\n"; + std::cout << "XGMI ACCUMULATED DATA TRANSFER SIZE (KB):\n"; + std::cout << std::dec << "\txgmi_read_data_acc= ["; + idx = 0; + for (const auto& temp : smu.xgmi_read_data_acc) { + std::cout << temp; + if ((idx + 1) != std::size(smu.xgmi_read_data_acc)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + std::cout << std::dec << "\txgmi_write_data_acc= ["; + idx = 0; + for (const auto& temp : smu.xgmi_write_data_acc) { + std::cout << temp; + if ((idx + 1) != std::size(smu.xgmi_write_data_acc)) { + std::cout << ", "; + } else { + std::cout << "]\n"; + } + ++idx; + } + + // Voltage (mV) + std::cout << "\tvoltage_soc = " << std::dec << smu.voltage_soc << "\n"; + std::cout << "\tvoltage_gfx = " << std::dec << smu.voltage_gfx << "\n"; + std::cout << "\tvoltage_mem = " << std::dec << smu.voltage_mem << "\n"; + + std::cout << "\tindep_throttle_status = " << std::dec << smu.indep_throttle_status << "\n"; + + // Clock Lock Status. Each bit corresponds to clock instance + std::cout << "\tgfxclk_lock_status (in hex) = " << std::hex + << smu.gfxclk_lock_status << std::dec <<"\n"; + + // Bandwidth (GB/sec) + std::cout << "\tpcie_bandwidth_acc=" << std::dec << smu.pcie_bandwidth_acc << "\n"; + std::cout << "\tpcie_bandwidth_inst=" << std::dec << smu.pcie_bandwidth_inst << "\n"; + + // Counts + std::cout << "\tpcie_l0_to_recov_count_acc= " << std::dec << smu.pcie_l0_to_recov_count_acc + << "\n"; + std::cout << "\tpcie_replay_count_acc= " << std::dec << smu.pcie_replay_count_acc << "\n"; + std::cout << "\tpcie_replay_rover_count_acc= " << std::dec + << smu.pcie_replay_rover_count_acc << "\n"; + std::cout << "\tpcie_nak_sent_count_acc= " << std::dec << smu.pcie_nak_sent_count_acc + << "\n"; + std::cout << "\tpcie_nak_rcvd_count_acc= " << std::dec << smu.pcie_nak_rcvd_count_acc + << "\n"; + + // Accumulation cycle counter + // Accumulated throttler residencies + std::cout << "\n"; + std::cout << "RESIDENCY ACCUMULATION / COUNTER:\n"; + std::cout << "\taccumulation_counter = " << std::dec << smu.accumulation_counter << "\n"; + std::cout << "\tprochot_residency_acc = " << std::dec << smu.prochot_residency_acc << "\n"; + std::cout << "\tppt_residency_acc = " << std::dec << smu.ppt_residency_acc << "\n"; + std::cout << "\tsocket_thm_residency_acc = " << std::dec << smu.socket_thm_residency_acc + << "\n"; + std::cout << "\tvr_thm_residency_acc = " << std::dec << smu.vr_thm_residency_acc + << "\n"; + std::cout << "\thbm_thm_residency_acc = " << std::dec << smu.hbm_thm_residency_acc << "\n"; + + // Number of current partitions + std::cout << "\tnum_partition = " << std::dec << smu.num_partition << "\n"; + + // PCIE other end recovery counter + std::cout << "\tpcie_lc_perf_other_end_recovery = " + << std::dec << smu.pcie_lc_perf_other_end_recovery << "\n"; + + idx = 0; + auto idy = 0; + std::cout << "\txcp_stats.gfx_busy_inst: " << "\n"; + for (auto& row : smu.xcp_stats) { + std::cout << "\t XCP [" << idx << "] : ["; + for (auto& col : row.gfx_busy_inst) { + if ((idy + 1) != std::size(row.gfx_busy_inst)) { + std::cout << col << ", "; + } else { + std::cout << col; + } + idy++; + } + std::cout << "]\n"; + idy = 0; + idx++; + } + + idx = 0; + idy = 0; + std::cout << "\txcp_stats.vcn_busy: " << "\n"; + for (auto& row : smu.xcp_stats) { + std::cout << "\t XCP [" << idx << "] : ["; + for (auto& col : row.vcn_busy) { + if ((idy + 1) != std::size(row.vcn_busy)) { + std::cout << col << ", "; + } else { + std::cout << col; + } + idy++; + } + std::cout << "]\n"; + idy = 0; + idx++; + } + + idx = 0; + idy = 0; + std::cout << "\txcp_stats.jpeg_busy: " << "\n"; + for (auto& row : smu.xcp_stats) { + std::cout << "\t XCP [" << idx << "] : ["; + for (auto& col : row.jpeg_busy) { + if ((idy + 1) != std::size(row.jpeg_busy)) { + std::cout << col << ", "; + } else { + std::cout << col; + } + idy++; + } + std::cout << "]\n"; + idy = 0; + idx++; + } + + idx = 0; + idy = 0; + std::cout << "\txcp_stats.gfx_busy_acc: " << "\n"; + for (auto& row : smu.xcp_stats) { + std::cout << "\t XCP [" << idx << "] : ["; + for (auto& col : row.gfx_busy_acc) { + if ((idy + 1) != std::size(row.gfx_busy_acc)) { + std::cout << col << ", "; + } else { + std::cout << col; + } + idy++; + } + std::cout << "]\n"; + idy = 0; + idx++; + } + + std::cout << "\n\n"; std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n"; constexpr uint16_t kMAX_ITER_TEST = 10; - amdsmi_gpu_metrics_t gpu_metrics_check; + amdsmi_gpu_metrics_t gpu_metrics_check = {}; for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) { - amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check); - std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.firmware_timestamp << "\n"; + amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check); + std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: " + << gpu_metrics_check.firmware_timestamp << "\n"; } std::cout << "\n"; for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) { - amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check); - std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.system_clock_counter << "\n"; + amdsmi_get_gpu_metrics_info(processor_handles[j], &gpu_metrics_check); + std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: " + << gpu_metrics_check.system_clock_counter << "\n"; } - std::cout << "\n"; std::cout << "\n"; - std::cout << "\t ** Note: Values MAX'ed out (UINTX MAX are unsupported for the version in question) ** " << "\n"; - std::cout << "\n"; - std::cout << "+=======+==================+============+==============" - << "+=============+=============+=============+============+\n"; + std::cout << " ** Note: Values MAX'ed out " + << "(UINTX MAX are unsupported for the version in question) ** " << "\n\n"; // Get nearest GPUs char *topology_link_type_str[] = { @@ -867,13 +1044,19 @@ int main() { ret = amdsmi_get_link_topology_nearest(processor_handles[j], static_cast(topo_link_type), nullptr); - CHK_AMDSMI_RET(ret); + if (ret != AMDSMI_STATUS_INVAL) { + CHK_AMDSMI_RET(ret); + } ret = amdsmi_get_link_topology_nearest(processor_handles[j], static_cast(topo_link_type), &topology_nearest_info); - CHK_AMDSMI_RET(ret); + if (ret != AMDSMI_STATUS_INVAL) { + CHK_AMDSMI_RET(ret); + } + printf("\tNearest GPUs found at %s\n", topology_link_type_str[topo_link_type]); + printf("\tNearest Count: %d\n", topology_nearest_info.count); for (uint32_t k = 0; k < topology_nearest_info.count; k++) { amdsmi_bdf_t bdf = {}; ret = amdsmi_get_gpu_device_bdf(topology_nearest_info.processor_list[k], &bdf); @@ -882,7 +1065,7 @@ int main() { bdf.bus_number, bdf.device_number, bdf.function_number); } } - } + } } // Clean up resources allocated at amdsmi_init. It will invalidate sockets diff --git a/example/amd_smi_nodrm_example.cc b/example/amd_smi_nodrm_example.cc index cd35ce1990..195b97e699 100644 --- a/example/amd_smi_nodrm_example.cc +++ b/example/amd_smi_nodrm_example.cc @@ -61,7 +61,7 @@ const char *err_str; \ std::cout << "AMDSMI call returned " << RET << " at line " \ << __LINE__ << std::endl; \ - amdsmi_status_code_to_string(RET, &err_str); \ + amdsmi_status_code_to_string(RET, &err_str); \ std::cout << err_str << std::endl; \ return RET; \ } \ @@ -262,8 +262,10 @@ int main() { char bad_page_status_names[3][15] = {"RESERVED", "PENDING", "UNRESERVABLE"}; uint32_t num_pages = 0; + std::vector bad_page_info(num_pages); ret = amdsmi_get_gpu_bad_page_info(processor_handles[j], &num_pages, - nullptr); + bad_page_info.data()); + std::cout << "num_pages = " << num_pages << "\n"; CHK_AMDSMI_RET(ret) printf(" Output of amdsmi_get_gpu_bad_page_info:\n"); if (!num_pages) { diff --git a/include/amd_smi/amdsmi.h b/include/amd_smi/amdsmi.h index 7b76235252..a1713b66e1 100644 --- a/include/amd_smi/amdsmi.h +++ b/include/amd_smi/amdsmi.h @@ -142,6 +142,29 @@ typedef enum { */ #define AMDSMI_MAX_NUM_JPEG 32 +/** + * @brief This should match AMDSMI_MAX_NUM_XCC; + * XCC - Accelerated Compute Core, the collection of compute units, + * ACE (Asynchronous Compute Engines), caches, + * and global resources organized as one unit. + * + * Refer to amd.com documentation for more detail: + * https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf + */ +#define AMDSMI_MAX_NUM_XCC 8 + +/** + * @brief This should match AMDSMI_MAX_NUM_XCP; + * XCP - Accelerated Compute Processor, + * also referred to as the Graphics Compute Partitions. + * Each physical gpu could have a maximum of 8 separate partitions + * associated with each (depending on ASIC support). + * + * Refer to amd.com documentation for more detail: + * https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf + */ +#define AMDSMI_MAX_NUM_XCP 8 + /* string format */ #define AMDSMI_TIME_FORMAT "%02d:%02d:%02d.%03d" #define AMDSMI_DATE_FORMAT "%04d-%02d-%02d:%02d:%02d:%02d.%03d" @@ -503,7 +526,25 @@ typedef struct { uint32_t vram_used; uint32_t reserved[2]; } amdsmi_vram_usage_t; - +/** + * @brief This structure hold violation status information. + * Note: for MI3x asics and higher, older ASICs will show unsupported. + */ +typedef struct { + uint64_t reference_timestamp; //!< Represents CPU timestamp in microseconds (uS) + uint64_t violation_timestamp; //!< Violation time in milliseconds (ms) + uint64_t per_prochot_thrm; //!< Processor hot violation % (greater than 0% is a violation); Max uint64 means unsupported + uint64_t per_ppt_pwr; //!< PVIOL; Package Power Tracking (PPT) violation % (greater than 0% is a violation); Max uint64 means unsupported + uint64_t per_socket_thrm; //!< TVIOL; Socket thermal violation % (greater than 0% is a violation); Max uint64 means unsupported + uint64_t per_vr_thrm; //!< Voltage regulator violation % (greater than 0% is a violation); Max uint64 means unsupported + uint64_t per_hbm_thrm; //!< High Bandwidth Memory (HBM) thermal violation % (greater than 0% is a violation); Max uint64 means unsupported + uint8_t active_prochot_thrm; //!< Processor hot violation; 1 = active 0 = not active; Max uint8 means unsupported + uint8_t active_ppt_pwr; //!< Package Power Tracking (PPT) violation; 1 = active 0 = not active; Max uint8 means unsupported + uint8_t active_socket_thrm; //!< Socket thermal violation; 1 = active 0 = not active; Max uint8 means unsupported + uint8_t active_vr_thrm; //!< Voltage regulator violation; 1 = active 0 = not active; Max uint8 means unsupported + uint8_t active_hbm_thrm; //!< High Bandwidth Memory (HBM) thermal violation; 1 = active 0 = not active; Max uint8 means unsupported + uint64_t reserved[24]; // Reserved for new violation info +} amdsmi_violation_status_t; typedef struct { amdsmi_range_t supported_freq_range; amdsmi_range_t current_freq_range; @@ -544,7 +585,8 @@ typedef struct { uint64_t pcie_replay_roll_over_count; //!< total number of replay rollovers issued on the PCIe link uint64_t pcie_nak_sent_count; //!< total number of NAKs issued on the PCIe link by the device uint64_t pcie_nak_received_count; //!< total number of NAKs issued on the PCIe link by the receiver - uint64_t reserved[13]; + uint32_t pcie_lc_perf_other_end_recovery_count; //!< PCIe other end recovery counter + uint64_t reserved[12]; } pcie_metric; uint64_t reserved[32]; } amdsmi_pcie_info_t; @@ -617,7 +659,8 @@ typedef struct { typedef struct { uint64_t kfd_id; //< 0xFFFFFFFFFFFFFFFF if not supported uint32_t node_id; //< 0xFFFFFFFF if not supported - uint32_t reserved[13]; + uint32_t current_partition_id; //< 0xFFFFFFFF if not supported + uint32_t reserved[12]; } amdsmi_kfd_info_t; /** @@ -1383,6 +1426,21 @@ typedef struct { /// \endcond } amd_metrics_table_header_t; + +/** + * @brief The following structures hold the gpu statistics for a device. + */ +struct amdsmi_gpu_xcp_metrics_t { + /* Utilization Instantaneous (%) */ + uint32_t gfx_busy_inst[AMDSMI_MAX_NUM_XCC]; + uint16_t jpeg_busy[AMDSMI_MAX_NUM_JPEG]; + uint16_t vcn_busy[AMDSMI_MAX_NUM_VCN]; + + /* Utilization Accumulated (%) */ + uint64_t gfx_busy_acc[AMDSMI_MAX_NUM_XCC]; +}; + + typedef struct { // TODO(amd) Doxygen documents // Note: This structure is extended to fit the needs of different GPU metric @@ -1402,6 +1460,7 @@ typedef struct { /* * v1.0 Base */ + // Temperature (C) uint16_t temperature_edge; uint16_t temperature_hotspot; @@ -1494,10 +1553,10 @@ typedef struct { uint16_t xgmi_link_width; uint16_t xgmi_link_speed; - // PCIe accumulated bandwidth (GB/sec) + // PCIE accumulated bandwidth (GB/sec) uint64_t pcie_bandwidth_acc; - // PCIe instantaneous bandwidth (GB/sec) + // PCIE instantaneous bandwidth (GB/sec) uint64_t pcie_bandwidth_inst; // PCIE L0 to recovery state transition accumulated count @@ -1509,20 +1568,20 @@ typedef struct { // PCIE replay rollover accumulated count uint64_t pcie_replay_rover_count_acc; - // XGMI accumulated data transfer size (KB) + // XGMI accumulated data transfer size(KiloBytes) uint64_t xgmi_read_data_acc[AMDSMI_MAX_NUM_XGMI_LINKS]; uint64_t xgmi_write_data_acc[AMDSMI_MAX_NUM_XGMI_LINKS]; - // Current clock frequencies (MHz) + // XGMI accumulated data transfer size(KiloBytes) uint16_t current_gfxclks[AMDSMI_MAX_NUM_GFX_CLKS]; uint16_t current_socclks[AMDSMI_MAX_NUM_CLKS]; uint16_t current_vclk0s[AMDSMI_MAX_NUM_CLKS]; uint16_t current_dclk0s[AMDSMI_MAX_NUM_CLKS]; - /* + /* * v1.5 additions */ - // JPEG activity % per AID + // JPEG activity percent (encode/decode) uint16_t jpeg_activity[AMDSMI_MAX_NUM_JPEG]; // PCIE NAK sent accumulated count @@ -1530,6 +1589,59 @@ typedef struct { // PCIE NAK received accumulated count uint32_t pcie_nak_rcvd_count_acc; + + /* + * v1.6 additions + */ + /* Accumulation cycle counter */ + uint64_t accumulation_counter; + + /** + * Accumulated throttler residencies + */ + uint64_t prochot_residency_acc; + /** + * Accumulated throttler residencies + * + * Prochot (thermal) - PPT (power) + * Package Power Tracking (PPT) violation % (greater than 0% is a violation); + * aka PVIOL + * + * Ex. PVIOL/TVIOL calculations + * Where A and B are measurments recorded at prior points in time. + * Typically A is the earlier measured value and B is the latest measured value. + * + * PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A)) + * TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A)) + */ + uint64_t ppt_residency_acc; + /** + * Accumulated throttler residencies + * + * Socket (thermal) - + * Socket thermal violation % (greater than 0% is a violation); + * aka TVIOL + * + * Ex. PVIOL/TVIOL calculations + * Where A and B are measurments recorded at prior points in time. + * Typically A is the earlier measured value and B is the latest measured value. + * + * PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A)) + * TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A)) + */ + uint64_t socket_thm_residency_acc; + uint64_t vr_thm_residency_acc; + uint64_t hbm_thm_residency_acc; + + /* Number of current partition */ + uint16_t num_partition; + + /* XCP (Graphic Cluster Partitions) metrics stats */ + struct amdsmi_gpu_xcp_metrics_t xcp_stats[AMDSMI_MAX_NUM_XCP]; + + /* PCIE other end recovery counter */ + uint32_t pcie_lc_perf_other_end_recovery; + /// \endcond } amdsmi_gpu_metrics_t; @@ -5022,6 +5134,23 @@ amdsmi_get_clock_info(amdsmi_processor_handle processor_handle, amdsmi_clk_type_ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_handle, amdsmi_vram_usage_t *info); +/** + * @brief Returns the violations for a processor + * + * @platform{gpu_bm_linux} @platform{host} @platform{guest_1vf} @platform{guest_mvf} + * + * @param[in] processor_handle Device which to query + * + * + * @param[in,out] info Reference to all violation status details available. + * Must be allocated by user. + * + * @return ::amdsmi_status_t | ::AMDSMI_STATUS_SUCCESS on success, non-zero on fail + */ +amdsmi_status_t +amdsmi_get_violation_status(amdsmi_processor_handle processor_handle, + amdsmi_violation_status_t *info); + /** @} End gpumon */ diff --git a/py-interface/__init__.py b/py-interface/__init__.py index e731120eda..ca2b4754fe 100644 --- a/py-interface/__init__.py +++ b/py-interface/__init__.py @@ -106,6 +106,7 @@ from .amdsmi_interface import amdsmi_get_clock_info from .amdsmi_interface import amdsmi_get_pcie_info from .amdsmi_interface import amdsmi_get_gpu_bad_page_info +from .amdsmi_interface import amdsmi_get_violation_status # # Process Information from .amdsmi_interface import amdsmi_get_gpu_process_list diff --git a/py-interface/amdsmi_interface.py b/py-interface/amdsmi_interface.py index 000c22b853..b35da33e1c 100644 --- a/py-interface/amdsmi_interface.py +++ b/py-interface/amdsmi_interface.py @@ -42,6 +42,8 @@ AMDSMI_MAX_NUM_GFX_CLKS = 8 AMDSMI_MAX_AID = 4 AMDSMI_MAX_ENGINES = 8 AMDSMI_MAX_NUM_JPEG = 32 +AMDSMI_MAX_NUM_XCC = 8 +AMDSMI_MAX_NUM_XCP = 8 # Max number of DPM policies AMDSMI_MAX_NUM_PM_POLICIES = 32 @@ -604,19 +606,27 @@ class MaxUIntegerTypes(IntEnum): UINT32_T = 0xFFFFFFFF UINT64_T = 0xFFFFFFFFFFFFFFFF -def _validate_if_max_uint(value, uint_type: MaxUIntegerTypes): +def _validate_if_max_uint(value, uint_type: MaxUIntegerTypes, isActivity=False, isBool=False): return_val = "N/A" if not isinstance(value, list): - if value == uint_type: + if (value == uint_type) or (isActivity and value > 100): return return_val else: - return value + if isBool: + return bool(value) + else: + return value else: - return_val = value - for idx, v in enumerate(value): - if v == uint_type: - return_val[idx] = "N/A" - return return_val + return_val = [] + for _, v in enumerate(value): + if (v == uint_type) or (isActivity and v > 100): + return_val.append("N/A") + else: + return_val.append(v) + if isBool: + return bool(return_val) + else: + return return_val def amdsmi_get_socket_handles() -> List[amdsmi_wrapper.amdsmi_socket_handle]: @@ -1725,8 +1735,9 @@ def amdsmi_get_gpu_kfd_info( ) kfd_info = { - "kfd_id": _validate_if_max_uint(kfd_info_struct.kfd_id, MaxUIntegerTypes.UINT32_T), - "node_id": _validate_if_max_uint(kfd_info_struct.node_id, MaxUIntegerTypes.UINT64_T) + "kfd_id": _validate_if_max_uint(kfd_info_struct.kfd_id, MaxUIntegerTypes.UINT64_T), + "node_id": _validate_if_max_uint(kfd_info_struct.node_id, MaxUIntegerTypes.UINT32_T), + "current_partition_id": _validate_if_max_uint(kfd_info_struct.current_partition_id, MaxUIntegerTypes.UINT32_T) } return kfd_info @@ -1992,6 +2003,35 @@ def amdsmi_get_gpu_bad_page_info( return _format_bad_page_info(bad_pages, num_pages) +def amdsmi_get_violation_status( + processor_handle: amdsmi_wrapper.amdsmi_processor_handle, +) -> Dict[str, Any]: + if not isinstance(processor_handle, amdsmi_wrapper.amdsmi_processor_handle): + raise AmdSmiParameterException( + processor_handle, amdsmi_wrapper.amdsmi_processor_handle + ) + + violation_status = amdsmi_wrapper.amdsmi_violation_status_t() + _check_res( + amdsmi_wrapper.amdsmi_get_violation_status( + processor_handle, ctypes.byref(violation_status)) + ) + + return { + "reference_timestamp": _validate_if_max_uint(violation_status.reference_timestamp, MaxUIntegerTypes.UINT64_T), + "violation_timestamp": _validate_if_max_uint(violation_status.violation_timestamp, MaxUIntegerTypes.UINT64_T), + "per_prochot_thrm": _validate_if_max_uint(violation_status.per_prochot_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True), + "per_ppt_pwr": _validate_if_max_uint(violation_status.per_ppt_pwr, MaxUIntegerTypes.UINT64_T, isActivity=True), #PVIOL + "per_socket_thrm": _validate_if_max_uint(violation_status.per_socket_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True), #TVIOL + "per_vr_thrm": _validate_if_max_uint(violation_status.per_vr_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True), + "per_hbm_thrm": _validate_if_max_uint(violation_status.per_hbm_thrm, MaxUIntegerTypes.UINT64_T, isActivity=True), + "active_prochot_thrm": _validate_if_max_uint(violation_status.active_prochot_thrm, MaxUIntegerTypes.UINT8_T, isBool=True), + "active_ppt_pwr": _validate_if_max_uint(violation_status.active_ppt_pwr, MaxUIntegerTypes.UINT8_T, isBool=True), #PVIOL + "active_socket_thrm": _validate_if_max_uint(violation_status.active_socket_thrm, MaxUIntegerTypes.UINT8_T, isBool=True), #TVIOL + "active_vr_thrm": _validate_if_max_uint(violation_status.active_vr_thrm, MaxUIntegerTypes.UINT8_T, isBool=True), + "active_hbm_thrm": _validate_if_max_uint(violation_status.active_hbm_thrm, MaxUIntegerTypes.UINT8_T, isBool=True) + } + def amdsmi_get_gpu_total_ecc_count( processor_handle: amdsmi_wrapper.amdsmi_processor_handle, ) -> Dict[str, Any]: @@ -2345,6 +2385,7 @@ def amdsmi_get_pcie_info( "pcie_replay_roll_over_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_replay_roll_over_count, MaxUIntegerTypes.UINT64_T), "pcie_nak_sent_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_nak_sent_count, MaxUIntegerTypes.UINT64_T), "pcie_nak_received_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_nak_received_count, MaxUIntegerTypes.UINT64_T), + "pcie_lc_perf_other_end_recovery_count": _validate_if_max_uint(pcie_info.pcie_metric.pcie_lc_perf_other_end_recovery_count, MaxUIntegerTypes.UINT32_T) } } @@ -3773,130 +3814,104 @@ def amdsmi_get_gpu_metrics_info( ) gpu_metrics_output = { - "temperature_edge": gpu_metrics.temperature_edge, - "temperature_hotspot": gpu_metrics.temperature_hotspot, - "temperature_mem": gpu_metrics.temperature_mem, - "temperature_vrgfx": gpu_metrics.temperature_vrgfx, - "temperature_vrsoc": gpu_metrics.temperature_vrsoc, - "temperature_vrmem": gpu_metrics.temperature_vrmem, - "average_gfx_activity": gpu_metrics.average_gfx_activity, - "average_umc_activity": gpu_metrics.average_umc_activity, - "average_mm_activity": gpu_metrics.average_mm_activity, - "average_socket_power": gpu_metrics.average_socket_power, - "energy_accumulator": gpu_metrics.energy_accumulator, - "system_clock_counter": gpu_metrics.system_clock_counter, - "average_gfxclk_frequency": gpu_metrics.average_gfxclk_frequency, - "average_socclk_frequency": gpu_metrics.average_socclk_frequency, - "average_uclk_frequency": gpu_metrics.average_uclk_frequency, - "average_vclk0_frequency": gpu_metrics.average_vclk0_frequency, - "average_dclk0_frequency": gpu_metrics.average_dclk0_frequency, - "average_vclk1_frequency": gpu_metrics.average_vclk1_frequency, - "average_dclk1_frequency": gpu_metrics.average_dclk1_frequency, - "current_gfxclk": gpu_metrics.current_gfxclk, - "current_socclk": gpu_metrics.current_socclk, - "current_uclk": gpu_metrics.current_uclk, - "current_vclk0": gpu_metrics.current_vclk0, - "current_dclk0": gpu_metrics.current_dclk0, - "current_vclk1": gpu_metrics.current_vclk1, - "current_dclk1": gpu_metrics.current_dclk1, - "throttle_status": gpu_metrics.throttle_status, - "current_fan_speed": gpu_metrics.current_fan_speed, - "pcie_link_width": gpu_metrics.pcie_link_width, - "pcie_link_speed": gpu_metrics.pcie_link_speed, - "gfx_activity_acc": gpu_metrics.gfx_activity_acc, - "mem_activity_acc": gpu_metrics.mem_activity_acc, - "temperature_hbm": list(gpu_metrics.temperature_hbm), - "firmware_timestamp": gpu_metrics.firmware_timestamp, - "voltage_soc": gpu_metrics.voltage_soc, - "voltage_gfx": gpu_metrics.voltage_gfx, - "voltage_mem": gpu_metrics.voltage_mem, - "indep_throttle_status": gpu_metrics.indep_throttle_status, - "current_socket_power": gpu_metrics.current_socket_power, - "vcn_activity": list(gpu_metrics.vcn_activity), - "gfxclk_lock_status": gpu_metrics.gfxclk_lock_status, - "xgmi_link_width": gpu_metrics.xgmi_link_width, - "xgmi_link_speed": gpu_metrics.xgmi_link_speed, - "pcie_bandwidth_acc": gpu_metrics.pcie_bandwidth_acc, - "pcie_bandwidth_inst": gpu_metrics.pcie_bandwidth_inst, - "pcie_l0_to_recov_count_acc": gpu_metrics.pcie_l0_to_recov_count_acc, - "pcie_replay_count_acc": gpu_metrics.pcie_replay_count_acc, - "pcie_replay_rover_count_acc": gpu_metrics.pcie_replay_rover_count_acc, - "xgmi_read_data_acc": list(gpu_metrics.xgmi_read_data_acc), - "xgmi_write_data_acc": list(gpu_metrics.xgmi_write_data_acc), - "current_gfxclks": list(gpu_metrics.current_gfxclks), - "current_socclks": list(gpu_metrics.current_socclks), - "current_vclk0s": list(gpu_metrics.current_vclk0s), - "current_dclk0s": list(gpu_metrics.current_dclk0s), - "pcie_nak_sent_count_acc": gpu_metrics.pcie_nak_sent_count_acc, - "pcie_nak_rcvd_count_acc": gpu_metrics.pcie_nak_rcvd_count_acc, - "jpeg_activity": list(gpu_metrics.jpeg_activity), + "temperature_edge": _validate_if_max_uint(gpu_metrics.temperature_edge, MaxUIntegerTypes.UINT16_T), + "temperature_hotspot": _validate_if_max_uint(gpu_metrics.temperature_hotspot, MaxUIntegerTypes.UINT16_T), + "temperature_mem": _validate_if_max_uint(gpu_metrics.temperature_mem, MaxUIntegerTypes.UINT16_T), + "temperature_vrgfx": _validate_if_max_uint(gpu_metrics.temperature_vrgfx, MaxUIntegerTypes.UINT16_T), + "temperature_vrsoc": _validate_if_max_uint(gpu_metrics.temperature_vrsoc, MaxUIntegerTypes.UINT16_T), + "temperature_vrmem": _validate_if_max_uint(gpu_metrics.temperature_vrmem, MaxUIntegerTypes.UINT16_T), + "average_gfx_activity": _validate_if_max_uint(gpu_metrics.average_gfx_activity, MaxUIntegerTypes.UINT16_T, isActivity=True), + "average_umc_activity": _validate_if_max_uint(gpu_metrics.average_umc_activity, MaxUIntegerTypes.UINT16_T, isActivity=True), + "average_mm_activity": _validate_if_max_uint(gpu_metrics.average_mm_activity, MaxUIntegerTypes.UINT16_T, isActivity=True), + "average_socket_power": _validate_if_max_uint(gpu_metrics.average_socket_power, MaxUIntegerTypes.UINT16_T), + "energy_accumulator": _validate_if_max_uint(gpu_metrics.energy_accumulator, MaxUIntegerTypes.UINT64_T), + "system_clock_counter": _validate_if_max_uint(gpu_metrics.system_clock_counter, MaxUIntegerTypes.UINT64_T), + "average_gfxclk_frequency": _validate_if_max_uint(gpu_metrics.average_gfxclk_frequency, MaxUIntegerTypes.UINT16_T), + "average_socclk_frequency": _validate_if_max_uint(gpu_metrics.average_socclk_frequency, MaxUIntegerTypes.UINT16_T), + "average_uclk_frequency": _validate_if_max_uint(gpu_metrics.average_uclk_frequency, MaxUIntegerTypes.UINT16_T), + "average_vclk0_frequency": _validate_if_max_uint(gpu_metrics.average_vclk0_frequency, MaxUIntegerTypes.UINT16_T), + "average_dclk0_frequency": _validate_if_max_uint(gpu_metrics.average_dclk0_frequency, MaxUIntegerTypes.UINT16_T), + "average_vclk1_frequency": _validate_if_max_uint(gpu_metrics.average_vclk1_frequency, MaxUIntegerTypes.UINT16_T), + "average_dclk1_frequency": _validate_if_max_uint(gpu_metrics.average_dclk1_frequency, MaxUIntegerTypes.UINT16_T), + "current_gfxclk": _validate_if_max_uint(gpu_metrics.current_gfxclk, MaxUIntegerTypes.UINT16_T), + "current_socclk": _validate_if_max_uint(gpu_metrics.current_socclk, MaxUIntegerTypes.UINT16_T), + "current_uclk": _validate_if_max_uint(gpu_metrics.current_uclk, MaxUIntegerTypes.UINT16_T), + "current_vclk0": _validate_if_max_uint(gpu_metrics.current_vclk0, MaxUIntegerTypes.UINT16_T), + "current_dclk0": _validate_if_max_uint(gpu_metrics.current_dclk0, MaxUIntegerTypes.UINT16_T), + "current_vclk1": _validate_if_max_uint(gpu_metrics.current_vclk1, MaxUIntegerTypes.UINT16_T), + "current_dclk1": _validate_if_max_uint(gpu_metrics.current_dclk1, MaxUIntegerTypes.UINT16_T), + "throttle_status": _validate_if_max_uint(gpu_metrics.throttle_status, MaxUIntegerTypes.UINT32_T, isBool=True), + "current_fan_speed": _validate_if_max_uint(gpu_metrics.current_fan_speed, MaxUIntegerTypes.UINT16_T), + "pcie_link_width": _validate_if_max_uint(gpu_metrics.pcie_link_width, MaxUIntegerTypes.UINT16_T), + "pcie_link_speed": _validate_if_max_uint(gpu_metrics.pcie_link_speed, MaxUIntegerTypes.UINT16_T), + "gfx_activity_acc": _validate_if_max_uint(gpu_metrics.gfx_activity_acc, MaxUIntegerTypes.UINT32_T), + "mem_activity_acc": _validate_if_max_uint(gpu_metrics.mem_activity_acc, MaxUIntegerTypes.UINT32_T), + "temperature_hbm": _validate_if_max_uint(list(gpu_metrics.temperature_hbm), MaxUIntegerTypes.UINT16_T), + "firmware_timestamp": _validate_if_max_uint(gpu_metrics.firmware_timestamp, MaxUIntegerTypes.UINT64_T), + "voltage_soc": _validate_if_max_uint(gpu_metrics.voltage_soc, MaxUIntegerTypes.UINT16_T), + "voltage_gfx": _validate_if_max_uint(gpu_metrics.voltage_gfx, MaxUIntegerTypes.UINT16_T), + "voltage_mem": _validate_if_max_uint(gpu_metrics.voltage_mem, MaxUIntegerTypes.UINT16_T), + "indep_throttle_status": _validate_if_max_uint(gpu_metrics.indep_throttle_status, MaxUIntegerTypes.UINT64_T, isBool=True), + "current_socket_power": _validate_if_max_uint(gpu_metrics.current_socket_power, MaxUIntegerTypes.UINT16_T), + "vcn_activity": _validate_if_max_uint(list(gpu_metrics.vcn_activity), MaxUIntegerTypes.UINT16_T, isActivity=True), + "gfxclk_lock_status": _validate_if_max_uint(gpu_metrics.gfxclk_lock_status, MaxUIntegerTypes.UINT32_T), + "xgmi_link_width": _validate_if_max_uint(gpu_metrics.xgmi_link_width, MaxUIntegerTypes.UINT16_T), + "xgmi_link_speed": _validate_if_max_uint(gpu_metrics.xgmi_link_speed, MaxUIntegerTypes.UINT16_T), + "pcie_bandwidth_acc": _validate_if_max_uint(gpu_metrics.pcie_bandwidth_acc, MaxUIntegerTypes.UINT64_T), + "pcie_bandwidth_inst": _validate_if_max_uint(gpu_metrics.pcie_bandwidth_inst, MaxUIntegerTypes.UINT64_T), + "pcie_l0_to_recov_count_acc": _validate_if_max_uint(gpu_metrics.pcie_l0_to_recov_count_acc, MaxUIntegerTypes.UINT64_T), + "pcie_replay_count_acc": _validate_if_max_uint(gpu_metrics.pcie_replay_count_acc, MaxUIntegerTypes.UINT64_T), + "pcie_replay_rover_count_acc": _validate_if_max_uint(gpu_metrics.pcie_replay_rover_count_acc, MaxUIntegerTypes.UINT64_T), + "xgmi_read_data_acc": _validate_if_max_uint(list(gpu_metrics.xgmi_read_data_acc), MaxUIntegerTypes.UINT64_T), + "xgmi_write_data_acc": _validate_if_max_uint(list(gpu_metrics.xgmi_write_data_acc), MaxUIntegerTypes.UINT64_T), + "current_gfxclks": _validate_if_max_uint(list(gpu_metrics.current_gfxclks), MaxUIntegerTypes.UINT16_T), + "current_socclks": _validate_if_max_uint(list(gpu_metrics.current_socclks), MaxUIntegerTypes.UINT16_T), + "current_vclk0s": _validate_if_max_uint(list(gpu_metrics.current_vclk0s), MaxUIntegerTypes.UINT16_T), + "current_dclk0s": _validate_if_max_uint(list(gpu_metrics.current_dclk0s), MaxUIntegerTypes.UINT16_T), + "jpeg_activity": _validate_if_max_uint(list(gpu_metrics.jpeg_activity), MaxUIntegerTypes.UINT16_T, isActivity=True), + "pcie_nak_sent_count_acc": _validate_if_max_uint(gpu_metrics.pcie_nak_sent_count_acc, MaxUIntegerTypes.UINT32_T), + "pcie_nak_rcvd_count_acc": _validate_if_max_uint(gpu_metrics.pcie_nak_rcvd_count_acc, MaxUIntegerTypes.UINT32_T), + "accumulation_counter": _validate_if_max_uint(gpu_metrics.accumulation_counter, MaxUIntegerTypes.UINT64_T), + "prochot_residency_acc": _validate_if_max_uint(gpu_metrics.prochot_residency_acc, MaxUIntegerTypes.UINT64_T), + "ppt_residency_acc": _validate_if_max_uint(gpu_metrics.ppt_residency_acc, MaxUIntegerTypes.UINT64_T), + "socket_thm_residency_acc": _validate_if_max_uint(gpu_metrics.socket_thm_residency_acc, MaxUIntegerTypes.UINT64_T), + "vr_thm_residency_acc": _validate_if_max_uint(gpu_metrics.vr_thm_residency_acc, MaxUIntegerTypes.UINT64_T), + "hbm_thm_residency_acc": _validate_if_max_uint(gpu_metrics.hbm_thm_residency_acc, MaxUIntegerTypes.UINT64_T), + "num_partition": _validate_if_max_uint(gpu_metrics.num_partition, MaxUIntegerTypes.UINT16_T), + "xcp_stats.gfx_busy_inst": list(gpu_metrics.xcp_stats), + "xcp_stats.jpeg_busy": list(gpu_metrics.xcp_stats), + "xcp_stats.vcn_busy": list(gpu_metrics.xcp_stats), + "xcp_stats.gfx_busy_acc": list(gpu_metrics.xcp_stats), + "pcie_lc_perf_other_end_recovery": _validate_if_max_uint(gpu_metrics.pcie_lc_perf_other_end_recovery, MaxUIntegerTypes.UINT32_T), } - # Validate support for each gpu_metric - uint_16_metrics = ['temperature_edge', 'temperature_hotspot', 'temperature_mem', - 'temperature_vrgfx', 'temperature_vrsoc', 'temperature_vrmem', - 'average_gfx_activity', 'average_umc_activity', 'average_mm_activity', - 'average_socket_power', 'average_gfxclk_frequency', 'average_socclk_frequency', - 'average_uclk_frequency', 'average_vclk0_frequency', 'average_dclk0_frequency', - 'average_vclk1_frequency', 'average_dclk1_frequency', 'current_gfxclk', - 'current_socclk', 'current_uclk', 'current_vclk0', 'current_dclk0', - 'current_vclk1', 'current_dclk1', 'current_fan_speed', 'pcie_link_width', - 'pcie_link_speed', 'voltage_soc', 'voltage_gfx', 'voltage_mem', - 'current_socket_power', 'xgmi_link_width', 'xgmi_link_speed'] - for metric in uint_16_metrics: - if gpu_metrics_output[metric] == 0xFFFF: - gpu_metrics_output[metric] = "N/A" - - uint_32_metrics = ['gfx_activity_acc','mem_activity_acc', 'pcie_nak_sent_count_acc', 'pcie_nak_rcvd_count_acc', 'gfxclk_lock_status'] - for metric in uint_32_metrics: - if gpu_metrics_output[metric] == 0xFFFFFFFF: - gpu_metrics_output[metric] = "N/A" - - uint_64_metrics = ['energy_accumulator', 'system_clock_counter', 'firmware_timestamp', - 'pcie_bandwidth_acc', 'pcie_bandwidth_inst', - 'pcie_l0_to_recov_count_acc', 'pcie_replay_count_acc', - 'pcie_replay_rover_count_acc'] - for metric in uint_64_metrics: - if gpu_metrics_output[metric] == 0xFFFFFFFFFFFFFFFF: - gpu_metrics_output[metric] = "N/A" - - # Custom validation for metrics in a bool format - uint_32_bool_metrics = ['throttle_status'] - for metric in uint_32_bool_metrics: - if gpu_metrics_output[metric] == 0xFFFFFFFF: - gpu_metrics_output[metric] = "N/A" - else: - gpu_metrics_output[metric] = bool(gpu_metrics_output[metric]) - - # Custom validation for metrics in a list format - uint_16_clock_list_metrics = ['current_gfxclks', 'current_socclks', 'current_vclk0s', 'current_dclk0s'] - for clock in uint_16_clock_list_metrics: - for index, clk in enumerate(gpu_metrics_output[clock]): - if clk == 0xFFFF: - gpu_metrics_output[clock][index] = "N/A" - - uint_16_activity_list_metrics = ['vcn_activity', 'jpeg_activity'] - for activity_metric in uint_16_activity_list_metrics: - for index, activity in enumerate(gpu_metrics_output[activity_metric]): - if activity == 0xFFFF or activity > 110: - gpu_metrics_output[activity_metric][index] = "N/A" - - uint_64_xgmi_metrics = ['xgmi_read_data_acc', 'xgmi_write_data_acc'] - for metric in uint_64_xgmi_metrics: - for index, data in enumerate(gpu_metrics_output[metric]): - if data == 0xFFFFFFFFFFFFFFFF: - gpu_metrics_output[metric][index] = "N/A" - - # Custom validation for specific gpu_metrics - for index, temp in enumerate(gpu_metrics_output['temperature_hbm']): - if temp == 0xFFFF: - gpu_metrics_output['temperature_hbm'][index] = "N/A" - - if gpu_metrics_output['indep_throttle_status'] == 0xFFFFFFFFFFFFFFFF: - gpu_metrics_output['indep_throttle_status'] = "N/A" - else: - gpu_metrics_output['indep_throttle_status'] = bool(gpu_metrics_output['indep_throttle_status']) - + # Create 2d array with each XCD's stats + for k,v in gpu_metrics_output.items(): + if 'xcp_stats' in k: + if 'xcp_stats.gfx_busy_inst' in k: + for curr_xcp, item in enumerate(v): + print_xcp_detail = [] + for val in item.gfx_busy_inst: + print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT32_T, isActivity=True)) + gpu_metrics_output[k][curr_xcp] = print_xcp_detail + if 'xcp_stats.jpeg_busy' in k: + for curr_xcp, item in enumerate(v): + print_xcp_detail = [] + for val in item.jpeg_busy: + print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT16_T, isActivity=True)) + gpu_metrics_output[k][curr_xcp] = print_xcp_detail + if 'xcp_stats.vcn_busy' in k: + for curr_xcp, item in enumerate(v): + print_xcp_detail = [] + for val in item.vcn_busy: + print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT16_T, isActivity=True)) + gpu_metrics_output[k][curr_xcp] = print_xcp_detail + if 'xcp_stats.gfx_busy_acc' in k: + for curr_xcp, item in enumerate(v): + print_xcp_detail = [] + for val in item.gfx_busy_acc: + print_xcp_detail.append(_validate_if_max_uint(val, MaxUIntegerTypes.UINT64_T, isActivity=True)) + gpu_metrics_output[k][curr_xcp] = print_xcp_detail return gpu_metrics_output diff --git a/py-interface/amdsmi_wrapper.py b/py-interface/amdsmi_wrapper.py index d8a169fca1..e68b0bc2aa 100644 --- a/py-interface/amdsmi_wrapper.py +++ b/py-interface/amdsmi_wrapper.py @@ -727,6 +727,28 @@ struct_amdsmi_vram_usage_t._fields_ = [ ] amdsmi_vram_usage_t = struct_amdsmi_vram_usage_t +class struct_amdsmi_violation_status_t(Structure): + pass + +struct_amdsmi_violation_status_t._pack_ = 1 # source:False +struct_amdsmi_violation_status_t._fields_ = [ + ('reference_timestamp', ctypes.c_uint64), + ('violation_timestamp', ctypes.c_uint64), + ('per_prochot_thrm', ctypes.c_uint64), + ('per_ppt_pwr', ctypes.c_uint64), + ('per_socket_thrm', ctypes.c_uint64), + ('per_vr_thrm', ctypes.c_uint64), + ('per_hbm_thrm', ctypes.c_uint64), + ('active_prochot_thrm', ctypes.c_ubyte), + ('active_ppt_pwr', ctypes.c_ubyte), + ('active_socket_thrm', ctypes.c_ubyte), + ('active_vr_thrm', ctypes.c_ubyte), + ('active_hbm_thrm', ctypes.c_ubyte), + ('PADDING_0', ctypes.c_ubyte * 3), + ('reserved', ctypes.c_uint64 * 24), +] + +amdsmi_violation_status_t = struct_amdsmi_violation_status_t class struct_amdsmi_frequency_range_t(Structure): pass @@ -804,7 +826,9 @@ struct_pcie_metric_._fields_ = [ ('pcie_replay_roll_over_count', ctypes.c_uint64), ('pcie_nak_sent_count', ctypes.c_uint64), ('pcie_nak_received_count', ctypes.c_uint64), - ('reserved', ctypes.c_uint64 * 13), + ('pcie_lc_perf_other_end_recovery_count', ctypes.c_uint32), + ('PADDING_2', ctypes.c_ubyte * 4), + ('reserved', ctypes.c_uint64 * 12), ] struct_amdsmi_pcie_info_t._pack_ = 1 # source:False @@ -933,7 +957,8 @@ struct_amdsmi_kfd_info_t._pack_ = 1 # source:False struct_amdsmi_kfd_info_t._fields_ = [ ('kfd_id', ctypes.c_uint64), ('node_id', ctypes.c_uint32), - ('reserved', ctypes.c_uint32 * 13), + ('current_partition_id', ctypes.c_uint32), + ('reserved', ctypes.c_uint32 * 12), ] amdsmi_kfd_info_t = struct_amdsmi_kfd_info_t @@ -1102,16 +1127,6 @@ amdsmi_process_handle_t = ctypes.c_uint32 class struct_amdsmi_proc_info_t(Structure): pass -class struct_engine_usage_(Structure): - pass - -struct_engine_usage_._pack_ = 1 # source:False -struct_engine_usage_._fields_ = [ - ('gfx', ctypes.c_uint64), - ('enc', ctypes.c_uint64), - ('reserved', ctypes.c_uint32 * 12), -] - class struct_memory_usage_(Structure): pass @@ -1123,6 +1138,16 @@ struct_memory_usage_._fields_ = [ ('reserved', ctypes.c_uint32 * 10), ] +class struct_engine_usage_(Structure): + pass + +struct_engine_usage_._pack_ = 1 # source:False +struct_engine_usage_._fields_ = [ + ('gfx', ctypes.c_uint64), + ('enc', ctypes.c_uint64), + ('reserved', ctypes.c_uint32 * 12), +] + struct_amdsmi_proc_info_t._pack_ = 1 # source:False struct_amdsmi_proc_info_t._fields_ = [ ('name', ctypes.c_char * 32), @@ -1713,6 +1738,17 @@ struct_amd_metrics_table_header_t._fields_ = [ ] amd_metrics_table_header_t = struct_amd_metrics_table_header_t +class struct_amdsmi_gpu_xcp_metrics_t(Structure): + pass + +struct_amdsmi_gpu_xcp_metrics_t._pack_ = 1 # source:False +struct_amdsmi_gpu_xcp_metrics_t._fields_ = [ + ('gfx_busy_inst', ctypes.c_uint32 * 8), + ('jpeg_busy', ctypes.c_uint16 * 32), + ('vcn_busy', ctypes.c_uint16 * 4), + ('gfx_busy_acc', ctypes.c_uint64 * 8), +] + class struct_amdsmi_gpu_metrics_t(Structure): pass @@ -1780,6 +1816,17 @@ struct_amdsmi_gpu_metrics_t._fields_ = [ ('jpeg_activity', ctypes.c_uint16 * 32), ('pcie_nak_sent_count_acc', ctypes.c_uint32), ('pcie_nak_rcvd_count_acc', ctypes.c_uint32), + ('accumulation_counter', ctypes.c_uint64), + ('prochot_residency_acc', ctypes.c_uint64), + ('ppt_residency_acc', ctypes.c_uint64), + ('socket_thm_residency_acc', ctypes.c_uint64), + ('vr_thm_residency_acc', ctypes.c_uint64), + ('hbm_thm_residency_acc', ctypes.c_uint64), + ('num_partition', ctypes.c_uint16), + ('PADDING_4', ctypes.c_ubyte * 6), + ('xcp_stats', struct_amdsmi_gpu_xcp_metrics_t * 8), + ('pcie_lc_perf_other_end_recovery', ctypes.c_uint32), + ('PADDING_5', ctypes.c_ubyte * 4), ] amdsmi_gpu_metrics_t = struct_amdsmi_gpu_metrics_t @@ -1852,8 +1899,7 @@ struct_amdsmi_topology_nearest_t._fields_ = [ ('count', ctypes.c_uint32), ('PADDING_0', ctypes.c_ubyte * 4), ('processor_list', ctypes.POINTER(None) * 32), - ('reserved', ctypes.c_uint32 * 15), - ('PADDING_1', ctypes.c_ubyte * 4), + ('reserved', ctypes.c_uint64 * 15), ] amdsmi_topology_nearest_t = struct_amdsmi_topology_nearest_t @@ -2385,6 +2431,9 @@ amdsmi_get_clock_info.argtypes = [amdsmi_processor_handle, amdsmi_clk_type_t, ct amdsmi_get_gpu_vram_usage = _libraries['libamd_smi.so'].amdsmi_get_gpu_vram_usage amdsmi_get_gpu_vram_usage.restype = amdsmi_status_t amdsmi_get_gpu_vram_usage.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_vram_usage_t)] +amdsmi_get_violation_status = _libraries['libamd_smi.so'].amdsmi_get_violation_status +amdsmi_get_violation_status.restype = amdsmi_status_t +amdsmi_get_violation_status.argtypes = [amdsmi_processor_handle, ctypes.POINTER(struct_amdsmi_violation_status_t)] amdsmi_get_gpu_process_list = _libraries['libamd_smi.so'].amdsmi_get_gpu_process_list amdsmi_get_gpu_process_list.restype = amdsmi_status_t amdsmi_get_gpu_process_list.argtypes = [amdsmi_processor_handle, ctypes.POINTER(ctypes.c_uint32), ctypes.POINTER(struct_amdsmi_proc_info_t)] @@ -2828,9 +2877,9 @@ __all__ = \ 'amdsmi_get_soc_pstate', 'amdsmi_get_socket_handles', 'amdsmi_get_socket_info', 'amdsmi_get_temp_metric', 'amdsmi_get_threads_per_core', 'amdsmi_get_utilization_count', - 'amdsmi_get_xgmi_info', 'amdsmi_get_xgmi_plpd', - 'amdsmi_gpu_block_t', 'amdsmi_gpu_cache_info_t', - 'amdsmi_gpu_control_counter', + 'amdsmi_get_violation_status', 'amdsmi_get_xgmi_info', + 'amdsmi_get_xgmi_plpd', 'amdsmi_gpu_block_t', + 'amdsmi_gpu_cache_info_t', 'amdsmi_gpu_control_counter', 'amdsmi_gpu_counter_group_supported', 'amdsmi_gpu_create_counter', 'amdsmi_gpu_destroy_counter', 'amdsmi_gpu_metrics_t', 'amdsmi_gpu_read_counter', 'amdsmi_gpu_xgmi_error_status', @@ -2883,9 +2932,9 @@ __all__ = \ 'amdsmi_topo_get_p2p_status', 'amdsmi_topology_nearest_t', 'amdsmi_utilization_counter_t', 'amdsmi_utilization_counter_type_t', 'amdsmi_vbios_info_t', - 'amdsmi_version_t', 'amdsmi_voltage_metric_t', - 'amdsmi_voltage_type_t', 'amdsmi_vram_info_t', - 'amdsmi_vram_type_t', 'amdsmi_vram_usage_t', + 'amdsmi_version_t', 'amdsmi_violation_status_t', + 'amdsmi_voltage_metric_t', 'amdsmi_voltage_type_t', + 'amdsmi_vram_info_t', 'amdsmi_vram_type_t', 'amdsmi_vram_usage_t', 'amdsmi_vram_vendor_type_t', 'amdsmi_xgmi_info_t', 'amdsmi_xgmi_status_t', 'processor_type_t', 'size_t', 'struct__links', 'struct_amd_metrics_table_header_t', @@ -2901,6 +2950,7 @@ __all__ = \ 'struct_amdsmi_freq_volt_region_t', 'struct_amdsmi_frequencies_t', 'struct_amdsmi_frequency_range_t', 'struct_amdsmi_fw_info_t', 'struct_amdsmi_gpu_cache_info_t', 'struct_amdsmi_gpu_metrics_t', + 'struct_amdsmi_gpu_xcp_metrics_t', 'struct_amdsmi_hsmp_metrics_table_t', 'struct_amdsmi_kfd_info_t', 'struct_amdsmi_link_id_bw_type_t', 'struct_amdsmi_link_metrics_t', 'struct_amdsmi_name_value_t', 'struct_amdsmi_od_vddc_point_t', @@ -2918,9 +2968,9 @@ __all__ = \ 'struct_amdsmi_topology_nearest_t', 'struct_amdsmi_utilization_counter_t', 'struct_amdsmi_vbios_info_t', 'struct_amdsmi_version_t', - 'struct_amdsmi_vram_info_t', 'struct_amdsmi_vram_usage_t', - 'struct_amdsmi_xgmi_info_t', 'struct_cache_', - 'struct_engine_usage_', 'struct_fw_info_list_', + 'struct_amdsmi_violation_status_t', 'struct_amdsmi_vram_info_t', + 'struct_amdsmi_vram_usage_t', 'struct_amdsmi_xgmi_info_t', + 'struct_cache_', 'struct_engine_usage_', 'struct_fw_info_list_', 'struct_memory_usage_', 'struct_nps_flags_', 'struct_pcie_metric_', 'struct_pcie_static_', 'struct_amdsmi_bdf_t','uint32_t', 'uint64_t', 'uint8_t', diff --git a/rocm_smi/example/rocm_smi_example.cc b/rocm_smi/example/rocm_smi_example.cc index 0aed74aec6..4997a4bc28 100644 --- a/rocm_smi/example/rocm_smi_example.cc +++ b/rocm_smi/example/rocm_smi_example.cc @@ -937,6 +937,22 @@ int main() { << gpu_metrics.pcie_replay_count_acc << "\n"; std::cout << "\t**.pcie_replay_rover_count_acc : " << std::dec << gpu_metrics.pcie_replay_rover_count_acc << "\n"; + std::cout << "\t**.accumulation_counter : " << std::dec + << gpu_metrics.accumulation_counter << "\n"; + std::cout << "\t**.prochot_residency_acc : " << std::dec + << gpu_metrics.prochot_residency_acc << "\n"; + std::cout << "\t**.ppt_residency_acc : " << std::dec + << gpu_metrics.ppt_residency_acc << "\n"; + std::cout << "\t**.socket_thm_residency_acc : " << std::dec + << gpu_metrics.socket_thm_residency_acc << "\n"; + std::cout << "\t**.vr_thm_residency_acc : " << std::dec + << gpu_metrics.vr_thm_residency_acc << "\n"; + std::cout << "\t**.hbm_thm_residency_acc : " << std::dec + << gpu_metrics.hbm_thm_residency_acc << "\n"; + std::cout << "\t**.num_partition: " << std::dec + << gpu_metrics.num_partition << "\n"; + std::cout << "\t**.pcie_lc_perf_other_end_recovery: " + << gpu_metrics.pcie_lc_perf_other_end_recovery << "\n"; std::cout << "\t**.temperature_hbm[] : " << std::dec << "\n"; for (const auto& temp : gpu_metrics.temperature_hbm) { @@ -978,6 +994,50 @@ int main() { std::cout << "\t -> " << std::dec << dclk << "\n"; } + std::cout << std::dec << "xcp_stats.gfx_busy_inst = \n"; + auto xcp = 0; + for (auto& row : gpu_metrics.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.gfx_busy_inst), + std::end(row.gfx_busy_inst), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + xcp = 0; + std::cout << std::dec << "xcp_stats.jpeg_busy = \n"; + for (auto& row : gpu_metrics.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.jpeg_busy), + std::end(row.jpeg_busy), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + xcp = 0; + std::cout << std::dec << "xcp_stats.vcn_busy = \n"; + for (auto& row : gpu_metrics.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.vcn_busy), + std::end(row.vcn_busy), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + xcp = 0; + std::cout << std::dec << "xcp_stats.gfx_busy_acc = \n"; + for (auto& row : gpu_metrics.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.gfx_busy_acc), + std::end(row.gfx_busy_acc), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + std::cout << "\n"; std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n"; constexpr uint16_t kMAX_ITER_TEST = 10; diff --git a/rocm_smi/include/rocm_smi/rocm_smi.h b/rocm_smi/include/rocm_smi/rocm_smi.h index 3eef23ce24..2fefdd0ed8 100644 --- a/rocm_smi/include/rocm_smi/rocm_smi.h +++ b/rocm_smi/include/rocm_smi/rocm_smi.h @@ -1071,6 +1071,41 @@ typedef struct metrics_table_header_t metrics_table_header_t; */ #define RSMI_MAX_NUM_GFX_CLKS 8 +/** + * @brief This should match kRSMI_MAX_NUM_XCC; + * XCC - Accelerated Compute Core, the collection of compute units, + * ACE (Asynchronous Compute Engines), caches, + * and global resources organized as one unit. + * + * Refer to amd.com documentation for more detail: + * https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf + */ +#define RSMI_MAX_NUM_XCC 8 + +/** + * @brief This should match kRSMI_MAX_NUM_XCP; + * XCP - Accelerated Compute Processor, + * also referred to as the Graphics Compute Partitions. + * Each physical gpu could have a maximum of 8 separate partitions + * associated with each (depending on ASIC support). + * + * Refer to amd.com documentation for more detail: + * https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/white-papers/amd-cdna-3-white-paper.pdf + */ +#define RSMI_MAX_NUM_XCP 8 + +/** + * @brief The following structures hold the gpu statistics for a device. + */ +struct amdgpu_xcp_metrics_t { + /* Utilization Instantaneous (%) */ + uint32_t gfx_busy_inst[RSMI_MAX_NUM_XCC]; + uint16_t jpeg_busy[RSMI_MAX_NUM_JPEG_ENGS]; + uint16_t vcn_busy[RSMI_MAX_NUM_VCNS]; + + /* Utilization Accumulated (%) */ + uint64_t gfx_busy_acc[RSMI_MAX_NUM_XCC]; +}; typedef struct { // TODO(amd) Doxygen documents @@ -1221,6 +1256,57 @@ typedef struct { // PCIE NAK received accumulated count uint32_t pcie_nak_rcvd_count_acc; + /* + * v1.6 additions + */ + /* Accumulation cycle counter */ + uint64_t accumulation_counter; + + /** + * Accumulated throttler residencies + */ + uint64_t prochot_residency_acc; + /** + * Accumulated throttler residencies + * + * Prochot (thermal) - PPT (power) + * Package Power Tracking (PPT) violation % (greater than 0% is a violation); + * aka PVIOL + * + * Ex. PVIOL/TVIOL calculations + * Where A and B are measurments recorded at prior points in time. + * Typically A is the earlier measured value and B is the latest measured value. + * + * PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A)) + * TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A)) + */ + uint64_t ppt_residency_acc; + /** + * Accumulated throttler residencies + * + * Socket (thermal) - + * Socket thermal violation % (greater than 0% is a violation); + * aka TVIOL + * + * Ex. PVIOL/TVIOL calculations + * Where A and B are measurments recorded at prior points in time. + * Typically A is the earlier measured value and B is the latest measured value. + * + * PVIOL % = (PptResidencyAcc (B) - PptResidencyAcc (A)) * 100/ (AccumulationCounter (B) - AccumulationCounter (A)) + * TVIOL % = (SocketThmResidencyAcc (B) - SocketThmResidencyAcc (A)) * 100 / (AccumulationCounter (B) - AccumulationCounter (A)) + */ + uint64_t socket_thm_residency_acc; + uint64_t vr_thm_residency_acc; + uint64_t hbm_thm_residency_acc; + + /* Number of current partition */ + uint16_t num_partition; + + /* XCP (Graphic Cluster Partitions) metrics stats */ + struct amdgpu_xcp_metrics_t xcp_stats[RSMI_MAX_NUM_XCP]; + + /* PCIE other end recovery counter */ + uint32_t pcie_lc_perf_other_end_recovery; /// \endcond } rsmi_gpu_metrics_t; @@ -3081,6 +3167,7 @@ rsmi_status_t rsmi_dev_reg_table_info_get(uint32_t dv_ind, rsmi_name_value_t** reg_metrics, uint32_t *num_of_metrics); + /** * @brief This function sets the clock range information * diff --git a/rocm_smi/include/rocm_smi/rocm_smi_device.h b/rocm_smi/include/rocm_smi/rocm_smi_device.h index 72fbdd8a96..d1b11f5ebd 100644 --- a/rocm_smi/include/rocm_smi/rocm_smi_device.h +++ b/rocm_smi/include/rocm_smi/rocm_smi_device.h @@ -225,7 +225,6 @@ class Device { void set_drm_render_minor(uint32_t minor) {drm_render_minor_ = minor;} static rsmi_dev_perf_level perfLvlStrToEnum(std::string s); uint64_t bdfid(void) const {return bdfid_;} - int get_partition_id() const {return (bdfid_ >> 28) & 0xf; } // location_id[31:28] void set_bdfid(uint64_t val) {bdfid_ = val;} pthread_mutex_t *mutex(void) {return mutex_.ptr;} evt::dev_evt_grp_set_t* supported_event_groups(void) { @@ -261,6 +260,8 @@ class Device { AMGpuMetricsPublicLatestTupl_t dev_copy_internal_to_external_metrics(); static const std::map devInfoTypesStrings; + void set_smi_device_id(uint32_t device_id) { m_device_id = device_id; } + void set_smi_partition_id(uint32_t partition_id) { m_partition_id = partition_id; } static const char* get_type_string(DevInfoTypes type); private: @@ -298,6 +299,8 @@ class Device { GpuMetricsBasePtr m_gpu_metrics_ptr; AMDGpuMetricsHeader_v1_t m_gpu_metrics_header; uint64_t m_gpu_metrics_updated_timestamp; + uint32_t m_device_id; + uint32_t m_partition_id; }; diff --git a/rocm_smi/include/rocm_smi/rocm_smi_gpu_metrics.h b/rocm_smi/include/rocm_smi/rocm_smi_gpu_metrics.h index 70067b10ae..00c0f8e70b 100644 --- a/rocm_smi/include/rocm_smi/rocm_smi_gpu_metrics.h +++ b/rocm_smi/include/rocm_smi/rocm_smi_gpu_metrics.h @@ -52,6 +52,7 @@ #include #include #include +#include #include #include #include @@ -64,21 +65,19 @@ * All 1.4 and newer GPU metrics are now defined in this header. * */ -namespace amd::smi -{ +namespace amd::smi { constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MAJOR_VER_1 = 1; constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_1 = 1; constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_2 = 2; constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_3 = 3; constexpr uint32_t kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_4 = 4; -constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MAJOR_VER = kRSMI_GPU_METRICS_API_CONTENT_MAJOR_VER_1; -constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MINON_VER = kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_4; +constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MAJOR_VER + = kRSMI_GPU_METRICS_API_CONTENT_MAJOR_VER_1; +constexpr uint32_t kRSMI_LATEST_GPU_METRICS_API_CONTENT_MINON_VER + = kRSMI_GPU_METRICS_API_CONTENT_MINOR_VER_4; -// Note: As gpu metrics are updating -constexpr uint32_t kRSMI_GPU_METRICS_EXPIRATION_SECS = 5; - // Note: This *must* match NUM_HBM_INSTANCES constexpr uint32_t kRSMI_MAX_NUM_HBM_INSTANCES = 4; @@ -97,23 +96,36 @@ constexpr uint32_t kRSMI_MAX_NUM_VCNS = 4; // Note: This *must* match NUM_JPEG_ENG constexpr uint32_t kRSMI_MAX_JPEG_ENGINES = 32; +// Note: This *must* match MAX_XCC +constexpr uint32_t kRSMI_MAX_NUM_XCC = 8; -struct AMDGpuMetricsHeader_v1_t -{ +// Note: This *must* match MAX_XCP +constexpr uint32_t kRSMI_MAX_NUM_XCP = 8; + + +struct AMDGpuMetricsHeader_v1_t { uint16_t m_structure_size; uint8_t m_format_revision; uint8_t m_content_revision; }; -struct AMDGpuMetricsBase_t -{ +struct amdgpu_xcp_metrics { + /* Utilization Instantaneous (%) */ + uint32_t gfx_busy_inst[kRSMI_MAX_NUM_XCC]; + uint16_t jpeg_busy[kRSMI_MAX_JPEG_ENGINES]; + uint16_t vcn_busy[kRSMI_MAX_NUM_VCNS]; + + /* Utilization Accumulated (%) */ + uint64_t gfx_busy_acc[kRSMI_MAX_NUM_XCC]; +}; + +struct AMDGpuMetricsBase_t { virtual ~AMDGpuMetricsBase_t() = default; }; using AMDGpuMetricsBaseRef = AMDGpuMetricsBase_t&; -struct AMDGpuMetrics_v11_t -{ +struct AMDGpuMetrics_v11_t { ~AMDGpuMetrics_v11_t() = default; struct AMDGpuMetricsHeader_v1_t m_common_header; @@ -174,8 +186,7 @@ struct AMDGpuMetrics_v11_t uint16_t m_temperature_hbm[kRSMI_MAX_NUM_HBM_INSTANCES]; }; -struct AMDGpuMetrics_v12_t -{ +struct AMDGpuMetrics_v12_t { ~AMDGpuMetrics_v12_t() = default; struct AMDGpuMetricsHeader_v1_t m_common_header; @@ -238,8 +249,7 @@ struct AMDGpuMetrics_v12_t uint64_t m_firmware_timestamp; }; -struct AMDGpuMetrics_v13_t -{ +struct AMDGpuMetrics_v13_t { ~AMDGpuMetrics_v13_t() = default; struct AMDGpuMetricsHeader_v1_t m_common_header; @@ -298,7 +308,7 @@ struct AMDGpuMetrics_v13_t uint32_t m_mem_activity_acc; // new in v1 uint16_t m_temperature_hbm[kRSMI_MAX_NUM_HBM_INSTANCES]; // new in v1 - // PMFW attached timestamp (10ns resolution) + // PMFW attached timestamp (10ns resolution) uint64_t m_firmware_timestamp; // Voltage (mV) @@ -312,8 +322,7 @@ struct AMDGpuMetrics_v13_t uint64_t m_indep_throttle_status; }; -struct AMDGpuMetrics_v14_t -{ +struct AMDGpuMetrics_v14_t { ~AMDGpuMetrics_v14_t() = default; struct AMDGpuMetricsHeader_v1_t m_common_header; @@ -329,7 +338,7 @@ struct AMDGpuMetrics_v14_t // Utilization (%) uint16_t m_average_gfx_activity; uint16_t m_average_umc_activity; // memory controller - uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode) + uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode) // Energy (15.259uJ (2^-16) units) uint64_t m_energy_accumulator; @@ -345,9 +354,9 @@ struct AMDGpuMetrics_v14_t // Link width (number of lanes) and speed (in 0.1 GT/s) uint16_t m_pcie_link_width; - uint16_t m_pcie_link_speed; // in 0.1 GT/s + uint16_t m_pcie_link_speed; // in 0.1 GT/s - // XGMI bus width and bitrate (in Gbps) + // XGMI bus width and bitrate (in Gbps) uint16_t m_xgmi_link_width; uint16_t m_xgmi_link_speed; @@ -358,7 +367,7 @@ struct AMDGpuMetrics_v14_t // PCIE accumulated bandwidth (GB/sec) uint64_t m_pcie_bandwidth_acc; - // PCIE instantaneous bandwidth (GB/sec) + // PCIE instantaneous bandwidth (GB/sec) uint64_t m_pcie_bandwidth_inst; // PCIE L0 to recovery state transition accumulated count @@ -387,8 +396,7 @@ struct AMDGpuMetrics_v14_t uint16_t m_padding; }; -struct AMDGpuMetrics_v15_t -{ +struct AMDGpuMetrics_v15_t { ~AMDGpuMetrics_v15_t() = default; struct AMDGpuMetricsHeader_v1_t m_common_header; @@ -404,7 +412,7 @@ struct AMDGpuMetrics_v15_t // Utilization (%) uint16_t m_average_gfx_activity; uint16_t m_average_umc_activity; // memory controller - uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode) + uint16_t m_vcn_activity[kRSMI_MAX_NUM_VCNS]; // VCN instances activity percent (encode/decode) uint16_t m_jpeg_activity[kRSMI_MAX_JPEG_ENGINES]; // JPEG activity percent (encode/decode) // Energy (15.259uJ (2^-16) units) @@ -421,7 +429,7 @@ struct AMDGpuMetrics_v15_t // Link width (number of lanes) and speed (in 0.1 GT/s) uint16_t m_pcie_link_width; - uint16_t m_pcie_link_speed; // in 0.1 GT/s + uint16_t m_pcie_link_speed; // in 0.1 GT/s // XGMI bus width and bitrate (in Gbps) uint16_t m_xgmi_link_width; @@ -468,7 +476,103 @@ struct AMDGpuMetrics_v15_t uint16_t m_padding; }; -using AMGpuMetricsLatest_t = AMDGpuMetrics_v15_t; +struct AMDGpuMetrics_v16_t { + ~AMDGpuMetrics_v16_t() = default; + + struct AMDGpuMetricsHeader_v1_t m_common_header; + + // Temperature (Celsius). It will be zero (0) if unsupported. + uint16_t m_temperature_hotspot; + uint16_t m_temperature_mem; + uint16_t m_temperature_vrsoc; + + // Power (Watts) + uint16_t m_current_socket_power; + + // Utilization (%) + uint16_t m_average_gfx_activity; + uint16_t m_average_umc_activity; // memory controller + + // Energy (15.259uJ (2^-16) units) + uint64_t m_energy_accumulator; + + // Driver attached timestamp (in ns) + uint64_t m_system_clock_counter; + + /* + * Important: bumped up public to uint64_t due to planned size increase + * for newer ASICs + */ + /* Accumulation cycle counter */ + uint32_t m_accumulation_counter; + + /* Accumulated throttler residencies */ + uint32_t m_prochot_residency_acc; + uint32_t m_ppt_residency_acc; + uint32_t m_socket_thm_residency_acc; + uint32_t m_vr_thm_residency_acc; + uint32_t m_hbm_thm_residency_acc; + + // Clock Lock Status. Each bit corresponds to clock instance + uint32_t m_gfxclk_lock_status; + + // Link width (number of lanes) and speed (in 0.1 GT/s) + uint16_t m_pcie_link_width; + uint16_t m_pcie_link_speed; // in 0.1 GT/s + + // XGMI bus width and bitrate (in Gbps) + uint16_t m_xgmi_link_width; + uint16_t m_xgmi_link_speed; + + // Utilization Accumulated (%) + uint32_t m_gfx_activity_acc; + uint32_t m_mem_activity_acc; + + // PCIE accumulated bandwidth (GB/sec) + uint64_t m_pcie_bandwidth_acc; + + // PCIE instantaneous bandwidth (GB/sec) + uint64_t m_pcie_bandwidth_inst; + + // PCIE L0 to recovery state transition accumulated count + uint64_t m_pcie_l0_to_recov_count_acc; + + // PCIE replay accumulated count + uint64_t m_pcie_replay_count_acc; + + // PCIE replay rollover accumulated count + uint64_t m_pcie_replay_rover_count_acc; + + // PCIE NAK sent accumulated count + uint32_t m_pcie_nak_sent_count_acc; + + // PCIE NAK received accumulated count + uint32_t m_pcie_nak_rcvd_count_acc; + + // XGMI accumulated data transfer size(KiloBytes) + uint64_t m_xgmi_read_data_acc[kRSMI_MAX_NUM_XGMI_LINKS]; + uint64_t m_xgmi_write_data_acc[kRSMI_MAX_NUM_XGMI_LINKS]; + + // PMFW attached timestamp (10ns resolution) + uint64_t m_firmware_timestamp; + + // Current clocks (Mhz) + uint16_t m_current_gfxclk[kRSMI_MAX_NUM_GFX_CLKS]; + uint16_t m_current_socclk[kRSMI_MAX_NUM_CLKS]; + uint16_t m_current_vclk0[kRSMI_MAX_NUM_CLKS]; + uint16_t m_current_dclk0[kRSMI_MAX_NUM_CLKS]; + uint16_t m_current_uclk; + + /* Number of current partition */ + uint16_t m_num_partition; + + /* XCP (Graphic Cluster Partitions) metrics stats */ + struct amdgpu_xcp_metrics m_xcp_stats[kRSMI_MAX_NUM_XCP]; + + /* PCIE other end recovery counter */ + uint32_t m_pcie_lc_perf_other_end_recovery; +}; +using AMGpuMetricsLatest_t = AMDGpuMetrics_v16_t; /** * This is GPU Metrics version that gets to public access. @@ -555,8 +659,7 @@ using AMDGpuMetricVersionFlagId_t = uint32_t; * Each Metric Unit (or a set of them) is related to a Metric class. * */ -enum class AMDGpuMetricsClassId_t : AMDGpuMetricTypeId_t -{ +enum class AMDGpuMetricsClassId_t : AMDGpuMetricTypeId_t { kGpuMetricHeader, kGpuMetricTemperature, kGpuMetricUtilization, @@ -569,6 +672,9 @@ enum class AMDGpuMetricsClassId_t : AMDGpuMetricTypeId_t kGpuMetricLinkWidthSpeed, kGpuMetricVoltage, kGpuMetricTimestamp, + kGpuMetricThrottleResidency, + kGpuMetricPartition, + kGpuMetricXcpStats, }; using AMDGpuMetricsClassIdTranslationTbl_t = std::map; @@ -605,8 +711,8 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t kMetricAvgMmActivity, kMetricGfxActivityAccumulator, kMetricMemActivityAccumulator, - kMetricVcnActivity, //v1.4 - kMetricJpegActivity, //v1.5 + kMetricVcnActivity, // v1.4 + kMetricJpegActivity, // v1.5 // kGpuMetricAverageClock counters kMetricAvgGfxClockFrequency, @@ -618,11 +724,11 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t kMetricAvgDClock1Frequency, // kGpuMetricCurrentClock counters - kMetricCurrGfxClock, //v1.4: Changed to multi-valued - kMetricCurrSocClock, //v1.4: Changed to multi-valued + kMetricCurrGfxClock, // v1.4: Changed to multi-valued + kMetricCurrSocClock, // v1.4: Changed to multi-valued kMetricCurrUClock, - kMetricCurrVClock0, //v1.4: Changed to multi-valued - kMetricCurrDClock0, //v1.4: Changed to multi-valued + kMetricCurrVClock0, // v1.4: Changed to multi-valued + kMetricCurrDClock0, // v1.4: Changed to multi-valued kMetricCurrVClock1, kMetricCurrDClock1, @@ -631,7 +737,7 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t kMetricIndepThrottleStatus, // kGpuMetricGfxClkLockStatus counters - kMetricGfxClkLockStatus, //v1.4 + kMetricGfxClkLockStatus, // v1.4 // kGpuMetricCurrentFanSpeed counters kMetricCurrFanSpeed, @@ -639,31 +745,50 @@ enum class AMDGpuMetricsUnitType_t : AMDGpuMetricTypeId_t // kGpuMetricLinkWidthSpeed counters kMetricPcieLinkWidth, kMetricPcieLinkSpeed, - kMetricPcieBandwidthAccumulator, //v1.4 - kMetricPcieBandwidthInst, //v1.4 - kMetricXgmiLinkWidth, //v1.4 - kMetricXgmiLinkSpeed, //v1.4 - kMetricXgmiReadDataAccumulator, //v1.4 - kMetricXgmiWriteDataAccumulator, //v1.4 - kMetricPcieL0RecovCountAccumulator, //v1.4 - kMetricPcieReplayCountAccumulator, //v1.4 - kMetricPcieReplayRollOverCountAccumulator, //v1.4 - kMetricPcieNakSentCountAccumulator, //v1.5 - kMetricPcieNakReceivedCountAccumulator, //v1.5 + kMetricPcieBandwidthAccumulator, // v1.4 + kMetricPcieBandwidthInst, // v1.4 + kMetricXgmiLinkWidth, // v1.4 + kMetricXgmiLinkSpeed, // v1.4 + kMetricXgmiReadDataAccumulator, // v1.4 + kMetricXgmiWriteDataAccumulator, // v1.4 + kMetricPcieL0RecovCountAccumulator, // v1.4 + kMetricPcieReplayCountAccumulator, // v1.4 + kMetricPcieReplayRollOverCountAccumulator, // v1.4 + kMetricPcieNakSentCountAccumulator, // v1.5 + kMetricPcieNakReceivedCountAccumulator, // v1.5 // kGpuMetricPowerEnergy counters kMetricAvgSocketPower, - kMetricCurrSocketPower, //v1.4 - kMetricEnergyAccumulator, //v1.4 + kMetricCurrSocketPower, // v1.4 + kMetricEnergyAccumulator, // v1.4 // kGpuMetricVoltage counters - kMetricVoltageSoc, //v1.3 - kMetricVoltageGfx, //v1.3 - kMetricVoltageMem, //v1.3 + kMetricVoltageSoc, // v1.3 + kMetricVoltageGfx, // v1.3 + kMetricVoltageMem, // v1.3 // kGpuMetricTimestamp counters kMetricTSClockCounter, kMetricTSFirmware, + + // kMetricAccumulationCounter counters + kMetricAccumulationCounter, // v1.6 + kMetricProchotResidencyAccumulator, // v1.6 + kMetricPPTResidencyAccumulator, // v1.6 + kMetricSocketThmResidencyAccumulator, // v1.6 + kMetricVRThmResidencyAccumulator, // v1.6 + kMetricHBMThmResidencyAccumulator, // v1.6 + + // kGpuMetricPartition + kGpuMetricNumPartition, // v1.6 + + // kGpuMetricXcpStats + kMetricGfxBusyInst, // v1.6 + kMetricJpegBusy, // v1.6 + kMetricVcnBusy, // v1.6 + kMetricGfxBusyAcc, // v1.6 + + kMetricPcieLCPerfOtherEndRecov, // v1.6 }; using AMDGpuMetricsUnitTypeTranslationTbl_t = std::map; @@ -676,14 +801,14 @@ enum class AMDGpuMetricsDataType_t : AMDGpuMetricsDataTypeId_t kUInt64, }; -struct AMDGpuDynamicMetricsValue_t -{ +struct AMDGpuDynamicMetricsValue_t { uint64_t m_value; std::string m_info; AMDGpuMetricsDataType_t m_original_type; }; using AMDGpuDynamicMetricTblValues_t = std::vector; -using AMDGpuDynamicMetricsTbl_t = std::map>; +using AMDGpuDynamicMetricsTbl_t = std::map>; /* @@ -700,13 +825,13 @@ enum class AMDGpuMetricVersionFlags_t : AMDGpuMetricVersionFlagId_t kGpuMetricV13 = (0x1 << 3), kGpuMetricV14 = (0x1 << 4), kGpuMetricV15 = (0x1 << 5), + kGpuMetricV16 = (0x1 << 6), }; using AMDGpuMetricVersionTranslationTbl_t = std::map; using GpuMetricTypePtr_t = std::shared_ptr; -class GpuMetricsBase_t -{ - public: +class GpuMetricsBase_t { + public: virtual ~GpuMetricsBase_t() = default; virtual size_t sizeof_metric_table() = 0; virtual GpuMetricTypePtr_t get_metrics_table() = 0; @@ -714,30 +839,32 @@ class GpuMetricsBase_t virtual AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() = 0; virtual rsmi_status_t populate_metrics_dynamic_tbl() = 0; virtual AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() = 0; + virtual void set_device_id(uint32_t device_id) { m_device_id = device_id; } + virtual void set_partition_id(uint32_t partition_id) { m_partition_id = partition_id; } virtual AMDGpuDynamicMetricsTbl_t get_metrics_dynamic_tbl() { return m_metrics_dynamic_tbl; } - protected: + protected: AMDGpuDynamicMetricsTbl_t m_metrics_dynamic_tbl; uint64_t m_metrics_timestamp; + uint32_t m_device_id; + uint32_t m_partition_id; }; using GpuMetricsBasePtr = std::shared_ptr; using AMDGpuMetricFactories_t = const std::map; -class GpuMetricsBase_v11_t final : public GpuMetricsBase_t -{ - public: +class GpuMetricsBase_v11_t final : public GpuMetricsBase_t { + public: virtual ~GpuMetricsBase_v11_t() = default; size_t sizeof_metric_table() override { return sizeof(AMDGpuMetrics_v11_t); } - GpuMetricTypePtr_t get_metrics_table() override - { + GpuMetricTypePtr_t get_metrics_table() override { if (!m_gpu_metric_ptr) { m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v11_t*){}); } @@ -745,13 +872,11 @@ class GpuMetricsBase_v11_t final : public GpuMetricsBase_t return m_gpu_metric_ptr; } - void dump_internal_metrics_table() override - { + void dump_internal_metrics_table() override { return; } - AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override - { + AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override { return AMDGpuMetricVersionFlags_t::kGpuMetricV11; } @@ -759,23 +884,20 @@ class GpuMetricsBase_v11_t final : public GpuMetricsBase_t AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override; - private: + private: AMDGpuMetrics_v11_t m_gpu_metrics_tbl; std::shared_ptr m_gpu_metric_ptr; - }; -class GpuMetricsBase_v12_t final : public GpuMetricsBase_t -{ - public: +class GpuMetricsBase_v12_t final : public GpuMetricsBase_t { + public: ~GpuMetricsBase_v12_t() = default; size_t sizeof_metric_table() override { return sizeof(AMDGpuMetrics_v12_t); } - GpuMetricTypePtr_t get_metrics_table() override - { + GpuMetricTypePtr_t get_metrics_table() override { if (!m_gpu_metric_ptr) { m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v12_t*){}); } @@ -783,36 +905,31 @@ class GpuMetricsBase_v12_t final : public GpuMetricsBase_t return m_gpu_metric_ptr; } - void dump_internal_metrics_table() override - { + void dump_internal_metrics_table() override { return; } - AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override - { + AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override { return AMDGpuMetricVersionFlags_t::kGpuMetricV12; } rsmi_status_t populate_metrics_dynamic_tbl() override; AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override; - private: + private: AMDGpuMetrics_v12_t m_gpu_metrics_tbl; std::shared_ptr m_gpu_metric_ptr; - }; -class GpuMetricsBase_v13_t final : public GpuMetricsBase_t -{ - public: +class GpuMetricsBase_v13_t final : public GpuMetricsBase_t { + public: ~GpuMetricsBase_v13_t() = default; size_t sizeof_metric_table() override { return sizeof(AMDGpuMetrics_v13_t); } - GpuMetricTypePtr_t get_metrics_table() override - { + GpuMetricTypePtr_t get_metrics_table() override { if (!m_gpu_metric_ptr) { m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v13_t*){}); } @@ -822,8 +939,7 @@ class GpuMetricsBase_v13_t final : public GpuMetricsBase_t void dump_internal_metrics_table() override; - AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override - { + AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override { return AMDGpuMetricVersionFlags_t::kGpuMetricV13; } @@ -831,23 +947,20 @@ class GpuMetricsBase_v13_t final : public GpuMetricsBase_t AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override; - private: + private: AMDGpuMetrics_v13_t m_gpu_metrics_tbl; std::shared_ptr m_gpu_metric_ptr; - }; -class GpuMetricsBase_v14_t final : public GpuMetricsBase_t -{ - public: +class GpuMetricsBase_v14_t final : public GpuMetricsBase_t { + public: ~GpuMetricsBase_v14_t() = default; size_t sizeof_metric_table() override { return sizeof(AMDGpuMetrics_v14_t); } - GpuMetricTypePtr_t get_metrics_table() override - { + GpuMetricTypePtr_t get_metrics_table() override { if (!m_gpu_metric_ptr) { m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v14_t*){}); } @@ -857,8 +970,7 @@ class GpuMetricsBase_v14_t final : public GpuMetricsBase_t void dump_internal_metrics_table() override; - AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override - { + AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override { return AMDGpuMetricVersionFlags_t::kGpuMetricV14; } @@ -866,23 +978,20 @@ class GpuMetricsBase_v14_t final : public GpuMetricsBase_t AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override; - private: + private: AMDGpuMetrics_v14_t m_gpu_metrics_tbl; std::shared_ptr m_gpu_metric_ptr; - }; -class GpuMetricsBase_v15_t final : public GpuMetricsBase_t -{ - public: +class GpuMetricsBase_v15_t final : public GpuMetricsBase_t { + public: ~GpuMetricsBase_v15_t() = default; size_t sizeof_metric_table() override { return sizeof(AMDGpuMetrics_v15_t); } - GpuMetricTypePtr_t get_metrics_table() override - { + GpuMetricTypePtr_t get_metrics_table() override { if (!m_gpu_metric_ptr) { m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v15_t*){}); } @@ -892,8 +1001,7 @@ class GpuMetricsBase_v15_t final : public GpuMetricsBase_t void dump_internal_metrics_table() override; - AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override - { + AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override { return AMDGpuMetricVersionFlags_t::kGpuMetricV15; } @@ -901,20 +1009,51 @@ class GpuMetricsBase_v15_t final : public GpuMetricsBase_t AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override; - private: + private: AMDGpuMetrics_v15_t m_gpu_metrics_tbl; std::shared_ptr m_gpu_metric_ptr; +}; +class GpuMetricsBase_v16_t final : public GpuMetricsBase_t { + public: + ~GpuMetricsBase_v16_t() = default; + + size_t sizeof_metric_table() override { + return sizeof(AMDGpuMetrics_v16_t); + } + + GpuMetricTypePtr_t get_metrics_table() override { + if (!m_gpu_metric_ptr) { + m_gpu_metric_ptr.reset(&m_gpu_metrics_tbl, [](AMDGpuMetrics_v16_t*){}); + } + assert(m_gpu_metric_ptr != nullptr); + return m_gpu_metric_ptr; + } + + void dump_internal_metrics_table() override; + + AMDGpuMetricVersionFlags_t get_gpu_metrics_version_used() override { + return AMDGpuMetricVersionFlags_t::kGpuMetricV16; + } + + rsmi_status_t populate_metrics_dynamic_tbl() override; + AMGpuMetricsPublicLatestTupl_t copy_internal_to_external_metrics() override; + + private: + AMDGpuMetrics_v16_t m_gpu_metrics_tbl; + std::shared_ptr m_gpu_metric_ptr; }; template -rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, AMDGpuMetricsUnitType_t metric_counter, T& metric_value); +rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, + AMDGpuMetricsUnitType_t metric_counter, T& metric_value); -} // namespace amd::smi +} // namespace amd::smi rsmi_status_t -rsmi_dev_gpu_metrics_header_info_get(uint32_t dv_ind, metrics_table_header_t& header_value); +rsmi_dev_gpu_metrics_header_info_get(uint32_t dv_ind, + metrics_table_header_t& header_value); -#endif // ROCM_SMI_ROCM_SMI_GPU_METRICS_H_ +#endif // ROCM_SMI_ROCM_SMI_GPU_METRICS_H_ diff --git a/rocm_smi/include/rocm_smi/rocm_smi_utils.h b/rocm_smi/include/rocm_smi/rocm_smi_utils.h index ed560ce8ed..6dc4588cd7 100644 --- a/rocm_smi/include/rocm_smi/rocm_smi_utils.h +++ b/rocm_smi/include/rocm_smi/rocm_smi_utils.h @@ -59,6 +59,7 @@ #include #include #include +#include #include "rocm_smi/rocm_smi_device.h" @@ -604,6 +605,74 @@ using TextFileTagContents_t = TagTextContents_t; +// +// Note: Output iterator that inserts a delimiter between elements. +// +template> +class ostream_joiner { + public: + using Char_t = CharType; + using Traits_t = TraitsType; + using Ostream_t = std::basic_ostream; + using iterator_category = std::output_iterator_tag; + using value_type = void; + using difference_type = void; + using pointer = void; + using reference = void; + + + ostream_joiner(Ostream_t* outstream, + const DelimiterType& delimiter) noexcept + (std::is_nothrow_copy_constructible_v) + : m_outstream(outstream), m_delimiter(delimiter) {} + + ostream_joiner(Ostream_t* outstream, DelimiterType&& delimiter) noexcept + (std::is_nothrow_move_constructible_v) + : m_outstream(outstream), m_delimiter(std::move(delimiter)) {} + + template ostream_joiner& operator=(const ValueType& value) { + if (!m_is_first) { + *m_outstream << m_delimiter; + } + this->m_is_first = false; + this->m_value_count++; + + if ((m_value_count % kMAX_VALUES_PER_LINE) == 0) { + *m_outstream << "\n" << value; + this->m_value_count = 0; + } else { + *m_outstream << value; + } + + return *this; + } + + ostream_joiner& operator*() noexcept { return *this; } + ostream_joiner& operator++() noexcept { return *this; } + ostream_joiner& operator++(int) noexcept { return *this; } + + + private: + Ostream_t* m_outstream; + DelimiterType m_delimiter; + bool m_is_first = true; + uint32_t m_value_count = 0; + const uint32_t kMAX_VALUES_PER_LINE = 9; +}; + +/// Object generator for ostream_joiner. +template +inline ostream_joiner, CharType, TraitsType> + make_ostream_joiner(std::basic_ostream* outstream, + DelimiterType&& delimiter) { + return { + outstream, + std::forward(delimiter) + }; +} + + } // namespace smi } // namespace amd diff --git a/rocm_smi/src/rocm_smi_device.cc b/rocm_smi/src/rocm_smi_device.cc index c32c81f156..1430d29599 100644 --- a/rocm_smi/src/rocm_smi_device.cc +++ b/rocm_smi/src/rocm_smi_device.cc @@ -1006,6 +1006,7 @@ const char* Device::get_type_string(DevInfoTypes type) { return "Unknown"; } + int Device::readDevInfoBinary(DevInfoTypes type, std::size_t b_size, void *p_binary_data) { auto sysfs_path = path_; @@ -1043,15 +1044,17 @@ int Device::readDevInfoBinary(DevInfoTypes type, std::size_t b_size, LOG_ERROR(ss); return ENOENT; } - ss << "Successfully read DevInfoBinary for DevInfoType (" - << get_type_string(type) << ") - SYSFS (" - << sysfs_path << "), returning binaryData = " << p_binary_data - << "; byte_size = " << std::dec << static_cast(b_size); + if (ROCmLogging::Logger::getInstance()->isLoggerEnabled()) { + ss << "Successfully read DevInfoBinary for DevInfoType (" + << get_type_string(type) << ") - SYSFS (" + << sysfs_path << "), returning binaryData = " << p_binary_data + << "; byte_size = " << std::dec << static_cast(b_size); - std::string metricDescription = "AMD SMI GPU METRICS (16-byte width), " + std::string metricDescription = "AMD SMI GPU METRICS (16-byte width), " + sysfs_path; - logHexDump(metricDescription.c_str(), p_binary_data, b_size, 16); - LOG_INFO(ss); + logHexDump(metricDescription.c_str(), p_binary_data, b_size, 16); + LOG_INFO(ss); + } return 0; } diff --git a/rocm_smi/src/rocm_smi_gpu_metrics.cc b/rocm_smi/src/rocm_smi_gpu_metrics.cc index 3bc078216d..d4227d1051 100644 --- a/rocm_smi/src/rocm_smi_gpu_metrics.cc +++ b/rocm_smi/src/rocm_smi_gpu_metrics.cc @@ -156,6 +156,7 @@ std::string stringfy_metric_header_version(const AMDGpuMetricsHeader_v1_t& metri // version 1.3: 259 // version 1.4: 260 // version 1.5: 261 +// version 1.6: 262 // const AMDGpuMetricVersionTranslationTbl_t amdgpu_metric_version_translation_table { @@ -164,6 +165,7 @@ const AMDGpuMetricVersionTranslationTbl_t amdgpu_metric_version_translation_tabl {join_metrics_version(1, 3), AMDGpuMetricVersionFlags_t::kGpuMetricV13}, {join_metrics_version(1, 4), AMDGpuMetricVersionFlags_t::kGpuMetricV14}, {join_metrics_version(1, 5), AMDGpuMetricVersionFlags_t::kGpuMetricV15}, + {join_metrics_version(1, 6), AMDGpuMetricVersionFlags_t::kGpuMetricV16}, }; /** @@ -183,10 +185,12 @@ const AMDGpuMetricsClassIdTranslationTbl_t amdgpu_metrics_class_id_translation_t {AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed, "Link/Bandwidth/Speed"}, {AMDGpuMetricsClassId_t::kGpuMetricVoltage, "Voltage"}, {AMDGpuMetricsClassId_t::kGpuMetricTimestamp, "Timestamp"}, + {AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency, "Throttler Residency"}, + {AMDGpuMetricsClassId_t::kGpuMetricPartition, "Partition Number"}, + {AMDGpuMetricsClassId_t::kGpuMetricXcpStats, "XCP Stats"}, }; -const AMDGpuMetricsUnitTypeTranslationTbl_t amdgpu_metrics_unit_type_translation_table -{ +const AMDGpuMetricsUnitTypeTranslationTbl_t amdgpu_metrics_unit_type_translation_table { // kGpuMetricTemperature counters {AMDGpuMetricsUnitType_t::kMetricTempEdge, "TempEdge"}, {AMDGpuMetricsUnitType_t::kMetricTempHotspot, "TempHotspot"}, @@ -261,20 +265,40 @@ const AMDGpuMetricsUnitTypeTranslationTbl_t amdgpu_metrics_unit_type_translation // kGpuMetricTimestamp counters {AMDGpuMetricsUnitType_t::kMetricTSClockCounter, "TSClockCounter"}, {AMDGpuMetricsUnitType_t::kMetricTSFirmware, "TSFirmware"}, + + // kGpuMetricThrottleResidency counters + {AMDGpuMetricsUnitType_t::kMetricAccumulationCounter, "AccumulationCounter"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricProchotResidencyAccumulator, "ProchotResidencyAccumulator"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricPPTResidencyAccumulator, "PPTResidencyAccumulator"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricSocketThmResidencyAccumulator, "SocketThmResidencyAccumulator"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricVRThmResidencyAccumulator, "VRThmResidencyAccumulator"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricHBMThmResidencyAccumulator, "HBMThmResidencyAccumulator"}, /* v1.6 */ + + // kGpuMetricPartition + {AMDGpuMetricsUnitType_t::kGpuMetricNumPartition, "numPartition"}, /* v1.6 */ + + // kGpuMetricXcpStats + {AMDGpuMetricsUnitType_t::kMetricGfxBusyInst, "GfxBusyInst"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricJpegBusy, "JpegBusy"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricVcnBusy, "VcnBusy"}, /* v1.6 */ + {AMDGpuMetricsUnitType_t::kMetricGfxBusyAcc, "GfxBusyAcc"}, /* v1.6 */ + + // kGpuMetricLinkWidthSpeed + {AMDGpuMetricsUnitType_t::kMetricPcieLCPerfOtherEndRecov, "PcieLCPerfOtherEndRecov"}, /* v1.6 */ }; AMDGpuMetricVersionFlags_t translate_header_to_flag_version(const AMDGpuMetricsHeader_v1_t& metrics_header) { - std::ostringstream ostrstream; + std::ostringstream ss; auto version_id(AMDGpuMetricVersionFlags_t::kGpuMetricNone); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); const auto flag_version = join_metrics_version(metrics_header); if (amdgpu_metric_version_translation_table.find(flag_version) != amdgpu_metric_version_translation_table.end()) { version_id = amdgpu_metric_version_translation_table.at(flag_version); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Translation Tbl: " << flag_version @@ -282,11 +306,11 @@ AMDGpuMetricVersionFlags_t translate_header_to_flag_version(const AMDGpuMetricsH << " | Returning = " << static_cast(version_id) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return version_id; } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Translation Tbl: " << flag_version @@ -294,40 +318,40 @@ AMDGpuMetricVersionFlags_t translate_header_to_flag_version(const AMDGpuMetricsH << " | Returning = " << static_cast(version_id) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return version_id; } uint16_t translate_flag_to_metric_version(AMDGpuMetricVersionFlags_t version_flag) { - std::ostringstream ostrstream; + std::ostringstream ss; auto version_id = uint16_t(0); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); for (const auto& [key, value] : amdgpu_metric_version_translation_table) { if (value == version_flag) { version_id = key; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Version Flag: " << static_cast(version_flag) << " | Unified Version: " << version_id << " | Str. Version: " << stringfy_metric_header_version(disjoin_metrics_version(version_id)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return version_id; } } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Version Flag: " << static_cast(version_flag) << " | Unified Version: " << version_id << " | Str. Version: " << stringfy_metric_header_version(disjoin_metrics_version(version_id)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return version_id; } @@ -348,37 +372,38 @@ AMDGpuMetricFactories_t amd_gpu_metrics_factory_table {AMDGpuMetricVersionFlags_t::kGpuMetricV13, std::make_shared(GpuMetricsBase_v13_t{})}, {AMDGpuMetricVersionFlags_t::kGpuMetricV14, std::make_shared(GpuMetricsBase_v14_t{})}, {AMDGpuMetricVersionFlags_t::kGpuMetricV15, std::make_shared(GpuMetricsBase_v15_t{})}, + {AMDGpuMetricVersionFlags_t::kGpuMetricV16, std::make_shared(GpuMetricsBase_v16_t{})}, }; GpuMetricsBasePtr amdgpu_metrics_factory(AMDGpuMetricVersionFlags_t gpu_metric_version) { - std::ostringstream ostrstream; - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + std::ostringstream ss; + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); auto contains = [](const AMDGpuMetricVersionFlags_t metric_version) { return (amd_gpu_metrics_factory_table.find(metric_version) != amd_gpu_metrics_factory_table.end()); }; if (contains(gpu_metric_version)) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Factory Version: " << static_cast(gpu_metric_version) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return (amd_gpu_metrics_factory_table.at(gpu_metric_version)); } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Factory Version: " << static_cast(gpu_metric_version) << " | Returning = " << "No object from factory." << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return nullptr; } @@ -448,12 +473,11 @@ AMDGpuDynamicMetricTblValues_t format_metric_row(const T& metric, const std::str auto value = uint64_t(0); if constexpr (std::is_array_v) { value = (metric[idx]); - } - else { + } else { value = (metric); } - auto amdgpu_dynamic_metric_value = [&, data_type=data_type]() { + auto amdgpu_dynamic_metric_value = [&]() { AMDGpuDynamicMetricsValue_t amdgpu_dynamic_metric_value_init{}; amdgpu_dynamic_metric_value_init.m_value = value; amdgpu_dynamic_metric_value_init.m_info = (value_title + " : " + std::to_string(idx)); @@ -467,17 +491,444 @@ AMDGpuDynamicMetricTblValues_t format_metric_row(const T& metric, const std::str return multi_values; } +void GpuMetricsBase_v16_t::dump_internal_metrics_table() +{ + std::ostringstream ss; + auto idx = uint64_t(0); + auto idy = uint64_t(0); + std::cout << __PRETTY_FUNCTION__ << " | ======= start ======= \n"; + ss << __PRETTY_FUNCTION__ + << " | ======= DEBUG ======= " + << " | Metric Version: " + << stringfy_metric_header_version(m_gpu_metrics_tbl.m_common_header) + << " | Size: " + << print_unsigned_int(m_gpu_metrics_tbl.m_common_header.m_structure_size) + << " |" + << "\n"; + ss << " temperature_hotspot: " << m_gpu_metrics_tbl.m_temperature_hotspot << "\n" + << " temperature_mem: " << m_gpu_metrics_tbl.m_temperature_mem << "\n" + << " temperature_vrsoc: " << m_gpu_metrics_tbl.m_temperature_vrsoc << "\n" + << " current_socket_power: " << m_gpu_metrics_tbl.m_current_socket_power << "\n" + << " average_gfx_activity: " << m_gpu_metrics_tbl.m_average_gfx_activity << "\n" + << " average_umc_activity: " << m_gpu_metrics_tbl.m_average_umc_activity << "\n"; + + ss << " energy_accumulator: " << m_gpu_metrics_tbl.m_energy_accumulator << "\n" + << " system_clock_counter: " << m_gpu_metrics_tbl.m_system_clock_counter << "\n" + << " accumulation_counter: " << m_gpu_metrics_tbl.m_accumulation_counter << "\n" + << " prochot_residency_acc: " << m_gpu_metrics_tbl.m_prochot_residency_acc << "\n" + << " ppt_residency_acc: " << m_gpu_metrics_tbl.m_ppt_residency_acc << "\n" + << " socket_thm_residency_acc: " << m_gpu_metrics_tbl.m_socket_thm_residency_acc << "\n" + << " vr_thm_residency_acc: " << m_gpu_metrics_tbl.m_vr_thm_residency_acc << "\n" + << " hbm_thm_residency_acc: " << m_gpu_metrics_tbl.m_hbm_thm_residency_acc << "\n" + << " average_gfx_activity: " << m_gpu_metrics_tbl.m_average_gfx_activity << "\n" + << " average_umc_activity: " << m_gpu_metrics_tbl.m_average_umc_activity << "\n" + << " gfxclk_lock_status: " << m_gpu_metrics_tbl.m_gfxclk_lock_status << "\n" + << " pcie_link_width: " << m_gpu_metrics_tbl.m_pcie_link_width << "\n" + << " pcie_link_speed: " << m_gpu_metrics_tbl.m_pcie_link_speed << "\n" + << " xgmi_link_width: " << m_gpu_metrics_tbl.m_xgmi_link_width << "\n" + << " xgmi_link_speed: " << m_gpu_metrics_tbl.m_xgmi_link_speed << "\n" + << " gfx_activity_acc: " << m_gpu_metrics_tbl.m_gfx_activity_acc << "\n" + << " mem_activity_acc: " << m_gpu_metrics_tbl.m_mem_activity_acc << "\n" + << " pcie_bandwidth_acc: " << m_gpu_metrics_tbl.m_pcie_bandwidth_acc << "\n" + << " pcie_bandwidth_inst: " << m_gpu_metrics_tbl.m_pcie_bandwidth_inst << "\n" + << " pcie_l0_to_recov_count_acc: " << m_gpu_metrics_tbl.m_pcie_l0_to_recov_count_acc << "\n" + << " pcie_replay_count_acc: " << m_gpu_metrics_tbl.m_pcie_replay_count_acc << "\n" + << " pcie_replay_rover_count_acc: " << m_gpu_metrics_tbl.m_pcie_replay_rover_count_acc << "\n" + << " pcie_nak_sent_count_acc: " << m_gpu_metrics_tbl.m_pcie_nak_sent_count_acc << "\n" + << " pcie_nak_rcvd_count_acc: " << m_gpu_metrics_tbl.m_pcie_nak_rcvd_count_acc << "\n" + << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n" + << " current_uclk: " << m_gpu_metrics_tbl.m_current_uclk << "\n" + << " num_partition: " << m_gpu_metrics_tbl.m_num_partition << "\n" + << " pcie_lc_perf_other_end_recovery: " + << m_gpu_metrics_tbl.m_pcie_lc_perf_other_end_recovery << "\n"; + idx = 0; + for (const auto& temp : m_gpu_metrics_tbl.m_xgmi_read_data_acc) { + ss << "\t [" << idx << "]: " << temp << "\n"; + ++idx; + } + + ss << " xgmi_write_data_acc: " << "\n"; + idx = 0; + for (const auto& temp : m_gpu_metrics_tbl.m_xgmi_write_data_acc) { + ss << "\t [" << idx << "]: " << temp << "\n"; + ++idx; + } + + ss << " current_gfxclk: " << "\n"; + idx = 0; + for (const auto& temp : m_gpu_metrics_tbl.m_current_gfxclk) { + ss << "\t [" << idx << "]: " << temp << "\n"; + ++idx; + } + + ss << " current_socclk: " << "\n"; + idx = 0; + for (const auto& temp : m_gpu_metrics_tbl.m_current_socclk) { + ss << "\t [" << idx << "]: " << temp << "\n"; + ++idx; + } + + ss << " current_vclk0: " << "\n"; + idx = 0; + for (const auto& temp : m_gpu_metrics_tbl.m_current_vclk0) { + ss << "\t [" << idx << "]: " << temp << "\n"; + ++idx; + } + + ss << " current_dclk0: " << "\n"; + idx = 0; + for (const auto& temp : m_gpu_metrics_tbl.m_current_dclk0) { + ss << "\t [" << idx << "]: " << temp << "\n"; + ++idx; + } + + idx = 0; + idy = 0; + ss << " xcp_stats.gfx_busy_inst: " << "\n"; + for (auto& row : m_gpu_metrics_tbl.m_xcp_stats) { + if (idx == 0) { + ss << "\t [ "; + } + for (auto& col : row.gfx_busy_inst) { + ss << "\t [" << idx << "] [" << idy << "]: " << col; + if (idy + 1 != (std::end(row.gfx_busy_inst) - std::end(row.gfx_busy_inst) - 1)) { + ss << ", "; + } + if (idx + 1 != + (std::end(m_gpu_metrics_tbl.m_xcp_stats) - std::end(m_gpu_metrics_tbl.m_xcp_stats) - 1)) { + ss << "\n"; + } else { + ss << "]\n"; + } + idy++; + } + idx++; + } + + idx = 0; + idy = 0; + ss << " xcp_stats.vcn_busy: " << "\n"; + for (auto& row : m_gpu_metrics_tbl.m_xcp_stats) { + if (idx == 0) { + ss << "\t [ "; + } + for (auto& col : row.vcn_busy) { + ss << "\t [" << idx << "] [" << idy << "]: " << col; + if (idy + 1 != (std::end(row.vcn_busy) - std::end(row.vcn_busy) - 1)) { + ss << ", "; + } + if (idx + 1 != + (std::end(m_gpu_metrics_tbl.m_xcp_stats) - std::end(m_gpu_metrics_tbl.m_xcp_stats) - 1)) { + ss << "\n"; + } else { + ss << "]\n"; + } + idy++; + } + idx++; + } + + idx = 0; + idy = 0; + ss << " xcp_stats.jpeg_busy: " << "\n"; + for (auto& row : m_gpu_metrics_tbl.m_xcp_stats) { + if (idx == 0) { + ss << "\t [ "; + } + for (auto& col : row.jpeg_busy) { + ss << "\t [" << idx << "] [" << idy << "]: " << col; + if (idy + 1 != (std::end(row.jpeg_busy) - std::end(row.jpeg_busy) - 1)) { + ss << ", "; + } + if (idx + 1 != + (std::end(m_gpu_metrics_tbl.m_xcp_stats) - std::end(m_gpu_metrics_tbl.m_xcp_stats) - 1)) { + ss << "\n"; + } else { + ss << "]\n"; + } + idy++; + } + idx++; + } + + idx = 0; + idy = 0; + ss << " xcp_stats.gfx_busy_acc: " << "\n"; + for (auto& row : m_gpu_metrics_tbl.m_xcp_stats) { + if (idx == 0) { + ss << "\t [ "; + } + for (auto& col : row.gfx_busy_acc) { + ss << "\t [" << idx << "] [" << idy << "]: " << col; + if (idy + 1 != (std::end(row.gfx_busy_acc) - std::end(row.gfx_busy_acc) - 1)) { + ss << ", "; + } + if (idx + 1 != + (std::end(m_gpu_metrics_tbl.m_xcp_stats) - std::end(m_gpu_metrics_tbl.m_xcp_stats) - 1)) { + ss << "\n"; + } else { + ss << "]\n"; + } + idy++; + } + idx++; + } + + LOG_DEBUG(ss); +} + +rsmi_status_t GpuMetricsBase_v16_t::populate_metrics_dynamic_tbl() { + std::ostringstream ss; + auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + if (!m_metrics_dynamic_tbl.empty()) { + m_metrics_dynamic_tbl.clear(); + } + + // + // Note: Any metric treatment/changes (if any) should happen before they + // get written to internal/external tables. + // + auto run_metric_adjustments_v16 = [&]() { + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + const auto gpu_metrics_version = + translate_flag_to_metric_version(get_gpu_metrics_version_used()); + ss << __PRETTY_FUNCTION__ + << " | ======= info ======= " + << " | Applying adjustments " + << " | Metric Version: " << stringfy_metric_header_version( + disjoin_metrics_version(gpu_metrics_version)) + << " |"; + LOG_TRACE(ss); + + // firmware_timestamp is at 10ns resolution + ss << __PRETTY_FUNCTION__ + << " | ======= Changes ======= " + << " | {m_firmware_timestamp} from: " << m_gpu_metrics_tbl.m_firmware_timestamp + << " to: " << (m_gpu_metrics_tbl.m_firmware_timestamp * 10); + m_gpu_metrics_tbl.m_firmware_timestamp = (m_gpu_metrics_tbl.m_firmware_timestamp * 10); + LOG_DEBUG(ss); + }; + + // Adjustments/Changes specific to this version + run_metric_adjustments_v16(); + + // Temperature Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricTemperature] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricTempHotspot, + format_metric_row(m_gpu_metrics_tbl.m_temperature_hotspot, + "temperature_hotspot"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricTemperature] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricTempMem, + format_metric_row(m_gpu_metrics_tbl.m_temperature_mem, + "temperature_mem"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricTemperature] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricTempVrSoc, + format_metric_row(m_gpu_metrics_tbl.m_temperature_vrsoc, + "temperature_vrsoc"))); + + // Power/Energy Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricPowerEnergy] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricCurrSocketPower, + format_metric_row(m_gpu_metrics_tbl.m_current_socket_power, + "curr_socket_power"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricPowerEnergy] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricEnergyAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_energy_accumulator, + "energy_acc"))); + + // Utilization Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricUtilization] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricAvgGfxActivity, + format_metric_row(m_gpu_metrics_tbl.m_average_gfx_activity, + "average_gfx_activity"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricUtilization] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricAvgUmcActivity, + format_metric_row(m_gpu_metrics_tbl.m_average_umc_activity, + "average_umc_activity"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricUtilization] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricGfxActivityAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_gfx_activity_acc, + "gfx_activity_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricUtilization] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricMemActivityAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_mem_activity_acc, + "mem_activity_acc"))); + + // Timestamp Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricTimestamp] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricTSFirmware, + format_metric_row(m_gpu_metrics_tbl.m_firmware_timestamp, + "firmware_timestamp"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricTimestamp] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricTSClockCounter, + format_metric_row(m_gpu_metrics_tbl.m_system_clock_counter, + "system_clock_counter"))); + + + // GfxLock Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricGfxClkLockStatus] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricGfxClkLockStatus, + format_metric_row(m_gpu_metrics_tbl.m_gfxclk_lock_status, + "gfxclk_lock_status"))); + + // Link/Width/Speed Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieLinkWidth, + format_metric_row(m_gpu_metrics_tbl.m_pcie_link_width, + "pcie_link_width"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieLinkSpeed, + format_metric_row(m_gpu_metrics_tbl.m_pcie_link_speed, + "pcie_link_speed"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricXgmiLinkWidth, + format_metric_row(m_gpu_metrics_tbl.m_xgmi_link_width, + "xgmi_link_width"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricXgmiLinkSpeed, + format_metric_row(m_gpu_metrics_tbl.m_xgmi_link_speed, + "xgmi_link_speed"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieBandwidthAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_pcie_bandwidth_acc, + "pcie_bandwidth_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieBandwidthInst, + format_metric_row(m_gpu_metrics_tbl.m_pcie_bandwidth_inst, + "pcie_bandwidth_inst"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieL0RecovCountAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_pcie_l0_to_recov_count_acc, + "pcie_l0_recov_count_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieReplayCountAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_pcie_replay_count_acc, + "pcie_replay_count_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieReplayRollOverCountAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_pcie_replay_rover_count_acc, + "pcie_replay_rollover_count_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieNakSentCountAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_pcie_nak_sent_count_acc, + "pcie_nak_sent_count_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieNakReceivedCountAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_pcie_nak_rcvd_count_acc, + "pcie_nak_rcvd_count_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricXgmiReadDataAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_xgmi_read_data_acc, + "[xgmi_read_data_acc]"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricXgmiWriteDataAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_xgmi_write_data_acc, + "[xgmi_write_data_acc]"))); + + // CurrentClock Info + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricCurrentClock] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricCurrGfxClock, + format_metric_row(m_gpu_metrics_tbl.m_current_gfxclk, + "[current_gfxclk]"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricCurrentClock] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricCurrSocClock, + format_metric_row(m_gpu_metrics_tbl.m_current_socclk, + "[current_socclk]"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricCurrentClock] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricCurrVClock0, + format_metric_row(m_gpu_metrics_tbl.m_current_vclk0, + "[current_vclk0]"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricCurrentClock] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricCurrDClock0, + format_metric_row(m_gpu_metrics_tbl.m_current_dclk0, + "[current_dclk0]"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricCurrentClock] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricCurrUClock, + format_metric_row(m_gpu_metrics_tbl.m_current_uclk, + "current_uclk"))); + + /* Accumulation cycle counter */ + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricAccumulationCounter, + format_metric_row(m_gpu_metrics_tbl.m_accumulation_counter, + "accumulation_counter"))); + + /* Accumulated throttler residencies */ + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricProchotResidencyAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_prochot_residency_acc, + "prochot_residency_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPPTResidencyAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_ppt_residency_acc, + "ppt_residency_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricSocketThmResidencyAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_socket_thm_residency_acc, + "socket_thm_residency_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricVRThmResidencyAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_vr_thm_residency_acc, + "vr_thm_residency_acc"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricThrottleResidency] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricHBMThmResidencyAccumulator, + format_metric_row(m_gpu_metrics_tbl.m_hbm_thm_residency_acc, + "hbm_thm_residency_acc"))); + + /* Partition info */ + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricPartition] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kGpuMetricNumPartition, + format_metric_row(m_gpu_metrics_tbl.m_num_partition, + "num_partition"))); + + /* xcp_stats info */ + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricXcpStats] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricGfxBusyInst, + format_metric_row(m_gpu_metrics_tbl.m_xcp_stats->gfx_busy_inst, + "xcp_stats->gfx_busy_inst"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricXcpStats] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricVcnBusy, + format_metric_row(m_gpu_metrics_tbl.m_xcp_stats->vcn_busy, + "xcp_stats->vcn_busy"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricXcpStats] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricJpegBusy, + format_metric_row(m_gpu_metrics_tbl.m_xcp_stats->jpeg_busy, + "xcp_stats->jpeg_busy"))); + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricXcpStats] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricGfxBusyAcc, + format_metric_row(m_gpu_metrics_tbl.m_xcp_stats->gfx_busy_acc, + "xcp_stats->gfx_busy_acc"))); + + /* PCIE other end recovery counter info */ + m_metrics_dynamic_tbl[AMDGpuMetricsClassId_t::kGpuMetricLinkWidthSpeed] + .insert(std::make_pair(AMDGpuMetricsUnitType_t::kMetricPcieLCPerfOtherEndRecov, + format_metric_row(m_gpu_metrics_tbl.m_pcie_lc_perf_other_end_recovery, + "pcie_lc_perf_other_end_recovery"))); + + ss << __PRETTY_FUNCTION__ + << " | ======= end ======= " + << " | Success " + << " | Returning = " << getRSMIStatusString(status_code) + << " |"; + LOG_TRACE(ss); + + return status_code; +} + void GpuMetricsBase_v15_t::dump_internal_metrics_table() { - std::ostringstream ostrstream; + std::ostringstream ss; std::cout << __PRETTY_FUNCTION__ << " | ======= start ======= \n"; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= DEBUG ======= " << " | Metric Version: " << stringfy_metric_header_version(m_gpu_metrics_tbl.m_common_header) << " | Size: " << print_unsigned_int(m_gpu_metrics_tbl.m_common_header.m_structure_size) << " |" << "\n"; - ostrstream << " temperature_hotspot: " << m_gpu_metrics_tbl.m_temperature_hotspot << "\n" + ss << " temperature_hotspot: " << m_gpu_metrics_tbl.m_temperature_hotspot << "\n" << " temperature_mem: " << m_gpu_metrics_tbl.m_temperature_mem << "\n" << " temperature_vrsoc: " << m_gpu_metrics_tbl.m_temperature_vrsoc << "\n" @@ -486,21 +937,21 @@ void GpuMetricsBase_v15_t::dump_internal_metrics_table() << " average_gfx_activity: " << m_gpu_metrics_tbl.m_average_gfx_activity << "\n" << " average_umc_activity: " << m_gpu_metrics_tbl.m_average_umc_activity << "\n"; - ostrstream << " vcn_activity: " << "\n"; + ss << " vcn_activity: " << "\n"; auto idx = uint64_t(0); for (const auto& temp : m_gpu_metrics_tbl.m_vcn_activity) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " jpeg_activity: " << "\n"; + ss << " jpeg_activity: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_jpeg_activity) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " energy_accumulator: " << m_gpu_metrics_tbl.m_energy_accumulator << "\n" + ss << " energy_accumulator: " << m_gpu_metrics_tbl.m_energy_accumulator << "\n" << " system_clock_counter: " << m_gpu_metrics_tbl.m_system_clock_counter << "\n" << " throttle_status: " << m_gpu_metrics_tbl.m_throttle_status << "\n" @@ -527,83 +978,86 @@ void GpuMetricsBase_v15_t::dump_internal_metrics_table() << " pcie_nak_sent_count_acc: " << m_gpu_metrics_tbl.m_pcie_nak_sent_count_acc << "\n" << " pcie_nak_rcvd_count_acc: " << m_gpu_metrics_tbl.m_pcie_nak_rcvd_count_acc << "\n"; - ostrstream << " xgmi_read_data_acc: " << "\n"; + ss << " xgmi_read_data_acc: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_xgmi_read_data_acc) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " xgmi_write_data_acc: " << "\n"; + ss << " xgmi_write_data_acc: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_xgmi_write_data_acc) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n"; + ss << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n"; - ostrstream << " current_gfxclk: " << "\n"; + ss << " current_gfxclk: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_gfxclk) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " current_socclk: " << "\n"; + ss << " current_socclk: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_socclk) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " current_vclk0: " << "\n"; + ss << " current_vclk0: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_vclk0) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " current_dclk0: " << "\n"; + ss << " current_dclk0: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_dclk0) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " padding: " << m_gpu_metrics_tbl.m_padding << "\n"; - LOG_DEBUG(ostrstream); + ss << " padding: " << m_gpu_metrics_tbl.m_padding << "\n"; + LOG_DEBUG(ss); } -rsmi_status_t GpuMetricsBase_v15_t::populate_metrics_dynamic_tbl() -{ - std::ostringstream ostrstream; +rsmi_status_t GpuMetricsBase_v15_t::populate_metrics_dynamic_tbl() { + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + if (!m_metrics_dynamic_tbl.empty()) { + m_metrics_dynamic_tbl.clear(); + } // // Note: Any metric treatment/changes (if any) should happen before they // get written to internal/external tables. // auto run_metric_adjustments_v15 = [&]() { - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; const auto gpu_metrics_version = translate_flag_to_metric_version(get_gpu_metrics_version_used()); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= info ======= " << " | Applying adjustments " << " | Metric Version: " << stringfy_metric_header_version( disjoin_metrics_version(gpu_metrics_version)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); // firmware_timestamp is at 10ns resolution - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= Changes ======= " << " | {m_firmware_timestamp} from: " << m_gpu_metrics_tbl.m_firmware_timestamp << " to: " << (m_gpu_metrics_tbl.m_firmware_timestamp * 10); m_gpu_metrics_tbl.m_firmware_timestamp = (m_gpu_metrics_tbl.m_firmware_timestamp * 10); - LOG_DEBUG(ostrstream); + LOG_DEBUG(ss); }; @@ -791,12 +1245,12 @@ rsmi_status_t GpuMetricsBase_v15_t::populate_metrics_dynamic_tbl() "current_uclk")) ); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } @@ -804,15 +1258,15 @@ rsmi_status_t GpuMetricsBase_v15_t::populate_metrics_dynamic_tbl() void GpuMetricsBase_v14_t::dump_internal_metrics_table() { - std::ostringstream ostrstream; + std::ostringstream ss; std::cout << __PRETTY_FUNCTION__ << " | ======= start ======= \n"; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= DEBUG ======= " << " | Metric Version: " << stringfy_metric_header_version(m_gpu_metrics_tbl.m_common_header) << " | Size: " << print_unsigned_int(m_gpu_metrics_tbl.m_common_header.m_structure_size) << " |" << "\n"; - ostrstream << " temperature_hotspot: " << m_gpu_metrics_tbl.m_temperature_hotspot << "\n" + ss << " temperature_hotspot: " << m_gpu_metrics_tbl.m_temperature_hotspot << "\n" << " temperature_mem: " << m_gpu_metrics_tbl.m_temperature_mem << "\n" << " temperature_vrsoc: " << m_gpu_metrics_tbl.m_temperature_vrsoc << "\n" @@ -821,14 +1275,14 @@ void GpuMetricsBase_v14_t::dump_internal_metrics_table() << " average_gfx_activity: " << m_gpu_metrics_tbl.m_average_gfx_activity << "\n" << " average_umc_activity: " << m_gpu_metrics_tbl.m_average_umc_activity << "\n"; - ostrstream << " vcn_activity: " << "\n"; + ss << " vcn_activity: " << "\n"; auto idx = uint64_t(0); for (const auto& temp : m_gpu_metrics_tbl.m_vcn_activity) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " energy_accumulator: " << m_gpu_metrics_tbl.m_energy_accumulator << "\n" + ss << " energy_accumulator: " << m_gpu_metrics_tbl.m_energy_accumulator << "\n" << " system_clock_counter: " << m_gpu_metrics_tbl.m_system_clock_counter << "\n" << " throttle_status: " << m_gpu_metrics_tbl.m_throttle_status << "\n" @@ -853,83 +1307,86 @@ void GpuMetricsBase_v14_t::dump_internal_metrics_table() << " pcie_replay_count_acc: " << m_gpu_metrics_tbl.m_pcie_replay_count_acc << "\n" << " pcie_replay_rover_count_acc: " << m_gpu_metrics_tbl.m_pcie_replay_rover_count_acc << "\n"; - ostrstream << " xgmi_read_data_acc: " << "\n"; + ss << " xgmi_read_data_acc: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_xgmi_read_data_acc) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " xgmi_write_data_acc: " << "\n"; + ss << " xgmi_write_data_acc: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_xgmi_write_data_acc) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n"; + ss << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n"; - ostrstream << " current_gfxclk: " << "\n"; + ss << " current_gfxclk: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_gfxclk) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " current_socclk: " << "\n"; + ss << " current_socclk: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_socclk) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " current_vclk0: " << "\n"; + ss << " current_vclk0: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_vclk0) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " current_dclk0: " << "\n"; + ss << " current_dclk0: " << "\n"; idx = 0; for (const auto& temp : m_gpu_metrics_tbl.m_current_dclk0) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " padding: " << m_gpu_metrics_tbl.m_padding << "\n"; - LOG_DEBUG(ostrstream); + ss << " padding: " << m_gpu_metrics_tbl.m_padding << "\n"; + LOG_DEBUG(ss); } -rsmi_status_t GpuMetricsBase_v14_t::populate_metrics_dynamic_tbl() -{ - std::ostringstream ostrstream; +rsmi_status_t GpuMetricsBase_v14_t::populate_metrics_dynamic_tbl() { + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + if (!m_metrics_dynamic_tbl.empty()) { + m_metrics_dynamic_tbl.clear(); + } // // Note: Any metric treatment/changes (if any) should happen before they // get written to internal/external tables. // auto run_metric_adjustments_v14 = [&]() { - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; const auto gpu_metrics_version = translate_flag_to_metric_version(get_gpu_metrics_version_used()); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= info ======= " << " | Applying adjustments " << " | Metric Version: " << stringfy_metric_header_version( disjoin_metrics_version(gpu_metrics_version)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); // firmware_timestamp is at 10ns resolution - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= Changes ======= " << " | {m_firmware_timestamp} from: " << m_gpu_metrics_tbl.m_firmware_timestamp << " to: " << (m_gpu_metrics_tbl.m_firmware_timestamp * 10); m_gpu_metrics_tbl.m_firmware_timestamp = (m_gpu_metrics_tbl.m_firmware_timestamp * 10); - LOG_DEBUG(ostrstream); + LOG_DEBUG(ss); }; @@ -1102,22 +1559,22 @@ rsmi_status_t GpuMetricsBase_v14_t::populate_metrics_dynamic_tbl() "current_uclk")) ); - ostrstream << __PRETTY_FUNCTION__ - << " | ======= end ======= " - << " | Success " - << " | Returning = " << getRSMIStatusString(status_code) - << " |"; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ + << " | ======= end ======= " + << " | Success " + << " | Returning = " << getRSMIStatusString(status_code) + << " |"; + LOG_TRACE(ss); return status_code; } rsmi_status_t init_max_public_gpu_matrics(AMGpuMetricsPublicLatest_t& rsmi_gpu_metrics) { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); rsmi_gpu_metrics.temperature_edge = init_max_uint_types(); rsmi_gpu_metrics.temperature_hotspot = init_max_uint_types(); @@ -1207,23 +1664,250 @@ rsmi_status_t init_max_public_gpu_matrics(AMGpuMetricsPublicLatest_t& rsmi_gpu_m rsmi_gpu_metrics.pcie_nak_sent_count_acc = init_max_uint_types(); rsmi_gpu_metrics.pcie_nak_rcvd_count_acc = init_max_uint_types(); + rsmi_gpu_metrics.accumulation_counter = init_max_uint_types(); + rsmi_gpu_metrics.prochot_residency_acc = init_max_uint_types(); + rsmi_gpu_metrics.ppt_residency_acc = init_max_uint_types(); + rsmi_gpu_metrics.socket_thm_residency_acc = init_max_uint_types(); + rsmi_gpu_metrics.vr_thm_residency_acc = init_max_uint_types(); + rsmi_gpu_metrics.hbm_thm_residency_acc = init_max_uint_types(); - ostrstream << __PRETTY_FUNCTION__ - << " | ======= end ======= " - << " | Success " - << " | Returning = " << getRSMIStatusString(status_code) - << " |"; - LOG_TRACE(ostrstream); + rsmi_gpu_metrics.num_partition = init_max_uint_types(); + + rsmi_gpu_metrics.pcie_lc_perf_other_end_recovery = + init_max_uint_types(); + + for (auto& row : rsmi_gpu_metrics.xcp_stats) { + std::fill(std::begin(row.gfx_busy_inst), std::end(row.gfx_busy_inst), + init_max_uint_types()); + std::fill(std::begin(row.jpeg_busy), std::end(row.jpeg_busy), + init_max_uint_types()); + std::fill(std::begin(row.vcn_busy), std::end(row.vcn_busy), + init_max_uint_types()); + std::fill(std::begin(row.gfx_busy_acc), std::end(row.gfx_busy_acc), + init_max_uint_types()); + } + + ss << __PRETTY_FUNCTION__ + << " | ======= end ======= " + << " | Success " + << " | Returning = " << getRSMIStatusString(status_code) + << " |"; + LOG_TRACE(ss); return status_code; } +AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v16_t::copy_internal_to_external_metrics() +{ + std::ostringstream ss; + auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + auto copy_data_from_internal_metrics_tbl = [&]() { + AMGpuMetricsPublicLatest_t metrics_public_init{}; + + // + // Note: Initializing data members with their max. If field is max, + // no data was assigned to it. + init_max_public_gpu_matrics(metrics_public_init); + + // Header + metrics_public_init.common_header.structure_size = m_gpu_metrics_tbl.m_common_header.m_structure_size; + metrics_public_init.common_header.format_revision = m_gpu_metrics_tbl.m_common_header.m_format_revision; + metrics_public_init.common_header.content_revision = m_gpu_metrics_tbl.m_common_header.m_content_revision; + + + // Temperature + metrics_public_init.temperature_hotspot = m_gpu_metrics_tbl.m_temperature_hotspot; + metrics_public_init.temperature_mem = m_gpu_metrics_tbl.m_temperature_mem; + metrics_public_init.temperature_vrsoc = m_gpu_metrics_tbl.m_temperature_vrsoc; + + // Power + metrics_public_init.current_socket_power = m_gpu_metrics_tbl.m_current_socket_power; + + // Utilization + metrics_public_init.average_gfx_activity = m_gpu_metrics_tbl.m_average_gfx_activity; + metrics_public_init.average_umc_activity = m_gpu_metrics_tbl.m_average_umc_activity; + + // Power/Energy + metrics_public_init.energy_accumulator = m_gpu_metrics_tbl.m_energy_accumulator; + + // Driver attached timestamp (in ns) + metrics_public_init.system_clock_counter = m_gpu_metrics_tbl.m_system_clock_counter; + + // Clock Lock Status. Each bit corresponds to clock instance + metrics_public_init.gfxclk_lock_status = m_gpu_metrics_tbl.m_gfxclk_lock_status; + + // Link width (number of lanes) and speed + metrics_public_init.pcie_link_width = m_gpu_metrics_tbl.m_pcie_link_width; + metrics_public_init.pcie_link_speed = m_gpu_metrics_tbl.m_pcie_link_speed; + + // XGMI bus width and bitrate + metrics_public_init.xgmi_link_width = m_gpu_metrics_tbl.m_xgmi_link_width; + metrics_public_init.xgmi_link_speed = m_gpu_metrics_tbl.m_xgmi_link_speed; + + // Utilization Accumulated + metrics_public_init.gfx_activity_acc = m_gpu_metrics_tbl.m_gfx_activity_acc; + metrics_public_init.mem_activity_acc = m_gpu_metrics_tbl.m_mem_activity_acc; + + // PCIE accumulated bandwidth + metrics_public_init.pcie_bandwidth_acc = m_gpu_metrics_tbl.m_pcie_bandwidth_acc; + + // PCIE instantaneous bandwidth + metrics_public_init.pcie_bandwidth_inst = m_gpu_metrics_tbl.m_pcie_bandwidth_inst; + + // PCIE L0 to recovery state transition accumulated count + metrics_public_init.pcie_l0_to_recov_count_acc = m_gpu_metrics_tbl.m_pcie_l0_to_recov_count_acc; + + // PCIE replay accumulated count + metrics_public_init.pcie_replay_count_acc = m_gpu_metrics_tbl.m_pcie_replay_count_acc; + + // PCIE replay rollover accumulated count + metrics_public_init.pcie_replay_rover_count_acc = m_gpu_metrics_tbl.m_pcie_replay_rover_count_acc; + + // PCIE NAK sent accumulated count + metrics_public_init.pcie_nak_sent_count_acc = m_gpu_metrics_tbl.m_pcie_nak_sent_count_acc; + + // PCIE NAK received accumulated count + metrics_public_init.pcie_nak_rcvd_count_acc = m_gpu_metrics_tbl.m_pcie_nak_rcvd_count_acc; + + // Accumulated throttler residencies + // bumped up public to uint64_t due to planned size increase for newer ASICs + metrics_public_init.accumulation_counter = m_gpu_metrics_tbl.m_accumulation_counter; + metrics_public_init.prochot_residency_acc = m_gpu_metrics_tbl.m_prochot_residency_acc; + metrics_public_init.ppt_residency_acc = m_gpu_metrics_tbl.m_ppt_residency_acc; + metrics_public_init.socket_thm_residency_acc = m_gpu_metrics_tbl.m_socket_thm_residency_acc; + metrics_public_init.vr_thm_residency_acc = m_gpu_metrics_tbl.m_vr_thm_residency_acc; + metrics_public_init.hbm_thm_residency_acc = m_gpu_metrics_tbl.m_hbm_thm_residency_acc; + + // XGMI accumulated data transfer size + // xgmi_read_data + const auto xgmi_read_data_num_elems = + static_cast( + std::end(m_gpu_metrics_tbl.m_xgmi_read_data_acc) - + std::begin(m_gpu_metrics_tbl.m_xgmi_read_data_acc)); + std::copy_n(std::begin(m_gpu_metrics_tbl.m_xgmi_read_data_acc), + xgmi_read_data_num_elems, + metrics_public_init.xgmi_read_data_acc); + // xgmi_write_data + const auto xgmi_write_data_num_elems = + static_cast( + std::end(m_gpu_metrics_tbl.m_xgmi_write_data_acc) - + std::begin(m_gpu_metrics_tbl.m_xgmi_write_data_acc)); + std::copy_n(std::begin(m_gpu_metrics_tbl.m_xgmi_write_data_acc), + xgmi_write_data_num_elems, + metrics_public_init.xgmi_write_data_acc); + + // PMFW attached timestamp (10ns resolution) + metrics_public_init.firmware_timestamp = m_gpu_metrics_tbl.m_firmware_timestamp; + + // Current clocks + // current_gfxclk + const auto curr_gfxclk_num_elems = + static_cast( + std::end(m_gpu_metrics_tbl.m_current_gfxclk) - + std::begin(m_gpu_metrics_tbl.m_current_gfxclk)); + std::copy_n(std::begin(m_gpu_metrics_tbl.m_current_gfxclk), + curr_gfxclk_num_elems, + metrics_public_init.current_gfxclks); + + // current_socclk + const auto curr_socclk_num_elems = + static_cast( + std::end(m_gpu_metrics_tbl.m_current_socclk) - + std::begin(m_gpu_metrics_tbl.m_current_socclk)); + std::copy_n(std::begin(m_gpu_metrics_tbl.m_current_socclk), + curr_socclk_num_elems, + metrics_public_init.current_socclks); + + // current_vclk0 + const auto curr_vclk0_num_elems = + static_cast( + std::end(m_gpu_metrics_tbl.m_current_vclk0) - + std::begin(m_gpu_metrics_tbl.m_current_vclk0)); + std::copy_n(std::begin(m_gpu_metrics_tbl.m_current_vclk0), + curr_vclk0_num_elems, + metrics_public_init.current_vclk0s); + + // current_dclk0 + const auto curr_dclk0_num_elems = + static_cast( + std::end(m_gpu_metrics_tbl.m_current_dclk0) - + std::begin(m_gpu_metrics_tbl.m_current_dclk0)); + std::copy_n(std::begin(m_gpu_metrics_tbl.m_current_dclk0), + curr_dclk0_num_elems, + metrics_public_init.current_dclk0s); + + metrics_public_init.current_uclk = m_gpu_metrics_tbl.m_current_uclk; + + metrics_public_init.num_partition = m_gpu_metrics_tbl.m_num_partition; + + metrics_public_init.pcie_lc_perf_other_end_recovery = + m_gpu_metrics_tbl.m_pcie_lc_perf_other_end_recovery; + + auto priv_it = std::begin(m_gpu_metrics_tbl.m_xcp_stats); + for (auto pub_it = std::begin(metrics_public_init.xcp_stats); + pub_it != std::end(metrics_public_init.xcp_stats); + ++pub_it, ++priv_it) { + std::copy_n(std::begin(priv_it->gfx_busy_inst), RSMI_MAX_NUM_XCC, + pub_it->gfx_busy_inst); + std::copy_n(std::begin(priv_it->jpeg_busy), RSMI_MAX_NUM_JPEG_ENGS, + pub_it->jpeg_busy); + std::copy_n(std::begin(priv_it->vcn_busy), RSMI_MAX_NUM_VCNS, + pub_it->vcn_busy); + std::copy_n(std::begin(priv_it->gfx_busy_acc), RSMI_MAX_NUM_XCC, + pub_it->gfx_busy_acc); + } + + // + // Note: Backwards compatibility -> Handling extra/exception cases + // related to earlier versions (1.3/1.4/1.5) + metrics_public_init.current_gfxclk = metrics_public_init.current_gfxclks[0]; + + metrics_public_init.current_socclk = metrics_public_init.current_socclks[0]; + + metrics_public_init.current_vclk0 = metrics_public_init.current_vclk0s[0]; + + metrics_public_init.current_vclk1 = metrics_public_init.current_vclk0s[1]; + + metrics_public_init.current_dclk0 = metrics_public_init.current_dclk0s[0]; + + metrics_public_init.current_dclk1 = metrics_public_init.current_dclk0s[1]; + + // separate by XCP + if (this->m_partition_id < kRSMI_MAX_NUM_XCP + && m_gpu_metrics_tbl.m_xcp_stats[this->m_partition_id].vcn_busy[0] != UINT16_MAX) { + std::copy(std::begin(m_gpu_metrics_tbl.m_xcp_stats[this->m_partition_id].vcn_busy), + std::end(m_gpu_metrics_tbl.m_xcp_stats[this->m_partition_id].vcn_busy), + std::begin(metrics_public_init.vcn_activity)); + } + if (this->m_partition_id < kRSMI_MAX_NUM_XCP + && m_gpu_metrics_tbl.m_xcp_stats[this->m_partition_id].jpeg_busy[0] != UINT16_MAX) { + std::copy(std::begin(m_gpu_metrics_tbl.m_xcp_stats[this->m_partition_id].jpeg_busy), + std::end(m_gpu_metrics_tbl.m_xcp_stats[this->m_partition_id].jpeg_busy), + std::begin(metrics_public_init.jpeg_activity)); + } + + return metrics_public_init; + }(); + + ss << __PRETTY_FUNCTION__ + << " | ======= end ======= " + << " | Success " + << " | Returning = " << getRSMIStatusString(status_code) + << " |"; + LOG_TRACE(ss); + + return std::make_tuple(status_code, copy_data_from_internal_metrics_tbl); +} + AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v15_t::copy_internal_to_external_metrics() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); auto copy_data_from_internal_metrics_tbl = [&]() { AMGpuMetricsPublicLatest_t metrics_public_init{}; @@ -1398,22 +2082,22 @@ AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v15_t::copy_internal_to_external_m return metrics_public_init; }(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return std::make_tuple(status_code, copy_data_from_internal_metrics_tbl); } AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v14_t::copy_internal_to_external_metrics() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); auto copy_data_from_internal_metrics_tbl = [&]() { AMGpuMetricsPublicLatest_t metrics_public_init{}; @@ -1573,27 +2257,27 @@ AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v14_t::copy_internal_to_external_m return metrics_public_init; }(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return std::make_tuple(status_code, copy_data_from_internal_metrics_tbl); } void GpuMetricsBase_v13_t::dump_internal_metrics_table() { - std::ostringstream ostrstream; + std::ostringstream ss; std::cout << __PRETTY_FUNCTION__ << " | ======= start ======= \n"; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= DEBUG ======= " << " | Metric Version: " << stringfy_metric_header_version(m_gpu_metrics_tbl.m_common_header) << " | Size: " << print_unsigned_int(m_gpu_metrics_tbl.m_common_header.m_structure_size) << " |" << "\n"; - ostrstream << " temperature_edge: " << m_gpu_metrics_tbl.m_temperature_edge << "\n" + ss << " temperature_edge: " << m_gpu_metrics_tbl.m_temperature_edge << "\n" << " temperature_hotspot: " << m_gpu_metrics_tbl.m_temperature_hotspot << "\n" << " temperature_mem: " << m_gpu_metrics_tbl.m_temperature_mem << "\n" << " temperature_vrgfx: " << m_gpu_metrics_tbl.m_temperature_vrgfx << "\n" @@ -1635,16 +2319,16 @@ void GpuMetricsBase_v13_t::dump_internal_metrics_table() << " gfx_activity_acc: " << m_gpu_metrics_tbl.m_gfx_activity_acc << "\n" << " mem_activity_acc: " << m_gpu_metrics_tbl.m_mem_activity_acc << "\n"; - LOG_DEBUG(ostrstream); + LOG_DEBUG(ss); - ostrstream << " temperature_hbm: " << "\n"; + ss << " temperature_hbm: " << "\n"; auto idx = uint64_t(0); for (const auto& temp : m_gpu_metrics_tbl.m_temperature_hbm) { - ostrstream << "\t [" << idx << "]: " << temp << "\n"; + ss << "\t [" << idx << "]: " << temp << "\n"; ++idx; } - ostrstream << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n" + ss << " firmware_timestamp: " << m_gpu_metrics_tbl.m_firmware_timestamp << "\n" << " voltage_soc: " << m_gpu_metrics_tbl.m_voltage_soc << "\n" << " voltage_gfx: " << m_gpu_metrics_tbl.m_voltage_gfx << "\n" @@ -1652,38 +2336,41 @@ void GpuMetricsBase_v13_t::dump_internal_metrics_table() << " padding1: " << m_gpu_metrics_tbl.m_padding1 << "\n" << " m_indep_throttle_status: " << m_gpu_metrics_tbl.m_indep_throttle_status << "\n"; - LOG_DEBUG(ostrstream); + LOG_DEBUG(ss); } -rsmi_status_t GpuMetricsBase_v13_t::populate_metrics_dynamic_tbl() -{ - std::ostringstream ostrstream; +rsmi_status_t GpuMetricsBase_v13_t::populate_metrics_dynamic_tbl() { + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + if (!m_metrics_dynamic_tbl.empty()) { + m_metrics_dynamic_tbl.clear(); + } // // Note: Any metric treatment/changes (if any) should happen before they // get written to internal/external tables. // auto run_metric_adjustments_v13 = [&]() { - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; const auto gpu_metrics_version = translate_flag_to_metric_version(get_gpu_metrics_version_used()); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= info ======= " << " | Applying adjustments " << " | Metric Version: " << stringfy_metric_header_version( disjoin_metrics_version(gpu_metrics_version)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); // firmware_timestamp is at 10ns resolution - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= Changes ======= " << " | {m_firmware_timestamp} from: " << m_gpu_metrics_tbl.m_firmware_timestamp << " to: " << (m_gpu_metrics_tbl.m_firmware_timestamp * 10); m_gpu_metrics_tbl.m_firmware_timestamp = (m_gpu_metrics_tbl.m_firmware_timestamp * 10); - LOG_DEBUG(ostrstream); + LOG_DEBUG(ss); }; @@ -1900,22 +2587,22 @@ rsmi_status_t GpuMetricsBase_v13_t::populate_metrics_dynamic_tbl() "voltage_mem")) ); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v13_t::copy_internal_to_external_metrics() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); auto copy_data_from_internal_metrics_tbl = [&]() { AMGpuMetricsPublicLatest_t metrics_public_init{}; @@ -2009,49 +2696,64 @@ AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v13_t::copy_internal_to_external_m // Note: Backwards compatibility -> Handling extra/exception cases // related to earlier versions (1.2) // metrics_public_init.current_socket_power = metrics_public_init.average_socket_power; + // average_mm_activity needs to not be UIN16_MAX and + // metrics_public_init.vcn_activity[0] should also be UIN16_MAX + if (metrics_public_init.average_mm_activity != UINT16_MAX + && metrics_public_init.vcn_activity[0] == UINT16_MAX) { + metrics_public_init.vcn_activity[0] = metrics_public_init.average_mm_activity; + } + // average_mm_activity needs to not be UIN16_MAX and + // metrics_public_init.xcp_stats->vcn_busy[0] should also be UIN16_MAX + if (metrics_public_init.average_mm_activity != UINT16_MAX + && metrics_public_init.xcp_stats->vcn_busy[0] == UINT16_MAX) { + metrics_public_init.xcp_stats->vcn_busy[0] = metrics_public_init.average_mm_activity; + } return metrics_public_init; }(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return std::make_tuple(status_code, copy_data_from_internal_metrics_tbl); } -rsmi_status_t GpuMetricsBase_v12_t::populate_metrics_dynamic_tbl() -{ - std::ostringstream ostrstream; +rsmi_status_t GpuMetricsBase_v12_t::populate_metrics_dynamic_tbl() { + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + if (!m_metrics_dynamic_tbl.empty()) { + m_metrics_dynamic_tbl.clear(); + } // // Note: Any metric treatment/changes (if any) should happen before they // get written to internal/external tables. // auto run_metric_adjustments_v12 = [&]() { - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; const auto gpu_metrics_version = translate_flag_to_metric_version(get_gpu_metrics_version_used()); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= info ======= " << " | Applying adjustments " << " | Metric Version: " << stringfy_metric_header_version( disjoin_metrics_version(gpu_metrics_version)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); // firmware_timestamp is at 10ns resolution - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= Changes ======= " << " | {m_firmware_timestamp} from: " << m_gpu_metrics_tbl.m_firmware_timestamp << " to: " << (m_gpu_metrics_tbl.m_firmware_timestamp * 10); m_gpu_metrics_tbl.m_firmware_timestamp = (m_gpu_metrics_tbl.m_firmware_timestamp * 10); - LOG_DEBUG(ostrstream); + LOG_DEBUG(ss); }; @@ -2246,22 +2948,22 @@ rsmi_status_t GpuMetricsBase_v12_t::populate_metrics_dynamic_tbl() "pcie_link_speed")) ); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v12_t::copy_internal_to_external_metrics() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); auto copy_data_from_internal_metrics_tbl = [&]() { AMGpuMetricsPublicLatest_t metrics_public_init{}; @@ -2347,37 +3049,40 @@ AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v12_t::copy_internal_to_external_m return metrics_public_init; }(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return std::make_tuple(status_code, copy_data_from_internal_metrics_tbl); } -rsmi_status_t GpuMetricsBase_v11_t::populate_metrics_dynamic_tbl() -{ - std::ostringstream ostrstream; +rsmi_status_t GpuMetricsBase_v11_t::populate_metrics_dynamic_tbl() { + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); + + if (!m_metrics_dynamic_tbl.empty()) { + m_metrics_dynamic_tbl.clear(); + } // // Note: Any metric treatment/changes (if any) should happen before they // get written to internal/external tables. // auto run_metric_adjustments_v11 = [&]() { - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; const auto gpu_metrics_version = translate_flag_to_metric_version(get_gpu_metrics_version_used()); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= info ======= " << " | Applying adjustments " << " | Metric Version: " << stringfy_metric_header_version( disjoin_metrics_version(gpu_metrics_version)) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); }; @@ -2567,22 +3272,22 @@ rsmi_status_t GpuMetricsBase_v11_t::populate_metrics_dynamic_tbl() "pcie_link_speed")) ); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v11_t::copy_internal_to_external_metrics() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); auto copy_data_from_internal_metrics_tbl = [&]() { AMGpuMetricsPublicLatest_t metrics_public_init{}; @@ -2665,12 +3370,12 @@ AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v11_t::copy_internal_to_external_m return metrics_public_init; }(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return std::make_tuple(status_code, copy_data_from_internal_metrics_tbl); } @@ -2678,19 +3383,18 @@ AMGpuMetricsPublicLatestTupl_t GpuMetricsBase_v11_t::copy_internal_to_external_m rsmi_status_t Device::dev_read_gpu_metrics_header_data() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); // Check if/when metrics table needs to be refreshed. - auto now_ts = actual_timestamp_in_secs(); auto op_result = readDevInfo(DevInfoTypes::kDevGpuMetrics, sizeof(AMDGpuMetricsHeader_v1_t), &m_gpu_metrics_header); if ((status_code = ErrnoToRsmiStatus(op_result)) != rsmi_status_t::RSMI_STATUS_SUCCESS) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2701,12 +3405,12 @@ rsmi_status_t Device::dev_read_gpu_metrics_header_data() << " Could not read Metrics Header: " << print_unsigned_int(m_gpu_metrics_header.m_structure_size) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } if ((status_code = is_gpu_metrics_version_supported(m_gpu_metrics_header)) == rsmi_status_t::RSMI_STATUS_NOT_SUPPORTED) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2717,12 +3421,12 @@ rsmi_status_t Device::dev_read_gpu_metrics_header_data() << " Could not read Metrics Header: " << print_unsigned_int(m_gpu_metrics_header.m_structure_size) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } m_gpu_metrics_updated_timestamp = actual_timestamp_in_secs(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << index() @@ -2731,16 +3435,16 @@ rsmi_status_t Device::dev_read_gpu_metrics_header_data() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } rsmi_status_t Device::dev_read_gpu_metrics_all_data() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); // At this point we should have a valid gpu_metrics pointer, and // we already read the header; setup_gpu_metrics_reading() @@ -2749,7 +3453,7 @@ rsmi_status_t Device::dev_read_gpu_metrics_all_data() (!m_gpu_metrics_header.m_format_revision) || (!m_gpu_metrics_header.m_content_revision))) { status_code = rsmi_status_t::RSMI_STATUS_SETTING_UNAVAILABLE; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2758,7 +3462,7 @@ rsmi_status_t Device::dev_read_gpu_metrics_all_data() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } @@ -2767,7 +3471,7 @@ rsmi_status_t Device::dev_read_gpu_metrics_all_data() m_gpu_metrics_ptr->get_metrics_table().get()); if ((status_code = ErrnoToRsmiStatus(op_result)) != rsmi_status_t::RSMI_STATUS_SUCCESS) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2778,14 +3482,14 @@ rsmi_status_t Device::dev_read_gpu_metrics_all_data() << " Could not read Metrics Header: " << print_unsigned_int(m_gpu_metrics_header.m_structure_size) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } // All metric units are pushed in. status_code = m_gpu_metrics_ptr->populate_metrics_dynamic_tbl(); if (status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2794,11 +3498,11 @@ rsmi_status_t Device::dev_read_gpu_metrics_all_data() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); } m_gpu_metrics_updated_timestamp = actual_timestamp_in_secs(); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << index() @@ -2807,16 +3511,16 @@ rsmi_status_t Device::dev_read_gpu_metrics_all_data() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } rsmi_status_t Device::setup_gpu_metrics_reading() { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); status_code = dev_read_gpu_metrics_header_data(); if (status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) { @@ -2826,7 +3530,7 @@ rsmi_status_t Device::setup_gpu_metrics_reading() const auto gpu_metrics_flag_version = translate_header_to_flag_version(dev_get_metrics_header()); if (gpu_metrics_flag_version == AMDGpuMetricVersionFlags_t::kGpuMetricNone) { status_code = rsmi_status_t::RSMI_STATUS_NOT_SUPPORTED; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2837,7 +3541,7 @@ rsmi_status_t Device::setup_gpu_metrics_reading() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } @@ -2846,7 +3550,7 @@ rsmi_status_t Device::setup_gpu_metrics_reading() m_gpu_metrics_ptr = amdgpu_metrics_factory(gpu_metrics_flag_version); if (!m_gpu_metrics_ptr) { status_code = rsmi_status_t::RSMI_STATUS_UNEXPECTED_DATA; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2855,15 +3559,17 @@ rsmi_status_t Device::setup_gpu_metrics_reading() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } + m_gpu_metrics_ptr->set_device_id(m_device_id); + m_gpu_metrics_ptr->set_partition_id(m_partition_id); // // m_gpu_metrics_ptr has the pointer to the proper object type/version. status_code = dev_read_gpu_metrics_all_data(); if (status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2872,11 +3578,11 @@ rsmi_status_t Device::setup_gpu_metrics_reading() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << index() @@ -2886,7 +3592,7 @@ rsmi_status_t Device::setup_gpu_metrics_reading() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } @@ -2924,13 +3630,12 @@ auto get_casted_value(const AMDGpuDynamicMetricsValue_t& metrics_value) } -rsmi_status_t Device::dev_log_gpu_metrics(std::ostringstream& outstream_metrics) -{ - std::ostringstream ostrstream; +rsmi_status_t Device::dev_log_gpu_metrics(std::ostringstream& outstream_metrics) { + std::ostringstream ss; std::ostringstream tmp_outstream_metrics; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); // If we still don't have a valid gpu_metrics pointer; // meaning, we didn't run any queries, and just want to @@ -2940,7 +3645,7 @@ rsmi_status_t Device::dev_log_gpu_metrics(std::ostringstream& outstream_metrics) if ((status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) || (!m_gpu_metrics_ptr)) { // At this point we should have a valid gpu_metrics pointer. status_code = rsmi_status_t::RSMI_STATUS_UNEXPECTED_DATA; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -2949,7 +3654,7 @@ rsmi_status_t Device::dev_log_gpu_metrics(std::ostringstream& outstream_metrics) << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } @@ -3035,7 +3740,7 @@ rsmi_status_t Device::dev_log_gpu_metrics(std::ostringstream& outstream_metrics) outstream_metrics << tmp_outstream_metrics.rdbuf(); LOG_DEBUG(tmp_outstream_metrics); - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << index() @@ -3045,22 +3750,21 @@ rsmi_status_t Device::dev_log_gpu_metrics(std::ostringstream& outstream_metrics) << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } AMGpuMetricsPublicLatestTupl_t Device::dev_copy_internal_to_external_metrics() { - std::ostringstream ostrstream; - std::ostringstream tmp_outstream_metrics; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); if (!m_gpu_metrics_ptr) { // At this point we should have a valid gpu_metrics pointer. status_code = rsmi_status_t::RSMI_STATUS_UNEXPECTED_DATA; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -3069,11 +3773,11 @@ AMGpuMetricsPublicLatestTupl_t Device::dev_copy_internal_to_external_metrics() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return std::make_tuple(status_code, AMGpuMetricsPublicLatest_t()); } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << index() @@ -3083,7 +3787,7 @@ AMGpuMetricsPublicLatestTupl_t Device::dev_copy_internal_to_external_metrics() << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return m_gpu_metrics_ptr->copy_internal_to_external_metrics(); } @@ -3091,15 +3795,15 @@ AMGpuMetricsPublicLatestTupl_t Device::dev_copy_internal_to_external_metrics() rsmi_status_t Device::run_internal_gpu_metrics_query(AMDGpuMetricsUnitType_t metric_counter, AMDGpuDynamicMetricTblValues_t& values) { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_NOT_SUPPORTED); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); status_code = setup_gpu_metrics_reading(); if ((status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) || (!m_gpu_metrics_ptr)) { status_code = rsmi_status_t::RSMI_STATUS_UNEXPECTED_DATA; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -3108,25 +3812,25 @@ rsmi_status_t Device::run_internal_gpu_metrics_query(AMDGpuMetricsUnitType_t met << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } // Lookup the dynamic table - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= info ======= " << " | Device #: " << index() << " | Metric Version: " << stringfy_metrics_header(dev_get_metrics_header()) << " | Metric Unit: " << static_cast(metric_counter) << " |"; - LOG_INFO(ostrstream); + LOG_INFO(ss); const auto gpu_metrics_tbl = m_gpu_metrics_ptr->get_metrics_dynamic_tbl(); for (const auto& [metric_class, metric_data] : gpu_metrics_tbl) { for (const auto& [metric_unit, metric_values] : metric_data) { if (metric_unit == metric_counter) { values = metric_values; status_code = rsmi_status_t::RSMI_STATUS_SUCCESS; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << index() @@ -3135,13 +3839,13 @@ rsmi_status_t Device::run_internal_gpu_metrics_query(AMDGpuMetricsUnitType_t met << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } } } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << index() @@ -3149,7 +3853,7 @@ rsmi_status_t Device::run_internal_gpu_metrics_query(AMDGpuMetricsUnitType_t met << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } @@ -3182,10 +3886,10 @@ constexpr bool is_std_vector_type_supported_v() template rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, AMDGpuMetricsUnitType_t metric_counter, T& metric_value) { - std::ostringstream ostrstream; + std::ostringstream ss; auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - ostrstream << __PRETTY_FUNCTION__ << " | ======= start ======="; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ << " | ======= start ======="; + LOG_TRACE(ss); static constexpr bool is_supported_vector_type = [&]() { if constexpr (is_std_vector_v) { @@ -3203,7 +3907,7 @@ rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, AMDGpuMetricsUnit GET_DEV_FROM_INDX status_code = dev->run_internal_gpu_metrics_query(metric_counter, tmp_values); if ((status_code != rsmi_status_t::RSMI_STATUS_SUCCESS) || tmp_values.empty()) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << dv_ind @@ -3215,7 +3919,7 @@ rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, AMDGpuMetricsUnit << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } @@ -3238,14 +3942,14 @@ rsmi_status_t rsmi_dev_gpu_metrics_info_query(uint32_t dv_ind, AMDGpuMetricsUnit static_assert(is_dependent_false_v, "Error: Data Type not supported..."); } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Device #: " << dv_ind << " | Metric Type: " << static_cast(metric_counter) << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; } @@ -3281,9 +3985,9 @@ rsmi_dev_gpu_metrics_header_info_get(uint32_t dv_ind, metrics_table_header_t& he { TRY auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); - std::ostringstream ostrstream; - ostrstream << __PRETTY_FUNCTION__ << "| ======= start ======="; - LOG_TRACE(ostrstream); + std::ostringstream ss; + ss << __PRETTY_FUNCTION__ << "| ======= start ======="; + LOG_TRACE(ss); GET_DEV_FROM_INDX status_code = dev->dev_read_gpu_metrics_header_data(); @@ -3292,14 +3996,14 @@ rsmi_dev_gpu_metrics_header_info_get(uint32_t dv_ind, metrics_table_header_t& he std::memcpy(&header_value, &tmp_header_info, sizeof(metrics_table_header_t)); } - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Success " << " | Device #: " << dv_ind << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_TRACE(ostrstream); + LOG_TRACE(ss); return status_code; CATCH @@ -3320,46 +4024,52 @@ rsmi_dev_gpu_metrics_info_get(uint32_t dv_ind, rsmi_gpu_metrics_t* smu) { auto status_code(rsmi_status_t::RSMI_STATUS_SUCCESS); std::ostringstream ostrstream; - ostrstream << __PRETTY_FUNCTION__ << "| ======= start ======="; - LOG_TRACE(ostrstream); + std::ostringstream ss; + + ss << __PRETTY_FUNCTION__ << "| ======= start ======="; + LOG_TRACE(ss); assert(smu != nullptr); if (smu == nullptr) { status_code = rsmi_status_t::RSMI_STATUS_INVALID_ARGS; - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << dv_ind << " | Returning = " << getRSMIStatusString(status_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return status_code; } + dev->set_smi_device_id(dv_ind); + uint32_t partition_id = 0; + auto ret = rsmi_dev_partition_id_get(dv_ind, &partition_id); + dev->set_smi_partition_id(partition_id); dev->dev_log_gpu_metrics(ostrstream); const auto [error_code, external_metrics] = dev->dev_copy_internal_to_external_metrics(); if (error_code != rsmi_status_t::RSMI_STATUS_SUCCESS) { - ostrstream << __PRETTY_FUNCTION__ + ss << __PRETTY_FUNCTION__ << " | ======= end ======= " << " | Fail " << " | Device #: " << dv_ind << " | Returning = " << getRSMIStatusString(error_code) << " |"; - LOG_ERROR(ostrstream); + LOG_ERROR(ss); return error_code; } *smu = external_metrics; - ostrstream << __PRETTY_FUNCTION__ - << " | ======= end ======= " - << " | Success " - << " | Device #: " << dv_ind - << " | Returning = " - << getRSMIStatusString(status_code) - << " |"; - LOG_TRACE(ostrstream); + ss << __PRETTY_FUNCTION__ + << " | ======= end ======= " + << " | Success " + << " | Device #: " << dv_ind + << " | Returning = " + << getRSMIStatusString(status_code) + << " |"; + LOG_TRACE(ss); return status_code; CATCH diff --git a/rocm_smi/src/rocm_smi_monitor.cc b/rocm_smi/src/rocm_smi_monitor.cc index 40d7e8e4ac..21adfae079 100644 --- a/rocm_smi/src/rocm_smi_monitor.cc +++ b/rocm_smi/src/rocm_smi_monitor.cc @@ -395,7 +395,6 @@ Monitor::setVoltSensorLabelMap(void) { volt_type_index_map_[t_type] = file_index; index_volt_type_map_.insert({file_index, t_type}); } - return 0; }; diff --git a/src/amd_smi/amd_smi.cc b/src/amd_smi/amd_smi.cc index 7b8a4fb032..4c25e43ad2 100644 --- a/src/amd_smi/amd_smi.cc +++ b/src/amd_smi/amd_smi.cc @@ -569,7 +569,9 @@ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_hand amd::smi::AMDSmiProcessor* device = nullptr; amdsmi_status_t ret = amd::smi::AMDSmiSystem::getInstance() .handle_to_processor(processor_handle, &device); - if (ret != AMDSMI_STATUS_SUCCESS) return ret; + if (ret != AMDSMI_STATUS_SUCCESS) { + return ret; + } if (device->get_processor_type() != AMDSMI_PROCESSOR_TYPE_AMD_GPU) { return AMDSMI_STATUS_NOT_SUPPORTED; @@ -577,8 +579,9 @@ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_hand amd::smi::AMDSmiGPUDevice* gpu_device = nullptr; amdsmi_status_t r = get_gpu_device_from_handle(processor_handle, &gpu_device); - if (r != AMDSMI_STATUS_SUCCESS) + if (r != AMDSMI_STATUS_SUCCESS) { return r; + } struct drm_amdgpu_info_vram_gtt gtt; uint64_t vram_used = 0; @@ -592,13 +595,282 @@ amdsmi_status_t amdsmi_get_gpu_vram_usage(amdsmi_processor_handle processor_hand r = gpu_device->amdgpu_query_info(AMDGPU_INFO_VRAM_USAGE, sizeof(vram_used), &vram_used); - if (r != AMDSMI_STATUS_SUCCESS) return r; + if (r != AMDSMI_STATUS_SUCCESS) { + return r; + } vram_info->vram_used = static_cast(vram_used / (1024 * 1024)); return AMDSMI_STATUS_SUCCESS; } +static void system_wait(int milli_seconds) { + std::ostringstream ss; + auto start = std::chrono::high_resolution_clock::now(); + // 1 ms = 1000 us + int waitTime = milli_seconds * 1000; + + ss << __PRETTY_FUNCTION__ << " | " + << "** Waiting for " << std::dec << waitTime + << " us (" << waitTime/1000 << " seconds) **"; + LOG_DEBUG(ss); + usleep(waitTime); + auto stop = std::chrono::high_resolution_clock::now(); + auto duration = + std::chrono::duration_cast(stop - start); + ss << __PRETTY_FUNCTION__ << " | " + << "** Waiting took " << duration.count() / 1000 + << " milli-seconds **"; + LOG_DEBUG(ss); +} + +amdsmi_status_t amdsmi_get_violation_status(amdsmi_processor_handle processor_handle, + amdsmi_violation_status_t *violation_status) { + AMDSMI_CHECK_INIT(); + + std::ostringstream ss; + if (violation_status == nullptr) { + return AMDSMI_STATUS_INVAL; + } + // 1 sec = 1000 ms = 1000000 us + constexpr uint64_t kFASTEST_POLL_TIME_MS = 1; // fastest SMU FW sample time is 1ms + + violation_status->reference_timestamp = std::numeric_limits::max(); + violation_status->violation_timestamp = std::numeric_limits::max(); + violation_status->per_prochot_thrm = std::numeric_limits::max(); + violation_status->per_ppt_pwr = std::numeric_limits::max(); + violation_status->per_socket_thrm = std::numeric_limits::max(); + violation_status->per_vr_thrm = std::numeric_limits::max(); + violation_status->per_hbm_thrm = std::numeric_limits::max(); + + violation_status->active_prochot_thrm = std::numeric_limits::max(); + violation_status->active_ppt_pwr = std::numeric_limits::max(); + violation_status->active_socket_thrm = std::numeric_limits::max(); + violation_status->active_vr_thrm = std::numeric_limits::max(); + violation_status->active_hbm_thrm = std::numeric_limits::max(); + + const auto p1 = std::chrono::system_clock::now(); + auto current_time = std::chrono::duration_cast( + p1.time_since_epoch()).count(); + violation_status->reference_timestamp = current_time; + + amd::smi::AMDSmiProcessor* device = nullptr; + amdsmi_status_t ret = amd::smi::AMDSmiSystem::getInstance() + .handle_to_processor(processor_handle, &device); + if (ret != AMDSMI_STATUS_SUCCESS) { + return ret; + } + + if (device->get_processor_type() != AMDSMI_PROCESSOR_TYPE_AMD_GPU) { + return AMDSMI_STATUS_NOT_SUPPORTED; + } + + amd::smi::AMDSmiGPUDevice* gpu_device = nullptr; + amdsmi_status_t r = get_gpu_device_from_handle(processor_handle, &gpu_device); + if (r != AMDSMI_STATUS_SUCCESS) { + return r; + } + + amdsmi_gpu_metrics_t metric_info_a = {}; + amdsmi_status_t status = amdsmi_get_gpu_metrics_info( + processor_handle, &metric_info_a); + if (status != AMDSMI_STATUS_SUCCESS) { + return status; + } + + // if all of these values are "undefined" then the feature is not supported on the ASIC + if (metric_info_a.accumulation_counter == std::numeric_limits::max() + && metric_info_a.prochot_residency_acc == std::numeric_limits::max() + && metric_info_a.ppt_residency_acc == std::numeric_limits::max() + && metric_info_a.socket_thm_residency_acc == std::numeric_limits::max() + && metric_info_a.vr_thm_residency_acc == std::numeric_limits::max() + && metric_info_a.hbm_thm_residency_acc == std::numeric_limits::max()) { + ss << __PRETTY_FUNCTION__ + << " | ASIC does not support throttle violations!, " + << "returning AMDSMI_STATUS_NOT_SUPPORTED"; + LOG_INFO(ss); + return AMDSMI_STATUS_NOT_SUPPORTED; + } + + // wait 1ms before reading again + system_wait(static_cast(kFASTEST_POLL_TIME_MS)); + + amdsmi_gpu_metrics_t metric_info_b = {}; + status = amdsmi_get_gpu_metrics_info( + processor_handle, &metric_info_b); + if (status != AMDSMI_STATUS_SUCCESS) { + return status; + } + + ss << __PRETTY_FUNCTION__ << " | " + << "[gpu_metrics A] metric_info_a.accumulation_counter: " << std::dec + << metric_info_a.accumulation_counter + << "; metric_info_a.prochot_residency_acc: " << std::dec + << metric_info_a.prochot_residency_acc + << "; metric_info_a.ppt_residency_acc (pviol): " << std::dec + << metric_info_a.ppt_residency_acc + << "; metric_info_a.socket_thm_residency_acc (tviol): " << std::dec + << metric_info_a.socket_thm_residency_acc + << "; metric_info_a.vr_thm_residency_acc: " << std::dec + << metric_info_a.vr_thm_residency_acc + << "; metric_info_a.hbm_thm_residency_acc: " << std::dec + << metric_info_a.hbm_thm_residency_acc << "\n" + << " [gpu_metrics B] metric_info_b.accumulation_counter: " << std::dec + << metric_info_b.accumulation_counter + << "; metric_info_b.prochot_residency_acc: " << std::dec + << metric_info_b.prochot_residency_acc + << "; metric_info_b.ppt_residency_acc (pviol): " << std::dec + << metric_info_b.ppt_residency_acc + << "; metric_info_b.socket_thm_residency_acc (tviol): " << std::dec + << metric_info_b.socket_thm_residency_acc + << "; metric_info_b.vr_thm_residency_acc: " << std::dec + << metric_info_b.vr_thm_residency_acc + << "; metric_info_b.hbm_thm_residency_acc: " << std::dec + << metric_info_b.hbm_thm_residency_acc + << "\n"; + LOG_DEBUG(ss); + + if ( (metric_info_b.prochot_residency_acc != std::numeric_limits::max() + || metric_info_a.prochot_residency_acc != std::numeric_limits::max()) + && (metric_info_b.prochot_residency_acc >= metric_info_a.prochot_residency_acc) + && ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) { + violation_status->per_prochot_thrm = + (((metric_info_b.prochot_residency_acc - metric_info_a.prochot_residency_acc) * 100) / + (metric_info_b.accumulation_counter - metric_info_a.accumulation_counter)); + + if (violation_status->per_prochot_thrm > 0) { + violation_status->active_prochot_thrm = 1; + violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS; + } else { + violation_status->active_prochot_thrm = 0; + } + ss << __PRETTY_FUNCTION__ << " | " + << "ENTERED prochot_residency_acc | per_prochot_thrm: " << std::dec + << violation_status->per_prochot_thrm + << "%; active_prochot_thrm = " << std::dec + << violation_status->active_prochot_thrm << "\n"; + LOG_DEBUG(ss); + } + if ( (metric_info_b.ppt_residency_acc != std::numeric_limits::max() + || metric_info_a.ppt_residency_acc != std::numeric_limits::max()) + && (metric_info_b.ppt_residency_acc >= metric_info_a.ppt_residency_acc) + && ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) { + violation_status->per_ppt_pwr = + (((metric_info_b.ppt_residency_acc - metric_info_a.ppt_residency_acc) * 100) / + (metric_info_b.accumulation_counter - metric_info_a.accumulation_counter)); + + if (violation_status->per_ppt_pwr > 0) { + violation_status->active_ppt_pwr = 1; + violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS; + } else { + violation_status->active_ppt_pwr = 0; + } + ss << __PRETTY_FUNCTION__ << " | " + << "ENTERED ppt_residency_acc | per_ppt_pwr: " << std::dec + << violation_status->per_ppt_pwr + << "%; active_ppt_pwr = " << std::dec + << violation_status->active_ppt_pwr << "\n"; + LOG_DEBUG(ss); + } + if ( (metric_info_b.socket_thm_residency_acc != std::numeric_limits::max() + || metric_info_a.socket_thm_residency_acc != std::numeric_limits::max()) + && (metric_info_b.socket_thm_residency_acc >= metric_info_a.socket_thm_residency_acc) + && ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) { + violation_status->per_socket_thrm = + (((metric_info_b.socket_thm_residency_acc - + metric_info_a.socket_thm_residency_acc) * 100) / + (metric_info_b.accumulation_counter - metric_info_a.accumulation_counter)); + + if (violation_status->per_socket_thrm > 0) { + violation_status->active_socket_thrm = 1; + violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS; + } else { + violation_status->active_socket_thrm = 0; + } + ss << __PRETTY_FUNCTION__ << " | " + << "ENTERED socket_thm_residency_acc | per_socket_thrm: " << std::dec + << violation_status->per_socket_thrm + << "%; active_ppt_pwr = " << std::dec + << violation_status->active_socket_thrm << "\n"; + LOG_DEBUG(ss); + } + if ( (metric_info_b.vr_thm_residency_acc != std::numeric_limits::max() + || metric_info_a.vr_thm_residency_acc != std::numeric_limits::max()) + && (metric_info_b.vr_thm_residency_acc >= metric_info_a.vr_thm_residency_acc) + && ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0)) { + violation_status->per_vr_thrm = + (((metric_info_b.vr_thm_residency_acc - + metric_info_a.vr_thm_residency_acc) * 100) / + (metric_info_b.accumulation_counter - metric_info_a.accumulation_counter)); + + if (violation_status->per_vr_thrm > 0) { + violation_status->active_vr_thrm = 1; + violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS; + } else { + violation_status->active_vr_thrm = 0; + } + ss << __PRETTY_FUNCTION__ << " | " + << "ENTERED vr_thm_residency_acc | per_vr_thrm: " << std::dec + << violation_status->per_vr_thrm + << "%; active_ppt_pwr = " << std::dec + << violation_status->active_vr_thrm << "\n"; + LOG_DEBUG(ss); + } + if ( (metric_info_b.hbm_thm_residency_acc != std::numeric_limits::max() + || metric_info_a.hbm_thm_residency_acc != std::numeric_limits::max()) + && (metric_info_b.hbm_thm_residency_acc >= metric_info_a.vr_thm_residency_acc) + && ((metric_info_b.accumulation_counter - metric_info_a.accumulation_counter) > 0) ) { + violation_status->per_hbm_thrm = + (((metric_info_b.hbm_thm_residency_acc - + metric_info_a.hbm_thm_residency_acc) * 100) / + (metric_info_b.accumulation_counter - metric_info_a.accumulation_counter)); + + if (violation_status->per_hbm_thrm > 0) { + violation_status->active_hbm_thrm = 1; + violation_status->violation_timestamp = kFASTEST_POLL_TIME_MS; + } else { + violation_status->active_hbm_thrm = 0; + } + ss << __PRETTY_FUNCTION__ << " | " + << "ENTERED hbm_thm_residency_acc | per_hbm_thrm: " << std::dec + << violation_status->per_hbm_thrm + << "%; active_ppt_pwr = " << std::dec + << violation_status->active_hbm_thrm << "\n"; + LOG_DEBUG(ss); + } + + ss << __PRETTY_FUNCTION__ << " | " + << "RETURNING AMDSMI_STATUS_SUCCESS | " + << "violation_status->reference_timestamp (time since epoch): " << std::dec + << violation_status->reference_timestamp + << "; violation_status->violation_timestamp (ms): " << std::dec + << violation_status->violation_timestamp + << "; violation_status->per_prochot_thrm (%): " << std::dec + << violation_status->per_prochot_thrm + << "; violation_status->per_ppt_pwr (%): " << std::dec + << violation_status->per_ppt_pwr + << "; violation_status->per_socket_thrm (%): " << std::dec + << violation_status->per_socket_thrm + << "; violation_status->per_vr_thrm (%): " << std::dec + << violation_status->per_vr_thrm + << "; violation_status->per_hbm_thrm (%): " << std::dec + << violation_status->per_hbm_thrm + << "; violation_status->active_prochot_thrm (bool): " << std::dec + << static_cast(violation_status->active_prochot_thrm) + << "; violation_status->active_ppt_pwr (bool): " << std::dec + << static_cast(violation_status->active_ppt_pwr) + << "; violation_status->active_socket_thrm (bool): " << std::dec + << static_cast(violation_status->active_socket_thrm) + << "; violation_status->active_vr_thrm (bool): " << std::dec + << static_cast(violation_status->active_vr_thrm) + << "; violation_status->active_hbm_thrm (bool): " << std::dec + << static_cast(violation_status->active_hbm_thrm) + << "\n"; + LOG_INFO(ss); + + return AMDSMI_STATUS_SUCCESS; +} + amdsmi_status_t amdsmi_get_gpu_fan_rpms(amdsmi_processor_handle processor_handle, uint32_t sensor_ind, int64_t *speed) { return rsmi_wrapper(rsmi_dev_fan_rpms_get, processor_handle, sensor_ind, @@ -755,7 +1027,8 @@ amdsmi_get_gpu_asic_info(amdsmi_processor_handle processor_handle, amdsmi_asic_i // default to 0xffff as not supported info->oam_id = std::numeric_limits::max(); uint16_t tmp_oam_id = 0; - status = rsmi_wrapper(rsmi_dev_xgmi_physical_id_get, processor_handle, &(tmp_oam_id)); + status = rsmi_wrapper(rsmi_dev_xgmi_physical_id_get, processor_handle, + &(tmp_oam_id)); info->oam_id = tmp_oam_id; // default to 0xffffffff as not supported @@ -792,9 +1065,9 @@ amdsmi_status_t amdsmi_get_gpu_kfd_info(amdsmi_processor_handle processor_handle info->kfd_id = std::numeric_limits::max(); auto tmp_kfd_id = uint64_t(0); status = rsmi_wrapper(rsmi_dev_guid_get, processor_handle, &(tmp_kfd_id)); - if (status != AMDSMI_STATUS_SUCCESS) { - return status; - } else { + // Do not return early if this value fails + // continue to try getting all info + if (status == AMDSMI_STATUS_SUCCESS) { info->kfd_id = tmp_kfd_id; } @@ -802,12 +1075,22 @@ amdsmi_status_t amdsmi_get_gpu_kfd_info(amdsmi_processor_handle processor_handle info->node_id = std::numeric_limits::max(); auto tmp_node_id = uint32_t(0); status = rsmi_wrapper(rsmi_dev_node_id_get, processor_handle, &(tmp_node_id)); - if (status != AMDSMI_STATUS_SUCCESS) { - return status; - } else { + // Do not return early if this value fails + // continue to try getting all info + if (status == AMDSMI_STATUS_SUCCESS) { info->node_id = tmp_node_id; } + // default to 0xffffffff as not supported + info->current_partition_id = std::numeric_limits::max(); + auto tmp_current_partition_id = uint32_t(0); + status = rsmi_wrapper(rsmi_dev_partition_id_get, processor_handle, &(tmp_current_partition_id)); + // Do not return early if this value fails + // continue to try getting all info + if (status == AMDSMI_STATUS_SUCCESS) { + info->current_partition_id = tmp_current_partition_id; + } + return AMDSMI_STATUS_SUCCESS; } @@ -1279,8 +1562,11 @@ amdsmi_status_t amdsmi_get_gpu_metrics_info( amdsmi_gpu_metrics_t *pgpu_metrics) { AMDSMI_CHECK_INIT(); // nullptr api supported + if (pgpu_metrics != nullptr) { + *pgpu_metrics = {}; + } return rsmi_wrapper(rsmi_dev_gpu_metrics_info_get, processor_handle, - reinterpret_cast(pgpu_metrics)); + reinterpret_cast(pgpu_metrics)); } @@ -1449,7 +1735,6 @@ amdsmi_status_t amdsmi_get_clk_freq(amdsmi_processor_handle processor_handle, clk_type == AMDSMI_CLK_TYPE_VCLK1 || clk_type == AMDSMI_CLK_TYPE_DCLK0 || clk_type == AMDSMI_CLK_TYPE_DCLK1 ) { - // when f == nullptr -> check if metrics are supported amdsmi_gpu_metrics_t metric_info; amdsmi_gpu_metrics_t * metric_info_p = nullptr; @@ -2266,6 +2551,14 @@ amdsmi_status_t amdsmi_get_pcie_info(amdsmi_processor_handle processor_handle, a */ info->pcie_metric.pcie_nak_sent_count = translate_umax_or_assign_valuepcie_metric.pcie_nak_sent_count)> (metric_info.pcie_nak_sent_count_acc, (metric_info.pcie_nak_sent_count_acc)); + /** + * pcie_metric.pcie_lc_perf_other_end_recovery: (uint32_t) + */ + info->pcie_metric.pcie_lc_perf_other_end_recovery_count = + translate_umax_or_assign_valuepcie_metric.pcie_lc_perf_other_end_recovery_count)> ( + metric_info.pcie_lc_perf_other_end_recovery, + (metric_info.pcie_lc_perf_other_end_recovery)); return AMDSMI_STATUS_SUCCESS; } diff --git a/src/amd_smi/amd_smi_system.cc b/src/amd_smi/amd_smi_system.cc index 6ba65fc48f..87ae5ccb35 100644 --- a/src/amd_smi/amd_smi_system.cc +++ b/src/amd_smi/amd_smi_system.cc @@ -237,7 +237,20 @@ amdsmi_status_t AMDSmiSystem::get_gpu_socket_id(uint32_t index, return amd::smi::rsmi_to_amdsmi_status(ret); } +/** +* | Name | Field | KFD property KFD -> PCIe ID (uint64_t) +* -------------- | ------- | ---------------- | ---------------------------- | +* | Domain | [63:32] | "domain" | (DOMAIN & 0xFFFFFFFF) << 32 | +* | Partition id | [31:28] | "location id" | (LOCATION & 0xF0000000) | +* | Reserved | [27:16] | "location id" | N/A | +* | Bus | [15: 8] | "location id" | (LOCATION & 0xFF00) | +* | Device | [ 7: 3] | "location id" | (LOCATION & 0xF8) | +* | Function | [ 2: 0] | "location id" | (LOCATION & 0x7) | +*/ + uint64_t domain = (bdfid >> 32) & 0xffffffff; + // may need to identify with partition_id in the future as well... TBD + uint64_t partition_id = (bdfid >> 28) & 0xf; uint64_t bus = (bdfid >> 8) & 0xff; uint64_t device_id = (bdfid >> 3) & 0x1f; uint64_t function = bdfid & 0x7; @@ -246,8 +259,8 @@ amdsmi_status_t AMDSmiSystem::get_gpu_socket_id(uint32_t index, // represents a physical device. std::stringstream ss; ss << std::setfill('0') << std::uppercase << std::hex - << std::setw(4) << domain << ":" << std::setw(2) << bus << ":" - << std::setw(2) << device_id; + << std::setw(4) << domain << ":" << std::setw(2) << bus << ":" + << std::setw(2) << device_id; socket_id = ss.str(); return AMDSMI_STATUS_SUCCESS; } diff --git a/tests/amd_smi_test/functional/gpu_metrics_read.cc b/tests/amd_smi_test/functional/gpu_metrics_read.cc index f19d6a2768..10655233d4 100644 --- a/tests/amd_smi_test/functional/gpu_metrics_read.cc +++ b/tests/amd_smi_test/functional/gpu_metrics_read.cc @@ -46,7 +46,10 @@ #include #include +#include #include +#include +#include #include #include @@ -54,6 +57,7 @@ #include "amd_smi/amdsmi.h" #include "gpu_metrics_read.h" #include "../test_common.h" +#include "rocm_smi/rocm_smi_utils.h" TestGpuMetricsRead::TestGpuMetricsRead() : TestBase() { @@ -87,6 +91,7 @@ void TestGpuMetricsRead::Close() { } + void TestGpuMetricsRead::Run(void) { amdsmi_status_t err; @@ -101,9 +106,10 @@ void TestGpuMetricsRead::Run(void) { std::cout << "Device #" << std::to_string(i) << "\n"; IF_VERB(STANDARD) { + std::cout << "\n\n"; std::cout << "\t**GPU METRICS: Using static struct (Backwards Compatibility):\n"; } - amdsmi_gpu_metrics_t smu; + amdsmi_gpu_metrics_t smu = {}; err = amdsmi_get_gpu_metrics_info(processor_handles_[i], &smu); const char *status_string; amdsmi_status_code_to_string(err, &status_string); @@ -122,250 +128,250 @@ void TestGpuMetricsRead::Run(void) { IF_VERB(STANDARD) { std::cout << "METRIC TABLE HEADER:\n"; std::cout << "structure_size=" << std::dec - << static_cast(smu.common_header.structure_size) << '\n'; + << static_cast(smu.common_header.structure_size) << "\n"; std::cout << "format_revision=" << std::dec - << static_cast(smu.common_header.format_revision) << '\n'; + << static_cast(smu.common_header.format_revision) << "\n"; std::cout << "content_revision=" << std::dec - << static_cast(smu.common_header.content_revision) << '\n'; + << static_cast(smu.common_header.content_revision) << "\n"; + std::cout << "\n"; std::cout << "TIME STAMPS (ns):\n"; - std::cout << std::dec << "system_clock_counter=" - << smu.system_clock_counter << '\n'; - std::cout << "firmware_timestamp (10ns resolution)=" << std::dec - << smu.firmware_timestamp << '\n'; + std::cout << std::dec << "system_clock_counter=" << smu.system_clock_counter << "\n"; + std::cout << "firmware_timestamp (10ns resolution)=" << std::dec << smu.firmware_timestamp + << "\n"; + std::cout << "\n"; std::cout << "TEMPERATURES (C):\n"; - std::cout << std::dec << "temperature_edge= " - << static_cast(smu.temperature_edge) << '\n'; - std::cout << std::dec << "temperature_hotspot= " - << static_cast(smu.temperature_hotspot) << '\n'; - std::cout << std::dec << "temperature_mem= " - << static_cast(smu.temperature_mem) << '\n'; - std::cout << std::dec << "temperature_vrgfx= " - << static_cast(smu.temperature_vrgfx) << '\n'; - std::cout << std::dec << "temperature_vrsoc= " - << static_cast(smu.temperature_vrsoc) << '\n'; - std::cout << std::dec << "temperature_vrmem= " - << static_cast(smu.temperature_vrmem) << '\n'; - for (int i = 0; i < AMDSMI_NUM_HBM_INSTANCES; ++i) { - std::cout << "temperature_hbm[" << i << "]= " << std::dec - << static_cast(smu.temperature_hbm[i]) << '\n'; - } + std::cout << std::dec << "temperature_edge= " << smu.temperature_edge << "\n"; + std::cout << std::dec << "temperature_hotspot= " << smu.temperature_hotspot << "\n"; + std::cout << std::dec << "temperature_mem= " << smu.temperature_mem << "\n"; + std::cout << std::dec << "temperature_vrgfx= " << smu.temperature_vrgfx << "\n"; + std::cout << std::dec << "temperature_vrsoc= " << smu.temperature_vrsoc << "\n"; + std::cout << std::dec << "temperature_vrmem= " << smu.temperature_vrmem << "\n"; + std::cout << "temperature_hbm = ["; + std::copy(std::begin(smu.temperature_hbm), + std::end(smu.temperature_hbm), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << std::dec << "]\n"; + std::cout << "\n"; std::cout << "UTILIZATION (%):\n"; - std::cout << std::dec << "average_gfx_activity=" - << static_cast(smu.average_gfx_activity) << '\n'; - std::cout << std::dec << "average_umc_activity=" - << static_cast(smu.average_umc_activity) << '\n'; - std::cout << std::dec << "average_mm_activity=" - << static_cast(smu.average_mm_activity) << '\n'; + std::cout << std::dec << "average_gfx_activity=" << smu.average_gfx_activity << "\n"; + std::cout << std::dec << "average_umc_activity=" << smu.average_umc_activity << "\n"; + std::cout << std::dec << "average_mm_activity=" << smu.average_mm_activity << "\n"; std::cout << std::dec << "vcn_activity= ["; - uint16_t size = static_cast( - sizeof(smu.vcn_activity)/sizeof(smu.vcn_activity[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.vcn_activity[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.vcn_activity[i]); - } - } + std::copy(std::begin(smu.vcn_activity), + std::end(smu.vcn_activity), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; + std::cout << "\n"; std::cout << std::dec << "jpeg_activity= ["; - size = static_cast( - sizeof(smu.jpeg_activity)/sizeof(smu.jpeg_activity[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.jpeg_activity[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.jpeg_activity[i]); - } - } + std::copy(std::begin(smu.jpeg_activity), + std::end(smu.jpeg_activity), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; + std::cout << "\n"; std::cout << "POWER (W)/ENERGY (15.259uJ per 1ns):\n"; - std::cout << std::dec << "average_socket_power=" - << static_cast(smu.average_socket_power) << '\n'; - std::cout << std::dec << "current_socket_power=" - << static_cast(smu.current_socket_power) << '\n'; - std::cout << std::dec << "energy_accumulator=" - << static_cast(smu.energy_accumulator) << '\n'; + std::cout << std::dec << "average_socket_power=" << smu.average_socket_power << "\n"; + std::cout << std::dec << "current_socket_power=" << smu.current_socket_power << "\n"; + std::cout << std::dec << "energy_accumulator=" << smu.energy_accumulator << "\n"; + std::cout << "\n"; std::cout << "AVG CLOCKS (MHz):\n"; - std::cout << std::dec << "average_gfxclk_frequency=" - << static_cast(smu.average_gfxclk_frequency) << '\n'; - std::cout << std::dec << "average_gfxclk_frequency=" - << static_cast(smu.average_gfxclk_frequency) << '\n'; - std::cout << std::dec << "average_uclk_frequency=" - << static_cast(smu.average_uclk_frequency) << '\n'; - std::cout << std::dec << "average_vclk0_frequency=" - << static_cast(smu.average_vclk0_frequency) << '\n'; - std::cout << std::dec << "average_dclk0_frequency=" - << static_cast(smu.average_dclk0_frequency) << '\n'; - std::cout << std::dec << "average_vclk1_frequency=" - << static_cast(smu.average_vclk1_frequency) << '\n'; - std::cout << std::dec << "average_dclk1_frequency=" - << static_cast(smu.average_dclk1_frequency) << '\n'; + std::cout << std::dec << "average_gfxclk_frequency=" << smu.average_gfxclk_frequency + << "\n"; + std::cout << std::dec << "average_gfxclk_frequency=" << smu.average_gfxclk_frequency + << "\n"; + std::cout << std::dec << "average_uclk_frequency=" << smu.average_uclk_frequency << "\n"; + std::cout << std::dec << "average_vclk0_frequency=" << smu.average_vclk0_frequency + << "\n"; + std::cout << std::dec << "average_dclk0_frequency=" << smu.average_dclk0_frequency + << "\n"; + std::cout << std::dec << "average_vclk1_frequency=" << smu.average_vclk1_frequency + << "\n"; + std::cout << std::dec << "average_dclk1_frequency=" << smu.average_dclk1_frequency + << "\n"; + std::cout << "\n"; std::cout << "CURRENT CLOCKS (MHz):\n"; - std::cout << std::dec << "current_gfxclk=" - << smu.current_gfxclk << '\n'; + std::cout << std::dec << "current_gfxclk=" << smu.current_gfxclk << "\n"; std::cout << std::dec << "current_gfxclks= ["; - size = static_cast( - sizeof(smu.current_gfxclks)/sizeof(smu.current_gfxclks[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.current_gfxclks[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.current_gfxclks[i]); - } - } + std::copy(std::begin(smu.current_gfxclks), + std::end(smu.current_gfxclks), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; - std::cout << std::dec << "current_socclk=" - << smu.current_socclk << '\n'; + + std::cout << std::dec << "current_socclk=" << smu.current_socclk << "\n"; std::cout << std::dec << "current_socclks= ["; - size = static_cast( - sizeof(smu.current_socclks)/sizeof(smu.current_socclks[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.current_socclks[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.current_socclks[i]); - } - } + std::copy(std::begin(smu.current_socclks), + std::end(smu.current_socclks), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; - std::cout << std::dec << "current_uclk=" - << static_cast(smu.current_uclk) << '\n'; - std::cout << std::dec << "current_vclk0=" - << static_cast(smu.current_vclk0) << '\n'; + + std::cout << std::dec << "current_uclk=" << smu.current_uclk << "\n"; + std::cout << std::dec << "current_vclk0=" << smu.current_vclk0 << "\n"; std::cout << std::dec << "current_vclk0s= ["; - size = static_cast( - sizeof(smu.current_vclk0s)/sizeof(smu.current_vclk0s[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.current_vclk0s[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.current_vclk0s[i]); - } - } + std::copy(std::begin(smu.current_vclk0s), + std::end(smu.current_vclk0s), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; - std::cout << std::dec << "current_dclk0=" - << smu.current_dclk0 << '\n'; + + std::cout << std::dec << "current_dclk0=" << smu.current_dclk0 << "\n"; std::cout << std::dec << "current_dclk0s= ["; - size = static_cast( - sizeof(smu.current_dclk0s)/sizeof(smu.current_dclk0s[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.current_dclk0s[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.current_dclk0s[i]); - } - } + std::copy(std::begin(smu.current_dclk0s), + std::end(smu.current_dclk0s), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; - std::cout << std::dec << "current_vclk1=" - << static_cast(smu.current_vclk1) << '\n'; - std::cout << std::dec << "current_dclk1=" - << static_cast(smu.current_dclk1) << '\n'; + + std::cout << std::dec << "current_vclk1=" << smu.current_vclk1 << "\n"; + std::cout << std::dec << "current_dclk1=" << smu.current_dclk1 << "\n"; + std::cout << "\n"; std::cout << "TROTTLE STATUS:\n"; - std::cout << std::dec << "throttle_status=" - << static_cast(smu.throttle_status) << '\n'; + std::cout << std::dec << "throttle_status=" << smu.throttle_status << "\n"; + std::cout << "\n"; std::cout << "FAN SPEED:\n"; - std::cout << std::dec << "current_fan_speed=" - << static_cast(smu.current_fan_speed) << '\n'; + std::cout << std::dec << "current_fan_speed=" << smu.current_fan_speed << "\n"; + std::cout << "\n"; std::cout << "LINK WIDTH (number of lanes) /SPEED (0.1 GT/s):\n"; - std::cout << "pcie_link_width=" - << std::to_string(smu.pcie_link_width) << '\n'; - std::cout << "pcie_link_speed=" - << std::to_string(smu.pcie_link_speed) << '\n'; - std::cout << "xgmi_link_width=" - << std::to_string(smu.xgmi_link_width) << '\n'; - std::cout << "xgmi_link_speed=" - << std::to_string(smu.xgmi_link_speed) << '\n'; + std::cout << "pcie_link_width=" << smu.pcie_link_width << "\n"; + std::cout << "pcie_link_speed=" << smu.pcie_link_speed << "\n"; + std::cout << "xgmi_link_width=" << smu.xgmi_link_width << "\n"; + std::cout << "xgmi_link_speed=" << smu.xgmi_link_speed << "\n"; std::cout << "\n"; std::cout << "Utilization Accumulated(%):\n"; - std::cout << "gfx_activity_acc=" - << std::dec << static_cast(smu.gfx_activity_acc) << '\n'; - std::cout << "mem_activity_acc=" - << std::dec << static_cast(smu.mem_activity_acc) << '\n'; + std::cout << "gfx_activity_acc=" << std::dec << smu.gfx_activity_acc << "\n"; + std::cout << "mem_activity_acc=" << std::dec << smu.mem_activity_acc << "\n"; std::cout << "\n"; std::cout << "XGMI ACCUMULATED DATA TRANSFER SIZE (KB):\n"; std::cout << std::dec << "xgmi_read_data_acc= ["; - size = static_cast( - sizeof(smu.xgmi_read_data_acc)/sizeof(smu.xgmi_read_data_acc[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.xgmi_read_data_acc[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.xgmi_read_data_acc[i]); - } - } + std::copy(std::begin(smu.xgmi_read_data_acc), + std::end(smu.xgmi_read_data_acc), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; + std::cout << std::dec << "xgmi_write_data_acc= ["; - size = static_cast( - sizeof(smu.xgmi_write_data_acc)/sizeof(smu.xgmi_write_data_acc[0])); - for (uint16_t i= 0; i < size; i++) { - if (i+1 < size) { - std::cout << std::dec << static_cast(smu.xgmi_write_data_acc[i]) << ", "; - } else { - std::cout << std::dec << static_cast(smu.xgmi_write_data_acc[i]); - } - } + std::copy(std::begin(smu.xgmi_write_data_acc), + std::end(smu.xgmi_write_data_acc), + amd::smi::make_ostream_joiner(&std::cout, ", ")); std::cout << std::dec << "]\n"; // Voltage (mV) - std::cout << "voltage_soc = " - << std::dec << static_cast(smu.voltage_soc) << "\n"; - std::cout << "voltage_soc = " - << std::dec << static_cast(smu.voltage_gfx) << "\n"; - std::cout << "voltage_mem = " - << std::dec << static_cast(smu.voltage_mem) << "\n"; + std::cout << "voltage_soc = " << std::dec << smu.voltage_soc << "\n"; + std::cout << "voltage_gfx = " << std::dec << smu.voltage_gfx << "\n"; + std::cout << "voltage_mem = " << std::dec << smu.voltage_mem << "\n"; - std::cout << "indep_throttle_status = " - << std::dec << static_cast(smu.indep_throttle_status) << "\n"; + std::cout << "indep_throttle_status = " << std::dec << smu.indep_throttle_status << "\n"; // Clock Lock Status. Each bit corresponds to clock instance - std::cout << "gfxclk_lock_status (in hex) = " - << std::hex << static_cast(smu.gfxclk_lock_status) << std::dec <<"\n"; + std::cout << "gfxclk_lock_status (in hex) = " << std::hex + << smu.gfxclk_lock_status << std::dec <<"\n"; // Bandwidth (GB/sec) - std::cout << "pcie_bandwidth_acc=" << std::dec - << static_cast(smu.pcie_bandwidth_acc) << "\n"; - std::cout << "pcie_bandwidth_inst=" << std::dec - << static_cast(smu.pcie_bandwidth_inst) << "\n"; + std::cout << "pcie_bandwidth_acc=" << std::dec << smu.pcie_bandwidth_acc << "\n"; + std::cout << "pcie_bandwidth_inst=" << std::dec << smu.pcie_bandwidth_inst << "\n"; // Counts - std::cout << "pcie_l0_to_recov_count_acc= " << std::dec - << static_cast(smu.pcie_l0_to_recov_count_acc) << "\n"; - std::cout << "pcie_replay_count_acc= " << std::dec - << static_cast(smu.pcie_replay_count_acc) << "\n"; + std::cout << "pcie_l0_to_recov_count_acc= " << std::dec << smu.pcie_l0_to_recov_count_acc + << "\n"; + std::cout << "pcie_replay_count_acc= " << std::dec << smu.pcie_replay_count_acc << "\n"; std::cout << "pcie_replay_rover_count_acc= " << std::dec - << static_cast(smu.pcie_replay_rover_count_acc) << "\n"; - std::cout << "pcie_nak_rcvd_count_acc= " << std::dec - << static_cast(smu.pcie_nak_rcvd_count_acc) << "\n"; - std::cout << "pcie_replay_rover_count_acc= " << std::dec - << static_cast(smu.pcie_replay_rover_count_acc) << "\n"; + << smu.pcie_replay_rover_count_acc << "\n"; + std::cout << "pcie_nak_sent_count_acc= " << std::dec << smu.pcie_nak_sent_count_acc + << "\n"; + std::cout << "pcie_nak_rcvd_count_acc= " << std::dec << smu.pcie_nak_rcvd_count_acc + << "\n"; - // Check for constant changes/refresh metrics + // Accumulation cycle counter + // Accumulated throttler residencies std::cout << "\n"; + std::cout << "RESIDENCY ACCUMULATION / COUNTER:\n"; + std::cout << "accumulation_counter = " << std::dec << smu.accumulation_counter << "\n"; + std::cout << "prochot_residency_acc = " << std::dec << smu.prochot_residency_acc << "\n"; + std::cout << "ppt_residency_acc = " << std::dec << smu.ppt_residency_acc << "\n"; + std::cout << "socket_thm_residency_acc = " << std::dec << smu.socket_thm_residency_acc + << "\n"; + std::cout << "vr_thm_residency_acc = " << std::dec << smu.vr_thm_residency_acc + << "\n"; + std::cout << "hbm_thm_residency_acc = " << std::dec << smu.hbm_thm_residency_acc << "\n"; + + // Number of current partitions + std::cout << "num_partition = " << std::dec << smu.num_partition << "\n"; + + // PCIE other end recovery counter + std::cout << "pcie_lc_perf_other_end_recovery = " + << std::dec << smu.pcie_lc_perf_other_end_recovery << "\n"; + + std::cout << std::dec << "xcp_stats.gfx_busy_inst = \n"; + auto xcp = 0; + for (auto& row : smu.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.gfx_busy_inst), + std::end(row.gfx_busy_inst), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + xcp = 0; + std::cout << std::dec << "xcp_stats.jpeg_busy = \n"; + for (auto& row : smu.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.jpeg_busy), + std::end(row.jpeg_busy), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + xcp = 0; + std::cout << std::dec << "xcp_stats.vcn_busy = \n"; + for (auto& row : smu.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.vcn_busy), + std::end(row.vcn_busy), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + xcp = 0; + std::cout << std::dec << "xcp_stats.gfx_busy_acc = \n"; + for (auto& row : smu.xcp_stats) { + std::cout << "XCP[" << xcp << "] = " << "[ "; + std::copy(std::begin(row.gfx_busy_acc), + std::end(row.gfx_busy_acc), + amd::smi::make_ostream_joiner(&std::cout, ", ")); + std::cout << " ]\n"; + xcp++; + } + + std::cout << "\n\n"; std::cout << "\t ** -> Checking metrics with constant changes ** " << "\n"; constexpr uint16_t kMAX_ITER_TEST = 10; - amdsmi_gpu_metrics_t gpu_metrics_check; + amdsmi_gpu_metrics_t gpu_metrics_check = {}; for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) { - amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check); - std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.firmware_timestamp << "\n"; + amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check); + std::cout << "\t\t -> firmware_timestamp [" << idx << "/" << kMAX_ITER_TEST << "]: " + << gpu_metrics_check.firmware_timestamp << "\n"; } std::cout << "\n"; for (auto idx = uint16_t(1); idx <= kMAX_ITER_TEST; ++idx) { - amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check); - std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: " << gpu_metrics_check.system_clock_counter << "\n"; + amdsmi_get_gpu_metrics_info(processor_handles_[i], &gpu_metrics_check); + std::cout << "\t\t -> system_clock_counter [" << idx << "/" << kMAX_ITER_TEST << "]: " + << gpu_metrics_check.system_clock_counter << "\n"; } + std::cout << "\n"; + std::cout << " ** Note: Values MAX'ed out " + << "(UINTX MAX are unsupported for the version in question) ** " << "\n\n"; } } @@ -377,5 +383,13 @@ void TestGpuMetricsRead::Run(void) { amdsmi_status_code_to_string(err, &status_string); std::cout << "\t\t** amdsmi_get_gpu_metrics_info(nullptr check): " << status_string << "\n"; ASSERT_EQ(err, AMDSMI_STATUS_INVAL); + + + // TODO(AMD_SMI_team): add xcd_counter_get for amd smi + // auto temp_xcd_counter_value = uint16_t(0); + // err = rsmi_dev_metrics_xcd_counter_get(i, &temp_xcd_counter_value); + // if (err != RSMI_STATUS_NOT_SUPPORTED) { + // CHK_ERR_ASRT(err); + // } } } diff --git a/tests/amd_smi_test/functional/sys_info_read.cc b/tests/amd_smi_test/functional/sys_info_read.cc index 35d40e0818..c11ef16d61 100644 --- a/tests/amd_smi_test/functional/sys_info_read.cc +++ b/tests/amd_smi_test/functional/sys_info_read.cc @@ -183,22 +183,26 @@ void TestSysInfoRead::Run(void) { } } - // kfd_id, node_id + // kfd_id, node_id, current_partition_id amdsmi_kfd_info_t kfd_info = {}; err = amdsmi_get_gpu_kfd_info(processor_handles_[i], &kfd_info); if (err != AMDSMI_STATUS_SUCCESS) { EXPECT_EQ(kfd_info.kfd_id, std::numeric_limits::max()); EXPECT_EQ(kfd_info.node_id, std::numeric_limits::max()); + EXPECT_EQ(kfd_info.current_partition_id, std::numeric_limits::max()); } else { IF_VERB(STANDARD) { std::cout << "\t**KFD ID: " << std::dec << kfd_info.kfd_id << "\n"; std::cout << "\t**Node ID: " << std::dec << kfd_info.node_id << "\n"; + std::cout << "\t**Current Parition ID: " << std::dec + << kfd_info.current_partition_id << "\n"; } EXPECT_EQ(err, AMDSMI_STATUS_SUCCESS); EXPECT_NE(kfd_info.kfd_id, std::numeric_limits::max()); EXPECT_NE(kfd_info.node_id, std::numeric_limits::max()); + EXPECT_NE(kfd_info.current_partition_id, std::numeric_limits::max()); } // Verify api support checking functionality is working err = amdsmi_get_gpu_kfd_info(processor_handles_[i], nullptr); diff --git a/tests/python_unittest/integration_test.py b/tests/python_unittest/integration_test.py index 51cd8e08a0..bfae5b65f6 100755 --- a/tests/python_unittest/integration_test.py +++ b/tests/python_unittest/integration_test.py @@ -114,6 +114,8 @@ class TestAmdSmiPythonInterface(unittest.TestCase): kfd_info['kfd_id'])) print(" kfd_info['node_id'] is: {}".format( kfd_info['node_id'])) + print(" kfd_info['current_partition_id'] is: {}\n".format( + kfd_info['current_partition_id'])) print() self.tearDown() @@ -527,6 +529,8 @@ class TestAmdSmiPythonInterface(unittest.TestCase): pcie_info['pcie_metric']['pcie_nak_sent_count'])) print(" pcie_info['pcie_metric']['pcie_nak_received_count'] is: {}".format( pcie_info['pcie_metric']['pcie_nak_received_count'])) + print(" pcie_info['pcie_metric']['pcie_lc_perf_other_end_recovery_count'] is: {}".format( + pcie_info['pcie_metric']['pcie_lc_perf_other_end_recovery_count'])) print() self.tearDown() @@ -844,6 +848,65 @@ class TestAmdSmiPythonInterface(unittest.TestCase): self.tearDown() # @unittest.SkipTest + @handle_exceptions + def test_accelerator_partition_profile(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_gpu_accelerator_partition_profile \n") + accelerator_partition = amdsmi.amdsmi_get_gpu_accelerator_partition_profile(processors[i]) + print(" Current partition id: {}".format( + accelerator_partition['partition_id'])) + print() + self.tearDown() + + # Only supported on MI300+ ASICs + @handle_exceptions + def test_get_violation_status(self): + self.setUp() + processors = amdsmi.amdsmi_get_processor_handles() + self.assertGreaterEqual(len(processors), 1) + self.assertLessEqual(len(processors), 32) + for i in range(0, len(processors)): + bdf = amdsmi.amdsmi_get_gpu_device_bdf(processors[i]) + print("\n\n###Test Processor {}, bdf: {}".format(i, bdf)) + print("\n###Test amdsmi_get_violation_status \n") + + violation_status = amdsmi.amdsmi_get_violation_status(processors[i]) + print(" Reference Timestamp: {}".format( + violation_status['reference_timestamp'])) + print(" Violation Timestamp: {}".format( + violation_status['violation_timestamp'])) + + print(" Prochot Thrm Violation (%): {}".format( + violation_status['per_prochot_thrm'])) + print(" PVIOL (per_ppt_pwr) (%): {}".format( + violation_status['per_ppt_pwr'])) + print(" TVIOL (per_socket_thrm) (%): {}".format( + violation_status['per_socket_thrm'])) + print(" VR_THRM Violation (%): {}".format( + violation_status['per_vr_thrm'])) + print(" HBM Thrm Violation (%): {}".format( + violation_status['per_hbm_thrm'])) + + print(" Prochot Thrm Violation (bool): {}".format( + violation_status['active_prochot_thrm'])) + print(" PVIOL (active_ppt_pwr) (bool): {}".format( + violation_status['active_ppt_pwr'])) + print(" TVIOL (active_socket_thrm) (bool): {}".format( + violation_status['active_socket_thrm'])) + print(" VR_THRM Violation (bool): {}".format( + violation_status['active_vr_thrm'])) + print(" HBM Thrm Violation (bool): {}".format( + violation_status['active_hbm_thrm'])) + print() + self.tearDown() + + def test_walkthrough(self): print("\n\n#######################################################################") print("========> test_walkthrough start <========\n") From 2c8e2060cbcf39cb86f98d43bd01a4ab87abc0be Mon Sep 17 00:00:00 2001 From: Maisam Arif Date: Fri, 27 Sep 2024 12:55:39 -0500 Subject: [PATCH 5/8] Adjusted throttle unit logic in amdsmi_commands.py Signed-off-by: Maisam Arif Change-Id: Icce949ff93f45c9751f43df0a80614fd377318fa --- amdsmi_cli/amdsmi_commands.py | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/amdsmi_cli/amdsmi_commands.py b/amdsmi_cli/amdsmi_commands.py index 159ee69ceb..f7c0991a5f 100644 --- a/amdsmi_cli/amdsmi_commands.py +++ b/amdsmi_cli/amdsmi_commands.py @@ -2145,15 +2145,13 @@ class AMDSMICommands(): logging.debug("Failed to get violation status' for gpu %s | %s", gpu_id, e.get_error_info()) for key, value in throttle_status.items(): - if ("active" in key) and (value is True): - throttle_status[key] = "ACTIVE" - continue - elif ("active" in key) and (value is False): + if "active" in key: throttle_status[key] = "NOT ACTIVE" + if value: + throttle_status[key] = "ACTIVE" continue - if "percent" in key: - True # continue with rest of logic - else: + + if "percent" not in key: continue activity_unit = '%' From 4e2fc2d6049b211952cec4259e261cd7d326731a Mon Sep 17 00:00:00 2001 From: gabrpham Date: Wed, 25 Sep 2024 22:44:38 -0500 Subject: [PATCH 6/8] Added `amd-smi partition` as preliminary command. new command includes following arguments: - current - display the current partition information for the selected gpu(s) - memory - display memory partition information for the selected gpu(s) - accelerator - display accelerator partition information for the selected gpu(s) additional functionality will be added as more partition APIs are added. Signed-off-by: gabrpham Change-Id: Ica86160139002ef5213d6d4b0e390670aeef01c8 --- CHANGELOG.md | 63 ++++------ amdsmi_cli/amdsmi_cli.py | 4 +- amdsmi_cli/amdsmi_commands.py | 205 +++++++++++++++++++++++++++++++ amdsmi_cli/amdsmi_logger.py | 26 +++- amdsmi_cli/amdsmi_parser.py | 39 +++++- py-interface/amdsmi_interface.py | 2 +- 6 files changed, 294 insertions(+), 45 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index e0c1c745ac..3caaedcd35 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -7,14 +7,15 @@ Full documentation for amd_smi_lib is available at [https://rocm.docs.amd.com/pr ## amd_smi_lib for ROCm 6.3.0 ### Changes -- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`** + +- **Added support for GPU metrics 1.6 to `amdsmi_get_gpu_metrics_info()`**. Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to include new fields for PVIOL / TVIOL, XCP (Graphics Compute Partitions) stats, and pcie_lc_perf_other_end_recovery: - `uint64_t accumulation_counter` - used for all throttled calculations - `uint64_t prochot_residency_acc` - Processor hot accumulator - `uint64_t ppt_residency_acc` - Package Power Tracking (PPT) accumulator (used in PVIOL calculations) - `uint64_t socket_thm_residency_acc` - Socket thermal accumulator - (used in TVIOL calculations) - `uint64_t vr_thm_residency_acc` - Voltage Rail (VR) thermal accumulator - - `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator + - `uint64_t hbm_thm_residency_acc` - High Bandwidth Memory (HBM) thermal accumulator - `uint16_t num_partition` - corresponds to the current total number of partitions - `struct amdgpu_xcp_metrics_t xcp_stats[MAX_NUM_XCP]` - for each partition associated with current GPU, provides gfx busy & accumulators, jpeg, and decoder (VCN) engine utilizations - `uint32_t gfx_busy_inst[MAX_NUM_XCC]` - graphic engine utilization (%) @@ -23,11 +24,12 @@ Updated `amdsmi_get_gpu_metrics_info()` and structure `amdsmi_gpu_metrics_t` to - `uint64_t gfx_busy_acc[MAX_NUM_XCC]` - graphic engine utilization accumulated (%) - `uint32_t pcie_lc_perf_other_end_recovery` - corresponds to the pcie other end recovery counter -- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`** +- **Added new violation status outputs and APIs: `amdsmi_status_t amdsmi_get_violation_status()`, `amd-smi metric --throttle`, and `amd-smi monitor --violation`**. ***Only available for MI300+ ASICs.*** Users can now retrieve violation status' through either our Python or C++ APIs. Additionally, we have added capability to view these outputs conviently through `amd-smi metric --throttle` and `amd-smi monitor --violation`. Example outputs are listed below (below is for reference, output is subject to change): + ```shell $ amd-smi metric --throttle GPU: 0 @@ -69,6 +71,7 @@ GPU: 1 HBM_THERMAL_VIOLATION_PERCENT: 0 % ... ``` + ```shell $ amd-smi monitor --violation GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL @@ -91,12 +94,12 @@ GPU PVIOL TVIOL PHOT_TVIOL VR_TVIOL HBM_TVIOL ... ``` -- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`** +- **Added ability to view XCP (Graphics Compute Partition) activity within `amd-smi metric --usage`**. ***Partition specific features are only available on MI300+ ASICs*** Users can now retrieve graphic utilization statistic on a per-XCP (per-partition) basis. Here all XCP activities will be listed, - but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`. - + but the current XCP is the partition id listed under both `amd-smi list` and `amd-smi static --partition`. Example outputs are listed below (below is for reference, output is subject to change): + ```shell $ amd-smi metric --usage GPU: 0 @@ -161,7 +164,6 @@ GPU: 0 XCP_6: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] XCP_7: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] - GPU: 1 USAGE: GFX_ACTIVITY: 0 % @@ -227,9 +229,10 @@ GPU: 1 ... ``` -- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value** +- **Added `LC_PERF_OTHER_END_RECOVERY` CLI output to `amd-smi metric --pcie` and updated `amdsmi_get_pcie_info()` to include this value**. ***Feature is only available on MI300+ ASICs*** Users can now retrieve both through `amdsmi_get_pcie_info()` which has an updated structure: + ```C typedef struct { ... @@ -247,9 +250,10 @@ typedef struct { } pcie_metric; uint64_t reserved[32]; } amdsmi_pcie_info_t; -``` +``` + + - Example outputs are listed below (below is for reference, output is subject to change): - Example outputs are listed below (below is for reference, output is subject to change): ```shell $ amd-smi metric --pcie GPU: 0 @@ -284,7 +288,7 @@ GPU: 1 ... ``` -- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`** +- **Updated BDF commands to look use KFD SYSFS for BDF: `amdsmi_get_gpu_device_bdf()`**. This aligns BDF output with ROCm SMI. See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partition ID. See API for better detail. Previously these bits were reserved bits (right before domain) and partition id was within function. - bits [63:32] = domain @@ -292,7 +296,6 @@ See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partiti - bits [27:16] = reserved - bits [15: 0] = pci bus/device/function - - **Moved python tests directory path install location**. - `/opt//share/amd_smi/pytest/..` to `/opt//share/amd_smi/tests/python_unittest/..` - On amd-smi-lib-tests uninstall, the amd_smi tests folder is removed. @@ -306,9 +309,7 @@ See below for overview as seen from `rsmi_dev_pci_id_get()` now provides partiti - **Added `amd-smi set -L/--clk-limit ...` command**. Equivalent to rocm-smi's '--extremum' command which sets sclk's or mclk's soft minimum or soft maximum clock frequency. - - -- **Added Pytest functionality to test amdsmi API calls in Python**. +- **Added unittest functionality to test amdsmi API calls in Python**. - **Changed the `power` parameter in `amdsmi_get_energy_count()` to `energy_accumulator`**. - Changes propagate forwards into the python interface as well, however we are maintaing backwards compatibility and keeping the `power` field in the python API until ROCm 6.4. @@ -341,7 +342,6 @@ Topology arguments: ID: 7 | BDF: 0000:df:00.0 | UUID: all | Selects all devices - -a, --access Displays link accessibility between GPUs -w, --weight Displays relative weight between GPUs -o, --hops Displays the number of hops between GPUs @@ -352,7 +352,6 @@ Topology arguments: -d, --dma Display P2P direct memory access (DMA) link capability between nodes -z, --bi-dir Display P2P bi-directional link capability between nodes - Command Modifiers: --json Displays output in JSON format (human readable by default). --csv Displays output in CSV format (human readable by default). @@ -407,7 +406,6 @@ BI-DIRECTIONAL TABLE: 0000:bf:00.0 F T T T F F SELF F 0000:df:00.0 T T T F F T F SELF - Legend: SELF = Current GPU ENABLED / DISABLED = Link is enabled or disabled @@ -504,10 +502,10 @@ GPU: 0 TARGET_GRAPHICS_VERSION: gfx942 ``` -- **Udpated Partition APIs and struct information and added and partition_id to `amd-smi static --partition` & `amd-smi list`**. +- **Udpated Partition APIs and struct information and added and partition_id to `amd-smi static --partition`**. - As part of an overhaul to partition information, some partition information will be made available in the `amdsmi_accelerator_partition_profile_t`. - This struct will be filled out by a new API, `amdsmi_get_gpu_accelerator_partition_profile()`. - - Future data from these APIs wil will eventually get added to `static --partition`. + - Future data from these APIs wil will eventually get added to `amd-smi partition`. ```C #define AMDSMI_MAX_ACCELERATOR_PROFILE 32 @@ -548,7 +546,6 @@ typedef union { uint32_t nps_cap_mask; } amdsmi_nps_caps_t; - typedef struct { amdsmi_accelerator_partition_type_t profile_type; // SPX, DPX, QPX, CPX and so on uint32_t num_partitions; // On MI300X, SPX: 1, DPX: 2, QPX: 4, CPX: 8, length of resources array @@ -567,21 +564,6 @@ GPU: 0 COMPUTE_PARTITION: CPX MEMORY_PARTITION: NPS4 PARTITION_ID: 0 - -$ amd-smi list -GPU: 0 - BDF: 0000:23:00.0 - UUID: - KFD_ID: 45412 - NODE_ID: 1 - PARTITION_ID: 0 - -GPU: 1 - BDF: 0000:26:00.0 - UUID: - KFD_ID: 59881 - NODE_ID: 2 - PARTITION_ID: 0 ``` ### Removals @@ -610,7 +592,7 @@ plan to eventually remove partition ID from the function portion of the BDF (Bus - bits [7:3] = Device - bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes -Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI. + - Previously in non-SPX modes (ex. CPX/TPX/DPX/etc) some MI3x ASICs would not report all logical GPU devices within AMD SMI. ```shell $ amd-smi monitor -p -t -v @@ -650,9 +632,8 @@ GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL ``` - **Fixed incorrect implementation of the Python API `amdsmi_get_gpu_metrics_header_info()`**. -- **`amdsmitst` TestGpuMetricsRead now prints metric in correct units** -- **`amd-smi static --partition` will have updates with additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()`**. +- **`amdsmitst` TestGpuMetricsRead now prints metric in correct units**. ### Known issues @@ -662,6 +643,10 @@ GPU POWER GPU_TEMP MEM_TEMP VRAM_USED VRAM_TOTAL - **Python API for `amdsmi_get_energy_count()` will deprecate the `power` field in ROCm 6.4 and use `energy_accumulator` field instead**. +- **Added preliminary `amd-smi partition` command**. + - The new partition command can be used to display GPU information, including memory and accelerator partition information. + - The command will be at full functionality once additional partition information from `amdsmi_get_gpu_accelerator_partition_profile()` has been implemented. + ## amd_smi_lib for ROCm 6.2.1 ### Additions diff --git a/amdsmi_cli/amdsmi_cli.py b/amdsmi_cli/amdsmi_cli.py index 1e61fa44f1..463632c1b2 100755 --- a/amdsmi_cli/amdsmi_cli.py +++ b/amdsmi_cli/amdsmi_cli.py @@ -94,7 +94,8 @@ if __name__ == "__main__": amd_smi_commands.reset, amd_smi_commands.monitor, amd_smi_commands.rocm_smi, - amd_smi_commands.xgmi) + amd_smi_commands.xgmi, + amd_smi_commands.partition) try: try: argcomplete.autocomplete(amd_smi_parser) @@ -128,7 +129,6 @@ if __name__ == "__main__": sys.tracebacklimit = 10 else: sys.tracebacklimit = -1 - # Execute subcommands args.func(args) except amdsmi_cli_exceptions.AmdSmiException as e: diff --git a/amdsmi_cli/amdsmi_commands.py b/amdsmi_cli/amdsmi_commands.py index f7c0991a5f..b232c29c21 100644 --- a/amdsmi_cli/amdsmi_commands.py +++ b/amdsmi_cli/amdsmi_commands.py @@ -5043,6 +5043,8 @@ class AMDSMICommands(): bitrate = pcie_speed_GTs_value max_bandwidth = bitrate * pcie_static['max_pcie_width'] except amdsmi_exception.AmdSmiLibraryException as e: + bitrate = "N/A" + max_bandwidth = "N/A" logging.debug("Failed to get bitrate and bandwidth for GPU %s | %s", src_gpu_id, e.get_error_info()) @@ -5084,6 +5086,8 @@ class AMDSMICommands(): read = metrics_info['xgmi_read_data_acc'][dest_gpu_id] write = metrics_info['xgmi_write_data_acc'][dest_gpu_id] except amdsmi_exception.AmdSmiLibraryException as e: + read = "N/A" + write = "N/A" logging.debug("Failed to get read data for %s to %s | %s", self.helpers.get_gpu_id_from_device_handle(src_gpu), self.helpers.get_gpu_id_from_device_handle(dest_gpu), @@ -5172,6 +5176,207 @@ class AMDSMICommands(): self.logger.print_output(multiple_device_enabled=True) + def partition(self, args, multiple_devices=False, gpu=None, current=None, memory=None, accelerator=None): + """ Display parition information for the target GPU + param: + args - argparser args to pass to subcommand + multiple_devices (bool) - True if checking for multiple devices + gpu (device_handle) - device_handle for target device + current - boolean which dictates whether the current partition information is shown + memory - boolean which dictates whether the memory partition information is shown + accelerator - boolean which dictates whether the accelerator partition information is shown + returns: + nothing + """ + + if gpu: + args.gpu = gpu + if args.gpu == None: + args.gpu = self.device_handles + if not isinstance(args.gpu, list): + args.gpu = [args.gpu] + if current: + args.current = current + if memory: + args.memory = memory + if accelerator: + args.accelerator = accelerator + + # if no args are present, then everything should be displayed + if not args.current and not args.memory and not args.accelerator: + args.current = True + args.memory = True + args.accelerator = True + + if args.current: + self.logger.table_header = ''.rjust(7) + current_header = "GPU_ID".ljust(13) + \ + "MEMORY".ljust(8) + \ + "ACCELERATOR_TYPE".ljust(18) + \ + "ACCELERATOR_PROFILE_INDEX".ljust(27) + \ + "PARTITION_ID".ljust(14) + self.logger.table_header = current_header + self.logger.table_header.strip() + + tabular_output = [] + for gpu in args.gpu: + gpu_id = self.helpers.get_gpu_id_from_device_handle(gpu) + try: + partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu) + profile_type = partition_dict['partition_profile']['profile_type'] + profile_index = partition_dict['partition_profile']['profile_index'] + partition_id = partition_dict['partition_id'] + except amdsmi_exception.AmdSmiLibraryException as e: + profile_type = "N/A" + profile_index = "N/A" + partition_id = "N/A" + logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info()) + try: + current_mem_cap = amdsmi_interface.amdsmi_get_gpu_memory_partition(gpu) + except amdsmi_exception.AmdSmiLibraryException as e: + current_mem_cap = "N/A" + logging.debug("Failed to get current memory partition capabilties for GPU %s | %s", gpu_id, e.get_error_info()) + + tabular_output_dict = {"gpu_id": gpu_id, + "memory": current_mem_cap, + "accelerator_type": profile_type, + "accelerator_profile_index": profile_index, + "partition_id": partition_id} + tabular_output.append(tabular_output_dict) + + self.logger.multiple_device_output = tabular_output + self.logger.table_title = "CURRENT_PARTITION" + self.logger.print_output(multiple_device_enabled=True, tabular=True) + self.logger.clear_multiple_devices_ouput() + + if args.memory: + for gpu in args.gpu: + gpu_id = self.helpers.get_gpu_id_from_device_handle(gpu) + try: + memory_partition = amdsmi_interface.amdsmi_get_gpu_memory_partition(gpu) # this info likely actually comes from different apis than used here + except amdsmi_exception.AmdSmiLibraryException as e: + memory_partition = "N/A" + logging.debug("Failed to get current memory partition for GPU %s | %s", gpu_id, e.get_error_info()) + try: + partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu) + temp_mem_caps = partition_dict['partition_profile']['memory_caps'] + + if temp_mem_caps.amdsmi_nps_flags_t == None: + mem_caps = temp_mem_caps.nps_cap_mask + mem_caps_list = [] + if mem_caps & 1 == 1: + mem_caps_list.append("NPS1") + if mem_caps & 2 == 2: + mem_caps_list.append("NPS2") + if mem_caps & 4 == 4: + mem_caps_list.append("NPS4") + if mem_caps & 8 == 8: + mem_caps_list.append("NPS8") + mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "") + else: + mem_caps = temp_mem_caps.amdsmi_nps_flags_t + mem_caps_list = [] + if mem_caps.nps1_cap == 1: + mem_caps_list.append("NPS1") + if mem_caps.nps2_cap == 1: + mem_caps_list.append("NPS2") + if mem_caps.nps4_cap == 1: + mem_caps_list.append("NPS4") + if mem_caps.nps8_cap == 1: + mem_caps_list.append("NPS8") + mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "") + if mem_caps_str == "": + mem_caps_str = "N/A" + except amdsmi_exception.AmdSmiLibraryException as e: + mem_caps_str = "N/A" + logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info()) + + memory_dict = {'caps': mem_caps_str, 'current': memory_partition} + self.logger.store_output(gpu, 'memory_partition', memory_dict) + self.logger.store_multiple_device_output() + self.logger.print_output(multiple_device_enabled=True) + self.logger.clear_multiple_devices_ouput() + if args.accelerator: + self.logger.table_header = ''.rjust(7) + current_header = "GPU_ID".ljust(13) + \ + "PROFILE_INDEX".ljust(15) + \ + "MEMORY_PARTITION_CAPS".ljust(23) + \ + "ACCELERATOR_TYPE".ljust(18) + \ + "PARTITION_ID".ljust(14) + \ + "NUM_PARTITIONS".ljust(16) + \ + "NUM_RESOURCES".ljust(15) + \ + "RESOURCE_INDEX".ljust(16) + \ + "RESOURCE_TYPE".ljust(15) + \ + "RESOURCE_INSTANCES".ljust(20) + \ + "RESOURCES_SHARED".ljust(18) + self.logger.table_header = current_header + self.logger.table_header.strip() + + tabular_output = [] + for gpu in args.gpu: + gpu_id = self.helpers.get_gpu_id_from_device_handle(gpu) + try: + partition_dict = amdsmi_interface.amdsmi_get_gpu_accelerator_partition_profile(gpu) + profile_type = partition_dict['partition_profile']['profile_type'] + profile_index = partition_dict['partition_profile']['profile_index'] + temp_mem_caps = partition_dict['partition_profile']['memory_caps'] + parition_id = partition_dict['partition_id'] + num_resources = partition_dict['partition_profile']['num_resources'] + resources = partition_dict['partition_profile']['resources'] + + if temp_mem_caps.amdsmi_nps_flags_t == None: + mem_caps = temp_mem_caps.nps_cap_mask + mem_caps_list = [] + if mem_caps & 1 == 1: + mem_caps_list.append("NPS1") + if mem_caps & 2 == 2: + mem_caps_list.append("NPS2") + if mem_caps & 4 == 4: + mem_caps_list.append("NPS4") + if mem_caps & 8 == 8: + mem_caps_list.append("NPS8") + mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "") + else: + mem_caps = temp_mem_caps.amdsmi_nps_flags_t + mem_caps_list = [] + if mem_caps.nps1_cap == 1: + mem_caps_list.append("NPS1") + if mem_caps.nps2_cap == 1: + mem_caps_list.append("NPS2") + if mem_caps.nps4_cap == 1: + mem_caps_list.append("NPS4") + if mem_caps.nps8_cap == 1: + mem_caps_list.append("NPS8") + mem_caps_str = str(mem_caps_list).replace("]", "").replace("[", "") + if mem_caps_str == "": + mem_caps_str = "N/A" + except amdsmi_exception.AmdSmiLibraryException as e: + profile_type = "N/A" + profile_index = "N/A" + temp_mem_caps = "N/A" + parition_id = "N/A" + num_resources = "N/A" + resources = "N/A" + mem_caps_str = "N/A" + logging.debug("Failed to get accelerator partition profile for GPU %s | %s", gpu_id, e.get_error_info()) + + tabular_output_dict = {"gpu_id": gpu_id, + "profile_index": profile_index, + "memory_partition_caps": mem_caps_str, + "accelerator_type": profile_type, + "partition_id": parition_id, + "num_partitions": 0, + "num_resources": num_resources, + "resource_index": resources, + "resource_type": resources, + "resource_instances": resources, + "resources_shared": resources} + tabular_output.append(tabular_output_dict) + + self.logger.multiple_device_output = tabular_output + self.logger.table_title = "ACCELERATOR_PARTITION_PROFILES" + self.logger.print_output(multiple_device_enabled=True, tabular=True) + self.logger.clear_multiple_devices_ouput() + + def _event_thread(self, commands, i): devices = commands.device_handles if len(devices) == 0: diff --git a/amdsmi_cli/amdsmi_logger.py b/amdsmi_cli/amdsmi_logger.py index 8234f99eac..86b463938f 100644 --- a/amdsmi_cli/amdsmi_logger.py +++ b/amdsmi_cli/amdsmi_logger.py @@ -150,8 +150,32 @@ class AMDSMILogger(): table_values += string_value.ljust(14) elif key == "link_type": table_values += string_value.ljust(10) + elif key == "memory": + table_values += string_value.ljust(8) + elif key == "accelerator_type": + table_values += string_value.ljust(18) + elif key == "partition_id": + table_values += string_value.ljust(14) + elif key == "accelerator_profile_index": + table_values += string_value.ljust(27) + elif key == "profile_index": + table_values += string_value.ljust(15) + elif key == "memory_partition_caps": + table_values += string_value.ljust(23) + elif key == "num_partitions": + table_values += string_value.ljust(16) + elif key == "num_resources": + table_values += string_value.ljust(15) + elif key == "resource_index": + table_values += string_value.ljust(16) + elif key == "resource_type": + table_values += string_value.ljust(15) + elif key == "resource_instances": + table_values += string_value.ljust(20) + elif key == "resources_shared": + table_values += string_value.ljust(18) elif key == "RW": - table_values += " " + string_value.ljust(52) + table_values += string_value.ljust(52) elif key == "process_list": #Add an additional padding between the first instance of GPU and NAME table_values += ' ' diff --git a/amdsmi_cli/amdsmi_parser.py b/amdsmi_cli/amdsmi_parser.py index bc9c85149b..9965c486b5 100644 --- a/amdsmi_cli/amdsmi_parser.py +++ b/amdsmi_cli/amdsmi_parser.py @@ -71,7 +71,7 @@ class AMDSMIParser(argparse.ArgumentParser): """ def __init__(self, version, list, static, firmware, bad_pages, metric, process, profile, event, topology, set_value, reset, monitor, - rocmsmi, xgmi): + rocmsmi, xgmi, partition): # Helper variables self.helpers = AMDSMIHelpers() @@ -117,7 +117,7 @@ class AMDSMIParser(argparse.ArgumentParser): # Store possible subcommands & aliases for later errors self.possible_commands = ['version', 'list', 'static', 'firmware', 'ucode', 'bad-pages', 'metric', 'process', 'profile', 'event', 'topology', 'set', - 'reset', 'monitor', 'dmon', 'xgmi'] + 'reset', 'monitor', 'dmon', 'xgmi', 'partition'] # Add all subparsers self._add_version_parser(self.subparsers, version) @@ -135,6 +135,7 @@ class AMDSMIParser(argparse.ArgumentParser): self._add_monitor_parser(self.subparsers, monitor) self._add_rocm_smi_parser(self.subparsers, rocmsmi) self._add_xgmi_parser(self.subparsers, xgmi) + self._add_partition_parser(self.subparsers, partition) def _not_negative_int(self, int_value): @@ -1286,6 +1287,40 @@ class AMDSMIParser(argparse.ArgumentParser): xgmi_parser.add_argument('-m', '--metric', action='store_true', required=False, help=metrics_help) + def _add_partition_parser(self, subparsers, func): + if not self.helpers.is_amdgpu_initialized(): + # The partition subcommand is only applicable to systems with amdgpu initialized + return + + # Subparser help text + partition_help = "Displays partition information of the devices" + partition_subcommand_help = "If no GPU is specified, returns information for all GPUs on the system.\ + \nIf no partition argument is provided all partition information will be displayed." + partition_optionals_title = "partition arguments" + + # Options help text + current_help = "display the current partition information" + memory_help = "display the current memory partition mode and capabilities" + accelerator_help = "display accelerator partition information" + + # Create partition subparser + partition_parser = subparsers.add_parser('partition', help=partition_help, description=partition_subcommand_help) + partition_parser._optionals.title = partition_optionals_title + partition_parser.formatter_class=lambda prog: AMDSMISubparserHelpFormatter(prog) + partition_parser.set_defaults(func=func) + + # Add Universal Arguments + self._add_device_arguments(partition_parser, required=False) + + # Handle GPU Options + partition_parser.add_argument('-c', '--current', action='store_true', required=False, help=current_help) + partition_parser.add_argument('-m', '--memory', action='store_true', required=False, help=memory_help) + partition_parser.add_argument('-a', '--accelerator', action='store_true', required=False, help=accelerator_help) + + # Add command modifiers to the bottom + self._add_command_modifiers(partition_parser) + + def error(self, message): outputformat = self.helpers.get_output_format() diff --git a/py-interface/amdsmi_interface.py b/py-interface/amdsmi_interface.py index b35da33e1c..bed84e26a8 100644 --- a/py-interface/amdsmi_interface.py +++ b/py-interface/amdsmi_interface.py @@ -2788,7 +2788,7 @@ def amdsmi_get_gpu_accelerator_partition_profile( "profile_type" : profile.profile_type, "num_partitions" : profile.num_partitions, "profile_index" : profile.profile_index, - "memory_caps" : "N/A", + "memory_caps" : profile.memory_caps, "num_resources" : profile.num_resources, "resources" : "N/A" } From 88ed9e2f0954847b83d9ad4983de651f0ecfc69b Mon Sep 17 00:00:00 2001 From: "Galantsev, Dmitrii" Date: Fri, 27 Sep 2024 18:33:56 -0500 Subject: [PATCH 7/8] CMAKE - Fix version Change-Id: Ieefdd4c64ae657a53f1f5fd9a7fc94b3d2c899c2 Signed-off-by: Galantsev, Dmitrii --- CMakeLists.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 97793e0290..4fc950cb16 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -206,7 +206,7 @@ configure_package_config_file( write_basic_package_version_file( ${CMAKE_CURRENT_BINARY_DIR}/amd_smi-config-version.cmake VERSION - "${AMD_SMI_LIBS_TARGET_VERSION_MAJOR}.${AMD_SMI_LIBS_TARGET_VERSION_MINOR}.${AMD_SMI_LIBS_TARGET_VERSION_PATCH}" + "${CPACK_PACKAGE_VERSION}" COMPATIBILITY SameMajorVersion) install( From a266d602c5f2f048812e489cf6962b7fb755baa3 Mon Sep 17 00:00:00 2001 From: Maisam Arif Date: Fri, 27 Sep 2024 18:55:19 -0500 Subject: [PATCH 8/8] Bump Version to 24.7.0.0 Signed-off-by: Maisam Arif Change-Id: Ife9277f6abf64ed862e11e12a6472c6e6ea4d68f --- CMakeLists.txt | 2 +- amdsmi_cli/README.md | 2 +- docs/doxygen/Doxyfile | 2 +- docs/how-to/using-AMD-SMI-CLI-tool.md | 2 +- include/amd_smi/amdsmi.h | 4 ++-- 5 files changed, 6 insertions(+), 6 deletions(-) diff --git a/CMakeLists.txt b/CMakeLists.txt index 4fc950cb16..d2d5078048 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -28,7 +28,7 @@ find_program(GIT NAMES git) ## Setup the package version based on git tags. set(PKG_VERSION_GIT_TAG_PREFIX "amdsmi_pkg_ver") -get_package_version_number("24.6.5" ${PKG_VERSION_GIT_TAG_PREFIX} GIT) +get_package_version_number("24.7.0" ${PKG_VERSION_GIT_TAG_PREFIX} GIT) message("Package version: ${PKG_VERSION_STR}") set(${AMD_SMI_LIBS_TARGET}_VERSION_MAJOR "${CPACK_PACKAGE_VERSION_MAJOR}") set(${AMD_SMI_LIBS_TARGET}_VERSION_MINOR "${CPACK_PACKAGE_VERSION_MINOR}") diff --git a/amdsmi_cli/README.md b/amdsmi_cli/README.md index ab95062473..2c782ae66c 100644 --- a/amdsmi_cli/README.md +++ b/amdsmi_cli/README.md @@ -81,7 +81,7 @@ AMD-SMI reports the version and current platform detected when running the comma ~$ amd-smi usage: amd-smi [-h] ... -AMD System Management Interface | Version: 24.6.5.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal +AMD System Management Interface | Version: 24.7.0.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal options: -h, --help show this help message and exit diff --git a/docs/doxygen/Doxyfile b/docs/doxygen/Doxyfile index a322c17a14..2f8227fed1 100644 --- a/docs/doxygen/Doxyfile +++ b/docs/doxygen/Doxyfile @@ -48,7 +48,7 @@ PROJECT_NAME = AMD SMI # could be handy for archiving the generated documentation or if some version # control system is used. -PROJECT_NUMBER = "24.6.5.0" +PROJECT_NUMBER = "24.7.0.0" # Using the PROJECT_BRIEF tag one can provide an optional one line description # for a project that appears at the top of each page and should give viewer a diff --git a/docs/how-to/using-AMD-SMI-CLI-tool.md b/docs/how-to/using-AMD-SMI-CLI-tool.md index 0ad610a81b..6371281088 100644 --- a/docs/how-to/using-AMD-SMI-CLI-tool.md +++ b/docs/how-to/using-AMD-SMI-CLI-tool.md @@ -8,7 +8,7 @@ AMD-SMI reports the version and current platform detected when running the comma ~$ amd-smi usage: amd-smi [-h] ... -AMD System Management Interface | Version: 24.6.5.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal +AMD System Management Interface | Version: 24.7.0.0 | ROCm version: 6.2.2 | Platform: Linux Baremetal options: -h, --help show this help message and exit diff --git a/include/amd_smi/amdsmi.h b/include/amd_smi/amdsmi.h index a1713b66e1..3c05386be3 100644 --- a/include/amd_smi/amdsmi.h +++ b/include/amd_smi/amdsmi.h @@ -177,10 +177,10 @@ typedef enum { #define AMDSMI_LIB_VERSION_YEAR 24 //! Major version should be changed for every header change (adding/deleting APIs, changing names, fields of structures, etc.) -#define AMDSMI_LIB_VERSION_MAJOR 6 +#define AMDSMI_LIB_VERSION_MAJOR 7 //! Minor version should be updated for each API change, but without changing headers -#define AMDSMI_LIB_VERSION_MINOR 5 +#define AMDSMI_LIB_VERSION_MINOR 0 //! Release version should be set to 0 as default and can be updated by the PMs for each CSP point release #define AMDSMI_LIB_VERSION_RELEASE 0