Changes:
- This aligns back to original struct naming for ROCm 7.0. This removes
any Major ABI breakages for updates for 7.0 release.
- Minor ABI breakage is required since there were additions to the
header. Refer to changelog for these updates.
Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Changes:
- Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency)
- Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs
(Violation Status is the first example of this in monitor)
- Improve CLI monitor output:
support multiple GPU lines per GPU, add new columns, and better formatting
- Refactor helpers and logger for flexible unit formatting and table rendering
- Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info()
new metrics APIs in C++ example
- Sync Python/C++ interface and structures for new metrics fields and naming
- Remove deprecated/unused RSMI activity APIs, documentation not needed since
the APIs no longer exist in ROCm SMI either.
- Cleanup metric violations + fix handle watch arguments
- Provide better handling/doc for average_flattened_ints()
- Group xcp metrics with brackets in human readable + adjust output size
Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
Description:
- Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently.
- Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality.
- Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`.
- Enhanced CLI and test cases to allow users to control when the driver reload occurs.
- Updated documentation and changelog to reflect the new driver reload process.
- Improved error handling and logging for driver reload operations.
- Added progress bar and user confirmation prompts for driver reload commands.
* Update build/test strategy to only allow one test execution at a time
* Modify API verbage + modify systemctl error output
- Systemctl is typically not enabled on docker.
- And is an edge case for gpu being active process/etc for display devices.
* Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values
* Move driver reload to after we save original compute partitions
---------
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Add support to Query UBB/OAM temperature.
* Updated Python API with new temperature metrics enum
---------
Co-authored-by: Bill Liu <shuzhliu@amd.com>
Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
The bug was reproduced like this.
In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow
In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done
The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow
From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.
The fix:
Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.
The Python CLI should not treat this as an error, but should continue to print what the API returned.
---------
Signed-off-by: Oosman Saeed <oossaeed@amd.com>
* Added copyrights
* Fixed type hinting for processor_handle in python_interface
* Fixed Incorrect type hinting to actual return types
---------
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Change-Id: Ie2a09acf628ed0c43eacc8ec78c159d125acbcdb
* Adjusted help text
* Adjusted --afid to run only with --cper-file
* Fixed interface return error
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I2b96f4515c85f3b9dd84ba5c2d819729a997141b
The xgmi read and write accumulated data from gpu metric index
is based on sysfs xgmi_port_num file. Mapped these two to display
read and write wrt src_gpu Vs dst_gpu.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
The xgmi command was showing pcie bit rate and bandwidth instead of xgmi. Corrected the API to get xgmi data from gpu metric.
Added python API for amdsmi_get_link_metrics. Modified the amdsmi_link_metrics struct.
Added check to confirm non zero partition got xgmi command.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
* Add the API and CLI to show the board voltage.
---------
Change-Id: Icb25bd653bb1d004704b5a21b378ca31b2b242c7
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
* Update to remove vram enum and instead use the string directly from the driver.
Signed-off-by: Narlo, Joseph <Joseph.Narlo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
feat: Report PLDM Bundle from SMC to IB
Code changes related to the following:
* APIs
* CLI
* Unit tests
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Change-Id: I35ccf01eb612ca80e3ae6b72039085c18c989222
- Added more events to `amdsmi_evt_notification_type_t`
Change-Id: I6a256fe828e4bec3197c7fecbed374ab17c6f850
Signed-off-by: Adam Pryor <Adam.Pryor@amd.com>
- Updated VCLK and DCLK min/max clock logic to populate N/A values.
- Updated VCLK and DCLK to show all available clocks.
- Updated deep_sleep logic using sys/fs clk_deep_sleep true/false.
- Added clarifying comments.
- Updated error output using e.get_error_info() instead of just error.
- Updated changelog
---------
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Changes:
- Removed libdrm/libdrm_amdgpu dependencies
- Added/updated new internal libdrm/libdrm_amdgpu/xf86drm APIs
to allow our APIs to reference before dynamic loading
the libdrm/libdrm_amdgpu libraries:
1. amdgpu_drm.h to what's seen in mainline
2. Added xf86drm.h to whats seen in mainline
- Modified internal DRM capabilities:
1. Require each API to independently connect to libdrm/libdrm_amdgpu
+ validate API handles reponses accordingly
2. Initialization of AMD SMI no longer has as strong of a tie to
libdrm
- Updated internal implementations of several APIs which have
connections to libdrm/libdrm_amdgpu or APIs which have conflicts
with open libdrm/libdrm_amdgpu connections:
1. amdsmi_init()
2. amdsmi_get_gpu_vram_usage()
3. amdsmi_get_gpu_asic_info()
4. amdsmi_get_gpu_vram_info()
5. amdsmi_get_gpu_vbios_info()
6. amdsmi_get_gpu_driver_info()
7. amdsmi_get_gpu_virtualization_mode()
8. amdsmi_set_gpu_memory_partition()
9. amdsmi_set_gpu_memory_partition_mode()
- Cleaned up effected tests/APIs
Change-Id: I96e2cf1b06b0cfee1b01a5e991ccc6116c4245a8
Earlier, the amd-smi metric and static json output
was not in valid json format. Changes are done to
get the output in valid json format.
---------
Change-Id: I5576333269509f63b3c800f225c3d73127ce80cf
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
* Small Fixes
* CLI Help text and parser formatting updates
* Changed metavar for set partition
---------
Change-Id: Ia8809665f6fac670452cd4db4e5e8f9c7270faba
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Co-authored-by: Pham, Gabriel <Gabriel.Pham@amd.com>
(#280)
- Fixed potential issue with min/max values when only one frequency is available
- Improve error handling in GPU frequency range detection
- Refactor clock frequency range detection for better readability
- Added special handling for current frequency indicator (*) in DPM output
- Added comments explaining special case handling for current frequency
- Cleaned up incorrect definitions in hsmp metric table definition
---------
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
* Changes:
- Updates to DRM renderD* / card* pathing for partition
- Now use KFD to discover AMD devices and populate accordingly
Device MUST have an accessible KFD node (via cgroups)
- Updated serveral AMD SMI CLI outputs to handle SYSFS files
which are not accessible on partition nodes
- Tests are updated to handle not supported features
- Added new method to help get card/drm info
(rsmi_dev_device_identifiers_get) from ROCm SMI
- Renamed device->get_card_id() & device->get_drm_render_minor()
These can now be used on internal AMD SMI calls.
- Removed warnings shown in build
Change-Id: Ice882fd9b97fb625a5bd4ef327f3ceaf247dc570
Signed-off-by: Charis Poag <Charis.Poag@amd.com>