Implements automatic device wake using getDRMDeviceId() DRM call when GPUs
are detected in low-power state. This ensures rocm-smi can access device
information on suspended GPUs.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Run pre-commit's whitespace related hooks on projects/rocm-smi-lib
In order for pre-commit to be useful, everything needs to meet a common
baseline.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Added Changelog Spaces for formatting
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Changes:
- Fix `rocm-smi --setsclk [0 .. n]` for multiple devices to continue on fail when
in a partitioned configuration (ex. in DPX/QPX/CPX/etc).
- Partitioned configurations or devices which do not support changing
sclk/mclk/pcie clks will now continue on failure. Will report a "not
supported" or other (rocm-smi) error codes for these devices.
- Updates impact other clock settings such as `--setmclk` and
`--setpcie`.
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Updated: rocm_smi.py
- Remove all else: clauses from functions where rsmi_ret_ok is part of the if clause, as requested.
- rsmi_ret_ok() function already handles unsucessful return codes and gracefully handles them.
- Updated check_runtime_status() function to sweep through /sys/class/drm to find active runtime_status.
- Updated the message to' AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status'
- This clarifies the status of the GPU and tells them where to check for more info.
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: gabrpham <Gabriel.Pham@amd.com>
The sysfs pcie bandwidth file pcie_bw is deprecated
in newer asics. This change will get pcie BW from
GPU metric for version 1.5 or later.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Unstable threading was causing segmentation faults. Update to use more
recent threading module rather than the _thread module solved
segmentation fault issue.
multiple issues solved by this commit:
[SWDEV-537518]
[SWDEV-540377]
[SWDEV-540223]
Signed-off-by: GabrPham <gabrpham_amdeng@amd.com>
[ROCm/rocm_smi_lib commit: 7dba992ebd]
* [SWDEV-531834] Fix AMD GPUs visible, but data is inaccessible:
- Scans directories under /sys/bus/pci/drivers/amdgpu
- Verifies each device's runtime_status to determine if it's active
- Returns False if any device is not in active state
- Handles permission errors gracefully with proper debug logging
- Includes comments explaining behavior differences between Instinct / NAVI hardware
The default status is set to True, assuming devices are active unless
proven otherwise, which accommodates hardware like some Instinct ASICS
which do not support runtime power management.
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
[ROCm/rocm_smi_lib commit: 47f80145cb]
* Changes:
- Updates to DRM renderD* / card* pathing for partition devices
- Now use KFD to discover AMD devices and populate accordingly
Device MUST have an accessible KFD node (via cgroups)
- Updated several ROCm SMI CLI outputs to handle SYSFS files
which are not accessible on partition nodes
- Added a new method to help get card/drm info
(rsmi_dev_device_identifiers_get) from ROCm SMI
Change-Id: If844f27ffc595942272abe9c8167ed90a0b0e225
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: a0df877fdf]
rocm-smi is not working in mGPU, Blocking DLM tests
Updates include:
- Creating check_runtime_status function to check for device status of active.
- Added warning to users that No AMD GPUs are available, check power status/control.
- Added check for empty string coming from HWMON, if emtpy returns unexpected data.
---------
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
[ROCm/rocm_smi_lib commit: 2630bf0a8c]
* SWDEV-518214: GPU Metrics 1.8 (#31)
- Updates:
- Adding the following metrics to allow new calculations for violation status:
- Per XCP metrics gfx_below_host_limit_ppt_acc
- Per XCP metrics gfx_below_host_limit_thm_acc
- Per XCP metrics gfx_low_utilization_acc
- Per XCP metrics gfx_below_host_limit_total_acc
- Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: f69e65f7bd]
Changes: - Fixed Device Name (market name)
- Added new API rsmi_dev_market_name_get()
- Updated tests
- Updated amdgpu_drm.h to match latest mainline kernel
- Fixed subsystem ID to only show hex value (not subsystem name)
- rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997
[ROCm/rocm_smi_lib commit: b951a65cf2]
The target_graphics_version was not formatted properly and was
showing incorrect Target Name. Corrected this by fomatting
major, minor and revision numbers.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
[ROCm/rocm_smi_lib commit: 6337f7b05b]
Changes:
- Added new GPU metrics:
1) XGMI link status - Up/Down; 1 = up; 0 = down
2) Graphics clocks below host limit (per XCP)
accumulators -> used to help calculate a violation status
3) VRAM max bandwidth at max memory clock
- Updated rocm-smi --showmetrics to include new metrics.
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.
Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 88a7e4b8ad]
Changes:
* [API] Removed checking board name, fixes for other MI ASICs
* [CLI] Increased progress bar to change memory partition modes
to 140 seconds, since driver reload is variable per system
Change-Id: Ifcaf40d28b4adf5eaa800c9e3748d33749dc414a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: d04cec7f1d]
Changes:
- Added warning screen to ROCm SMI users
setting memory partition
- Added new API (rsmi_dev_memory_partition_capabilities_get)
to retrieve memory partition capabilities
(What users can set memory partition modes to)
- Increased time-bar for CLI sets display to 40 seconds
- API now waits until the driver reloads with SYSFS files active
- [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
not properly displaying for MI2x or Navi 3x ASICs.
- Updated tests
Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 46902274b6]
The reset gpu partition support for both compute and memory were removed
Code changes related to the following:
* rsmi_dev_compute_partition_reset()
* rsmi_dev_memory_partition_reset()
* CLI
* Unit tests
* Documentation
Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: a1295714f2]
Changes:
- Added new GPU metrics:
1) Violation status' (ex. PVIOL/TVIOL) accumulators
2) XCP (Graphics Compute Partitions) statistics
3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.
Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 0609cbf1d0]
- logging.warn() is deprecated in favour of logging.warning()
- for some reason, this is the only place in all of rocm_smi.py
that uses logging.warn() as pointed out on github
https://github.com/ROCm/rocm_smi_lib/issues/187
Change-Id: Ie1e4a0ea16b996fbed2e902c8edfe68087a5a5fa
[ROCm/rocm_smi_lib commit: fe6a49d186]
Options '--showvoltagerange' and '--showvc' show 'warning' instead of 'error' for unsupported voltage curves
Code changes related to the following:
* CLI
Change-Id: Ide662c98202c32ad01ccaf3c47a61f2543f82ebb
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 72b112f8f3]
Updates:
- [CLI] Previously --showfw displayed fw that
does not exist on systems. This change removes
that extra output.
Change-Id: If8b063001b80b03579ea1378dfd890c60f62ccd7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 6b8db74578]
rocm-smi is installed in /opt/rocm-ver/bin , but not as a soft link in wheel package
For rocm-smi to work from bin directory, it need the extra path to find rsmiBindings.py
Change-Id: I41388f680cb2ab9f11dc135639b0d30b66082392
[ROCm/rocm_smi_lib commit: c9201f7736]
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.
Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8] = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 323ab1105d]