Increased threshold from 2100 μs to 3100 µs to accommodate
gpu_metric read time variation across Navi systems.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Run pre-commit's whitespace related hooks on projects/rocm-smi-lib
In order for pre-commit to be useful, everything needs to meet a common
baseline.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Added Changelog Spaces for formatting
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Added runtime PM detection and DRM ioctl-based device wake
to handle GPUs in BACO state. Modified tests to wake
suspended devices before reading sysfs files.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Changes: - Updates to APIs to handle null pointers or RSMI_STATUS_NOT_SUPPORTED
- Fixes to tests to handle partitioned configurations correctly
- Synced with latest AMD SMI API changes
Change-Id: I7a932f9336ef29ccb01d3b15e2101f6136b45720
[ROCm/rocm_smi_lib commit: 12b78439d2]
[SWDEV-523359] fan_read_write: Add set fan speed validation check.
- Handled NOT_SUPPORTED status which previously caused rsmitst to false fail
- Added continute statement to proceed with rest of FanReadWrite test.
- fixed spacing line 140
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
[ROCm/rocm_smi_lib commit: ac31c6e576]
* SWDEV-518214: GPU Metrics 1.8 (#31)
- Updates:
- Adding the following metrics to allow new calculations for violation status:
- Per XCP metrics gfx_below_host_limit_ppt_acc
- Per XCP metrics gfx_below_host_limit_thm_acc
- Per XCP metrics gfx_low_utilization_acc
- Per XCP metrics gfx_below_host_limit_total_acc
- Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: f69e65f7bd]
Guest: Tests needed to account for not supporting changing compute
partitions.
BM: Tests need to account for invalid responses from Driver (due to
static CPX config).
Change-Id: I09ccee981c6b73684b64e5053068920a6c1b6439
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 23e945c6b3]
Memory partition test was changing original compute partiton based
on default compute mode. Corrected this to set back to original
compute partition.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
[ROCm/rocm_smi_lib commit: 1d6b8d9422]
Guest: Tests needed to account for not supporting changing compute
partitions.
BM: Tests need to account for invalid responses from Driver (due to
static CPX config).
[ROCm/rocm_smi_lib commit: 1dd9ca9df4]
Changes: - Fixed Device Name (market name)
- Added new API rsmi_dev_market_name_get()
- Updated tests
- Updated amdgpu_drm.h to match latest mainline kernel
- Fixed subsystem ID to only show hex value (not subsystem name)
- rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997
[ROCm/rocm_smi_lib commit: 6a5e94c451]
Changes:
- Added new GPU metrics:
1) XGMI link status - Up/Down; 1 = up; 0 = down
2) Graphics clocks below host limit (per XCP)
accumulators -> used to help calculate a violation status
3) VRAM max bandwidth at max memory clock
- Updated rocm-smi --showmetrics to include new metrics.
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.
Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 4de2168866]
Changes:
- Added warning screen to ROCm SMI users
setting memory partition
- Added new API (rsmi_dev_memory_partition_capabilities_get)
to retrieve memory partition capabilities
(What users can set memory partition modes to)
- Increased time-bar for CLI sets display to 40 seconds
- API now waits until the driver reloads with SYSFS files active
- [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
not properly displaying for MI2x or Navi 3x ASICs.
- Updated tests
Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 46902274b6]
The reset gpu partition support for both compute and memory were removed
Code changes related to the following:
* rsmi_dev_compute_partition_reset()
* rsmi_dev_memory_partition_reset()
* CLI
* Unit tests
* Documentation
Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: a1295714f2]
Changes:
- Added new GPU metrics:
1) Violation status' (ex. PVIOL/TVIOL) accumulators
2) XCP (Graphics Compute Partitions) statistics
3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.
Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 0609cbf1d0]
In order to check partition id's we must continue to check # of devices.
Since this fluctuates with partition updates
and there are drm minor limitations.
For the drm minor limitation of 64, user must remove other drivers
using PCIe space. You can see these by:
ls /sys/class/drm
Recommend: rmmod unneeded driver and reload amdgpu. In order to
ensure CPX can enumerate with all XCP (Graphic Cluster Partitions).
Change-Id: Ib663503f0b7264dce163f6ac2d50795fc8dc5eba
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: c11209f618]
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.
Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8] = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 323ab1105d]
Checks returned error by rsmi_dev_od_volt_info_get() before assert
Code changes related to the following:
* Unit tests
Change-Id: Icc0f329e35992aae19f07243024521181467bcd3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 497ef4a7ef]
Fixes reading pp_od_clk_voltage new variable format and size.
Code changes related to the following:
* get_od_clk_volt_info()
* get_od_clk_volt_curve_regions()
* Unit tests
* CLI options removed: --showclkvolt, --showvc, --showvoltagerange, --setvc
Change-Id: Ieedb845eeadcea2f2e447ec576c253ad2a814176
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 48ddd9abd7]
This patch adds 'ring hang' enums to ROCM SMI LIB.
This event type name is KFD_SMI_EVENT_RING_HANG.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9b886eb1fc027f03bcca1e5d1a89a2a186b64bf5
[ROCm/rocm_smi_lib commit: 3282aaa8de]
Fixes TestMeasureApiExecutionTime test fails
Code changes related to the following:
* Unit tests
Change-Id: I6223078f219448deb6bfbd78edae371a5a4cf03c
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: adf5c1da67]
Checks returned error by get_gpu_pci_bandwith() before assert
Code changes related to the following:
* Unit tests
Change-Id: Ia0fe64f168711147c5e66c7917cf633be40dee9f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 35b561fd69]
Checks and forces rereading gpu metrics unconditionally
Code changes related to the following:
* Device::dev_log_gpu_metrics()
* Examples
* Unit tests
Change-Id: Ic1c4f34a39f2bf197263f80ddbb84da26345807d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: b4d37caa70]
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards
Code changes related to the following:
* 'rsmi_dev_metrics_' APIs
* Functional tests
* Examples
Change-Id: I7d562a95889361ee6f8f7588f8a790f42c8eb262
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: ce36198cb1]
Updated:
* [CLI] Fixed vram % - printf style formatting causes many data errors
This fix updates to the recommended way of outputting formatted data.
https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
* [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
-> CLI name: "GUID"
-> ROCm SMI calls: no arg, -i, --showhw, --showproduct
* [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
-> CLI name: "Node"
-> ROCm SMI calls: no arg, --showhw, --showproduct
* [CLI] Added target gfx version from kfd
-> CLI name: "GFX Version" or "GFX VER"
-> ROCm SMI calls: --showhw, --showproduct
* [CLI] Base ROCm CLI
-> Removed - stacked id formatting:
This is to simplify identifiers helpful to users.
More identifiers can be found on -i --showhw, --showproduct
* [CLI] Update -i, --showhw, --showproduct, w/out arg
-> Card ID/DID/Model/SKU/VBIOS:
All unsupported values now display "N/A" instead
of "unknown" or "unsupported"
* [CLI] Showhw now expands data based on content
Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 4b5ccb57f0]
Updates:
- [API] rsmi_dev_target_graphics_version_get, takes
reported value from KFD -> parses into human-readable
values. If device does not support, returns MAX UINT64
value and RSMI_STATUS_NOT_SUPPORTED.
Otherwise, puts into base10 format removing
extra 0's + putting in correct format. If user
provides nullptr, returning RSMI_STATUS_INVALID_ARGS.
- [Test/Example] sys_info_read updated to include
new rsmi_dev_target_graphics_version_get tests
Change-Id: I50f94e06b8733a5dec2eb08f284b44927f36abcd
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 5d2cd0c271]
Adds support and implement APIs for 'gpu_metrics_v1_5'
Code changes related to the following:
* gpu metrics 1.5 support
* Unit tests
* Examples
Build changes related to the following: None
Change-Id: Ie8917dd63c1dd1a94467b100fa44b634cebe62b6
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 373621aed3]
Received EACCES return for file that does not have
write access (read only). Permissions would be an
issue, but we check for sudo/root permissions early on.
Change-Id: I98615b02e4acccc59facb42225887a6b7273716b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: c6b0c93e6f]
Code changes related to the following:
* Check smallest copy size for multi-valued metrics
* Unit tests: gpu_metric_read
* ROCMSMI examples
Build changes related to the following:
* CMakeLists.txt
Change-Id: Ieb2363020fa21c93fbacd0edcc1d394eed183051
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 8e0d3d5a39]
Uses new support for 'gpu_metrics_v1_4'
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* new data structure fields added in 1.4
* added APIs for all other existing metrics before 1.4
* added support to older metrics; 1.1, and 1.2
* public APIs renamed to start with prefix 'rsmi_dev_metrics_'
* Unit tests updated
* Examples updated
Build changes related to the following: None
Change-Id: Ibdaf031be9d916020b4049544dbd725858c7711d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 2c8ba4cae9]
* Updates:
- [API/CLI] rsmi_dev_*_partition_set &
rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
EBUSY writes + cleaned up accidental map insertions
(maplookup[] can insert values that are not in the map,
map.at(key) fixes this potential issue)
- [API] rsmi_dev_gpu_metrics_info_get() - returns
RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
outside of 1v1/1v2/1v3
- [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
EBUSY write errors; kept backward compatibility
for other writes which do not care about these states
- [API] rsmi_dev_od_volt_info_get()
& rsmi_dev_od_volt_curve_regions_get() have better logging
+ Expose more details on why they are erroring
- [Utils/logs/example] Expose AMD GPU gfx target version to aid in
system troubleshooting
- [Utils] Added test methods that look at od volt
freq & regions into here - for easier access across
several tests
- [Utils] Updated getRSMIStatusString(new argument - fullstatus;
default to true for backwards compatibility)
-> true shows shortened RSMI STATUS response
- [Utils] Added splitString to cut out noisy return responses
(used in getRSMIStatusString(), when fullstatus = true)
- [Utils] Added getFileCreationDate() to expose build date
of the library - helpful for local builds or experimental builds
- [Utils] Macro cleanup
- [Example] Added a few gpu_metric checks - helpful for upcoming
updates
- [Device] SYSFS/DebugFS - now have better r/w displayed in logs
- [LOGS] Expose library build date - see above for details
- [Tests] Add more warnings/errors to test builds
- [Tests] Moved up Partition tests for ordered test runs - helped
identify issues with GPU BUSY writes
- [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
with waits for busy status found & cleaned up how we checked
for partition changes - with RSMI responses exposed more clearly
- [Tests] perf_determinism - multi gpu now properly runs through
with full resets as needed
- [Tests] volt_freq_curv_read - better error handling with more
verbose output
Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 57b6135e54]
* Updates:
- [API] After discovering all amd gpus, we now properly
map correct bdf (xgmi nodes). Especially important for
partition changes - aka secondary nodes.
- [API] While adding new secondary nodes we now have
better grouping -> due to resorting based on
kfd properties list & matching to primary uniqueid
- [API] All secondary nodes are now AddToDeviceList
with correct bdf (location id), provided by kfd
- [API] Modified AddToDeviceList(..., uint64_t bdfid):
providing an optional field - bdfid. This allows working
around primary pcie cards with xgmi nodes
- [API] Utils - cpplint minor fixes
- [Example] Removed all endl references w/ newline, fixed
spacing, and some incorrect values displaying as hex
(needed dec representation)
- [API] kfd node functions - now print full path of file
for trace logs
- [Tests] power_read.cc: Added in generic power test to
confirm guaranteeing specific return values
Change-Id: I143474e8d64c4915a966e789be6bcea4fa7f4472
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 6f1afd2678]