* Changes:
- Updates to DRM renderD* / card* pathing for partition devices
- Now use KFD to discover AMD devices and populate accordingly
Device MUST have an accessible KFD node (via cgroups)
- Updated several ROCm SMI CLI outputs to handle SYSFS files
which are not accessible on partition nodes
- Added a new method to help get card/drm info
(rsmi_dev_device_identifiers_get) from ROCm SMI
Change-Id: If844f27ffc595942272abe9c8167ed90a0b0e225
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
* SWDEV-518214: GPU Metrics 1.8 (#31)
- Updates:
- Adding the following metrics to allow new calculations for violation status:
- Per XCP metrics gfx_below_host_limit_ppt_acc
- Per XCP metrics gfx_below_host_limit_thm_acc
- Per XCP metrics gfx_low_utilization_acc
- Per XCP metrics gfx_below_host_limit_total_acc
- Increasing available JPEG engines to 40. Current ASICs may not support all 40. These will be indicated as UINT16_MAX or N/A in CLI.
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>
Changes: - Fixed Device Name (market name)
- Added new API rsmi_dev_market_name_get()
- Updated tests
- Updated amdgpu_drm.h to match latest mainline kernel
- Fixed subsystem ID to only show hex value (not subsystem name)
- rocm_smi_lib now has a recommended requirement for libdrm
Change-Id: Ic438529e16c8c3dbbdd620da664918148c40c997
- Fixes the broken links in rocm_smi.h
- Uses automodule instead of autofunction in docs/reference/python_api.rst
- Fixes some warnings during docs build
- Update some of the versions in requirements.txt
- To address https://github.com/ROCm/rocm_smi_lib/issues/208
where use of fake BDFs for partitions can cause confusion. This note
is already in the comments of the function definition, but was not
updated in the function declaration.
- Fix broken formatting for the location table for PCIE coordinate fields
- Tracked in SWDEV-501108
Change-Id: Ic85439866cb836bb43acc52314a7f1d026c3215d
Changes:
- Added new GPU metrics:
1) XGMI link status - Up/Down; 1 = up; 0 = down
2) Graphics clocks below host limit (per XCP)
accumulators -> used to help calculate a violation status
3) VRAM max bandwidth at max memory clock
- Updated rocm-smi --showmetrics to include new metrics.
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.
Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Changes:
- Added warning screen to ROCm SMI users
setting memory partition
- Added new API (rsmi_dev_memory_partition_capabilities_get)
to retrieve memory partition capabilities
(What users can set memory partition modes to)
- Increased time-bar for CLI sets display to 40 seconds
- API now waits until the driver reloads with SYSFS files active
- [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
not properly displaying for MI2x or Navi 3x ASICs.
- Updated tests
Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
The reset gpu partition support for both compute and memory were removed
Code changes related to the following:
* rsmi_dev_compute_partition_reset()
* rsmi_dev_memory_partition_reset()
* CLI
* Unit tests
* Documentation
Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Changes:
- Added new GPU metrics:
1) Violation status' (ex. PVIOL/TVIOL) accumulators
2) XCP (Graphics Compute Partitions) statistics
3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.
Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.
Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8] = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes
Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
This patch fixes error:
error: assignment of member 'dismiss_' in read-only object
Reported by kind Gentoo folks:
<https://bugs.gentoo.org/918709>
Change-Id: I7cc593043e97402afd85593c528ace86952b1350
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Fixes reading pp_od_clk_voltage new variable format and size.
Code changes related to the following:
* get_od_clk_volt_info()
* get_od_clk_volt_curve_regions()
* Unit tests
* CLI options restored: --showclkvolt, --showvc, --showvoltagerange, --setvc
* Rework: 48ddd9ab
* Bump CLI version
* CHANGELOG.md
Change-Id: I817ca224de923fdaa992df84592d63b4d5a12b22
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
This commit aligns the rsmiBindings.py.in file's
"notification_type_names" & "rsmi_evt_notification_type_t" with
those found in the rsmiBindings.py file.
Change-Id: I67f36606c505992fb98495651310bd70a1755033
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Fixes reading pp_od_clk_voltage new variable format and size.
Code changes related to the following:
* get_od_clk_volt_info()
* get_od_clk_volt_curve_regions()
* Unit tests
* CLI options removed: --showclkvolt, --showvc, --showvoltagerange, --setvc
Change-Id: Ieedb845eeadcea2f2e447ec576c253ad2a814176
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
This patch adds 'ring hang' enums to ROCM SMI LIB.
This event type name is KFD_SMI_EVENT_RING_HANG.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9b886eb1fc027f03bcca1e5d1a89a2a186b64bf5
The environment variable RSMI_MUTEX_THREAD_ONLY=1 to enable thread only mutex.
The RSMI_INIT_FLAG_THRAD_ONLY_MUTEX can also be pass to rsmi_init()
to enable thread only mutex.
Change-Id: I2d9844039b774e386f03bb9bb130d8c342504ea6
* Updates:
- [CLI] rocm-smi (no arg) and --showhw:
Now displays 'ID'/'PARTITION ID' from the pcie_id identifier
Helps users identify which partition # the device is
Information provided by KFD
Note: partition_id of 0, means a primary node (AKA root node),
ex. ASICs which do not have partitioning support will show 0
- [API] Fix partitions nodes which do not enumerate with domain:
Adding kfd's domain, allows ASICs which have domains
to enumerate in proper order.
Full pcie_id / bdf propagates to all partition nodes.
- [API] Update rsmi_dev_pci_id_get() to allow users to extract
partition_id from device
- [CLI] Added fix for devices which have modprobe failure,
but DRM does not come up properly. Even though driver shows
initialization was successful.
- [API/Utils] Overloaded print_int_as_hex() template:
Now accepts bitsize, and prints in smallest byte size
possible. Note: bitsize of < 8, please just print as decimial.
Change-Id: Ib0c6f73b2b9c9fea29442a39a669c432874382d8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards
Code changes related to the following:
* 'rsmi_dev_metrics_' APIs
* Functional tests
* Examples
Change-Id: I7d562a95889361ee6f8f7588f8a790f42c8eb262
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Updated:
* [CLI] Fixed vram % - printf style formatting causes many data errors
This fix updates to the recommended way of outputting formatted data.
https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
* [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
-> CLI name: "GUID"
-> ROCm SMI calls: no arg, -i, --showhw, --showproduct
* [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
-> CLI name: "Node"
-> ROCm SMI calls: no arg, --showhw, --showproduct
* [CLI] Added target gfx version from kfd
-> CLI name: "GFX Version" or "GFX VER"
-> ROCm SMI calls: --showhw, --showproduct
* [CLI] Base ROCm CLI
-> Removed - stacked id formatting:
This is to simplify identifiers helpful to users.
More identifiers can be found on -i --showhw, --showproduct
* [CLI] Update -i, --showhw, --showproduct, w/out arg
-> Card ID/DID/Model/SKU/VBIOS:
All unsupported values now display "N/A" instead
of "unknown" or "unsupported"
* [CLI] Showhw now expands data based on content
Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
On some systems [rocm-smi --showpids] reports
get_compute_process_info_by_pid, Not supported on the given system
[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN
get_compute_process_info_by_pid fails because cu_occupancy debugfs method
is not provided on some graphics cards and GFX revisions by design
Proposing a change to return success status when only cu_occupancy debugfs method
is not found and provide cu_occupancy invalidation value to mark only
this parameter as UNKNOWN
Change-Id: Iae37070d9bd19483b4e6c8ee24c7d9a4c92f00d7
Signed-off-by: Vladimir Stempen <Vladimir.Stempen@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
In addition to be able to set clock range, new setextremum option
is added to set only min/max clock as sometimes one of them may
not be supported.
Change-Id: I7c91ba308f3fc6c78efc88117509c515d403a6cb
Updates:
- [CLI] Switching to use generic rsmi_dev_power_get()
this is a backwards compatible function to
retrieve power values. More consistent than
previous fixes.
- [API] Update API for rsmi_dev_power_get()
Now provides @depricated for this function.
Providing notes on newer ASICS only support
current socket power, where as previous
ASICS only provided average power.
Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Updates:
- [API] rsmi_dev_target_graphics_version_get, takes
reported value from KFD -> parses into human-readable
values. If device does not support, returns MAX UINT64
value and RSMI_STATUS_NOT_SUPPORTED.
Otherwise, puts into base10 format removing
extra 0's + putting in correct format. If user
provides nullptr, returning RSMI_STATUS_INVALID_ARGS.
- [Test/Example] sys_info_read updated to include
new rsmi_dev_target_graphics_version_get tests
Change-Id: I50f94e06b8733a5dec2eb08f284b44927f36abcd
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Adds support and implement APIs for 'gpu_metrics_v1_5'
Code changes related to the following:
* gpu metrics 1.5 support
* Unit tests
* Examples
Build changes related to the following: None
Change-Id: Ie8917dd63c1dd1a94467b100fa44b634cebe62b6
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Uses new support for 'gpu_metrics_v1_4'
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* new data structure fields added in 1.4
* added APIs for all other existing metrics before 1.4
* added support to older metrics; 1.1, and 1.2
* added support to dump_internal_metrics_table()
* public APIs renamed to start with prefix 'rsmi_dev_metrics_'
* Unit tests updated
* Examples updated
Build changes related to the following: None
Change-Id: I23e59f99d3ed43318cd6bd43bd2f0c5387e9ccb9
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Uses new support for 'gpu_metrics_v1_4'
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* new data structure fields added in 1.4
* added APIs for all other existing metrics before 1.4
* added support to older metrics; 1.1, and 1.2
* public APIs renamed to start with prefix 'rsmi_dev_metrics_'
* Unit tests updated
* Examples updated
Build changes related to the following: None
Change-Id: Ibdaf031be9d916020b4049544dbd725858c7711d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
- std=c++.. is not required because CMAKE_CXX_STANDARD is set
- nullptr check breaks the test because we rely on nullptr as an api for
checking feature availability.
- enum number setting is unnecessary
Change-Id: I393e6dd3f292b7fa4198302f140c0443ba5e50f5
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
The rsmi_topo_get_link_type() is extended to support query the CPU
and GPU link type by passing dv_ind_dst as 0xFFFFFFFF.
Change-Id: I1f212a01e8120adb70a08ab772fa9faaaecefa29
* Updates:
- [API/CLI] rsmi_dev_*_partition_set &
rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
EBUSY writes + cleaned up accidental map insertions
(maplookup[] can insert values that are not in the map,
map.at(key) fixes this potential issue)
- [API] rsmi_dev_gpu_metrics_info_get() - returns
RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
outside of 1v1/1v2/1v3
- [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
EBUSY write errors; kept backward compatibility
for other writes which do not care about these states
- [API] rsmi_dev_od_volt_info_get()
& rsmi_dev_od_volt_curve_regions_get() have better logging
+ Expose more details on why they are erroring
- [Utils/logs/example] Expose AMD GPU gfx target version to aid in
system troubleshooting
- [Utils] Added test methods that look at od volt
freq & regions into here - for easier access across
several tests
- [Utils] Updated getRSMIStatusString(new argument - fullstatus;
default to true for backwards compatibility)
-> true shows shortened RSMI STATUS response
- [Utils] Added splitString to cut out noisy return responses
(used in getRSMIStatusString(), when fullstatus = true)
- [Utils] Added getFileCreationDate() to expose build date
of the library - helpful for local builds or experimental builds
- [Utils] Macro cleanup
- [Example] Added a few gpu_metric checks - helpful for upcoming
updates
- [Device] SYSFS/DebugFS - now have better r/w displayed in logs
- [LOGS] Expose library build date - see above for details
- [Tests] Add more warnings/errors to test builds
- [Tests] Moved up Partition tests for ordered test runs - helped
identify issues with GPU BUSY writes
- [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
with waits for busy status found & cleaned up how we checked
for partition changes - with RSMI responses exposed more clearly
- [Tests] perf_determinism - multi gpu now properly runs through
with full resets as needed
- [Tests] volt_freq_curv_read - better error handling with more
verbose output
Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
* Updates:
- [API] After discovering all amd gpus, we now properly
map correct bdf (xgmi nodes). Especially important for
partition changes - aka secondary nodes.
- [API] While adding new secondary nodes we now have
better grouping -> due to resorting based on
kfd properties list & matching to primary uniqueid
- [API] All secondary nodes are now AddToDeviceList
with correct bdf (location id), provided by kfd
- [API] Modified AddToDeviceList(..., uint64_t bdfid):
providing an optional field - bdfid. This allows working
around primary pcie cards with xgmi nodes
- [API] Utils - cpplint minor fixes
- [Example] Removed all endl references w/ newline, fixed
spacing, and some incorrect values displaying as hex
(needed dec representation)
- [API] kfd node functions - now print full path of file
for trace logs
- [Tests] power_read.cc: Added in generic power test to
confirm guaranteeing specific return values
Change-Id: I143474e8d64c4915a966e789be6bcea4fa7f4472
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
* Updates:
- [API] Added rsmi_dev_power_get(uint32_t dv_ind,
uint64_t *power,
RSMI_POWER_TYPE
*type)
provides generic get to average or
current power & provides backwards
compatibility
- Added a utility function to get MonitorTypes
(monitor_type_string(type)) &
RSMI_POWER_TYPE (power_type_string(type))
strings
- [Tests] Added rsmi_dev_power_get tests and
provided better verification of return values for
all power APIs
- [Tests] Updated power outputs to show correct
units
- [example] Now uses avg, current, and generic
power functions with type output response
Change-Id: I5ca06ca37fd5f61e100f2835b664d6cdd1ca42e6
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Adds support for 'gpu_metrics_v1_4' and new counters
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* The new gpu_metrics are now part of the Device
Build changes related to the following: None
Change-Id: Ie748e977cd0a01c6a2fb82260014c0699605dbb3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
* Updates:
- rocm_smi_lib + CLI:
Rename all "NPS mode" -> "memory partition"
related files/functions/API/CLI to align with correct
technical naming
- rocm_smi_main: fixed identifying primary card's unique id
utilize rsmi_dev_unique_id_get to map which
KFD nodes belong to it
- rsmi_dev_*_partition*: now have better logging output
- compute partition tests:
Added 20 sec delay for workaround until GPU
busy is confirmed as the issue
- CPPLint fixes/formatting
- [Example] Moved all endl to "\n" for efficiency
- [Example] Added Edge & Junction temperature examples
- [Example] Added rsmi_minmax_bandwidth_get() example - WIP
Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
* Updates:
- rocm_smi_logger:
General cleanup &
Aligned to cpplint rules for usage
- rocm_smi_monitor:
Fixed MonitorTypes
from not displaying properly in logs
& Added socket power label + current
socket power MonitorTypes
- rocm_smi API:
Added rsmi_dev_current_socket_power_get API
- rocm_smi CLI:
General cleanup,
Concise info now displays device data
in variable width (see printLogSpacer's
new field),
printLogSpacer now as an adjustable
variable that overrides appWidth,
Added Socket Power to base rocm-smi +
--showpower CLI calls,
--showpower & base rocm-smi CLI defaults
to printing socket power (if not available,
displays average power)
- Cleaned up temp label references
- power_read gtests:
Added current socket power to testing
Change-Id: Ica57e6f98ad96e2584e7c7955e188f68d2dab89d
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
The purpose of this patch is to add the following missing firmware
blocks to the SMI LIB:
-RSMI_FW_BLOCK_MES
-RSMI_FW_BLOCK_MES_KIQ
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I5d4d37d883878dd02ef8533d4eb8891d54d70630
Implements APIs for 'gpu_metrics_v1_3' utilization averages
Code changes related to the following:
* rsmi_dev_activity_metric_get()
* rsmi_dev_activity_avg_mm_get()
* CLI shows "Avg.Memory Bandwidth" under "--showmemuse"
Change-Id: I8e4600f350a7c18499abf022534db2b875f09d5f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
* Updates:
- Fixed infinit loop on systems
which did not have VRAM files
- Fixed concise info from throwing exception
with no amdgpu driver loaded
- Fix for ability to see all nodes when
after switching partitions (mirrors
original card display/settings)
- Added to logs build type, lib path,
and set env. variables
Change-Id: Ic0333df355144ce2242cecea93fe4ce51caf311c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>