- [CLI] Rounded VRAM output on CLI, no diffrence in output
- [python API] Fixed initializing calls which reuse initializeRsmi()
calls - now we set a global reference to rocmsmi to use
throughout API calls (see error below)
Traceback (most recent call last):
File "/home/charpoag/rocmsmi_pythonapi.py", line 9, in <module>
rocm_smi.initializeRsmi()
File "/opt/rocm/libexec/rocm_smi/rocm_smi.py", line 3531, in initializeRsmi
ret_init = rocmsmi.rsmi_init(0)
NameError: name 'rocmsmi' is not defined
Change-Id: I0eff3b8a432abf6d4344a02b9f638e1191c51a19
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 90160a7c9c]
Checks returned error by get_gpu_pci_bandwith() before assert
Code changes related to the following:
* Unit tests
Change-Id: Ia0fe64f168711147c5e66c7917cf633be40dee9f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 35b561fd69]
Checks and forces rereading gpu metrics unconditionally
Code changes related to the following:
* Device::dev_log_gpu_metrics()
* Examples
* Unit tests
Change-Id: Ic1c4f34a39f2bf197263f80ddbb84da26345807d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: b4d37caa70]
Cleans up individual gpu metric APIs which will be implemented according to 'unified-headers' standards
Code changes related to the following:
* 'rsmi_dev_metrics_' APIs
* Functional tests
* Examples
Change-Id: I7d562a95889361ee6f8f7588f8a790f42c8eb262
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: ce36198cb1]
Updated:
* [CLI] Fixed vram % - printf style formatting causes many data errors
This fix updates to the recommended way of outputting formatted data.
https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
* [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
-> CLI name: "GUID"
-> ROCm SMI calls: no arg, -i, --showhw, --showproduct
* [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
-> CLI name: "Node"
-> ROCm SMI calls: no arg, --showhw, --showproduct
* [CLI] Added target gfx version from kfd
-> CLI name: "GFX Version" or "GFX VER"
-> ROCm SMI calls: --showhw, --showproduct
* [CLI] Base ROCm CLI
-> Removed - stacked id formatting:
This is to simplify identifiers helpful to users.
More identifiers can be found on -i --showhw, --showproduct
* [CLI] Update -i, --showhw, --showproduct, w/out arg
-> Card ID/DID/Model/SKU/VBIOS:
All unsupported values now display "N/A" instead
of "unknown" or "unsupported"
* [CLI] Showhw now expands data based on content
Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 4b5ccb57f0]
On some systems [rocm-smi --showpids] reports
get_compute_process_info_by_pid, Not supported on the given system
[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN
get_compute_process_info_by_pid fails because cu_occupancy debugfs method
is not provided on some graphics cards and GFX revisions by design
Proposing a change to return success status when only cu_occupancy debugfs method
is not found and provide cu_occupancy invalidation value to mark only
this parameter as UNKNOWN
Change-Id: Iae37070d9bd19483b4e6c8ee24c7d9a4c92f00d7
Signed-off-by: Vladimir Stempen <Vladimir.Stempen@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rocm_smi_lib commit: 677433b367]
In addition to be able to set clock range, new setextremum option
is added to set only min/max clock as sometimes one of them may
not be supported.
Change-Id: I7c91ba308f3fc6c78efc88117509c515d403a6cb
[ROCm/rocm_smi_lib commit: 4e0a7f2f67]
Updates:
- [CLI] Switching to use generic rsmi_dev_power_get()
this is a backwards compatible function to
retrieve power values. More consistent than
previous fixes.
- [API] Update API for rsmi_dev_power_get()
Now provides @depricated for this function.
Providing notes on newer ASICS only support
current socket power, where as previous
ASICS only provided average power.
Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 51aec98edd]
Updates:
- [API] rsmi_dev_target_graphics_version_get, takes
reported value from KFD -> parses into human-readable
values. If device does not support, returns MAX UINT64
value and RSMI_STATUS_NOT_SUPPORTED.
Otherwise, puts into base10 format removing
extra 0's + putting in correct format. If user
provides nullptr, returning RSMI_STATUS_INVALID_ARGS.
- [Test/Example] sys_info_read updated to include
new rsmi_dev_target_graphics_version_get tests
Change-Id: I50f94e06b8733a5dec2eb08f284b44927f36abcd
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 5d2cd0c271]
The current code assume err_count sysfs only have 2 lines, which is
changed for umc_err_count by adding extra line for defer errors.
The code is changed to relax such check.
Change-Id: I1c469555a5d460d7bc4f4926245646c09c6a2056
[ROCm/rocm_smi_lib commit: 73c65b6bfe]
Change the python tool not to display above information if it is
not supported.
Change-Id: I48ffd95f07168219a629dfb391c1b4587308286d
[ROCm/rocm_smi_lib commit: 905c25e59b]
Apply the following changes to project documentation for ReadtheDocs:
add version number to documentation left navigation bar and page title
add an "About" section with a license page
enable htmlzip, pdf, epub formats when publishing on Read the Docs
set pdf title, author, copyright, and version
rename .sphinx/.doxygen to sphinx/doxygen
remove docBin from URL
update rocm-docs-core dependency
update dependabot config
Change-Id: Ife8c89a2e9323f436b3e54ef2a9e013c19b3b228
[ROCm/rocm_smi_lib commit: 67dc4b0f2a]
Adds support and implement APIs for 'gpu_metrics_v1_5'
Code changes related to the following:
* gpu metrics 1.5 support
* Unit tests
* Examples
Build changes related to the following: None
Change-Id: Ie8917dd63c1dd1a94467b100fa44b634cebe62b6
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 373621aed3]
Received EACCES return for file that does not have
write access (read only). Permissions would be an
issue, but we check for sudo/root permissions early on.
Change-Id: I98615b02e4acccc59facb42225887a6b7273716b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: c6b0c93e6f]
Code changes related to the following:
* Check smallest copy size for multi-valued metrics
* Unit tests: gpu_metric_read
* ROCMSMI examples
Build changes related to the following:
* CMakeLists.txt
Change-Id: Ieb2363020fa21c93fbacd0edcc1d394eed183051
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 8e0d3d5a39]
MCM die check was inconsistent (using avg power).
By using only the energy counter, this provides
a consistent way of checking which die is the MCM node.
Change-Id: I532fa2047706d0f1e92e643ce1e6759e45b65ec0
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 553d26ef3a]
Uses new support for 'gpu_metrics_v1_4'
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* new data structure fields added in 1.4
* added APIs for all other existing metrics before 1.4
* added support to older metrics; 1.1, and 1.2
* added support to dump_internal_metrics_table()
* public APIs renamed to start with prefix 'rsmi_dev_metrics_'
* Unit tests updated
* Examples updated
Build changes related to the following: None
Change-Id: I23e59f99d3ed43318cd6bd43bd2f0c5387e9ccb9
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 713d259f88]
Uses new support for 'gpu_metrics_v1_4'
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* new data structure fields added in 1.4
* added APIs for all other existing metrics before 1.4
* added support to older metrics; 1.1, and 1.2
* public APIs renamed to start with prefix 'rsmi_dev_metrics_'
* Unit tests updated
* Examples updated
Build changes related to the following: None
Change-Id: Ibdaf031be9d916020b4049544dbd725858c7711d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 2c8ba4cae9]
Sort GPU index based on BDF. Also add an API to get the XGMI
physical id.
Change-Id: I998876e435165c59d450ecd0b979315278b488a5
[ROCm/rocm_smi_lib commit: e5627d2bf1]
- std=c++.. is not required because CMAKE_CXX_STANDARD is set
- nullptr check breaks the test because we rely on nullptr as an api for
checking feature availability.
- enum number setting is unnecessary
Change-Id: I393e6dd3f292b7fa4198302f140c0443ba5e50f5
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rocm_smi_lib commit: a099f0682a]
The rsmi_topo_get_link_type() is extended to support query the CPU
and GPU link type by passing dv_ind_dst as 0xFFFFFFFF.
Change-Id: I1f212a01e8120adb70a08ab772fa9faaaecefa29
[ROCm/rocm_smi_lib commit: de5bc164de]
* Updates:
- [API/CLI] rsmi_dev_*_partition_set &
rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
EBUSY writes + cleaned up accidental map insertions
(maplookup[] can insert values that are not in the map,
map.at(key) fixes this potential issue)
- [API] rsmi_dev_gpu_metrics_info_get() - returns
RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
outside of 1v1/1v2/1v3
- [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
EBUSY write errors; kept backward compatibility
for other writes which do not care about these states
- [API] rsmi_dev_od_volt_info_get()
& rsmi_dev_od_volt_curve_regions_get() have better logging
+ Expose more details on why they are erroring
- [Utils/logs/example] Expose AMD GPU gfx target version to aid in
system troubleshooting
- [Utils] Added test methods that look at od volt
freq & regions into here - for easier access across
several tests
- [Utils] Updated getRSMIStatusString(new argument - fullstatus;
default to true for backwards compatibility)
-> true shows shortened RSMI STATUS response
- [Utils] Added splitString to cut out noisy return responses
(used in getRSMIStatusString(), when fullstatus = true)
- [Utils] Added getFileCreationDate() to expose build date
of the library - helpful for local builds or experimental builds
- [Utils] Macro cleanup
- [Example] Added a few gpu_metric checks - helpful for upcoming
updates
- [Device] SYSFS/DebugFS - now have better r/w displayed in logs
- [LOGS] Expose library build date - see above for details
- [Tests] Add more warnings/errors to test builds
- [Tests] Moved up Partition tests for ordered test runs - helped
identify issues with GPU BUSY writes
- [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
with waits for busy status found & cleaned up how we checked
for partition changes - with RSMI responses exposed more clearly
- [Tests] perf_determinism - multi gpu now properly runs through
with full resets as needed
- [Tests] volt_freq_curv_read - better error handling with more
verbose output
Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 57b6135e54]
Upstream soversion is at 5 for a while, but Debian's soversion has been set to
1 in the beginning of the rocm-smi-lib package. This is probably erroneous,
and the library should probably be better off being synchronized with upstream
so there is some kind of ABI compatibility between the two distributions.
.
FIXME: please use upstream soversion next time an ABI breakage justifies an
SOVERSION bump, instead of just incrementing the present version by one.
Author: Étienne Mollier <emollier@debian.org>
Forwarded: not-needed
Last-Update: 2023-09-17
Change-Id: I6c4d28bd26889359c0b83c474d5ae58a81741cf4
Co-authored-by: Étienne Mollier <emollier@debian.org>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rocm_smi_lib commit: 1775ae4b8d]
When built with LTO enabled, the linking of liboam.so chokes on the
following error, which is somewhat similar to the Debian bug #1030876
affecting PA-RISC, although the symptoms subtly differs in that it
suggests to build using -fPIC:
/usr/bin/ld: /tmp/cc0wF8Kx.ltrans0.ltrans.o: relocation R_X86_64_PC32 against symbol `_ZTVSt9exception@@GLIBCXX_3.4' can not be used when making a shared object; recompile with -fPIC
The -fPIC argument is passed appropriately down to the build command,
however it looks to be erased by the late introduction of -fPIE flag
by upstream build system. Erasing this flag allows the build to go
through, both with LTO and on PA-RISC.
Bug: https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/111
Bug-Debian: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1015653
Change-Id: I8b35fd4b62cfa1a9ddb145362464df5dd276e2f5
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rocm_smi_lib commit: c4c19e7917]
* Updates:
- [API] After discovering all amd gpus, we now properly
map correct bdf (xgmi nodes). Especially important for
partition changes - aka secondary nodes.
- [API] While adding new secondary nodes we now have
better grouping -> due to resorting based on
kfd properties list & matching to primary uniqueid
- [API] All secondary nodes are now AddToDeviceList
with correct bdf (location id), provided by kfd
- [API] Modified AddToDeviceList(..., uint64_t bdfid):
providing an optional field - bdfid. This allows working
around primary pcie cards with xgmi nodes
- [API] Utils - cpplint minor fixes
- [Example] Removed all endl references w/ newline, fixed
spacing, and some incorrect values displaying as hex
(needed dec representation)
- [API] kfd node functions - now print full path of file
for trace logs
- [Tests] power_read.cc: Added in generic power test to
confirm guaranteeing specific return values
Change-Id: I143474e8d64c4915a966e789be6bcea4fa7f4472
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 6f1afd2678]
* Updates:
- [API] Added rsmi_dev_power_get(uint32_t dv_ind,
uint64_t *power,
RSMI_POWER_TYPE
*type)
provides generic get to average or
current power & provides backwards
compatibility
- Added a utility function to get MonitorTypes
(monitor_type_string(type)) &
RSMI_POWER_TYPE (power_type_string(type))
strings
- [Tests] Added rsmi_dev_power_get tests and
provided better verification of return values for
all power APIs
- [Tests] Updated power outputs to show correct
units
- [example] Now uses avg, current, and generic
power functions with type output response
Change-Id: I5ca06ca37fd5f61e100f2835b664d6cdd1ca42e6
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 31a1fcce7d]
Adds support for 'gpu_metrics_v1_4' and new counters
Code changes related to the following:
* rsmi gpu_metrics APIs
* rsmi gpu_metrics Logs
* The new gpu_metrics are now part of the Device
Build changes related to the following: None
Change-Id: Ie748e977cd0a01c6a2fb82260014c0699605dbb3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
[ROCm/rocm_smi_lib commit: 4e4ebde640]
* Updates:
- rocm_smi_lib + CLI:
Rename all "NPS mode" -> "memory partition"
related files/functions/API/CLI to align with correct
technical naming
- rocm_smi_main: fixed identifying primary card's unique id
utilize rsmi_dev_unique_id_get to map which
KFD nodes belong to it
- rsmi_dev_*_partition*: now have better logging output
- compute partition tests:
Added 20 sec delay for workaround until GPU
busy is confirmed as the issue
- CPPLint fixes/formatting
- [Example] Moved all endl to "\n" for efficiency
- [Example] Added Edge & Junction temperature examples
- [Example] Added rsmi_minmax_bandwidth_get() example - WIP
Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: b251bb0c9f]
- Return from freq_output function early if clock is unsupported
- Right-align frequencies
Change-Id: I799c9351dac8a5be161bc9243cd3816539728357
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rocm_smi_lib commit: e962d3b281]