Library path was printed at all times even with --json flag.
This commit adds a mandatory initRsmiBindings function which is a core
component of the rsmiBindings.py library.
It **MUST** be called on import.
Change-Id: Ic6ae1ec5d1fabba288910e6aed6c4706e53e5cd7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
For operations related to:
--resetfans
--setfan
We report 'Not supported' for these cases instead of 'Permission denied'
Code changes related to the following:
* rocm_smi_properties
* rocm_smi related APIs
Change-Id: I144646efc3804fabd45cc5a46351803950b4feb7
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
* Updates:
- Fixed infinit loop on systems
which did not have VRAM files
- Fixed concise info from throwing exception
with no amdgpu driver loaded
- Fix for ability to see all nodes when
after switching partitions (mirrors
original card display/settings)
- Added to logs build type, lib path,
and set env. variables
Change-Id: Ic0333df355144ce2242cecea93fe4ce51caf311c
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Code changes related to the following:
* Reverts earlier fix for the same issue
* Check for existence of files before reading
Change-Id: I175b20c3343c414b12b79dc3fc404f53fbaabf3a
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Code changes related to the following:
* rocm_smi.py
Change-Id: I600e776bf479f972b8d639ce5a658a24916aed3c
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
Properly handles 'Unable to detect' vs 'Not supported' fan cases where:
* sysfs file (pwm#) exists, and readings report zero (0), "Unable to detect fan speed"
* sysfs file (pwm#) does not exist, then "Not supported"
Change-Id: If4b0312c872b76647a3e54427ba2a3f3e8e6dab1
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
* Updates:
- Fix for devices which do not have edge sensors, but junction
- Added partitioning (memory and dynamic) displays for
base rocm-smi CLI calls
- Added subheading for base rocm-smi call output
- Added better hwmon and device detection logging
Change-Id: I8219884b2e532d6ed379527cacdc1f2b232a5451
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Since the reset will continue if the reset power and current power
is the same, error may confuse the user.
Change-Id: I35b9ef17afd47b5af5bd2b8882a44f63991fe509
Code changes related to the following:
* Added 'rsmi_dev_revision_get()' related code
* Test code
* Functional tests
Change-Id: I8c2097c65384a028c8c8437b717d05d52fe45250
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
The librocm_smi64.so is used for development, while
librocm_smi64.so.MAJOR is used for runtime, thus the python front end
should not be loading the .so binary, but rather the .so.MAJOR binary.
As well, it's good not to hardcode "lib" as some distros will change
this.
rsmiBindings.py is now generated with CMake
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I7cb745f8936fdf10d3ebd6c1e606031f713184ca
If temp in hwmon was missing - rocm-smi crashed.
e.g. /sys/class/drm/card1/device/hwmon/hwmon5/temp1_input
This change displays "N/A" for temp instead of crashing.
Change-Id: I02f84a466bd3acfbd9b65e7e4ca0f18e76606c3b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Used pyright to show errors and warnings and resolved most
Change-Id: I0fdf7dcdf08db5c35dec80f6645e0a395fbe4197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Updates:
* [rocm-smi] Added larger app width size, which helps
display missing device info
* [rocm-smi] Added better context when rsmi_ret_ok
does not return with RSMI_STATUS_SUCCESS
* [rocm-smi] Removed all references to an
undefined function (printLogNoDev())
* [rocm-smi] Fixed not detecting non-int
values when setting the voltage curve
* [rocm-smi] Added better context on missing
sysfs file when setting clock overdrive
values
* [rocm-smi] Fixed getMemInfo() calls not
referencing tuple values (making it easier
to read)
* [rocm-smi] Silenced concise info spitting
out errors for missing VRAM files, instead
display which metric is "unsupported" if
the files are missing
* [rocm-smi] Updated function descriptions for
rsmi_ret_ok & getMemInfo
* [rocm-smi] Updated getMemInfo to provide a
quiet call, to silence for concise info calls.
This provides a way to keep the output clean.
* [rocm-smi-lib] Added when using debug sysfs
files, to state, which enums are enabled
for debug
Change-Id: I0e9e0c97ccf71467ced0e1a1f71803327a8be2b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Updates:
* Added RSMI_STATUS_SETTING_UNAVAILABLE for
rsmi_dev_compute_partition_set - gives users
better error output when attempting to set
compute partition to values not listed in
available_compute_partition SYSFS
* Updated python --setcomputepartition to
provide better output when receiving
RSMI_STATUS_SETTING_UNAVAILABLE
* Updated all test & example files to check for
RSMI_STATUS_SETTING_UNAVAILABLE when doing
rsmi_dev_compute_partition_set
Change-Id: Ida5d54880d9b9b6e4a0468cdb962fdc0c18d6257
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Updates:
* Added rsmi_dev_compute_partition_reset & rsmi_dev_nps_mode_reset
* Added --resetcomputepartition and --resetnpsmode python smi calls
* Added temp data files rocmsmi_boot_compute_partition_<device num>
& rocmsmi_boot_nps_mode_partition_<device num>, writes UNKNOWN
if data cannot be read or device does not support
* Cleaned up NPS & compute API documentation
* Added creation and reading of API temp files (used in reset
functionality)
* Cleaned up output of rocm_smi_example
* Updated rocm_smi_example to check if running with sudo permission
before executing write API calls (cleans up erroneous output)
* Added template specialization for storing temp data, requires
specific rsmi_type_t enums (restrics what data can be stored)
* Added storage of temp data, if temp files do not exist
* Updated google tests for NPS & compute to include reset API calls
Change-Id: I69895a466b97107617e6dbb355737b84499a76c9
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Updates:
* Added rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
* Added ability to set multiple SYSFS files in debug build
* Added ability to see user's env variables set for debug build
* Added tests for rsmi_dev_nps_mode_set and rsmi_dev_nps_mode_get
* Added ability to restart AMD GPU driver, used in nps_mode_set
* Updated ROCm_SMI_Manual.pdf to include new APIs
* Added progress bar for long running python_smi_tools, used
in setting nps_mode if runs longer than .1 seconds
Change-Id: I6d61bedd28d7cba6aff432ad2d127ba741b7d15a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
This patch fixes a --showproductname bug, which is related to the
device's SKU. If a device with a VBIOS value that cannot be decoded
is used, that device's SKU cannot be parsed out of the VBIOS string.
Now, when the VBIOS value cannot be decoded, an error will be
printed instead of crashing with an 'UnboundLocalError' message.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I680a182e94107e782235b8a2477ab165988f7703
Original updates:
* Added .gitignore to help with future commits
* Updated/added copyrights on modified or added files
* Updated rocm_smi.h/.cc
- Added 3 new SMI API functions:
rsmi_dev_compute_partition_set &
rsmi_dev_compute_partition_get
- Added helpful maps/enums used in
new get/set compute_partition API calls
* Updated rocm_smi.py
- Added --showcomputepartition
- Added --setcomputepartition
- Fixed a few mistypes
* Updated rsmiBindings.py - added helpful class/dict/list
* Updated rocm_smi_example.cc
- Added helpful MACRO to detect if api is not supported.
- Added current_compute_partition set/get rocm lib calls
- Added helpful macro to discover future RSMI errors
- Commented out test_set_freq, was having permission issues
on a Navi21
* Updated rocm_smi_main.cc
- Added helpful map to debug API calls, left in for future use
- Added comment to better understand a non-class function returns
* Added computepartition_read_write.cc/.h
- Added get/set compute partition API test calls
- Confirmed on devices that do not support the API calls, tests pass
* Updated rocm_smi_test/main.cc
- Calls new compute partition gtests
Added following updates from review feedback:
* Updated rocm_smi.h/cc
- Removed C++ API calls, adding support for both C/C++
API calls could cause confusion and adds extra work for us
- rsmi_dev_compute_partition_get -> Fixed an edge case where
user gives a small buffer length size (smaller than data
received), but does not receive the partial buffer back.
google Tests are updated to reflect this find.
* Updated rocm_smi_example.cc
- Fixed test_set_freq, issue was that file was not writable.
We now indicate this warning, so prior errors make sense.
- General test code cleanup. Removed extra code,
by creating loops for tests.
* Updated rocm_smi_main.cc
- Moved and got rid of an external reference to a map used
for debugging RSMI enums, now is a const public reference.
* Updated rocm_smi.py
- Updated python code to identify NOT_SUPPORTED due to
(currently) only a few GPU support the feature
Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
This patch fixes a couple of --showproductname bugs, both of which
are related to the device's SKU.
Previously if a device with a non-standard VBIOS name was used,
fetching that device's SKU wasn't working correctly.
A standard VBIOS name should follow the following pattern:
AAA-BBBBBB-CCC
Where the middle section "BBBBBB" between the hypens is the SKU.
Now, SKU can be correctly fetched even with a non-standard VBIOS
name, and return 'unkown' if SKU does not exist.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I5899a859c6131c6048bb31a4305ddacbac3075a9
The purpose of this patch is to add a new feature to the smi cli.
Use ./rocm-smi --showtempgraph to print a persistant bar graph for
each GPU's temperature.
The bar graphs refresh continuously to show current temps, and the
graphs change in a color gradient depending on the temperature.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I98902b76c42cc7281420759f5ebe8c78f7785e66
We append CE then UE, but in the table right after, it goes UE then CE.
Fix the order of the table, and add capitals for consistency
Change-Id: I208f37685508ab1e2ff83d3456620bbbf3a16268
Invalid peer links are labeled as N/A during topology report creation.
This invalid link type could be triggered by having a configuration
with CPU XGMI iolinks and disable XGMI peer to peer access. This can
be done by setting the driver parameter 'use_xgmi_p2p = 0'.
Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: Ifb09a8f3266a3f07686615dfb45781d6cfe55e83
The purpose of this patch is to modify the column header of the default
'./rocm-smi' command from 'Temp' to 'Temp (DieEdge)' for clarity.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I127a9214be97a1185c3db010f1c9176d1f412ec9
--showdeviceid
Fix for false-positive "FRU is corrupted" messages,
since str(sn).isalphanum() triggers on empty struct.
--showproductname
fix script termination on non-alphanum product name
Change-Id: I78d4998e156f9b0d9f45338bed2a0d30b789e220
This fixes the 'unknown' value being displayed
for Perf Level because of a missing mapping of
RSMI_DEV_PERF_LEVEL_DETERMINISM to its string
value.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I479c2baea450f0ff61640ad81cbd4d08ad56ff8e
The purpose of this patch is to set RETCODE equal to 0 by default
unless an appropriate '--loglevel LEVEL' has been set.
To allow a non-zero RETCODE value, you must use any loglevel that
is not 'warning' or 'None' (default).
You can set the loglevel in the CLI with:
--loglevel <debug/info/warning/error/critical>
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9484a750206a3f464c59952304e72c59c3d12465
If the FRU has been corrupted, then the serial number will come in with
any manner of random bytes, which will cause decode() to fail
spectacularily. Check that the serial returned by the kernel is
alphanumeric, and print to the error log if not (then continue to the
next device).
Change-Id: If4f35b140b6089e02729b1490ed6b48d614a122a
This patch changes the error handling for setClockRange.
When a device does not support modifying a clock type (sclk/mclk),
an error message is printed through the python CLI.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I37d9ea4189b1ca81e5deaab5efa6cfa4901b89b3
showpidgpus prints 'none' when no GPU devices are
being used by the running process. Adding a fix
to print a relevant message.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I165a6644a76c8e1c3c3cad676dcfd41eb1c4724f
This patch fixes a --showvoltagerange bug, which attempts to check
the voltage curve on a device that does not have any voltage
regions in its OverDrive voltage frequency data (odvf).
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I647c30c978ffb13f6819ac3d069ee340710a7f99
Fixes bug in the 'setPowerOverdrive' function which mishandles
GPUs with secondary dies. Secondary dies have a default power cap
of 0W and cannot be changed, so they are now skipped.
Fixes bug in the 'resetPowerOverdrive' function which incorrectly
resets the wattage to the current value.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I483fa3f58b1fa44a3bf7bae3b52c59ce523ae152
Show an optional debug log (RSMI_DEBUG_BITFIELD=2) to
the user in the following scenarios:
1. If more than one current frequency is found
2. If frequencies are not read in increasing order of
their value
If current frequency is not available, index for it is
set to -1, values will not have * next to it in the
output. This will also be handled in rocm_smi.py.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I477ec065f7513c8045d6392f12ef6cb835a6b8f6