Original updates:
* Added .gitignore to help with future commits
* Updated/added copyrights on modified or added files
* Updated rocm_smi.h/.cc
- Added 3 new SMI API functions:
rsmi_dev_compute_partition_set &
rsmi_dev_compute_partition_get
- Added helpful maps/enums used in
new get/set compute_partition API calls
* Updated rocm_smi.py
- Added --showcomputepartition
- Added --setcomputepartition
- Fixed a few mistypes
* Updated rsmiBindings.py - added helpful class/dict/list
* Updated rocm_smi_example.cc
- Added helpful MACRO to detect if api is not supported.
- Added current_compute_partition set/get rocm lib calls
- Added helpful macro to discover future RSMI errors
- Commented out test_set_freq, was having permission issues
on a Navi21
* Updated rocm_smi_main.cc
- Added helpful map to debug API calls, left in for future use
- Added comment to better understand a non-class function returns
* Added computepartition_read_write.cc/.h
- Added get/set compute partition API test calls
- Confirmed on devices that do not support the API calls, tests pass
* Updated rocm_smi_test/main.cc
- Calls new compute partition gtests
Added following updates from review feedback:
* Updated rocm_smi.h/cc
- Removed C++ API calls, adding support for both C/C++
API calls could cause confusion and adds extra work for us
- rsmi_dev_compute_partition_get -> Fixed an edge case where
user gives a small buffer length size (smaller than data
received), but does not receive the partial buffer back.
google Tests are updated to reflect this find.
* Updated rocm_smi_example.cc
- Fixed test_set_freq, issue was that file was not writable.
We now indicate this warning, so prior errors make sense.
- General test code cleanup. Removed extra code,
by creating loops for tests.
* Updated rocm_smi_main.cc
- Moved and got rid of an external reference to a map used
for debugging RSMI enums, now is a const public reference.
* Updated rocm_smi.py
- Updated python code to identify NOT_SUPPORTED due to
(currently) only a few GPU support the feature
Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
This patch fixes a couple of --showproductname bugs, both of which
are related to the device's SKU.
Previously if a device with a non-standard VBIOS name was used,
fetching that device's SKU wasn't working correctly.
A standard VBIOS name should follow the following pattern:
AAA-BBBBBB-CCC
Where the middle section "BBBBBB" between the hypens is the SKU.
Now, SKU can be correctly fetched even with a non-standard VBIOS
name, and return 'unkown' if SKU does not exist.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I5899a859c6131c6048bb31a4305ddacbac3075a9
The purpose of this patch is to add a new feature to the smi cli.
Use ./rocm-smi --showtempgraph to print a persistant bar graph for
each GPU's temperature.
The bar graphs refresh continuously to show current temps, and the
graphs change in a color gradient depending on the temperature.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I98902b76c42cc7281420759f5ebe8c78f7785e66
We append CE then UE, but in the table right after, it goes UE then CE.
Fix the order of the table, and add capitals for consistency
Change-Id: I208f37685508ab1e2ff83d3456620bbbf3a16268
Invalid peer links are labeled as N/A during topology report creation.
This invalid link type could be triggered by having a configuration
with CPU XGMI iolinks and disable XGMI peer to peer access. This can
be done by setting the driver parameter 'use_xgmi_p2p = 0'.
Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: Ifb09a8f3266a3f07686615dfb45781d6cfe55e83
The purpose of this patch is to modify the column header of the default
'./rocm-smi' command from 'Temp' to 'Temp (DieEdge)' for clarity.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I127a9214be97a1185c3db010f1c9176d1f412ec9
--showdeviceid
Fix for false-positive "FRU is corrupted" messages,
since str(sn).isalphanum() triggers on empty struct.
--showproductname
fix script termination on non-alphanum product name
Change-Id: I78d4998e156f9b0d9f45338bed2a0d30b789e220
This fixes the 'unknown' value being displayed
for Perf Level because of a missing mapping of
RSMI_DEV_PERF_LEVEL_DETERMINISM to its string
value.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I479c2baea450f0ff61640ad81cbd4d08ad56ff8e
The purpose of this patch is to set RETCODE equal to 0 by default
unless an appropriate '--loglevel LEVEL' has been set.
To allow a non-zero RETCODE value, you must use any loglevel that
is not 'warning' or 'None' (default).
You can set the loglevel in the CLI with:
--loglevel <debug/info/warning/error/critical>
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I9484a750206a3f464c59952304e72c59c3d12465
If the FRU has been corrupted, then the serial number will come in with
any manner of random bytes, which will cause decode() to fail
spectacularily. Check that the serial returned by the kernel is
alphanumeric, and print to the error log if not (then continue to the
next device).
Change-Id: If4f35b140b6089e02729b1490ed6b48d614a122a
This patch changes the error handling for setClockRange.
When a device does not support modifying a clock type (sclk/mclk),
an error message is printed through the python CLI.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I37d9ea4189b1ca81e5deaab5efa6cfa4901b89b3
showpidgpus prints 'none' when no GPU devices are
being used by the running process. Adding a fix
to print a relevant message.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I165a6644a76c8e1c3c3cad676dcfd41eb1c4724f
This patch fixes a --showvoltagerange bug, which attempts to check
the voltage curve on a device that does not have any voltage
regions in its OverDrive voltage frequency data (odvf).
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I647c30c978ffb13f6819ac3d069ee340710a7f99
Fixes bug in the 'setPowerOverdrive' function which mishandles
GPUs with secondary dies. Secondary dies have a default power cap
of 0W and cannot be changed, so they are now skipped.
Fixes bug in the 'resetPowerOverdrive' function which incorrectly
resets the wattage to the current value.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I483fa3f58b1fa44a3bf7bae3b52c59ce523ae152
Show an optional debug log (RSMI_DEBUG_BITFIELD=2) to
the user in the following scenarios:
1. If more than one current frequency is found
2. If frequencies are not read in increasing order of
their value
If current frequency is not available, index for it is
set to -1, values will not have * next to it in the
output. This will also be handled in rocm_smi.py.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I477ec065f7513c8045d6392f12ef6cb835a6b8f6
This reverts commit b931380f02.
DRM device id does not always match GPU ID in the rocm_smi.py. This leads to cases where wrong device is checked by os.path.isfile().
Change-Id: Ib6f2b9be123b7eb64334d3feec57f63d7eb37d6f
Instead of check /proc/modules for amdgpu, the code will check
/sys/module/amdgpu/initstate which covers the case when the driver
is compiled into the kernel.
Change-Id: Id39ec5b0eb9b68204bc9f5f779057ba8cc090bdc
Fixes a bug in the 'formatCsv' function which mishandles json
data conversion for 'system' data types.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I705060409bf5ae75b994ffda270843065ca12321
Wrapper header files
Soft link to libraries and binaries
rocm_smi.py and rsmiBindings.py installed in libexec/rocm_smi
Binaries, libraries and header files installed as per File Reorg folder structure
Change-Id: I3166ab67f89c2ae4aafbc87bb00c9a5233221ade
The purpose of this patch is to hide 'One or more commands failed.'
from showing up, unless an appropriate log level has been set.
You can set the loglevel in the CLI with:
--loglevel <debug/info/warning/error/critical>
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ifa309cd62596491a6ea5892e0752251f037fc0e9