rocm-smi --showproductname will not show "Card series" in its output if
product_name exported by Kernel is empty string. This has been raised a
regression by customer.
BUG: SWDEV-297228
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I9aae24778e2d3a30aa661d8f338278c1666590fb
[ROCm/amdsmi commit: 7a8c3f3629]
Fix error message in -P for secondary die
Signed-off-by: Elena Sakhnovitch
Change-Id: Ica3c0a83b565d2231fad23389b9378056a0f56b3
[ROCm/amdsmi commit: 2db7e2a312]
During the tail end when process is terminating, subprocess module fails
to find the process. This results in extraneous printing of a line with
char 'b'. Fix this.
BUG: SWDEV-296409
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I39aacf8ae948a5acec0aa93296cc0e0aec88b3ef
[ROCm/amdsmi commit: a03acf2c07]
Python's default 'print' implementation is not thread safe, causing
empty lines to be printed during multithreaded code execution.
This fixes the --showevents output for multi-GPU systems.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I72f7341cdf4401f1fed4cd8f7d7a4a90bf9a3a4c
[ROCm/amdsmi commit: 95348f37cc]
Use zero padding for the hexadecimal value 'device_model' inside
showProductName with a padding length of 4.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I962b94d414c6ba050d951486ad9e7559123f8850
[ROCm/amdsmi commit: 03ae187a35]
Fix the stack-use-after-scope error reported by the AddressSanitizer.
Bug: SWDEV-291913
Change-Id: I0ffd71af8679b8bff6c363096fafe75dffcf329e
[ROCm/amdsmi commit: 8c60dbebaa]
Specify that timestamp resolution is in ns in header file.
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4db00a07c0b5c43ae23c98213f2fbbcf93110234
[ROCm/amdsmi commit: 14201290a2]
Implement default GPU power cap functionality in rsmitst.
It is available in the "rsmitstReadOnly.TestPowerRead" test, and
is displayed as: "Default Power Cap: #uW" (where uW is microwatts).
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I564ea3785f1a93dfd30587634057516549fa762c
[ROCm/amdsmi commit: 5b42cdf780]
Since device is a list, we need to pass a single item to the isAmdGpu
function.
Fixes: ffbe481241 "rocm_smi.py: Don't try to reset non-AMD GPUs"
Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: I19a74377636ff4589f11d092f41e1d35c1acb307
[ROCm/amdsmi commit: 242d94a668]
Instead of throwing "Unsupported clock" errors for ASICs that don't
support a certain clock type (e.g. dcefclk on MI-series), just dump the
warning to logging.debug and don't try to read the clock
Signed-off-by: Kent Russell <kent.russell@amd.com>
Change-Id: If3cb9a472b03aa535a76fc24bcd9f77122090634
[ROCm/amdsmi commit: b931380f02]
Use default power cap exposed via sysfs to determine when to
show 'Out of Spec" warning.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I0fa3612b50e230856b0d5a390f876b35268d9587
[ROCm/amdsmi commit: b71e07b3fb]
Implement default GPU power cap functionality in the LIB.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53
[ROCm/amdsmi commit: 83cd2fe4f1]
Implement showevent functionality in the ROCm SMI Python CLI.
It can be called using --showevents with any combination of:
VM_FAULT, THERMAL_THROTTLE, and/or GPU_RESET
For example:
./rocm-smi --showevents VM_FAULT, THERMAL_THROTTLE, GPU_RESET
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I905fd9c949e91423b79833a04ab89d6ba3760e62
[ROCm/amdsmi commit: a9e7e5a475]
Many data center cards are fanless. Don't show warning if unable to get
fan speed. The fan speed will be reported as 0
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I53efe67ac88fb0824cf4820430b46c18bc7692df
[ROCm/amdsmi commit: 1c9e384c8f]
The coarse grain utilization counter includes GFX and Memory activity.
Change-Id: I5d09976792d3f4a1c1081651fa24ff857016d4c0
[ROCm/amdsmi commit: 9bfb9ac297]
This won't work for obvious reasons, so exit with an error instead of
trying to access a file that doesn't exist and segfaulting
Change-Id: Id1230922fa6e9a19e9394280faad88a43c7d2e34
[ROCm/amdsmi commit: c7c2ac5559]
The rsmi_dev_temp_metric_get() can also support the HBM
temperatures which is retrieved from gpu_metrics.
Change-Id: I96b979296e90cf881523627b41b1a02849676416
[ROCm/amdsmi commit: da480b4589]
Previously, RSMI assumed that the event counter values returned
from perf were only new events. But in fact, when we read the
counter values, they are running totals. To account for this, we
now record the value we read and take the difference between the
current value and the previously recorded value.
Change-Id: I1e04b514e89c7c4d4719889f2dae3a1283864e7f
[ROCm/amdsmi commit: ce475b009c]
rocm_smi.py --set<m|s>clk was treating the freq as a string.
This causes problems in parsing when the index is more than 1
digit. Now, treat the indexes as integers.
Change-Id: Ia0d859d33b685fe90689a86ff1c83980808b1514
[ROCm/amdsmi commit: 11440536cf]
rocm_smi_lib is not currently known to only compile
on specific architectures.
Change-Id: I209e8baa063e99ebe5ff09eaf0dc6541770aa829
[ROCm/amdsmi commit: 7effb405f0]
Previously, during the rsmi_init discovery process, the existence
of an hwmon# directory was used to distinguish between gpus nodes
and non-gpu nodes. This isn't reliable in some scenarios. Instead,
the existence of the vbios_version file is used as an
indicator that the node is indeed a gpu.
Change-Id: Icfbe5c42ed0970077b05f25c3d209308a31bec85
[ROCm/amdsmi commit: ff9546aa62]
The purpose of this patch is to fix a power cap bug for --setpoweroverdrive.
This bug occurs when the user attempts to set a lower wattage than the current
or default wattage, which displays an unnecessary warning message.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: I730d2c6031b7d7c4af5acf32ecd28da5ca21ab12
[ROCm/amdsmi commit: 20e2d260fb]