On APU's vbios_version string might not be exposed. Relying on vendor ID
to detect AMDGPU is sufficient
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I08ef4b4bc7491a40f318791803aeaf261f7fac25
Original updates:
* Added .gitignore to help with future commits
* Updated/added copyrights on modified or added files
* Updated rocm_smi.h/.cc
- Added 3 new SMI API functions:
rsmi_dev_compute_partition_set &
rsmi_dev_compute_partition_get
- Added helpful maps/enums used in
new get/set compute_partition API calls
* Updated rocm_smi.py
- Added --showcomputepartition
- Added --setcomputepartition
- Fixed a few mistypes
* Updated rsmiBindings.py - added helpful class/dict/list
* Updated rocm_smi_example.cc
- Added helpful MACRO to detect if api is not supported.
- Added current_compute_partition set/get rocm lib calls
- Added helpful macro to discover future RSMI errors
- Commented out test_set_freq, was having permission issues
on a Navi21
* Updated rocm_smi_main.cc
- Added helpful map to debug API calls, left in for future use
- Added comment to better understand a non-class function returns
* Added computepartition_read_write.cc/.h
- Added get/set compute partition API test calls
- Confirmed on devices that do not support the API calls, tests pass
* Updated rocm_smi_test/main.cc
- Calls new compute partition gtests
Added following updates from review feedback:
* Updated rocm_smi.h/cc
- Removed C++ API calls, adding support for both C/C++
API calls could cause confusion and adds extra work for us
- rsmi_dev_compute_partition_get -> Fixed an edge case where
user gives a small buffer length size (smaller than data
received), but does not receive the partial buffer back.
google Tests are updated to reflect this find.
* Updated rocm_smi_example.cc
- Fixed test_set_freq, issue was that file was not writable.
We now indicate this warning, so prior errors make sense.
- General test code cleanup. Removed extra code,
by creating loops for tests.
* Updated rocm_smi_main.cc
- Moved and got rid of an external reference to a map used
for debugging RSMI enums, now is a const public reference.
* Updated rocm_smi.py
- Updated python code to identify NOT_SUPPORTED due to
(currently) only a few GPU support the feature
Change-Id: I4a567acbb59d6771fb64df08d19175fe3604fd1b
If the code is crashed and the mutex may be in bad status. The user
has to mannually remove it. The fix will remove the shared mutex
if no process is using it.
Change-Id: I18bf562f2e0a7de8b3f0cccf72d60950b0d9bb2d
Return value from ReadSysfsStr function that reads cu_occupancy file
was not handled correctly. Modified the script to handle any fail conditions.
Change-Id: I3c71e0f6f288f196ed1f833e8709255c2b6e78ee
Devices with CPU XGMI iolink do not support PCIe peer access. Therefore,
they should not be reported as accessible links in the topology.
Signed-off-by: Alex Sierra <Alex.Sierra@amd.com>
Change-Id: I3ee51796945dc0966200dee03886510e8f1846b7
1. Memory allocated for handle was not deleted
when no variant, subvariant or supported function
was found
2. handle->func_id_iter address was set to 0
before delete[]
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Iab50fdfbe03eec8e6fd0e84e03bd2c47e645b3d8
Show an optional debug log (RSMI_DEBUG_BITFIELD=2) to
the user in the following scenarios:
1. If more than one current frequency is found
2. If frequencies are not read in increasing order of
their value
If current frequency is not available, index for it is
set to -1, values will not have * next to it in the
output. This will also be handled in rocm_smi.py.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I477ec065f7513c8045d6392f12ef6cb835a6b8f6
Add DEBUG_LOG that will optionally print error
message when RSMI_DEBUG_BITFIELD is set to 2.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I6017e92d8a9e5f9861ae29ece0488d4bc198f996
showclocks/showclkfrq does not display pp_dpm_pcie values
in sriov. This fix adds pcie clocks to rsmi_clk_type_t
where rest of the clocks are present.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I6d129ae412623b369c14456ae9781b2dbceb2139
This patch adds the following 4 missing GPU blocks to the SMI LIB:
-RSMI_GPU_BLOCK_MMHUB
-RSMI_GPU_BLOCK_PCIE_BIF
-RSMI_GPU_BLOCK_HDP
-RSMI_GPU_BLOCK_XGMI_WAFL
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia1ec6f53e195f4bf7b8f073d6bed4fdb6572e546
When an application call the library in a system without amdgpu,
it may always print out "rsmi_init() failed". Suppress the error
message in the library.
Change-Id: Ice63dd3a764b221a6935536bff1bfa6aa3e51a46
readlink() does not append a null byte to buffer. Initialize the
tpath to prevent stack buffer overflow.
Change-Id: I17895dc3576b080a0c35bd0528a5b83223ec1c1b
CMakeLists.txt does not set up the DEBUG macro correctly to mean
!NDEBUG, so, as a workaround, replace all uses of ifdef NDEBUG with
ifndef DEBUG in the library sources.
Change-Id: I408adb36d1a2310fb894a486574469662ebb27cd
(cherry picked from commit 9f87197d8d)
pop_back() was causing a seg fault when pp_dpm_pcie file is empty and returns whitespace.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I888f1f79751cd456e43751a5b96d08560a039677
The (temperature == nullptr) check happens only when HBM temperature is retrieved.
This check needs to apply in other cases as well, hence moving this outside the HBM condition.
This should return RSMI_STATUS_INVALID_ARGS consistently in all cases when nullptr is passed through rsmitst.
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Iea3cec75312a0a669c7da27e15e9782e6a885c5f
Driver mem fills in 0xFF for all for the metrices not supported for that ASIC.
So if 0xFF is detected, return RSMI_STATUS_NOT_SUPPORTED
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: I86a38148c7a288ea0db94893f685560eaac098ab
Driver mem fills in 0xFF for all for the metrices not supported for that ASIC.
So if 0xFF is detected, return RSMI_STATUS_NOT_SUPPORTED
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Iacb6474486e3732f2aa824ff447c17f8243b65cd
Implements rsmi_is_p2p_accessible API.
The function returns True if P2P is possible between two nodes.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ic7316eebcec4480175c7ad04c21a42b2e1a4c454
rocm-smi --showproductname will not show "Card series" in its output if
product_name exported by Kernel is empty string. This has been raised a
regression by customer.
BUG: SWDEV-297228
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I9aae24778e2d3a30aa661d8f338278c1666590fb
gpu_metrics version 1.2 provides atomic timestamp. Use this timestamp.
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I7a1a675f53b93718f34b1f2979173e9064e0ef93
Chnage to RSMI_GPU_METRICS_API_CONTENT_VER_1. In preparation for
supporting additional formats
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I4367a2622a0fa41e6b05bc4436ecd24b8c4e30e2
Implement default GPU power cap functionality in the LIB.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Ia6b3420beb0e4df5559c3e6d11d0667972590b53
The rsmi_dev_temp_metric_get() can also support the HBM
temperatures which is retrieved from gpu_metrics.
Change-Id: I96b979296e90cf881523627b41b1a02849676416
Previously, RSMI assumed that the event counter values returned
from perf were only new events. But in fact, when we read the
counter values, they are running totals. To account for this, we
now record the value we read and take the difference between the
current value and the previously recorded value.
Change-Id: I1e04b514e89c7c4d4719889f2dae3a1283864e7f
Previously, during the rsmi_init discovery process, the existence
of an hwmon# directory was used to distinguish between gpus nodes
and non-gpu nodes. This isn't reliable in some scenarios. Instead,
the existence of the vbios_version file is used as an
indicator that the node is indeed a gpu.
Change-Id: Icfbe5c42ed0970077b05f25c3d209308a31bec85
The purpose of this patch is to implement GPU reset functionality
in the LIB, and to call it from the rocm_smi python CLI.
Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: Iaf525f7016f8354a7fd93af0209ca2e97ef4fd56
If we fail to find an expected temperature or voltage label
file, previously we were attempting to re-add a mapping of file
index to sensor types. Attempting to insert a map item that is already
present has no effect, so there should be no functional change.
This was a remnant of old code that should have been deleted.
Change-Id: Ie6f8a62f619a1ae58756e0fd891532434518cf78
The environment variable RSMI_DEBUG_INFINITE_LOOP is introduced
to facilitate debugging RSMI in user applications. When this
env. variable is non-zero, an infinite loop will be entered in
rsmi_init(). At this point, a debugger can be attached and RSMI
can be debugger. This only applies to debug builds.
Change-Id: I23f6dd730fc965764295070de053314a1cc5b6aa
Some rsmi apps fail without much explanation when
rsmi_init() fails. This patch hopes to provide some clues to
the reason for the failure.
Change-Id: Id51308dc327b9871d537dd3e709b677db4ef10bc
Previously, when a process holding a shared mutex was killed,
the next time an RSMI application was started, it would not be
able to obtain the mutex--the application would have to exit.
This fix uses pthread_mutexattr_setrobust() to detect this
situation and act accordingingly.
Also, add some missing, needed mutexes and move mutexes
closer to where the protect resource is used.
Change-Id: Icfdc3a246f4cfa3fd008e3f13472199abd76fd35
A special mode of operation to achieve minimal performance variation by letting
the user have the ability to provide the desired frequency to be set as the soft limit.
The user can control the entry and exit to the mode via rocm-smi a mechanism to
enter / exit performance determinism mode as below.
Enter performance determinism mode:
- hold a lock
- write performance_determinism to power_dpm_force_performance_level
- write input clk_freq to pp_dpm_sclk
- release lock
Exit performance determinism_mode:
- hold a lock
- write auto to power_dpm_force_performance_level
- release lock
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com>
Change-Id: Ia24e27954cdf1c4337ffc83d8948fbdfaf4552d2