Increased threshold from 2100 μs to 3100 µs to accommodate
gpu_metric read time variation across Navi systems.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Problem:
When TheRock-based PyTorch package is installed along with amdsmi, importing
torch causes a double-free crash on exit (GitHub issue ROCm/TheRock#2269).
Root cause:
Both librocm_smi64.so and libamd_smi.so export the C++ static member
'amd::smi::Device::devInfoTypesStrings'. When libraries are loaded with
RTLD_GLOBAL, the dynamic linker resolves libamd_smi.so's reference to this
symbol to the one in librocm_smi64.so. This causes:
1. librocm_smi64.so registers its destructor for devInfoTypesStrings
2. libamd_smi.so also registers a destructor, but for the SAME address
3. On exit, both destructors run on the same object -> double-free
Fix:
Change devInfoTypesStrings from a class static member to a file-local static
variable. This ensures the symbol has internal linkage and is not exported,
preventing the symbol collision.
Changes:
- rocm_smi_device.h: Remove static member declaration
- rocm_smi_device.cc: Change from 'Device::devInfoTypesStrings' to file-local
'static const std::map<...> devInfoTypesStrings'
- rocm_smi.cc: Remove the global alias to the (now removed) class member
Tested on gfx1151. `import torch` crashed on exit before the fix, and doesn't crash after the fix.
Implements automatic device wake using getDRMDeviceId() DRM call when GPUs
are detected in low-power state. This ensures rocm-smi can access device
information on suspended GPUs.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
* Enable Lintian Support for ROCM-SMI
* Enable Lintian Support for ROCMINFO
* Updated Lintian Override File Processing
* Update UT Fix for Lintian rocmsmi,rocminfo
* Update UT Fixes, Review Comments
* Update Review Comments - removed extra white spaces, added error check for gzip, date commands
* Update Review Comments - Correcting License Type
* Sync Lintian ChangeLog
* Changelog data sync enhanced
* Update Review Comments, UT fix
* white space cleanup - precommit check
* Run pre-commit's whitespace related hooks on projects/rocm-smi-lib
In order for pre-commit to be useful, everything needs to meet a common
baseline.
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
* Added Changelog Spaces for formatting
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
---------
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Added runtime PM detection and DRM ioctl-based device wake
to handle GPUs in BACO state. Modified tests to wake
suspended devices before reading sysfs files.
---------
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Changes:
- Fix `rocm-smi --setsclk [0 .. n]` for multiple devices to continue on fail when
in a partitioned configuration (ex. in DPX/QPX/CPX/etc).
- Partitioned configurations or devices which do not support changing
sclk/mclk/pcie clks will now continue on failure. Will report a "not
supported" or other (rocm-smi) error codes for these devices.
- Updates impact other clock settings such as `--setmclk` and
`--setpcie`.
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Updated: rocm_smi.py
- Remove all else: clauses from functions where rsmi_ret_ok is part of the if clause, as requested.
- rsmi_ret_ok() function already handles unsucessful return codes and gracefully handles them.
- Updated check_runtime_status() function to sweep through /sys/class/drm to find active runtime_status.
- Updated the message to' AMD GPU device(s) is/are in a low-power state. Check power control/runtime_status'
- This clarifies the status of the GPU and tells them where to check for more info.
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: gabrpham <Gabriel.Pham@amd.com>
The sysfs pcie bandwidth file pcie_bw is deprecated
in newer asics. This change will get pcie BW from
GPU metric for version 1.5 or later.
Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
- Clean up and standardization of MIT licenses after discussion with legal team.
- Update README.md with blurb for top-level files.
- MIT License explicitly mentioned for relevant projects.
- Removal of years.
- Copyright attribution should be to `Advanced Micro Devices, Inc.` and not `AMD ROCm(TM) Software`
- Removal of `All rights reserved.`
- Reduce line width of the text for readability.
- Add clear visual separators for additional licenses.
- Convert text files to markdown format for aforementioned separators.
- Update build scripts to point to renamed files.
- Fixed SMI doc references
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
Unstable threading was causing segmentation faults. Update to use more
recent threading module rather than the _thread module solved
segmentation fault issue.
multiple issues solved by this commit:
[SWDEV-537518]
[SWDEV-540377]
[SWDEV-540223]
Signed-off-by: GabrPham <gabrpham_amdeng@amd.com>
[ROCm/rocm_smi_lib commit: 7dba992ebd]
When librocm-smi is pulled through a dependency, we may end up on a system
without actual hardware supported by ROCM, and rsmi_init() failing is
actually expected, we do want to frighten the user in such a case.
[ROCm/rocm_smi_lib commit: 8ca4207d5c]
liboam.a was missing in static rocm-smi package and resulting in compilation error on appliction that use rocm-smi
[ROCm/rocm_smi_lib commit: 59468e3f78]
Changes: - Updates to APIs to handle null pointers or RSMI_STATUS_NOT_SUPPORTED
- Fixes to tests to handle partitioned configurations correctly
- Synced with latest AMD SMI API changes
Change-Id: I7a932f9336ef29ccb01d3b15e2101f6136b45720
[ROCm/rocm_smi_lib commit: 12b78439d2]
Updated:
- Removed backwards compatibility for jpeg_activity/vcn_activity
- On supported ASICs users can use XCP (partition) stat values:
jpeg_busy and vcn_busy
Change-Id: I78c403f8462668738ec57cac12b107f6a3989b18
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
[ROCm/rocm_smi_lib commit: 1c6b2adae7]
[SWDEV-523359] fan_read_write: Add set fan speed validation check.
- Handled NOT_SUPPORTED status which previously caused rsmitst to false fail
- Added continute statement to proceed with rest of FanReadWrite test.
- fixed spacing line 140
Signed-off-by: Juan Castillo <juan.castillo@amd.com>
[ROCm/rocm_smi_lib commit: ac31c6e576]