Commit Graph

1963 Commits

Author SHA1 Message Date
Saeed, Oosman 10bfc7c056 [SWDEV-554697] CPER not properly displaying warnings for non-zero partition id's (#687)
* Get primary gpu_id for non-primary partitions.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>

* corrected partitions warning print logic

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I08be6c78ddd46e5316dc9d538de4908b65b21d43

* Updated patch with latest changes and modified
xgmi partition_id check.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>

* Typo correction

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>

* adjusted logging

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I6d425102d8583aabbcd4d7f55c9c733428524d59

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Oosman Saeed <oossaeed@amd.com>
Co-authored-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: 5398eaa6b3]
2025-09-12 16:39:56 -05:00
3049ac537468bd90fe07f2cbb3d7a83e_amdeng 85bcf06edd [SWDEV-531904] Unit and Integ Test Updates (#563)
* [SWDEV-531904] Unit and Integ Test Updates
Updated: unit_tests.py
- Removed redundant self.setUp() and self.tearDown() calls.
- Removed test_free_name_value_pairs() since is internal only.
Updated: integration_test.py
- Added logic to set AMDSMI_CLI_PATH from environment or default.
- Raise FileNotFoundError if path does not exist.
- Append CLI path to sys.path and handle ImportError with a clear message.
- Removed redundant @handle_exceptions function decorator.
- Removed redundant self.setUp() and self.tearDown() calls.
Updated: amdsmi_interface.py
- Removed POINTER conversion in amdsmi_get_gpu_pm_metrics_info() and amdsmi_get_gpu_reg_table_info()

All tests pass/skip

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

* Update tests/python_unittest/integration_test.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>

* Review Update 1
Modified: integration_test.py
- Added logic to properly loop through firmware list and display each name and version

Signed-off-by: Juan Castillo <juan.castillo@amd.com>

* Skip xgmi_err tests + improve running output

Changes:
1. Now check for elevated permissions
2. Skip xgmi_error related SYSFS tests, refer to xgmi_read_write.cc
   (both are skipped)
3. Added list of tests and provided a summary of additional output
   provided

Change-Id: Iefc85c270faad89c625e2bd7af397d24faed2437
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

---------

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
Signed-off-by: Castillo, Juan <Juan.Castillo@amd.com>
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
Co-authored-by: Juan Castillo <juan.castillo@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Charis Poag <Charis.Poag@amd.com>

[ROCm/amdsmi commit: 67eb541c15]
2025-09-11 16:39:31 -05:00
Pryor, Adam 0a2231deb7 Fix groups failing inside container (#684)
* Fix groups failing inside container

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/amdsmi commit: 5ebd7b8022]
2025-09-10 15:36:26 -05:00
Pham, Gabriel e9ee0bccf2 [SWDEV-551309] Adjusted amdsmitst and reset command (#654)
* Adjusted amdsmitst and reset command to account for separation of power profile and perf level behavior
* Updated test to reset power profile to previous user setting
* Removed performance level from reset_profile_results in reset --profile command
* Updated Changelog with change to reset profile behavior

---------

Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: 954d4860c1]
2025-09-09 16:11:07 -05:00
Arif, Maisam 1a36f2ad0b [SWDEV-550075] Updated README to link to amd-smi virtualization repo (#664)
Co-authored-by: Peter Park <peter.park@amd.com>

[ROCm/amdsmi commit: fd5eb4e963]
2025-09-09 16:05:01 -05:00
Bindhiya Kanangot Balakrishnan 9d0ce8ba42 [SWDEV-414304] Reduce excessive hwmon operations
Previously, the function was iterating through all enum
values(0-250). This fix reduces the number of hwmon operations
by calling add_temp_sensor_entry only for temperature types
that fall within the defined enum ranges.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>


[ROCm/amdsmi commit: 17ffe5a1bd]
2025-09-09 10:30:51 -05:00
Park, Peter 0f75c19e4d [SWDEV-551318] Add doc about RAS / CPER (#636)
* add doc about ras/cper
* add sample code examples for CPER and AFID
---------

Signed-off-by: Park, Peter <Peter.Park@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Co-authored-by: Oosman Saeed <oossaeed@amd.com>

[ROCm/amdsmi commit: 5e92adc5b3]
2025-09-09 10:27:15 -05:00
Kanangot Balakrishnan, Bindhiya e5ba10d4c2 [SWDEV-553557] Add bad_page_threshold_exceeded to RAS (#677)
Added bad_page_threshold_exceeded field to ras, which
compares retired pages count against bad page threshold.
This field displays True if retired pages exceed the
threshold, False if within threshold, or N/A if
threshold data is unavailable.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Co-authored-by: Arif, Maisam <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: edaae978a2]
2025-09-09 09:15:37 -05:00
AL Musaffar, Yazen 851354429f [SWDEV-545894] Folder name defaulting to lower case fix (#611)
* Folder name defaulting to lower case

* Update amdsmi_cli/amdsmi_cli.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>

* Fixed Based On Comments

* Remove unused variable 'skip_next'

Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>

---------

Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
Co-authored-by: yalmusaf_amdeng <yalmusaf@amd.com>
Co-authored-by: Pham, Gabriel <Gabriel.Pham@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/amdsmi commit: 4a8ee27225]
2025-09-07 20:38:29 -05:00
Galantsev, Dmitrii d0b5e20440 Create run-clang-tidy.sh
Change-Id: I4faa950a59434ba4706da581af51dd8a7e071dcb
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/amdsmi commit: 0cd05bf307]
2025-09-05 17:44:17 -05:00
Galantsev, Dmitrii 7bbfc98588 Add extra element to array for bounds checking
Decrement padding to keep struct size the same

Change-Id: I4bea5d4b4d5c908423c7cc55a7e8c404b4a6b5e8
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/amdsmi commit: 85e37bb6ce]
2025-09-05 17:44:17 -05:00
Galantsev, Dmitrii 6797de3ed5 Ignore more warnings in clang and clang-tidy
Change-Id: I6f7c7e478f0f176da550d5bccf833dae1a4f1878
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/amdsmi commit: 20bc3aeeef]
2025-09-05 17:44:17 -05:00
Galantsev, Dmitrii 74efdc57a7 Clean up clang-tidy warnings and unused variables
Change-Id: I1365edf8926908b3a49652fb87f079f8fbf1f56b


[ROCm/amdsmi commit: aba1c792b4]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 3a7b4a283a Remove an impossible check
amdsmi/tests/amd_smi_test/functional/memorypartition_read_write.cc:453:32: warning: the address of ‘orig_memory_partition’ will never be NULL [-Waddress]
  453 |     if ((orig_memory_partition == nullptr) ||
      |          ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: 66eb189396]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 0c7c849c42 Use nested namespace for amd::smi
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: eacec681dd]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 47b2e80ab8 Drop an unnecessary NULL comparison
warning: the address of ‘amdsmi_asic_info_t::vendor_name’ will never be NULL [-Waddress]

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: 4a863b27ab]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 02b357526b Fix a comparison between signed and unsigned integer
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: a15bad1c9e]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 08eec3c675 Drop unused variables
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: a99e827d97]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) a8a89db945 Remove unnecessary typedef declarations
amd_smi_cper.h:32:1: warning: ‘typedef’ was ignored in this declaration

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: 3d0ea25af3]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) c9eddf75e7 Remove unnecessary includes
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: 924a06d1e1]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 5fe413710b Fix a typo
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: 05f79879c3]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 15e335ac3f Use nested namespace for amd::smi
Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: faca0222f0]
2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD) 3b411b6759 Fix a crash when running amd-smi version --cpu
When running on a system that doesn't support HSMP (such as an APU)
then the following is observed:
```
/usr/include/c++/15.1.1/bits/stl_vector.h:1263: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = void*; _Alloc = std::allocator<void*>; reference = void*&; size_type = long unsigned int]: Assertion '__n < this->size()' failed.
```

This is because no "CPU" are detected on the SOC, which really means
no CPUs that support HSMP.  Catch this case so that a clean return
can be passed up.

Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>


[ROCm/amdsmi commit: e5d9e1361e]
2025-09-03 00:49:48 -05:00
Maisam Arif d8c125f2b0 [SWDEV-553016] Added Copyright to scoped_fd.cc
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I2ea872e7c5c61a6e4b5c7e7114d016b8a1069b28


[ROCm/amdsmi commit: c876180875]
2025-09-02 15:02:47 -05:00
Maisam Arif db443c025c [SWDEV-540665] Change parser to not accept 0 as a power set input
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I510fa5523b8dd7ea33f49e21cc199d4a2cfcf9bb


[ROCm/amdsmi commit: 2c9f3af026]
2025-08-29 04:18:36 -05:00
gabrpham_amdeng 51c2ea4731 reverted help formatting column width to 80
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>


[ROCm/amdsmi commit: 39b26104d4]
2025-08-28 11:30:24 -05:00
Tim Huang c3f5771541 Regenerate Rust bindings against latest amdsmi.h header
- Regenerate Rust wrapper against latest amdsmi.h header
- Add libc dependency for proper C memory management
- Fix compilation errors caused by types removed from amdsmi.h
- Add FFI bindings regeneration documentation in README

This update ensures the Rust bindings are synchronized with the latest
C API and provides guidance for developers on regenerating
Bindings.

Signed-off-by: Tim Huang <tim.huang@amd.com>


[ROCm/amdsmi commit: 51a44bc0c4]
2025-08-28 09:34:57 -05:00
Maisam Arif ed3e242202 [SWDEV-540665] Remove amdsmi_set_power_cap API Guest Restriction
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I682506b48c10eefbd04f9b494ad57fb8ae8842b0


[ROCm/amdsmi commit: 4ffa468613]
2025-08-27 20:18:43 -05:00
Arif, Maisam 433893c770 Revert "[SWDEV-536176] libdrm_amdgpu depdency change (#448)"
This reverts commit 4d33e79baa.


[ROCm/amdsmi commit: ed2300516f]
2025-08-27 20:11:17 -05:00
Oosman Saeed 190ed3953d [SWDEV-546239] Match amdsmi output with host output
[ROCm/amdsmi commit: 594d5ce8ee]
2025-08-27 18:41:59 -05:00
Maisam Arif 8d5335a8de [SWDEV-544299] Fix CLI prefix for amd-smi metric -G
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic184ec824213421388356417e713d9ed5adeddeb


[ROCm/amdsmi commit: 978fad01d2]
2025-08-27 18:08:06 -05:00
Arif, Maisam 46a2ef944f [SWDEV-552378] Removed First enums in amdsmi_interface.py (#659)
- **Fixed gpuboard and baseboard temperatures enums in amdsmi Python Library**.  
  - AmdSmiTemperatureType had issues with referencing the right attribute, so we removed the following duplicate enums:
    - `AmdSmiTemperatureType.GPUBOARD_NODE_FIRST`
    - `AmdSmiTemperatureType.GPUBOARD_VR_FIRST`
    - `AmdSmiTemperatureType.BASEBOARD_FIRST`

Change-Id: Ia61446b593bd9182d597c4b4c2ac3c5ffdae7493
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: 286c421a49]
2025-08-27 18:07:17 -05:00
Arif, Maisam 4d33e79baa [SWDEV-536176] libdrm_amdgpu depdency change (#448)
* Cmake fix updates
* Next fix will be addressing libdrm further

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Justin Williams <juwillia@amd.com>

[ROCm/amdsmi commit: 652761de54]
2025-08-27 09:32:51 -05:00
Pham, Gabriel 3ef5bfef94 Added gpuboard and baseboard temperatures to amd-smi metric (#617)
* Added gpu-board and base-board temperatures to amd-smi metric
* Updated Changelog and adjusted the metric base-board/gpu-board output
* Adjusted output of metric to hide base/gpu-board when not relevant

---------

Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: b13fc16d60]
2025-08-26 12:49:56 -05:00
adapryor 671612471d [SWDEV-546543] Fix segfault in gpu_metrics
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/amdsmi commit: e8fa06d223]
2025-08-22 15:23:57 -05:00
adapryor 17f9feb94e [SWDEV-546543] Fix segfault in gpu_metrics
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/amdsmi commit: d25c01e802]
2025-08-22 15:23:57 -05:00
Maisam Arif a68cd9612a [SWDEV-540665] Power cap on 1VF cli parsing fix
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I5aac8f820fd8ae1c6c1dbae3b5b9e69018c69452


[ROCm/amdsmi commit: e030f71229]
2025-08-22 15:22:44 -05:00
Oosman Saeed 588cf7d0c2 continue to process all entries
[ROCm/amdsmi commit: dee18e9fb4]
2025-08-21 23:37:24 -05:00
gabrpham_amdeng f55c41202e [SWDEV-549373] Added vbios and pldm information to version header and adjusted platform info display
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>


[ROCm/amdsmi commit: 71c8b92076]
2025-08-21 18:16:47 -05:00
Pryor, Adam 8486ac80ba [SWDEV-540665] Move Virtualization checks in APIs into amd-smi APIs (#643)
* Remove vm checks in rocm-smi
* Move virtualization checks up the stack into amd-smi

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: f8afba0a5f]
2025-08-21 18:11:50 -05:00
gabrpham_amdeng d12d268029 Added Version Header to all Help Sections
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>


[ROCm/amdsmi commit: 5aae1a31fa]
2025-08-21 17:17:16 -05:00
Pryor, Adam 7ede8b9f4a [SWDEV-540665] Fix power_caps in help text (#642)
Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/amdsmi commit: 4ac1c7e453]
2025-08-21 16:45:37 -05:00
Maisam Arif f732ee4e98 Fix spelling and incorrect error references
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I23e947a0cfd4f68067f9fca703574f44680163d4


[ROCm/amdsmi commit: 074c4b7a3f]
2025-08-21 12:36:43 -05:00
Pryor, Adam 5e4a23dd01 [SWDEV-525336] Filter out amd-smi process itself from detection (#638)
* Filter out amd-smi from process detection
* Fixed N/A stripping N/ incorrectly from running elevated processes

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/amdsmi commit: ad29de4238]
2025-08-21 11:41:03 -05:00
Oosman Saeed 7c83dac63d [SWDEV-547223] RAS HBM CRC Read CE failed due to AFID missing 24
cherry-pick aca-decode repo changeset: aca-decode repo: f9e5ad5 (HEAD -> main, origin/main, origin/HEAD) Fix bug in Corrected HBM Error being decoded as AFID 34 (#5)


[ROCm/amdsmi commit: ffca095246]
2025-08-21 11:00:30 -05:00
Saeed, Oosman 3779562abb [SWDEV-546239] amd-smi ras cper - no data created (#614)
* Update amd-smi doc with examples of CPER and AFID API usage.

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>

[ROCm/amdsmi commit: fd5e37a07e]
2025-08-20 11:27:41 -05:00
Pham, Gabriel d32bae0e8f Updated Changelog for updated temperature metrics API (#616)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: e6af86f44a]
2025-08-19 19:02:50 -05:00
AL Musaffar, Yazen 678972b8ec [SWDEV-549789] Removed incorrect CPER AFID references (#619)
* Fix for afid help
* Update amdsmi_parser.py

Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>

[ROCm/amdsmi commit: e84e364b35]
2025-08-19 18:55:33 -05:00
Pryor, Adam 96a28009fc [SWDEV-544620] Add kfd fallback for GPU Processes (#631)
Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/amdsmi commit: b62900c372]
2025-08-19 18:53:16 -05:00
Pham, Gabriel 729b7beddf [SWDEV-446394] Updated error message for setting clock limit (#633)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>

[ROCm/amdsmi commit: c0ea186d47]
2025-08-19 18:51:49 -05:00