rocm-systems

Автор	SHA1	Сообщение	Дата
Pryor, Adam	2144cfbba4	[SWDEV-357472] Add evicted_ms metric (#620 ) - Added evicted_time metric for kfd processes. - Time that queues are evicted on a GPU in milliseconds - Added to CLI in `amd-smi monitor -q` and `amd-smi process` - Added to C API and Python API: - amdsmi_get_gpu_process_list() - amdsmi_get_gpu_compute_process_info() - amdsmi_get_gpu_compute_process_info_by_pid() --------- Signed-off-by: Pryor, Adam <Adam.Pryor@amd.com>	2025-10-28 14:49:03 -05:00
Charis Poag	00a04f5810	[SWDEV-562726] Fix clang + ASAN errors * Updates: - [ASAN] GCC does not support `-shared-libsan flags`, so removed this one - [Clang] Fixed refernces to local binding errors (name collision) & other strict scope/structure/lamda binding errors - [Clang] Fix rsmi_wrapper error: \"error: missing default argument on parameter \'args\'\" - [ASAN] Fixed stack-buffer-overflow found in `amdsmi_get_gpu_accelerator_partition_profile()` Change-Id: I854007efb75d828dbb8088c0d56dbc125081f0f2 Signed-off-by: Charis Poag <Charis.Poag@amd.com>	2025-10-28 09:54:23 -05:00
Saeed, Oosman	90f4b8c43d	Sync with latest ras-decode @bc6b43c (#770 ) Signed-off-by: Oosman Saeed <oossaeed@amd.com>	2025-10-27 14:10:00 -05:00
Kanangot Balakrishnan, Bindhiya	09a97f02ed	[SWDEV-542718] Correct socket_affinity (#760 ) * [SWDEV-542718] Correct socket_affinity Updated Socket affinity to show bitmask and expanded cpu list. Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com> * Update per-device local_cpulist for socket_affinity Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com> * Added amdsmi_get_cpu_affinity_from_local_cpulist API. Updated the wrapper. Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com> * Revert "Added amdsmi_get_cpu_affinity_from_local_cpulist API." This reverts commit 9a2ef934b1787f8aa09d3e4efe02f897b4295215. * Moved the changes to C API. In case of SOCKET_SCOPE, use local_cpulist first. If it is unavailable or not readable, fallback to numa. Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com> * Addressed review comments Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com> --------- Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>	2025-10-22 16:20:41 -05:00
Poag, Charis	01b4fe6614	[SWDEV-535159] Add support for GPU partition metrics (#490 ) [SWDEV-535159] Add support for GPU partition metrics Changes include: - Internal logic to smart-switch between gpu_metrics/xcp_metrics files - [WIP] Initial plumbing for new partition metric API Change-Id: I4340fb1b48bac0117d80d5d486b9e871430d5cd8 Signed-off-by: Charis Poag <Charis.Poag@amd.com> Add amdsmi_get_gpu_partition_metrics_info() + minor cleanup Change-Id: I5d60604f18baddbd03852dc90e88aa0b8107d50e Signed-off-by: Charis Poag <Charis.Poag@amd.com> Fix partition metric logic + update logging/tests Change-Id: I9e89b19ead17694c54e224f8e13ff8ee3eb2e22a Signed-off-by: Charis Poag <Charis.Poag@amd.com> Adjust amd-smi metric/monitor/default to show (some) partition information Change-Id: I2e8d2745876a19bdaec3c039daa97345c9f701b5 Signed-off-by: Charis Poag <Charis.Poag@amd.com> Add C++ tests Change-Id: Ib9eb0b57a6d7a280992e05a4c6eba632826952ef Signed-off-by: Charis Poag <Charis.Poag@amd.com> Remove modification of energy counter, not needed Change-Id: I5c48eaaae248ee6dc79abba609d837ec35d78022 Signed-off-by: Charis Poag <Charis.Poag@amd.com> [CLI] amd-smi metric: cleaned up N/A'd multi-valued to show just N/A Changes: 1. amd-smi metric: cleaned up N/A'd multi-valued to show just N/A ex. JPEG_ACTIVITY: [N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A, N/A] Now just shows: N/A 2. [Python Unit Test] Changed testname TestAmdSmiPythonBDF(unittest.TestCase) -> AmdSmiPythonUnitTest Test name was confusing. Change-Id: Ieb3b036f30002fd22362508eb9fc5d443df395ae Signed-off-by: Charis Poag <Charis.Poag@amd.com> Log cleanup Change-Id: I1b1a95f1844d35bec7a7bd8cb996f87e4914c069 Signed-off-by: Charis Poag <Charis.Poag@amd.com> Add amd-smi partition-metrics CLI + general cleanup Change-Id: Ia91488e6cb3a4d62b4087afbddfe0b3bb9378fdc Signed-off-by: Charis Poag <Charis.Poag@amd.com> [1.3 metrics] Remove forwards compatibility for partition metrics Change-Id: Iab928983e6f6f1587bc9307f6f3fa2b2696ca6f7 Signed-off-by: Charis Poag <Charis.Poag@amd.com> Fixed violation output not showing % + general cleanup Change-Id: Icac1b0a55b18c7628b07109ae0c377d17e0825f1 Signed-off-by: Charis Poag <Charis.Poag@amd.com> Clean up amdsmi_get_gpu_partition_metrics_info & amd-smi partition-metric outputs Change-Id: I6427028b980874641e9ffb3b5d88ad493dbf9cf4 Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Fix metrics not found + extra logging/formatting Change-Id: I841a27bb2c305e97ec7579a13ac915e5be497c3a Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Update license to current default Change-Id: I0de9b8a2d5dbbeab4491097f0354ba17b0d30866 Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Cleanup for review Change-Id: I96ed25c3f2b8968eea1af24c5e5860c2b4e74e6e Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Moderize updated/new interal APIs. Change-Id: I3c48a250eeb703709b14cb5ffa68268d8321626c Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Remove extra logging in dynamic metrics Change-Id: Idb97547bcbe143d6fa1cb5cb278ffe4da615ce14 Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Remove amd-smi partition-metric command Change-Id: Ib83c17e5cd7e0da3798198943bddd46c296b411c Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Move new CLI updates to another PR + minor fixes Change-Id: I3b1163eec12f9b5f7d95ee33de08e168cec1b1fe Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Allow dynamic metrics to work for gpu/xcp metrics 1.9+/1.1+ Updated some logging as well. Change-Id: I2ed9f5a5ef8afb1520508820ca6153525f0644b4 Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Allow dyn gpu/xcp metric v1.9+/v1.1+ Added tests for quick check Change-Id: I576d6f6582a55afb08e5ac57791ce95e2fa184a2 Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Update tests for larger subset of version checks Change-Id: I3cdf4f8bb4fc6161f4c76566939f90545d0f362a Signed-off-by: Charis Poag <Charis.Poag@amd.com> * Fix XCP metrics in gpu/partition metric pre-v1.9/v1.1 (dynamic) Change-Id: I4dabc1ed6bef6b86c8e7f92bf9cb5992f3966fe2 Signed-off-by: Charis Poag <Charis.Poag@amd.com> --------- Signed-off-by: Charis Poag <Charis.Poag@amd.com>	2025-10-20 14:43:40 -05:00
Narlo, Joseph	460cfcba1f	[SWDEV-555807] TestCudaMallocAsync test power draw failing (#755 ) * Clarified comments regarding power limit retrieval and its support on virtualized systems. * Change unsupported comment to UINT32_MAX --------- Signed-off-by: josnarlo <Joseph.Narlo@amd.com> Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>	2025-10-17 08:57:57 -05:00
Pryor, Adam	cba4c871d3	[SWDEV-559082] Add asic info cache (#756 ) Signed-off-by: adapryor <Adam.pryor@amd.com>	2025-10-08 21:48:08 -05:00
Maisam Arif	4e8ed1f3e3	Clean up and add comments Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: Id30c0ccb68918e109533593df7c360837bdfa002	2025-10-08 12:00:21 -05:00
Oosman Saeed	c6698c9100	[SWDEV-553168] Add support for decoding out of band boot time CPER files. Change-Id: Ic4278698f9c5b5ae56bd56fd43150c0653c1ef05	2025-10-07 22:23:33 -05:00
Maisam Arif	a0d59397b4	[SWDEV-558993] Fix bdf sourcing Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I0c50f490334f6de12a4c01abf1c2ed9e50d87295	2025-10-07 01:32:26 -05:00
Narlo, Joseph	7decbc67a1	[SWDEV-539078] Add missing API definitions to python interface (#525 ) Added the following API's to amdsmi_interface.py. amdsmi_get_cpu_handle() amdsmi_get_esmi_err_msg() amdsmi_get_gpu_event_notification() amdsmi_get_processor_count_from_handles() amdsmi_get_processor_handles_by_type() amdsmi_gpu_validate_ras_eeprom() amdsmi_init_gpu_event_notification() amdsmi_set_gpu_event_notification_mask() amdsmi_stop_gpu_event_notification() amdsmi_get_gpu_busy_percent() Added additional return value to API amdsmi_get_xgmi_plpd(). The entry policies is added to the end of the dictionary to match API definition. The entry plpds is marked for deprecation as it has the same information as policies. --------- Signed-off-by: josnarlo <Joseph.Narlo@amd.com> Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>	2025-10-06 14:50:00 -05:00
Pryor, Adam	c967aead58	[SWDEV-525336] Use KFD to determine process start/stop (#723 ) * Used KFD to determine linking between GPUs and PIDs rather than depend on fdinfo's per pid single gpu bdf info that we were getting. Signed-off-by: adapryor <Adam.pryor@amd.com> --------- Signed-off-by: adapryor <Adam.pryor@amd.com> Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>	2025-10-02 10:57:08 -05:00
Maisam Arif	843dfaeed2	Removed unused version config files Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I3b00a8c302615026422f6d5d602959989ee0418e	2025-09-25 18:19:14 -05:00
Mario Limonciello	ccfdb65b6f	Set the SOVERSION in CMake from MAJOR/MINOR/RELEASE variables Having the SOVERSION derived from the git tags doesn't scale well for distributions that don't have the git history while building (such as a tarball). As part of `e7d6590` the strings are parsed from a header. Re-use those. Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>	2025-09-25 18:19:14 -05:00
Maisam Arif	df87246b40	Changed amd_smi_drm.cc to depend on dynamic libdrm Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: Ie8a794578da1a3ad8893d436e54bbfb67857a7ae	2025-09-25 17:40:05 -05:00
Maisam Arif	9f22e59c52	Change libdrm.so.2 references to dynamic libdrm naming Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: Ie02c91a3a210ab7612fec670b2aad66d476d2cf3	2025-09-24 20:44:03 -05:00
Stella Laurenzo	4d5d24d1c6	Fix delay loading of drm by soname.	2025-09-24 20:44:03 -05:00
Stella Laurenzo	62e4329559	Add rt dep back	2025-09-24 20:44:03 -05:00
Stella Laurenzo	4e6731a817	[cmake] Fix dependencies. * Use CMAKE_DL_LIBS instead of hard-coded `dl`. * Use Threads::Threads instead of `pthread`. * Drop `rt` dep. * Find libdrm via pkgconfig (consistent to how other ROCm projects do it as documented here: https://github.com/ROCm/TheRock/blob/main/docs/development/dependencies.md#libdrm)	2025-09-24 20:44:03 -05:00
Maisam Arif	cd21b5edcc	[SWDEV-554587] Added IFWI Version and boot_firmware API - Changed amd-smi static --vbios to accept ifwi - Change population logic for vbios version API - Added IFWI boot_firmware to the CLI, C++, Rust, and Python API Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I4ea504d40a43cfb011ab38fc9a664ecf12d39c8a	2025-09-23 16:05:10 -05:00
Kanangot Balakrishnan, Bindhiya	6715c5aa92	[SWDEV-534605] Increase max devices supported and drm test link type (#625 ) Increased the AMDSMI_MAX_DEVICES to 64 to accomodate all devices in CPX mode. The link type has been modified in amd-smi to match with rocm-smi types, updated the same for drm tests. --------- Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>	2025-09-17 16:30:04 -05:00
Mario Limonciello (AMD)	eacec681dd	Use nested namespace for amd::smi Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD)	4a863b27ab	Drop an unnecessary NULL comparison warning: the address of ‘amdsmi_asic_info_t::vendor_name’ will never be NULL [-Waddress] Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD)	a15bad1c9e	Fix a comparison between signed and unsigned integer Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD)	a99e827d97	Drop unused variables Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD)	924a06d1e1	Remove unnecessary includes Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD)	faca0222f0	Use nested namespace for amd::smi Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-05 17:44:17 -05:00
Mario Limonciello (AMD)	e5d9e1361e	Fix a crash when running `amd-smi version --cpu` When running on a system that doesn't support HSMP (such as an APU) then the following is observed: ``` /usr/include/c++/15.1.1/bits/stl_vector.h:1263: std::vector<_Tp, _Alloc>::reference std::vector<_Tp, _Alloc>::operator[](size_type) [with _Tp = void; _Alloc = std::allocator<void>; reference = void*&; size_type = long unsigned int]: Assertion '__n < this->size()' failed. ``` This is because no "CPU" are detected on the SOC, which really means no CPUs that support HSMP. Catch this case so that a clean return can be passed up. Signed-off-by: Mario Limonciello (AMD) <superm1@kernel.org>	2025-09-03 00:49:48 -05:00
Maisam Arif	c876180875	[SWDEV-553016] Added Copyright to scoped_fd.cc Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I2ea872e7c5c61a6e4b5c7e7114d016b8a1069b28	2025-09-02 15:02:47 -05:00
Maisam Arif	4ffa468613	[SWDEV-540665] Remove amdsmi_set_power_cap API Guest Restriction Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I682506b48c10eefbd04f9b494ad57fb8ae8842b0	2025-08-27 20:18:43 -05:00
Arif, Maisam	ed2300516f	Revert "[SWDEV-536176] libdrm_amdgpu depdency change (#448 )" This reverts commit `652761de54`.	2025-08-27 20:11:17 -05:00
Arif, Maisam	652761de54	[SWDEV-536176] libdrm_amdgpu depdency change (#448 ) * Cmake fix updates * Next fix will be addressing libdrm further --------- Signed-off-by: adapryor <Adam.pryor@amd.com> Signed-off-by: Justin Williams <juwillia@amd.com>	2025-08-27 09:32:51 -05:00
Pryor, Adam	f8afba0a5f	[SWDEV-540665] Move Virtualization checks in APIs into amd-smi APIs (#643 ) * Remove vm checks in rocm-smi * Move virtualization checks up the stack into amd-smi --------- Signed-off-by: adapryor <Adam.pryor@amd.com> Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>	2025-08-21 18:11:50 -05:00
Oosman Saeed	ffca095246	[SWDEV-547223] RAS HBM CRC Read CE failed due to AFID missing 24 cherry-pick aca-decode repo changeset: aca-decode repo: f9e5ad5 (HEAD -> main, origin/main, origin/HEAD) Fix bug in Corrected HBM Error being decoded as AFID 34 (#5)	2025-08-21 11:00:30 -05:00
Maisam Arif	6de6290dc1	Removed kfd_ioctl.h from rocm include install Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I7948eb050f79a8a0f71e0b8a8e4e08187ac0bb84	2025-08-19 17:18:14 -05:00
Charis Poag	d3b73fac82	Revert Major ABI break for amdsmi_get_violation_status() Changes: - This aligns back to original struct naming for ROCm 7.0. This removes any Major ABI breakages for updates for 7.0 release. - Minor ABI breakage is required since there were additions to the header. Refer to changelog for these updates. Change-Id: If35af74eac6beac8c267d05ce789b7761ed24bff Signed-off-by: Charis Poag <Charis.Poag@amd.com>	2025-08-18 11:36:57 -05:00
Bill Liu	c45a53d751	[SWDEV-548260] Enable Support for Multiple init() and shutdown() Implemented reference counting to manage init and shutdown processes, allowing for multiple initializations and shutdowns.	2025-08-15 11:44:50 -05:00
josnarlo	925014ddaf	Fix getting version information Change-Id: I2695733307888f5ab41a1265ae4369a2ea011e09	2025-08-08 08:12:10 -05:00
Poag, Charis	e2e4fc65c1	[SWDEV-542223] Update Violation Status Changes to Design + Minor cleanup (#558 ) Changes: - Update violation status logic and metric naming for XCP/XCC metrics (thrm/thm consistency) - Added XCP identifier in monitor to allow partition metrics to be shown with applicable APIs (Violation Status is the first example of this in monitor) - Improve CLI monitor output: support multiple GPU lines per GPU, add new columns, and better formatting - Refactor helpers and logger for flexible unit formatting and table rendering - Add examples for amdsmi_get_gpu_pm_metrics_info()/amdsmi_get_gpu_reg_table_info() new metrics APIs in C++ example - Sync Python/C++ interface and structures for new metrics fields and naming - Remove deprecated/unused RSMI activity APIs, documentation not needed since the APIs no longer exist in ROCm SMI either. - Cleanup metric violations + fix handle watch arguments - Provide better handling/doc for average_flattened_ints() - Group xcp metrics with brackets in human readable + adjust output size Signed-off-by: Poag, Charis <Charis.Poag@amd.com>	2025-08-06 16:03:06 -05:00
Poag, Charis	d24dc7ef89	[SWDEV-518561] Separate Driver Reload from Memory Partition Sets (#582 ) Description: - Added a new API `amdsmi_gpu_driver_reload()` to reload the AMD GPU driver independently. - Updated CLI (`sudo amd-smi reset -r`) and Python bindings to support driver reload functionality. - Removed automatic driver reload from `amdsmi_set_gpu_memory_partition()` and `amdsmi_set_gpu_memory_partition_mode()`. - Enhanced CLI and test cases to allow users to control when the driver reload occurs. - Updated documentation and changelog to reflect the new driver reload process. - Improved error handling and logging for driver reload operations. - Added progress bar and user confirmation prompts for driver reload commands. * Update build/test strategy to only allow one test execution at a time * Modify API verbage + modify systemctl error output - Systemctl is typically not enabled on docker. - And is an edge case for gpu being active process/etc for display devices. * Remove AMDSMI_STATUS_AMDGPU_RESTART_ERR from the return values * Move driver reload to after we save original compute partitions --------- Signed-off-by: Charis Poag <Charis.Poag@amd.com>	2025-08-05 20:44:28 -05:00
Liu, Shuzhou (Bill)	abd3c02a3c	Query UBB/OAM temperature API (#581 ) Add support to Query UBB/OAM temperature. * Updated Python API with new temperature metrics enum --------- Co-authored-by: Bill Liu <shuzhliu@amd.com> Co-authored-by: gabrpham_amdeng <Gabriel.Pham@amd.com>	2025-08-05 20:37:45 -05:00
Saeed, Oosman	753a5ea326	[SWDEV-533349] codeQL erors in amdsmi source code (#588 ) Signed-off-by: Saeed, Oosman <Oosman.Saeed@amd.com>	2025-08-05 20:17:21 -05:00
gabrpham_amdeng	4f0d1c8c29	[SWDEV-543627] Fixed incorrect metric min clock values Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>	2025-07-26 04:55:25 -05:00
Kanangot Balakrishnan, Bindhiya	c9f0d1b953	[SWDEV-545342] Remove link type translation (#575 )	2025-07-25 13:16:06 -05:00
Saeed, Oosman	03414e20ee	SWDEV-539482: Different sizes of mem leaks observed in amdsmitst (#538 ) Signed-off-by:Oosman Saeed <oossaeed@amd.com>	2025-07-15 14:33:27 -05:00
Bindhiya Kanangot Balakrishnan	645c313f00	[SWDEV-543308] Revert amdsmi_link_metrics structure change Moved the bit_rate and max_bandwidth back into links in the amdsmi_link_metrics_t struct as this change was impacting other teams. Modified the C and python API's, wrapper, and CLI accordingly. Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>	2025-07-14 13:56:26 -05:00
Kanangot Balakrishnan, Bindhiya	514517e536	[SWDEV-539721] Show complete process name (#536 ) Modified the file used to fetch process name so that complete name with path can be displayed. Changes: amd-smi monitor -q - human readable format will output only the process name - csv and json formats will print the full path amd-smi process - name will always be the full path to the process amd-smi (default output) - name will always be truncated. --------- Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com> Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>	2025-07-09 16:34:39 -05:00
Narlo, Joseph	2cf6272b53	[SWDEV-541675] Remove Unnecessary API from amdsmi.h (#530 ) Signed-off-by: josnarlo <Joseph.Narlo@amd.com>	2025-07-07 11:14:27 -05:00
Saeed, Oosman	5b95d227bc	[SWDEV-538308] CPER CLI 20 limit bug (#499 ) The bug was reproduced like this. In terminal #1, run command: sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow In terminal #2, inject errors: while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1: sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information. The fix: Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available. The Python CLI should not treat this as an error, but should continue to print what the API returned. --------- Signed-off-by: Oosman Saeed <oossaeed@amd.com>	2025-07-07 11:11:13 -05:00
Maisam Arif	2d2e5fe692	[SWDEV-533390] Removed kfd_ioctl.h from being copied on install Signed-off-by: Maisam Arif <Maisam.Arif@amd.com> Change-Id: I03cb03b5f034e822c8f3c2d1e11e8b4e57251905	2025-06-20 14:32:16 -05:00

1 2 3 4 5 ...

602 Коммитов