Commit Graph

443 Commits

Author SHA1 Message Date
Pham, Gabriel fc5ea762b3 Added Platform Information to Default Command (#553)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-08-05 20:11:42 -05:00
Pryor, Adam 2dc2e12a97 Documentation updates for AMDSMI_GPU_METRICS_CACHE_MS (#564)
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-05 19:58:37 -05:00
AL Musaffar, Yazen 27cae85910 [SWDEV-544092] Fix Navi process float conversion (#579)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-08-04 14:40:18 -05:00
Bindhiya Kanangot Balakrishnan b16a66b2c5 [SWDEV-525336] Fix N/A process name display
The amd-smi command will will show only executable
name of a process by stripping absolute path. This
cause "N/A" process names incorrectly display as
"A" in the output. Corrected the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-08-04 13:51:42 -05:00
Kanangot Balakrishnan, Bindhiya 27a1705d96 [SWDEV-537852] Update compute-partition set error messages (#505)
[SWDEV-537852] Update compute-partition set error messages

Setting compute partition needs sudo privileges. Added
AmdSmiPermissionDeniedException to display CLI elevated
permission errors.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-08-01 08:22:22 -05:00
Bindhiya Kanangot Balakrishnan 449839a32e [SWDEV-537852] Update help text for InvalidParameterValueException
Updated the help text to display command name.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 10:47:13 -05:00
Kanangot Balakrishnan, Bindhiya 6f7b397998 [SWDEV-537852] Update help and error text (#518)
Improved amd-smi help and error messages.
Updated to show subcommand name in help text.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-24 09:06:22 -05:00
Poag, Charis ec055f2c2d [SWDEV-536953] Fix sets/resets + Align Power Cap Behavior with ROCM_SMI (#456)
Changes:
  - Modified outputputs for amd-smi set/reset when in partitions
    to display error codes
  - Provided some general cleanup for the above ^
----------------------------------------------------
  - Updated  `amd-smi set -o <value>` /  `amd-smi set --power-cap <value>`  command to
    allow setting power cap to values other than 0, provided the current power cap is not 0.
  - Modified power_cap_read_write.cc:
    - Added a check to ensure that the power cap can only be set to non-zero values if the current
      power cap is not 0.
    - Reset the power cap to the original value after the test to maintain state consistency.
Change-Id: If489bb35812ba4fc4cc34723b0dc39c99926e5d7

---------

Signed-off-by: Poag, Charis <Charis.Poag@amd.com>
2025-07-22 17:21:15 -05:00
Bindhiya Kanangot Balakrishnan 645c313f00 [SWDEV-543308] Revert amdsmi_link_metrics structure change
Moved the bit_rate and max_bandwidth back into links in the
amdsmi_link_metrics_t struct as this change was impacting
other teams. Modified the C and python API's, wrapper, and
CLI accordingly.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-14 13:56:26 -05:00
Maisam Arif 10f9aae0b3 Reduced calls to drm devinfo for getting virtualization_mode
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I22a6a9ca15131b37a775e8d4f595fb13c0b043c7
2025-07-11 12:26:42 -05:00
Kanangot Balakrishnan, Bindhiya f6b854b4ed [SWDEV-541289] Update violation argument in amd-smi (#526)
* Disabled violation argument for monitor on guests as it is supported on BM only. 
* Added `-v` and `--violation` args to metric along with `throttle` due to legacy behavior.
	* Supressed metric throttle arg and do not show in help text

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-07-09 16:38:09 -05:00
Kanangot Balakrishnan, Bindhiya 514517e536 [SWDEV-539721] Show complete process name (#536)
Modified the file used to fetch process name so that complete name with path can be displayed.

Changes:
amd-smi monitor -q
- human readable format will output only the process name
- csv and json formats will print the full path

amd-smi process
- name will always be the full path to the process

amd-smi (default output)
- name will always be truncated.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-09 16:34:39 -05:00
AL Musaffar, Yazen 01a6158c85 [SWDEV-532904] CLI lists unusable UUID without sudo (#510)
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-07-09 15:45:03 -05:00
josnarlo 0257140504 [SWDEV-536953] Align Power Cap Behavior with ROCM_SMI
Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
2025-07-09 15:37:40 -05:00
Kanangot Balakrishnan, Bindhiya ce230efaaa [SWDEV-537852] Update process name help text (#517)
* [SWDEV-537852] Update process name help text

Currently process name displays N/A if that need elevated
permissions. Updated the default amd-smi, process and monitor
commands help texts to display elevated permission requirement.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-07-07 11:26:10 -05:00
Poag, Charis 88473b7fd0 [SWDEV-533305] Remove partition info from amd-smi static (-p/--partition still available) + CLI API call cleanup (#529)
Updates:
- Separate extra APIs calls from amd-smi CLI to target specific CLI commands that need them.
- Remove extra current_compute_partition SYSFS calls from amd-smi static.
- Remove the partition information from the default `amd-smi static` CLI command.
- Users must now use the `-p` argument to view partition information with `amd-smi static`.
- The help text for the `partition` argument has been updated to reflect this change.
- The partition information can still be accessed using the `amd-smi partition -c -m` or `sudo amd-smi partition -a` commands.

---------

Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-07-07 11:21:46 -05:00
Park, Peter 8039ab9449 Fix links in docs (#532)
* fix links in amdsmi_cli/README.md
* fix xrefs to install docs
* rm rocm-smi examples and add cli tutorial
* rm disclaimer and add amd smi contributing guidelines to index

Signed-off-by: Peter Park <Peter.Park@amd.com>
2025-07-07 11:18:40 -05:00
Saeed, Oosman 5b95d227bc [SWDEV-538308] CPER CLI 20 limit bug (#499)
The bug was reproduced like this.

In terminal #1, run command:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

In terminal #2, inject errors:
while true; do sudo amdgpuras -b 7 -s 1 -m 6 -t 2; sleep 2; done

The terminal #1 starts dumping cper entry information that it captures. After 20 entries have been captured, open terminal #3 and run same command as terminal #1:
sudo amd-smi ras --cper --gpu 6 --severity all --folder /tmp/cper_dump --follow 

From terminal #3, there will be no output, even when terminal #1 continues capturing and printing information.

The fix:

Since we already have more than 20 CPER entries available in the GPU buffer, when we run the command from terminal #3 to start capturing from the beginning and pass 20 buffers to copy entries to, the C++ API returns a code saying there is more data available.

The Python CLI should not treat this as an error, but should continue to print what the API returned.

---------

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
2025-07-07 11:11:13 -05:00
gabrpham_amdeng a2885d6e70 [SWDEV-539451] Adjusted reset command to prevent reset on partitions
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-07-03 01:11:46 -05:00
Bindhiya Kanangot Balakrishnan fa9ca21520 [SWDEV-540014] Correct topology link_type check
Topology numa_bw checks for non-xgmi links to set as N/A.
The recent change in link_type enum mapping caused this
condition to check for PCIE instead of XGMI. Corrected
the same.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-30 14:01:19 -05:00
Bindhiya Kanangot Balakrishnan c3453f7c97 [SWDEV-530646] Reduce amdsmi_topo_get_p2p_status calls in topology
The topology method calls amdsmi_topo_get_p2p_status repeatedly
for the same GPU pairs across different table sections,
significantly impacting performance with 60+ GPUs. Reduced this
by implemeting result caching.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-24 11:27:28 -05:00
gabrpham_amdeng 9729aba695 Adjusted CU % logic to be more robust 2025-06-19 10:57:19 -05:00
gabrpham_amdeng fd751ba918 Changed NUM_CU to CU % 2025-06-19 10:57:19 -05:00
gabrpham 9e221a3f09 Added GTT Memory to process table of default command
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
2025-06-19 10:57:19 -05:00
gabrpham 8a0e65d911 Added GTT Memory to default command and adjusted table format
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
2025-06-19 10:57:19 -05:00
Galantsev, Dmitrii 4262802588 CLI - Fix partition json output
Change-Id: I2b9e575cb960db7c136776bfe5c040b27feba727
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-06-19 10:34:57 -05:00
Deepak Mewar 7571eb014f Updated display format of cpu & socket affinities
Signed-off-by: Deepak Mewar <deepak.mewar@amd.com>
2025-06-13 17:37:00 -05:00
Bindhiya Kanangot Balakrishnan 6fbda16098 [SWDEV-512393] Print keys of lists in custom_dump
The custom_dump function was not printing list's key
and so static numa was not displaying list keys
CPU affinity and Socket affinity. Updated custom_dump
to print the keys.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-13 17:37:00 -05:00
Pham, Gabriel 940ece6813 Added GTT Memory to default output process table (#480)
* Added GTT Memory to default command and adjusted table format

---------

Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
2025-06-13 16:43:56 -05:00
Maisam Arif b579d89ae2 [SWDEV-537062] Fixed CU Occupancy reporting UINT MAX
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I975579997a9e455eb930f6c0b8fc5f3dc3cbfae4
2025-06-11 10:42:00 -05:00
Maisam Arif ac63f410c2 Fixed Parser Folder Checking
* Adjusted help text
* Adjusted --afid to run only with --cper-file
* Fixed interface return error

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I2b96f4515c85f3b9dd84ba5c2d819729a997141b
2025-06-10 15:58:06 -05:00
Maisam Arif fb592e003a [SWDEV-536417] CPER Display fixes
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic2f3901d0f4c95bd9ed4beda8aa5fd3d596df8d2
2025-06-10 15:58:06 -05:00
Saeed, Oosman 815e0252b1 [SWDEV-536417] AFID & addc decode fixes (#449)
* fix endian problem
* use hw_revision and flags_mask from cper section instead of hardcoded values

---------

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-06-06 13:41:16 -05:00
Maisam Arif 8bc37a19d2 [SWDEV-536417] CPER & AFID CLI Fixes
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I20aafb1cd2bf8386c30e6d0a0fff8df9c8587554
2025-06-06 12:26:13 -05:00
Charis Poag 391451752b [SWDEV-529030/SWDEV-531217] Fix tests & output for partitioned configurations (CPX, DPX, QPX, etc.)
Changes:
  - Updated AMD SMI firmware to display "N/A" for unavailable firmware in partitioned environments, improving clarity.
    Example (in DPX):
    $ amd-smi firmware
    GPU: 0
        FW_LIST:
            ...
            FW 12:
                FW_ID: PM
                FW_VERSION: 00.86.39.00
    GPU: 1
        FW_LIST: N/A
  - Fixed amd-smi partition not showing current partition information on
    asics with inablity to set memory or accelerator partitions.
    $ amd-smi partition -c -m
    CURRENT_PARTITION:
    GPU_ID  MEMORY  ACCELERATOR_TYPE  ACCELERATOR_PROFILE_INDEX  PARTITION_ID
    0       NPS1    CPX               2                          0
    1       N/A     N/A               N/A                        1
    2       N/A     N/A               N/A                        2
    3       N/A     N/A               N/A                        3
    4       N/A     N/A               N/A                        4
    5       N/A     N/A               N/A                        5
    6       NPS1    SPX               0                          0
    7       NPS1    SPX               0                          0
    8       NPS1    SPX               0                          0

    MEMORY_PARTITION:
    GPU_ID  MEMORY_PARTITION_CAPS  CURRENT_MEMORY_PARTITION
    0       N/A                    NPS1
    1       N/A                    N/A
    2       N/A                    N/A
    3       N/A                    N/A
    4       N/A                    N/A
    5       N/A                    N/A
    6       N/A                    NPS1
    7       N/A                    NPS1
    8       N/A                    NPS1

  - Refactored amd_smi_drm_example.cc:
    - Grouped partition changes and restores original partition settings.
    - Now handles partitioned environments allowing example to continue even if some APIs are not supported in partitioned configurations.
  - Modified amdsmi_asic_info_t (see amdsmi_get_gpu_asic_info()) to report OAM ID as N/A if 0xFFFFFFFF (was 0xFFFF).
    Allows for better handling of OAM IDs in partitioned environments (DNE for non-primary nodes,
    since its a physical identifier). Easier to handle in tests and example code (ie. now consistent w/ max size of the structure's value).
  - Introduced amdsmi_RAII_open_FD() (internal API) to manage file descriptors using RAII, ensuring proper closure and preventing resource leaks.
    Updated the following APIs to use this function:
      - amdsmi_get_gpu_asic_info(), amdsmi_get_gpu_vram_usage(),
        amdsmi_get_gpu_vram_info(), amdsmi_get_gpu_vbios_info(),
        amdsmi_get_gpu_driver_info(), amdsmi_get_gpu_virtualization_mode()
  - Updated AMD SMI test_base.cc/.h:
    - Improved output and handling for partitioned environments.
    - Added detailed ASIC information logging to align with structure changes.
    - Enhanced error messages for better context before ASSERT checks.
  - Resolved test failures in partitioned environments by updating
    logic and handling for partition-specific configurations.
    Fixed tests include:
      - computepartition_read_write.cc, frequencies_read_write.cc,
        gpu_metrics_read.cc, mem_util_read.cc, memorypartition_read_write.cc,
        perf_level_read.cc, perf_level_read_write.cc, power_cap_read_write.cc,
        power_read.cc, sys_info_read.cc, gpu_busy_read.cc

Change-Id: I36e903f8fddd714c74c719459c71aba8bbb77e6f
Signed-off-by: Charis Poag <Charis.Poag@amd.com>

Resetting head + adding fixes for tests ran in partitions

Change-Id: I0c1e9ac07488b50c95f3bc6d8a724e67d2c715dc
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2025-06-05 19:24:49 -05:00
Bindhiya Kanangot Balakrishnan 872c58b7a3 [SWDEV-534746] Generate valid json output for partition command
The amd-smi partition --json output was not in valid json
format. Changes are done to get the output in valid
json format.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-05 01:40:52 -05:00
Saeed, Oosman 2c3fa591b5 [SWDEV-530385] Update aca-decode with parsing fixes (#435)
*Update aca-decode to #4cd539d that fixes some errors in parsing cper files for afid extraction
*Without this fix, we get garbage value for some cper input files relating GFX_poison_cpers

Signed-off-by: Oosman Saeed <oossaeed@amd.com>
2025-06-04 18:49:05 -05:00
Narlo, Joseph ce7d6dfe61 [SWDEV-532769] amd-smi APIs mismatch with documentation (#428)
* Populated socket_power to get power info
---------

Signed-off-by: josnarlo <Joseph.Narlo@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-06-03 17:12:13 -05:00
Bindhiya Kanangot Balakrishnan 8f943b03e1 [SWDEV-534745] Generate valid json output for xgmi command
The amd-smi xgmi --json output was not in valid json
format. Changes are done to get the output in valid
json format.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-03 12:48:02 -05:00
Saeed, Oosman fab13c5b60 [SWDEV-530385] show afids on each line of printout (#422)
* show afids on each line of printout
* clean up afids and cper code
---------

Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-06-02 17:22:10 -05:00
Pham, Gabriel 91021da055 [SWDEV-446039] Added Flat Process table to default output (#425)
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Co-authored-by: Maisam Arif <Maisam.Arif@amd.com>
2025-06-02 17:15:15 -05:00
Kanangot Balakrishnan, Bindhiya 8ed52616ad [SWDEV-519061] xgmi command output shows zero for all xgmi acc read/write data in the first column (#392)
The xgmi read and write accumulated data from gpu metric index
is based on sysfs xgmi_port_num file. Mapped these two to display
read and write wrt src_gpu Vs dst_gpu.
---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-06-02 14:01:06 -05:00
Maisam Arif cebb0799cb [SWDEV-488303] Fixed process list information source
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Iec3416cb5ca1bdd806c3225b514bbf3dbf8c0d2e
2025-05-30 20:48:29 -05:00
gabrpham_amdeng 1fa4cdacf3 Suppressed help text of default command
Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-05-30 19:53:14 -05:00
Pham, Gabriel daf74d1cd6 [SWDEV-511822] Added group check to default command (#415)
Signed-off-by: Pham, Gabriel <Gabriel.Pham@amd.com>
2025-05-30 18:40:18 -05:00
Kanangot Balakrishnan, Bindhiya 2eff0b3764 [SWDEV-530633] Use gpu_metric speed and BW for xgmi (#366)
The xgmi command was showing pcie bit rate and bandwidth instead of xgmi. Corrected the API to get xgmi data from gpu metric.
Added python API for amdsmi_get_link_metrics. Modified the amdsmi_link_metrics struct.
Added check to confirm non zero partition got xgmi command.

---------

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-05-30 16:51:11 -05:00
Castillo, Juan 2e8aaf02c9 [SWDEV-534728] Fixed deep_sleep status does not work with --json flag (#413)
- When in json output mode the .rstrip function does not work due to dict obj type.
	- The clk_value is now checked for dict instance before extracting the value.
	- If clk_value is a dict then the .get() function is used to extract the value.
	- Else it is a string obj which uses .split() to extract the value.
	- If clk_value is < min_clk_value then deep_sleep is set to ENABLED
    - initialize clk_value and min_clk_value to 0 for each loop.
    - fix if/else for better readability

---------

Signed-off-by: Juan Castillo <juan.castillo@amd.com>
2025-05-30 16:45:32 -05:00
Arif, Maisam 0fdaebdbaa [SWDEV-488303] Updated CU occupancy for per-process retrieval (#243)
Change-Id: I2990597c6dd4b2e8cf3e11ce60f72049ebdd9a8c
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2025-05-29 20:35:27 -05:00
Liu, Shuzhou (Bill) 970560fc7c [SWDEV-520665] Add support for board voltage (#303)
* Add the API and CLI to show the board voltage. 

---------

Change-Id: Icb25bd653bb1d004704b5a21b378ca31b2b242c7
Signed-off-by: Arif, Maisam <Maisam.Arif@amd.com>
Signed-off-by: AL Musaffar, Yazen <Yazen.ALMusaffar@amd.com>
2025-05-29 18:55:08 -05:00
Pham, Gabriel bc158d2b51 [SWDEV-511822] Created default command for amdsmi (#348)
* Added degree symbol and fixed power usage
* Added degree symbol and fixed power usage
* fixed default command

---------

Signed-off-by: gabrpham_amdeng <Gabriel.Pham@amd.com>
2025-05-29 17:14:58 -05:00