Граф коммитов

219 Коммитов

Автор SHA1 Сообщение Дата
Charis Poag 6fada8c4a6 Merge amd-staging into amd-master 20240401
Change-Id: I52c8665735e86deed53645197c11889fc7ece8c5
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-04-01 17:48:06 -05:00
Charis Poag f5c32b5415 Add ROCm 6.1.1 changelog, ROCm SMI deprication, vbios fix
* Updates:
    - Add ROCm 6.1.1 Changelog updates
    - Add planned ROCm SMI deprication notice
    - Fix rocm-smi --showvbios showing extra errors
      for GPUs which do not have a VBIOS (MI300a ASICs)

Change-Id: I0e5ccfe2677f9c7909ca13863a920e323e82b439
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-30 00:11:09 -05:00
guanyu12 8d4261c5c5 Merge amd-staging into amd-master 20240321
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I006fc6c187f134a4851e262fa53ab6bf8d58759d
2024-03-21 14:03:51 +08:00
Charis Poag c5acd4ee88 Update ROCm 6.0/6.1 CHANGELOG.md & README.md
* Updates:
    - [CHANGELOG.md] Provide 6.1 and 6.0 changes
    - [README.md] Update readme with relavant changes
    - [CLI] Updated --showpower to expand on types of power provided to users

Change-Id: Ic653cc81f80b7973654e2c23e1ab70567b930aa7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-20 00:17:33 -05:00
guanyu12 ab8ebd4dea Merge amd-staging into amd-master 20240314
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I1d79ce09196cf101c2a885fd6be8f1094e8d5f9f
2024-03-14 11:15:44 +08:00
Charis Poag c2035fa1b9 [SWDEV-436308] Add Partition_ID from KFD
* Updates:
    - [CLI] rocm-smi (no arg) and --showhw:
      Now displays 'ID'/'PARTITION ID' from the pcie_id identifier
      Helps users identify which partition # the device is
      Information provided by KFD
      Note: partition_id of 0, means a primary node (AKA root node),
      ex. ASICs which do not have partitioning support will show 0
    - [API] Fix partitions nodes which do not enumerate with domain:
            Adding kfd's domain, allows ASICs which have domains
            to enumerate in proper order.
            Full pcie_id / bdf propagates to all partition nodes.
    - [API] Update rsmi_dev_pci_id_get() to allow users to extract
      partition_id from device
    - [CLI] Added fix for devices which have modprobe failure,
      but DRM does not come up properly. Even though driver shows
      initialization was successful.
    - [API/Utils] Overloaded print_int_as_hex() template:
      Now accepts bitsize, and prints in smallest byte size
      possible. Note: bitsize of < 8, please just print as decimial.

Change-Id: Ib0c6f73b2b9c9fea29442a39a669c432874382d8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-08 10:51:15 -05:00
guanyu12 4d1ea826e1 Merge amd-staging into amd-master 20240308
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I2edf51a9b8f93589bf6eadee7b2691629c433977
2024-03-08 16:22:17 +08:00
Istvan Kiss 50a079af0f Update documentation and add python API documentation
Change-Id: Ibccf5b6a5fba81cea42e04a022deac8a3207b9b8
2024-03-06 22:01:30 -05:00
Charis Poag 90160a7c9c Fix rocm_smi library calls
- [CLI] Rounded VRAM output on CLI, no diffrence in output
    - [python API] Fixed initializing calls which reuse initializeRsmi()
      calls - now we set a global reference to rocmsmi to use
      throughout API calls (see error below)

Traceback (most recent call last):
  File "/home/charpoag/rocmsmi_pythonapi.py", line 9, in <module>
    rocm_smi.initializeRsmi()
  File "/opt/rocm/libexec/rocm_smi/rocm_smi.py", line 3531, in initializeRsmi
    ret_init = rocmsmi.rsmi_init(0)
NameError: name 'rocmsmi' is not defined

Change-Id: I0eff3b8a432abf6d4344a02b9f638e1191c51a19
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-03-04 21:08:08 -06:00
Charis Poag 93ed5205f9 Merge amd-staging into amd-master 20240216
Change-Id: Id3e41507ab6143d08cb052710aa19c6f2e402fed
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-16 20:03:19 -06:00
Charis Poag 4b5ccb57f0 [SWDEV-423481/SWDEV-423393] Align all device identifier details
Updated:
 * [CLI] Fixed vram % - printf style formatting causes many data errors
   This fix updates to the recommended way of outputting formatted data.
   https://docs.python.org/3/library/stdtypes.html#printf-style-string-formatting
 * [API/CLI] Added gpu_id / GUID from kfd (rsmi_dev_guid_get)
       -> CLI name: "GUID"
       -> ROCm SMI calls: no arg, -i, --showhw, --showproduct
 * [API/CLI] Added node_id from kfd (rsmi_dev_node_get)
       -> CLI name: "Node"
       -> ROCm SMI calls: no arg, --showhw, --showproduct
 * [CLI] Added target gfx version from kfd
       -> CLI name: "GFX Version" or "GFX VER"
       -> ROCm SMI calls: --showhw, --showproduct
 * [CLI] Base ROCm CLI
       -> Removed - stacked id formatting:
	   This is to simplify identifiers helpful to users.
	   More identifiers can be found on -i --showhw, --showproduct
 * [CLI] Update -i, --showhw, --showproduct, w/out arg
      -> Card ID/DID/Model/SKU/VBIOS:
            All unsupported values now display "N/A" instead
            of "unknown" or "unsupported"
 * [CLI] Showhw now expands data based on content

Change-Id: Ifb8586f9f545892b8a5aa7903608273cdd77e075
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-13 19:52:29 -05:00
Vladimir Stempen 677433b367 Fix [Not supported] status for get_compute_process_info_by_pid
On some systems [rocm-smi --showpids] reports
get_compute_process_info_by_pid, Not supported on the given system
[PID] [PROCESS NAME] 1 UNKNOWN UNKNOWN UNKNOWN

get_compute_process_info_by_pid fails because cu_occupancy debugfs method
is not provided on some graphics cards and GFX revisions by design

Proposing a change to return success status when only cu_occupancy debugfs method
is not found and provide cu_occupancy invalidation value to mark only
this parameter as UNKNOWN

Change-Id: Iae37070d9bd19483b4e6c8ee24c7d9a4c92f00d7
Signed-off-by: Vladimir Stempen <Vladimir.Stempen@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-13 18:17:47 -05:00
Galantsev, Dmitrii d03061823a Merge amd-staging into amd-master 20240212
Change-Id: I662f2a470446550ba8c612aa1e5be911d7f7489f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-02-12 11:30:04 -06:00
Bill(Shuzhou) Liu 4e0a7f2f67 Support set min or max clock
In addition to be able to set clock range, new setextremum option
is added to set only min/max clock as sometimes one of them may
not be supported.

Change-Id: I7c91ba308f3fc6c78efc88117509c515d403a6cb
2024-02-09 09:24:26 -06:00
Charis Poag c18ec624af [SWDEV-437365] Fix --showpower
Updates:
  - [CLI] Switching to use generic rsmi_dev_power_get()
  this is a backwards compatible function to
  retrieve power values. More consistent than
  previous fixes.
  - [API] Update API for rsmi_dev_power_get()
  Now provides @depricated for this function.
  Providing notes on newer ASICS only support
  current socket power, where as previous
  ASICS only provided average power.

Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
(cherry picked from commit 51aec98edd)
2024-02-02 19:30:46 -05:00
Charis Poag 51aec98edd [SWDEV-437365] Fix --showpower
Updates:
  - [CLI] Switching to use generic rsmi_dev_power_get()
  this is a backwards compatible function to
  retrieve power values. More consistent than
  previous fixes.
  - [API] Update API for rsmi_dev_power_get()
  Now provides @depricated for this function.
  Providing notes on newer ASICS only support
  current socket power, where as previous
  ASICS only provided average power.

Change-Id: I34da0e925cf0b6c669bdd801b017f33f3b3ee86a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-02-02 00:00:38 -06:00
Galantsev, Dmitrii 9386d60522 Merge amd-staging into amd-master 20240124
Change-Id: I358fde8bed15c8b2a240a0be8cf5411e21238b08
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-24 16:26:34 -06:00
Bill(Shuzhou) Liu 905c25e59b Voltage clock display as 0 when overdrive and voltage not supported
Change the python tool not to display above information if it is
not supported.

Change-Id: I48ffd95f07168219a629dfb391c1b4587308286d
2024-01-19 17:11:08 -05:00
Galantsev, Dmitrii 0c5c46db6f Merge amd-staging into amd-master 20231121
Change-Id: I400dfcdbf7fd1afcb020805342a4389038ce3917
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-21 17:28:29 -06:00
Galantsev, Dmitrii e1c972a193 Bump version lib:7.0.0 tool:2.0.0+hash
Change-Id: I7f2fd5605a93d07f61b997a25e1fbcf2780ea5cb
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-21 17:19:41 -06:00
Galantsev, Dmitrii d61aaf44e1 Add version hash
Change-Id: I6cf18b00a45ebd106f981e92681cab2ef25924e2
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-21 17:14:38 -06:00
Charis Poag 553d26ef3a Fix CLI checks for secondary die
MCM die check was inconsistent (using avg power).
By using only the energy counter, this provides
a consistent way of checking which die is the MCM node.

Change-Id: I532fa2047706d0f1e92e643ce1e6759e45b65ec0
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-11-21 11:58:52 -05:00
Oliveira, Daniel 2c8ba4cae9 rocm_smi_lib: Fix Refactoring gpu_metrics code
Uses new support for 'gpu_metrics_v1_4'

Code changes related to the following:
  * rsmi gpu_metrics APIs
  * rsmi gpu_metrics Logs
  * new data structure fields added in 1.4
  * added APIs for all other existing metrics before 1.4
  * added support to older metrics; 1.1, and 1.2
  * public APIs renamed to start with prefix 'rsmi_dev_metrics_'
  * Unit tests updated
  * Examples updated

Build changes related to the following: None

Change-Id: Ibdaf031be9d916020b4049544dbd725858c7711d
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-11-10 19:05:09 -06:00
Galantsev, Dmitrii 8aa036ae08 Merge amd-staging into amd-master 20231102
Change-Id: I7d1901564af875f2f9aa8879f24bff098ea30600
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-11-02 18:24:10 -05:00
Charis Poag 57b6135e54 Partition EBUSY with RSMI_STATUS_BUSY & invalid GPU Metrics check
* Updates:
   - [API/CLI] rsmi_dev_*_partition_set &
     rsmi_dev_*_partition_reset - exposed RSMI_STATUS_BUSY for
     EBUSY writes + cleaned up accidental map insertions
     (maplookup[] can insert values that are not in the map,
     map.at(key) fixes this potential issue)
   - [API] rsmi_dev_gpu_metrics_info_get() - returns
     RSMI_STATUS_NOT_SUPPORTED for unsupported metric tables
     outside of 1v1/1v2/1v3
   - [API] writeDevInfoStr() - exposes RSMI_STATUS_BUSY for
     EBUSY write errors; kept backward compatibility
     for other writes which do not care about these states
   - [API] rsmi_dev_od_volt_info_get()
      & rsmi_dev_od_volt_curve_regions_get() have better logging
     + Expose more details on why they are erroring
   - [Utils/logs/example] Expose AMD GPU gfx target version to aid in
     system troubleshooting
   - [Utils] Added test methods that look at od volt
     freq & regions into here - for easier access across
     several tests
   - [Utils] Updated getRSMIStatusString(new argument - fullstatus;
     default to true for backwards compatibility)
     -> true shows shortened RSMI STATUS response
   - [Utils] Added splitString to cut out noisy return responses
     (used in getRSMIStatusString(), when fullstatus = true)
   - [Utils] Added getFileCreationDate() to expose build date
     of the library - helpful for local builds or experimental builds
   - [Utils] Macro cleanup
   - [Example] Added a few gpu_metric checks - helpful for upcoming
     updates
   - [Device] SYSFS/DebugFS - now have better r/w displayed in logs
   - [LOGS] Expose library build date - see above for details
   - [Tests] Add more warnings/errors to test builds
   - [Tests] Moved up Partition tests for ordered test runs - helped
     identify issues with GPU BUSY writes
   - [Tests] compute_partition_read_write - handles RSMI_STATUS_BUSY
     with waits for busy status found & cleaned up how we checked
     for partition changes - with RSMI responses exposed more clearly
   - [Tests] perf_determinism - multi gpu now properly runs through
     with full resets as needed
   - [Tests] volt_freq_curv_read - better error handling with more
     verbose output

Change-Id: Ie94c6abb6a9aab95c345996d3ad3843cf6734977
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-27 14:52:02 -04:00
guanyu12 fe6bf6444a Merge amd-staging into amd-master 20231026
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: I57e13f063345d6932d285257815638096f23d1d6
2023-10-26 14:51:27 +08:00
Galantsev, Dmitrii 275108f5b9 README - Clean-up cli readme
Change-Id: I665cc5a48a240f0d2289439a4877c9f667b19851
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-23 13:17:04 -05:00
Maxime Chambonnet 8cfcb51550 Updated README.md with standard Markdown tables, cleaned a bit header levels.
Change-Id: Ibd6e382413d7667a5a823ac69620a2cfb7046bc5
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-23 13:11:18 -05:00
Galantsev, Dmitrii 2254d3e376 rocm_smi.py - Fix checkIfSecondaryDie duplicate
fixes:
https://github.com/RadeonOpenCompute/rocm_smi_lib/issues/128

Change-Id: I9b01c3fc4d255e01423cca25b9c924630b8cf326
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-19 16:57:08 -04:00
guanyu12 5fddc58624 Merge amd-staging into amd-master 20231010
Signed-off-by: guanyu12 <guanyu12@amd.com>
Change-Id: Id320e9e80c796a4edb155cdc002de1efdb6b2c3a
2023-10-10 10:48:42 +08:00
Charis Poag b251bb0c9f Rename NPS -> memory partition + compute partition node fix
* Updates:
        - rocm_smi_lib + CLI:
          Rename all "NPS mode" -> "memory partition"
          related files/functions/API/CLI to align with correct
          technical naming
        - rocm_smi_main: fixed identifying primary card's unique id
          utilize rsmi_dev_unique_id_get to map which
          KFD nodes belong to it
        - rsmi_dev_*_partition*: now have better logging output
        - compute partition tests:
          Added 20 sec delay for workaround until GPU
          busy is confirmed as the issue
        - CPPLint fixes/formatting
        - [Example] Moved all endl to "\n" for efficiency
        - [Example] Added Edge & Junction temperature examples
        - [Example] Added rsmi_minmax_bandwidth_get() example - WIP

Change-Id: Ida6db6fda7e0ac9d696a34cb15b4746e69d58d51
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-10-06 11:51:09 -04:00
Galantsev, Dmitrii 2831a5addc Merge amd-staging into amd-master 20231005
Change-Id: Ie217f139f63aa10ec5e9ce48797b7cb94864736d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-05 16:22:45 -05:00
Galantsev, Dmitrii d862bee754 Add --version to CLI
Change-Id: Id2a8f10f544ed04e874db773820534eddd73f55d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-10-02 17:57:02 -05:00
Ori Messinger aa89f2e125 ROCm SMI CLI: Add Missing Firmware Blocks
The purpose of this patch is to add the following missing firmware
blocks to the SMI CLI:
-RSMI_FW_BLOCK_MES
-RSMI_FW_BLOCK_MES_KIQ

Signed-off-by: Ori Messinger <Ori.Messinger@amd.com>
Change-Id: If9cabdc60ffcf08f27c9e6bdc20e8a26b192a738
2023-09-29 18:13:16 -04:00
Bill(Shuzhou) Liu 016dbf8aa3 Do not print the library name if in default folder
The rocm-smi python tool will not print the library name on default
folder.

Change-Id: I203a872ebe2fc994766a2628049ca50c8bfa7120
2023-09-27 12:14:33 -04:00
Hao Zhou 4ce4535450 Merge amd-staging into amd-master 20230926
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I6d152514b258cf7b5f0ab0e54e2539ab5f033f14
2023-09-26 09:40:23 +08:00
Charis Poag f078375350 Add Current (Instant) Socket Power
* Updates:
    - rocm_smi_logger:
      General cleanup &
      Aligned to cpplint rules for usage
    - rocm_smi_monitor:
      Fixed MonitorTypes
      from not displaying properly in logs
      & Added socket power label + current
      socket power MonitorTypes
    - rocm_smi API:
      Added rsmi_dev_current_socket_power_get API
    - rocm_smi CLI:
      General cleanup,
      Concise info now displays device data
      in variable width (see printLogSpacer's
      new field),
      printLogSpacer now as an adjustable
      variable that overrides appWidth,
      Added Socket Power to base rocm-smi +
      --showpower CLI calls,
      --showpower & base rocm-smi CLI defaults
      to printing socket power (if not available,
      displays average power)
    - Cleaned up temp label references
    - power_read gtests:
      Added current socket power to testing

Change-Id: Ica57e6f98ad96e2584e7c7955e188f68d2dab89d
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-09-25 01:38:54 -04:00
Galantsev, Dmitrii 3d40c4bb2c SWDEV-422836 - Add sleep frequency support
Change-Id: I0bde403b010bf036ce44ed0600cc7eb03742c6b6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-25 01:38:27 -04:00
Bill(Shuzhou) Liu 2247c4b46c Change the python tool id output label
Change the label from GPU to Device as we call rsmi_dev_id_get().

Change-Id: I8ffe3673d434e5291ebd5cc909afb7d18154ecb6
2023-09-25 01:31:04 -04:00
Hao Zhou d417ea52f6 Merge amd-staging into amd-master 20230925
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: Id510a95e3bea2ddddae7c417071bde599569930a
2023-09-25 09:38:13 +08:00
Galantsev, Dmitrii 3a4e428fd5 PY: Remove f-strings from rocm_smi.py
Change-Id: I0a422e8f66473af837460ecb2450e5be329163b0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-23 00:12:34 -05:00
Galantsev, Dmitrii 1683245ecf PY: Remove f-strings from rocm_smi.py
Change-Id: I0a422e8f66473af837460ecb2450e5be329163b0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-22 19:15:59 -05:00
Oliveira, Daniel e0483f2ee2 rocm_smi_lib: Fix [linux BM] [AMDSMI] Memory Bandwidth
Implements APIs for 'gpu_metrics_v1_3' utilization averages

Code changes related to the following:
  * rsmi_dev_activity_metric_get()
  * rsmi_dev_activity_avg_mm_get()
  * CLI shows "Avg.Memory Bandwidth" under "--showmemuse"

Change-Id: I8e4600f350a7c18499abf022534db2b875f09d5f
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-09-21 11:00:29 -04:00
Hao Zhou 6d081cd1b1 Merge amd-staging into amd-master 20230921
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I08c5ae1cca4b03dbb3cfcbcbf61d4b1b633908c1
2023-09-21 13:40:25 +08:00
Galantsev, Dmitrii 094c98a74f rocm_smi.py: Fix pipe into head error
When piping rocm_smi into 'head' it failed with "Broken pipe" error. The
error can be safely ignored. head closes the pipe early which causes
calls a SIGPIPE signal to be raised.

https://docs.python.org/3/library/signal.html#note-on-sigpipe

Change-Id: I4a589c6ed9a8c5b50de84b33e28115c6b510045f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-15 16:37:10 -04:00
Galantsev, Dmitrii 3b95214fff rsmiBindings.py - Add initRsmiBindings()
Library path was printed at all times even with --json flag.
This commit adds a mandatory initRsmiBindings function which is a core
component of the rsmiBindings.py library.

It **MUST** be called on import.

Change-Id: Ic6ae1ec5d1fabba288910e6aed6c4706e53e5cd7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-15 16:37:10 -04:00
Galantsev, Dmitrii a4b470fe71 Add errors for existing but empty dev files
Change-Id: Iad9febc50f9b8e6085f8b605249ee884d2f134d6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-14 17:30:03 -04:00
Oliveira, Daniel 12f395e592 rocm_smi_lib: Fix rocm-smi --resetfans results in Permission Denied
For operations related to:
  --resetfans
  --setfan

We report 'Not supported' for these cases instead of 'Permission denied'

Code changes related to the following:
  * rocm_smi_properties
  * rocm_smi related APIs

Change-Id: I144646efc3804fabd45cc5a46351803950b4feb7
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-09-14 16:54:29 -04:00
Hao Zhou 265341dd39 Merge amd-staging into amd-master 20230914
Signed-off-by: Hao Zhou <Hao.Zhou@amd.com>
Change-Id: I408a62826faff38d319b0d7ef08767223b3b327f
2023-09-14 10:23:32 +08:00
Galantsev, Dmitrii 4acfb00ad5 PY: Silence error output when printing concise info
Change-Id: I9ce4ad523b3fe2ec8afc5bea791810ec67558f11
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-09-12 19:16:16 -04:00