Commit Graph

654 Commitit

Tekijä SHA1 Viesti Päivämäärä
Kanangot Balakrishnan, Bindhiya 94dca7073b [SWDEV-481004] Fix for incorrect gfx_version number (#8)
The target_graphics_version was not formatted properly and was
showing incorrect Target Name. Corrected this by fomatting
major, minor and revision numbers.

Signed-off-by: Bindhiya Kanangot Balakrishnan <Bindhiya.KanangotBalakrishnan@amd.com>
2025-01-21 15:46:56 -06:00
Mallya, Ameya Keshava aefc865bf4 Fixed Workflow for updated KWS structure 2025-01-17 08:26:52 -08:00
Mallya, Ameya Keshava dcdaf7389e Add KWS Check (#9) 2025-01-15 12:05:09 -08:00
Mallya, Ameya Keshava c6ce9a5aa0 Create kws-caller.yml 2025-01-15 11:12:25 -08:00
Galantsev, Dmitrii 55ee3cc442 [SWDEV-495169] Update ROCm SMI CLI and Error handling (#3)
Issues include:

Update ROCm SMI displaying None or Not Supported to N/A
Update ROCm SMI displaying err msg to instead log err

Signed-off-by: Juan Castillo juan.castillo@amd.com
Change-Id: I1a2ce6e4f329666b5666664a7d7b4475d6c1cbc7
2025-01-14 17:15:18 -06:00
James Xu 67a0de4279 [SWDEV-501108] Update Doxygen note on rsmi_dev_pci_id_get
- To address https://github.com/ROCm/rocm_smi_lib/issues/208
where use of fake BDFs for partitions can cause confusion. This note
is already in the comments of the function definition, but was not
updated in the function declaration.
- Fix broken formatting for the location table for PCIE coordinate fields
- Tracked in SWDEV-501108

Change-Id: Ic85439866cb836bb43acc52314a7f1d026c3215d
2025-01-14 15:49:55 -06:00
Choudhary, Rahul bb122efeae Merge pull request #1 from AMD-ROCm-Internal/rahchoud_amdeng-patch-1
Create rocm_ci_caller.yml init file to call shared workflow
2025-01-07 12:11:25 -08:00
Choudhary, Rahul 3c01c13dfd Create rocm_ci_caller.yml init file to call shared workflow 2025-01-07 11:53:58 -08:00
gabrpham 6f51cd651e Fixed reset event issues
Issues include:
	SWDEV-480250
	SWDEV-480255
	SWDEV-480248

Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: Icf12211e4b136f26fce18f09a7bf8b7e9cd20691
2024-12-30 13:12:46 -05:00
Charis Poag 4de2168866 [SWDEV-496693] GPU metrics 1.7
Changes:
    - Added new GPU metrics:
      1) XGMI link status - Up/Down; 1 = up; 0 = down
      2) Graphics clocks below host limit (per XCP)
         accumulators -> used to help calculate a violation status
      3) VRAM max bandwidth at max memory clock
    - Updated rocm-smi --showmetrics to include new metrics.
    Units/values reflect as indicated by driver, may differ
    from AMD SMI or other ROCm SMI interfaces which
    use these fields.
    - N/A fields means the device does not support providing this
    data.

Change-Id: I17b313345f15070a76b3a30dd8d5645d212d601b
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-12-15 16:48:13 -05:00
Charis Poag 8488518b1c [SWDEV-475712] Fix MI2x target_graphics_version
Removed correcting target_graphics_version by
product name. Instead detected target_graphics_version which
needs to be corrected -> populate accordingly.

Change-Id: I90765c87e0629daea5c732dace8acfd17e8c62c7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-12-08 22:01:43 -06:00
Charis Poag d04cec7f1d [SWDEV-499029] Fix unable to change memory partition modes
Changes:
  * [API] Removed checking board name, fixes for other MI ASICs
  * [CLI] Increased progress bar to change memory partition modes
    to 140 seconds, since driver reload is variable per system

Change-Id: Ifcaf40d28b4adf5eaa800c9e3748d33749dc414a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-22 20:19:29 -05:00
gabrpham 5428d29b19 [SWDEV-478748] Changing PCIE Read/Write message TEST FAILURE to WARNING
Signed-off-by: gabrpham <Gabriel.Pham@amd.com>
Change-Id: I534a94b358f7fddbe3c11d249c6e090cf3fa121e
2024-11-13 15:05:26 -06:00
Peter Park c3f1d2baf1 [SWDEV-479054] update doc for rsmi_compute_process_info_get to note 2-step usage:w
Change-Id: I81608f807ab679a27be12be591193712d81232bd
Signed-off-by: Peter Park <peter.park@amd.com>
2024-11-13 12:52:18 -05:00
Charis Poag 46902274b6 [SWDEV-488276/SWDEV-497613] Update memory partition set functionality
Changes:
  - Added warning screen to ROCm SMI users
    setting memory partition
  - Added new API (rsmi_dev_memory_partition_capabilities_get)
    to retrieve memory partition capabilities
    (What users can set memory partition modes to)
  - Increased time-bar for CLI sets display to 40 seconds
  - API now waits until the driver reloads with SYSFS files active
  - [SWDEV-475712] [CLI/API] Fixed target_graphics_version field
    not properly displaying for MI2x or Navi 3x ASICs.
  - Updated tests

Change-Id: Iaf89d1b7ad9ceb449b289bc82ea198fe3b23992e
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-11-12 12:18:44 -04:00
Peter Park 568cc6e7c7 Update changelog fmt to internal standard
Change-Id: Icdb7eb59c6770f46ddae401f23b84cd06e6d3b09
2024-11-08 16:20:49 -05:00
Jorge López 35c1d00f5a Updates driverInitialized() to support amdgpu built as module as well as kernel built-in. Fixes ROCm/rocm_smi_lib#102 and is an updated version of ROCm/rocm_smi_lib#104
Change-Id: Icb3abe820bc67035b822358a1c04bd09a7c22b6b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Reviewed-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-11-05 16:30:37 -05:00
adapryor 61ed9e13f4 [SWDEV-412505] Handle mclk permission errors as not supported
Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I25c9af42ed62697f87c70ecaeb153abe53401091
2024-10-31 15:18:03 -04:00
Oliveira, Daniel a1295714f2 [SWDEV-490187 / SWDEV-491215] Remove reset gpu partition + NPS test disabled
The reset gpu partition support for both compute and memory were removed

Code changes related to the following:
  * rsmi_dev_compute_partition_reset()
  * rsmi_dev_memory_partition_reset()
  * CLI
  * Unit tests
  * Documentation

Change-Id: I3fb8570dbf9e755ae70369587ef44bbf64e17fe8
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-10-21 14:22:57 -05:00
Charis Poag 0609cbf1d0 [SWDEV-422195/SWDEV-440985] GPU metrics 1.6 + --showmetrics
Changes:
- Added new GPU metrics:
  1) Violation status' (ex. PVIOL/TVIOL) accumulators
  2) XCP (Graphics Compute Partitions) statistics
  3) pcie other end recovery counter
- Added rocm-smi --showmetrics
Units/values reflect as indicated by driver, may differ
from AMD SMI or other ROCm SMI interfaces which
use these fields.
- N/A fields means the device does not support providing this
data.

Change-Id: Ia2cd3bb65c4f474ebdb39db8062ea716f2b4d8ee
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-09-27 13:18:05 -04:00
Peter Park b7221c64b0 Add LD_LIBRARY_PATH note to rocm.docs pages
https://github.com/ROCm/rocm_smi_lib/pull/197
https://advanced-micro-devices-demo--197.com.readthedocs.build/projects/rocm_smi_lib/en/197/

Change-Id: I64386a4f364e40ce61ad9963376d2686db2aa36d
2024-09-26 11:18:44 -04:00
Harkirat Gill c61ab4fa28 Add cstdint header for gcc-15 compatibility
Common C++ headers (like <memory>) in GCC 15.0.0 (combined with libstdc++) don't transitively include uint64_t anymore.

Minimal reproducer: https://godbolt.org/z/dqGbnG8bY

Porting: https://github.com/ROCm/rocm_smi_lib/pull/198
Closes: https://github.com/ROCm/rocm_smi_lib/issues/191

Change-Id: I2786e968c107a78104c43c4c474b7f65eaf88c0a
2024-09-23 15:05:07 -04:00
James Xu 35496cabc4 Skip missing vram_str_path and sdma_str_path if sysfs files not created when passing some, but not all, GPUs to a docker image.
- This fix addresses SWDEV-456049 and probably SWDEV-442181 which
	have the same apparent root cause of an early exiting
	loop while enumerating GPU stats

Change-Id: I517329e06fa2c53205d8b6e002895e648ebf521c
2024-09-19 16:53:37 -04:00
James Xu fe6a49d186 SWDEV-478077 - logging.warn used instead of logging.warning
- logging.warn() is deprecated in favour of logging.warning()
- for some reason, this is the only place in all of rocm_smi.py
	that uses logging.warn() as pointed out on github
	https://github.com/ROCm/rocm_smi_lib/issues/187

Change-Id: Ie1e4a0ea16b996fbed2e902c8edfe68087a5a5fa
2024-09-16 13:50:26 -04:00
Oliveira, Daniel 72b112f8f3 [SWDEV-483822] rocm-smi shows 'warning' for unsupported curves
Options '--showvoltagerange' and '--showvc' show 'warning' instead of 'error' for unsupported voltage curves

Code changes related to the following:
  * CLI

Change-Id: Ide662c98202c32ad01ccaf3c47a61f2543f82ebb
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-09-10 11:36:36 -05:00
Charis Poag 6b8db74578 Fix rocm-smi --showfw displaying error fw prints
Updates:
  - [CLI] Previously --showfw displayed fw that
    does not exist on systems. This change removes
    that extra output.

Change-Id: If8b063001b80b03579ea1378dfd890c60f62ccd7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-08-27 15:43:16 -04:00
Ranjith Ramakrishnan 4ceffdca68 SWDEV-480347 - Don't terminate build for cpack bytecompile errors
In centos-7, python2 is used for cpack bytecompile. Using f strings in code will result in syntax error.
Setting _python_bytecompile_errors_terminate_build to 0 will ignore the errors

Change-Id: I43ecc99ae16627f4f5f91d0cca0398f6a003fa3c
2024-08-23 13:43:32 -07:00
Maisam Arif 055b023d2e Bump version tool:2.3.1+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ic67456d7484c2f5a0ce0e086e56b29e20d9d9745
2024-08-08 01:40:55 -05:00
Ranjith Ramakrishnan 1b828b735b SWDEV-476075 - Prevent the modification of interpreter directives
CPACK is converting /usr/bin/env python3 to /usr/libexec/platform-python in RHEL8.
Undefining __brp_mangle_shebangs will prevent the same

Change-Id: Id285e2cea1de583853cec17eccf0a3a794cca643
2024-08-05 09:50:04 -07:00
Ranjith Ramakrishnan c9201f7736 SWDEV-469004 - Append additonal path to system path
rocm-smi is installed in /opt/rocm-ver/bin , but not as a soft link in wheel package
For rocm-smi to work from bin directory, it need the extra path to find rsmiBindings.py

Change-Id: I41388f680cb2ab9f11dc135639b0d30b66082392
2024-07-31 19:52:46 -04:00
Maisam Arif c2235eea35 [SWDEV-464799] Handle UnicodeEncodeError with non UTF-8 locales
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: Ifb8e6e3c7891c4f70faba5441fb87cc4ba2302f3
2024-07-31 17:01:01 -04:00
Charis Poag c11209f618 [SWDEV-475552/SWDEV-475351] Fix segfault TestComputePartitionReadWrite
In order to check partition id's we must continue to check # of devices.
Since this fluctuates with partition updates
and there are drm minor limitations.

For the drm minor limitation of 64, user must remove other drivers
using PCIe space. You can see these by:
ls /sys/class/drm

Recommend: rmmod unneeded driver and reload amdgpu. In order to
ensure CPX can enumerate with all XCP (Graphic Cluster Partitions).

Change-Id: Ib663503f0b7264dce163f6ac2d50795fc8dc5eba
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-07-27 17:47:54 -05:00
Galantsev, Dmitrii dd954be887 Azure - Switch to amd-staging branch
Change-Id: I9e69d0d4f8ece2dfc7b3327f8486f0d3d8bbeba0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-07-23 17:07:05 -05:00
Maisam Arif db4d81b944 Bump version lib:7.3.0 tool:2.3.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I637b34e03580d5b5efb1e12805a9cdeb7778de74
2024-07-10 19:55:15 -05:00
Galantsev, Dmitrii e2e65cc7ad Fix return 0 on failed do_configureLogrotate
Fixes https://github.com/ROCm/rocm_smi_lib/issues/184

Change-Id: I206927835de8811df6813c7a9b0b92258d776894
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-07-08 11:25:10 -05:00
Charis Poag 323ab1105d [SWDEV-463213] Add partition ID fallback + new API
Changes:
- Added rsmi_dev_partition_id_get() -> uses fallback described
  below for devices which support partition updates.
- Updated/added to tests for partitions to reflect these changes.

Due to driver changes in KFD, some devices may report bits [31:28] or [2:0].
bits [63:32] = domain
bits [31:28] = partition id
bits [27:16] = reserved
bits [15:8]  = Bus
bits [7:3] = Device
bits [2:0] = Function (partition id maybe in bits [2:0]) <-- Fallback for non SPX modes

Change-Id: Ia5641cfb8dbe2d1bff52f8eb81d5a159954528d3
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-06-27 17:27:01 -05:00
Bill(Shuzhou) Liu 57e8e72b79 Change error message for concise json/csv
The error message is changed to not supported instead of errors.

Change-Id: I28bd1e009770674389534be12519cc34673ba846
2024-06-21 16:16:36 -04:00
Ranjith Ramakrishnan d54bade574 SWDEV-468081 - Remove package provides field from RPM and DEB package
The provides tag is required when the package provides a virtual package.
Package name along with version will be provided by default and the provides tag is not required for this.
Using the tag for providing the name, but without version was resulting in package upgrade issues.

Change-Id: I74506d8c3bbd75d028bcdc03525c29541dce2b4c
2024-06-18 18:27:53 -04:00
Galantsev, Dmitrii 12c8237705 Fix assignment of member dismiss_
This patch fixes error:
    error: assignment of member 'dismiss_' in read-only object

Reported by kind Gentoo folks:
<https://bugs.gentoo.org/918709>

Change-Id: I7cc593043e97402afd85593c528ace86952b1350
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-06-17 16:56:46 -04:00
Sam Wu 2d7d7c449a [ROCDOC-593] Update Read the Docs configuration to Python 3.10 and latest rocm-docs-core
Change-Id: Ia086cd708f5bfcff71780cc104afe1e0908923c9
2024-06-12 15:06:24 -06:00
Roopa Malavally 2fd36e33ad ROCm SMI Documentation Reorg
Change-Id: I3e4db2c50a43a51eeea4d3e06ba4811ad1859368
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2024-05-31 16:25:35 -05:00
Galantsev, Dmitrii 10f3c2325c SWDEV-464886 - Fix ASAN REGEX error in cmake
Change-Id: Iaa5deed3ac833ebf1a010b98cfd4493359653ffe
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-30 16:42:00 -05:00
Galantsev, Dmitrii 2096c8225c SWDEV-464886 - Fix REGEX error in cmake
Simplify rocm-core dependency handling

Change-Id: I07de1d40e4a3c90481c2de3abe9aac3dbfdd6d93
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-30 14:54:44 -05:00
Joseph Macaranas 69c74d696b Fix path typo
Change-Id: If6d539447f29bc5ac0449b7c8a717ae31c9f4bf0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-27 11:17:02 -05:00
Ranjith Ramakrishnan 9f7e69bd5e SWDEV-442738 - Static package generation for rocm_smi_lib
Package name will have suffix static-dev/devel

Change-Id: Ia273a66c663c56b023f6d765d024b30f1c35639d
2024-05-21 13:31:00 -04:00
Galantsev, Dmitrii 0c3c3442e0 Azure - Add rocm-ci.yml
Change-Id: If5db660729c732d96eb66897f0339850db98bb6b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-05-15 12:54:10 -05:00
Oliveira, Daniel e7d54946fb fix: [MIT-License] [rocm/rocm_smi_lib]
Updates the license to MIT

Code changes related to the following: None

Change-Id: I62d0a5f02a2d5e58c1952337dff54892793c16cf
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-05-15 01:38:36 -04:00
Oliveira, Daniel 497ef4a7ef fix: [SWDEV-461904] [rocm/rocm_smi_lib]
Checks returned error by rsmi_dev_od_volt_info_get() before assert

Code changes related to the following:
  * Unit tests

Change-Id: Icc0f329e35992aae19f07243024521181467bcd3
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2024-05-14 18:25:00 -05:00
Bill(Shuzhou) Liu 8c44416410 Discover the amdgpu when card numbers are not consecutive.
When discover the amdgpu, if the assigned numbers are not consecutive,
not all GPU can be discovered. The code is change to discover the
GPU based on max card number.

Change-Id: I8b6a8b49594d6a54c7feb2645bedb83dc5c1b4cc
2024-05-08 13:59:16 -05:00
Maisam Arif 9c16cc8baf Bump version lib:7.2.0 tool:2.2.0+hash
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I07138dad67d796fb8c2dd418a384f663dd8532c0
2024-05-07 21:04:29 -05:00