커밋 그래프

493 커밋

작성자 SHA1 메시지 날짜
Galantsev, Dmitrii 548b68cb67 .editorconfig - Remove broken whitespace rule
Change-Id: I67260f1f1952609dc89834d0763acd732bf39860
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-08-22 16:51:20 -04:00
Galantsev, Dmitrii 62f01cb150 TESTS - Use gpu version as a workaround for a missing name
Depends-On: Ifbd38f11fbde7ba28af4be1d611310dea1b5112a
Change-Id: Ia7b7975f03424854df0a470b2719cf2ff2cf8e40
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-08-21 19:18:22 -04:00
Bill(Shuzhou) Liu a10f00bf57 Fallback to kfd node when VRAM sysfs not available
The driver may not expose VRAM sysfs in certain system. Add a
fallback to it.

Change-Id: Ib3be71b4f4d2c79318d5026b0a97f3657d8a97b6
2023-08-17 14:36:03 -05:00
Charis Poag 755e14dbad [SWDEV-399953] Smart Temperature detection + partitioning display
* Updates:
    - Fix for devices which do not have edge sensors, but junction
    - Added partitioning (memory and dynamic) displays for
      base rocm-smi CLI calls
    - Added subheading for base rocm-smi call output
    - Added better hwmon and device detection logging

Change-Id: I8219884b2e532d6ed379527cacdc1f2b232a5451
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-08-10 19:53:38 -04:00
Oliveira, Daniel cc5ab079df Fix rsmitstReadWrite.TestPowerReadWrite test failure
Code changes related to the following:
  * All reinforcement work moved to their own files
  * Self contained changes only to support them
  * New files added to CMakeLists.txt

Change-Id: I761e91f54392824df9145eaed8b9805986861285
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-08-09 21:51:05 -05:00
Ranjith Ramakrishnan 9406cdd832 SWDEV-366827 - Disable file reorg backward compatibility support by default
Change-Id: I1de06d0d6a30c8c862d768b58460ef1b49d15e29
2023-08-07 09:21:19 -07:00
Charis Poag 9c7eed7edc [lib] Enhance Logger: gpu_metrics + enable console out
* Updates:
    - Env variable RSMI_LOGGING=0 or any other value
        -> all logging off
    - Env variable RSMI_LOGGING=1 -> logs only
    - Env variable RSMI_LOGGING=2 -> console only
    - Env variable RSMI_LOGGING=3 -> both logs + console
    - Metrics output includes hexdump of current file
      and decoded metrics (functions: logHexDump
      and log_gpu_metrics)
    - System info gathered, now includes if system's
      perceived endianness - little or big endian
      helpful for viewing decoded hexdump or any
      binary translation
    - Added templates for printing unsigned hex
      (print_unsigned_hex_and_int), unsigned integers
      (print_unsigned_int), and printing both unsigned
      hex and int with an optional header
      (print_unsigned_hex_and_int)
    - Fixed some build compile warnings/errors -
      ex. doing strncpys for sku or board names
      this operation is expected and needed
      and for temp file writes if unsuccessful
      we now properly send RSMI_STATUS_FILE_ERROR
    - Fixed on RHEL 8.8/9.x logrotate does not properly
      initialize

Change-Id: Ifa0f0218c9cafd0a8cd6aa8e7f94d61e9107200f
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-08-01 21:46:19 -05:00
Bill(Shuzhou) Liu 0522439ac2 Crash when ecc count sysfile cannot be read
Replace assert with error handling code.

Change-Id: I6500ae4d38a8caea87828aa7d76373d20c8354c7
2023-07-31 08:36:53 -05:00
Bill(Shuzhou) Liu aeb6c61f54 Change reset power error message to logging
Since the reset will continue if the reset power and current power
is the same, error may confuse the user.

Change-Id: I35b9ef17afd47b5af5bd2b8882a44f63991fe509
2023-07-27 15:18:28 -05:00
Bill(Shuzhou) Liu 80d650b95a Handle csv output when the command is not based on the device
Fix the error only one csv line can be printed out when output
is not based on device.

Change-Id: Idacc5d98acc223e932fb3d46c888bfa04778b73c
2023-07-26 15:28:18 -05:00
Maisam Arif c78ec46671 SWDEV-394316 - Handle not applicable vbios
Change-Id: I3390078a63c9a5eff67024b84a3be1369c4b1460
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
2023-07-25 16:33:22 -05:00
Charis Poag 4613e8dec3 Update logging and README for other project usage
Updates:
    * [rocm-smi] Logging now can update files on
      per-project-basis for install/remove
    * [rocm-smi] README now has latest build
      instructions, including test builds
    * [rocm-smi] Updated README to include
      revision dates

Change-Id: Ifb19a6f32ccf6938f47225db53fef88021909264
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-07-20 19:09:11 -05:00
Oliveira, Daniel 573620f586 Add revision to --showhw
Code changes related to the following:
  * Added 'rsmi_dev_revision_get()' related code
  * Test code
  * Functional tests

Change-Id: I8c2097c65384a028c8c8437b717d05d52fe45250
Signed-off-by: Oliveira, Daniel <daniel.oliveira@amd.com>
2023-07-18 16:17:33 -05:00
Galantsev, Dmitrii 8fe848d10e Fix sys and id tests
The following read tests were failing:
*.TestIdInfoRead
*.TestSysInfoRead

1. *.TestIdInfoRead failed because rsmi_dev_brand_get did not specify
   dependency on vbios_version.

2. *.TestSysInfoRead failed because the test didn't expect vbios_version to
   be missing. Which is a new behavior in Aqua Vanjaram.

Change-Id: I9ee88a12fcf6cff2032049e2ecdfb2957efb03ab
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-07-17 15:52:23 -04:00
Galantsev, Dmitrii b0fe2fbd07 Add .cache to gitignore
Change-Id: Ida03bf1f50704bea44827d7578cd74c1896d4368
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-07-17 15:52:23 -04:00
Bill(Shuzhou) Liu 0aeb6025bd rocm-smi --showevents shows wrong gpuID
Use the gpuid returned from the event data instead.

Change-Id: I7f286cc105f7ea12985223e603504f0ef3d9724e
2023-07-13 08:28:53 -05:00
Galantsev, Dmitrii e6c42c6626 Simplify gitignore
Remove generic gitignore to simplify tracking of generated files

Change-Id: Idf1f9719b2cfd16b31332a3ed87be5943c2c1ce7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-07-07 11:48:09 -04:00
Jeremy Newton 2d2c73a5e6 Fix python loading of librocm_smi64
The librocm_smi64.so is used for development, while
librocm_smi64.so.MAJOR is used for runtime, thus the python front end
should not be loading the .so binary, but rather the .so.MAJOR binary.

As well, it's good not to hardcode "lib" as some distros will change
this.

rsmiBindings.py is now generated with CMake

Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
Change-Id: I7cb745f8936fdf10d3ebd6c1e606031f713184ca
2023-07-06 09:52:56 -04:00
Jeremy Newton 828f46b445 Only install asan license if enabled
Change-Id: I79c6fce84c23ed12e65db8e234a29dbfedd11f68
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
2023-06-30 23:34:43 -04:00
Jeremy Newton 4f481dd7f3 Actually fix version string
There seems to be a scope issue with the existing variables, but just
putting in the pkg version string seems sufficient.

Change-Id: I4ccef872ff848a70cb2abc07bf605c5f29a608e8
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
2023-06-30 23:34:14 -04:00
Tom Rix 19c3e2aff9 Improve handling of ContructBDFID errors
Building on this package on Fedora reports this warning
In file included from rpmbuild/BUILD/rocm_smi_lib-rocm-5.5.1/src/rocm_smi_main.cc:62:
In member function 'amd::smi::Device::set_bdfid(unsigned long)',
    inlined from 'amd::smi::RocmSMI::Initialize(unsigned long)' at rpmbuild/BUILD/rocm_smi_lib-rocm-5.5.1/src/rocm_smi_main.cc:330:27:
rpmbuild/BUILD/rocm_smi_lib-rocm-5.5.1/include/rocm_smi/rocm_smi_device.h:199:42: warning: 'bdfid' may be used uninitialized [-Wmaybe-uninitialized]
  199 |     void set_bdfid(uint64_t val) {bdfid_ = val;}
      |                                   ~~~~~~~^~~~~
rpmbuild/BUILD/rocm_smi_lib-rocm-5.5.1/src/rocm_smi_main.cc: In member function 'amd::smi::RocmSMI::Initialize(unsigned long)':
rpmbuild/BUILD/rocm_smi_lib-rocm-5.5.1/src/rocm_smi_main.cc:324:12: note: 'bdfid' was declared here
  324 |   uint64_t bdfid;
      |            ^~~~~

Only set the bdfid when it is know to be valid.

Signed-off-by: Tom Rix <trix@redhat.com>
Change-Id: I839b4d2d2d4e3b25469cf5972245b9630da00c87
2023-06-30 00:16:44 -04:00
Jeremy Newton 74dc98114f Update default version to match tags
When building from github, these tags don't exist, so the defaults
should try to match the internal tags

Change-Id: Id570341f27e21916b1a7f3605ee2b5b9716cad9b
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
2023-06-30 00:16:22 -04:00
Jeremy Newton 1a86dd75bb Fix version file generation
This looks like a typo, as the following variables are not defined:
- AMD_SMI_LIBS_TARGET_VERSION_MAJOR
- AMD_SMI_LIBS_TARGET_VERSION_MINOR
- AMD_SMI_LIBS_TARGET_VERSION_PATCH

Change-Id: I43449e7bd2a2de643d33e79fad063a7859679c8d
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
2023-06-29 14:42:30 -04:00
Jeremy Newton d00d885394 Fix python script install permissions
The keyword "PROGRAMS" should be used in place of "FILES" in order to
make sure executable scripts have the correct permissions.

Change-Id: I6c287dc1291774ad6d97a04d621957dea0a1b697
Signed-off-by: Jeremy Newton <Jeremy.Newton@amd.com>
2023-06-27 14:57:59 -04:00
Bill(Shuzhou) Liu 910bf677a9 Crash if no hwmon sysfs
Return NOT_SUPPORTED if no hwmon sysfs.

Change-Id: I01356a21f004ab552ca6ef7ffb49934bfdfd5e31
2023-06-26 08:00:32 -05:00
Galantsev, Dmitrii 82078565e9 SWDEV-406542 - Add gtest to install targets
Change-Id: I116505aaa33109fce66ab8daf9921e2de11a27d4
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-20 11:14:56 -05:00
Galantsev, Dmitrii 9519d5b8cf SWDEV-391041 - Disable TestPowerReadWrite
Change-Id: I56b5bea3e5206a6f0d5ecdb482103881f80f0b8b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-16 15:18:27 -04:00
Galantsev, Dmitrii e7585cc045 Assign tests to aqua_vanjaram
Change-Id: Iee78b1e810356327261006087b081e39dab0b9e8
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-16 15:18:27 -04:00
Bill(Shuzhou) Liu d9b6af7a09 Expand showpids to provide more details
Provide details of GPU usage by an application.

Change-Id: I0f36df7d358754c2c8a60432b736d98f667ee99c
2023-06-16 08:52:18 -04:00
Galantsev, Dmitrii 0478d53e23 SWDEV-340919 - Package rsmitst
Similar to I879b21428e6642f19fda67092b365d8b78b7ba7b.

Main CMake improvements:

* Add rsmitst with -DBUILD_TESTS=ON
* Package tests into rocm-smi-lib-tests.deb and .rpm
* Note - this breaks build_rsmitst.sh

Misc improvements:

* Add .editorconfig to normalize code formatting
* Export compile_commands.json
* Remove gtest source and pull from github instead

Change-Id: Ib87ed4a5acd9f78badae6d028e5ff3d4f56dafc2
Depends-On: I8b26795471ad1432c805e45d8b58d7bb34abfcfc
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-13 22:52:10 -05:00
Galantsev, Dmitrii ac94bf5ed5 Temporarily ignore TestFrequencies
See SWDEV-391039 and SWDEV-391040 for details

Change-Id: I662ba43363d949465454ea4af4d4586b3d47a811
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-12 19:26:21 -05:00
Galantsev, Dmitrii 713f85721b --showtempgraph - Show N/A when no temp found
If temp in hwmon was missing - rocm-smi crashed.
e.g. /sys/class/drm/card1/device/hwmon/hwmon5/temp1_input

This change displays "N/A" for temp instead of crashing.

Change-Id: I02f84a466bd3acfbd9b65e7e4ca0f18e76606c3b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-12 19:16:39 -05:00
Maisam Arif 00e170c2f5 SWDEV-404157 - Fixed printLog delimiter parsing
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I3d8e22d185790f4325aeacc18e4bfcfe8777d356
2023-06-08 20:02:51 -05:00
Galantsev, Dmitrii f78f9a4082 Fix test temp blacklist, ignore TestVoltCurvRead
Change-Id: I86fa14fdc06e1b170a0bc0c0727fc08e4f4e2074
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-06 17:02:14 -04:00
Charis Poag e2dec17284 [SWDEV-402336 + SWDEV-398070] Fix RPM install part2
Updates:
    [rocm-smi] RPM installation comment included a macro,
    now removed

Change-Id: Ifa7a8d2d1a713940c39e20df9d02635e0e623dd8
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-06-05 13:50:57 -05:00
Galantsev, Dmitrii e8391c9d7c Clean-up python errors and warnings
Used pyright to show errors and warnings and resolved most

Change-Id: I0fdf7dcdf08db5c35dec80f6645e0a395fbe4197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-06-01 17:37:57 -04:00
Charis Poag b0f2a9d2ef [SWDEV-402336 + SWDEV-398070] Fix RPM install - override macros
Updates:
    * [rocm-smi] RPM installation now overrides macro usage

Change-Id: I2a5ba14670becc178f672182eabe71965a526178
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-06-01 11:58:42 -04:00
Galantsev, Dmitrii 2048f8978f Fix memset compile warning
Change-Id: If31210f3c6038e56f43ae8631ed1657d1509488e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-05-31 21:54:32 -04:00
Bill(Shuzhou) Liu a6467c4083 Fallback to gpu_metrics if the sysfs is not available
The gpu_metrics may have required PCI link width and speed.

Change-Id: I939d733f5f6a71088545ba042345eb1b6ad20ee5
2023-05-24 14:51:43 -05:00
Bill(Shuzhou) Liu 160c99d12d SWDEV-400644: Reset the mutex only if errors
To prevent reset the mutex while using it, only reset the mutex
if it cannot acquire it.

Change-Id: I95e0ed1bf543f285ce81b4df9c51e16a88081d38
2023-05-22 11:20:44 -04:00
Charis Poag c3a095a180 [SWDEV-398070] Adding logging to ROCm SMI (by default off)
Updates:
    * [rocm-smi] Provide a thread-safe logging feature
    * [rocm-smi] Adding logrotation into install/upgrade/remove
      scripts
    * [rocm-smi] Updated cmake lists to include rocm_smi_logger
    * [rocm-smi] Updated DEB/RPM install/remove logging file &
      folder with all users having r/w privledges for
      /var/log/rocm_smi_lib/ROCm-SMI-lib.log
    * [rocm-smi] Added ability to do a glob search for multiple files
      (globFileExists), assists doing file searches with * strings
    * [rocm-smi] Added ability to log system details when RSMI_LOGGING
      is turned on (getSystemDetails())
    * [rocm-smi] Added logging to provide which ROCm API is being called
      when RSMI_LOGGING is on
    * [rocm-smi] Added logging to provide SYSFS path and read value,
      when RSMI_LOGGING is on. Provides error reponse on failure.
    * [rocm-smi] Added logging to provide SYSFS path and read value,
      when RSMI_LOGGING is on. Provides error reponse on failure.
    * [rocm-smi] Added environment variable RSMI_LOGGING to control
      when logging is enabled or disabled. By default, by not
      setting this env. variable, logging is turned off. When
      setting RSMI_LOGGING=<any value>, logging is enabled
      which is placed in /var/log/rocm_smi_lib/ROCm-SMI-lib.log file.
      Setting RSMI_LOGGING is allowed in both debug and release builds.
    * [rocm-smi] Removed an initialize procedure which keeps
      debug_inf_loop. Seems this feature is not being used.

Change-Id: I79b48387609c6233c6f05b04fb8bba66b68c2399
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-05-17 21:18:52 -05:00
Sam Wu ed74bc6eca sphinx documentation
ref: https://github.com/RadeonOpenCompute/rocm_smi_lib/pull/119

fix formatting in docs/index.md

Change-Id: I940ef8147a40bd3b702aa591bd56557a870621fb
2023-05-11 10:41:45 -04:00
Ranjith Ramakrishnan daffcdb930 SWDEV-383221 - Set the default value of ROCM_HEADER_WRAPPER_WERROR to OFF
Using wrapper header files will result in #warning message by default

Change-Id: I8941a96bdc1b921a7646ccb353130cb283957ff8
2023-05-08 16:56:52 -07:00
Charis Poag 6be92b9e26 [SWDEV-392571] Fix concise info when missing VRAM info
Updates:
    * [rocm-smi] Added larger app width size, which helps
      display missing device info
    * [rocm-smi] Added better context when rsmi_ret_ok
      does not return with RSMI_STATUS_SUCCESS
    * [rocm-smi] Removed all references to an
      undefined function (printLogNoDev())
    * [rocm-smi] Fixed not detecting non-int
      values when setting the voltage curve
    * [rocm-smi] Added better context on missing
      sysfs file when setting clock overdrive
      values
    * [rocm-smi] Fixed getMemInfo() calls not
      referencing tuple values (making it easier
      to read)
    * [rocm-smi] Silenced concise info spitting
      out errors for missing VRAM files, instead
      display which metric is "unsupported" if
      the files are missing
    * [rocm-smi] Updated function descriptions for
      rsmi_ret_ok & getMemInfo
    * [rocm-smi] Updated getMemInfo to provide a
      quiet call, to silence for concise info calls.
      This provides a way to keep the output clean.
    * [rocm-smi-lib] Added when using debug sysfs
      files, to state, which enums are enabled
      for debug

Change-Id: I0e9e0c97ccf71467ced0e1a1f71803327a8be2b7
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-04-13 15:11:35 -04:00
Bill(Shuzhou) Liu b6789891b0 Validate the clock frequency when set it
Add the check of the clock frequency when set it.

Change-Id: I707291bfb5007bb69100c780af50a4b0f697bb37
2023-04-06 11:54:38 -04:00
Charis Poag 78a0812f7f [SWDEV-391036 + SWDEV-392933] Fixes for VoltRead and ComputePart.
Updates:
    * VoltRead - needed to properly send out RSMI_STATUS_NOT_SUPPORTED
      when device does not have voltage hwmon files
    * ComputePart. - test failure was likely caused due to EvtNotif
      causing conflicts (unknown exactly why). Test passes when
      moving it ahead of the event notifier. Both API calls may have
      a system resource issue, TBD.
    * rocm_smi_example - now indicates when an API call
      returns RSMI_STATUS_NOT_SUPPORTED or
      RSMI_STATUS_NOT_YET_IMPLEMENTED. Allows example to fully complete
      on systems which may not provide support for all API calls.

Change-Id: I520b8584e078d412414e8e5797c664220a7e823a
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-04-05 12:44:29 -05:00
Bill(Shuzhou) Liu 58c83eb379 Increase the max BDF ID length
Increase the max length from 256 to 512.

Change-Id: I3114f7ce6852aafa9dfec0186f27c1121c939c69
2023-03-29 10:04:28 -04:00
Bill(Shuzhou) Liu 0c82a9d577 Correct subsystem name by matching device id.
The rsmi_dev_subsystem_name_get() only matches subvendor id and
subdevice id for a vendor. The change will also match device id.

Change-Id: Ife3aedaf6fc7390ed7fa62edbde40c2340689b23
2023-03-28 15:48:31 -05:00
AravindanC 778f3b7fdc SWDEV-351540 - ASAN packaging for rocm_smi_lib
Change-Id: Iab354d02d261a0270a3d118b825835fc6f021c15
2023-03-20 13:14:53 -07:00
Charis Poag f44d1ea8bc [SWDEV-387906] Fix rocm-smi initialize crash
Fix was needed due to hwmon updates.
Several voltage sensors (ex. vddgfx/vddnb)
are unsupported or not applicable
to upcoming hardware. This was not the case
for previous hardware sensors, resulting in
the rocm-smi crash observed.

Change-Id: Ib8593e10811638def26fc7a1eda29309e328db09
Signed-off-by: Charis Poag <Charis.Poag@amd.com>
2023-03-17 15:04:34 -05:00