提交图

175 次代码提交

作者 SHA1 备注 提交日期
Maisam Arif d9adf280cd Updated RDC to use AMD-SMI 24.6.0 structs
Signed-off-by: Maisam Arif <Maisam.Arif@amd.com>
Change-Id: I9ef0f3cb786c1238e53cf21df5c6afafac829175


[ROCm/rdc commit: 7c6bd4dc1c]
2024-05-31 10:37:39 -05:00
Galantsev, Dmitrii a80dfd4f00 Add memory bandwidth metrics
Change-Id: I310ca8af0536497be619d2bda1e540d1f11c2565
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 53033a5b77]
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii cff2ac8490 Add rocprofiler_example.cc and fix logging
Change-Id: Ib3ed8754f314edc76ea56bfec9a645d720f8926d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c7fcb1ad25]
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii 83cf97e280 Profiler - Add all required metrics
Change-Id: Iea3938df9407789c061c3a6ead9167a69069d6e6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c3a4c899d5]
2024-05-09 23:24:02 -05:00
Galantsev, Dmitrii 8b317a6490 Add rocprofiler plugin
Rename ROCR -> Runtime and ROCP -> Profiler

Change-Id: If90953da8fa5d695b681813dad4a3e7ec26a9c7e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 234b2d835b]
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 24f30a6ee3 Error if power metric inaccessible
Change-Id: I359c24f24d0200181646d5a7c13a6e0e4d4958b6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 1f5fa94132]
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 93b990ffa0 AMDSMI - Add ring hang event
Change-Id: I84696e3cc1a4eba8de48e464f1a208ed9c6e489d
Depends-On: I2e73ba08ee0004f6f30660b2fa425ea94bafceca
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 5525bf8c86]
2024-05-03 16:45:42 -05:00
Bill(Shuzhou) Liu 79897be094 Add new XGMI and PCIE bandwidth fields from gpu_metrics
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.

Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283


[ROCm/rdc commit: 61a75d346b]
2024-05-03 16:18:17 -04:00
Galantsev, Dmitrii 028355dff0 SWDEV-439576 - rocmsmi -> amdsmi
- Migrate to amdsmi library
- NOTE: raslib still uses rocmsmi
- Remove unused rocmsmi service
- Remove unused RDC client code
- Remove RSMI calls from protos/rdc.proto

Change-Id: Ifc34a264c506b0ec5792307ee56b34526268762d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9702d0f2d7]
2024-04-09 20:19:28 -05:00
Galantsev, Dmitrii c314326da0 Revert "Sort the ROCr gpu index based on BDF"
Fix 'rdcd diag' compute and system tests.
This reverts commit 4acaddc32d.

Change-Id: Ia092c46649c1d6338fb96ffe7e6feba4b045f027


[ROCm/rdc commit: 662cc0f8b2]
2024-04-09 10:27:19 -05:00
Galantsev, Dmitrii 53ecc0fc81 Remove -X from .hsaco files
Change-Id: I1f1b4f07eb854ce2e254564b83719be52b553b02
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9d55c26247]
2024-03-27 20:35:08 -05:00
Galantsev, Dmitrii dd257bfcac CMAKE - Find hsa-runtime64
Change-Id: Id877eb9cfcc61d81993a6a43703ef2e5f72e1e8f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 6d5d9971c2]
2024-02-19 23:49:38 -05:00
Galantsev, Dmitrii 3c18db8861 SWDEV-444700 - CMAKE - Fix RUNPATH
These RUNPATH changes make it so libraries can be found without setting
LD_LIBRARY_PATH.

Mostly tested on installed RDC binaries and libraries. The
build binaries should also work.

Change-Id: Ifd908a5b61d24dfcbb1d08d21b4ee830156d8643
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 32806681ca]
2024-02-13 16:56:28 -06:00
Galantsev, Dmitrii 185245cafa CMAKE: Reduce install messages size
Change-Id: I6fa7cfe986b1de702492a96bddbfd406501bba50
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: aa5448fc16]
2024-02-06 00:31:32 -06:00
Bill(Shuzhou) Liu d1efa59fe8 Fallback to junction temperature and socket power
If the card does not have edge temperature, fallback to junction
temperature. If the card only have socket power, then use socket
power instead.

Change-Id: I053a67a89cf3b29a34e82123f522c08d7dd68916


[ROCm/rdc commit: 5cfe2b4169]
2024-02-05 10:10:26 -06:00
Galantsev, Dmitrii 703d6c0d44 Use templates for module population
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.

Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: f9e80cc37a]
2024-01-10 00:27:09 -06:00
Galantsev, Dmitrii 38c60ff90b RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eaa1862a80]
2024-01-10 00:27:04 -06:00
Galantsev, Dmitrii ea624cbb7c LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 434e40305d]
2024-01-09 11:37:11 -06:00
Galantsev, Dmitrii 61cf14d7cc Simplify ModuleMgr
Change-Id: I3a57876c73e50771fcedb7ca4c67d55ac406b34d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 95e057c88d]
2024-01-09 11:37:11 -06:00
Bill(Shuzhou) Liu 4acaddc32d Sort the ROCr gpu index based on BDF
The rocm-smi index is changed to sort based on BDF. The rocr plugin
is also changed based on that.

Change-Id: I5851431db336d50266b253dec1894a7bd9f3554b


[ROCm/rdc commit: 61a2773875]
2023-11-16 09:07:22 -05:00
Galantsev, Dmitrii d4440d392e Upgrade to CXX-17 gtest-1.14
Change-Id: I1c7316f151128cbc9318b226dac14950e399d2c7
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 8f9a6796f1]
2023-09-28 12:54:49 -05:00
Galantsev, Dmitrii a337dc062b SWDEV-392942 - Disable rocmtools
Temporarily disable rocmtools because of hsa_shut_down issues

Change-Id: I5e8b6729b8200ccdd5c399862bfc632ba69f884c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 90e824c63b]
2023-04-05 13:20:19 -05:00
Bill(Shuzhou) Liu df95a71a09 Rebuild rdc_ras library on Ubuntu 20.04
Rebuild rdc_ras library on Ubuntu 20.04 for backward compatibilities.
Fallback to rocm_smi for ECC errors if rdc_ras library not available.

Change-Id: I8db9687e3eb54a6f62fce2c8d57a796c6da6b5c4


[ROCm/rdc commit: 29551b1fd0]
2023-03-16 10:02:15 -04:00
Galantsev, Dmitrii c1a76d532a SWDEV-380364 - Resolve dmon + rocmtools halt
* Move hsa_init out of rocmtools and into RDC
* Remove secondary hsa_shut_down from ROCR module

Change-Id: I57d84d41ddc51595b98e734265f10bc5129a7352
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Depends-On: I2b389ee1a9ba3507b2df1fc2fe83598f67731aac


[ROCm/rdc commit: 24b3f138e9]
2023-02-02 18:33:14 -06:00
Galantsev, Dmitrii c59365f813 Remove rocmtools environment variable
- Set ROCMTOOLS_METRICS_PATH inside rdcd
- Add nullptr checks for rocmtools library functions

Change-Id: Ibbe4fed90df20e68b1a7971533765d831860c16f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 35edaa2322]
2023-01-16 19:16:26 -06:00
Galantsev, Dmitrii 5c803f6b03 SWDEV-352414 - Fix gRPC linker issues
- Replace gRPC library with gRPC package
- Relax RUNPATH
- Make LINKER_FLAGS global

gRPC package includes its dependencies:
SSL, UPB, ABSL, and etc.

Change-Id: Ieb198ad96e26e89b09cb85986214a5b1451b17a6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 3e4c55ec6c]
2023-01-04 18:50:07 -06:00
Galantsev, Dmitrii eccb4e202c Add rocmtools support
This commit adds integration with ROCmTools

Additional changes:
- Fix DEB and RPM installation issue when systemd is not present
- Fix typos in rdc.h
- Wrap negative values in parentheses in rdc.h
- CMAKE: Improve rocm_smi searching
- README: Improve formatting, add info about ROCmTools

Metrics added: 700-714
Metrics can be listed with `rdci dmon --list-all`
Majority of the metrics are only supported by Instict (MI) series GPUs
700 RDC_FI_PROF_ELAPSED_CYCLES should be available on most devices
See README for more information

Change-Id: I907d3eacdc92fc5588ca6c76c2fa1ce0ad900770
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 861a843ed7]
2022-12-16 12:19:59 -06:00
Galantsev, Dmitrii 2b89ab397c Improve CMake and relocate tests
- Respect CMAKE_INSTALL_PREFIX and ignore RDC_CLIENT_INSTALL_PREFIX
- Move example and rdctst from rocm/bin to rocm/share/rdc
- Add README for examples

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Change-Id: I0b1d996d206327fd1b51ac6e82d548829bdb1570


[ROCm/rdc commit: f6efd7fbf6]
2022-10-27 13:49:54 -05:00
Galantsev, Dmitrii 9ff80828e5 Compile rdctst and improve CMakeLists
Main CMake improvements:

* Add rdctst with -DBUILD_TESTS=ON
* Set default ROCM_DIR to /opt/rocm/
* Split rdc_libs/CMakeLists.txt into subdirectories
* Package tests into rdc-tests.deb and .rpm

Misc improvements:

* Add .editorconfig to normalize code formatting
* Add .gitignore
* Expand RPATH for gRPC to reduce LD_LIBRARY_PATH usage
* Export compile_commands.json
* Show warning and do not install gRPC if GRPC_ROOT is left as default
* Move .in files into relevant subdirectories
* Move most variables into project CMakeLists.txt to avoid redefinitions
* Normalize CMakeLists.txt formatting (4 spaces indentation)
* Rename DIAGNOSTIC_LIB to RDC_ROCR_LIB
* Update gRPC version in README to 1.44.0
* Remove gtest source
* Pull gtest from github if not installed

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Depends-On: I1039ef61247e3f0ff822925cc869fb0c2bf3af85
Change-Id: I879b21428e6642f19fda67092b365d8b78b7ba7b


[ROCm/rdc commit: 2c171767b3]
2022-10-07 13:58:50 -05:00
Ranjith Ramakrishnan 3df8b88ca6 File reorganization with backward compatibility
SWDEV-291455 -  Binary , header files and libraries installed in bin,include and lib folder under /opt/rocm-ver
Prebuilt ras library with updated search path
cmake config files in lib/cmake/rdc
grpc,sp3,hsaco and private libraries installed in lib/rdc
config  installed in share/rdc
authentication and python_binding installed in libexec/rdc
Backward compatibility added for header files and libraries

Depends-On: I3f3d192935923f71737b3fe55ded536654a73dd7
Change-Id: Ia1a6cadc59034b155631a1ee5fdbe692d2a8a71b


[ROCm/rdc commit: 52a3463147]
2022-08-04 23:42:42 -07:00
Bill(Shuzhou) Liu fa3a258bb6 Add MI200 kernel files for RDC diagnostic
Add the kernel files compiled for MI200.

Change-Id: Ib61795809c14457e332a77d7182992f245ff5b31


[ROCm/rdc commit: 179bd293ef]
2022-01-11 09:28:30 -05:00
Bill(Shuzhou) Liu 8c772e1b90 Fix the compile error for gcc-11
Fix the error: 'sleep_for' is not a member of 'std::this_thread'

Change-Id: If25ef03023df17081878f9b44c3a68195f07c653


[ROCm/rdc commit: adfa89631d]
2021-10-26 15:36:52 -04:00
Bill(Shuzhou) Liu 6b700f8005 Support GPU memory test and compute queue test using Rocr
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.

Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9


[ROCm/rdc commit: 78e2f2486b]
2021-10-21 11:01:12 -04:00
Bill(Shuzhou) Liu 57f1f72eb6 Add cmake target for RDC
RDC will provide cmake files exporting the INCLUDE/LIBRARY targets.

Change-Id: I8e8aeff426c45eae823d988f6473424ccf29687c


[ROCm/rdc commit: a640e5c821]
2021-09-28 13:53:44 -04:00
Icarus Sparry 506a3072e9 Add dependency on rocm-core
Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>
Change-Id: I5783b116b098bc8ebad62a4fad407a29c80f19af
Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>


[ROCm/rdc commit: 13c550d861]
2021-07-27 08:43:48 -04:00
Bill(Shuzhou) Liu fa9c6ad6f8 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd


[ROCm/rdc commit: 76ccf58008]
2021-07-26 14:56:17 -04:00
Bill(Shuzhou) Liu ac15d50b0c rdcd process uses 100% CPU
The rdcd uses another thread to listen the GPU events. That thread
runs in a tight loop which consume 100% CPU.
The fix will add a sleep to yield CPU.

Bug: SWDEV-291576
Change-Id: I7996720aab4a80346d79b1c73ee532d2abcd93cc


[ROCm/rdc commit: 5a4bf97327]
2021-06-18 13:49:45 -04:00
Bill(Shuzhou) Liu 7d7a5bfd1c Disable bulk fetch. Add environment variable to enable it
RDC can optimize by bulk fetching multiple metrics using a single
rocm_smi call. However, currently this is not completely supported in
all ASIC generations. By default disable this for now.

Set environment variable RDC_BULK_FETCH_ENABLED=TRUE to enable
RDC bulk fetch.

BUG: SWDEV-289316

Change-Id: Ibb55514f198356dccf5f47bb0fd2d53c17acb251


[ROCm/rdc commit: 673f5a4ee1]
2021-06-09 15:53:17 -04:00
Bill(Shuzhou) Liu f504f697e3 The Diagnostic API interface
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.

Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb


[ROCm/rdc commit: eab3625d65]
2021-05-27 10:59:11 -04:00
Bill(Shuzhou) Liu 307b8ee085 The RDC returns power_usage 0
RDC is trying to bulk fetch power usage from gpu_metrics. If the
gpu_metrics is 0, it will fallback to rsmi_dev_power_ave_get().

Change-Id: I57d165d6af0c91b39798c89eef317d4e5df2d0f6


[ROCm/rdc commit: eafb948115]
2021-05-12 09:59:36 -04:00
Chris Freehill b9ac3ffbd9 Add grpc to rdc_libs RUNPATH
librdc_client was failing to find libgrpc when rdci is
started.

Change-Id: Idba5d237e7c45bbee92759aed2521c32babe7a5e


[ROCm/rdc commit: 26b32f5a08]
2021-02-12 09:43:59 -06:00
Chris Freehill 8b1c887834 Turn on/off DAC capabilities as needed
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.

To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.

Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f


[ROCm/rdc commit: 6b5aeaaa23]
2021-02-04 12:25:26 -06:00
Chris Freehill 7cf47fb5c9 Add to/correct handling of RDC_EVNT_XGMI_*_THRPUT events
RDC_EVNT_XGMI_[2-5]_THRPUT were missing from RDC. Additionally,
these were handled as "pseudo" events, but this is not
necessary.

Change-Id: I3478365ac0d78f60a7b63235bea484f3edb8bd16


[ROCm/rdc commit: a9d0e037b5]
2021-01-29 14:56:46 -05:00
Bill(Shuzhou) Liu cd50afa74c Install grpc lib to rdc folder
Install the grpc lib to rdc/grpc/lib and add miss libraries.

Add “--no-as-needed” and all extra grpc libraries in rdci/rdcd as
RUNPATH will only search direct dependencies.

Change-Id: I596acb2eb3a7228d703e79db64699bc20d0e7c09


[ROCm/rdc commit: 07d4d5376e]
2021-01-25 14:55:45 -05:00
Bill(Shuzhou) Liu f41c146bc4 Bulk fetch metrics from rocm_smi_lib
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.

If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.

Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1


[ROCm/rdc commit: 51efe26442]
2021-01-07 16:40:37 -05:00
Bill(Shuzhou) Liu 17d5758923 Add raslib fields to RDC
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
  for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
  removed from RdcSmiLib.cc, and will be gotten from raslib.

Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6


[ROCm/rdc commit: 81ad23343c]
2020-12-01 10:56:36 -05:00
Chris Freehill 79b5e54d3b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc


[ROCm/rdc commit: b278cd379b]
2020-11-22 07:10:39 -05:00
Bill(Shuzhou) Liu dbacfc2d6a Add a CMake option to build RDC library only
When RDC are only used as the libraries, the user can choose not to build
the rdci and rdcd, which will remove the dependencies to the gRPC and protoc.
The -DBUILD_STANDALONE=off should be pass to the cmake.
* Change README.md for the instructions.
* Move the python_binding installation from client/CMakeLists.txt to CMakeLists.txt
  so that the RDC library only build will also install the folder.
* Change CMakeLists.txt and rdc_libs/CMakeLists.txt to build with gRPC only if
  the BUILD_STANDALONE is enabled.

Change-Id: If9cfe9fc298a83636d85fe352a311fe2fe041661


[ROCm/rdc commit: 105675aeeb]
2020-11-11 08:48:40 -05:00
Bill(Shuzhou) Liu 753d5fed6d Support watch() and unwatch() in RDC module framework
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
  include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
  It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.

Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562


[ROCm/rdc commit: 151520b97e]
2020-09-18 16:02:31 -04:00
Bill(Shuzhou) Liu a88fc1829c Add new init and destroy API for RAS library.
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();

When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()

Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca


[ROCm/rdc commit: 72691cc024]
2020-09-09 14:01:03 -04:00