Граф коммитов

96 Коммитов

Автор SHA1 Сообщение Дата
Bill(Shuzhou) Liu ffc5db221b Install libprotobuf under lib64 folder
When compile grpc on SLES, the libprotobuf is created under lib64
folder, install it to lib folder as well.

Change-Id: I9ccf2133c3b1b71e623d9009a86cf580a19e76cf
2022-02-04 09:02:31 -05:00
Bill(Shuzhou) Liu 7eeb7f9388 Add rpm License header
Add rpm License header for cpack

Change-Id: I3e8d05abe69749abe6ce28751e7da9bb229aa08d
2022-01-20 13:33:08 -05:00
Bill(Shuzhou) Liu 0273dd6b9e Add license file to rdc package
Install LICENSE.txt to share/doc/rdc

Change-Id: Ife9872aa745cb6fcf79976bf6453098a6594572a
2022-01-18 10:50:31 -05:00
Bill(Shuzhou) Liu 179bd293ef Add MI200 kernel files for RDC diagnostic
Add the kernel files compiled for MI200.

Change-Id: Ib61795809c14457e332a77d7182992f245ff5b31
2022-01-11 09:28:30 -05:00
Bill(Shuzhou) Liu adfa89631d Fix the compile error for gcc-11
Fix the error: 'sleep_for' is not a member of 'std::this_thread'

Change-Id: If25ef03023df17081878f9b44c3a68195f07c653
2021-10-26 15:36:52 -04:00
Bill(Shuzhou) Liu 78e2f2486b Support GPU memory test and compute queue test using Rocr
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.

Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9
2021-10-21 11:01:12 -04:00
Bill(Shuzhou) Liu 6ab71e1a4a Correct the install path of grpc
Correct the grpc install path which miss $.

Change-Id: I17736a81ee24d2abc680a3646b1536efafcb3d69
2021-10-19 17:11:35 -04:00
Bill(Shuzhou) Liu a640e5c821 Add cmake target for RDC
RDC will provide cmake files exporting the INCLUDE/LIBRARY targets.

Change-Id: I8e8aeff426c45eae823d988f6473424ccf29687c
2021-09-28 13:53:44 -04:00
Icarus Sparry 2a1a002f74 Add owner write to rcdi permissions
Needed to work around a debian packaging bug if debug information is
being produced in a separate package.

Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>
Change-Id: Ieab3cc3515eeeb952159acea3dc1effd14613eeb
2021-09-22 18:11:50 +00:00
Bill(Shuzhou) Liu bd034263d4 Make RDC to respect the LD_LIBRARY_PATH pass by the cmake
The RDC override the LD_LIBRARY_PATH to force to use the current grpc
path. The change will also add original LD_LIBRARY_PATH to it.

Change-Id: I48da84c3135c6ede129c3cb9148dbb1896b652c3
2021-09-17 15:52:30 -04:00
Bill(Shuzhou) Liu 6f95200387 Add -g compiler option for ADDRESS_SANITIZER
Add -g compiler option for Address Sanitizer

Change-Id: I5c4a72dd06a7242715c537fc0d44770b126862d2
2021-08-03 13:52:21 -04:00
Icarus Sparry 13c550d861 Add dependency on rocm-core
Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>
Change-Id: I5783b116b098bc8ebad62a4fad407a29c80f19af
Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>
2021-07-27 08:43:48 -04:00
Bill(Shuzhou) Liu 76ccf58008 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
2021-07-26 14:56:17 -04:00
Bill(Shuzhou) Liu 5a4bf97327 rdcd process uses 100% CPU
The rdcd uses another thread to listen the GPU events. That thread
runs in a tight loop which consume 100% CPU.
The fix will add a sleep to yield CPU.

Bug: SWDEV-291576
Change-Id: I7996720aab4a80346d79b1c73ee532d2abcd93cc
2021-06-18 13:49:45 -04:00
Bill(Shuzhou) Liu 673f5a4ee1 Disable bulk fetch. Add environment variable to enable it
RDC can optimize by bulk fetching multiple metrics using a single
rocm_smi call. However, currently this is not completely supported in
all ASIC generations. By default disable this for now.

Set environment variable RDC_BULK_FETCH_ENABLED=TRUE to enable
RDC bulk fetch.

BUG: SWDEV-289316

Change-Id: Ibb55514f198356dccf5f47bb0fd2d53c17acb251
2021-06-09 15:53:17 -04:00
Bill(Shuzhou) Liu eab3625d65 The Diagnostic API interface
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.

Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb
2021-05-27 10:59:11 -04:00
Bill(Shuzhou) Liu eafb948115 The RDC returns power_usage 0
RDC is trying to bulk fetch power usage from gpu_metrics. If the
gpu_metrics is 0, it will fallback to rsmi_dev_power_ave_get().

Change-Id: I57d165d6af0c91b39798c89eef317d4e5df2d0f6
2021-05-12 09:59:36 -04:00
Bill(Shuzhou) Liu ee40694cc6 Change the CMake version to 3.15
CMake 3.15 or greater is required for gRPC

Change-Id: I15cda7b2bccb2bc1c46b3bb84eb1116a15ce32a4
2021-04-09 15:43:46 -04:00
Chris Freehill 7a05145542 Fix some lintian errors
Fix lintian errors related to maintainer, postinst script and
permissions.

Change-Id: I6924ff92ff5453fa7e562a6188c2c91cea87df68
2021-03-03 19:35:24 -06:00
Bill(Shuzhou) Liu 3aa95b210a Create prebuild raslib package for RDC
Create a folder for prebuild raslib which contains the RAS binary
and configure files. The CMakeLists.txt is changed to include
those files.

Change-Id: I530198cff5686a19e58096c87457ab8b7c52d5f3
2021-03-01 15:49:01 -05:00
Bill(Shuzhou) Liu 5b4fbe08d2 Change CMakeLists.txt to include the libras
The CMakeLists.txt is changed to add instructions to build raslib.

Change-Id: I0779046f28cbc7af292c83f3ae3ed7bcda5c57eb
2021-02-23 14:49:18 -05:00
Bill(Shuzhou) Liu 7ca7a571a7 RDC Prometheus plugin return errors when use the --rdc_gpu_indexes
When above option is used, the plugin returns errors:
  result = rdc.rdc_group_gpu_add(rdc_handle, gpu_group_id, gpu)
  ctypes.ArgumentError: argument 3: <type 'exceptions.TypeError'>: wrong type

The rdc_prometheus.py is changed to convert string to integer.
The RdcUtil.py is also changed to raise Exception properly.

Change-Id: I9535091ff1fc8882cccd32e5f2810da5241768c3
2021-02-23 14:15:04 -05:00
Chris Freehill 26b32f5a08 Add grpc to rdc_libs RUNPATH
librdc_client was failing to find libgrpc when rdci is
started.

Change-Id: Idba5d237e7c45bbee92759aed2521c32babe7a5e
2021-02-12 09:43:59 -06:00
Chris Freehill 6b5aeaaa23 Turn on/off DAC capabilities as needed
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.

To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.

Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f
2021-02-04 12:25:26 -06:00
Chris Freehill a9d0e037b5 Add to/correct handling of RDC_EVNT_XGMI_*_THRPUT events
RDC_EVNT_XGMI_[2-5]_THRPUT were missing from RDC. Additionally,
these were handled as "pseudo" events, but this is not
necessary.

Change-Id: I3478365ac0d78f60a7b63235bea484f3edb8bd16
2021-01-29 14:56:46 -05:00
Bill(Shuzhou) Liu b8746f7fc0 Disable Autoprov when build rpm
The rdc may install third party library in its local folder. We
need to disable the Autoprov to prevent the rpm add provides
for them.

Change-Id: I18d008c2ca2ecb6bba64467d78b1f8c3a6585aea
2021-01-28 09:34:32 -05:00
Bill(Shuzhou) Liu 07d4d5376e Install grpc lib to rdc folder
Install the grpc lib to rdc/grpc/lib and add miss libraries.

Add “--no-as-needed” and all extra grpc libraries in rdci/rdcd as
RUNPATH will only search direct dependencies.

Change-Id: I596acb2eb3a7228d703e79db64699bc20d0e7c09
2021-01-25 14:55:45 -05:00
Bill(Shuzhou) Liu 51efe26442 Bulk fetch metrics from rocm_smi_lib
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.

If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.

Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1
2021-01-07 16:40:37 -05:00
Bill(Shuzhou) Liu ceb562d630 Add the Address Sanitizer Support for RDC
Change the CMakLists.txt to add the -fsanitize=address
Refer to jira ticket SWDEV-259873

Change-Id: Ie37fd661787eaea16f366b925d9a97db233cd136
2021-01-07 12:11:12 -05:00
Ashutosh Mishra 0054349862 Adapting to rocm_smi_lib changes
The name of rocm-smi-lib64 has been changed to rocm-smi-lib
Hence updating the requisite name here for resolving installation dependency

Depends-On:  Ib37d29aedc20b610619f6921f4147b41c0eaf134
Change-Id: I4efd778b72d43ad8f0842410a94ac1e3d3b9192a
2020-12-10 11:05:20 -05:00
Bill(Shuzhou) Liu 9bf6e630d6 rdci dmon Segmentation fault if fields do not contain events
Fix the core dump observed in dev test.

Change-Id: Ib008aeeee2f415174dbb0c4ba301b3f9d6d2d54b
2020-12-07 16:52:14 -05:00
Freddy Paul fe1593dda5 RDC:Move rdc deamon to rocm path.
Installing files to standard path across each version and using
ldconfig has issues with side-by-side install.

Usage of RUNPATH/RPATH for ROCm to ensure all ROCm libraries are
picked without the need for ldconfig.

For RDC server to be picked up by systemctl, service config file
shall be a symlink from /lib/systemctl/system/rdc.service to
corresponding RDC file path in a given version of ROCm

For side-by-side install packages of RDC post install scripts
will be removed. Hence Use will have to set the symlink explicitly
for now.

Change-Id: I916da7cf132f0f9c667e2470fac2b0875e3db9d0
2020-12-04 14:43:06 -05:00
Bill(Shuzhou) Liu 81ad23343c Add raslib fields to RDC
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
  for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
  removed from RdcSmiLib.cc, and will be gotten from raslib.

Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6
2020-12-01 10:56:36 -05:00
Bill(Shuzhou) Liu 4b3dbc4697 Use relative path to find librdc_bootstrap.so
The python script will search list of the installation folders to
find the librdc_bootstrap.so.

Change-Id: I52e444e6d153c318c731c4b2cd0d8e39b0fd31ca
2020-11-30 13:46:15 -05:00
Chris Freehill b278cd379b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
2020-11-22 07:10:39 -05:00
Bill(Shuzhou) Liu 105675aeeb Add a CMake option to build RDC library only
When RDC are only used as the libraries, the user can choose not to build
the rdci and rdcd, which will remove the dependencies to the gRPC and protoc.
The -DBUILD_STANDALONE=off should be pass to the cmake.
* Change README.md for the instructions.
* Move the python_binding installation from client/CMakeLists.txt to CMakeLists.txt
  so that the RDC library only build will also install the folder.
* Change CMakeLists.txt and rdc_libs/CMakeLists.txt to build with gRPC only if
  the BUILD_STANDALONE is enabled.

Change-Id: If9cfe9fc298a83636d85fe352a311fe2fe041661
2020-11-11 08:48:40 -05:00
Bill(Shuzhou) Liu bb6d98b036 The collectd plugin for RDC
Two files are added to the python_binding folder:
* The rdc_collectd.py is a collectd plugin to store the RDC
  metrics to the collectd round robin database.
* The rdc_collectd.conf is a configure file which can control
  which fields to collect, how frequently the fields can be collected
  and run the plugin in embedded mode.

Change-Id: Ief44d004376ca8a82ed0d8ad36805243acb47080
2020-11-10 14:26:49 -05:00
Chris Freehill 0030f27ff8 Move grpc to ROCm install dir
Having grpc installed outside of ROCm dir is problematic
for multiple, simultaneous ROCm installations.

Change-Id: I5ad458ad01a76786339607d708b48534f15b137b
2020-10-24 21:46:10 -05:00
Bill(Shuzhou) Liu aeba7b0f91 Integrate RDC with Grafana
A new Grafana dashboard file rdc_grafana_dashboard_example.json
has been added to the folder python_binding. User can import
this dashboard to monitor multiple compute nodes.

To display the host name only in the dashboard, the
rdc_prometheus_example.yml is also changed to create a new label
short_instance which will not have the port number.

Change-Id: I9ab91838006d59c8dcb5fea01decb8c799484e1d
2020-10-15 14:12:15 -04:00
Bill(Shuzhou) Liu 151520b97e Support watch() and unwatch() in RDC module framework
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
  include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
  It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.

Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562
2020-09-18 16:02:31 -04:00
Cole Nelson a80454b35d CMakeLists.txt: set release/revision fields for pkg names
Change-Id: I575c1a593b8798c93611d77444ff096fc272e3c3
Signed-off-by: Cole Nelson <cole.nelson@amd.com>
2020-09-17 16:29:45 -07:00
Chris Freehill 6fb4c79784 Update README with ldconfig instructions
Change-Id: Id033122d0b2f74b52a95a2ace99889c5d090cab3
(cherry picked from commit 29a3aee72f9546743d25ebae8c356b33933d3657)
2020-09-15 10:11:34 -04:00
Chris Freehill 9051b752c4 Add grpc to build
Also:
* fix typo in rpm post install script
* for RPM, tell CPack to exclude intermediate directories
  in rpm file

Change-Id: I9dbb4901298d3699e092b53b339f5cb1d77b4edb
(cherry picked from commit e894cfa757aae8343afb373ce4ae60a1aa950a91)
2020-09-12 09:52:48 -04:00
Bill(Shuzhou) Liu 72691cc024 Add new init and destroy API for RAS library.
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();

When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()

Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca
2020-09-09 14:01:03 -04:00
Harish Kasiviswanathan 5e1111d4cb Update README.md document
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I365acc202442495daf89df1328e58c92457ab10d
2020-09-02 20:07:05 -04:00
Harish Kasiviswanathan 0be419f8b0 Add RDC user guide
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I89403343125bb303f5c502de502b1d554418b365
2020-09-02 20:06:57 -04:00
Bill(Shuzhou) Liu ba35cdcfe2 RDC module framework
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.

* The RdcModuleMgr will be used to manage different modules. RDC
  only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
  It will also call rdc_telemetry_fields_query() defined in the RAS
  library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
  interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
  RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
  library: RdcSmiLib or RdcRasLib.

The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.

Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
2020-09-02 14:46:40 -04:00
Chris Freehill bf412e3f76 Move docs/README.md to root
Also:
* consolidated the info in the previous rdc/README.md into
the README.md that was moved from docs/ directory.
* added missing information to get grpc into the default
library path (needed to add the grpc dir with ldconfig).
* formatting fixes

Change-Id: Id61e761ad7bdee40364bb8837be8705ed5ca53d1
2020-08-18 17:45:33 -04:00
Chris Freehill d7c9625fc6 Add event counter support
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.

Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.

Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
  a field should be hidden or not, by default. This will
  allow us to support many fields, even those that are not
  typically of interest (but sometimes may be), without
  confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
  including the more obscure fields.

Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
2020-08-17 19:45:18 -04:00
Bill(Shuzhou) Liu 9c7a1347ea RDC Prometheus plugin
The rdc_prometheus.py is a Prometheus plugin for RDC
The rdc_prometheus_example.yml and prometheus_targets.json are
example Prometheus configuration. If there are multiple compute
nodes, they can be defined at prometheus_targets.json.

Change-Id: I3611b1e8a166f6608351f6e7644808bf72a4d3a0
2020-08-17 14:09:37 -05:00