Commit Graph

60 Commits

Author SHA1 Message Date
Bill(Shuzhou) Liu bb6d98b036 The collectd plugin for RDC
Two files are added to the python_binding folder:
* The rdc_collectd.py is a collectd plugin to store the RDC
  metrics to the collectd round robin database.
* The rdc_collectd.conf is a configure file which can control
  which fields to collect, how frequently the fields can be collected
  and run the plugin in embedded mode.

Change-Id: Ief44d004376ca8a82ed0d8ad36805243acb47080
2020-11-10 14:26:49 -05:00
Chris Freehill 0030f27ff8 Move grpc to ROCm install dir
Having grpc installed outside of ROCm dir is problematic
for multiple, simultaneous ROCm installations.

Change-Id: I5ad458ad01a76786339607d708b48534f15b137b
2020-10-24 21:46:10 -05:00
Bill(Shuzhou) Liu aeba7b0f91 Integrate RDC with Grafana
A new Grafana dashboard file rdc_grafana_dashboard_example.json
has been added to the folder python_binding. User can import
this dashboard to monitor multiple compute nodes.

To display the host name only in the dashboard, the
rdc_prometheus_example.yml is also changed to create a new label
short_instance which will not have the port number.

Change-Id: I9ab91838006d59c8dcb5fea01decb8c799484e1d
2020-10-15 14:12:15 -04:00
Bill(Shuzhou) Liu 151520b97e Support watch() and unwatch() in RDC module framework
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
  include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
  It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.

Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562
2020-09-18 16:02:31 -04:00
Cole Nelson a80454b35d CMakeLists.txt: set release/revision fields for pkg names
Change-Id: I575c1a593b8798c93611d77444ff096fc272e3c3
Signed-off-by: Cole Nelson <cole.nelson@amd.com>
2020-09-17 16:29:45 -07:00
Chris Freehill 6fb4c79784 Update README with ldconfig instructions
Change-Id: Id033122d0b2f74b52a95a2ace99889c5d090cab3
(cherry picked from commit 29a3aee72f9546743d25ebae8c356b33933d3657)
2020-09-15 10:11:34 -04:00
Chris Freehill 9051b752c4 Add grpc to build
Also:
* fix typo in rpm post install script
* for RPM, tell CPack to exclude intermediate directories
  in rpm file

Change-Id: I9dbb4901298d3699e092b53b339f5cb1d77b4edb
(cherry picked from commit e894cfa757aae8343afb373ce4ae60a1aa950a91)
2020-09-12 09:52:48 -04:00
Bill(Shuzhou) Liu 72691cc024 Add new init and destroy API for RAS library.
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();

When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()

Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca
2020-09-09 14:01:03 -04:00
Harish Kasiviswanathan 5e1111d4cb Update README.md document
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I365acc202442495daf89df1328e58c92457ab10d
2020-09-02 20:07:05 -04:00
Harish Kasiviswanathan 0be419f8b0 Add RDC user guide
Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
Change-Id: I89403343125bb303f5c502de502b1d554418b365
2020-09-02 20:06:57 -04:00
Bill(Shuzhou) Liu ba35cdcfe2 RDC module framework
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.

* The RdcModuleMgr will be used to manage different modules. RDC
  only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
  It will also call rdc_telemetry_fields_query() defined in the RAS
  library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
  interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
  RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
  library: RdcSmiLib or RdcRasLib.

The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.

Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
2020-09-02 14:46:40 -04:00
Chris Freehill bf412e3f76 Move docs/README.md to root
Also:
* consolidated the info in the previous rdc/README.md into
the README.md that was moved from docs/ directory.
* added missing information to get grpc into the default
library path (needed to add the grpc dir with ldconfig).
* formatting fixes

Change-Id: Id61e761ad7bdee40364bb8837be8705ed5ca53d1
2020-08-18 17:45:33 -04:00
Chris Freehill d7c9625fc6 Add event counter support
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.

Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.

Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
  a field should be hidden or not, by default. This will
  allow us to support many fields, even those that are not
  typically of interest (but sometimes may be), without
  confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
  including the more obscure fields.

Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
2020-08-17 19:45:18 -04:00
Bill(Shuzhou) Liu 9c7a1347ea RDC Prometheus plugin
The rdc_prometheus.py is a Prometheus plugin for RDC
The rdc_prometheus_example.yml and prometheus_targets.json are
example Prometheus configuration. If there are multiple compute
nodes, they can be defined at prometheus_targets.json.

Change-Id: I3611b1e8a166f6608351f6e7644808bf72a4d3a0
2020-08-17 14:09:37 -05:00
Chris Freehill 15be17539f Fix how test deals with terminating rdcd
Previously we would return -1 if we detected rdcd was
still running. But the rdcd process ID is alive as long
as the test is running. So now we return 0, and the rdcd
process ends, allowing the test to end cleanly.

Change-Id: I98a5aa0a03d14127824b86e1190047c9f9d2edb7
2020-08-17 14:09:37 -05:00
Chris Freehill 63863cbd2e Fix rdcd start/kill issues in rdctst
Change-Id: I7e4b1c19832b09b17720892d2c4f200d304ef2fb
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 9209c6c516 RDC python binding
A new folder python_binding is created for RDC python binding:
* The rdc_bootstrap.py is a python ctypes wrapper for the librdc_boostrap.so
* The RdcUtil.py defines common utilities for RDC to manage group/fieldgroup
* The RdcReader.py is a class to simplify the usage of the RDC:
  - The user only needs to specify which fields he wants to monitoring.
     RdcReader will create groups and fieldgroups, watch the fields, and fetch the fields.
  - The RdcReader can support embedded and standalone mode.
  - The standalone can be with authentication and without authentication.
  - In standalone mode, the RdcReader can automatically reconnect to the rdcd when the connection is lost.
  - When rdcd is restarted, the previously created group and fieldgroup may lose.
    The RdcReader can re-create them and watch the fields after reconnect.
  - If the client is restarted, RdcReader can detect the groups and fieldgroups
    created before and avoid re-create them.
  - The user can pass the unit converter if he does not want to use RDC default unit.

Change-Id: I109ec86012f37162eb13f7d3e921115b7dd82369
2020-08-17 14:09:37 -05:00
Chris Freehill 5950ebadc4 rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu e6d910f67a Support standard deviation and json output for job stats
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.

A new option --json is added to the rdci to output the results
in json format.

In the job stats, using the GMT time instead of timestamp
for start and end time.

Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca
2020-08-17 14:09:37 -05:00
Chris Freehill 9efb55b06f Fix rdctst build
Compile and link steps were looking in wrong directories for
include and library files.

Change-Id: I5cbfd67ca2a02cab898f820587a9793f2105f2e6
2020-08-17 14:09:37 -05:00
Divya Shikre 15a591540d Add error message when user tries to delete invalid group ID.
Change-Id: I3d4bda4696158b44e3b72de0a701bbb9f6c962c4
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com
2020-08-17 14:09:37 -05:00
Cole Nelson d9408697d8 rdc: homogenize package references to AMD
Change-Id: I136afeedcbb4df87b37ca52d7faa6f91321b41f9
Signed-off-by: Cole Nelson <cole.nelson@amd.com>
2020-08-17 14:09:37 -05:00
Chris Freehill 4045a59bb4 Fix package names to DEB and RPM convention
Added a CPACK_PROJECT_CONFIG_FILE called package.txt for this.

Change-Id: Ia2b2c6cdb98506510a8fa6881d814804108553db
(cherry picked from commit 8c803f85df0c23c7e30dadc0ab9748749a1d3588)
2020-08-17 14:09:37 -05:00
Chris Freehill 9483b74fe4 Combine client and server packages
Now, only one package is generated. This works with older
versions of cmake.

Also, restore change to postinst scripts for Debian and
RPM, undone in a prevous commit.

Change-Id: Ica005656c5f1df0d01d3071584b97de9f0e61cb3
(cherry picked from commit c14d23843228fa146f38c87cb59514e855725b41)
2020-08-17 14:09:37 -05:00
Chris Freehill 4008dd8eac Add .x86_64 and _amd64 suffixes to .rpm and .debs
Also:
* update README documentation
* correct postinst scripts for deb and rpm
* add lib64/ to link_directories (needed for CentOS and others)
* remove a redundant "rdc" from the package names
* rearrange the package names to conform to convention
For example:
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty_amd64.deb
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty.x86_64.rpm

* fix issues that result from having, in essence, 2 different
  install prefixes, 1 for the client and 1 for the server.

Change-Id: I88f0e1b8b72df2793c35ed71534afd91142da012
2020-08-17 14:09:37 -05:00
Chris Freehill 54dcd421e4 Also look in <grpc root>/lib64 for grpc libs
Some distros install grpc libs into lib64/ instead of
lib/. This takes care of that.

Change-Id: Iaab31c331828844be4a9e5c7abec4da609173356
2020-08-17 14:09:37 -05:00
Chris Freehill c35b0b8ec1 Add rdci to package bin directory
Also:
* add to the CI bin directory
* when making a new manual PDF, don't overwrite the old one;
  instead, make a copy that can be used to manually replace
  the existing one, if desired.

Change-Id: I9384e3627835a9c9983a55c23417a279a7b4d0f4
2020-08-17 14:09:37 -05:00
Chris Freehill 5f947270c1 Prepare rdctst for automated test runs
Mostly this involves creating a "batch mode" which does not
have any interactive prompts. Also, in batch mode, both stand-
alone and embedded modes are run.

Change-Id: I9703e501ab1f853e992b6b401fa0215681ab69f0
2020-08-17 14:09:29 -05:00
Chris Freehill 8e4d1e7f33 Make job id char array const in rdc api
Also make adjustments to packaging.

Change-Id: I73cc18ce67f833ff563cb1488b000b69b315979a
2020-08-17 14:07:25 -05:00
Chris Freehill 3793e8df27 Generate RDC docs on make and put into package
DOxygen docs from rdc.h will be generated on build and a
new pdf file will be created in the docs folder. Note
that the generated pdf should should only be updated in
the git repo when certain files are modified:

* rdc.h Doxygen comments are updated (or any future files that
are processed for Doxygen comments)

* the Doxygen config file is updated

Change-Id: I3084520f5b02aaa8e4973f9055d8679d6788b0ef
2020-08-17 14:07:25 -05:00
Divya Shikre 61579371f8 Implement gtests for RDC
adding gtest placeholder
adding discovery,group,fieldgroup,dmon,stats test

Change-Id: I71428f70345af5c8025fb66c1d411dc348daa2ef
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu d30cb81fdb Add support to use the field name in rdci
In the rdci dmon and fieldgroup, now the fields can be specified
using either number id or the field name.

When the rdc is async fetching metrics, it will not report that fetch
as an error.

Change-Id: I81331e2c239af987181147be5ac0e29ba1617ab4
2020-08-17 14:07:25 -05:00
Chris Freehill 76162057a1 Fix some docs issues
Change-Id: I961a34a90ead7e7559f778cae3bef7ec41689a93
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 2772d3f238 Rename description of job stats
Change the job stats description.

Change-Id: I9b56a40d648c05e5327ad1b640277302d0e5e00c
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 96afb24845 Allow the rdcd to be started by user other than rdc or root
Remove the check whether the rdcd is started by rdc user.
Add the read access check for the private key and certificates if
the authentication is enabled.

Change-Id: I0e7a7eafb7985801572f809da0cb3e4012683153
2020-08-17 14:07:25 -05:00
Chris Freehill b6da10f1f4 Separate client/server packages, build_ scripts mods
Change-Id: Ic553be523e7c6ae8ac930fa2126add45f33645b7
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu f4a3fd4dda Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 096dc2dadb Implement the stats subsystem in rdci
Add support for the stats subsystem in rdci
Modify the dmon system to handle the case when no GPUs in a group

Change-Id: I5a18e1201d24b5318b8e324a77551a757b108f25
2020-08-17 14:07:25 -05:00
Chris Freehill 1b58033183 Make GPRC and protobuf external components to RDC
Pass in GRPC root (or use default location) for RDC to use
when building RDC components.

Change-Id: I89db2ac2be27ab6449c817d210a94c11fef965fd
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu fe3e75edfa Implement the gRPC APIs for the job stats
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.

Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu a547dc7efd Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 16bce67835 Implement the rdci subsystem: group, fieldgroup and dmon
Add the support for rdci subsystem group create, delete and query

Add the support for rdci subsystem fieldgroup create, delete and query

Add the support for rdci dmon system. The dmon system may show the stats every
a few seconds until press Ctrl-C. To cleanup the resources (for example, unwatch),
a signal handler is added.

Change-Id: Ib22a8a43b7083c7c72819ca21145e22702d9ad6c
2020-08-17 14:07:25 -05:00
Chris Freehill a6acf24ae7 Handle different levels of rdcd privilege
Depending on how a user starts rdcd, rdcd will either have
full monitor/control capabilities or have just monitoring
capabilties.

The only 2 user ids allowed are "rdc" and root.

Change-Id: Ie296a2f68c9723bef5945b1af1070ef99eeea93b
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 7ee29b6cdd Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611
2020-08-17 14:07:25 -05:00
Chris Freehill 7084690872 Correct CMake install for rdc_libs target
This will allow us to not have to use LD_LIBRARY_PATH when
packages are installed.

Change-Id: I16b4c50d400c3c7e3bbebe446c53d3605cebae53
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 66e4e790c3 Add SSL mutual authentication support for rdci
The RDC API is changed to pass the certificates to the gRPC.

Add the support to add all GPUs in the host to a group. Also before
add a GPU to a group, the RDC API will verify that GPU exists or not.

Add the support to fetch the temperature metrics.

Change-Id: I5857ef03fede233d16e8b2836be120f33172da93
2020-08-17 14:07:25 -05:00
Chris Freehill 023de40df7 Add return value for Make RdcStandaloneHandler::error_handle
Quiets compile warning.

Change-Id: I5e7454f56e824e2304c790bac729cfa0fcf78603
2020-08-17 14:07:25 -05:00
Chris Freehill 47fdfa4c7e Add support for gRPC authenticated communications
Also, make a few namespace corrections and some minor refactoring.

Change-Id: Iedcaf6b43cb7576bc11dfefe980abd190c838831
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 020f6939f7 SWDEV-209060 - Create the Skeleton RDC CLI and daemon
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:

rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.

RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so

rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.

rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.

Add the instruction how to run the rdcd and rdci in the build folder in the README.md.

Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 1ff1c7b617 SWDEV-223878 - Add cache manager and watch table to skeleton rdc_lib
Support cache manager and watch table in rdc_lib

RdcCacheManagerImpl.cc is added to implement cache of metrics. Currently, only
integer mertics are supported. The cache manager provids function to retrieve the
latest and history metrics from cache. It also provides interfaces to update and evict the cache.

RdcWatchTableImpl.cc is added to implement watch and unwatch fields. It uses the
field settings to control how frequently a field needs to be updated. We have a preliminarily
performance optimization for this class as it may be called very frequently.

RdcMetricsUpdaterImpl.cc is added to run the update at background thread when
RDC_OPERATION_MODE_AUTO is set.

After this code change, the rdcd/rdci should be able to implement basic discovery, group and dmon
function. The job management function is not implemented in the skeleton rdc_lib yet.

Change-Id: I26cff8c2ec85d1ad8e7df24c66b02f0060838d37
2020-08-17 14:07:25 -05:00