When RDC are only used as the libraries, the user can choose not to build
the rdci and rdcd, which will remove the dependencies to the gRPC and protoc.
The -DBUILD_STANDALONE=off should be pass to the cmake.
* Change README.md for the instructions.
* Move the python_binding installation from client/CMakeLists.txt to CMakeLists.txt
so that the RDC library only build will also install the folder.
* Change CMakeLists.txt and rdc_libs/CMakeLists.txt to build with gRPC only if
the BUILD_STANDALONE is enabled.
Change-Id: If9cfe9fc298a83636d85fe352a311fe2fe041661
[ROCm/rdc commit: 105675aeeb]
Two files are added to the python_binding folder:
* The rdc_collectd.py is a collectd plugin to store the RDC
metrics to the collectd round robin database.
* The rdc_collectd.conf is a configure file which can control
which fields to collect, how frequently the fields can be collected
and run the plugin in embedded mode.
Change-Id: Ief44d004376ca8a82ed0d8ad36805243acb47080
[ROCm/rdc commit: bb6d98b036]
Having grpc installed outside of ROCm dir is problematic
for multiple, simultaneous ROCm installations.
Change-Id: I5ad458ad01a76786339607d708b48534f15b137b
[ROCm/rdc commit: 0030f27ff8]
A new Grafana dashboard file rdc_grafana_dashboard_example.json
has been added to the folder python_binding. User can import
this dashboard to monitor multiple compute nodes.
To display the host name only in the dashboard, the
rdc_prometheus_example.yml is also changed to create a new label
short_instance which will not have the port number.
Change-Id: I9ab91838006d59c8dcb5fea01decb8c799484e1d
[ROCm/rdc commit: aeba7b0f91]
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.
Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562
[ROCm/rdc commit: 151520b97e]
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();
When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()
Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca
[ROCm/rdc commit: 72691cc024]
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.
* The RdcModuleMgr will be used to manage different modules. RDC
only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
It will also call rdc_telemetry_fields_query() defined in the RAS
library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
library: RdcSmiLib or RdcRasLib.
The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.
Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
[ROCm/rdc commit: ba35cdcfe2]
Also:
* consolidated the info in the previous rdc/README.md into
the README.md that was moved from docs/ directory.
* added missing information to get grpc into the default
library path (needed to add the grpc dir with ldconfig).
* formatting fixes
Change-Id: Id61e761ad7bdee40364bb8837be8705ed5ca53d1
[ROCm/rdc commit: bf412e3f76]
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.
Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.
Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
a field should be hidden or not, by default. This will
allow us to support many fields, even those that are not
typically of interest (but sometimes may be), without
confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
including the more obscure fields.
Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
[ROCm/rdc commit: d7c9625fc6]
The rdc_prometheus.py is a Prometheus plugin for RDC
The rdc_prometheus_example.yml and prometheus_targets.json are
example Prometheus configuration. If there are multiple compute
nodes, they can be defined at prometheus_targets.json.
Change-Id: I3611b1e8a166f6608351f6e7644808bf72a4d3a0
[ROCm/rdc commit: 9c7a1347ea]
Previously we would return -1 if we detected rdcd was
still running. But the rdcd process ID is alive as long
as the test is running. So now we return 0, and the rdcd
process ends, allowing the test to end cleanly.
Change-Id: I98a5aa0a03d14127824b86e1190047c9f9d2edb7
[ROCm/rdc commit: 15be17539f]
A new folder python_binding is created for RDC python binding:
* The rdc_bootstrap.py is a python ctypes wrapper for the librdc_boostrap.so
* The RdcUtil.py defines common utilities for RDC to manage group/fieldgroup
* The RdcReader.py is a class to simplify the usage of the RDC:
- The user only needs to specify which fields he wants to monitoring.
RdcReader will create groups and fieldgroups, watch the fields, and fetch the fields.
- The RdcReader can support embedded and standalone mode.
- The standalone can be with authentication and without authentication.
- In standalone mode, the RdcReader can automatically reconnect to the rdcd when the connection is lost.
- When rdcd is restarted, the previously created group and fieldgroup may lose.
The RdcReader can re-create them and watch the fields after reconnect.
- If the client is restarted, RdcReader can detect the groups and fieldgroups
created before and avoid re-create them.
- The user can pass the unit converter if he does not want to use RDC default unit.
Change-Id: I109ec86012f37162eb13f7d3e921115b7dd82369
[ROCm/rdc commit: 9209c6c516]
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
[ROCm/rdc commit: 5950ebadc4]
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.
A new option --json is added to the rdci to output the results
in json format.
In the job stats, using the GMT time instead of timestamp
for start and end time.
Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca
[ROCm/rdc commit: e6d910f67a]
Compile and link steps were looking in wrong directories for
include and library files.
Change-Id: I5cbfd67ca2a02cab898f820587a9793f2105f2e6
[ROCm/rdc commit: 9efb55b06f]
Added a CPACK_PROJECT_CONFIG_FILE called package.txt for this.
Change-Id: Ia2b2c6cdb98506510a8fa6881d814804108553db
(cherry picked from commit 8c803f85df0c23c7e30dadc0ab9748749a1d3588)
[ROCm/rdc commit: 4045a59bb4]
Now, only one package is generated. This works with older
versions of cmake.
Also, restore change to postinst scripts for Debian and
RPM, undone in a prevous commit.
Change-Id: Ica005656c5f1df0d01d3071584b97de9f0e61cb3
(cherry picked from commit c14d23843228fa146f38c87cb59514e855725b41)
[ROCm/rdc commit: 9483b74fe4]
Also:
* update README documentation
* correct postinst scripts for deb and rpm
* add lib64/ to link_directories (needed for CentOS and others)
* remove a redundant "rdc" from the package names
* rearrange the package names to conform to convention
For example:
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty_amd64.deb
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty.x86_64.rpm
* fix issues that result from having, in essence, 2 different
install prefixes, 1 for the client and 1 for the server.
Change-Id: I88f0e1b8b72df2793c35ed71534afd91142da012
[ROCm/rdc commit: 4008dd8eac]
Some distros install grpc libs into lib64/ instead of
lib/. This takes care of that.
Change-Id: Iaab31c331828844be4a9e5c7abec4da609173356
[ROCm/rdc commit: 54dcd421e4]
Also:
* add to the CI bin directory
* when making a new manual PDF, don't overwrite the old one;
instead, make a copy that can be used to manually replace
the existing one, if desired.
Change-Id: I9384e3627835a9c9983a55c23417a279a7b4d0f4
[ROCm/rdc commit: c35b0b8ec1]
Mostly this involves creating a "batch mode" which does not
have any interactive prompts. Also, in batch mode, both stand-
alone and embedded modes are run.
Change-Id: I9703e501ab1f853e992b6b401fa0215681ab69f0
[ROCm/rdc commit: 5f947270c1]
DOxygen docs from rdc.h will be generated on build and a
new pdf file will be created in the docs folder. Note
that the generated pdf should should only be updated in
the git repo when certain files are modified:
* rdc.h Doxygen comments are updated (or any future files that
are processed for Doxygen comments)
* the Doxygen config file is updated
Change-Id: I3084520f5b02aaa8e4973f9055d8679d6788b0ef
[ROCm/rdc commit: 3793e8df27]
In the rdci dmon and fieldgroup, now the fields can be specified
using either number id or the field name.
When the rdc is async fetching metrics, it will not report that fetch
as an error.
Change-Id: I81331e2c239af987181147be5ac0e29ba1617ab4
[ROCm/rdc commit: d30cb81fdb]
Remove the check whether the rdcd is started by rdc user.
Add the read access check for the private key and certificates if
the authentication is enabled.
Change-Id: I0e7a7eafb7985801572f809da0cb3e4012683153
[ROCm/rdc commit: 96afb24845]
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.
Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.
Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a
[ROCm/rdc commit: f4a3fd4dda]
Add support for the stats subsystem in rdci
Modify the dmon system to handle the case when no GPUs in a group
Change-Id: I5a18e1201d24b5318b8e324a77551a757b108f25
[ROCm/rdc commit: 096dc2dadb]
Pass in GRPC root (or use default location) for RDC to use
when building RDC components.
Change-Id: I89db2ac2be27ab6449c817d210a94c11fef965fd
[ROCm/rdc commit: 1b58033183]
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.
Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce
[ROCm/rdc commit: fe3e75edfa]
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.
Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.
Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.
Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
[ROCm/rdc commit: a547dc7efd]
Add the support for rdci subsystem group create, delete and query
Add the support for rdci subsystem fieldgroup create, delete and query
Add the support for rdci dmon system. The dmon system may show the stats every
a few seconds until press Ctrl-C. To cleanup the resources (for example, unwatch),
a signal handler is added.
Change-Id: Ib22a8a43b7083c7c72819ca21145e22702d9ad6c
[ROCm/rdc commit: 16bce67835]
Depending on how a user starts rdcd, rdcd will either have
full monitor/control capabilities or have just monitoring
capabilties.
The only 2 user ids allowed are "rdc" and root.
Change-Id: Ie296a2f68c9723bef5945b1af1070ef99eeea93b
[ROCm/rdc commit: a6acf24ae7]
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon
Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon
Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.
Change-Id: I066091423146dea180c16af212688ed43dc44611
[ROCm/rdc commit: 7ee29b6cdd]
This will allow us to not have to use LD_LIBRARY_PATH when
packages are installed.
Change-Id: I16b4c50d400c3c7e3bbebe446c53d3605cebae53
[ROCm/rdc commit: 7084690872]
The RDC API is changed to pass the certificates to the gRPC.
Add the support to add all GPUs in the host to a group. Also before
add a GPU to a group, the RDC API will verify that GPU exists or not.
Add the support to fetch the temperature metrics.
Change-Id: I5857ef03fede233d16e8b2836be120f33172da93
[ROCm/rdc commit: 66e4e790c3]
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:
rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.
RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so
rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.
rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.
Add the instruction how to run the rdcd and rdci in the build folder in the README.md.
Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c
[ROCm/rdc commit: 020f6939f7]