Commit graph

41 Commits

Autor SHA1 Nachricht Datum
Bill(Shuzhou) Liu 588ea96dd2 Support standard deviation and json output for job stats
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.

A new option --json is added to the rdci to output the results
in json format.

In the job stats, using the GMT time instead of timestamp
for start and end time.

Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca


[ROCm/rdc commit: e6d910f67a]
2020-08-17 14:09:37 -05:00
Chris Freehill bf248131cb Fix rdctst build
Compile and link steps were looking in wrong directories for
include and library files.

Change-Id: I5cbfd67ca2a02cab898f820587a9793f2105f2e6


[ROCm/rdc commit: 9efb55b06f]
2020-08-17 14:09:37 -05:00
Divya Shikre b85dc35d5e Add error message when user tries to delete invalid group ID.
Change-Id: I3d4bda4696158b44e3b72de0a701bbb9f6c962c4
Signed-off-by: Divya Shikre <DivyaUday.Shikre@amd.com


[ROCm/rdc commit: 15a591540d]
2020-08-17 14:09:37 -05:00
Cole Nelson ddd32f0283 rdc: homogenize package references to AMD
Change-Id: I136afeedcbb4df87b37ca52d7faa6f91321b41f9
Signed-off-by: Cole Nelson <cole.nelson@amd.com>


[ROCm/rdc commit: d9408697d8]
2020-08-17 14:09:37 -05:00
Chris Freehill ce49edfcb7 Fix package names to DEB and RPM convention
Added a CPACK_PROJECT_CONFIG_FILE called package.txt for this.

Change-Id: Ia2b2c6cdb98506510a8fa6881d814804108553db
(cherry picked from commit 8c803f85df0c23c7e30dadc0ab9748749a1d3588)


[ROCm/rdc commit: 4045a59bb4]
2020-08-17 14:09:37 -05:00
Chris Freehill caeb82122e Combine client and server packages
Now, only one package is generated. This works with older
versions of cmake.

Also, restore change to postinst scripts for Debian and
RPM, undone in a prevous commit.

Change-Id: Ica005656c5f1df0d01d3071584b97de9f0e61cb3
(cherry picked from commit c14d23843228fa146f38c87cb59514e855725b41)


[ROCm/rdc commit: 9483b74fe4]
2020-08-17 14:09:37 -05:00
Chris Freehill 27b7f174b3 Add .x86_64 and _amd64 suffixes to .rpm and .debs
Also:
* update README documentation
* correct postinst scripts for deb and rpm
* add lib64/ to link_directories (needed for CentOS and others)
* remove a redundant "rdc" from the package names
* rearrange the package names to conform to convention
For example:
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty_amd64.deb
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty.x86_64.rpm

* fix issues that result from having, in essence, 2 different
  install prefixes, 1 for the client and 1 for the server.

Change-Id: I88f0e1b8b72df2793c35ed71534afd91142da012


[ROCm/rdc commit: 4008dd8eac]
2020-08-17 14:09:37 -05:00
Chris Freehill aaaa7f0cd1 Also look in <grpc root>/lib64 for grpc libs
Some distros install grpc libs into lib64/ instead of
lib/. This takes care of that.

Change-Id: Iaab31c331828844be4a9e5c7abec4da609173356


[ROCm/rdc commit: 54dcd421e4]
2020-08-17 14:09:37 -05:00
Chris Freehill 80bd911736 Add rdci to package bin directory
Also:
* add to the CI bin directory
* when making a new manual PDF, don't overwrite the old one;
  instead, make a copy that can be used to manually replace
  the existing one, if desired.

Change-Id: I9384e3627835a9c9983a55c23417a279a7b4d0f4


[ROCm/rdc commit: c35b0b8ec1]
2020-08-17 14:09:37 -05:00
Chris Freehill cd5f37a3aa Prepare rdctst for automated test runs
Mostly this involves creating a "batch mode" which does not
have any interactive prompts. Also, in batch mode, both stand-
alone and embedded modes are run.

Change-Id: I9703e501ab1f853e992b6b401fa0215681ab69f0


[ROCm/rdc commit: 5f947270c1]
2020-08-17 14:09:29 -05:00
Chris Freehill b94b3489c7 Make job id char array const in rdc api
Also make adjustments to packaging.

Change-Id: I73cc18ce67f833ff563cb1488b000b69b315979a


[ROCm/rdc commit: 8e4d1e7f33]
2020-08-17 14:07:25 -05:00
Chris Freehill bcfdf23234 Generate RDC docs on make and put into package
DOxygen docs from rdc.h will be generated on build and a
new pdf file will be created in the docs folder. Note
that the generated pdf should should only be updated in
the git repo when certain files are modified:

* rdc.h Doxygen comments are updated (or any future files that
are processed for Doxygen comments)

* the Doxygen config file is updated

Change-Id: I3084520f5b02aaa8e4973f9055d8679d6788b0ef


[ROCm/rdc commit: 3793e8df27]
2020-08-17 14:07:25 -05:00
Divya Shikre b156e86589 Implement gtests for RDC
adding gtest placeholder
adding discovery,group,fieldgroup,dmon,stats test

Change-Id: I71428f70345af5c8025fb66c1d411dc348daa2ef


[ROCm/rdc commit: 61579371f8]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0f8f345992 Add support to use the field name in rdci
In the rdci dmon and fieldgroup, now the fields can be specified
using either number id or the field name.

When the rdc is async fetching metrics, it will not report that fetch
as an error.

Change-Id: I81331e2c239af987181147be5ac0e29ba1617ab4


[ROCm/rdc commit: d30cb81fdb]
2020-08-17 14:07:25 -05:00
Chris Freehill aa0a40f84d Fix some docs issues
Change-Id: I961a34a90ead7e7559f778cae3bef7ec41689a93


[ROCm/rdc commit: 76162057a1]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu b7cf5bc94c Rename description of job stats
Change the job stats description.

Change-Id: I9b56a40d648c05e5327ad1b640277302d0e5e00c


[ROCm/rdc commit: 2772d3f238]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 11ed178796 Allow the rdcd to be started by user other than rdc or root
Remove the check whether the rdcd is started by rdc user.
Add the read access check for the private key and certificates if
the authentication is enabled.

Change-Id: I0e7a7eafb7985801572f809da0cb3e4012683153


[ROCm/rdc commit: 96afb24845]
2020-08-17 14:07:25 -05:00
Chris Freehill c4dc0f4f56 Separate client/server packages, build_ scripts mods
Change-Id: Ic553be523e7c6ae8ac930fa2126add45f33645b7


[ROCm/rdc commit: b6da10f1f4]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5c2a56e069 Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a


[ROCm/rdc commit: f4a3fd4dda]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 39f3d3af8a Implement the stats subsystem in rdci
Add support for the stats subsystem in rdci
Modify the dmon system to handle the case when no GPUs in a group

Change-Id: I5a18e1201d24b5318b8e324a77551a757b108f25


[ROCm/rdc commit: 096dc2dadb]
2020-08-17 14:07:25 -05:00
Chris Freehill 819c4febca Make GPRC and protobuf external components to RDC
Pass in GRPC root (or use default location) for RDC to use
when building RDC components.

Change-Id: I89db2ac2be27ab6449c817d210a94c11fef965fd


[ROCm/rdc commit: 1b58033183]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu dc48d8c977 Implement the gRPC APIs for the job stats
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.

Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce


[ROCm/rdc commit: fe3e75edfa]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0813e7052f Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769


[ROCm/rdc commit: a547dc7efd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu aef3d29925 Implement the rdci subsystem: group, fieldgroup and dmon
Add the support for rdci subsystem group create, delete and query

Add the support for rdci subsystem fieldgroup create, delete and query

Add the support for rdci dmon system. The dmon system may show the stats every
a few seconds until press Ctrl-C. To cleanup the resources (for example, unwatch),
a signal handler is added.

Change-Id: Ib22a8a43b7083c7c72819ca21145e22702d9ad6c


[ROCm/rdc commit: 16bce67835]
2020-08-17 14:07:25 -05:00
Chris Freehill a7fb94589c Handle different levels of rdcd privilege
Depending on how a user starts rdcd, rdcd will either have
full monitor/control capabilities or have just monitoring
capabilties.

The only 2 user ids allowed are "rdc" and root.

Change-Id: Ie296a2f68c9723bef5945b1af1070ef99eeea93b


[ROCm/rdc commit: a6acf24ae7]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu ce4890f88c Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611


[ROCm/rdc commit: 7ee29b6cdd]
2020-08-17 14:07:25 -05:00
Chris Freehill 3d8180e5af Correct CMake install for rdc_libs target
This will allow us to not have to use LD_LIBRARY_PATH when
packages are installed.

Change-Id: I16b4c50d400c3c7e3bbebe446c53d3605cebae53


[ROCm/rdc commit: 7084690872]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0a20efdbf3 Add SSL mutual authentication support for rdci
The RDC API is changed to pass the certificates to the gRPC.

Add the support to add all GPUs in the host to a group. Also before
add a GPU to a group, the RDC API will verify that GPU exists or not.

Add the support to fetch the temperature metrics.

Change-Id: I5857ef03fede233d16e8b2836be120f33172da93


[ROCm/rdc commit: 66e4e790c3]
2020-08-17 14:07:25 -05:00
Chris Freehill d1acc44ffd Add return value for Make RdcStandaloneHandler::error_handle
Quiets compile warning.

Change-Id: I5e7454f56e824e2304c790bac729cfa0fcf78603


[ROCm/rdc commit: 023de40df7]
2020-08-17 14:07:25 -05:00
Chris Freehill 2f59e7e1ab Add support for gRPC authenticated communications
Also, make a few namespace corrections and some minor refactoring.

Change-Id: Iedcaf6b43cb7576bc11dfefe980abd190c838831


[ROCm/rdc commit: 47fdfa4c7e]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 199f085ce3 SWDEV-209060 - Create the Skeleton RDC CLI and daemon
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:

rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.

RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so

rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.

rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.

Add the instruction how to run the rdcd and rdci in the build folder in the README.md.

Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c


[ROCm/rdc commit: 020f6939f7]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu dd0ef78c56 SWDEV-223878 - Add cache manager and watch table to skeleton rdc_lib
Support cache manager and watch table in rdc_lib

RdcCacheManagerImpl.cc is added to implement cache of metrics. Currently, only
integer mertics are supported. The cache manager provids function to retrieve the
latest and history metrics from cache. It also provides interfaces to update and evict the cache.

RdcWatchTableImpl.cc is added to implement watch and unwatch fields. It uses the
field settings to control how frequently a field needs to be updated. We have a preliminarily
performance optimization for this class as it may be called very frequently.

RdcMetricsUpdaterImpl.cc is added to run the update at background thread when
RDC_OPERATION_MODE_AUTO is set.

After this code change, the rdcd/rdci should be able to implement basic discovery, group and dmon
function. The job management function is not implemented in the skeleton rdc_lib yet.

Change-Id: I26cff8c2ec85d1ad8e7df24c66b02f0060838d37


[ROCm/rdc commit: 1ff1c7b617]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 7f7cf5c1db Support discovery and group management in rdc_lib
The rdc.h is modified for new discovery and grouping APIs.

The RdcGroupSettingsImpl.cc is added to implement the GPU group and
the field group management.

The RdcMetricFetcherImpl.cc is added to fetch the metrics from
rocm_smi_lib. Currently, only support power, memory, GPU utilization,
temperature, GPU clock, total device and device name.

A new example field_value_example.cc is added to demo how to record
the fields and retrieve data from cache.

Change-Id: I57acfa048fe9b3d848e2d441e768b3a63ccae3f8


[ROCm/rdc commit: a5f063f8b3]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5b27d846b2 Create the rdc.h header file and librdc_bootstrap.so
The rdc.h is the only header file will be provided to the user.
The inital version only includes the data structure and function
required for the job stats example.

The example folder has one example demonstrated how to use the API
to collect the job summary stats.

The RdcBootStrap.cc will dynamically load different libraries when user
select either the standalone or embbed mode. We also created a
dummy RdcEmbeddedHandler.cc for librdc.so.

In order to run the example after build, it needs to specify the
LD_LIBRARY_PATH. Assume current folder is the build folder:
LD_LIBRARY_PATH=$PWD/rdc_libs $PWD/example/jobstats

The folder is structured in following ways:
example
include
    - rdc - rdc.h (the only header file exposed to the user)
    - rdc_libs
          - impl
rdc_libs
    - boostrap
         - src
    - rdc
         - src
    - rdc_client
         - src
    - rdc_server
         - src

Change-Id: Ia386ddf4cabcb2dc4fe82de6464ca0619cb3d959


[ROCm/rdc commit: 85006053ed]
2020-08-17 14:07:25 -05:00
Chris Freehill 5c33103352 Make rdcd run as user "rdc"
The rdc account will be created on installation if it does
not already exist. It will be a system account with no
home directory.

rdcd will be started as a systemd service, but change to
user "rdc". The rdc user will drop all priviliges except
CAP_DAC_OVERRIDE, permitted. This means the default mode
will have no special privileges, but have the ability to
gain write access (e.g., to sysfs) when needed.

rdc tests were being inadvertantly added to the
installation. This was adversely impacting the new
functionality, so it was corrected in this commit.

Also included are a few small formatting changes.

Change-Id: I9c6bb132fee28119fd3960594dfb97bd2e7b282a


[ROCm/rdc commit: 5cc498c6aa]
2020-08-17 14:07:25 -05:00
Chris Freehill 87aa4ff77c Add read fan values and associated tests
Change-Id: I89322e93d5f3110adace15e5a576f00d4934be79


[ROCm/rdc commit: 4729c47866]
2020-08-17 14:07:25 -05:00
Chris Freehill 7ed9c89ff6 Add use of namespaces
Change-Id: I962eb808b3b874d1c3bf4cb418bf36952f88e3e2


[ROCm/rdc commit: 02c6d3fb4d]
2020-08-17 14:07:25 -05:00
Chris Freehill 77683bf0e8 Add Google test based tests.
Initial testing include an "id test", which really just a
template test at this point, and a temperature sensor test.

The google test code is included in this commit. It will
eventually be taken out and replaced with a pull from a google
external repo.

Change-Id: I591818a9c169f4654fc8d8f17cf648f227d72545


[ROCm/rdc commit: ca4344f5fa]
2020-08-17 14:06:56 -05:00
Chris Freehill ba14edbb4d Break srvs. into rsmi & admin srvs. Add VerifyConnection api.
Change-Id: I67567264c37e31f3409062a14e56eba4801cd944


[ROCm/rdc commit: dc6f6f3e9a]
2020-01-09 20:02:33 -06:00
Chris Freehill bc7f01e992 Initial RDC commit
Includes server, client and example targets.

Change-Id: I30596fb0453af71d49b8390a8468a6d073200836


[ROCm/rdc commit: 5898345d17]
2020-01-09 17:57:29 -06:00
Chris Freehill dd48b63bbf Initial commit
Change-Id: I30d87413f6771d1d9d67cd4b2d65ed788d275533


[ROCm/rdc commit: 0de56e087a]
2020-01-09 17:57:19 -06:00