Граф коммитов

25 Коммитов

Автор SHA1 Сообщение Дата
Galantsev, Dmitrii 9702d0f2d7 SWDEV-439576 - rocmsmi -> amdsmi
- Migrate to amdsmi library
- NOTE: raslib still uses rocmsmi
- Remove unused rocmsmi service
- Remove unused RDC client code
- Remove RSMI calls from protos/rdc.proto

Change-Id: Ifc34a264c506b0ec5792307ee56b34526268762d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-04-09 20:19:28 -05:00
Galantsev, Dmitrii f9e80cc37a Use templates for module population
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.

Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-10 00:27:09 -06:00
Galantsev, Dmitrii eaa1862a80 RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-10 00:27:04 -06:00
Galantsev, Dmitrii 434e40305d LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-09 11:37:11 -06:00
Galantsev, Dmitrii ed3cfffd7e Server - Add -a/--address option
Change-Id: Ia9e8d76b9a4ba0aadc567142601a87f0ad0b69e4
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-12-04 15:26:44 -06:00
Bill(Shuzhou) Liu 1ab4110d46 RDC crash when exit
Join the signal handling thread instead of cancel it to prevent
crash with "terminate called without an active exception".

Change-Id: I2e18eb825728fd3a94f67b1b0049516bb7b6ebbc
2023-11-03 09:10:22 -04:00
Galantsev, Dmitrii 8f6bf948cc ASAN: Shutdown the signaling thread on exit
Change-Id: Ica546db354430f5f4adc33d8d92e09927d40f75b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2023-03-28 11:26:38 -05:00
Bill(Shuzhou) Liu 76ccf58008 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
2021-07-26 14:56:17 -04:00
Chris Freehill 6b5aeaaa23 Turn on/off DAC capabilities as needed
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.

To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.

Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f
2021-02-04 12:25:26 -06:00
Chris Freehill b278cd379b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
2020-11-22 07:10:39 -05:00
Chris Freehill 5950ebadc4 rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu e6d910f67a Support standard deviation and json output for job stats
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.

A new option --json is added to the rdci to output the results
in json format.

In the job stats, using the GMT time instead of timestamp
for start and end time.

Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 96afb24845 Allow the rdcd to be started by user other than rdc or root
Remove the check whether the rdcd is started by rdc user.
Add the read access check for the private key and certificates if
the authentication is enabled.

Change-Id: I0e7a7eafb7985801572f809da0cb3e4012683153
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu f4a3fd4dda Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu fe3e75edfa Implement the gRPC APIs for the job stats
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.

Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce
2020-08-17 14:07:25 -05:00
Chris Freehill a6acf24ae7 Handle different levels of rdcd privilege
Depending on how a user starts rdcd, rdcd will either have
full monitor/control capabilities or have just monitoring
capabilties.

The only 2 user ids allowed are "rdc" and root.

Change-Id: Ie296a2f68c9723bef5945b1af1070ef99eeea93b
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 7ee29b6cdd Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611
2020-08-17 14:07:25 -05:00
Chris Freehill 47fdfa4c7e Add support for gRPC authenticated communications
Also, make a few namespace corrections and some minor refactoring.

Change-Id: Iedcaf6b43cb7576bc11dfefe980abd190c838831
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 020f6939f7 SWDEV-209060 - Create the Skeleton RDC CLI and daemon
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:

rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.

RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so

rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.

rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.

Add the instruction how to run the rdcd and rdci in the build folder in the README.md.

Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c
2020-08-17 14:07:25 -05:00
Chris Freehill 5cc498c6aa Make rdcd run as user "rdc"
The rdc account will be created on installation if it does
not already exist. It will be a system account with no
home directory.

rdcd will be started as a systemd service, but change to
user "rdc". The rdc user will drop all priviliges except
CAP_DAC_OVERRIDE, permitted. This means the default mode
will have no special privileges, but have the ability to
gain write access (e.g., to sysfs) when needed.

rdc tests were being inadvertantly added to the
installation. This was adversely impacting the new
functionality, so it was corrected in this commit.

Also included are a few small formatting changes.

Change-Id: I9c6bb132fee28119fd3960594dfb97bd2e7b282a
2020-08-17 14:07:25 -05:00
Chris Freehill 4729c47866 Add read fan values and associated tests
Change-Id: I89322e93d5f3110adace15e5a576f00d4934be79
2020-08-17 14:07:25 -05:00
Chris Freehill 02c6d3fb4d Add use of namespaces
Change-Id: I962eb808b3b874d1c3bf4cb418bf36952f88e3e2
2020-08-17 14:07:25 -05:00
Chris Freehill ca4344f5fa Add Google test based tests.
Initial testing include an "id test", which really just a
template test at this point, and a temperature sensor test.

The google test code is included in this commit. It will
eventually be taken out and replaced with a pull from a google
external repo.

Change-Id: I591818a9c169f4654fc8d8f17cf648f227d72545
2020-08-17 14:06:56 -05:00
Chris Freehill dc6f6f3e9a Break srvs. into rsmi & admin srvs. Add VerifyConnection api.
Change-Id: I67567264c37e31f3409062a14e56eba4801cd944
2020-01-09 20:02:33 -06:00
Chris Freehill 5898345d17 Initial RDC commit
Includes server, client and example targets.

Change-Id: I30596fb0453af71d49b8390a8468a6d073200836
2020-01-09 17:57:29 -06:00