Graphe des révisions

10 Révisions

Auteur SHA1 Message Date
Galantsev, Dmitrii eaa1862a80 RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-10 00:27:04 -06:00
Galantsev, Dmitrii 434e40305d LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2024-01-09 11:37:11 -06:00
Bill(Shuzhou) Liu 76ccf58008 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
2021-07-26 14:56:17 -04:00
Bill(Shuzhou) Liu eab3625d65 The Diagnostic API interface
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.

Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb
2021-05-27 10:59:11 -04:00
Chris Freehill 5950ebadc4 rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
2020-08-17 14:09:37 -05:00
Chris Freehill 8e4d1e7f33 Make job id char array const in rdc api
Also make adjustments to packaging.

Change-Id: I73cc18ce67f833ff563cb1488b000b69b315979a
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu a547dc7efd Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 7ee29b6cdd Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu a5f063f8b3 Support discovery and group management in rdc_lib
The rdc.h is modified for new discovery and grouping APIs.

The RdcGroupSettingsImpl.cc is added to implement the GPU group and
the field group management.

The RdcMetricFetcherImpl.cc is added to fetch the metrics from
rocm_smi_lib. Currently, only support power, memory, GPU utilization,
temperature, GPU clock, total device and device name.

A new example field_value_example.cc is added to demo how to record
the fields and retrieve data from cache.

Change-Id: I57acfa048fe9b3d848e2d441e768b3a63ccae3f8
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 85006053ed Create the rdc.h header file and librdc_bootstrap.so
The rdc.h is the only header file will be provided to the user.
The inital version only includes the data structure and function
required for the job stats example.

The example folder has one example demonstrated how to use the API
to collect the job summary stats.

The RdcBootStrap.cc will dynamically load different libraries when user
select either the standalone or embbed mode. We also created a
dummy RdcEmbeddedHandler.cc for librdc.so.

In order to run the example after build, it needs to specify the
LD_LIBRARY_PATH. Assume current folder is the build folder:
LD_LIBRARY_PATH=$PWD/rdc_libs $PWD/example/jobstats

The folder is structured in following ways:
example
include
    - rdc - rdc.h (the only header file exposed to the user)
    - rdc_libs
          - impl
rdc_libs
    - boostrap
         - src
    - rdc
         - src
    - rdc_client
         - src
    - rdc_server
         - src

Change-Id: Ia386ddf4cabcb2dc4fe82de6464ca0619cb3d959
2020-08-17 14:07:25 -05:00