Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.
It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.
The grpc client and server side diagnostics function is added.
The diag module is added to the rdci.
Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.
Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up
Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.
* The RdcModuleMgr will be used to manage different modules. RDC
only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
It will also call rdc_telemetry_fields_query() defined in the RAS
library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
library: RdcSmiLib or RdcRasLib.
The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.
Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.
Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.
Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.
Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.
Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.
Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.
Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon
Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon
Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.
Change-Id: I066091423146dea180c16af212688ed43dc44611
Support cache manager and watch table in rdc_lib
RdcCacheManagerImpl.cc is added to implement cache of metrics. Currently, only
integer mertics are supported. The cache manager provids function to retrieve the
latest and history metrics from cache. It also provides interfaces to update and evict the cache.
RdcWatchTableImpl.cc is added to implement watch and unwatch fields. It uses the
field settings to control how frequently a field needs to be updated. We have a preliminarily
performance optimization for this class as it may be called very frequently.
RdcMetricsUpdaterImpl.cc is added to run the update at background thread when
RDC_OPERATION_MODE_AUTO is set.
After this code change, the rdcd/rdci should be able to implement basic discovery, group and dmon
function. The job management function is not implemented in the skeleton rdc_lib yet.
Change-Id: I26cff8c2ec85d1ad8e7df24c66b02f0060838d37
The rdc.h is modified for new discovery and grouping APIs.
The RdcGroupSettingsImpl.cc is added to implement the GPU group and
the field group management.
The RdcMetricFetcherImpl.cc is added to fetch the metrics from
rocm_smi_lib. Currently, only support power, memory, GPU utilization,
temperature, GPU clock, total device and device name.
A new example field_value_example.cc is added to demo how to record
the fields and retrieve data from cache.
Change-Id: I57acfa048fe9b3d848e2d441e768b3a63ccae3f8
The rdc.h is the only header file will be provided to the user.
The inital version only includes the data structure and function
required for the job stats example.
The example folder has one example demonstrated how to use the API
to collect the job summary stats.
The RdcBootStrap.cc will dynamically load different libraries when user
select either the standalone or embbed mode. We also created a
dummy RdcEmbeddedHandler.cc for librdc.so.
In order to run the example after build, it needs to specify the
LD_LIBRARY_PATH. Assume current folder is the build folder:
LD_LIBRARY_PATH=$PWD/rdc_libs $PWD/example/jobstats
The folder is structured in following ways:
example
include
- rdc - rdc.h (the only header file exposed to the user)
- rdc_libs
- impl
rdc_libs
- boostrap
- src
- rdc
- src
- rdc_client
- src
- rdc_server
- src
Change-Id: Ia386ddf4cabcb2dc4fe82de6464ca0619cb3d959