8 Коммитов

Автор SHA1 Сообщение Дата
limeng12 71e2727a8f Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112


[ROCm/rdc commit: 853d3b0cc5]
2024-11-19 14:00:49 +08:00
Galantsev, Dmitrii ea624cbb7c LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 434e40305d]
2024-01-09 11:37:11 -06:00
Chris Freehill 6b246dcf4b rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec


[ROCm/rdc commit: 5950ebadc4]
2020-08-17 14:09:37 -05:00
Chris Freehill b94b3489c7 Make job id char array const in rdc api
Also make adjustments to packaging.

Change-Id: I73cc18ce67f833ff563cb1488b000b69b315979a


[ROCm/rdc commit: 8e4d1e7f33]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5c2a56e069 Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a


[ROCm/rdc commit: f4a3fd4dda]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0813e7052f Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769


[ROCm/rdc commit: a547dc7efd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu ce4890f88c Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611


[ROCm/rdc commit: 7ee29b6cdd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu dd0ef78c56 SWDEV-223878 - Add cache manager and watch table to skeleton rdc_lib
Support cache manager and watch table in rdc_lib

RdcCacheManagerImpl.cc is added to implement cache of metrics. Currently, only
integer mertics are supported. The cache manager provids function to retrieve the
latest and history metrics from cache. It also provides interfaces to update and evict the cache.

RdcWatchTableImpl.cc is added to implement watch and unwatch fields. It uses the
field settings to control how frequently a field needs to be updated. We have a preliminarily
performance optimization for this class as it may be called very frequently.

RdcMetricsUpdaterImpl.cc is added to run the update at background thread when
RDC_OPERATION_MODE_AUTO is set.

After this code change, the rdcd/rdci should be able to implement basic discovery, group and dmon
function. The job management function is not implemented in the skeleton rdc_lib yet.

Change-Id: I26cff8c2ec85d1ad8e7df24c66b02f0060838d37


[ROCm/rdc commit: 1ff1c7b617]
2020-08-17 14:07:25 -05:00