Files
rocm-systems/python_binding
limeng12 853d3b0cc5 Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
2024-11-19 14:00:49 +08:00
..
2020-08-17 14:09:37 -05:00
2024-11-19 14:00:49 +08:00
2020-11-10 14:26:49 -05:00
2024-05-08 18:15:38 -05:00

Quick start

If you do not have the RDC installed, please specify the RDC library path using:

$ export LD_LIBRARY_PATH=<rdc_libs_path>

Then you can run RdcReader in python_binding folder:

$ python RdcReader.py

Prometheus plugin

Install the prometheus_client:

$ pip install prometheus_client

Start the rdcd with auth and then run plugin to connect to it:

$ python rdc_prometheus.py

Check the options of the plugin:

$ python rdc_prometheus.py --help

Verify the plugin is running:

$ curl localhost:5000

In the managment computer, install the Prometheus from https://github.com/prometheus/prometheus

Modify the file prometheus_targets.json to add the compute nodes running the plugin. Start the Prometheus

$ prometheus --config.file=<full path of the rdc_prometheus_example.yml>

Browse to localhost:9090 in the management computer for metrics from RDC.