Commit Graph

11 Commits

Author SHA1 Message Date
Bill(Shuzhou) Liu e16a8bcaf5 Identify GPUs using PCI device identifier in RDC Prometheus plugin
Add a new option --enable_pci_id to Prometheus plugin, which will map
the GPU index to the PCI Device Identifier.

Change-Id: I38a2a7e4841975da095391002397d4515ffb8e0d


[ROCm/rdc commit: 23ab2c0671]
2022-05-05 09:16:05 -04:00
Bill(Shuzhou) Liu c6a69f8e59 Update RDC document
Update README.md to refer to document portal.

Change-Id: I427122751fec5a27936b345a3ac76c96478be164


[ROCm/rdc commit: 2cd7f66154]
2022-04-27 14:38:48 -04:00
Bill(Shuzhou) Liu d53a9c4e21 RDC Prometheus plugin return errors when use the --rdc_gpu_indexes
When above option is used, the plugin returns errors:
  result = rdc.rdc_group_gpu_add(rdc_handle, gpu_group_id, gpu)
  ctypes.ArgumentError: argument 3: <type 'exceptions.TypeError'>: wrong type

The rdc_prometheus.py is changed to convert string to integer.
The RdcUtil.py is also changed to raise Exception properly.

Change-Id: I9535091ff1fc8882cccd32e5f2810da5241768c3


[ROCm/rdc commit: 7ca7a571a7]
2021-02-23 14:15:04 -05:00
Bill(Shuzhou) Liu 17d5758923 Add raslib fields to RDC
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
  for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
  removed from RdcSmiLib.cc, and will be gotten from raslib.

Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6


[ROCm/rdc commit: 81ad23343c]
2020-12-01 10:56:36 -05:00
Bill(Shuzhou) Liu 4839075a86 Use relative path to find librdc_bootstrap.so
The python script will search list of the installation folders to
find the librdc_bootstrap.so.

Change-Id: I52e444e6d153c318c731c4b2cd0d8e39b0fd31ca


[ROCm/rdc commit: 4b3dbc4697]
2020-11-30 13:46:15 -05:00
Chris Freehill 79b5e54d3b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc


[ROCm/rdc commit: b278cd379b]
2020-11-22 07:10:39 -05:00
Bill(Shuzhou) Liu 121f5617fe The collectd plugin for RDC
Two files are added to the python_binding folder:
* The rdc_collectd.py is a collectd plugin to store the RDC
  metrics to the collectd round robin database.
* The rdc_collectd.conf is a configure file which can control
  which fields to collect, how frequently the fields can be collected
  and run the plugin in embedded mode.

Change-Id: Ief44d004376ca8a82ed0d8ad36805243acb47080


[ROCm/rdc commit: bb6d98b036]
2020-11-10 14:26:49 -05:00
Bill(Shuzhou) Liu 1602a481cf Integrate RDC with Grafana
A new Grafana dashboard file rdc_grafana_dashboard_example.json
has been added to the folder python_binding. User can import
this dashboard to monitor multiple compute nodes.

To display the host name only in the dashboard, the
rdc_prometheus_example.yml is also changed to create a new label
short_instance which will not have the port number.

Change-Id: I9ab91838006d59c8dcb5fea01decb8c799484e1d


[ROCm/rdc commit: aeba7b0f91]
2020-10-15 14:12:15 -04:00
Bill(Shuzhou) Liu 753d5fed6d Support watch() and unwatch() in RDC module framework
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
  include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
  It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.

Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562


[ROCm/rdc commit: 151520b97e]
2020-09-18 16:02:31 -04:00
Bill(Shuzhou) Liu b91560f0a8 RDC Prometheus plugin
The rdc_prometheus.py is a Prometheus plugin for RDC
The rdc_prometheus_example.yml and prometheus_targets.json are
example Prometheus configuration. If there are multiple compute
nodes, they can be defined at prometheus_targets.json.

Change-Id: I3611b1e8a166f6608351f6e7644808bf72a4d3a0


[ROCm/rdc commit: 9c7a1347ea]
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 14c9c17292 RDC python binding
A new folder python_binding is created for RDC python binding:
* The rdc_bootstrap.py is a python ctypes wrapper for the librdc_boostrap.so
* The RdcUtil.py defines common utilities for RDC to manage group/fieldgroup
* The RdcReader.py is a class to simplify the usage of the RDC:
  - The user only needs to specify which fields he wants to monitoring.
     RdcReader will create groups and fieldgroups, watch the fields, and fetch the fields.
  - The RdcReader can support embedded and standalone mode.
  - The standalone can be with authentication and without authentication.
  - In standalone mode, the RdcReader can automatically reconnect to the rdcd when the connection is lost.
  - When rdcd is restarted, the previously created group and fieldgroup may lose.
    The RdcReader can re-create them and watch the fields after reconnect.
  - If the client is restarted, RdcReader can detect the groups and fieldgroups
    created before and avoid re-create them.
  - The user can pass the unit converter if he does not want to use RDC default unit.

Change-Id: I109ec86012f37162eb13f7d3e921115b7dd82369


[ROCm/rdc commit: 9209c6c516]
2020-08-17 14:09:37 -05:00