Backgroud health check

Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
This commit is contained in:
limeng12
2024-10-23 16:42:24 +08:00
کامیت شده توسط Meng, Li (Jassmine)
والد f1428a8226
کامیت 853d3b0cc5
26فایلهای تغییر یافته به همراه2260 افزوده شده و 3 حذف شده
@@ -163,6 +163,14 @@ class rdc_field_t(c_int):
RDC_EVNT_NOTIF_PRE_RESET = 2002
RDC_EVNT_NOTIF_POST_RESET = 2003
RDC_EVNT_NOTIF_RING_HANG = 2004
RDC_HEALTH_XGMI_ERROR = 3000
RDC_HEALTH_PCIE_REPLAY_COUNT = 3001
RDC_HEALTH_RETIRED_PAGE_NUM = 3002
RDC_HEALTH_PENDING_PAGE_NUM = 3003
RDC_HEALTH_RETIRED_PAGE_LIMIT = 3004
RDC_HEALTH_UNCORRECTABLE_PAGE_LIMIT = 3005
RDC_HEALTH_POWER_THROTTLE_TIME = 3006
RDC_HEALTH_THERMAL_THROTTLE_TIME = 3007
rdc_handle_t = c_void_p
rdc_gpu_group_t = c_uint32