Averaging happens very slowly and only confuses people...
Change-Id: I60754d3b896b6ffeb6104bb1c2fcc54e9869b331
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
SWDEV-475242
For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.
In rocprofiler, I think VALU Utilization is the closest to what we want.
Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
- Enable set and get for policy settings
- Enable register and clear policy events
Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f
Remove occupancy metrics and replace with OccupancyPercent
Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE
Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Detcah the thread which handle shutdown signals instead of joining
thread can avoid the segfault issue on specific ASIC.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I74ac53c027ac370605caaa87115c83fd8027526a
Libasan is in gcc by default, thus building RDC with ASAN
enabled by GCC doesn't need -shared-libasan.
Change-Id: I8078f7ea5d46c6beea29c2823db3357a67f00b60
Signed-off-by: Li Ma <li.ma@amd.com>
Call the previously implemented get_rdcd_version and rdc_get_smiversion
Change-Id: If76037d462fa9328c3af8c85423ee4547882e36e
Signed-off-by: Chen Gong <curry.gong@amd.com>
For rdci, the version information of some components(such as RDCD),
cannot be obtained through the rdc_device_get_component_version API.
Here create a fake rdc API to get them.
Change-Id: I75d8bcd1993873cff209995b58362f75787a4598
Signed-off-by: Chen Gong <curry.gong@amd.com>
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.
Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
Prepare for adding 'detection version information' later
Change-Id: Ib2b5e70b2360b1c5ff87a537f41f34f23c7ed61f
Signed-off-by: Chen Gong <curry.gong@amd.com>
Want to display version information along with the hash value.
Change-Id: I0f9ad576f8f66747ce2e84d4f524ccd16d399927
Signed-off-by: Chen Gong <curry.gong@amd.com>
This fixes issues like:
1.
/run/lock/rdcd.lock: Bad file descriptor
Failed to determine owner of lock file.: Numerical result out of range
2.
rdc.service: Failed to determine group credentials: No such process
rdc.service: Failed at step GROUP spawning rdcd: No such process
Change-Id: I0ef5eb6ab72d036a3ea8dcb81f7a9108d279f7d6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>