Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
[ROCm/rdc commit: 853d3b0cc5]
- Enable set and get for policy settings
- Enable register and clear policy events
Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
[ROCm/rdc commit: d8fec06bab]
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39
[ROCm/rdc commit: 4bd31b605a]
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f
[ROCm/rdc commit: b17abf93fa]
Detcah the thread which handle shutdown signals instead of joining
thread can avoid the segfault issue on specific ASIC.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I74ac53c027ac370605caaa87115c83fd8027526a
[ROCm/rdc commit: ca569346a3]
Libasan is in gcc by default, thus building RDC with ASAN
enabled by GCC doesn't need -shared-libasan.
Change-Id: I8078f7ea5d46c6beea29c2823db3357a67f00b60
Signed-off-by: Li Ma <li.ma@amd.com>
[ROCm/rdc commit: 183c65c8b2]
For rdci, the version information of some components(such as RDCD),
cannot be obtained through the rdc_device_get_component_version API.
Here create a fake rdc API to get them.
Change-Id: I75d8bcd1993873cff209995b58362f75787a4598
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 8db404f84f]
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.
Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 5a3fd9fbc1]
Prepare for adding 'detection version information' later
Change-Id: Ib2b5e70b2360b1c5ff87a537f41f34f23c7ed61f
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 45c6d0b03b]
Want to display version information along with the hash value.
Change-Id: I0f9ad576f8f66747ce2e84d4f524ccd16d399927
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: ac874d3921]
This fixes issues like:
1.
/run/lock/rdcd.lock: Bad file descriptor
Failed to determine owner of lock file.: Numerical result out of range
2.
rdc.service: Failed to determine group credentials: No such process
rdc.service: Failed at step GROUP spawning rdcd: No such process
Change-Id: I0ef5eb6ab72d036a3ea8dcb81f7a9108d279f7d6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: c015f0fcaa]
Previously we relied on "render" to be enough and only used "video" as a
fallback. On some systems like SLES this might not be sufficient.
One issue happened when starting rocprofiler as part of RDC
initialization:
what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES
The issue only happened when RDC was started with systemd.
Turns out "rdc" user (under which systemctl starts RDC) only had render
but not video group. Adding video group solved the issue.
Change-Id: Idf6a9521ae72a0b28a428869aa7ab1edde3ae259
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 4ebc34095c]
When the rdc is built with this configure option
-DBUILD_STANDALONE=OFF
This error is caused
CMake Error at rdc_libs/CMakeLists.txt:106 (export):
export given target "rdc_client" which is not built by this project.
Resolve this by using conditional
Change-Id: I3f6bb2946c609c7db9fc38015b7d9c8ae766f3a0
Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 6762a6dd8b]
RAS plugin loaded rocm-smi which is in conflict with amd-smi library
Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi
We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.
One such enum is kDevGpuMetrics:
rocm-smi: kDevGpuMetrics = 68
amd-smi: kDevGpuMetrics = 75
Example of overlapping map definitions:
$ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
00000000003c4980 g DO .data.rel.ro0000000000000008 Base devInfoTypesStrings
00000000003db830 g DO .bss0000000000000030 Base _ZN3amd3smi6Device19devInfoTypesStringsE
$ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so | grep devInfoTypesStrings
00000000003dc590 g DO .bss0000000000000030 Base _ZN3amd3smi6Device19devInfoTypesStringsE
00000000003c9c68 g DO .data.rel.ro0000000000000008 Base devInfoTypesStrings
Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: d85657e5f2]
- Replace non-working fields with working ones
- remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
- ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
- ROCP_METRICS is only relevant for rocprofv1 (which we are using)
Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 20ca2ce574]