* Implement CPU discovery support
SWDEV-482949:
enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:
1 GPUs found.
-----------------------------------------------------------------
GPU Index Device Information
0 AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index Device Information
0 AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------
Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
* CMAKE - Add required version for amdsmi
Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
---------
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 3bdca8b8b6]
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
[ROCm/rdc commit: 853d3b0cc5]
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.
Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 5a3fd9fbc1]
NOTE: RVS Build is disabled by default due to CI build issues.
Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: eaa1862a80]
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.
It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.
The grpc client and server side diagnostics function is added.
The diag module is added to the rdci.
Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
[ROCm/rdc commit: 76ccf58008]
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.
Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb
[ROCm/rdc commit: eab3625d65]
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
[ROCm/rdc commit: 5950ebadc4]
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.
Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.
Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.
Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
[ROCm/rdc commit: a547dc7efd]
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon
Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon
Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.
Change-Id: I066091423146dea180c16af212688ed43dc44611
[ROCm/rdc commit: 7ee29b6cdd]
The rdc.h is modified for new discovery and grouping APIs.
The RdcGroupSettingsImpl.cc is added to implement the GPU group and
the field group management.
The RdcMetricFetcherImpl.cc is added to fetch the metrics from
rocm_smi_lib. Currently, only support power, memory, GPU utilization,
temperature, GPU clock, total device and device name.
A new example field_value_example.cc is added to demo how to record
the fields and retrieve data from cache.
Change-Id: I57acfa048fe9b3d848e2d441e768b3a63ccae3f8
[ROCm/rdc commit: a5f063f8b3]
The rdc.h is the only header file will be provided to the user.
The inital version only includes the data structure and function
required for the job stats example.
The example folder has one example demonstrated how to use the API
to collect the job summary stats.
The RdcBootStrap.cc will dynamically load different libraries when user
select either the standalone or embbed mode. We also created a
dummy RdcEmbeddedHandler.cc for librdc.so.
In order to run the example after build, it needs to specify the
LD_LIBRARY_PATH. Assume current folder is the build folder:
LD_LIBRARY_PATH=$PWD/rdc_libs $PWD/example/jobstats
The folder is structured in following ways:
example
include
- rdc - rdc.h (the only header file exposed to the user)
- rdc_libs
- impl
rdc_libs
- boostrap
- src
- rdc
- src
- rdc_client
- src
- rdc_server
- src
Change-Id: Ia386ddf4cabcb2dc4fe82de6464ca0619cb3d959
[ROCm/rdc commit: 85006053ed]