* Implement CPU discovery support
SWDEV-482949:
enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:
1 GPUs found.
-----------------------------------------------------------------
GPU Index Device Information
0 AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index Device Information
0 AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------
Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
* CMAKE - Add required version for amdsmi
Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
---------
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 3bdca8b8b6]
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter
Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e
[ROCm/rdc commit: 016a1d9d39]
1. For temperature the unit in milli Celsius
2. For power the unit in microwatts.
3. Fix second register call to rdcd doesn't functional because start flag
Co-authored-by: Chao Fei <chao.fei@amd.com>
[ROCm/rdc commit: bd7d7c99c1]
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
[ROCm/rdc commit: 853d3b0cc5]
- Enable set and get for policy settings
- Enable register and clear policy events
Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
[ROCm/rdc commit: d8fec06bab]
Prepare for adding 'detection version information' later
Change-Id: Ib2b5e70b2360b1c5ff87a537f41f34f23c7ed61f
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 45c6d0b03b]
NOTE: RVS Build is disabled by default due to CI build issues.
Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: eaa1862a80]
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.
Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9
[ROCm/rdc commit: 78e2f2486b]
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.
It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.
The grpc client and server side diagnostics function is added.
The diag module is added to the rdci.
Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
[ROCm/rdc commit: 76ccf58008]
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.
If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.
Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1
[ROCm/rdc commit: 51efe26442]
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up
Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
[ROCm/rdc commit: b278cd379b]
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.
Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.
Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
a field should be hidden or not, by default. This will
allow us to support many fields, even those that are not
typically of interest (but sometimes may be), without
confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
including the more obscure fields.
Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
[ROCm/rdc commit: d7c9625fc6]
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
[ROCm/rdc commit: 5950ebadc4]
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.
A new option --json is added to the rdci to output the results
in json format.
In the job stats, using the GMT time instead of timestamp
for start and end time.
Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca
[ROCm/rdc commit: e6d910f67a]
In the rdci dmon and fieldgroup, now the fields can be specified
using either number id or the field name.
When the rdc is async fetching metrics, it will not report that fetch
as an error.
Change-Id: I81331e2c239af987181147be5ac0e29ba1617ab4
[ROCm/rdc commit: d30cb81fdb]
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.
Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.
Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a
[ROCm/rdc commit: f4a3fd4dda]
Add support for the stats subsystem in rdci
Modify the dmon system to handle the case when no GPUs in a group
Change-Id: I5a18e1201d24b5318b8e324a77551a757b108f25
[ROCm/rdc commit: 096dc2dadb]
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.
Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.
Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.
Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
[ROCm/rdc commit: a547dc7efd]
Add the support for rdci subsystem group create, delete and query
Add the support for rdci subsystem fieldgroup create, delete and query
Add the support for rdci dmon system. The dmon system may show the stats every
a few seconds until press Ctrl-C. To cleanup the resources (for example, unwatch),
a signal handler is added.
Change-Id: Ib22a8a43b7083c7c72819ca21145e22702d9ad6c
[ROCm/rdc commit: 16bce67835]
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon
Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon
Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.
Change-Id: I066091423146dea180c16af212688ed43dc44611
[ROCm/rdc commit: 7ee29b6cdd]