7 Коммитов

Автор SHA1 Сообщение Дата
Galantsev, Dmitrii 703d6c0d44 Use templates for module population
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.

Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: f9e80cc37a]
2024-01-10 00:27:09 -06:00
Galantsev, Dmitrii ea624cbb7c LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 434e40305d]
2024-01-09 11:37:11 -06:00
Chris Freehill 6b246dcf4b rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec


[ROCm/rdc commit: 5950ebadc4]
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 5c2a56e069 Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a


[ROCm/rdc commit: f4a3fd4dda]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0813e7052f Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769


[ROCm/rdc commit: a547dc7efd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 7f7cf5c1db Support discovery and group management in rdc_lib
The rdc.h is modified for new discovery and grouping APIs.

The RdcGroupSettingsImpl.cc is added to implement the GPU group and
the field group management.

The RdcMetricFetcherImpl.cc is added to fetch the metrics from
rocm_smi_lib. Currently, only support power, memory, GPU utilization,
temperature, GPU clock, total device and device name.

A new example field_value_example.cc is added to demo how to record
the fields and retrieve data from cache.

Change-Id: I57acfa048fe9b3d848e2d441e768b3a63ccae3f8


[ROCm/rdc commit: a5f063f8b3]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5b27d846b2 Create the rdc.h header file and librdc_bootstrap.so
The rdc.h is the only header file will be provided to the user.
The inital version only includes the data structure and function
required for the job stats example.

The example folder has one example demonstrated how to use the API
to collect the job summary stats.

The RdcBootStrap.cc will dynamically load different libraries when user
select either the standalone or embbed mode. We also created a
dummy RdcEmbeddedHandler.cc for librdc.so.

In order to run the example after build, it needs to specify the
LD_LIBRARY_PATH. Assume current folder is the build folder:
LD_LIBRARY_PATH=$PWD/rdc_libs $PWD/example/jobstats

The folder is structured in following ways:
example
include
    - rdc - rdc.h (the only header file exposed to the user)
    - rdc_libs
          - impl
rdc_libs
    - boostrap
         - src
    - rdc
         - src
    - rdc_client
         - src
    - rdc_server
         - src

Change-Id: Ia386ddf4cabcb2dc4fe82de6464ca0619cb3d959


[ROCm/rdc commit: 85006053ed]
2020-08-17 14:07:25 -05:00