Wykres commitów

23 Commity

Autor SHA1 Wiadomość Data
Bill(Shuzhou) Liu 6b700f8005 Support GPU memory test and compute queue test using Rocr
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.

Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9


[ROCm/rdc commit: 78e2f2486b]
2021-10-21 11:01:12 -04:00
Bill(Shuzhou) Liu fa9c6ad6f8 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd


[ROCm/rdc commit: 76ccf58008]
2021-07-26 14:56:17 -04:00
Bill(Shuzhou) Liu 7d7a5bfd1c Disable bulk fetch. Add environment variable to enable it
RDC can optimize by bulk fetching multiple metrics using a single
rocm_smi call. However, currently this is not completely supported in
all ASIC generations. By default disable this for now.

Set environment variable RDC_BULK_FETCH_ENABLED=TRUE to enable
RDC bulk fetch.

BUG: SWDEV-289316

Change-Id: Ibb55514f198356dccf5f47bb0fd2d53c17acb251


[ROCm/rdc commit: 673f5a4ee1]
2021-06-09 15:53:17 -04:00
Bill(Shuzhou) Liu f504f697e3 The Diagnostic API interface
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.

Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb


[ROCm/rdc commit: eab3625d65]
2021-05-27 10:59:11 -04:00
Bill(Shuzhou) Liu f41c146bc4 Bulk fetch metrics from rocm_smi_lib
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.

If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.

Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1


[ROCm/rdc commit: 51efe26442]
2021-01-07 16:40:37 -05:00
Chris Freehill 79b5e54d3b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc


[ROCm/rdc commit: b278cd379b]
2020-11-22 07:10:39 -05:00
Bill(Shuzhou) Liu 753d5fed6d Support watch() and unwatch() in RDC module framework
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
  include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
  It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.

Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562


[ROCm/rdc commit: 151520b97e]
2020-09-18 16:02:31 -04:00
Bill(Shuzhou) Liu a88fc1829c Add new init and destroy API for RAS library.
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();

When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()

Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca


[ROCm/rdc commit: 72691cc024]
2020-09-09 14:01:03 -04:00
Bill(Shuzhou) Liu 32e91cf8d2 RDC module framework
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.

* The RdcModuleMgr will be used to manage different modules. RDC
  only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
  It will also call rdc_telemetry_fields_query() defined in the RAS
  library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
  interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
  RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
  library: RdcSmiLib or RdcRasLib.

The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.

Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e


[ROCm/rdc commit: ba35cdcfe2]
2020-09-02 14:46:40 -04:00
Chris Freehill 17430dde45 Add event counter support
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.

Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.

Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
  a field should be hidden or not, by default. This will
  allow us to support many fields, even those that are not
  typically of interest (but sometimes may be), without
  confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
  including the more obscure fields.

Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83


[ROCm/rdc commit: d7c9625fc6]
2020-08-17 19:45:18 -04:00
Chris Freehill 6b246dcf4b rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec


[ROCm/rdc commit: 5950ebadc4]
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 588ea96dd2 Support standard deviation and json output for job stats
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.

A new option --json is added to the rdci to output the results
in json format.

In the job stats, using the GMT time instead of timestamp
for start and end time.

Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca


[ROCm/rdc commit: e6d910f67a]
2020-08-17 14:09:37 -05:00
Chris Freehill b94b3489c7 Make job id char array const in rdc api
Also make adjustments to packaging.

Change-Id: I73cc18ce67f833ff563cb1488b000b69b315979a


[ROCm/rdc commit: 8e4d1e7f33]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0f8f345992 Add support to use the field name in rdci
In the rdci dmon and fieldgroup, now the fields can be specified
using either number id or the field name.

When the rdc is async fetching metrics, it will not report that fetch
as an error.

Change-Id: I81331e2c239af987181147be5ac0e29ba1617ab4


[ROCm/rdc commit: d30cb81fdb]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5c2a56e069 Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a


[ROCm/rdc commit: f4a3fd4dda]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu dc48d8c977 Implement the gRPC APIs for the job stats
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.

Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce


[ROCm/rdc commit: fe3e75edfa]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0813e7052f Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769


[ROCm/rdc commit: a547dc7efd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu ce4890f88c Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611


[ROCm/rdc commit: 7ee29b6cdd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0a20efdbf3 Add SSL mutual authentication support for rdci
The RDC API is changed to pass the certificates to the gRPC.

Add the support to add all GPUs in the host to a group. Also before
add a GPU to a group, the RDC API will verify that GPU exists or not.

Add the support to fetch the temperature metrics.

Change-Id: I5857ef03fede233d16e8b2836be120f33172da93


[ROCm/rdc commit: 66e4e790c3]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 199f085ce3 SWDEV-209060 - Create the Skeleton RDC CLI and daemon
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:

rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.

RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so

rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.

rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.

Add the instruction how to run the rdcd and rdci in the build folder in the README.md.

Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c


[ROCm/rdc commit: 020f6939f7]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu dd0ef78c56 SWDEV-223878 - Add cache manager and watch table to skeleton rdc_lib
Support cache manager and watch table in rdc_lib

RdcCacheManagerImpl.cc is added to implement cache of metrics. Currently, only
integer mertics are supported. The cache manager provids function to retrieve the
latest and history metrics from cache. It also provides interfaces to update and evict the cache.

RdcWatchTableImpl.cc is added to implement watch and unwatch fields. It uses the
field settings to control how frequently a field needs to be updated. We have a preliminarily
performance optimization for this class as it may be called very frequently.

RdcMetricsUpdaterImpl.cc is added to run the update at background thread when
RDC_OPERATION_MODE_AUTO is set.

After this code change, the rdcd/rdci should be able to implement basic discovery, group and dmon
function. The job management function is not implemented in the skeleton rdc_lib yet.

Change-Id: I26cff8c2ec85d1ad8e7df24c66b02f0060838d37


[ROCm/rdc commit: 1ff1c7b617]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 7f7cf5c1db Support discovery and group management in rdc_lib
The rdc.h is modified for new discovery and grouping APIs.

The RdcGroupSettingsImpl.cc is added to implement the GPU group and
the field group management.

The RdcMetricFetcherImpl.cc is added to fetch the metrics from
rocm_smi_lib. Currently, only support power, memory, GPU utilization,
temperature, GPU clock, total device and device name.

A new example field_value_example.cc is added to demo how to record
the fields and retrieve data from cache.

Change-Id: I57acfa048fe9b3d848e2d441e768b3a63ccae3f8


[ROCm/rdc commit: a5f063f8b3]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5b27d846b2 Create the rdc.h header file and librdc_bootstrap.so
The rdc.h is the only header file will be provided to the user.
The inital version only includes the data structure and function
required for the job stats example.

The example folder has one example demonstrated how to use the API
to collect the job summary stats.

The RdcBootStrap.cc will dynamically load different libraries when user
select either the standalone or embbed mode. We also created a
dummy RdcEmbeddedHandler.cc for librdc.so.

In order to run the example after build, it needs to specify the
LD_LIBRARY_PATH. Assume current folder is the build folder:
LD_LIBRARY_PATH=$PWD/rdc_libs $PWD/example/jobstats

The folder is structured in following ways:
example
include
    - rdc - rdc.h (the only header file exposed to the user)
    - rdc_libs
          - impl
rdc_libs
    - boostrap
         - src
    - rdc
         - src
    - rdc_client
         - src
    - rdc_server
         - src

Change-Id: Ia386ddf4cabcb2dc4fe82de6464ca0619cb3d959


[ROCm/rdc commit: 85006053ed]
2020-08-17 14:07:25 -05:00