A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.
Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9
[ROCm/rdc commit: 78e2f2486b]
Needed to work around a debian packaging bug if debug information is
being produced in a separate package.
Signed-off-by: Icarus Sparry <icarus.sparry@amd.com>
Change-Id: Ieab3cc3515eeeb952159acea3dc1effd14613eeb
[ROCm/rdc commit: 2a1a002f74]
The RDC override the LD_LIBRARY_PATH to force to use the current grpc
path. The change will also add original LD_LIBRARY_PATH to it.
Change-Id: I48da84c3135c6ede129c3cb9148dbb1896b652c3
[ROCm/rdc commit: bd034263d4]
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.
It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.
The grpc client and server side diagnostics function is added.
The diag module is added to the rdci.
Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
[ROCm/rdc commit: 76ccf58008]
The rdcd uses another thread to listen the GPU events. That thread
runs in a tight loop which consume 100% CPU.
The fix will add a sleep to yield CPU.
Bug: SWDEV-291576
Change-Id: I7996720aab4a80346d79b1c73ee532d2abcd93cc
[ROCm/rdc commit: 5a4bf97327]
RDC can optimize by bulk fetching multiple metrics using a single
rocm_smi call. However, currently this is not completely supported in
all ASIC generations. By default disable this for now.
Set environment variable RDC_BULK_FETCH_ENABLED=TRUE to enable
RDC bulk fetch.
BUG: SWDEV-289316
Change-Id: Ibb55514f198356dccf5f47bb0fd2d53c17acb251
[ROCm/rdc commit: 673f5a4ee1]
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.
Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb
[ROCm/rdc commit: eab3625d65]
RDC is trying to bulk fetch power usage from gpu_metrics. If the
gpu_metrics is 0, it will fallback to rsmi_dev_power_ave_get().
Change-Id: I57d165d6af0c91b39798c89eef317d4e5df2d0f6
[ROCm/rdc commit: eafb948115]
Fix lintian errors related to maintainer, postinst script and
permissions.
Change-Id: I6924ff92ff5453fa7e562a6188c2c91cea87df68
[ROCm/rdc commit: 7a05145542]
Create a folder for prebuild raslib which contains the RAS binary
and configure files. The CMakeLists.txt is changed to include
those files.
Change-Id: I530198cff5686a19e58096c87457ab8b7c52d5f3
[ROCm/rdc commit: 3aa95b210a]
When above option is used, the plugin returns errors:
result = rdc.rdc_group_gpu_add(rdc_handle, gpu_group_id, gpu)
ctypes.ArgumentError: argument 3: <type 'exceptions.TypeError'>: wrong type
The rdc_prometheus.py is changed to convert string to integer.
The RdcUtil.py is also changed to raise Exception properly.
Change-Id: I9535091ff1fc8882cccd32e5f2810da5241768c3
[ROCm/rdc commit: 7ca7a571a7]
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.
To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.
Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f
[ROCm/rdc commit: 6b5aeaaa23]
RDC_EVNT_XGMI_[2-5]_THRPUT were missing from RDC. Additionally,
these were handled as "pseudo" events, but this is not
necessary.
Change-Id: I3478365ac0d78f60a7b63235bea484f3edb8bd16
[ROCm/rdc commit: a9d0e037b5]
The rdc may install third party library in its local folder. We
need to disable the Autoprov to prevent the rpm add provides
for them.
Change-Id: I18d008c2ca2ecb6bba64467d78b1f8c3a6585aea
[ROCm/rdc commit: b8746f7fc0]
Install the grpc lib to rdc/grpc/lib and add miss libraries.
Add “--no-as-needed” and all extra grpc libraries in rdci/rdcd as
RUNPATH will only search direct dependencies.
Change-Id: I596acb2eb3a7228d703e79db64699bc20d0e7c09
[ROCm/rdc commit: 07d4d5376e]
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.
If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.
Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1
[ROCm/rdc commit: 51efe26442]
Change the CMakLists.txt to add the -fsanitize=address
Refer to jira ticket SWDEV-259873
Change-Id: Ie37fd661787eaea16f366b925d9a97db233cd136
[ROCm/rdc commit: ceb562d630]
The name of rocm-smi-lib64 has been changed to rocm-smi-lib
Hence updating the requisite name here for resolving installation dependency
Depends-On: Ib37d29aedc20b610619f6921f4147b41c0eaf134
Change-Id: I4efd778b72d43ad8f0842410a94ac1e3d3b9192a
[ROCm/rdc commit: 0054349862]
Installing files to standard path across each version and using
ldconfig has issues with side-by-side install.
Usage of RUNPATH/RPATH for ROCm to ensure all ROCm libraries are
picked without the need for ldconfig.
For RDC server to be picked up by systemctl, service config file
shall be a symlink from /lib/systemctl/system/rdc.service to
corresponding RDC file path in a given version of ROCm
For side-by-side install packages of RDC post install scripts
will be removed. Hence Use will have to set the symlink explicitly
for now.
Change-Id: I916da7cf132f0f9c667e2470fac2b0875e3db9d0
[ROCm/rdc commit: fe1593dda5]
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
removed from RdcSmiLib.cc, and will be gotten from raslib.
Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6
[ROCm/rdc commit: 81ad23343c]
The python script will search list of the installation folders to
find the librdc_bootstrap.so.
Change-Id: I52e444e6d153c318c731c4b2cd0d8e39b0fd31ca
[ROCm/rdc commit: 4b3dbc4697]
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up
Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
[ROCm/rdc commit: b278cd379b]
When RDC are only used as the libraries, the user can choose not to build
the rdci and rdcd, which will remove the dependencies to the gRPC and protoc.
The -DBUILD_STANDALONE=off should be pass to the cmake.
* Change README.md for the instructions.
* Move the python_binding installation from client/CMakeLists.txt to CMakeLists.txt
so that the RDC library only build will also install the folder.
* Change CMakeLists.txt and rdc_libs/CMakeLists.txt to build with gRPC only if
the BUILD_STANDALONE is enabled.
Change-Id: If9cfe9fc298a83636d85fe352a311fe2fe041661
[ROCm/rdc commit: 105675aeeb]
Two files are added to the python_binding folder:
* The rdc_collectd.py is a collectd plugin to store the RDC
metrics to the collectd round robin database.
* The rdc_collectd.conf is a configure file which can control
which fields to collect, how frequently the fields can be collected
and run the plugin in embedded mode.
Change-Id: Ief44d004376ca8a82ed0d8ad36805243acb47080
[ROCm/rdc commit: bb6d98b036]
Having grpc installed outside of ROCm dir is problematic
for multiple, simultaneous ROCm installations.
Change-Id: I5ad458ad01a76786339607d708b48534f15b137b
[ROCm/rdc commit: 0030f27ff8]
A new Grafana dashboard file rdc_grafana_dashboard_example.json
has been added to the folder python_binding. User can import
this dashboard to monitor multiple compute nodes.
To display the host name only in the dashboard, the
rdc_prometheus_example.yml is also changed to create a new label
short_instance which will not have the port number.
Change-Id: I9ab91838006d59c8dcb5fea01decb8c799484e1d
[ROCm/rdc commit: aeba7b0f91]
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.
Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562
[ROCm/rdc commit: 151520b97e]
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();
When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()
Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca
[ROCm/rdc commit: 72691cc024]
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.
* The RdcModuleMgr will be used to manage different modules. RDC
only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
It will also call rdc_telemetry_fields_query() defined in the RAS
library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
library: RdcSmiLib or RdcRasLib.
The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.
Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
[ROCm/rdc commit: ba35cdcfe2]
Also:
* consolidated the info in the previous rdc/README.md into
the README.md that was moved from docs/ directory.
* added missing information to get grpc into the default
library path (needed to add the grpc dir with ldconfig).
* formatting fixes
Change-Id: Id61e761ad7bdee40364bb8837be8705ed5ca53d1
[ROCm/rdc commit: bf412e3f76]
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.
Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.
Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
a field should be hidden or not, by default. This will
allow us to support many fields, even those that are not
typically of interest (but sometimes may be), without
confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
including the more obscure fields.
Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
[ROCm/rdc commit: d7c9625fc6]
The rdc_prometheus.py is a Prometheus plugin for RDC
The rdc_prometheus_example.yml and prometheus_targets.json are
example Prometheus configuration. If there are multiple compute
nodes, they can be defined at prometheus_targets.json.
Change-Id: I3611b1e8a166f6608351f6e7644808bf72a4d3a0
[ROCm/rdc commit: 9c7a1347ea]
Previously we would return -1 if we detected rdcd was
still running. But the rdcd process ID is alive as long
as the test is running. So now we return 0, and the rdcd
process ends, allowing the test to end cleanly.
Change-Id: I98a5aa0a03d14127824b86e1190047c9f9d2edb7
[ROCm/rdc commit: 15be17539f]
A new folder python_binding is created for RDC python binding:
* The rdc_bootstrap.py is a python ctypes wrapper for the librdc_boostrap.so
* The RdcUtil.py defines common utilities for RDC to manage group/fieldgroup
* The RdcReader.py is a class to simplify the usage of the RDC:
- The user only needs to specify which fields he wants to monitoring.
RdcReader will create groups and fieldgroups, watch the fields, and fetch the fields.
- The RdcReader can support embedded and standalone mode.
- The standalone can be with authentication and without authentication.
- In standalone mode, the RdcReader can automatically reconnect to the rdcd when the connection is lost.
- When rdcd is restarted, the previously created group and fieldgroup may lose.
The RdcReader can re-create them and watch the fields after reconnect.
- If the client is restarted, RdcReader can detect the groups and fieldgroups
created before and avoid re-create them.
- The user can pass the unit converter if he does not want to use RDC default unit.
Change-Id: I109ec86012f37162eb13f7d3e921115b7dd82369
[ROCm/rdc commit: 9209c6c516]