- Enable set and get for policy settings
- Enable register and clear policy events
Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>
[ROCm/rdc commit: d8fec06bab]
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39
[ROCm/rdc commit: 4bd31b605a]
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f
[ROCm/rdc commit: b17abf93fa]
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.
Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 5a3fd9fbc1]
RAS plugin loaded rocm-smi which is in conflict with amd-smi library
Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi
We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.
One such enum is kDevGpuMetrics:
rocm-smi: kDevGpuMetrics = 68
amd-smi: kDevGpuMetrics = 75
Example of overlapping map definitions:
$ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
00000000003c4980 g DO .data.rel.ro0000000000000008 Base devInfoTypesStrings
00000000003db830 g DO .bss0000000000000030 Base _ZN3amd3smi6Device19devInfoTypesStringsE
$ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so | grep devInfoTypesStrings
00000000003dc590 g DO .bss0000000000000030 Base _ZN3amd3smi6Device19devInfoTypesStringsE
00000000003c9c68 g DO .data.rel.ro0000000000000008 Base devInfoTypesStrings
Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: d85657e5f2]
- Replace non-working fields with working ones
- remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
- ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
- ROCP_METRICS is only relevant for rocprofv1 (which we are using)
Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 20ca2ce574]
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.
Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283
[ROCm/rdc commit: 61a75d346b]
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.
Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: f9e80cc37a]
NOTE: RVS Build is disabled by default due to CI build issues.
Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: eaa1862a80]
This commit adds integration with ROCmTools
Additional changes:
- Fix DEB and RPM installation issue when systemd is not present
- Fix typos in rdc.h
- Wrap negative values in parentheses in rdc.h
- CMAKE: Improve rocm_smi searching
- README: Improve formatting, add info about ROCmTools
Metrics added: 700-714
Metrics can be listed with `rdci dmon --list-all`
Majority of the metrics are only supported by Instict (MI) series GPUs
700 RDC_FI_PROF_ELAPSED_CYCLES should be available on most devices
See README for more information
Change-Id: I907d3eacdc92fc5588ca6c76c2fa1ce0ad900770
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 861a843ed7]
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.
Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9
[ROCm/rdc commit: 78e2f2486b]
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.
It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.
The grpc client and server side diagnostics function is added.
The diag module is added to the rdci.
Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
[ROCm/rdc commit: 76ccf58008]
RDC can optimize by bulk fetching multiple metrics using a single
rocm_smi call. However, currently this is not completely supported in
all ASIC generations. By default disable this for now.
Set environment variable RDC_BULK_FETCH_ENABLED=TRUE to enable
RDC bulk fetch.
BUG: SWDEV-289316
Change-Id: Ibb55514f198356dccf5f47bb0fd2d53c17acb251
[ROCm/rdc commit: 673f5a4ee1]
The API interface defines how the caller will use the API. An
example also shows how the API can be used.
It also defines the RdcDiagnostic module which can load the
library dynamically and then dispatch diagnostic test to run.
Change-Id: I1e041aab86f7e19338860f5ba65262977f4ea9cb
[ROCm/rdc commit: eab3625d65]
RDC_EVNT_XGMI_[2-5]_THRPUT were missing from RDC. Additionally,
these were handled as "pseudo" events, but this is not
necessary.
Change-Id: I3478365ac0d78f60a7b63235bea484f3edb8bd16
[ROCm/rdc commit: a9d0e037b5]
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.
If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.
Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1
[ROCm/rdc commit: 51efe26442]
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
removed from RdcSmiLib.cc, and will be gotten from raslib.
Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6
[ROCm/rdc commit: 81ad23343c]
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up
Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
[ROCm/rdc commit: b278cd379b]
The framework now supports watch() and unwatch(), which can be used
by the telemetry library to init events or pre-fetch fields when recording
starts.
* A new header file RdcTelemetryLibInterface.h is defined for library to
include it.
* The RdcWatchTable will not talk to RdcMetricFetcher directly anymore.
It will call the framework watch/unwatch to dispatch it to the libraries.
* Make the python binding consistent with the current code.
Change-Id: Ie5731d920ed5928f901369d60c23bd450807a562
[ROCm/rdc commit: 151520b97e]
RAS library will provide two new APIs:
rdc_status_t rdc_module_init(uint64_t flags);
rdc_status_t rdc_module_destroy();
When RDC load the librdc_ras.so, it will call rdc_module_init().
When RDC exit, it will call rdc_module_destroy()
Change-Id: I7f5c81fd19a45a906c3c339cd6eabee2277f27ca
[ROCm/rdc commit: 72691cc024]
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.
* The RdcModuleMgr will be used to manage different modules. RDC
only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
It will also call rdc_telemetry_fields_query() defined in the RAS
library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
library: RdcSmiLib or RdcRasLib.
The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.
Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
[ROCm/rdc commit: ba35cdcfe2]
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.
Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.
Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
a field should be hidden or not, by default. This will
allow us to support many fields, even those that are not
typically of interest (but sometimes may be), without
confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
including the more obscure fields.
Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
[ROCm/rdc commit: d7c9625fc6]
A new folder python_binding is created for RDC python binding:
* The rdc_bootstrap.py is a python ctypes wrapper for the librdc_boostrap.so
* The RdcUtil.py defines common utilities for RDC to manage group/fieldgroup
* The RdcReader.py is a class to simplify the usage of the RDC:
- The user only needs to specify which fields he wants to monitoring.
RdcReader will create groups and fieldgroups, watch the fields, and fetch the fields.
- The RdcReader can support embedded and standalone mode.
- The standalone can be with authentication and without authentication.
- In standalone mode, the RdcReader can automatically reconnect to the rdcd when the connection is lost.
- When rdcd is restarted, the previously created group and fieldgroup may lose.
The RdcReader can re-create them and watch the fields after reconnect.
- If the client is restarted, RdcReader can detect the groups and fieldgroups
created before and avoid re-create them.
- The user can pass the unit converter if he does not want to use RDC default unit.
Change-Id: I109ec86012f37162eb13f7d3e921115b7dd82369
[ROCm/rdc commit: 9209c6c516]
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
[ROCm/rdc commit: 5950ebadc4]