AMDSMI needs to merge first and bump the version to at least 24.4.2
Change-Id: I30149bb78c79ebc3de0dabdc8e63fcef12b2f406
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
* Implement CPU discovery support
SWDEV-482949:
enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:
1 GPUs found.
-----------------------------------------------------------------
GPU Index Device Information
0 AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index Device Information
0 AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------
Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
* CMAKE - Add required version for amdsmi
Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
---------
Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter
Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e
Implemented max memory bandwith and current memory bandwidth. Added two
new field ids: RDC_FI_GPU_MEMORY_MAX_BANDWIDTH, RDC_FI_GPU_MEMORY_CUR_BANDWIDTH
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I453e49937a84777146575f4f5bdd69fd4fe53bfc
SWDEV-475242
For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.
In rocprofiler, I think VALU Utilization is the closest to what we want.
Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f
Remove occupancy metrics and replace with OccupancyPercent
Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE
Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
RAS plugin loaded rocm-smi which is in conflict with amd-smi library
Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi
We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.
One such enum is kDevGpuMetrics:
rocm-smi: kDevGpuMetrics = 68
amd-smi: kDevGpuMetrics = 75
Example of overlapping map definitions:
$ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
00000000003c4980 g DO .data.rel.ro0000000000000008 Base devInfoTypesStrings
00000000003db830 g DO .bss0000000000000030 Base _ZN3amd3smi6Device19devInfoTypesStringsE
$ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so | grep devInfoTypesStrings
00000000003dc590 g DO .bss0000000000000030 Base _ZN3amd3smi6Device19devInfoTypesStringsE
00000000003c9c68 g DO .data.rel.ro0000000000000008 Base devInfoTypesStrings
Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
- Replace non-working fields with working ones
- remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
- ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
- ROCP_METRICS is only relevant for rocprofv1 (which we are using)
Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.
Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283
Provide support for reliable metrics and hide experimental in current
release.
Further ROCMTools integration development is pushed out to ROCm 5.6.
Change-Id: Iae7a0ed3991588c833bd8ef580b02b9c71390d55
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
This commit adds integration with ROCmTools
Additional changes:
- Fix DEB and RPM installation issue when systemd is not present
- Fix typos in rdc.h
- Wrap negative values in parentheses in rdc.h
- CMAKE: Improve rocm_smi searching
- README: Improve formatting, add info about ROCmTools
Metrics added: 700-714
Metrics can be listed with `rdci dmon --list-all`
Majority of the metrics are only supported by Instict (MI) series GPUs
700 RDC_FI_PROF_ELAPSED_CYCLES should be available on most devices
See README for more information
Change-Id: I907d3eacdc92fc5588ca6c76c2fa1ce0ad900770
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.
To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.
Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f
RDC_EVNT_XGMI_[2-5]_THRPUT were missing from RDC. Additionally,
these were handled as "pseudo" events, but this is not
necessary.
Change-Id: I3478365ac0d78f60a7b63235bea484f3edb8bd16
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
removed from RdcSmiLib.cc, and will be gotten from raslib.
Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up
Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.
* The RdcModuleMgr will be used to manage different modules. RDC
only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
It will also call rdc_telemetry_fields_query() defined in the RAS
library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
library: RdcSmiLib or RdcRasLib.
The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.
Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.
Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.
Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
a field should be hidden or not, by default. This will
allow us to support many fields, even those that are not
typically of interest (but sometimes may be), without
confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
including the more obscure fields.
Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec