54 Commity

Autor SHA1 Wiadomość Data
Adam Pryor bd6c6852fc [SWDEV-566924] Update KFD_ID metric to use amd-smi instead of rocprof (#2355) 2025-12-18 08:39:19 -06:00
Yazen AL Musaffar 16b9160034 [RDC] [SWDEV-551280] RDC to include Error Counters (#1087)
* rdc error counter

* RDC error counters

* fix

* Updates

* updated field names

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>

---------

Signed-off-by: yalmusaf_amdeng <yalmusaf@amd.com>
Co-authored-by: yalmusaf_amdeng <yalmusaf@amd.com>
2025-12-03 15:22:18 -06:00
Dmitrii 8abe24d3b0 rdc: Add CPU support and CPU metrics infrastructure (#770) 2025-09-12 16:14:38 -05:00
Galantsev, Dmitrii 8fc1d27ecd Profiler - Remove UUID metric
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 059451d48f]
2025-07-22 14:55:28 -05:00
Galantsev, Dmitrii ad14980e9a Profiler - Add partition support
NOTE: GPU ordering used is not the same as in HSA/HIP.

GPUs are ordered via amdsmi and then GPU_ID fields are compared to map
GPU partitions to each other.

Change-Id: If379214f5281d7d5ee98515b3e5ba7affc2e2197
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 85b619b2f0]
2025-06-03 19:34:00 -05:00
Pryor, Adam ec661d5d17 [SWDEV-243250] RDC Process Start/Stop integration (#189)
Change-Id: I3d2be33b5d23cd259b3d06fb572f81d19e6c3798

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 0e9c3b2c4f]
2025-06-02 14:42:21 -05:00
Galantsev, Dmitrii 0d352c515e Profiler - Align SMI and Profiler indices
Change-Id: If2bb850ffd1c1b8b16a8f5963a0f6971f82d4863
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eff955fdf7]
2025-05-21 19:11:17 -05:00
adapryor 0702a6a5a2 Profiler - Fix SIMD Utilization
Change-Id: I6775cce9901a714d20e80c8c17e7a563edeb48a4


[ROCm/rdc commit: 33924ea79e]
2025-05-07 00:56:52 -05:00
Galantsev, Dmitrii b6488d150d Profiler - Add SIMD_UTILIZATION (#171)
Change-Id: I19d5acd80dbed8c4fc4e1c85eec71ca89398d299

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 02c0786a2c]
2025-05-06 13:20:03 -07:00
Galantsev, Dmitrii 375ab5eace Add RDC_FI_GPU_BUSY_PERCENT
AMDSMI needs to merge first and bump the version to at least 24.4.2

Change-Id: I30149bb78c79ebc3de0dabdc8e63fcef12b2f406
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: a5cb334f8b]
2025-04-15 17:00:56 -05:00
Galantsev, Dmitrii 5276903800 Revert "Implement CPU discovery support"
This reverts commit f967f8a17d15e148464393fcd145af01dc0e1525.


[ROCm/rdc commit: 24024f0e4f]
2025-04-07 20:45:19 -05:00
Yuan, Perry f0f44d977f Implement CPU discovery support (#77)
* Implement CPU discovery support

SWDEV-482949:

enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:

1 GPUs found.
-----------------------------------------------------------------
GPU Index        Device Information
0               AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index        Device Information
0               AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------

Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>

* CMAKE - Add required version for amdsmi

Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 3bdca8b8b6]
2025-03-31 10:58:36 +08:00
Galantsev, Dmitrii bfee4ae9ee Profiler - Add CPC and CPF metrics
Change-Id: I27fd725e9e1868c9afe7624d6e4aafad2a42d47e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 51de344be7]
2025-03-27 19:01:23 -05:00
adapryor fbeacaff0c [SWDEV-517396] Align rdc_field with rdc_bootstrap
Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I5e05e25c5980a3141665ae2d13a6ae09207ccb41


[ROCm/rdc commit: 9571dad23d]
2025-03-04 08:49:28 -06:00
Pryor, Adam 0186fc2481 SWDEV-508477 Eval Flops Percent (#85)
SWDEV-508477 - Profiler add FP*_PERCENT

Change-Id: Idb6250fe6b7ba3df6fe7d30861e0fbbda7e9bdce

Signed-off-by: adapryor <Adam.pryor@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 6f358ddc9e]
2025-01-24 10:07:32 -06:00
adapryor 8286a92fc1 Implementation for RDC_FI_PROF_OCCUPANCY_PER_ACTIVE_CU SWDEV-50895
Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I8da7d9846edabe5629c75f50cd2bb4b23e019a17
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/rdc commit: 290b90dc89]
2025-01-21 21:49:19 -06:00
limeng12 4f3b114740 [SWDEV-230863] Improve the functionality of RdcSmiHealth module.
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter

Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e


[ROCm/rdc commit: 016a1d9d39]
2025-01-14 08:14:36 +08:00
Ma, Li 0e5cf815d8 SWDEV-475244 - Memory Usage and Bandwidth: max mem and current mem (#48)
[ROCm/rdc commit: 772481f952]
2024-12-23 10:22:53 +08:00
Greg Scaffidi 725599b51c Add RDC_FI_PROF_SM_ACTIVE metric.
Signed-off-by: Greg Scaffidi <salvatore.scaffidi@amd.com>
Change-Id: I63aaf5eb05d74ba696ace2b088e17c2cfb1bd74b
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/rdc commit: f4de4b0529]
2024-12-21 15:21:46 -06:00
Li Ma 772c1c0a0d SWDEV-475244 - Memory Usage and Bandwidth: max mem and current mem
Implemented max memory bandwith and current memory bandwidth. Added two
new field ids: RDC_FI_GPU_MEMORY_MAX_BANDWIDTH, RDC_FI_GPU_MEMORY_CUR_BANDWIDTH

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I453e49937a84777146575f4f5bdd69fd4fe53bfc


[ROCm/rdc commit: 30f9b2ac2f]
2024-12-16 09:43:20 +08:00
Chen Gong a8086b484d rocprofiler: add valu utilization
SWDEV-475242

For the description of "FP32 Engine Activity" and "FP64 Engine Activity" in dcgm,
It seems that we do not have an equivalent to these pipe-utilizations on our hardware.

In rocprofiler, I think VALU Utilization is the closest to what we want.

Change-Id: Ibce8835ef4757084cdfd73258de6fc1606ca0158
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 251fcbe49d]
2024-11-21 15:24:01 +08:00
limeng12 71e2727a8f Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112


[ROCm/rdc commit: 853d3b0cc5]
2024-11-19 14:00:49 +08:00
Bill(Shuzhou) Liu b813ae3426 Correct RDC_FI_PCIE_BANDWIDTH unit
The unit should be mbps instead of GB/second


[ROCm/rdc commit: 5e3ebecf80]
2024-11-13 09:45:46 -05:00
Chao Fei d489245fbe Enable RDC policy feature
1. Add policy APIs
2. Add policy example for policy API usage

Change-Id: I14deb7c809d0b865b7bb083842092fc37868025e
Signed-off-by: Chao Fei <Chao.Fei@amd.com>


[ROCm/rdc commit: 345ac64a43]
2024-10-23 20:37:27 -04:00
Li Ma 7e3c4b9a21 SWDEV-475244 - Memory Usage and Bandwidth: memory activity
Implemented memory activity and added a new fied id
RDC_FI_GPU_MEMORY_ACTIVITY.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I11abe356ef6b01ce4917fd19dcc128efbc535f39


[ROCm/rdc commit: 4bd31b605a]
2024-10-22 11:11:31 +08:00
Li Ma 09c718954c SWDEV-475255 - MM Engine Decoding Throughput
Implemented DEC activity for now due to ENC activity is unavailable in
amdsmi.

Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I34bb56e6e0d8d2ab91243f8932f0ac10cb2d1e9f


[ROCm/rdc commit: b17abf93fa]
2024-10-18 10:01:41 +08:00
Galantsev, Dmitrii 793b2de0cb Profiler - Modify metrics
Remove occupancy metrics and replace with OccupancyPercent

Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE

Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 28acbf0436]
2024-10-15 19:00:30 -05:00
adapryor c283aebd1c Add XGMI read/write sum metrics
Change-Id: I898b779ea7f5336edf0d047fb1e5d3ec40085baa


[ROCm/rdc commit: e20bc58b1c]
2024-10-09 17:02:55 -05:00
Galantsev, Dmitrii cc3c3ce9b6 Add OAM_ID
Change-Id: I771b2f7f088940838c09ba3521a7955faa64e7ec
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bffe4e22fa]
2024-09-09 21:19:33 -05:00
Galantsev, Dmitrii 9a2806ac95 SWDEV-452795 - Disable RAS plugin, fix XGMI
RAS plugin loaded rocm-smi which is in conflict with amd-smi library

Main source of grief was the map 'devInfoTypesStrings' that is defined
in both rocm-smi and amd-smi

We assume that rocm-smi would get lazy-loaded by RAS library and
overwrite symbols defined in amd-smi. devInfoTypesStrings in rocm-smi
contains different number of elements, the enums are also different.
RDC relies on amd-smi's enums.

One such enum is kDevGpuMetrics:
  rocm-smi: kDevGpuMetrics = 68
  amd-smi:  kDevGpuMetrics = 75

Example of overlapping map definitions:

  $ objdump --dynamic-syms /opt/rocm/lib/libamd_smi.so | grep devInfoTypesStrings
  00000000003c4980 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings
  00000000003db830 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  $ objdump --dynamic-syms /opt/rocm/lib/librocm_smi64.so  | grep devInfoTypesStrings
  00000000003dc590 g    DO .bss0000000000000030  Base        _ZN3amd3smi6Device19devInfoTypesStringsE
  00000000003c9c68 g    DO .data.rel.ro0000000000000008  Base        devInfoTypesStrings

Change-Id: Ib2f2db32b6abd7ebe84e7807c25581461eb86bae
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: d85657e5f2]
2024-06-26 03:42:07 -05:00
Galantsev, Dmitrii 29b86095ed Fix rocprofiler plugin
- Replace non-working fields with working ones
    - remove CU_OCCUPANCY completely as it isn't well supported
- Fix rocprofiler initialization with shared_ptr and rdc_module_init
- Replace env var ROCPROFILER_METRICS_PATH with ROCP_METRICS
    - ROCPROFILER_METRICS_PATH is only relevant for rocprofv2
    - ROCP_METRICS is only relevant for rocprofv1 (which we are using)

Change-Id: I21e6fa3f0e1694c38f44ca0e5659d672559f7380
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 20ca2ce574]
2024-06-06 01:51:39 -05:00
Galantsev, Dmitrii c2a75bbe4c Finalize the rocprofiler fields
Change-Id: I4ed1c4309f21bdcc7281d911663036caf5947182
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 07c414af5e]
2024-06-04 19:49:06 -05:00
Galantsev, Dmitrii f73e123900 Add GPU indexing and fix check for fields in rocprof
- Fix RUNPATH for tests

Change-Id: I79517592b49d27080a010a2e41e5878adf24a157
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: e11afbf60f]
2024-06-04 12:56:22 -05:00
Galantsev, Dmitrii a80dfd4f00 Add memory bandwidth metrics
Change-Id: I310ca8af0536497be619d2bda1e540d1f11c2565
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 53033a5b77]
2024-05-17 14:55:01 -05:00
Galantsev, Dmitrii 83cf97e280 Profiler - Add all required metrics
Change-Id: Iea3938df9407789c061c3a6ead9167a69069d6e6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: c3a4c899d5]
2024-05-09 23:24:02 -05:00
Galantsev, Dmitrii 8b317a6490 Add rocprofiler plugin
Rename ROCR -> Runtime and ROCP -> Profiler

Change-Id: If90953da8fa5d695b681813dad4a3e7ec26a9c7e
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 234b2d835b]
2024-05-07 04:39:39 -05:00
Galantsev, Dmitrii 93b990ffa0 AMDSMI - Add ring hang event
Change-Id: I84696e3cc1a4eba8de48e464f1a208ed9c6e489d
Depends-On: I2e73ba08ee0004f6f30660b2fa425ea94bafceca
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 5525bf8c86]
2024-05-03 16:45:42 -05:00
Bill(Shuzhou) Liu 79897be094 Add new XGMI and PCIE bandwidth fields from gpu_metrics
For new ASIC, the RDC_EVNT_XGMI, RDC_FI_PCIE_RX and RDC_FI_PCIE_TX
are not supported. New fileds RDC_FI_XGMI and RDC_FI_PCIE_BANDWIDTH
should be used.

Change-Id: Iff5bbef4c07994090fa7c4e9b319966215525283


[ROCm/rdc commit: 61a75d346b]
2024-05-03 16:18:17 -04:00
Galantsev, Dmitrii 39d6f482b8 Remove unsupported rocprofiler metrics
Change-Id: If6cfbcbe018227c591733471ab203fc6675d50af
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 81e3a78b1f]
2024-02-09 15:18:54 -06:00
Galantsev, Dmitrii ea624cbb7c LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 434e40305d]
2024-01-09 11:37:11 -06:00
Galantsev, Dmitrii 45d7a2df04 Server - Add -a/--address option
Change-Id: Ia9e8d76b9a4ba0aadc567142601a87f0ad0b69e4
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: ed3cfffd7e]
2023-12-04 15:26:44 -06:00
Galantsev, Dmitrii a337dc062b SWDEV-392942 - Disable rocmtools
Temporarily disable rocmtools because of hsa_shut_down issues

Change-Id: I5e8b6729b8200ccdd5c399862bfc632ba69f884c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 90e824c63b]
2023-04-05 13:20:19 -05:00
Galantsev, Dmitrii 6be2c8784d SWDEV-342533 - Hide WIP fields
Provide support for reliable metrics and hide experimental in current
release.

Further ROCMTools integration development is pushed out to ROCm 5.6.

Change-Id: Iae7a0ed3991588c833bd8ef580b02b9c71390d55
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 4536a453db]
2023-01-23 15:31:46 -06:00
Galantsev, Dmitrii eccb4e202c Add rocmtools support
This commit adds integration with ROCmTools

Additional changes:
- Fix DEB and RPM installation issue when systemd is not present
- Fix typos in rdc.h
- Wrap negative values in parentheses in rdc.h
- CMAKE: Improve rocm_smi searching
- README: Improve formatting, add info about ROCmTools

Metrics added: 700-714
Metrics can be listed with `rdci dmon --list-all`
Majority of the metrics are only supported by Instict (MI) series GPUs
700 RDC_FI_PROF_ELAPSED_CYCLES should be available on most devices
See README for more information

Change-Id: I907d3eacdc92fc5588ca6c76c2fa1ce0ad900770
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 861a843ed7]
2022-12-16 12:19:59 -06:00
Chris Freehill 8b1c887834 Turn on/off DAC capabilities as needed
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.

To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.

Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f


[ROCm/rdc commit: 6b5aeaaa23]
2021-02-04 12:25:26 -06:00
Chris Freehill 7cf47fb5c9 Add to/correct handling of RDC_EVNT_XGMI_*_THRPUT events
RDC_EVNT_XGMI_[2-5]_THRPUT were missing from RDC. Additionally,
these were handled as "pseudo" events, but this is not
necessary.

Change-Id: I3478365ac0d78f60a7b63235bea484f3edb8bd16


[ROCm/rdc commit: a9d0e037b5]
2021-01-29 14:56:46 -05:00
Bill(Shuzhou) Liu 17d5758923 Add raslib fields to RDC
The new raslib fields are added to RDC for dmon.
* The rdc_field.data, rdc.h and rdc_bootstrap.py are changed
  for new fields.
* The RDC_FI_ECC_CORRECT_TOTAL and RDC_FI_ECC_UNCORRECT_TOTAL are
  removed from RdcSmiLib.cc, and will be gotten from raslib.

Change-Id: I4ee016e3d52e9d38b54406ca129da511f741c6d6


[ROCm/rdc commit: 81ad23343c]
2020-12-01 10:56:36 -05:00
Chris Freehill 79b5e54d3b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc


[ROCm/rdc commit: b278cd379b]
2020-11-22 07:10:39 -05:00
Bill(Shuzhou) Liu 32e91cf8d2 RDC module framework
The framework is required for RAS integration. When the RAS fields
need to be retrieved, the framework will load the RAS library at run time,
and then call the RAS function to retrieve RAS metrics.

* The RdcModuleMgr will be used to manage different modules. RDC
  only has the telemetry module now.
* When RDCTelemetryModule is loaded, it will load the RAS library.
  It will also call rdc_telemetry_fields_query() defined in the RAS
  library for the list of fields RAS supported.
* The RdcSmiLib is a wrapper for the rocm_msi_lib to provide the
  interface required by the RDCTelemetryModule.
* The RdcWatchTable will use the RdcModuleMgr to get the
  RDCTelemetryModule to bulk fetch mulitple fields.
* The RdcTelemetryModule will dispatch those fields to different
  library: RdcSmiLib or RdcRasLib.

The watch() and unwatch() in the RDCTelemetryModule will been implemented
at the next task.

Change-Id: I81b01d5b52d1ea3cdcec7c09af86b6622dd5899e


[ROCm/rdc commit: ba35cdcfe2]
2020-09-02 14:46:40 -04:00
Chris Freehill 17430dde45 Add event counter support
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.

Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.

Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
  a field should be hidden or not, by default. This will
  allow us to support many fields, even those that are not
  typically of interest (but sometimes may be), without
  confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
  including the more obscure fields.

Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83


[ROCm/rdc commit: d7c9625fc6]
2020-08-17 19:45:18 -04:00