52 Commits

Author SHA1 Message Date
Adam Pryor 5bf6e366dd [SWDEV-548460] Add RDC Policy Reset Message (#2180)
* [SWDEV-548460] Add RDC Policy Reset Message

* [rdc] Bump version to 1.3.0

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

* chore: [rdc] Format CMakeLists.txt

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-12-29 08:31:13 -08:00
Yazen AL Musaffar c0d773c47b Fix for created rdc groups not listing when running rdci dmon & rdci group -l -u (#1983)
Signed-off-by: yalmusaf_amdeng <Yazen.ALMusaffar@amd.com>
2025-12-03 15:21:17 -06:00
Dmitrii 8abe24d3b0 rdc: Add CPU support and CPU metrics infrastructure (#770) 2025-09-12 16:14:38 -05:00
Galantsev, Dmitrii 2d41f97290 Bump version to 1.2.0
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 543543ff1b]
2025-08-05 20:06:12 -05:00
Pryor, Adam 010ac416b1 [SWDEV-379269] Add all gpus as default to dmon (#211)
Change-Id: Idb17e9018c39479830a4366f2002d02725d66873

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 816f7a850f]
2025-07-15 16:03:28 -05:00
Pryor, Adam d075194597 [SWDEV-531379] Fix config (#183)
* [SWDEV-531379] Fix config

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: Ie1bd6903235016a185dd93fbac0a87658fb12a62

* Fix group field find

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I1f8c62615327df4b5ca916b158b4882a3d5a59d0

* fixes

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I971f3e12e293ea9e5d4d67db64d8d7217b87561c

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 8663702737]
2025-06-09 13:55:15 -05:00
Pryor, Adam 151b0301f1 [SWDEV-535739] Align RDC with amdsmi 26.0 (#191)
* Align RDC with amdsmi 26.0.0
* Remove RDCI_IOLINK_TYPE_NUMIOLINKTYPES

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: Ib7f2a22bd9544e0bf74afb1ed8d8f8b79b129b1a

[ROCm/rdc commit: cc7ccf507a]
2025-06-02 18:27:19 -05:00
Pryor, Adam ec661d5d17 [SWDEV-243250] RDC Process Start/Stop integration (#189)
Change-Id: I3d2be33b5d23cd259b3d06fb572f81d19e6c3798

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 0e9c3b2c4f]
2025-06-02 14:42:21 -05:00
Galantsev, Dmitrii ff8704cf76 RDCI - Fix misaligned fields
Change-Id: I7914c01b82e7e2fb5c63521d6d4803570447790c
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 7b06b778b9]
2025-05-21 19:11:17 -05:00
Pryor, Adam 2cb7903b06 [SWDEV-523349/SWDEV-527257] Fix Rdci Config (#161)
Change-Id: Iae21ea8061205f186086a3ed59c6259ddeb1dbe7

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 2db6ddea69]
2025-04-28 11:57:51 -05:00
Galantsev, Dmitrii e15c5a15fa CMAKE - Bump version to 1.1.0
Change-Id: I0fbc0f6d842c034ad858f30fa6418afd01e11a4f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: ac50573e67]
2025-04-11 17:27:27 -05:00
Galantsev, Dmitrii 5276903800 Revert "Implement CPU discovery support"
This reverts commit f967f8a17d15e148464393fcd145af01dc0e1525.


[ROCm/rdc commit: 24024f0e4f]
2025-04-07 20:45:19 -05:00
Yuan, Perry f0f44d977f Implement CPU discovery support (#77)
* Implement CPU discovery support

SWDEV-482949:

enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:

1 GPUs found.
-----------------------------------------------------------------
GPU Index        Device Information
0               AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index        Device Information
0               AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------

Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>

* CMAKE - Add required version for amdsmi

Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 3bdca8b8b6]
2025-03-31 10:58:36 +08:00
Galantsev, Dmitrii e80760c890 RVS - Add long-running tests
Change-Id: Iddeb7f2d4fdcd69d7ac1ae94b2fa128ee3011b1a
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: bdb2367010]
2025-03-27 23:42:56 -05:00
Pryor, Adam fe868f6763 [SWDEV-498711] RDC Partition Implementation (#119)
* [SWDEV-498711] RDC Partition Implementation

Change-Id: Ibfc3709793770537e4c9d36458f34c6b4f461724
Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 47692d3ed5]
2025-03-27 14:10:11 -05:00
Pryor, Adam c5560793e8 SWDEV-500382 fix energy consumed (#105)
Change-Id: I3f180f34abed763db1287bf01581753534f32828

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: af56e460c4]
2025-01-30 09:38:00 -06:00
stali 01990d5121 fix topology issue
[ROCm/rdc commit: e36d3fae22]
2025-01-24 09:22:42 +08:00
stali 7f4e5c85cb fixed rdc link state print issue
[ROCm/rdc commit: b427c07ffe]
2025-01-22 09:05:49 +08:00
limeng12 4f3b114740 [SWDEV-230863] Improve the functionality of RdcSmiHealth module.
Memory check:get the threshold of retired page number
EEPROM check:read and verify the checksum
Power/Thermal check: power/thermal throttle status counter

Signed-off-by: Meng Li <li.meng@amd.com>
Change-Id: Id2c751416eb5bf007e6e1da8dc05966a6ba1324e


[ROCm/rdc commit: 016a1d9d39]
2025-01-14 08:14:36 +08:00
Galantsev, Dmitrii b78295c8f8 RVS - Add IET and PEBB tests
Change-Id: Ia032901d74c882e5cbfa5a3164199cd4d571341f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 5861ec7663]
2025-01-08 18:23:13 -06:00
Galantsev, Dmitrii 9d32387925 RVS - Add memory bandwidth test
Change-Id: I4c8990170861f6a0f3853615db68634fdaa7a622
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: b058cbecf1]
2025-01-08 18:23:13 -06:00
Li, Star 474eb81053 Fix unit issue in policy feature (#78)
1. For temperature the unit in milli Celsius
2. For power the unit in microwatts.
3. Fix second register call to rdcd doesn't functional because start flag

Co-authored-by: Chao Fei <chao.fei@amd.com>

[ROCm/rdc commit: bd7d7c99c1]
2025-01-06 09:21:08 +08:00
Pryor, Adam 20f3ba845c Implementation for adding pcie_total (#40)
* Implementation for adding pcie_total

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I4b0cfd7095e9d984e939283ee7169d01f55a1847
Signed-off-by: adapryor <Adam.pryor@amd.com>

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I021f29083de651cab9fbe7db98acbe20f65948d4

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I42f3207b745fa787dabe30a85c8e063159d1337d

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 60b7359161]
2024-12-26 18:36:41 -06:00
stali 52bb0d6466 Enable RDC link Status feature
1.add link status APIs
   2.Add link status example for link status API usage


[ROCm/rdc commit: 29b6699b62]
2024-12-23 09:30:21 +08:00
Adam Pryor 1c26bf4304 Implementation for SWDEV-479728:[RDC] - Clock Speed/Power Cap Control
Change-Id: I767a71325527aa3c691e9607953ceafebacfb4d5
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/rdc commit: df170c8801]
2024-12-20 16:03:33 -06:00
stali 1e45293968 Enable RDC topology feature
1.Add topology APIs
2.Add topology example for topology API usage

Change-Id: Ib79c06d0bac85119672f194ba685ebf25029979c


[ROCm/rdc commit: 8bcb5f7068]
2024-12-16 10:02:41 +08:00
limeng12 71e2727a8f Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112


[ROCm/rdc commit: 853d3b0cc5]
2024-11-19 14:00:49 +08:00
stali f34e245ba1 Enable RDCI policy subsystem
- Enable set and get for policy settings
- Enable register and clear policy events

Change-Id: If4eaaf9b80e668fb21691757210e0aa1532cecae
Signed-off-by: stali <Star.Li@amd.com>


[ROCm/rdc commit: d8fec06bab]
2024-11-12 20:40:08 -06:00
Galantsev, Dmitrii 73c79fcd83 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9c77312c51]
2024-11-07 11:21:22 -06:00
Chen Gong dc905e20ff Implement the discovery -v command line interface
Call the previously implemented get_rdcd_version and rdc_get_smiversion

Change-Id: If76037d462fa9328c3af8c85423ee4547882e36e
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 0cfca6d93d]
2024-09-10 10:06:44 -05:00
Chen Gong d19c6dfa36 Reorganize the code path of the rdci Discovery Subsystem
Prepare for adding 'detection version information' later

Change-Id: Ib2b5e70b2360b1c5ff87a537f41f34f23c7ed61f
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 45c6d0b03b]
2024-09-10 10:06:44 -05:00
Chen Gong 2aba92bdce Add the function of outputting rdci version information
Change-Id: Iabeec48ba2e109ead7fb6fb07454ebcdc74a11e6
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 6591563d53]
2024-09-10 10:06:44 -05:00
Galantsev, Dmitrii 38c60ff90b RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eaa1862a80]
2024-01-10 00:27:04 -06:00
Galantsev, Dmitrii ea624cbb7c LINT: Add cpplint, clang-format and pre-commit support
Change-Id: I3cbb787ef27d90486b212dfb1a8c77c460acc2ac
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 434e40305d]
2024-01-09 11:37:11 -06:00
Galantsev, Dmitrii 8fc6d04a54 Format DOUBLE as a fixed floating point number
previous format:
1.20758e+06
0.370689
0.00014128

new format:
1207583.000
0.371
0.000

Change-Id: I00f41d841e5e62c4b25dc5e646b6487449773e01
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 4d35ff6092]
2023-01-18 11:18:57 -05:00
Bill(Shuzhou) Liu 6b700f8005 Support GPU memory test and compute queue test using Rocr
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.

Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9


[ROCm/rdc commit: 78e2f2486b]
2021-10-21 11:01:12 -04:00
Bill(Shuzhou) Liu fa9c6ad6f8 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd


[ROCm/rdc commit: 76ccf58008]
2021-07-26 14:56:17 -04:00
Bill(Shuzhou) Liu f41c146bc4 Bulk fetch metrics from rocm_smi_lib
The RDC provides a wrapper to bulk fetch metrics from rocm_smi_lib.

If the video card does not support bulk fetch or the metrics cannot be
bulk fetched, it will fallback to fetch them one by one.

Change-Id: I8852ba1ed67e0fabc805c93b1080f74c233516e1


[ROCm/rdc commit: 51efe26442]
2021-01-07 16:40:37 -05:00
Bill(Shuzhou) Liu acded1f944 rdci dmon Segmentation fault if fields do not contain events
Fix the core dump observed in dev test.

Change-Id: Ib008aeeee2f415174dbb0c4ba301b3f9d6d2d54b


[ROCm/rdc commit: 9bf6e630d6]
2020-12-07 16:52:14 -05:00
Chris Freehill 79b5e54d3b Add event notification support and rdci timestamps
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up

Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc


[ROCm/rdc commit: b278cd379b]
2020-11-22 07:10:39 -05:00
Chris Freehill 17430dde45 Add event counter support
Adds support for RSMI event counters. This also includes
"macro" or "pseudo" events, in which an event value is
obtained from RSMI, followed by some post processing before
being displayed in rdci.

Aside from the support of new fields, the main update here
is to introduce an initialization and "shutdown" call for
new fields that will require this.

Also, includes some modifications to the rdci dmon list
command:
* in rdc_field_data.data, added the ability to specify whether
  a field should be hidden or not, by default. This will
  allow us to support many fields, even those that are not
  typically of interest (but sometimes may be), without
  confusing the user or unnecessary clutter.
* added a --list-all option which lists all available field
  including the more obscure fields.

Change-Id: I01dd0edea963c12f82c6e44f893a390711ef3e83


[ROCm/rdc commit: d7c9625fc6]
2020-08-17 19:45:18 -04:00
Chris Freehill 6b246dcf4b rdc_field_t replaces uint32_t; centralize field data
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.

Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.

Finally, cleaned up several cpplint issues.

Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec


[ROCm/rdc commit: 5950ebadc4]
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 588ea96dd2 Support standard deviation and json output for job stats
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.

A new option --json is added to the rdci to output the results
in json format.

In the job stats, using the GMT time instead of timestamp
for start and end time.

Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca


[ROCm/rdc commit: e6d910f67a]
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 0f8f345992 Add support to use the field name in rdci
In the rdci dmon and fieldgroup, now the fields can be specified
using either number id or the field name.

When the rdc is async fetching metrics, it will not report that fetch
as an error.

Change-Id: I81331e2c239af987181147be5ac0e29ba1617ab4


[ROCm/rdc commit: d30cb81fdb]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu b7cf5bc94c Rename description of job stats
Change the job stats description.

Change-Id: I9b56a40d648c05e5327ad1b640277302d0e5e00c


[ROCm/rdc commit: 2772d3f238]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 5c2a56e069 Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a


[ROCm/rdc commit: f4a3fd4dda]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 39f3d3af8a Implement the stats subsystem in rdci
Add support for the stats subsystem in rdci
Modify the dmon system to handle the case when no GPUs in a group

Change-Id: I5a18e1201d24b5318b8e324a77551a757b108f25


[ROCm/rdc commit: 096dc2dadb]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 0813e7052f Implement the rdc_lib API to support the job stats
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.

Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.

Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.

Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769


[ROCm/rdc commit: a547dc7efd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu aef3d29925 Implement the rdci subsystem: group, fieldgroup and dmon
Add the support for rdci subsystem group create, delete and query

Add the support for rdci subsystem fieldgroup create, delete and query

Add the support for rdci dmon system. The dmon system may show the stats every
a few seconds until press Ctrl-C. To cleanup the resources (for example, unwatch),
a signal handler is added.

Change-Id: Ib22a8a43b7083c7c72819ca21145e22702d9ad6c


[ROCm/rdc commit: 16bce67835]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu ce4890f88c Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611


[ROCm/rdc commit: 7ee29b6cdd]
2020-08-17 14:07:25 -05:00