28 Commits

Autor SHA1 Nachricht Datum
Adam Pryor 5bf6e366dd [SWDEV-548460] Add RDC Policy Reset Message (#2180)
* [SWDEV-548460] Add RDC Policy Reset Message

* [rdc] Bump version to 1.3.0

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

* chore: [rdc] Format CMakeLists.txt

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
2025-12-29 08:31:13 -08:00
Pryor, Adam 151b0301f1 [SWDEV-535739] Align RDC with amdsmi 26.0 (#191)
* Align RDC with amdsmi 26.0.0
* Remove RDCI_IOLINK_TYPE_NUMIOLINKTYPES

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: Ib7f2a22bd9544e0bf74afb1ed8d8f8b79b129b1a

[ROCm/rdc commit: cc7ccf507a]
2025-06-02 18:27:19 -05:00
Pryor, Adam ec661d5d17 [SWDEV-243250] RDC Process Start/Stop integration (#189)
Change-Id: I3d2be33b5d23cd259b3d06fb572f81d19e6c3798

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 0e9c3b2c4f]
2025-06-02 14:42:21 -05:00
Galantsev, Dmitrii 5276903800 Revert "Implement CPU discovery support"
This reverts commit f967f8a17d15e148464393fcd145af01dc0e1525.


[ROCm/rdc commit: 24024f0e4f]
2025-04-07 20:45:19 -05:00
Yuan, Perry f0f44d977f Implement CPU discovery support (#77)
* Implement CPU discovery support

SWDEV-482949:

enable the CPU model name info support to the RDC, rdci command
can detect GPU and CPU modules at the same time.
It will query the CPU info through the amdsmi interface like below:

1 GPUs found.
-----------------------------------------------------------------
GPU Index        Device Information
0               AMD Radeon PRO W7800
=================================================================
1 CPUs found.
-----------------------------------------------------------------
CPU Index        Device Information
0               AMD Ryzen Threadripper PRO 7995WX 96-Cores
-----------------------------------------------------------------

Change-Id: Ibc6533c9a61000cd86c45b1bae14c3eb6788c119
Signed-off-by: Perry Yuan <perry.yuan@amd.com>

* CMAKE - Add required version for amdsmi

Change-Id: I341a89351d196ec66cce215a5d1d3953302fcc66
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

---------

Signed-off-by: Perry Yuan <perry.yuan@amd.com>
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

[ROCm/rdc commit: 3bdca8b8b6]
2025-03-31 10:58:36 +08:00
Pryor, Adam fe868f6763 [SWDEV-498711] RDC Partition Implementation (#119)
* [SWDEV-498711] RDC Partition Implementation

Change-Id: Ibfc3709793770537e4c9d36458f34c6b4f461724
Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 47692d3ed5]
2025-03-27 14:10:11 -05:00
Galantsev, Dmitrii b78295c8f8 RVS - Add IET and PEBB tests
Change-Id: Ia032901d74c882e5cbfa5a3164199cd4d571341f
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 5861ec7663]
2025-01-08 18:23:13 -06:00
Galantsev, Dmitrii 9d32387925 RVS - Add memory bandwidth test
Change-Id: I4c8990170861f6a0f3853615db68634fdaa7a622
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: b058cbecf1]
2025-01-08 18:23:13 -06:00
Pryor, Adam 20f3ba845c Implementation for adding pcie_total (#40)
* Implementation for adding pcie_total

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I4b0cfd7095e9d984e939283ee7169d01f55a1847
Signed-off-by: adapryor <Adam.pryor@amd.com>

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I021f29083de651cab9fbe7db98acbe20f65948d4

* Updates

Signed-off-by: adapryor <Adam.pryor@amd.com>
Change-Id: I42f3207b745fa787dabe30a85c8e063159d1337d

---------

Signed-off-by: adapryor <Adam.pryor@amd.com>

[ROCm/rdc commit: 60b7359161]
2024-12-26 18:36:41 -06:00
stali 52bb0d6466 Enable RDC link Status feature
1.add link status APIs
   2.Add link status example for link status API usage


[ROCm/rdc commit: 29b6699b62]
2024-12-23 09:30:21 +08:00
Adam Pryor 1c26bf4304 Implementation for SWDEV-479728:[RDC] - Clock Speed/Power Cap Control
Change-Id: I767a71325527aa3c691e9607953ceafebacfb4d5
Signed-off-by: adapryor <Adam.pryor@amd.com>


[ROCm/rdc commit: df170c8801]
2024-12-20 16:03:33 -06:00
stali 1e45293968 Enable RDC topology feature
1.Add topology APIs
2.Add topology example for topology API usage

Change-Id: Ib79c06d0bac85119672f194ba685ebf25029979c


[ROCm/rdc commit: 8bcb5f7068]
2024-12-16 10:02:41 +08:00
limeng12 71e2727a8f Backgroud health check
Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
 - XGMI error detected
 - PCIE replay count detected
 - Memory check
 - InfoROM check
 - Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.

At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.

Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112


[ROCm/rdc commit: 853d3b0cc5]
2024-11-19 14:00:49 +08:00
Galantsev, Dmitrii 73c79fcd83 Finish basic logging impl
Change-Id: Ia3d6ac80f4832f1bfb63573c543659abd5f84341
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9c77312c51]
2024-11-07 11:21:22 -06:00
Chao Fei d489245fbe Enable RDC policy feature
1. Add policy APIs
2. Add policy example for policy API usage

Change-Id: I14deb7c809d0b865b7bb083842092fc37868025e
Signed-off-by: Chao Fei <Chao.Fei@amd.com>


[ROCm/rdc commit: 345ac64a43]
2024-10-23 20:37:27 -04:00
Chen Gong dc85a9e385 Provide a way for rdci to get component version
For rdci, the version information of some components(such as RDCD),
cannot be obtained through the rdc_device_get_component_version API.

Here create a fake rdc API to get them.

Change-Id: I75d8bcd1993873cff209995b58362f75787a4598
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 8db404f84f]
2024-09-10 10:06:44 -05:00
Chen Gong 2ae8557614 Add an rdc API to get component version
Change-Id: I56250a6101debeb78628f1fd1dfff7f21c52cdc0
Signed-off-by: Chen Gong <curry.gong@amd.com>


[ROCm/rdc commit: 69a9f24b6e]
2024-09-10 10:06:44 -05:00
Galantsev, Dmitrii 028355dff0 SWDEV-439576 - rocmsmi -> amdsmi
- Migrate to amdsmi library
- NOTE: raslib still uses rocmsmi
- Remove unused rocmsmi service
- Remove unused RDC client code
- Remove RSMI calls from protos/rdc.proto

Change-Id: Ifc34a264c506b0ec5792307ee56b34526268762d
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: 9702d0f2d7]
2024-04-09 20:19:28 -05:00
Galantsev, Dmitrii 38c60ff90b RVS: Finish initial RVS integration
NOTE: RVS Build is disabled by default due to CI build issues.

Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>


[ROCm/rdc commit: eaa1862a80]
2024-01-10 00:27:04 -06:00
Bill(Shuzhou) Liu fa9c6ad6f8 Add the RdcSmiDiagnostic module
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.

It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.

The grpc client and server side diagnostics function is added.

The diag module is added to the rdci.

Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd


[ROCm/rdc commit: 76ccf58008]
2021-07-26 14:56:17 -04:00
Bill(Shuzhou) Liu 588ea96dd2 Support standard deviation and json output for job stats
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.

A new option --json is added to the rdci to output the results
in json format.

In the job stats, using the GMT time instead of timestamp
for start and end time.

Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca


[ROCm/rdc commit: e6d910f67a]
2020-08-17 14:09:37 -05:00
Bill(Shuzhou) Liu 5c2a56e069 Support extra metrics in the RDC
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.

Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.

Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a


[ROCm/rdc commit: f4a3fd4dda]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu dc48d8c977 Implement the gRPC APIs for the job stats
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.

Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce


[ROCm/rdc commit: fe3e75edfa]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu ce4890f88c Implement the APIs for gRPC calls in client/server
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon

Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon

Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.

Change-Id: I066091423146dea180c16af212688ed43dc44611


[ROCm/rdc commit: 7ee29b6cdd]
2020-08-17 14:07:25 -05:00
Bill(Shuzhou) Liu 199f085ce3 SWDEV-209060 - Create the Skeleton RDC CLI and daemon
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:

rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.

RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so

rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.

rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.

Add the instruction how to run the rdcd and rdci in the build folder in the README.md.

Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c


[ROCm/rdc commit: 020f6939f7]
2020-08-17 14:07:25 -05:00
Chris Freehill 87aa4ff77c Add read fan values and associated tests
Change-Id: I89322e93d5f3110adace15e5a576f00d4934be79


[ROCm/rdc commit: 4729c47866]
2020-08-17 14:07:25 -05:00
Chris Freehill ba14edbb4d Break srvs. into rsmi & admin srvs. Add VerifyConnection api.
Change-Id: I67567264c37e31f3409062a14e56eba4801cd944


[ROCm/rdc commit: dc6f6f3e9a]
2020-01-09 20:02:33 -06:00
Chris Freehill bc7f01e992 Initial RDC commit
Includes server, client and example targets.

Change-Id: I30596fb0453af71d49b8390a8468a6d073200836


[ROCm/rdc commit: 5898345d17]
2020-01-09 17:57:29 -06:00