Add the RdcSmiHealth module, which will call rocm_smi_lib.
It will support following health:
- XGMI error detected
- PCIE replay count detected
- Memory check
- InfoROM check
- Power/Thermal check
The grpc client and server side health function is added.
The health module is added to the rdci.
At present, XGMI/PCIE and a part of Memory have been implemented.
Others will be added as soon as possible.
Change-Id: I1bd99290bdc7dea733f21a41a8c4bcefb2138112
[ROCm/rdc commit: 853d3b0cc5]
Detcah the thread which handle shutdown signals instead of joining
thread can avoid the segfault issue on specific ASIC.
Signed-off-by: Li Ma <li.ma@amd.com>
Change-Id: I74ac53c027ac370605caaa87115c83fd8027526a
[ROCm/rdc commit: ca569346a3]
Implement an API to obtain the version information of the rdc calling component.
See rdc_component_t for details on available components.
It can be expanded later if necessary.
Change-Id: I03b48f774179c52c57b606704283add74ca39a02
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: 5a3fd9fbc1]
Want to display version information along with the hash value.
Change-Id: I0f9ad576f8f66747ce2e84d4f524ccd16d399927
Signed-off-by: Chen Gong <curry.gong@amd.com>
[ROCm/rdc commit: ac874d3921]
Modifying the /opt/rocm/etc/rdc file modifies RDC launch options. If
the file doesn't exist, the service should still launch (though a new
file should likely be included with the next released package of 'rdc'.
Change-Id: I1a1891e9c5c3e6048754eb555779a97a170754c0
[ROCm/rdc commit: de3cb36ce0]
The executable rdcd was using an absolute path in rdc.service. Using update-alternatives gives the flexibility to invoke the binary from anywhere and no absolute path is required.
Change-Id: I2f3d6fcbf9dd854870cfc2e00532c504ce6cd6fc
[ROCm/rdc commit: 0ca6d6fa59]
These RUNPATH changes make it so libraries can be found without setting
LD_LIBRARY_PATH.
Mostly tested on installed RDC binaries and libraries. The
build binaries should also work.
Change-Id: Ifd908a5b61d24dfcbb1d08d21b4ee830156d8643
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 32806681ca]
Also add stddef.h workaround for old GCC.
RHEL-8 still uses GCC 8.5 and templates are not well supported.
Change-Id: Ia4dae23892ec63682ea848c46ba81de85cf6d209
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: f9e80cc37a]
NOTE: RVS Build is disabled by default due to CI build issues.
Change-Id: I1593f0fe22075a9f86f54afa3ac151e109f1f7bd
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: eaa1862a80]
Join the signal handling thread instead of cancel it to prevent
crash with "terminate called without an active exception".
Change-Id: I2e18eb825728fd3a94f67b1b0049516bb7b6ebbc
[ROCm/rdc commit: 1ab4110d46]
- Replace gRPC library with gRPC package
- Relax RUNPATH
- Make LINKER_FLAGS global
gRPC package includes its dependencies:
SSL, UPB, ABSL, and etc.
Change-Id: Ieb198ad96e26e89b09cb85986214a5b1451b17a6
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
[ROCm/rdc commit: 3e4c55ec6c]
- Respect CMAKE_INSTALL_PREFIX and ignore RDC_CLIENT_INSTALL_PREFIX
- Move example and rdctst from rocm/bin to rocm/share/rdc
- Add README for examples
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Change-Id: I0b1d996d206327fd1b51ac6e82d548829bdb1570
[ROCm/rdc commit: f6efd7fbf6]
Main CMake improvements:
* Add rdctst with -DBUILD_TESTS=ON
* Set default ROCM_DIR to /opt/rocm/
* Split rdc_libs/CMakeLists.txt into subdirectories
* Package tests into rdc-tests.deb and .rpm
Misc improvements:
* Add .editorconfig to normalize code formatting
* Add .gitignore
* Expand RPATH for gRPC to reduce LD_LIBRARY_PATH usage
* Export compile_commands.json
* Show warning and do not install gRPC if GRPC_ROOT is left as default
* Move .in files into relevant subdirectories
* Move most variables into project CMakeLists.txt to avoid redefinitions
* Normalize CMakeLists.txt formatting (4 spaces indentation)
* Rename DIAGNOSTIC_LIB to RDC_ROCR_LIB
* Update gRPC version in README to 1.44.0
* Remove gtest source
* Pull gtest from github if not installed
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Depends-On: I1039ef61247e3f0ff822925cc869fb0c2bf3af85
Change-Id: I879b21428e6642f19fda67092b365d8b78b7ba7b
[ROCm/rdc commit: 2c171767b3]
With file reorganization changes binaries are moved to /opt/rocm-ver/bin.
Similarly rdc.service moved to /opt/rocm-ver/libexec/rdc
Test suites still used old paths
Once test suites changes are made, backward compatibility for binaries and rdc.service can be removed
Corrcted binary path in rdc.service.in
Corrected GRPC runpath
Change-Id: I306924d81cedc19586305a79d51eea8af6e70e83
[ROCm/rdc commit: c3ea96dd71]
SWDEV-291455 - Binary , header files and libraries installed in bin,include and lib folder under /opt/rocm-ver
Prebuilt ras library with updated search path
cmake config files in lib/cmake/rdc
grpc,sp3,hsaco and private libraries installed in lib/rdc
config installed in share/rdc
authentication and python_binding installed in libexec/rdc
Backward compatibility added for header files and libraries
Depends-On: I3f3d192935923f71737b3fe55ded536654a73dd7
Change-Id: Ia1a6cadc59034b155631a1ee5fdbe692d2a8a71b
[ROCm/rdc commit: 52a3463147]
grpc v1.44.0 needs to link to library absl_synchronization. The
CMakeLists.txt is changed to link to that library if available.
Change-Id: I92f7247473a70e7a83416b9744e788e45d104565
[ROCm/rdc commit: 2a46ee2ab2]
Provides a RdcSmiDiagnostic module, which will call rocm_smi_lib.
It will support following diagnostics: Get GPU Topology, Check GPU
parameters and check processes running on the GPUs.
The grpc client and server side diagnostics function is added.
The diag module is added to the rdci.
Change-Id: I10a0cf3c20556a61373ab686f82cae75acaa40dd
[ROCm/rdc commit: 76ccf58008]
Fix lintian errors related to maintainer, postinst script and
permissions.
Change-Id: I6924ff92ff5453fa7e562a6188c2c91cea87df68
[ROCm/rdc commit: 7a05145542]
Write access is required for some RSMI services. This change
temporarily permits write access so configuration can be done,
and then turns it off.
To help with this, the ScopedCapability struct is introduced to
provide scope limited access, helping to ensure a process is not
left with extra capability, should an exception occur.
Change-Id: I4978a1a688db935b8bfc27b3b537a0dd07959d3f
[ROCm/rdc commit: 6b5aeaaa23]
Install the grpc lib to rdc/grpc/lib and add miss libraries.
Add “--no-as-needed” and all extra grpc libraries in rdci/rdcd as
RUNPATH will only search direct dependencies.
Change-Id: I596acb2eb3a7228d703e79db64699bc20d0e7c09
[ROCm/rdc commit: 07d4d5376e]
Installing files to standard path across each version and using
ldconfig has issues with side-by-side install.
Usage of RUNPATH/RPATH for ROCm to ensure all ROCm libraries are
picked without the need for ldconfig.
For RDC server to be picked up by systemctl, service config file
shall be a symlink from /lib/systemctl/system/rdc.service to
corresponding RDC file path in a given version of ROCm
For side-by-side install packages of RDC post install scripts
will be removed. Hence Use will have to set the symlink explicitly
for now.
Change-Id: I916da7cf132f0f9c667e2470fac2b0875e3db9d0
[ROCm/rdc commit: fe1593dda5]
Also:
* print header line every 50 line on output
* print events that are being listened for with header
* cpplint clean-up
Change-Id: Ic049eb79156a9528b556e56f0fa43e1344f898cc
[ROCm/rdc commit: b278cd379b]
Make the RDC use the new rdc_field_t enum instead of uint32_t.
This will help prevent invalid field types from being passed in.
Also, centralize where data related to fields is kept. This will
reduce the number of places where changes are required each
time a new field is added.
Finally, cleaned up several cpplint issues.
Change-Id: I48e4512e18c164411d8b09ae3d4bed99fba359ec
[ROCm/rdc commit: 5950ebadc4]
In the job stats, in addition to the max, min and average,
it will also display the standard deviation.
A new option --json is added to the rdci to output the results
in json format.
In the job stats, using the GMT time instead of timestamp
for start and end time.
Change-Id: If245c4fc4854a1dc867f97ff5aa9112af7962eca
[ROCm/rdc commit: e6d910f67a]
Also:
* update README documentation
* correct postinst scripts for deb and rpm
* add lib64/ to link_directories (needed for CentOS and others)
* remove a redundant "rdc" from the package names
* rearrange the package names to conform to convention
For example:
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty_amd64.deb
rdc-server_1.0.0.0.local-build-0-c3187fb-dirty.x86_64.rpm
* fix issues that result from having, in essence, 2 different
install prefixes, 1 for the client and 1 for the server.
Change-Id: I88f0e1b8b72df2793c35ed71534afd91142da012
[ROCm/rdc commit: 4008dd8eac]
Remove the check whether the rdcd is started by rdc user.
Add the read access check for the private key and certificates if
the authentication is enabled.
Change-Id: I0e7a7eafb7985801572f809da0cb3e4012683153
[ROCm/rdc commit: 96afb24845]
Remove the * in the rdci stats
When a group is created, the GPUs can be added in the same command.
Add the support to the memory temperature.
Add the support to the memory clock.
Add the support to report the ECC errors.
Add the support to report the PCIe bandwidth throughput.
Since the RX/TX throughput may take 1 second to retreive, an async fetch is implemented
in the RdcMetricFetcherImpl.
Change-Id: If04f602fe1f2d14dbf7c2fb189549fd030523f9a
[ROCm/rdc commit: f4a3fd4dda]
Pass in GRPC root (or use default location) for RDC to use
when building RDC components.
Change-Id: I89db2ac2be27ab6449c817d210a94c11fef965fd
[ROCm/rdc commit: 1b58033183]
Add the job stats APIs in the rdc_api_service at the server side rdcd
Add the job stats APIs for the RdcStandaloneHandler at the client side
Make the load librdc.so and librdc_client.so thread safe.
Impelement async update all fields in RdcEmbeddedHandler.
Change-Id: I659d91efb32d1094d3b7f0f2cec39518cd7336ce
[ROCm/rdc commit: fe3e75edfa]
Depending on how a user starts rdcd, rdcd will either have
full monitor/control capabilities or have just monitoring
capabilties.
The only 2 user ids allowed are "rdc" and root.
Change-Id: Ie296a2f68c9723bef5945b1af1070ef99eeea93b
[ROCm/rdc commit: a6acf24ae7]
Implement the APIs defined in the RdcStandaloneHandler to make gRPC call to daemon
Implement the APIs defined in the RdcAPIServiceImpl to handle the gRPC calls in daemon
Add two APIs to get all GPU groups and field groups: rdc_group_get_all_ids()
and rdc_group_field_all_ids()
Those two APIs are required by the rdci group and fieldgroup
sub-modules.
Change-Id: I066091423146dea180c16af212688ed43dc44611
[ROCm/rdc commit: 7ee29b6cdd]
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:
rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.
RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so
rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.
rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.
Add the instruction how to run the rdcd and rdci in the build folder in the README.md.
Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c
[ROCm/rdc commit: 020f6939f7]
The rdc account will be created on installation if it does
not already exist. It will be a system account with no
home directory.
rdcd will be started as a systemd service, but change to
user "rdc". The rdc user will drop all priviliges except
CAP_DAC_OVERRIDE, permitted. This means the default mode
will have no special privileges, but have the ability to
gain write access (e.g., to sysfs) when needed.
rdc tests were being inadvertantly added to the
installation. This was adversely impacting the new
functionality, so it was corrected in this commit.
Also included are a few small formatting changes.
Change-Id: I9c6bb132fee28119fd3960594dfb97bd2e7b282a
[ROCm/rdc commit: 5cc498c6aa]