Remove occupancy metrics and replace with OccupancyPercent
Add OCCUPANCY_PERCENT which uses OccupancyPercent
Add GR_ENGINE_ACTIVE which uses GPU_UTIL/100
Add TENSOR_ACTIVE_PERCENT which uses MfmaUtil
Modify FLOPS_64 to use FP64_ACTIVE
Change-Id: I5f30d77a0c80f5ac78abd1a9e57f8a0a3c6cc00b
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Provide support for reliable metrics and hide experimental in current
release.
Further ROCMTools integration development is pushed out to ROCm 5.6.
Change-Id: Iae7a0ed3991588c833bd8ef580b02b9c71390d55
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
This commit adds integration with ROCmTools
Additional changes:
- Fix DEB and RPM installation issue when systemd is not present
- Fix typos in rdc.h
- Wrap negative values in parentheses in rdc.h
- CMAKE: Improve rocm_smi searching
- README: Improve formatting, add info about ROCmTools
Metrics added: 700-714
Metrics can be listed with `rdci dmon --list-all`
Majority of the metrics are only supported by Instict (MI) series GPUs
700 RDC_FI_PROF_ELAPSED_CYCLES should be available on most devices
See README for more information
Change-Id: I907d3eacdc92fc5588ca6c76c2fa1ce0ad900770
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Main CMake improvements:
* Add rdctst with -DBUILD_TESTS=ON
* Set default ROCM_DIR to /opt/rocm/
* Split rdc_libs/CMakeLists.txt into subdirectories
* Package tests into rdc-tests.deb and .rpm
Misc improvements:
* Add .editorconfig to normalize code formatting
* Add .gitignore
* Expand RPATH for gRPC to reduce LD_LIBRARY_PATH usage
* Export compile_commands.json
* Show warning and do not install gRPC if GRPC_ROOT is left as default
* Move .in files into relevant subdirectories
* Move most variables into project CMakeLists.txt to avoid redefinitions
* Normalize CMakeLists.txt formatting (4 spaces indentation)
* Rename DIAGNOSTIC_LIB to RDC_ROCR_LIB
* Update gRPC version in README to 1.44.0
* Remove gtest source
* Pull gtest from github if not installed
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Depends-On: I1039ef61247e3f0ff822925cc869fb0c2bf3af85
Change-Id: I879b21428e6642f19fda67092b365d8b78b7ba7b
A new diagnostic module librdc_rocr.so is created. The
module uses Rocr to test the memory allocation, memory access
and compute queue ready status.
Change-Id: I9098f4fc3209bf381b7cb3658a4e94c2e22f2fe9
Installing files to standard path across each version and using
ldconfig has issues with side-by-side install.
Usage of RUNPATH/RPATH for ROCm to ensure all ROCm libraries are
picked without the need for ldconfig.
For RDC server to be picked up by systemctl, service config file
shall be a symlink from /lib/systemctl/system/rdc.service to
corresponding RDC file path in a given version of ROCm
For side-by-side install packages of RDC post install scripts
will be removed. Hence Use will have to set the symlink explicitly
for now.
Change-Id: I916da7cf132f0f9c667e2470fac2b0875e3db9d0
When RDC are only used as the libraries, the user can choose not to build
the rdci and rdcd, which will remove the dependencies to the gRPC and protoc.
The -DBUILD_STANDALONE=off should be pass to the cmake.
* Change README.md for the instructions.
* Move the python_binding installation from client/CMakeLists.txt to CMakeLists.txt
so that the RDC library only build will also install the folder.
* Change CMakeLists.txt and rdc_libs/CMakeLists.txt to build with gRPC only if
the BUILD_STANDALONE is enabled.
Change-Id: If9cfe9fc298a83636d85fe352a311fe2fe041661
Also:
* fix typo in rpm post install script
* for RPM, tell CPack to exclude intermediate directories
in rpm file
Change-Id: I9dbb4901298d3699e092b53b339f5cb1d77b4edb
(cherry picked from commit e894cfa757aae8343afb373ce4ae60a1aa950a91)
Also:
* consolidated the info in the previous rdc/README.md into
the README.md that was moved from docs/ directory.
* added missing information to get grpc into the default
library path (needed to add the grpc dir with ldconfig).
* formatting fixes
Change-Id: Id61e761ad7bdee40364bb8837be8705ed5ca53d1
Add the function to start and stop the job recording.
Add the function to get the job stats for each GPU and summary of multiple GPUs
Add the function to remove the jobs.
Add a class RdcLogger which can control the log level using the environment variable RDC_LOG.
This is similar to GRPC_VERBOSITY gRPC. When the customer has the issues, he can enable the verbose
log to help us to troubleshoot the issues.
Add the -u support in the rdci group, fieldgroup and dmon for connecting to rdcd without authentication.
Change-Id: I22c591823c1ee6485db106b911bed8271d1b2769
Add the support for rdci subsystem group create, delete and query
Add the support for rdci subsystem fieldgroup create, delete and query
Add the support for rdci dmon system. The dmon system may show the stats every
a few seconds until press Ctrl-C. To cleanup the resources (for example, unwatch),
a signal handler is added.
Change-Id: Ib22a8a43b7083c7c72819ca21145e22702d9ad6c
The RDC API is changed to pass the certificates to the gRPC.
Add the support to add all GPUs in the host to a group. Also before
add a GPU to a group, the RDC API will verify that GPU exists or not.
Add the support to fetch the temperature metrics.
Change-Id: I5857ef03fede233d16e8b2836be120f33172da93
Create the skeleton implementation of rdc_client.so and rdci. Modify current rdcd to
integrate the RDC API service:
rdc.proto is changed to add a new RdcAPI service which defined the interfaces for the RDC API.
RdcStandaloneHandler.cpp is added to send the request using gRPC to the rdcd. It is built into
the rdc_client.so
rdci.cc, RdciDisCoverySubSystem.cc and RdciSubSystem.cc are added to implement skeleton rdci.
Currently, the discovery subsystem is supported.
rdc_api_service.cc is added to the server as a skeleton to implement the RdcAPI service. Currently,
only discovery API is implemented. Note: we disabled the rdc_rsmi_service, which will be removed
in the future. The original rdc_client.so is renamed to rdc_client_smi.so which should also be
removed in the future.
Add the instruction how to run the rdcd and rdci in the build folder in the README.md.
Change-Id: Id232f9f83787e5812d4a295dc8cf0daa7728b06c