Change-Id: I0d268ed2aee5c595f2a23e779000122e57165f9d Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
ROCm™ Data Center Tool (RDC) 🚀
The ROCm™ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.
🌟 Main Features
- GPU Telemetry 📊
- GPU Statistics for Jobs 📈
- Integration with Third-Party Tools 🔗
- Open Source 🛠️
Note
The published documentation is available at ROCm Data Center Tool in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the
rdc/docsfolder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see Contribute to ROCm documentation.
🛠️ Installation Guide
📋 Prerequisites
Before setting up RDC, ensure your system meets the following requirements:
- Supported Platforms: RDC runs on AMD ROCm-supported platforms. Refer to the List of Supported Operating Systems for details.
- Dependencies:
🔐 Certificate Generation
For certificate generation, refer to the RDC Developer Handbook (Generate Files for Authentication) or consult the concise guide located at authentication/readme.txt.
🚀 Running RDC
RDC supports two primary modes of operation: Standalone and Embedded. Choose the mode that best fits your deployment needs.
🗂️ Standalone Mode
Standalone mode allows RDC to run independently with all its components installed.
-
Start RDCD with Authentication (Monitor-Only Capabilities):
/opt/rocm/bin/rdcd -
Start RDCD with Authentication (Full Capabilities):
sudo /opt/rocm/bin/rdcd -
Start RDCD without Authentication (Monitor-Only):
/opt/rocm/bin/rdcd -u -
Start RDCD without Authentication (Full Capabilities):
sudo /opt/rocm/bin/rdcd -u
🔗 Embedded Mode
Embedded mode integrates RDC directly into your existing management tools using its library format.
-
Run RDC in Embedded Mode:
python your_management_tool.py --rdc_embedded
Note: Ensure that the rdcd daemon is not running separately when using embedded mode.
🛠️ Starting RDCD Using systemd
-
Copy the Service File:
sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/ -
Configure Capabilities:
-
Full Capabilities: Ensure the following lines are uncommented in
/etc/systemd/system/rdc.service:CapabilityBoundingSet=CAP_DAC_OVERRIDE AmbientCapabilities=CAP_DAC_OVERRIDE -
Monitor-Only Capabilities: Comment out the above lines to restrict RDCD to monitoring.
-
-
Start the Service:
sudo systemctl start rdc sudo systemctl status rdc -
Modify RDCD Options:
Edit
/opt/rocm/share/rdc/conf/rdc_options.confto append any additional RDCD parameters.sudo nano /opt/rocm/share/rdc/conf/rdc_options.confExample Configuration:
RDC_OPTS="-p 50051 -u -d"- Flags:
-p 50051: Use port 50051-u: Unauthenticated mode-d: Enable debug messages
- Flags:
🏗️ Building RDC from Source
If you prefer to build RDC from source, follow the steps below.
🔧 Building gRPC and protoc
Important: RDC requires gRPC and protoc to be built from source as pre-built packages are not available.
-
Install Required Tools:
sudo apt-get update sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl -
Clone and Build gRPC:
git clone -b v1.67.1 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules cd grpc export GRPC_ROOT=/opt/grpc cmake -B build \ -DgRPC_INSTALL=ON \ -DgRPC_BUILD_TESTS=OFF \ -DBUILD_SHARED_LIBS=ON \ -DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,'$ORIGIN' \ -DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \ -DCMAKE_INSTALL_LIBDIR=lib \ -DCMAKE_BUILD_TYPE=Release make -C build -j $(nproc) sudo make -C build install echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf sudo ldconfig cd ..
🔧 Building RDC
-
Clone the RDC Repository:
git clone https://github.com/ROCm/rdc cd rdc -
Configure the Build:
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"- Optional Features:
-
Enable ROCm Profiler:
cmake -B build -DBUILD_PROFILER=ON -
Enable RVS:
cmake -B build -DBUILD_RVS=ON -
Build RDC Library Only (without rdci and rdcd):
cmake -B build -DBUILD_STANDALONE=OFF -
Build RDC Library Without ROCm Run-time:
cmake -B build -DBUILD_RUNTIME=OFF
-
- Optional Features:
-
Build and Install:
make -C build -j $(nproc) sudo make -C build install -
Update System Library Path:
export RDC_LIB_DIR=/opt/rocm/lib/rdc export GRPC_LIB_DIR="/opt/grpc/lib" echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf sudo ldconfig
📊 Features Overview
🔍 Discovery
Locate and display information about GPUs present in a compute node.
Example:
rdci discovery <host_name> -l
Output:
2 GPUs found
+-----------+----------------------------------------------+
| GPU Index | Device Information |
+-----------+----------------------------------------------+
| 0 | Name: AMD Radeon Instinct MI50 Accelerator |
| 1 | Name: AMD Radeon Instinct MI50 Accelerator |
+-----------+----------------------------------------------+
👥 Groups
🖥️ GPU Groups
Create, delete, and list logical groups of GPUs.
Create a Group:
rdci group -c GPU_GROUP
Add GPUs to Group:
rdci group -g 1 -a 0,1
List Groups:
rdci group -l
Delete a Group:
rdci group -d 1
🗂️ Field Groups
Manage field groups to monitor specific GPU metrics.
Create a Field Group:
rdci fieldgroup -c <fgroup> -f 150,155
List Field Groups:
rdci fieldgroup -l
Delete a Field Group:
rdci fieldgroup -d 1
Important
🛑 Monitor Errors
Define fields to monitor RAS ECC counters.
Correctable ECC Errors:
312 RDC_FI_ECC_CORRECT_TOTALUncorrectable ECC Errors:
313 RDC_FI_ECC_UNCORRECT_TOTAL
📈 Device Monitoring
Monitor GPU fields such as temperature, power usage, and utilization.
Command:
rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
Sample Output:
1 group found
+-----------+-------------+---------------+
| GPU Index | TEMP (m°C) | POWER (µW) |
+-----------+-------------+---------------+
| 0 | 25000 | 520500 |
+-----------+-------------+---------------+
📊 Job Stats
Display GPU statistics for any given workload.
Start Recording Stats:
rdci stats -s 2 -g 1
Stop Recording Stats:
rdci stats -x 2
Display Job Stats:
rdci stats -j 2
Sample Output:
Summary:
Executive Status:
Start time: 1586795401
End time: 1586795445
Total execution time: 44
Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12
🩺 Diagnostic
Run diagnostics on a GPU group to ensure system health.
Command:
rdci diag -g <gpu_group>
Sample Output:
No compute process: Pass
Node topology check: Pass
GPU parameters check: Pass
Compute Queue ready: Pass
System memory check: Pass
=============== Diagnostic Details ==================
No compute process: No processes running on any devices.
Node topology check: No link detected.
GPU parameters check: GPU 0 Critical Edge temperature in range.
Compute Queue ready: Run binary search task on GPU 0 Pass.
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
🔌 Integration with Third-Party Tools
RDC integrates seamlessly with tools like Prometheus, Grafana, and Reliability, Availability, and Serviceability (RAS) to enhance monitoring and visualization.
🐍 Python Bindings
RDC provides a generic Python class RdcReader to simplify telemetry gathering.
Sample Program:
from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time
default_field_ids = [
rdc_field_t.RDC_FI_POWER_USAGE,
rdc_field_t.RDC_FI_GPU_UTIL
]
class SimpleRdcReader(RdcReader):
def __init__(self):
super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)
def handle_field(self, gpu_index, value):
field_name = self.rdc_util.field_id_string(value.field_id).lower()
print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")
if __name__ == '__main__':
reader = SimpleRdcReader()
while True:
time.sleep(1)
reader.process()
Running the Example:
# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py
📈 Prometheus Plugin
The Prometheus plugin allows you to monitor events and send alerts.
Installation:
-
Install Prometheus Client:
pip install prometheus_client -
Run the Prometheus Plugin:
python rdc_prometheus.py -
Verify Plugin:
curl localhost:5000
Integration Steps:
-
Download and Install Prometheus:
-
Configure Prometheus Targets:
- Modify
prometheus_targets.jsonto point to your compute nodes.
[ { "targets": [ "rdc_test1.amd.com:5000", "rdc_test2.amd.com:5000" ] } ] - Modify
-
Start Prometheus with Configuration File:
prometheus --config.file=/path/to/rdc_prometheus_example.yml -
Access Prometheus UI:
- Open http://localhost:9090 in your browser.
📊 Grafana Integration
Grafana provides advanced visualization capabilities for RDC metrics.
Installation:
-
Download Grafana:
-
Install Grafana:
- Follow the Installation Instructions.
-
Start Grafana Server:
sudo systemctl start grafana-server sudo systemctl status grafana-server -
Access Grafana:
- Open http://localhost:3000 in your browser and log in with the default credentials (
admin/admin).
- Open http://localhost:3000 in your browser and log in with the default credentials (
Configuration Steps:
-
Add Prometheus Data Source:
- Navigate to Configuration → Data Sources → Add data source → Prometheus.
- Set the URL to http://localhost:9090 and save.
-
Import RDC Dashboard:
- Click the + icon and select Import.
- Upload
rdc_grafana_dashboard_example.jsonfrom thepython_bindingfolder. - Select the desired compute node for visualization.
🛡️ Reliability, Availability, and Serviceability (RAS) Plugin
The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.
Installation:
-
Ensure GPU Supports RAS:
- The GPU must support RAS features.
-
RDC Installation Includes RAS Library:
librdc_ras.sois located in/opt/rocm-4.2.0/rdc/lib.
Usage:
-
Monitor ECC Errors:
rdci dmon -i 0 -e 600,601Sample Output:
GPU ECC_CORRECT ECC_UNCORRECT 0 0 0
Important
🐞 Troubleshooting
Known Issues
🛑 dmon Fields Return N/A
Missing Libraries:
- Verify
/opt/rocm/lib/rdc/librdc_*.soexists.- Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
Unsupported GPU:
- Most metrics work on MI300 and newer.
- Limited metrics on MI200.
- Consumer GPUs (e.g., RX6800) have fewer supported metrics.
🐍 dmon RocProfiler Fields Return Zeros
Solution:
Set the
HSA_TOOLS_LIBenvironment variable before running a compute job.export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1Example:
# Terminal 1 rdcd -u # Terminal 2 export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1 gpu-burn # Terminal 3 rdci dmon -u -e 800,801 -i 0 -c 1 # Output: GPU OCCUPANCY_PERCENT ACTIVE_WAVES 0 001.000 32640.000⚠️
HSA_STATUS_ERROR_OUT_OF_RESOURCESError Message:
terminate called after throwing an instance of 'std::runtime_error' what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. Aborted (core dumped)Solution:
Missing Groups:
- Ensure
videoandrendergroups exist.sudo usermod -aG video,render $USER
- Log out and log back in to apply group changes.
🐛 Troubleshooting RDCD
View RDCD Logs:
sudo journalctl -u rdcRun RDCD with Debug Logs:
RDC_LOG=DEBUG /opt/rocm/bin/rdcd
- Logging Levels Supported: ERROR, INFO, DEBUG
Enable Additional Logging Messages:
export RSMI_LOGGING=3
📄 License
RDC is open-source and available under the MIT License.
📧 Support
For support and further inquiries, please refer to the ROCm Documentation or contact the maintainers through the repository's issue tracker.