Files
rocm-systems/projects/rdc/README.md
T

Neņem vērā izmaiņas no .git-blame-ignore-revs. Nospiediet šeit, lai to apietu un redzētu visu izmaiņu skatu.

633 rindas
15 KiB
Markdown

2024-12-09 13:40:49 -06:00
# ROCm™ Data Center Tool (RDC) 🚀
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
The ROCm™ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
## 🌟 Main Features
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
- **GPU Telemetry** 📊
- **GPU Statistics for Jobs** 📈
- **Integration with Third-Party Tools** 🔗
- **Open Source** 🛠️
2024-03-19 14:41:16 -05:00
2025-02-12 21:07:30 +05:30
> [!NOTE]
> The published documentation is available at [ROCm Data Center Tool](https://rocm.docs.amd.com/projects/rdc/en/latest/index.html) in an organized, easy-to-read format, with search and a table of contents. The documentation source files reside in the `rdc/docs` folder of this repository. As with all ROCm projects, the documentation is open source. For more information on contributing to the documentation, see [Contribute to ROCm documentation](https://rocm.docs.amd.com/en/latest/contribute/contributing.html).
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
## 🛠️ Installation Guide
2023-11-23 11:29:04 -07:00
2024-12-09 13:40:49 -06:00
### 📋 Prerequisites
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
Before setting up RDC, ensure your system meets the following requirements:
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
- **Supported Platforms**: RDC runs on AMD ROCm-supported platforms. Refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) for details.
- **Dependencies**:
- **CMake** ≥ 3.15
- **g++** (5.4.0)
- **Doxygen** (1.8.11)
- **LaTeX** (pdfTeX 3.14159265-2.6-1.40.16)
- **gRPC and protoc**
- **libcap-dev**
- **AMD ROCm Platform** ([GitHub](https://github.com/ROCm/ROCm))
- **AMDSMI Library** ([GitHub](https://github.com/ROCm/amdsmi))
- **ROCK Kernel Driver** ([GitHub](https://github.com/ROCm/ROCK-Kernel-Driver))
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
### 🔐 Certificate Generation
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
For certificate generation, refer to the [**RDC Developer Handbook (Generate Files for Authentication)**](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication) or consult the concise guide located at `authentication/readme.txt`.
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
---
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
## 🚀 Running RDC
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
RDC supports two primary modes of operation: **Standalone** and **Embedded**. Choose the mode that best fits your deployment needs.
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
### 🗂️ Standalone Mode
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
Standalone mode allows RDC to run independently with all its components installed.
2023-05-02 08:54:06 -06:00
2024-12-09 13:40:49 -06:00
1. **Start RDCD with Authentication (Monitor-Only Capabilities):**
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
```bash
/opt/rocm/bin/rdcd
```
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
2. **Start RDCD with Authentication (Full Capabilities):**
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
```bash
sudo /opt/rocm/bin/rdcd
```
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
3. **Start RDCD without Authentication (Monitor-Only):**
2022-12-13 14:37:59 -06:00
2024-12-09 13:40:49 -06:00
```bash
/opt/rocm/bin/rdcd -u
```
2023-05-02 08:54:06 -06:00
2024-12-09 13:40:49 -06:00
4. **Start RDCD without Authentication (Full Capabilities):**
```bash
sudo /opt/rocm/bin/rdcd -u
```
### 🔗 Embedded Mode
Embedded mode integrates RDC directly into your existing management tools using its library format.
- **Run RDC in Embedded Mode:**
```bash
python your_management_tool.py --rdc_embedded
```
**Note:** Ensure that the `rdcd` daemon is not running separately when using embedded mode.
### 🛠️ Starting RDCD Using systemd
1. **Copy the Service File:**
```bash
sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
```
2. **Configure Capabilities:**
- **Full Capabilities:** Ensure the following lines are **uncommented** in `/etc/systemd/system/rdc.service`:
```ini
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE
```
- **Monitor-Only Capabilities:** **Comment out** the above lines to restrict RDCD to monitoring.
3. **Start the Service:**
```bash
sudo systemctl start rdc
sudo systemctl status rdc
```
4. **Modify RDCD Options:**
Edit `/opt/rocm/share/rdc/conf/rdc_options.conf` to append any additional RDCD parameters.
```bash
sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf
```
**Example Configuration:**
```bash
RDC_OPTS="-p 50051 -u -d"
```
- **Flags:**
- `-p 50051` : Use port 50051
- `-u` : Unauthenticated mode
- `-d` : Enable debug messages
---
## 🏗️ Building RDC from Source
If you prefer to build RDC from source, follow the steps below.
### 🔧 Building gRPC and protoc
**Important:** RDC requires gRPC and protoc to be built from source as pre-built packages are not available.
1. **Install Required Tools:**
```bash
sudo apt-get update
2025-07-18 12:51:04 -05:00
sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl libcap-dev
2024-12-09 13:40:49 -06:00
```
2. **Clone and Build gRPC:**
```bash
2025-01-13 17:15:42 -06:00
git clone -b v1.67.1 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules
2023-11-23 11:29:04 -07:00
cd grpc
export GRPC_ROOT=/opt/grpc
cmake -B build \
-DgRPC_INSTALL=ON \
-DgRPC_BUILD_TESTS=OFF \
-DBUILD_SHARED_LIBS=ON \
2025-01-30 10:33:58 -06:00
-DCMAKE_SHARED_LINKER_FLAGS_INIT=-Wl,--enable-new-dtags,--build-id=sha1,--rpath,'$ORIGIN' \
2023-11-23 11:29:04 -07:00
-DCMAKE_INSTALL_PREFIX="$GRPC_ROOT" \
-DCMAKE_INSTALL_LIBDIR=lib \
-DCMAKE_BUILD_TYPE=Release
make -C build -j $(nproc)
sudo make -C build install
echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf
2024-12-09 13:40:49 -06:00
sudo ldconfig
cd ..
```
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
### 🔧 Building RDC
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
1. **Clone the RDC Repository:**
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
```bash
git clone https://github.com/ROCm/rocm-systems --recursive
cd rocm-systems/projects/rdc
2024-12-09 13:40:49 -06:00
```
2. **Configure the Build:**
```bash
2023-11-23 11:29:04 -07:00
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
2024-12-09 13:40:49 -06:00
```
- **Optional Features:**
- **Enable ROCm Profiler:**
```bash
cmake -B build -DBUILD_PROFILER=ON
```
- **Enable RVS:**
```bash
cmake -B build -DBUILD_RVS=ON
```
- **Build RDC Library Only (without rdci and rdcd):**
```bash
cmake -B build -DBUILD_STANDALONE=OFF
```
- **Build RDC Library Without ROCm Run-time:**
```bash
cmake -B build -DBUILD_RUNTIME=OFF
```
3. **Build and Install:**
```bash
2023-11-23 11:29:04 -07:00
make -C build -j $(nproc)
2024-12-09 13:40:49 -06:00
sudo make -C build install
```
2023-11-23 11:29:04 -07:00
2024-12-09 13:40:49 -06:00
4. **Update System Library Path:**
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
```bash
export RDC_LIB_DIR=/opt/rocm/lib/rdc
export GRPC_LIB_DIR="/opt/grpc/lib"
echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
sudo ldconfig
```
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
---
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
## 📊 Features Overview
2024-12-09 13:40:49 -06:00
### 🔍 Discovery
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
Locate and display information about GPUs present in a compute node.
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
**Example:**
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
```bash
rdci discovery <host_name> -l
```
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
**Output:**
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
```
2 GPUs found
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
+-----------+----------------------------------------------+
| GPU Index | Device Information |
+-----------+----------------------------------------------+
| 0 | Name: AMD Radeon Instinct MI50 Accelerator |
| 1 | Name: AMD Radeon Instinct MI50 Accelerator |
+-----------+----------------------------------------------+
```
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
## 👥 Groups
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
#### 🖥️ GPU Groups
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
Create, delete, and list logical groups of GPUs.
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
**Create a Group:**
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
```bash
rdci group -c GPU_GROUP
```
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
**Add GPUs to Group:**
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
```bash
rdci group -g 1 -a 0,1
```
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
**List Groups:**
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
```bash
rdci group -l
```
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
**Delete a Group:**
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
```bash
rdci group -d 1
```
2025-01-30 12:08:11 -06:00
2024-12-09 13:40:49 -06:00
#### 🗂️ Field Groups
Manage field groups to monitor specific GPU metrics.
**Create a Field Group:**
```bash
rdci fieldgroup -c <fgroup> -f 150,155
```
**List Field Groups:**
```bash
rdci fieldgroup -l
```
**Delete a Field Group:**
```bash
rdci fieldgroup -d 1
```
> [!IMPORTANT]
>### 🛑 Monitor Errors
>
>Define fields to monitor RAS ECC counters.
>
>- **Correctable ECC Errors:**
>
> ```bash
> 312 RDC_FI_ECC_CORRECT_TOTAL
> ```
>
>- **Uncorrectable ECC Errors:**
>
> ```bash
> 313 RDC_FI_ECC_UNCORRECT_TOTAL
> ```
### 📈 Device Monitoring
Monitor GPU fields such as temperature, power usage, and utilization.
**Command:**
```bash
rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
```
**Sample Output:**
```
1 group found
+-----------+-------------+---------------+
| GPU Index | TEMP (m°C) | POWER (µW) |
+-----------+-------------+---------------+
| 0 | 25000 | 520500 |
+-----------+-------------+---------------+
```
### 📊 Job Stats
Display GPU statistics for any given workload.
**Start Recording Stats:**
```bash
rdci stats -s 2 -g 1
```
**Stop Recording Stats:**
```bash
rdci stats -x 2
```
**Display Job Stats:**
```bash
rdci stats -j 2
```
**Sample Output:**
```
Summary:
Executive Status:
Start time: 1586795401
End time: 1586795445
Total execution time: 44
Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12
```
### 🩺 Diagnostic
Run diagnostics on a GPU group to ensure system health.
**Command:**
```bash
rdci diag -g <gpu_group>
```
**Sample Output:**
```
No compute process: Pass
Node topology check: Pass
GPU parameters check: Pass
Compute Queue ready: Pass
System memory check: Pass
=============== Diagnostic Details ==================
No compute process: No processes running on any devices.
Node topology check: No link detected.
GPU parameters check: GPU 0 Critical Edge temperature in range.
Compute Queue ready: Run binary search task on GPU 0 Pass.
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
```
---
## 🔌 Integration with Third-Party Tools
RDC integrates seamlessly with tools like **Prometheus**, **Grafana**, and **Reliability, Availability, and Serviceability (RAS)** to enhance monitoring and visualization.
### 🐍 Python Bindings
RDC provides a generic Python class `RdcReader` to simplify telemetry gathering.
2024-12-09 13:40:49 -06:00
**Sample Program:**
2024-12-09 13:40:49 -06:00
```python
from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time
2024-12-09 13:40:49 -06:00
default_field_ids = [
rdc_field_t.RDC_FI_POWER_USAGE,
rdc_field_t.RDC_FI_GPU_UTIL
]
2023-11-23 11:29:04 -07:00
2024-12-09 13:40:49 -06:00
class SimpleRdcReader(RdcReader):
def __init__(self):
super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
def handle_field(self, gpu_index, value):
field_name = self.rdc_util.field_id_string(value.field_id).lower()
print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")
2023-11-23 11:29:04 -07:00
2024-12-09 13:40:49 -06:00
if __name__ == '__main__':
reader = SimpleRdcReader()
while True:
time.sleep(1)
reader.process()
```
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
**Running the Example:**
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
```bash
# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py
```
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
### 📈 Prometheus Plugin
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
The Prometheus plugin allows you to monitor events and send alerts.
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
**Installation:**
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
1. **Install Prometheus Client:**
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
```bash
pip install prometheus_client
```
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
2. **Run the Prometheus Plugin:**
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
```bash
python rdc_prometheus.py
```
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
3. **Verify Plugin:**
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
```bash
curl localhost:5000
```
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
**Integration Steps:**
2024-09-25 15:10:41 -05:00
2024-12-09 13:40:49 -06:00
1. **Download and Install Prometheus:**
- [Prometheus GitHub](https://github.com/prometheus/prometheus)
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
2. **Configure Prometheus Targets:**
- Modify `prometheus_targets.json` to point to your compute nodes.
2022-12-13 14:37:59 -06:00
2024-12-09 13:40:49 -06:00
```json
[
{
"targets": [
"rdc_test1.amd.com:5000",
"rdc_test2.amd.com:5000"
]
}
]
```
2023-11-23 11:29:04 -07:00
2024-12-09 13:40:49 -06:00
3. **Start Prometheus with Configuration File:**
2022-12-13 14:37:59 -06:00
2024-12-09 13:40:49 -06:00
```bash
prometheus --config.file=/path/to/rdc_prometheus_example.yml
```
2023-11-23 11:29:04 -07:00
2024-12-09 13:40:49 -06:00
4. **Access Prometheus UI:**
- Open [http://localhost:9090](http://localhost:9090) in your browser.
2022-12-13 14:37:59 -06:00
2024-12-09 13:40:49 -06:00
### 📊 Grafana Integration
2020-08-16 11:38:09 -05:00
2024-12-09 13:40:49 -06:00
Grafana provides advanced visualization capabilities for RDC metrics.
2024-12-09 13:40:49 -06:00
**Installation:**
1. **Download Grafana:**
- [Grafana Download](https://grafana.com/grafana/download)
2. **Install Grafana:**
- Follow the [Installation Instructions](https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/).
3. **Start Grafana Server:**
```bash
sudo systemctl start grafana-server
sudo systemctl status grafana-server
```
4. **Access Grafana:**
- Open [http://localhost:3000](http://localhost:3000/) in your browser and log in with the default credentials (`admin`/`admin`).
**Configuration Steps:**
1. **Add Prometheus Data Source:**
- Navigate to **Configuration → Data Sources → Add data source → Prometheus**.
- Set the URL to [http://localhost:9090](http://localhost:9090) and save.
2. **Import RDC Dashboard:**
- Click the **+** icon and select **Import**.
- Upload `rdc_grafana_dashboard_example.json` from the `python_binding` folder.
- Select the desired compute node for visualization.
### 🛡️ Reliability, Availability, and Serviceability (RAS) Plugin
The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.
**Installation:**
1. **Ensure GPU Supports RAS:**
- The GPU must support RAS features.
2. **RDC Installation Includes RAS Library:**
- `librdc_ras.so` is located in `/opt/rocm-4.2.0/rdc/lib`.
**Usage:**
- **Monitor ECC Errors:**
```bash
rdci dmon -i 0 -e 600,601
```
**Sample Output:**
```
GPU ECC_CORRECT ECC_UNCORRECT
0 0 0
```
---
> [!IMPORTANT]
>## 🐞 Troubleshooting
>
> ### Known Issues
>#### 🛑 dmon Fields Return N/A
>
>1. **Missing Libraries:**
> - Verify `/opt/rocm/lib/rdc/librdc_*.so` exists.
> - Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
>
>2. **Unsupported GPU:**
> - Most metrics work on MI300 and newer.
> - Limited metrics on MI200.
> - Consumer GPUs (e.g., RX6800) have fewer supported metrics.
>
>#### 🐍 dmon RocProfiler Fields Return Zeros
>
>**Error Message:**
>
>```
>terminate called after throwing an instance of 'std::runtime_error'
> what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
>Aborted (core dumped)
>```
>
>**Solution:**
>
>1. **Missing Groups:**
> - Ensure `video` and `render` groups exist.
>
> ```bash
> sudo usermod -aG video,render $USER
> ```
>
> - Log out and log back in to apply group changes.
>
>### 🐛 Troubleshooting RDCD
>
>- **View RDCD Logs:**
>
> ```bash
> sudo journalctl -u rdc
> ```
>
>- **Run RDCD with Debug Logs:**
>
> ```bash
> RDC_LOG=DEBUG /opt/rocm/bin/rdcd
> ```
>
> - **Logging Levels Supported:** ERROR, INFO, DEBUG
>
>- **Enable Additional Logging Messages:**
>
> ```bash
> export RSMI_LOGGING=3
> ```
---
## 📄 License
RDC is open-source and available under the [MIT License](https://opensource.org/licenses/MIT).
---
## 📧 Support
For support and further inquiries, please refer to the [**ROCm Documentation**](https://rocm.docs.amd.com/projects/rdc/en/latest/) or contact the maintainers through the repository's issue tracker.