From a70aa81cfdf4c8a0ef81696145fba5f9441ca2de Mon Sep 17 00:00:00 2001 From: "Pryor, Adam" Date: Thu, 30 Jan 2025 12:08:11 -0600 Subject: [PATCH] Dgalants/add auth script location (#108) * DOCS: Add authentication scripts location Change-Id: Ie285d80ea6d9bb8f710998208d0aa7c6db661d02 Signed-off-by: Galantsev, Dmitrii * Make README.md pretty (#44) Change-Id: I7c3341deaf3621ebbc9e495b023b1dd4971a5f1d --------- Signed-off-by: Galantsev, Dmitrii Co-authored-by: Galantsev, Dmitrii Co-authored-by: Williams, Justin --- README.md | 715 +++++++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 567 insertions(+), 148 deletions(-) diff --git a/README.md b/README.md index 2170d5c8ce..cc460b7daf 100644 --- a/README.md +++ b/README.md @@ -1,68 +1,151 @@ -# ROCmTM Data Center Tool (RDC) +# ROCmβ„’ Data Center Tool (RDC) πŸš€ -The ROCmβ„’ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are: +The ROCmβ„’ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring. -- GPU telemetry -- GPU statistics for jobs -- Integration with third-party tools -- Open source +## 🌟 Main Features -For up-to-date document and how to start using RDC from pre-built packages, please refer to the [**ROCm DataCenter Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/) +- **GPU Telemetry** πŸ“Š +- **GPU Statistics for Jobs** πŸ“ˆ +- **Integration with Third-Party Tools** πŸ”— +- **Open Source** πŸ› οΈ -## Certificate generation +For comprehensive documentation and to get started with RDC using pre-built packages, refer to the [**ROCm Data Center Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/). -For certificate generation, please refer to -[**RDC Developer Handbook**#generate-files-for-authentication](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication) -Or read the concise guide under authentication/readme.txt +--- -## Supported platforms +## πŸ› οΈ Installation Guide -RDC can run on AMD ROCm supported platforms, please refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) +### πŸ“‹ Prerequisites -## Important notes +Before setting up RDC, ensure your system meets the following requirements: -### RocProfiler metrics usage +- **Supported Platforms**: RDC runs on AMD ROCm-supported platforms. Refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) for details. +- **Dependencies**: + - **CMake** β‰₯ 3.15 + - **g++** (5.4.0) + - **Doxygen** (1.8.11) + - **LaTeX** (pdfTeX 3.14159265-2.6-1.40.16) + - **gRPC and protoc** + - **libcap-dev** + - **AMD ROCm Platform** ([GitHub](https://github.com/ROCm/ROCm)) + - **AMDSMI Library** ([GitHub](https://github.com/ROCm/amdsmi)) + - **ROCK Kernel Driver** ([GitHub](https://github.com/ROCm/ROCK-Kernel-Driver)) -When using rocprofiler fields (800-899) you must call -`export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1` -before starting a compute load. +### πŸ” Certificate Generation -[***See: dmon-rocprofiler-fields-return-zeros***](#dmon-rocprofiler-fields-return-zeros) +For certificate generation, refer to the [**RDC Developer Handbook (Generate Files for Authentication)**](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication) or consult the concise guide located at `authentication/readme.txt`. -## Building RDC from source +--- -### Dependencies +## πŸš€ Running RDC - CMake 3.15 ## 3.15 or greater is required for gRPC - g++ (5.4.0) - Doxygen (1.8.11) ## required to build the latest documentation - Latex (pdfTeX 3.14159265-2.6-1.40.16) ## required to build the latest documentation - gRPC and protoc ## required for communication - libcap-dev ## required to manage the privileges. +RDC supports two primary modes of operation: **Standalone** and **Embedded**. Choose the mode that best fits your deployment needs. - AMD ROCm platform (https://github.com/ROCm/ROCm) - * It is recommended to install the complete AMD ROCm platform. - For installation instruction see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html - * At the minimum, these two components are required - (i) AMDSMI Library (https://github.com/ROCm/amdsmi) - (ii) AMD ROCk Kernel driver (https://github.com/ROCm/ROCK-Kernel-Driver) +### πŸ—‚οΈ Standalone Mode -## Building gRPC and protoc +Standalone mode allows RDC to run independently with all its components installed. -**NOTE:** gRPC and protoc compiler must be built when building RDC from source as pre-built packages are not available. When installing RDC from a package, gRPC and protoc will be installed from the package. +1. **Start RDCD with Authentication (Monitor-Only Capabilities):** -**IMPORTANT:** Building gRPC and protocol buffers requires CMake 3.15 or greater. With an older version build will quietly succeed with a *message*. However, all components of gRPC will not be installed and RDC will ***fail*** to run + ```bash + /opt/rocm/bin/rdcd + ``` -The following tools are required for gRPC build & installation +2. **Start RDCD with Authentication (Full Capabilities):** - automake make g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang-5.0 libc++-dev curl + ```bash + sudo /opt/rocm/bin/rdcd + ``` -### Download and build gRPC +3. **Start RDCD without Authentication (Monitor-Only):** -By default (without using CMAKE_INSTALL_PREFIX option), gRPC will install to `/usr/local` lib, include and bin directories. -It is highly recommended to install gRPC into a unique directory. -Below example installs gRPC into `/opt/grpc` + ```bash + /opt/rocm/bin/rdcd -u + ``` +4. **Start RDCD without Authentication (Full Capabilities):** + + ```bash + sudo /opt/rocm/bin/rdcd -u + ``` + +### πŸ”— Embedded Mode + +Embedded mode integrates RDC directly into your existing management tools using its library format. + +- **Run RDC in Embedded Mode:** + + ```bash + python your_management_tool.py --rdc_embedded + ``` + +**Note:** Ensure that the `rdcd` daemon is not running separately when using embedded mode. + +### πŸ› οΈ Starting RDCD Using systemd + +1. **Copy the Service File:** + + ```bash + sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/ + ``` + +2. **Configure Capabilities:** + + - **Full Capabilities:** Ensure the following lines are **uncommented** in `/etc/systemd/system/rdc.service`: + + ```ini + CapabilityBoundingSet=CAP_DAC_OVERRIDE + AmbientCapabilities=CAP_DAC_OVERRIDE + ``` + + - **Monitor-Only Capabilities:** **Comment out** the above lines to restrict RDCD to monitoring. + +3. **Start the Service:** + + ```bash + sudo systemctl start rdc + sudo systemctl status rdc + ``` + +4. **Modify RDCD Options:** + + Edit `/opt/rocm/share/rdc/conf/rdc_options.conf` to append any additional RDCD parameters. + + ```bash + sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf + ``` + + **Example Configuration:** + + ```bash + RDC_OPTS="-p 50051 -u -d" + ``` + + - **Flags:** + - `-p 50051` : Use port 50051 + - `-u` : Unauthenticated mode + - `-d` : Enable debug messages + +--- + +## πŸ—οΈ Building RDC from Source + +If you prefer to build RDC from source, follow the steps below. + +### πŸ”§ Building gRPC and protoc + +**Important:** RDC requires gRPC and protoc to be built from source as pre-built packages are not available. + +1. **Install Required Tools:** + + ```bash + sudo apt-get update + sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl + ``` + +2. **Clone and Build gRPC:** + + ```bash git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules cd grpc export GRPC_ROOT=/opt/grpc @@ -77,166 +160,502 @@ Below example installs gRPC into `/opt/grpc` make -C build -j $(nproc) sudo make -C build install echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf + sudo ldconfig + cd .. + ``` -## Building RDC +### πŸ”§ Building RDC -Clone the RDC source code from GitHub and use CMake to build and install +1. **Clone the RDC Repository:** + ```bash git clone https://github.com/ROCm/rdc cd rdc - # default installation location is /opt/rocm, specify with -DROCM_DIR or -DCMAKE_INSTALL_PREFIX + ``` + +2. **Configure the Build:** + + ```bash cmake -B build -DGRPC_ROOT="$GRPC_ROOT" - # enable rocprofiler (optional) - cmake -B build -DBUILD_PROFILER=ON - # enable RVS (optional) - cmake -B build -DBUILD_RVS=ON - # build and install + ``` + + - **Optional Features:** + - **Enable ROCm Profiler:** + + ```bash + cmake -B build -DBUILD_PROFILER=ON + ``` + + - **Enable RVS:** + + ```bash + cmake -B build -DBUILD_RVS=ON + ``` + + - **Build RDC Library Only (without rdci and rdcd):** + + ```bash + cmake -B build -DBUILD_STANDALONE=OFF + ``` + + - **Build RDC Library Without ROCm Run-time:** + + ```bash + cmake -B build -DBUILD_RUNTIME=OFF + ``` + +3. **Build and Install:** + + ```bash make -C build -j $(nproc) - make -C build install + sudo make -C build install + ``` -## Building RDC library only without gRPC (optional) +4. **Update System Library Path:** -If only the RDC libraries are needed (i.e. only "embedded mode" is required), the user can choose to not build rdci and rdcd. This will eliminate the need for gRPC and protoc. To build in this way, -DBUILD_STANDALONE=off should be passed on the the cmake command line: + ```bash + export RDC_LIB_DIR=/opt/rocm/lib/rdc + export GRPC_LIB_DIR="/opt/grpc/lib" + echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf + echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf + sudo ldconfig + ``` - cmake -B build -DBUILD_STANDALONE=off +--- -## Building RDC library without ROCM Run time (optional) +## πŸ“Š Features Overview -The user can choose to not build RDC diagnostic ROCM Run time. This will eliminate the need for ROCM Run time. To build in this way, -DBUILD_RUNTIME=off should be passed on the the cmake command line: +### πŸ” Discovery - cmake -B build -DBUILD_RUNTIME=off +Locate and display information about GPUs present in a compute node. -## Update System Library Path +**Example:** - RDC_LIB_DIR=/opt/rocm/lib/rdc - GRPC_LIB_DIR="${RDC_LIB_DIR}/grpc/lib\n/opt/grpc/lib" - echo -e "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf - echo -e "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf - ldconfig +```bash +rdci discovery -l +``` -## Running RDC +**Output:** -RDC supports encrypted communications between clients and servers. The -communication can be configured to be *authenticated* or *not authenticated*. The [**user guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/) has information on how to generate and install SSL keys and certificates for authentication. By default, authentication is enabled. +``` +2 GPUs found -## Starting ROCmβ„’ Data Center Daemon (RDCD) ++-----------+----------------------------------------------+ +| GPU Index | Device Information | ++-----------+----------------------------------------------+ +| 0 | Name: AMD Radeon Instinct MI50 Accelerator | +| 1 | Name: AMD Radeon Instinct MI50 Accelerator | ++-----------+----------------------------------------------+ +``` -For an RDC client application to monitor and/or control a remote system, the RDC server daemon, *rdcd*, must be running on the remote system. *rdcd* can be configured to run with (a) full-capabilities which includes ability to set or change GPU configuration or (b) monitor-only capabilities which limits to monitoring GPU metrics. +## πŸ‘₯ Groups -### Start RDCD from command-line +#### πŸ–₯️ GPU Groups -When *rdcd* is started from a command-line the *capabilities* are determined by privilege of the *user* starting *rdcd* +Create, delete, and list logical groups of GPUs. - ## NOTE: Replace /opt/rocm with specific rocm version if needed +**Create a Group:** - ## To run with authentication. Ensure SSL keys are setup properly - /opt/rocm/bin/rdcd ## rdcd is started with monitor-only capabilities - sudo /opt/rocm/bin/rdcd ## rdcd is started will full-capabilities +```bash +rdci group -c GPU_GROUP +``` - ## To run without authentication. SSL key & certificates are not required. - /opt/rocm/bin/rdcd -u ## rdcd is started with monitor-only capabilities - sudo /opt/rocm/bin/rdcd -u ## rdcd is started will full-capabilities +**Add GPUs to Group:** -### Start RDCD using systemd +```bash +rdci group -g 1 -a 0,1 +``` -*rdcd* can be started by using the systemctl command. You can copy `/opt/rocm/libexec/rdc/rdc.service`, which is installed with RDC, to the systemd folder. This file has 2 lines that control what *capabilities* with which *rdcd* will run. If left uncommented, rdcd will run with full-capabilities. +**List Groups:** - ## file: /opt/rocm/libexec/rdc/rdc.service - ## Comment the following two lines to run with monitor-only capabilities - CapabilityBoundingSet=CAP_DAC_OVERRIDE - AmbientCapabilities=CAP_DAC_OVERRIDE +```bash +rdci group -l +``` - systemctl start rdc ## start rdc as systemd service +**Delete a Group:** -Additional options can be passed to *rdcd* by modifying `/opt/rocm/share/rdc/conf/rdc_options.conf` +```bash +rdci group -d 1 +``` - ## file: /opt/rocm/share/rdc/conf/rdc_options.conf - # Append 'rdc' daemon parameters here - RDC_OPTS="-p 50051 -u -d" +#### πŸ—‚οΈ Field Groups -Example above does the following: +Manage field groups to monitor specific GPU metrics. -- Use port 50051 -- Use unauthenticated mode -- Enable debug messages -- **NOTE:** You must add `-u` flag to `rdci` calls as well +**Create a Field Group:** -## Invoke RDC using ROCmβ„’ Data Center Interface (RDCI) +```bash +rdci fieldgroup -c -f 150,155 +``` -RDCI provides command-line interface to all RDC features. This CLI can be run locally or remotely. Refer to [**user guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/how-to/features.html) for the current list of features. +**List Field Groups:** - ## sample rdci commands to test RDC functionality - ## discover devices in a local or remote compute node - ## NOTE: option -u (for unauthenticated) is required if rdcd was started in this mode - ## Assuming that rdc is installed into /opt/rocm +```bash +rdci fieldgroup -l +``` - cd /opt/rocm/bin - ./rdci discovery -l <-u> ## list available GPUs in localhost - ./rdci discovery -l <-u> ## list available GPUs in host machine - ./rdci dmon <-u> -l ## list most GPU counters - # assuming rdcd is running locally, using -u instead of - ./rdci dmon -u --list-all ## list all GPU counters - ./rdci dmon -u -i 0 -c 1 -e 100 ## monitor field 100 on gpu 0 for count of 1 - ./rdci dmon -u -i 0 -c 1 -e 1,2 ## monitor fields 1,2 on gpu 0 for count of 1 +**Delete a Field Group:** -## Known issues +```bash +rdci fieldgroup -d 1 +``` +> [!IMPORTANT] +>### πŸ›‘ Monitor Errors +> +>Define fields to monitor RAS ECC counters. +> +>- **Correctable ECC Errors:** +> +> ```bash +> 312 RDC_FI_ECC_CORRECT_TOTAL +> ``` +> +>- **Uncorrectable ECC Errors:** +> +> ```bash +> 313 RDC_FI_ECC_UNCORRECT_TOTAL +> ``` -### dmon fields return N/A +### πŸ“ˆ Device Monitoring -1. An optional library might be missing. Do you have - `/opt/rocm/lib/rdc/librdc_*.so`? Do you have the library it's related to? - (rocprofiler, rocruntime...) -2. The GPU you're using might not be supported. As a rule of thumb - most - metrics should work on MI300 and up. Less metrics are supported for MI200. - NV21 (aka RX6800) and other consumer GPUs have even less metrics. +Monitor GPU fields such as temperature, power usage, and utilization. -### dmon rocprofiler fields return zeros +**Command:** -Due to a rocprofiler limitation - you must set `HSA_TOOLS_LIB` environmental -variable *before* running a compute job. +```bash +rdci dmon -f -g -c 5 -d 1000 +``` -If `HSA_TOOLS_LIB` is not set - most rocprofiler metrics will return all zeros. +**Sample Output:** -E.g. Correct output on MI300 using [gpu-burn](https://github.com/ROCm/HIP-Examples/tree/master/gpu-burn) +``` +1 group found - # terminal 1 - rdcd -u - # terminal 2 - export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1 - gpu-burn - # terminal 3 - rdci dmon -u -e 800,801 -i 0 -c 1 - # output: - # GPU OCCUPANCY_PERCENT ACTIVE_WAVES - # 0 001.000 32640.000 ++-----------+-------------+---------------+ +| GPU Index | TEMP (mΒ°C) | POWER (Β΅W) | ++-----------+-------------+---------------+ +| 0 | 25000 | 520500 | ++-----------+-------------+---------------+ +``` -### `HSA_STATUS_ERROR_OUT_OF_RESOURCES` +### πŸ“Š Job Stats -error: +Display GPU statistics for any given workload. - terminate called after throwing an instance of 'std::runtime_error' - what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. - Aborted (core dumped) +**Start Recording Stats:** -1. Missing groups. Run `groups`. You're expected to have `video` and `render`. - This can be fixed with `sudo usermod -aG video,render $USER` followed by - logging out and logging back in. +```bash +rdci stats -s 2 -g 1 +``` -## Troubleshooting rdcd +**Stop Recording Stats:** -- Log messages that can provide useful debug information. +```bash +rdci stats -x 2 +``` -If rdcd was started as a systemd service, then use journalctl to view rdcd logs +**Display Job Stats:** - sudo journalctl -u rdc +```bash +rdci stats -j 2 +``` -To run rdcd with debug log from command-line use -version will be the version number(ex:3.10.0) of ROCm where RDC was packaged with +**Sample Output:** - RDC_LOG=DEBUG /opt/rocm/bin/rdcd +``` +Summary: +Executive Status: -RDC_LOG=DEBUG also works on rdci +Start time: 1586795401 +End time: 1586795445 +Total execution time: 44 -ERROR, INFO, DEBUG logging levels are supported +Energy Consumed (Joules): 21682 +Power Usage (Watts): Max: 49 Min: 13 Avg: 34 +GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903 +GPU Utilization (%): Max: 69 Min: 0 Avg: 2 +Max GPU Memory Used (bytes): 524320768 +Memory Utilization (%): Max: 12 Min: 11 Avg: 12 +``` -Additional logging messages can be enabled with `RSMI_LOGGING=3` +### 🩺 Diagnostic + +Run diagnostics on a GPU group to ensure system health. + +**Command:** + +```bash +rdci diag -g +``` + +**Sample Output:** + +``` +No compute process: Pass +Node topology check: Pass +GPU parameters check: Pass +Compute Queue ready: Pass +System memory check: Pass +=============== Diagnostic Details ================== +No compute process: No processes running on any devices. +Node topology check: No link detected. +GPU parameters check: GPU 0 Critical Edge temperature in range. +Compute Queue ready: Run binary search task on GPU 0 Pass. +System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass. +``` + +--- + +## πŸ”Œ Integration with Third-Party Tools + +RDC integrates seamlessly with tools like **Prometheus**, **Grafana**, and **Reliability, Availability, and Serviceability (RAS)** to enhance monitoring and visualization. + +### 🐍 Python Bindings + +RDC provides a generic Python class `RdcReader` to simplify telemetry gathering. + +**Sample Program:** + +```python +from RdcReader import RdcReader +from RdcUtil import RdcUtil +from rdc_bootstrap import * +import time + +default_field_ids = [ + rdc_field_t.RDC_FI_POWER_USAGE, + rdc_field_t.RDC_FI_GPU_UTIL +] + +class SimpleRdcReader(RdcReader): + def __init__(self): + super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000) + + def handle_field(self, gpu_index, value): + field_name = self.rdc_util.field_id_string(value.field_id).lower() + print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}") + +if __name__ == '__main__': + reader = SimpleRdcReader() + while True: + time.sleep(1) + reader.process() +``` + +**Running the Example:** + +```bash +# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH +python SimpleReader.py +``` + +### πŸ“ˆ Prometheus Plugin + +The Prometheus plugin allows you to monitor events and send alerts. + +**Installation:** + +1. **Install Prometheus Client:** + + ```bash + pip install prometheus_client + ``` + +2. **Run the Prometheus Plugin:** + + ```bash + python rdc_prometheus.py + ``` + +3. **Verify Plugin:** + + ```bash + curl localhost:5000 + ``` + +**Integration Steps:** + +1. **Download and Install Prometheus:** + - [Prometheus GitHub](https://github.com/prometheus/prometheus) + +2. **Configure Prometheus Targets:** + - Modify `prometheus_targets.json` to point to your compute nodes. + + ```json + [ + { + "targets": [ + "rdc_test1.amd.com:5000", + "rdc_test2.amd.com:5000" + ] + } + ] + ``` + +3. **Start Prometheus with Configuration File:** + + ```bash + prometheus --config.file=/path/to/rdc_prometheus_example.yml + ``` + +4. **Access Prometheus UI:** + - Open [http://localhost:9090](http://localhost:9090) in your browser. + +### πŸ“Š Grafana Integration + +Grafana provides advanced visualization capabilities for RDC metrics. + +**Installation:** + +1. **Download Grafana:** + - [Grafana Download](https://grafana.com/grafana/download) + +2. **Install Grafana:** + - Follow the [Installation Instructions](https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/). + +3. **Start Grafana Server:** + + ```bash + sudo systemctl start grafana-server + sudo systemctl status grafana-server + ``` + +4. **Access Grafana:** + - Open [http://localhost:3000](http://localhost:3000/) in your browser and log in with the default credentials (`admin`/`admin`). + +**Configuration Steps:** + +1. **Add Prometheus Data Source:** + - Navigate to **Configuration β†’ Data Sources β†’ Add data source β†’ Prometheus**. + - Set the URL to [http://localhost:9090](http://localhost:9090) and save. + +2. **Import RDC Dashboard:** + - Click the **+** icon and select **Import**. + - Upload `rdc_grafana_dashboard_example.json` from the `python_binding` folder. + - Select the desired compute node for visualization. + +### πŸ›‘οΈ Reliability, Availability, and Serviceability (RAS) Plugin + +The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors. + +**Installation:** + +1. **Ensure GPU Supports RAS:** + - The GPU must support RAS features. + +2. **RDC Installation Includes RAS Library:** + - `librdc_ras.so` is located in `/opt/rocm-4.2.0/rdc/lib`. + +**Usage:** + +- **Monitor ECC Errors:** + + ```bash + rdci dmon -i 0 -e 600,601 + ``` + + **Sample Output:** + + ``` + GPU ECC_CORRECT ECC_UNCORRECT + 0 0 0 + ``` + +--- +> [!IMPORTANT] +>## 🐞 Troubleshooting +> +> ### Known Issues +>#### πŸ›‘ dmon Fields Return N/A +> +>1. **Missing Libraries:** +> - Verify `/opt/rocm/lib/rdc/librdc_*.so` exists. +> - Ensure all related libraries (rocprofiler, rocruntime, etc.) are present. +> +>2. **Unsupported GPU:** +> - Most metrics work on MI300 and newer. +> - Limited metrics on MI200. +> - Consumer GPUs (e.g., RX6800) have fewer supported metrics. +> +>#### 🐍 dmon RocProfiler Fields Return Zeros +> +>**Solution:** +> +>Set the `HSA_TOOLS_LIB` environment variable **before** running a compute job. +> +>```bash +>export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1 +>``` +> +>**Example:** +> +>```bash +># Terminal 1 +>rdcd -u +> +># Terminal 2 +>export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1 +>gpu-burn +> +># Terminal 3 +>rdci dmon -u -e 800,801 -i 0 -c 1 +> +># Output: +>GPU OCCUPANCY_PERCENT ACTIVE_WAVES +>0 001.000 32640.000 +>``` +> +>#### ⚠️ `HSA_STATUS_ERROR_OUT_OF_RESOURCES` +> +>**Error Message:** +> +>``` +>terminate called after throwing an instance of 'std::runtime_error' +> what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. +>Aborted (core dumped) +>``` +> +>**Solution:** +> +>1. **Missing Groups:** +> - Ensure `video` and `render` groups exist. +> +> ```bash +> sudo usermod -aG video,render $USER +> ``` +> +> - Log out and log back in to apply group changes. +> +>### πŸ› Troubleshooting RDCD +> +>- **View RDCD Logs:** +> +> ```bash +> sudo journalctl -u rdc +> ``` +> +>- **Run RDCD with Debug Logs:** +> +> ```bash +> RDC_LOG=DEBUG /opt/rocm/bin/rdcd +> ``` +> +> - **Logging Levels Supported:** ERROR, INFO, DEBUG +> +>- **Enable Additional Logging Messages:** +> +> ```bash +> export RSMI_LOGGING=3 +> ``` + +--- + +## πŸ“„ License + +RDC is open-source and available under the [MIT License](https://opensource.org/licenses/MIT). + +--- + +## πŸ“§ Support + +For support and further inquiries, please refer to the [**ROCm Documentation**](https://rocm.docs.amd.com/projects/rdc/en/latest/) or contact the maintainers through the repository's issue tracker.