diff --git a/README.md b/README.md
index cc460b7daf..2170d5c8ce 100644
--- a/README.md
+++ b/README.md
@@ -1,151 +1,68 @@
-# ROCmβ’ Data Center Tool (RDC) π
+# ROCmTM Data Center Tool (RDC)
-The ROCmβ’ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.
+The ROCmβ’ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are:
-## π Main Features
+- GPU telemetry
+- GPU statistics for jobs
+- Integration with third-party tools
+- Open source
-- **GPU Telemetry** π
-- **GPU Statistics for Jobs** π
-- **Integration with Third-Party Tools** π
-- **Open Source** π οΈ
+For up-to-date document and how to start using RDC from pre-built packages, please refer to the [**ROCm DataCenter Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/)
-For comprehensive documentation and to get started with RDC using pre-built packages, refer to the [**ROCm Data Center Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/).
+## Certificate generation
----
+For certificate generation, please refer to
+[**RDC Developer Handbook**#generate-files-for-authentication](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication)
+Or read the concise guide under authentication/readme.txt
-## π οΈ Installation Guide
+## Supported platforms
-### π Prerequisites
+RDC can run on AMD ROCm supported platforms, please refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems)
-Before setting up RDC, ensure your system meets the following requirements:
+## Important notes
-- **Supported Platforms**: RDC runs on AMD ROCm-supported platforms. Refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) for details.
-- **Dependencies**:
- - **CMake** β₯ 3.15
- - **g++** (5.4.0)
- - **Doxygen** (1.8.11)
- - **LaTeX** (pdfTeX 3.14159265-2.6-1.40.16)
- - **gRPC and protoc**
- - **libcap-dev**
- - **AMD ROCm Platform** ([GitHub](https://github.com/ROCm/ROCm))
- - **AMDSMI Library** ([GitHub](https://github.com/ROCm/amdsmi))
- - **ROCK Kernel Driver** ([GitHub](https://github.com/ROCm/ROCK-Kernel-Driver))
+### RocProfiler metrics usage
-### π Certificate Generation
+When using rocprofiler fields (800-899) you must call
+`export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1`
+before starting a compute load.
-For certificate generation, refer to the [**RDC Developer Handbook (Generate Files for Authentication)**](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication) or consult the concise guide located at `authentication/readme.txt`.
+[***See: dmon-rocprofiler-fields-return-zeros***](#dmon-rocprofiler-fields-return-zeros)
----
+## Building RDC from source
-## π Running RDC
+### Dependencies
-RDC supports two primary modes of operation: **Standalone** and **Embedded**. Choose the mode that best fits your deployment needs.
+ CMake 3.15 ## 3.15 or greater is required for gRPC
+ g++ (5.4.0)
+ Doxygen (1.8.11) ## required to build the latest documentation
+ Latex (pdfTeX 3.14159265-2.6-1.40.16) ## required to build the latest documentation
+ gRPC and protoc ## required for communication
+ libcap-dev ## required to manage the privileges.
-### ποΈ Standalone Mode
+ AMD ROCm platform (https://github.com/ROCm/ROCm)
+ * It is recommended to install the complete AMD ROCm platform.
+ For installation instruction see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
+ * At the minimum, these two components are required
+ (i) AMDSMI Library (https://github.com/ROCm/amdsmi)
+ (ii) AMD ROCk Kernel driver (https://github.com/ROCm/ROCK-Kernel-Driver)
-Standalone mode allows RDC to run independently with all its components installed.
+## Building gRPC and protoc
-1. **Start RDCD with Authentication (Monitor-Only Capabilities):**
+**NOTE:** gRPC and protoc compiler must be built when building RDC from source as pre-built packages are not available. When installing RDC from a package, gRPC and protoc will be installed from the package.
- ```bash
- /opt/rocm/bin/rdcd
- ```
+**IMPORTANT:** Building gRPC and protocol buffers requires CMake 3.15 or greater. With an older version build will quietly succeed with a *message*. However, all components of gRPC will not be installed and RDC will ***fail*** to run
-2. **Start RDCD with Authentication (Full Capabilities):**
+The following tools are required for gRPC build & installation
- ```bash
- sudo /opt/rocm/bin/rdcd
- ```
+ automake make g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang-5.0 libc++-dev curl
-3. **Start RDCD without Authentication (Monitor-Only):**
+### Download and build gRPC
- ```bash
- /opt/rocm/bin/rdcd -u
- ```
+By default (without using CMAKE_INSTALL_PREFIX option), gRPC will install to `/usr/local` lib, include and bin directories.
+It is highly recommended to install gRPC into a unique directory.
+Below example installs gRPC into `/opt/grpc`
-4. **Start RDCD without Authentication (Full Capabilities):**
-
- ```bash
- sudo /opt/rocm/bin/rdcd -u
- ```
-
-### π Embedded Mode
-
-Embedded mode integrates RDC directly into your existing management tools using its library format.
-
-- **Run RDC in Embedded Mode:**
-
- ```bash
- python your_management_tool.py --rdc_embedded
- ```
-
-**Note:** Ensure that the `rdcd` daemon is not running separately when using embedded mode.
-
-### π οΈ Starting RDCD Using systemd
-
-1. **Copy the Service File:**
-
- ```bash
- sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
- ```
-
-2. **Configure Capabilities:**
-
- - **Full Capabilities:** Ensure the following lines are **uncommented** in `/etc/systemd/system/rdc.service`:
-
- ```ini
- CapabilityBoundingSet=CAP_DAC_OVERRIDE
- AmbientCapabilities=CAP_DAC_OVERRIDE
- ```
-
- - **Monitor-Only Capabilities:** **Comment out** the above lines to restrict RDCD to monitoring.
-
-3. **Start the Service:**
-
- ```bash
- sudo systemctl start rdc
- sudo systemctl status rdc
- ```
-
-4. **Modify RDCD Options:**
-
- Edit `/opt/rocm/share/rdc/conf/rdc_options.conf` to append any additional RDCD parameters.
-
- ```bash
- sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf
- ```
-
- **Example Configuration:**
-
- ```bash
- RDC_OPTS="-p 50051 -u -d"
- ```
-
- - **Flags:**
- - `-p 50051` : Use port 50051
- - `-u` : Unauthenticated mode
- - `-d` : Enable debug messages
-
----
-
-## ποΈ Building RDC from Source
-
-If you prefer to build RDC from source, follow the steps below.
-
-### π§ Building gRPC and protoc
-
-**Important:** RDC requires gRPC and protoc to be built from source as pre-built packages are not available.
-
-1. **Install Required Tools:**
-
- ```bash
- sudo apt-get update
- sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl
- ```
-
-2. **Clone and Build gRPC:**
-
- ```bash
git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules
cd grpc
export GRPC_ROOT=/opt/grpc
@@ -160,502 +77,166 @@ If you prefer to build RDC from source, follow the steps below.
make -C build -j $(nproc)
sudo make -C build install
echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf
- sudo ldconfig
- cd ..
- ```
-### π§ Building RDC
+## Building RDC
-1. **Clone the RDC Repository:**
+Clone the RDC source code from GitHub and use CMake to build and install
- ```bash
git clone https://github.com/ROCm/rdc
cd rdc
- ```
-
-2. **Configure the Build:**
-
- ```bash
+ # default installation location is /opt/rocm, specify with -DROCM_DIR or -DCMAKE_INSTALL_PREFIX
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
- ```
-
- - **Optional Features:**
- - **Enable ROCm Profiler:**
-
- ```bash
- cmake -B build -DBUILD_PROFILER=ON
- ```
-
- - **Enable RVS:**
-
- ```bash
- cmake -B build -DBUILD_RVS=ON
- ```
-
- - **Build RDC Library Only (without rdci and rdcd):**
-
- ```bash
- cmake -B build -DBUILD_STANDALONE=OFF
- ```
-
- - **Build RDC Library Without ROCm Run-time:**
-
- ```bash
- cmake -B build -DBUILD_RUNTIME=OFF
- ```
-
-3. **Build and Install:**
-
- ```bash
+ # enable rocprofiler (optional)
+ cmake -B build -DBUILD_PROFILER=ON
+ # enable RVS (optional)
+ cmake -B build -DBUILD_RVS=ON
+ # build and install
make -C build -j $(nproc)
- sudo make -C build install
- ```
+ make -C build install
-4. **Update System Library Path:**
+## Building RDC library only without gRPC (optional)
- ```bash
- export RDC_LIB_DIR=/opt/rocm/lib/rdc
- export GRPC_LIB_DIR="/opt/grpc/lib"
- echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
- echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
- sudo ldconfig
- ```
+If only the RDC libraries are needed (i.e. only "embedded mode" is required), the user can choose to not build rdci and rdcd. This will eliminate the need for gRPC and protoc. To build in this way, -DBUILD_STANDALONE=off should be passed on the the cmake command line:
----
+ cmake -B build -DBUILD_STANDALONE=off
-## π Features Overview
+## Building RDC library without ROCM Run time (optional)
-### π Discovery
+The user can choose to not build RDC diagnostic ROCM Run time. This will eliminate the need for ROCM Run time. To build in this way, -DBUILD_RUNTIME=off should be passed on the the cmake command line:
-Locate and display information about GPUs present in a compute node.
+ cmake -B build -DBUILD_RUNTIME=off
-**Example:**
+## Update System Library Path
-```bash
-rdci discovery -l
-```
+ RDC_LIB_DIR=/opt/rocm/lib/rdc
+ GRPC_LIB_DIR="${RDC_LIB_DIR}/grpc/lib\n/opt/grpc/lib"
+ echo -e "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
+ echo -e "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
+ ldconfig
-**Output:**
+## Running RDC
-```
-2 GPUs found
+RDC supports encrypted communications between clients and servers. The
+communication can be configured to be *authenticated* or *not authenticated*. The [**user guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/) has information on how to generate and install SSL keys and certificates for authentication. By default, authentication is enabled.
-+-----------+----------------------------------------------+
-| GPU Index | Device Information |
-+-----------+----------------------------------------------+
-| 0 | Name: AMD Radeon Instinct MI50 Accelerator |
-| 1 | Name: AMD Radeon Instinct MI50 Accelerator |
-+-----------+----------------------------------------------+
-```
+## Starting ROCmβ’ Data Center Daemon (RDCD)
-## π₯ Groups
+For an RDC client application to monitor and/or control a remote system, the RDC server daemon, *rdcd*, must be running on the remote system. *rdcd* can be configured to run with (a) full-capabilities which includes ability to set or change GPU configuration or (b) monitor-only capabilities which limits to monitoring GPU metrics.
-#### π₯οΈ GPU Groups
+### Start RDCD from command-line
-Create, delete, and list logical groups of GPUs.
+When *rdcd* is started from a command-line the *capabilities* are determined by privilege of the *user* starting *rdcd*
-**Create a Group:**
+ ## NOTE: Replace /opt/rocm with specific rocm version if needed
-```bash
-rdci group -c GPU_GROUP
-```
+ ## To run with authentication. Ensure SSL keys are setup properly
+ /opt/rocm/bin/rdcd ## rdcd is started with monitor-only capabilities
+ sudo /opt/rocm/bin/rdcd ## rdcd is started will full-capabilities
-**Add GPUs to Group:**
+ ## To run without authentication. SSL key & certificates are not required.
+ /opt/rocm/bin/rdcd -u ## rdcd is started with monitor-only capabilities
+ sudo /opt/rocm/bin/rdcd -u ## rdcd is started will full-capabilities
-```bash
-rdci group -g 1 -a 0,1
-```
+### Start RDCD using systemd
-**List Groups:**
+*rdcd* can be started by using the systemctl command. You can copy `/opt/rocm/libexec/rdc/rdc.service`, which is installed with RDC, to the systemd folder. This file has 2 lines that control what *capabilities* with which *rdcd* will run. If left uncommented, rdcd will run with full-capabilities.
-```bash
-rdci group -l
-```
+ ## file: /opt/rocm/libexec/rdc/rdc.service
+ ## Comment the following two lines to run with monitor-only capabilities
+ CapabilityBoundingSet=CAP_DAC_OVERRIDE
+ AmbientCapabilities=CAP_DAC_OVERRIDE
-**Delete a Group:**
+ systemctl start rdc ## start rdc as systemd service
-```bash
-rdci group -d 1
-```
+Additional options can be passed to *rdcd* by modifying `/opt/rocm/share/rdc/conf/rdc_options.conf`
-#### ποΈ Field Groups
+ ## file: /opt/rocm/share/rdc/conf/rdc_options.conf
+ # Append 'rdc' daemon parameters here
+ RDC_OPTS="-p 50051 -u -d"
-Manage field groups to monitor specific GPU metrics.
+Example above does the following:
-**Create a Field Group:**
+- Use port 50051
+- Use unauthenticated mode
+- Enable debug messages
+- **NOTE:** You must add `-u` flag to `rdci` calls as well
-```bash
-rdci fieldgroup -c -f 150,155
-```
+## Invoke RDC using ROCmβ’ Data Center Interface (RDCI)
-**List Field Groups:**
+RDCI provides command-line interface to all RDC features. This CLI can be run locally or remotely. Refer to [**user guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/how-to/features.html) for the current list of features.
-```bash
-rdci fieldgroup -l
-```
+ ## sample rdci commands to test RDC functionality
+ ## discover devices in a local or remote compute node
+ ## NOTE: option -u (for unauthenticated) is required if rdcd was started in this mode
+ ## Assuming that rdc is installed into /opt/rocm
-**Delete a Field Group:**
+ cd /opt/rocm/bin
+ ./rdci discovery -l <-u> ## list available GPUs in localhost
+ ./rdci discovery -l <-u> ## list available GPUs in host machine
+ ./rdci dmon <-u> -l ## list most GPU counters
+ # assuming rdcd is running locally, using -u instead of
+ ./rdci dmon -u --list-all ## list all GPU counters
+ ./rdci dmon -u -i 0 -c 1 -e 100 ## monitor field 100 on gpu 0 for count of 1
+ ./rdci dmon -u -i 0 -c 1 -e 1,2 ## monitor fields 1,2 on gpu 0 for count of 1
-```bash
-rdci fieldgroup -d 1
-```
-> [!IMPORTANT]
->### π Monitor Errors
->
->Define fields to monitor RAS ECC counters.
->
->- **Correctable ECC Errors:**
->
-> ```bash
-> 312 RDC_FI_ECC_CORRECT_TOTAL
-> ```
->
->- **Uncorrectable ECC Errors:**
->
-> ```bash
-> 313 RDC_FI_ECC_UNCORRECT_TOTAL
-> ```
+## Known issues
-### π Device Monitoring
+### dmon fields return N/A
-Monitor GPU fields such as temperature, power usage, and utilization.
+1. An optional library might be missing. Do you have
+ `/opt/rocm/lib/rdc/librdc_*.so`? Do you have the library it's related to?
+ (rocprofiler, rocruntime...)
+2. The GPU you're using might not be supported. As a rule of thumb - most
+ metrics should work on MI300 and up. Less metrics are supported for MI200.
+ NV21 (aka RX6800) and other consumer GPUs have even less metrics.
-**Command:**
+### dmon rocprofiler fields return zeros
-```bash
-rdci dmon -f -g -c 5 -d 1000
-```
+Due to a rocprofiler limitation - you must set `HSA_TOOLS_LIB` environmental
+variable *before* running a compute job.
-**Sample Output:**
+If `HSA_TOOLS_LIB` is not set - most rocprofiler metrics will return all zeros.
-```
-1 group found
+E.g. Correct output on MI300 using [gpu-burn](https://github.com/ROCm/HIP-Examples/tree/master/gpu-burn)
-+-----------+-------------+---------------+
-| GPU Index | TEMP (mΒ°C) | POWER (Β΅W) |
-+-----------+-------------+---------------+
-| 0 | 25000 | 520500 |
-+-----------+-------------+---------------+
-```
+ # terminal 1
+ rdcd -u
+ # terminal 2
+ export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
+ gpu-burn
+ # terminal 3
+ rdci dmon -u -e 800,801 -i 0 -c 1
+ # output:
+ # GPU OCCUPANCY_PERCENT ACTIVE_WAVES
+ # 0 001.000 32640.000
-### π Job Stats
+### `HSA_STATUS_ERROR_OUT_OF_RESOURCES`
-Display GPU statistics for any given workload.
+error:
-**Start Recording Stats:**
+ terminate called after throwing an instance of 'std::runtime_error'
+ what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
+ Aborted (core dumped)
-```bash
-rdci stats -s 2 -g 1
-```
+1. Missing groups. Run `groups`. You're expected to have `video` and `render`.
+ This can be fixed with `sudo usermod -aG video,render $USER` followed by
+ logging out and logging back in.
-**Stop Recording Stats:**
+## Troubleshooting rdcd
-```bash
-rdci stats -x 2
-```
+- Log messages that can provide useful debug information.
-**Display Job Stats:**
+If rdcd was started as a systemd service, then use journalctl to view rdcd logs
-```bash
-rdci stats -j 2
-```
+ sudo journalctl -u rdc
-**Sample Output:**
+To run rdcd with debug log from command-line use
+version will be the version number(ex:3.10.0) of ROCm where RDC was packaged with
-```
-Summary:
-Executive Status:
+ RDC_LOG=DEBUG /opt/rocm/bin/rdcd
-Start time: 1586795401
-End time: 1586795445
-Total execution time: 44
+RDC_LOG=DEBUG also works on rdci
-Energy Consumed (Joules): 21682
-Power Usage (Watts): Max: 49 Min: 13 Avg: 34
-GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
-GPU Utilization (%): Max: 69 Min: 0 Avg: 2
-Max GPU Memory Used (bytes): 524320768
-Memory Utilization (%): Max: 12 Min: 11 Avg: 12
-```
+ERROR, INFO, DEBUG logging levels are supported
-### π©Ί Diagnostic
-
-Run diagnostics on a GPU group to ensure system health.
-
-**Command:**
-
-```bash
-rdci diag -g
-```
-
-**Sample Output:**
-
-```
-No compute process: Pass
-Node topology check: Pass
-GPU parameters check: Pass
-Compute Queue ready: Pass
-System memory check: Pass
-=============== Diagnostic Details ==================
-No compute process: No processes running on any devices.
-Node topology check: No link detected.
-GPU parameters check: GPU 0 Critical Edge temperature in range.
-Compute Queue ready: Run binary search task on GPU 0 Pass.
-System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
-```
-
----
-
-## π Integration with Third-Party Tools
-
-RDC integrates seamlessly with tools like **Prometheus**, **Grafana**, and **Reliability, Availability, and Serviceability (RAS)** to enhance monitoring and visualization.
-
-### π Python Bindings
-
-RDC provides a generic Python class `RdcReader` to simplify telemetry gathering.
-
-**Sample Program:**
-
-```python
-from RdcReader import RdcReader
-from RdcUtil import RdcUtil
-from rdc_bootstrap import *
-import time
-
-default_field_ids = [
- rdc_field_t.RDC_FI_POWER_USAGE,
- rdc_field_t.RDC_FI_GPU_UTIL
-]
-
-class SimpleRdcReader(RdcReader):
- def __init__(self):
- super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)
-
- def handle_field(self, gpu_index, value):
- field_name = self.rdc_util.field_id_string(value.field_id).lower()
- print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")
-
-if __name__ == '__main__':
- reader = SimpleRdcReader()
- while True:
- time.sleep(1)
- reader.process()
-```
-
-**Running the Example:**
-
-```bash
-# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
-python SimpleReader.py
-```
-
-### π Prometheus Plugin
-
-The Prometheus plugin allows you to monitor events and send alerts.
-
-**Installation:**
-
-1. **Install Prometheus Client:**
-
- ```bash
- pip install prometheus_client
- ```
-
-2. **Run the Prometheus Plugin:**
-
- ```bash
- python rdc_prometheus.py
- ```
-
-3. **Verify Plugin:**
-
- ```bash
- curl localhost:5000
- ```
-
-**Integration Steps:**
-
-1. **Download and Install Prometheus:**
- - [Prometheus GitHub](https://github.com/prometheus/prometheus)
-
-2. **Configure Prometheus Targets:**
- - Modify `prometheus_targets.json` to point to your compute nodes.
-
- ```json
- [
- {
- "targets": [
- "rdc_test1.amd.com:5000",
- "rdc_test2.amd.com:5000"
- ]
- }
- ]
- ```
-
-3. **Start Prometheus with Configuration File:**
-
- ```bash
- prometheus --config.file=/path/to/rdc_prometheus_example.yml
- ```
-
-4. **Access Prometheus UI:**
- - Open [http://localhost:9090](http://localhost:9090) in your browser.
-
-### π Grafana Integration
-
-Grafana provides advanced visualization capabilities for RDC metrics.
-
-**Installation:**
-
-1. **Download Grafana:**
- - [Grafana Download](https://grafana.com/grafana/download)
-
-2. **Install Grafana:**
- - Follow the [Installation Instructions](https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/).
-
-3. **Start Grafana Server:**
-
- ```bash
- sudo systemctl start grafana-server
- sudo systemctl status grafana-server
- ```
-
-4. **Access Grafana:**
- - Open [http://localhost:3000](http://localhost:3000/) in your browser and log in with the default credentials (`admin`/`admin`).
-
-**Configuration Steps:**
-
-1. **Add Prometheus Data Source:**
- - Navigate to **Configuration β Data Sources β Add data source β Prometheus**.
- - Set the URL to [http://localhost:9090](http://localhost:9090) and save.
-
-2. **Import RDC Dashboard:**
- - Click the **+** icon and select **Import**.
- - Upload `rdc_grafana_dashboard_example.json` from the `python_binding` folder.
- - Select the desired compute node for visualization.
-
-### π‘οΈ Reliability, Availability, and Serviceability (RAS) Plugin
-
-The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.
-
-**Installation:**
-
-1. **Ensure GPU Supports RAS:**
- - The GPU must support RAS features.
-
-2. **RDC Installation Includes RAS Library:**
- - `librdc_ras.so` is located in `/opt/rocm-4.2.0/rdc/lib`.
-
-**Usage:**
-
-- **Monitor ECC Errors:**
-
- ```bash
- rdci dmon -i 0 -e 600,601
- ```
-
- **Sample Output:**
-
- ```
- GPU ECC_CORRECT ECC_UNCORRECT
- 0 0 0
- ```
-
----
-> [!IMPORTANT]
->## π Troubleshooting
->
-> ### Known Issues
->#### π dmon Fields Return N/A
->
->1. **Missing Libraries:**
-> - Verify `/opt/rocm/lib/rdc/librdc_*.so` exists.
-> - Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
->
->2. **Unsupported GPU:**
-> - Most metrics work on MI300 and newer.
-> - Limited metrics on MI200.
-> - Consumer GPUs (e.g., RX6800) have fewer supported metrics.
->
->#### π dmon RocProfiler Fields Return Zeros
->
->**Solution:**
->
->Set the `HSA_TOOLS_LIB` environment variable **before** running a compute job.
->
->```bash
->export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
->```
->
->**Example:**
->
->```bash
-># Terminal 1
->rdcd -u
->
-># Terminal 2
->export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
->gpu-burn
->
-># Terminal 3
->rdci dmon -u -e 800,801 -i 0 -c 1
->
-># Output:
->GPU OCCUPANCY_PERCENT ACTIVE_WAVES
->0 001.000 32640.000
->```
->
->#### β οΈ `HSA_STATUS_ERROR_OUT_OF_RESOURCES`
->
->**Error Message:**
->
->```
->terminate called after throwing an instance of 'std::runtime_error'
-> what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
->Aborted (core dumped)
->```
->
->**Solution:**
->
->1. **Missing Groups:**
-> - Ensure `video` and `render` groups exist.
->
-> ```bash
-> sudo usermod -aG video,render $USER
-> ```
->
-> - Log out and log back in to apply group changes.
->
->### π Troubleshooting RDCD
->
->- **View RDCD Logs:**
->
-> ```bash
-> sudo journalctl -u rdc
-> ```
->
->- **Run RDCD with Debug Logs:**
->
-> ```bash
-> RDC_LOG=DEBUG /opt/rocm/bin/rdcd
-> ```
->
-> - **Logging Levels Supported:** ERROR, INFO, DEBUG
->
->- **Enable Additional Logging Messages:**
->
-> ```bash
-> export RSMI_LOGGING=3
-> ```
-
----
-
-## π License
-
-RDC is open-source and available under the [MIT License](https://opensource.org/licenses/MIT).
-
----
-
-## π§ Support
-
-For support and further inquiries, please refer to the [**ROCm Documentation**](https://rocm.docs.amd.com/projects/rdc/en/latest/) or contact the maintainers through the repository's issue tracker.
+Additional logging messages can be enabled with `RSMI_LOGGING=3`