Dgalants/add auth script location (#108)

* DOCS: Add authentication scripts location

Change-Id: Ie285d80ea6d9bb8f710998208d0aa7c6db661d02
Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>

* Make README.md pretty (#44)

Change-Id: I7c3341deaf3621ebbc9e495b023b1dd4971a5f1d

---------

Signed-off-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Galantsev, Dmitrii <dmitrii.galantsev@amd.com>
Co-authored-by: Williams, Justin <Justin.Williams@amd.com>
This commit is contained in:
Pryor, Adam
2025-01-30 12:08:11 -06:00
zatwierdzone przez GitHub
rodzic 4da277a64e
commit a70aa81cfd
+567 -148
Wyświetl plik
@@ -1,68 +1,151 @@
# ROCm<sup>TM</sup> Data Center Tool (RDC)
# ROCm Data Center Tool (RDC) 🚀
The ROCm™ Data Center Tool simplifies the administration and addresses key infrastructure challenges in AMD GPUs in cluster and datacenter environments. The main features are:
The ROCm™ Data Center Tool (RDC) simplifies administration and addresses key infrastructure challenges in AMD GPUs within cluster and datacenter environments. RDC offers a suite of features to enhance your GPU management and monitoring.
- GPU telemetry
- GPU statistics for jobs
- Integration with third-party tools
- Open source
## 🌟 Main Features
For up-to-date document and how to start using RDC from pre-built packages, please refer to the [**ROCm DataCenter Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/)
- **GPU Telemetry** 📊
- **GPU Statistics for Jobs** 📈
- **Integration with Third-Party Tools** 🔗
- **Open Source** 🛠️
## Certificate generation
For comprehensive documentation and to get started with RDC using pre-built packages, refer to the [**ROCm Data Center Tool User Guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/).
For certificate generation, please refer to
[**RDC Developer Handbook**#generate-files-for-authentication](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication)
Or read the concise guide under authentication/readme.txt
---
## Supported platforms
## 🛠️ Installation Guide
RDC can run on AMD ROCm supported platforms, please refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems)
### 📋 Prerequisites
## Important notes
Before setting up RDC, ensure your system meets the following requirements:
### RocProfiler metrics usage
- **Supported Platforms**: RDC runs on AMD ROCm-supported platforms. Refer to the [List of Supported Operating Systems](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/reference/system-requirements.html#supported-operating-systems) for details.
- **Dependencies**:
- **CMake** ≥ 3.15
- **g++** (5.4.0)
- **Doxygen** (1.8.11)
- **LaTeX** (pdfTeX 3.14159265-2.6-1.40.16)
- **gRPC and protoc**
- **libcap-dev**
- **AMD ROCm Platform** ([GitHub](https://github.com/ROCm/ROCm))
- **AMDSMI Library** ([GitHub](https://github.com/ROCm/amdsmi))
- **ROCK Kernel Driver** ([GitHub](https://github.com/ROCm/ROCK-Kernel-Driver))
When using rocprofiler fields (800-899) you must call
`export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1`
before starting a compute load.
### 🔐 Certificate Generation
[***See: dmon-rocprofiler-fields-return-zeros***](#dmon-rocprofiler-fields-return-zeros)
For certificate generation, refer to the [**RDC Developer Handbook (Generate Files for Authentication)**](https://rocm.docs.amd.com/projects/rdc/en/latest/install/handbook.html#generate-files-for-authentication) or consult the concise guide located at `authentication/readme.txt`.
## Building RDC from source
---
### Dependencies
## 🚀 Running RDC
CMake 3.15 ## 3.15 or greater is required for gRPC
g++ (5.4.0)
Doxygen (1.8.11) ## required to build the latest documentation
Latex (pdfTeX 3.14159265-2.6-1.40.16) ## required to build the latest documentation
gRPC and protoc ## required for communication
libcap-dev ## required to manage the privileges.
RDC supports two primary modes of operation: **Standalone** and **Embedded**. Choose the mode that best fits your deployment needs.
AMD ROCm platform (https://github.com/ROCm/ROCm)
* It is recommended to install the complete AMD ROCm platform.
For installation instruction see https://rocm.docs.amd.com/projects/install-on-linux/en/latest/tutorial/quick-start.html
* At the minimum, these two components are required
(i) AMDSMI Library (https://github.com/ROCm/amdsmi)
(ii) AMD ROCk Kernel driver (https://github.com/ROCm/ROCK-Kernel-Driver)
### 🗂️ Standalone Mode
## Building gRPC and protoc
Standalone mode allows RDC to run independently with all its components installed.
**NOTE:** gRPC and protoc compiler must be built when building RDC from source as pre-built packages are not available. When installing RDC from a package, gRPC and protoc will be installed from the package.
1. **Start RDCD with Authentication (Monitor-Only Capabilities):**
**IMPORTANT:** Building gRPC and protocol buffers requires CMake 3.15 or greater. With an older version build will quietly succeed with a *message*. However, all components of gRPC will not be installed and RDC will ***fail*** to run
```bash
/opt/rocm/bin/rdcd
```
The following tools are required for gRPC build & installation
2. **Start RDCD with Authentication (Full Capabilities):**
automake make g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang-5.0 libc++-dev curl
```bash
sudo /opt/rocm/bin/rdcd
```
### Download and build gRPC
3. **Start RDCD without Authentication (Monitor-Only):**
By default (without using CMAKE_INSTALL_PREFIX option), gRPC will install to `/usr/local` lib, include and bin directories.
It is highly recommended to install gRPC into a unique directory.
Below example installs gRPC into `/opt/grpc`
```bash
/opt/rocm/bin/rdcd -u
```
4. **Start RDCD without Authentication (Full Capabilities):**
```bash
sudo /opt/rocm/bin/rdcd -u
```
### 🔗 Embedded Mode
Embedded mode integrates RDC directly into your existing management tools using its library format.
- **Run RDC in Embedded Mode:**
```bash
python your_management_tool.py --rdc_embedded
```
**Note:** Ensure that the `rdcd` daemon is not running separately when using embedded mode.
### 🛠️ Starting RDCD Using systemd
1. **Copy the Service File:**
```bash
sudo cp /opt/rocm/libexec/rdc/rdc.service /etc/systemd/system/
```
2. **Configure Capabilities:**
- **Full Capabilities:** Ensure the following lines are **uncommented** in `/etc/systemd/system/rdc.service`:
```ini
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE
```
- **Monitor-Only Capabilities:** **Comment out** the above lines to restrict RDCD to monitoring.
3. **Start the Service:**
```bash
sudo systemctl start rdc
sudo systemctl status rdc
```
4. **Modify RDCD Options:**
Edit `/opt/rocm/share/rdc/conf/rdc_options.conf` to append any additional RDCD parameters.
```bash
sudo nano /opt/rocm/share/rdc/conf/rdc_options.conf
```
**Example Configuration:**
```bash
RDC_OPTS="-p 50051 -u -d"
```
- **Flags:**
- `-p 50051` : Use port 50051
- `-u` : Unauthenticated mode
- `-d` : Enable debug messages
---
## 🏗️ Building RDC from Source
If you prefer to build RDC from source, follow the steps below.
### 🔧 Building gRPC and protoc
**Important:** RDC requires gRPC and protoc to be built from source as pre-built packages are not available.
1. **Install Required Tools:**
```bash
sudo apt-get update
sudo apt-get install automake make cmake g++ unzip build-essential autoconf libtool pkg-config libgflags-dev libgtest-dev clang libc++-dev curl
```
2. **Clone and Build gRPC:**
```bash
git clone -b v1.61.0 https://github.com/grpc/grpc --depth=1 --shallow-submodules --recurse-submodules
cd grpc
export GRPC_ROOT=/opt/grpc
@@ -77,166 +160,502 @@ Below example installs gRPC into `/opt/grpc`
make -C build -j $(nproc)
sudo make -C build install
echo "$GRPC_ROOT" | sudo tee /etc/ld.so.conf.d/grpc.conf
sudo ldconfig
cd ..
```
## Building RDC
### 🔧 Building RDC
Clone the RDC source code from GitHub and use CMake to build and install
1. **Clone the RDC Repository:**
```bash
git clone https://github.com/ROCm/rdc
cd rdc
# default installation location is /opt/rocm, specify with -DROCM_DIR or -DCMAKE_INSTALL_PREFIX
```
2. **Configure the Build:**
```bash
cmake -B build -DGRPC_ROOT="$GRPC_ROOT"
# enable rocprofiler (optional)
cmake -B build -DBUILD_PROFILER=ON
# enable RVS (optional)
cmake -B build -DBUILD_RVS=ON
# build and install
```
- **Optional Features:**
- **Enable ROCm Profiler:**
```bash
cmake -B build -DBUILD_PROFILER=ON
```
- **Enable RVS:**
```bash
cmake -B build -DBUILD_RVS=ON
```
- **Build RDC Library Only (without rdci and rdcd):**
```bash
cmake -B build -DBUILD_STANDALONE=OFF
```
- **Build RDC Library Without ROCm Run-time:**
```bash
cmake -B build -DBUILD_RUNTIME=OFF
```
3. **Build and Install:**
```bash
make -C build -j $(nproc)
make -C build install
sudo make -C build install
```
## Building RDC library only without gRPC (optional)
4. **Update System Library Path:**
If only the RDC libraries are needed (i.e. only "embedded mode" is required), the user can choose to not build rdci and rdcd. This will eliminate the need for gRPC and protoc. To build in this way, -DBUILD_STANDALONE=off should be passed on the the cmake command line:
```bash
export RDC_LIB_DIR=/opt/rocm/lib/rdc
export GRPC_LIB_DIR="/opt/grpc/lib"
echo "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
echo "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
sudo ldconfig
```
cmake -B build -DBUILD_STANDALONE=off
---
## Building RDC library without ROCM Run time (optional)
## 📊 Features Overview
The user can choose to not build RDC diagnostic ROCM Run time. This will eliminate the need for ROCM Run time. To build in this way, -DBUILD_RUNTIME=off should be passed on the the cmake command line:
### 🔍 Discovery
cmake -B build -DBUILD_RUNTIME=off
Locate and display information about GPUs present in a compute node.
## Update System Library Path
**Example:**
RDC_LIB_DIR=/opt/rocm/lib/rdc
GRPC_LIB_DIR="${RDC_LIB_DIR}/grpc/lib\n/opt/grpc/lib"
echo -e "${RDC_LIB_DIR}" | sudo tee /etc/ld.so.conf.d/x86_64-librdc_client.conf
echo -e "${GRPC_LIB_DIR}" | sudo tee -a /etc/ld.so.conf.d/x86_64-librdc_client.conf
ldconfig
```bash
rdci discovery <host_name> -l
```
## Running RDC
**Output:**
RDC supports encrypted communications between clients and servers. The
communication can be configured to be *authenticated* or *not authenticated*. The [**user guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/) has information on how to generate and install SSL keys and certificates for authentication. By default, authentication is enabled.
```
2 GPUs found
## Starting ROCm™ Data Center Daemon (RDCD)
+-----------+----------------------------------------------+
| GPU Index | Device Information |
+-----------+----------------------------------------------+
| 0 | Name: AMD Radeon Instinct MI50 Accelerator |
| 1 | Name: AMD Radeon Instinct MI50 Accelerator |
+-----------+----------------------------------------------+
```
For an RDC client application to monitor and/or control a remote system, the RDC server daemon, *rdcd*, must be running on the remote system. *rdcd* can be configured to run with (a) full-capabilities which includes ability to set or change GPU configuration or (b) monitor-only capabilities which limits to monitoring GPU metrics.
## 👥 Groups
### Start RDCD from command-line
#### 🖥️ GPU Groups
When *rdcd* is started from a command-line the *capabilities* are determined by privilege of the *user* starting *rdcd*
Create, delete, and list logical groups of GPUs.
## NOTE: Replace /opt/rocm with specific rocm version if needed
**Create a Group:**
## To run with authentication. Ensure SSL keys are setup properly
/opt/rocm/bin/rdcd ## rdcd is started with monitor-only capabilities
sudo /opt/rocm/bin/rdcd ## rdcd is started will full-capabilities
```bash
rdci group -c GPU_GROUP
```
## To run without authentication. SSL key & certificates are not required.
/opt/rocm/bin/rdcd -u ## rdcd is started with monitor-only capabilities
sudo /opt/rocm/bin/rdcd -u ## rdcd is started will full-capabilities
**Add GPUs to Group:**
### Start RDCD using systemd
```bash
rdci group -g 1 -a 0,1
```
*rdcd* can be started by using the systemctl command. You can copy `/opt/rocm/libexec/rdc/rdc.service`, which is installed with RDC, to the systemd folder. This file has 2 lines that control what *capabilities* with which *rdcd* will run. If left uncommented, rdcd will run with full-capabilities.
**List Groups:**
## file: /opt/rocm/libexec/rdc/rdc.service
## Comment the following two lines to run with monitor-only capabilities
CapabilityBoundingSet=CAP_DAC_OVERRIDE
AmbientCapabilities=CAP_DAC_OVERRIDE
```bash
rdci group -l
```
systemctl start rdc ## start rdc as systemd service
**Delete a Group:**
Additional options can be passed to *rdcd* by modifying `/opt/rocm/share/rdc/conf/rdc_options.conf`
```bash
rdci group -d 1
```
## file: /opt/rocm/share/rdc/conf/rdc_options.conf
# Append 'rdc' daemon parameters here
RDC_OPTS="-p 50051 -u -d"
#### 🗂️ Field Groups
Example above does the following:
Manage field groups to monitor specific GPU metrics.
- Use port 50051
- Use unauthenticated mode
- Enable debug messages
- **NOTE:** You must add `-u` flag to `rdci` calls as well
**Create a Field Group:**
## Invoke RDC using ROCm™ Data Center Interface (RDCI)
```bash
rdci fieldgroup -c <fgroup> -f 150,155
```
RDCI provides command-line interface to all RDC features. This CLI can be run locally or remotely. Refer to [**user guide**](https://rocm.docs.amd.com/projects/rdc/en/latest/how-to/features.html) for the current list of features.
**List Field Groups:**
## sample rdci commands to test RDC functionality
## discover devices in a local or remote compute node
## NOTE: option -u (for unauthenticated) is required if rdcd was started in this mode
## Assuming that rdc is installed into /opt/rocm
```bash
rdci fieldgroup -l
```
cd /opt/rocm/bin
./rdci discovery -l <-u> ## list available GPUs in localhost
./rdci discovery <host> -l <-u> ## list available GPUs in host machine
./rdci dmon <host> <-u> -l ## list most GPU counters
# assuming rdcd is running locally, using -u instead of <host>
./rdci dmon -u --list-all ## list all GPU counters
./rdci dmon -u -i 0 -c 1 -e 100 ## monitor field 100 on gpu 0 for count of 1
./rdci dmon -u -i 0 -c 1 -e 1,2 ## monitor fields 1,2 on gpu 0 for count of 1
**Delete a Field Group:**
## Known issues
```bash
rdci fieldgroup -d 1
```
> [!IMPORTANT]
>### 🛑 Monitor Errors
>
>Define fields to monitor RAS ECC counters.
>
>- **Correctable ECC Errors:**
>
> ```bash
> 312 RDC_FI_ECC_CORRECT_TOTAL
> ```
>
>- **Uncorrectable ECC Errors:**
>
> ```bash
> 313 RDC_FI_ECC_UNCORRECT_TOTAL
> ```
### dmon fields return N/A
### 📈 Device Monitoring
1. An optional library might be missing. Do you have
`/opt/rocm/lib/rdc/librdc_*.so`? Do you have the library it's related to?
(rocprofiler, rocruntime...)
2. The GPU you're using might not be supported. As a rule of thumb - most
metrics should work on MI300 and up. Less metrics are supported for MI200.
NV21 (aka RX6800) and other consumer GPUs have even less metrics.
Monitor GPU fields such as temperature, power usage, and utilization.
### dmon rocprofiler fields return zeros
**Command:**
Due to a rocprofiler limitation - you must set `HSA_TOOLS_LIB` environmental
variable *before* running a compute job.
```bash
rdci dmon -f <field_group> -g <gpu_group> -c 5 -d 1000
```
If `HSA_TOOLS_LIB` is not set - most rocprofiler metrics will return all zeros.
**Sample Output:**
E.g. Correct output on MI300 using [gpu-burn](https://github.com/ROCm/HIP-Examples/tree/master/gpu-burn)
```
1 group found
# terminal 1
rdcd -u
# terminal 2
export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
gpu-burn
# terminal 3
rdci dmon -u -e 800,801 -i 0 -c 1
# output:
# GPU OCCUPANCY_PERCENT ACTIVE_WAVES
# 0 001.000 32640.000
+-----------+-------------+---------------+
| GPU Index | TEMP (m°C) | POWER (µW) |
+-----------+-------------+---------------+
| 0 | 25000 | 520500 |
+-----------+-------------+---------------+
```
### `HSA_STATUS_ERROR_OUT_OF_RESOURCES`
### 📊 Job Stats
error:
Display GPU statistics for any given workload.
terminate called after throwing an instance of 'std::runtime_error'
what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
Aborted (core dumped)
**Start Recording Stats:**
1. Missing groups. Run `groups`. You're expected to have `video` and `render`.
This can be fixed with `sudo usermod -aG video,render $USER` followed by
logging out and logging back in.
```bash
rdci stats -s 2 -g 1
```
## Troubleshooting rdcd
**Stop Recording Stats:**
- Log messages that can provide useful debug information.
```bash
rdci stats -x 2
```
If rdcd was started as a systemd service, then use journalctl to view rdcd logs
**Display Job Stats:**
sudo journalctl -u rdc
```bash
rdci stats -j 2
```
To run rdcd with debug log from command-line use
version will be the version number(ex:3.10.0) of ROCm where RDC was packaged with
**Sample Output:**
RDC_LOG=DEBUG /opt/rocm/bin/rdcd
```
Summary:
Executive Status:
RDC_LOG=DEBUG also works on rdci
Start time: 1586795401
End time: 1586795445
Total execution time: 44
ERROR, INFO, DEBUG logging levels are supported
Energy Consumed (Joules): 21682
Power Usage (Watts): Max: 49 Min: 13 Avg: 34
GPU Clock (MHz): Max: 1000 Min: 300 Avg: 903
GPU Utilization (%): Max: 69 Min: 0 Avg: 2
Max GPU Memory Used (bytes): 524320768
Memory Utilization (%): Max: 12 Min: 11 Avg: 12
```
Additional logging messages can be enabled with `RSMI_LOGGING=3`
### 🩺 Diagnostic
Run diagnostics on a GPU group to ensure system health.
**Command:**
```bash
rdci diag -g <gpu_group>
```
**Sample Output:**
```
No compute process: Pass
Node topology check: Pass
GPU parameters check: Pass
Compute Queue ready: Pass
System memory check: Pass
=============== Diagnostic Details ==================
No compute process: No processes running on any devices.
Node topology check: No link detected.
GPU parameters check: GPU 0 Critical Edge temperature in range.
Compute Queue ready: Run binary search task on GPU 0 Pass.
System memory check: Max Single Allocation Memory Test for GPU 0 Pass. CPUAccessToGPUMemoryTest for GPU 0 Pass. GPUAccessToCPUMemoryTest for GPU 0 Pass.
```
---
## 🔌 Integration with Third-Party Tools
RDC integrates seamlessly with tools like **Prometheus**, **Grafana**, and **Reliability, Availability, and Serviceability (RAS)** to enhance monitoring and visualization.
### 🐍 Python Bindings
RDC provides a generic Python class `RdcReader` to simplify telemetry gathering.
**Sample Program:**
```python
from RdcReader import RdcReader
from RdcUtil import RdcUtil
from rdc_bootstrap import *
import time
default_field_ids = [
rdc_field_t.RDC_FI_POWER_USAGE,
rdc_field_t.RDC_FI_GPU_UTIL
]
class SimpleRdcReader(RdcReader):
def __init__(self):
super().__init__(ip_port=None, field_ids=default_field_ids, update_freq=1000000)
def handle_field(self, gpu_index, value):
field_name = self.rdc_util.field_id_string(value.field_id).lower()
print(f"{value.ts} {gpu_index}:{field_name} {value.value.l_int}")
if __name__ == '__main__':
reader = SimpleRdcReader()
while True:
time.sleep(1)
reader.process()
```
**Running the Example:**
```bash
# Ensure RDC shared libraries are in the library path and RdcReader.py is in PYTHONPATH
python SimpleReader.py
```
### 📈 Prometheus Plugin
The Prometheus plugin allows you to monitor events and send alerts.
**Installation:**
1. **Install Prometheus Client:**
```bash
pip install prometheus_client
```
2. **Run the Prometheus Plugin:**
```bash
python rdc_prometheus.py
```
3. **Verify Plugin:**
```bash
curl localhost:5000
```
**Integration Steps:**
1. **Download and Install Prometheus:**
- [Prometheus GitHub](https://github.com/prometheus/prometheus)
2. **Configure Prometheus Targets:**
- Modify `prometheus_targets.json` to point to your compute nodes.
```json
[
{
"targets": [
"rdc_test1.amd.com:5000",
"rdc_test2.amd.com:5000"
]
}
]
```
3. **Start Prometheus with Configuration File:**
```bash
prometheus --config.file=/path/to/rdc_prometheus_example.yml
```
4. **Access Prometheus UI:**
- Open [http://localhost:9090](http://localhost:9090) in your browser.
### 📊 Grafana Integration
Grafana provides advanced visualization capabilities for RDC metrics.
**Installation:**
1. **Download Grafana:**
- [Grafana Download](https://grafana.com/grafana/download)
2. **Install Grafana:**
- Follow the [Installation Instructions](https://grafana.com/docs/grafana/latest/setup-grafana/installation/debian/).
3. **Start Grafana Server:**
```bash
sudo systemctl start grafana-server
sudo systemctl status grafana-server
```
4. **Access Grafana:**
- Open [http://localhost:3000](http://localhost:3000/) in your browser and log in with the default credentials (`admin`/`admin`).
**Configuration Steps:**
1. **Add Prometheus Data Source:**
- Navigate to **Configuration → Data Sources → Add data source → Prometheus**.
- Set the URL to [http://localhost:9090](http://localhost:9090) and save.
2. **Import RDC Dashboard:**
- Click the **+** icon and select **Import**.
- Upload `rdc_grafana_dashboard_example.json` from the `python_binding` folder.
- Select the desired compute node for visualization.
### 🛡️ Reliability, Availability, and Serviceability (RAS) Plugin
The RAS plugin enables monitoring and counting of ECC (Error-Correcting Code) errors.
**Installation:**
1. **Ensure GPU Supports RAS:**
- The GPU must support RAS features.
2. **RDC Installation Includes RAS Library:**
- `librdc_ras.so` is located in `/opt/rocm-4.2.0/rdc/lib`.
**Usage:**
- **Monitor ECC Errors:**
```bash
rdci dmon -i 0 -e 600,601
```
**Sample Output:**
```
GPU ECC_CORRECT ECC_UNCORRECT
0 0 0
```
---
> [!IMPORTANT]
>## 🐞 Troubleshooting
>
> ### Known Issues
>#### 🛑 dmon Fields Return N/A
>
>1. **Missing Libraries:**
> - Verify `/opt/rocm/lib/rdc/librdc_*.so` exists.
> - Ensure all related libraries (rocprofiler, rocruntime, etc.) are present.
>
>2. **Unsupported GPU:**
> - Most metrics work on MI300 and newer.
> - Limited metrics on MI200.
> - Consumer GPUs (e.g., RX6800) have fewer supported metrics.
>
>#### 🐍 dmon RocProfiler Fields Return Zeros
>
>**Solution:**
>
>Set the `HSA_TOOLS_LIB` environment variable **before** running a compute job.
>
>```bash
>export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
>```
>
>**Example:**
>
>```bash
># Terminal 1
>rdcd -u
>
># Terminal 2
>export HSA_TOOLS_LIB=/opt/rocm/lib/librocprofiler64.so.1
>gpu-burn
>
># Terminal 3
>rdci dmon -u -e 800,801 -i 0 -c 1
>
># Output:
>GPU OCCUPANCY_PERCENT ACTIVE_WAVES
>0 001.000 32640.000
>```
>
>#### ⚠️ `HSA_STATUS_ERROR_OUT_OF_RESOURCES`
>
>**Error Message:**
>
>```
>terminate called after throwing an instance of 'std::runtime_error'
> what(): hsa error code: 4104 HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
>Aborted (core dumped)
>```
>
>**Solution:**
>
>1. **Missing Groups:**
> - Ensure `video` and `render` groups exist.
>
> ```bash
> sudo usermod -aG video,render $USER
> ```
>
> - Log out and log back in to apply group changes.
>
>### 🐛 Troubleshooting RDCD
>
>- **View RDCD Logs:**
>
> ```bash
> sudo journalctl -u rdc
> ```
>
>- **Run RDCD with Debug Logs:**
>
> ```bash
> RDC_LOG=DEBUG /opt/rocm/bin/rdcd
> ```
>
> - **Logging Levels Supported:** ERROR, INFO, DEBUG
>
>- **Enable Additional Logging Messages:**
>
> ```bash
> export RSMI_LOGGING=3
> ```
---
## 📄 License
RDC is open-source and available under the [MIT License](https://opensource.org/licenses/MIT).
---
## 📧 Support
For support and further inquiries, please refer to the [**ROCm Documentation**](https://rocm.docs.amd.com/projects/rdc/en/latest/) or contact the maintainers through the repository's issue tracker.