[rocm-core] Adding a tool for ROCM Deployment Health Check (#958)

* Adding a tool for ROCM Deployment Health Check

rdhc.py - This simple tool will check for the rocm
installation and its readiness on the current system and its working status.
Check the README file for more info.

Signed-off-by: Saravanan Solaiyappan <saravanan.solaiyappan@amd.com>
This commit is contained in:
solaiys
2025-09-24 22:43:42 +05:30
zatwierdzone przez GitHub
rodzic 83fb0c8c47
commit 8912930840
3 zmienionych plików z 1920 dodań i 0 usunięć
+81
Wyświetl plik
@@ -0,0 +1,81 @@
# rdhc
Rocm Deployment Health Check Tool
## Features of the ROCm Deployment Health Check Tool
1. **Cross-Platform Support**: Works on Ubuntu, RHEL, and SLES distributions
2. **Comprehensive Testing**:
- Default tests (GPU presence, driver status, rocminfo, rocm-smi)
- Library dependency verification
- Check some kernel parameters and ENV variables presence
- Component-specific tests
- Build and test the test program available from rocm-examples git repo dynamically.
3. **Dynamic Component Detection**: Identifies installed ROCm components using distribution-specific package manager commands
4. **Flexible Reporting**:
- Pretty table output for terminal display
- JSON export for further analysis or integration
5. **Configurable Verbosity**: Through command-line options (`-v` for verbose, `-s` for silent)
## Install dependency pip packages
```bash
sudo pip3 install -r requirements.txt
```
## Usage
```bash
./rdhc.py -h
usage: sudo -E rdhc.py [options]
ROCm Deployment Health Check Tool
optional arguments:
-h, --help show this help message and exit
--quick Run quick tests only (default)
--all Default tests + Compile and executes simple program for each component.
-v, --verbose Enable verbose output
-s, --silent Silent mode (errors only)
-j FILE, --json FILE Export results to JSON file
-d DIR, --dir DIR Directory path for temporary files (default: /tmp/rdhc/)
Usage examples:
# Run quick test (default tests only)
sudo -E ./rdhc.py
# Run all tests including compile and execute the rocm-example program for each component
sudo -E ./rdhc.py --all
# Run all tests with verbose output
sudo -E ./rdhc.py --all -v
# Enable verbose output
sudo -E ./rdhc.py -v
# Run in silent mode (only errors shown)
sudo -E ./rdhc.py -s
# Export results to a specific JSON file
sudo -E ./rdhc.py --all --json rdhc-results.json
# Specify a directory for temp files and logs (default: /tmp/rdhc/)
sudo -E ./rdhc.py -d /home/user/rdhc-dir/
```
## RDHC Environment VARIABLES
RDHC tool will use the following ENV varaibles and act accordingly if they are set.
```bash
# ROCm installation path can be set by the below ENV varaible. Default is "/opt/rocm/"
export ROCM_PATH="/opt/rocm"
# For library dependency validation, the lib search depth can be set by the below ENV.
# Default is full depth. It checks for all the lib files in ROCM_PATH/lib/ folder recursively.
export LIBDIR_MAX_DEPTH=""
# if you want to check the libs only from the ROCM_PATH/lib/ folder set the depth as 1.
export LIBDIR_MAX_DEPTH=1
```
The tool is designed to be easily extended with additional component tests by
adding new test methods following the naming convention `test_check_component_name()`.
+1837
Wyświetl plik
Plik diff jest za duży Load Diff
@@ -0,0 +1,2 @@
prettytable>=3.14.0
PyYAML>=5.4.1