* Added PCIE Atomic Operations enable check. Tests if atomic operations are enabled for GPU devices. Displays the Atomic routing capability via Link capability and status. Signed-off-by: Saravanan Solaiyappan <saravanan.solaiyappan@amd.com>
4.9 KiB
ROCm Deployment Health Check (RDHC)
Overview
RDHC is a comprehensive health check tool for ROCm deployments. It validates GPU presence, driver status, kernel parameters, library dependencies, and tests installed ROCm components.
Features
- Cross-Platform Support: Works on Ubuntu, RHEL, and SLES distributions
- Comprehensive Testing: GPU validation, driver checks, library dependencies, kernel parameters, and component-specific tests
- Dynamic Component Detection: Automatically identifies installed ROCm components
- Flexible Reporting: Pretty table output and JSON export options
- Configurable Verbosity: Support for verbose, normal, and silent modes
Test Categories
Default Tests (Quick Mode)
- GPU Presence - Detects AMD GPUs in the system
- AMDGPU Driver - Validates driver installation and initialization
- Kernel Parameters - Checks ROCm-related kernel settings
- rocminfo - Validates ROCm information utility
- rocm_agent_enumerator - Checks GPU agent enumeration
- amd-smi - Tests AMD System Management Interface
- Library Dependencies - Validates ROCm library dependencies
- Environment Variables - Checks ROCm-related environment settings
- Multinode Cluster Readiness - Validates network and MPI configuration
- Atomic Operations - Checks if atomic operations are enabled on GPUs
Component Tests (--all mode)
Tests installed ROCm components by compiling and executing example programs:
- HIP (hipcc, hip-runtime-amd)
- Math Libraries (hipBLAS, hipFFT, rocBLAS, rocFFT, etc.)
- Primitives (hipCUB, rocPRIM, rocThrust)
- Solvers (hipSOLVER, rocSOLVER, rocSPARSE)
- Deep Learning (MIOpen)
- Applications (from rocm-examples repository)
Output
The tool provides three types of output:
- Console Output - Real-time test progress and results
- Summary Tables - Formatted tables showing:
- General system information
- GPU device information
- Firmware version information
- Test results with status and details
- JSON Export - Detailed results in JSON format for further analysis
Install dependency pip packages
sudo pip3 install -r requirements.txt
Usage
./rdhc.py -h
usage: sudo -E rdhc.py [options]
ROCm Deployment Health Check Tool
optional arguments:
-h, --help show this help message and exit
--quick Run quick tests only (default)
--all Default tests + Compile and executes simple program for each component.
-v, --verbose Enable verbose output
-s, --silent Silent mode (errors only)
-j FILE, --json FILE Export results to JSON file
-d DIR, --dir DIR Directory path for temporary files (default: /tmp/rdhc/)
Usage examples:
# Run quick test (default tests only)
sudo -E ./rdhc.py
# Run all tests including compile and execute the rocm-example program for each component
sudo -E ./rdhc.py --all
# Run all tests with verbose output
sudo -E ./rdhc.py --all -v
# Enable verbose output
sudo -E ./rdhc.py -v
# Run in silent mode (only errors shown)
sudo -E ./rdhc.py -s
# Export results to a specific JSON file
sudo -E ./rdhc.py --all --json rdhc-results.json
# Specify a directory for temp files and logs (default: /tmp/rdhc/)
sudo -E ./rdhc.py -d /home/user/rdhc-dir/
RDHC Environment VARIABLES
RDHC tool will use the following ENV variables and act accordingly if they are set.
# ROCm installation path can be set by the below ENV variable. Default is "/opt/rocm/"
export ROCM_PATH="/opt/rocm"
# For library dependency validation, the lib search depth can be set by the below ENV.
# Default is full depth. It checks for all the lib files in ROCM_PATH/lib/ folder recursively.
export LIBDIR_MAX_DEPTH=""
# if you want to check the libs only from the ROCM_PATH/lib/ folder set the depth as 1.
export LIBDIR_MAX_DEPTH=1
Troubleshooting
Python Package Installation Issues (Ubuntu 24.04)
If sudo pip3 install fails with an "externally-managed-environment" error (common in Ubuntu 24.04), use a Python virtual environment instead:
# Create a virtual environment (one-time setup)
python3 -m venv ~/rdhc-venv
# Activate the virtual environment
source ~/rdhc-venv/bin/activate
# Install required packages
pip3 install -r requirements.txt
Note for Ubuntu 24.04 users: Due to enhanced security policies, sudo -E does not preserve the virtual environment PATH. Replace all sudo -E commands with sudo --preserve-env=PATH in the usage examples above.
For example:
# Instead of: sudo -E ./rdhc.py
# Use:
source ~/rdhc-venv/bin/activate
sudo --preserve-env=PATH ./rdhc.py
# Run all tests
sudo --preserve-env=PATH ./rdhc.py --all
# Run with verbose output
sudo --preserve-env=PATH ./rdhc.py -v
The tool is designed to be easily extended with additional component tests by adding new test methods following the naming convention test_check_component_name().