Added PCIE Atomic Operations enable check. (#1746)
* Added PCIE Atomic Operations enable check. Tests if atomic operations are enabled for GPU devices. Displays the Atomic routing capability via Link capability and status. Signed-off-by: Saravanan Solaiyappan <saravanan.solaiyappan@amd.com>
这个提交包含在:
@@ -1,28 +1,64 @@
|
||||
# rdhc
|
||||
Rocm Deployment Health Check Tool
|
||||
# ROCm Deployment Health Check (RDHC)
|
||||
|
||||
## Overview
|
||||
|
||||
RDHC is a comprehensive health check tool for ROCm deployments. It validates GPU presence, driver status, kernel parameters, library dependencies, and tests installed ROCm components.
|
||||
|
||||
|
||||
## Features of the ROCm Deployment Health Check Tool
|
||||
## Features
|
||||
|
||||
- **Cross-Platform Support**: Works on Ubuntu, RHEL, and SLES distributions
|
||||
- **Comprehensive Testing**: GPU validation, driver checks, library dependencies, kernel parameters, and component-specific tests
|
||||
- **Dynamic Component Detection**: Automatically identifies installed ROCm components
|
||||
- **Flexible Reporting**: Pretty table output and JSON export options
|
||||
- **Configurable Verbosity**: Support for verbose, normal, and silent modes
|
||||
|
||||
|
||||
## Test Categories
|
||||
|
||||
### Default Tests (Quick Mode)
|
||||
|
||||
1. **GPU Presence** - Detects AMD GPUs in the system
|
||||
2. **AMDGPU Driver** - Validates driver installation and initialization
|
||||
3. **Kernel Parameters** - Checks ROCm-related kernel settings
|
||||
4. **rocminfo** - Validates ROCm information utility
|
||||
5. **rocm_agent_enumerator** - Checks GPU agent enumeration
|
||||
6. **amd-smi** - Tests AMD System Management Interface
|
||||
7. **Library Dependencies** - Validates ROCm library dependencies
|
||||
8. **Environment Variables** - Checks ROCm-related environment settings
|
||||
9. **Multinode Cluster Readiness** - Validates network and MPI configuration
|
||||
10. **Atomic Operations** - Checks if atomic operations are enabled on GPUs
|
||||
|
||||
### Component Tests (--all mode)
|
||||
|
||||
Tests installed ROCm components by compiling and executing example programs:
|
||||
|
||||
- HIP (hipcc, hip-runtime-amd)
|
||||
- Math Libraries (hipBLAS, hipFFT, rocBLAS, rocFFT, etc.)
|
||||
- Primitives (hipCUB, rocPRIM, rocThrust)
|
||||
- Solvers (hipSOLVER, rocSOLVER, rocSPARSE)
|
||||
- Deep Learning (MIOpen)
|
||||
- Applications (from rocm-examples repository)
|
||||
|
||||
## Output
|
||||
|
||||
The tool provides three types of output:
|
||||
|
||||
1. **Console Output** - Real-time test progress and results
|
||||
2. **Summary Tables** - Formatted tables showing:
|
||||
- General system information
|
||||
- GPU device information
|
||||
- Firmware version information
|
||||
- Test results with status and details
|
||||
3. **JSON Export** - Detailed results in JSON format for further analysis
|
||||
|
||||
1. **Cross-Platform Support**: Works on Ubuntu, RHEL, and SLES distributions
|
||||
2. **Comprehensive Testing**:
|
||||
- Default tests (GPU presence, driver status, rocminfo, rocm-smi)
|
||||
- Library dependency verification
|
||||
- Check some kernel parameters and ENV variables presence
|
||||
- Component-specific tests
|
||||
- Build and test the test program available from rocm-examples git repo dynamically.
|
||||
3. **Dynamic Component Detection**: Identifies installed ROCm components using distribution-specific package manager commands
|
||||
4. **Flexible Reporting**:
|
||||
- Pretty table output for terminal display
|
||||
- JSON export for further analysis or integration
|
||||
5. **Configurable Verbosity**: Through command-line options (`-v` for verbose, `-s` for silent)
|
||||
|
||||
## Install dependency pip packages
|
||||
|
||||
```bash
|
||||
sudo pip3 install -r requirements.txt
|
||||
|
||||
```
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
@@ -61,12 +97,12 @@ sudo -E ./rdhc.py --all --json rdhc-results.json
|
||||
|
||||
# Specify a directory for temp files and logs (default: /tmp/rdhc/)
|
||||
sudo -E ./rdhc.py -d /home/user/rdhc-dir/
|
||||
|
||||
```
|
||||
|
||||
## RDHC Environment VARIABLES
|
||||
RDHC tool will use the following ENV varaibles and act accordingly if they are set.
|
||||
RDHC tool will use the following ENV variables and act accordingly if they are set.
|
||||
```bash
|
||||
# ROCm installation path can be set by the below ENV varaible. Default is "/opt/rocm/"
|
||||
# ROCm installation path can be set by the below ENV variable. Default is "/opt/rocm/"
|
||||
export ROCM_PATH="/opt/rocm"
|
||||
|
||||
# For library dependency validation, the lib search depth can be set by the below ENV.
|
||||
@@ -75,7 +111,41 @@ export LIBDIR_MAX_DEPTH=""
|
||||
|
||||
# if you want to check the libs only from the ROCM_PATH/lib/ folder set the depth as 1.
|
||||
export LIBDIR_MAX_DEPTH=1
|
||||
|
||||
```
|
||||
The tool is designed to be easily extended with additional component tests by
|
||||
adding new test methods following the naming convention `test_check_component_name()`.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Python Package Installation Issues (Ubuntu 24.04)
|
||||
|
||||
If `sudo pip3 install` fails with an "externally-managed-environment" error (common in Ubuntu 24.04), use a Python virtual environment instead:
|
||||
|
||||
```bash
|
||||
# Create a virtual environment (one-time setup)
|
||||
python3 -m venv ~/rdhc-venv
|
||||
|
||||
# Activate the virtual environment
|
||||
source ~/rdhc-venv/bin/activate
|
||||
|
||||
# Install required packages
|
||||
pip3 install -r requirements.txt
|
||||
```
|
||||
|
||||
**Note for Ubuntu 24.04 users:** Due to enhanced security policies, `sudo -E` does not preserve the virtual environment PATH. Replace all `sudo -E` commands with `sudo --preserve-env=PATH` in the usage examples above.
|
||||
|
||||
For example:
|
||||
```bash
|
||||
# Instead of: sudo -E ./rdhc.py
|
||||
# Use:
|
||||
source ~/rdhc-venv/bin/activate
|
||||
sudo --preserve-env=PATH ./rdhc.py
|
||||
|
||||
# Run all tests
|
||||
sudo --preserve-env=PATH ./rdhc.py --all
|
||||
|
||||
# Run with verbose output
|
||||
sudo --preserve-env=PATH ./rdhc.py -v
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
The tool is designed to be easily extended with additional component tests by adding new test methods following the naming convention `test_check_component_name()`.
|
||||
|
||||
+171
-14
@@ -476,8 +476,7 @@ class ROCMHealthCheck:
|
||||
|
||||
# AMD GPUs PCI class codes: 03xx (Display controllers ), 12xx (Processing accelerators)
|
||||
# use class codes also to identify AMD GPUs
|
||||
stdout, _, ret_code = run_command( "lspci -d 1002: -nn | grep -Ei \
|
||||
'Display controller|Processing accelerators|\[03[[:xdigit:]]{2}\]|\[12[[:xdigit:]]{2}\]' ",\
|
||||
stdout, _, ret_code = run_command( r"lspci -d 1002: -nn | grep -Ei 'Display controller|Processing accelerators|\[03[[:xdigit:]]{2}\]|\[12[[:xdigit:]]{2}\]' ",\
|
||||
shell=True)
|
||||
gpu_hw = stdout.strip()
|
||||
if ret_code == 0 and gpu_hw:
|
||||
@@ -1254,7 +1253,7 @@ class ROCMHealthCheck:
|
||||
# 2. Check if network cards (NICs) are present in hardware list
|
||||
self.logger.info("----Checking for network interface cards...")
|
||||
nic_brand = None
|
||||
nic_cards, stderr, ret_code = run_command("lspci -nn | grep -i 'ethernet\|network\|infiniband'", shell=True)
|
||||
nic_cards, stderr, ret_code = run_command("lspci -nn | grep -Ei 'ethernet|network|infiniband'", shell=True)
|
||||
if ret_code != 0 or not nic_cards.strip():
|
||||
errors += 1
|
||||
cluster_readiness_issues.append("No network cards found in hardware")
|
||||
@@ -1348,6 +1347,137 @@ class ROCMHealthCheck:
|
||||
self.logger.info(f" {success_msg}")
|
||||
return TestStatus.PASS.value, success_msg
|
||||
|
||||
def test_check_atomic_operations(self):
|
||||
"""Test if atomic operations are enabled for GPU devices"""
|
||||
self.logger.info("--Checking atomic operations support for GPU devices...")
|
||||
|
||||
# Find AMD GPU devices using lspci
|
||||
stdout, stderr, ret_code = run_command("lspci -d 1002: -nn | grep -Ei 'Display controller|Processing accelerators|VGA compatible controller'", shell=True)
|
||||
|
||||
if ret_code != 0 or not stdout.strip():
|
||||
self.logger.error("!!! No AMD GPU devices found")
|
||||
return TestStatus.FAIL.value, "No AMD GPU devices found to check atomic operations."
|
||||
|
||||
gpu_devices = stdout.strip().split('\n')
|
||||
self.logger.info(f"----Found {len(gpu_devices)} AMD GPU device(s)")
|
||||
|
||||
def parse_atomic_details(stdout_detail, pci_address):
|
||||
"""Parse atomic operations details from lspci output"""
|
||||
atomic_cap_found = False
|
||||
atomic_enabled = False
|
||||
|
||||
for line in stdout_detail.strip().split('\n'):
|
||||
line = line.strip()
|
||||
|
||||
if "AtomicOpsCap:" in line:
|
||||
atomic_cap_found = True
|
||||
self.logger.debug(f"------Device {pci_address}: {line}")
|
||||
|
||||
if "AtomicOpsCtl:" in line:
|
||||
# Check if ReqEn+ (Request Enable is set)
|
||||
if "ReqEn+" in line:
|
||||
atomic_enabled = True
|
||||
self.logger.debug(f"------Device {pci_address}: {line}")
|
||||
|
||||
return atomic_cap_found, atomic_enabled
|
||||
|
||||
def check_device_atomic_ops(gpu_line):
|
||||
"""Check atomic operations for a single GPU device"""
|
||||
# Extract PCI address using regex (e.g., "01:00.0")
|
||||
pci_match = re.match(r'^([0-9a-f]{2}:[0-9a-f]{2}\.[0-9a-f])', gpu_line.strip())
|
||||
|
||||
if not pci_match:
|
||||
self.logger.warning(f"!!! Could not extract PCI address from line: {gpu_line}")
|
||||
return None, "check_failed", f"Invalid format: Could not extract PCI address"
|
||||
|
||||
pci_address = pci_match.group(1)
|
||||
|
||||
# Get atomic operations info using grep to filter relevant lines
|
||||
stdout_detail, stderr_detail, ret_detail = run_command(
|
||||
f"lspci -vvv -s {pci_address} | grep -i atomic",
|
||||
shell=True
|
||||
)
|
||||
|
||||
if ret_detail != 0 or not stdout_detail.strip():
|
||||
self.logger.warning(f"!!! Failed to get atomic operations info for device {pci_address}")
|
||||
self.logger.warning(f"!!! Try running the test with 'sudo -E' ")
|
||||
return pci_address, "check_failed", f"{pci_address}: Check failed"
|
||||
|
||||
# Parse AtomicOpsCap and AtomicOpsCtl
|
||||
atomic_cap_found, atomic_enabled = parse_atomic_details(stdout_detail, pci_address)
|
||||
|
||||
# Determine device status
|
||||
if atomic_cap_found and atomic_enabled:
|
||||
status_msg = f"{pci_address}: Supported and Enabled"
|
||||
self.logger.info(f"------{status_msg}")
|
||||
return pci_address, "enabled", status_msg
|
||||
elif atomic_cap_found and not atomic_enabled:
|
||||
status_msg = f"{pci_address}: Supported but NOT Enabled (ReqEn-)"
|
||||
self.logger.warning(f"!!! {status_msg}")
|
||||
return pci_address, "disabled", status_msg
|
||||
else:
|
||||
status_msg = f"{pci_address}: Capability not found or unclear"
|
||||
self.logger.warning(f"!!! {status_msg}")
|
||||
return pci_address, "check_failed", status_msg
|
||||
|
||||
def check_pcie_atomic_routing_capability(pci_address):
|
||||
"""Check PCIe generation and lane configuration for atomic routing"""
|
||||
|
||||
stdout, stderr, ret_code = run_command(
|
||||
f"lspci -vvv -s {pci_address} | grep -E 'LnkCap:|LnkSta:'",
|
||||
shell=True
|
||||
)
|
||||
|
||||
if ret_code == 0 and stdout.strip():
|
||||
self.logger.debug(f"------PCIe Link Capabilities for {pci_address}:")
|
||||
for line in stdout.strip().split('\n'):
|
||||
self.logger.info(f"--------{line.strip()}")
|
||||
|
||||
# Check for PCIe Gen4/Gen5 which have better atomic support
|
||||
if "LnkSta" in line and "Speed" in line:
|
||||
if "16GT/s" in line: # PCIe Gen4
|
||||
self.logger.info(f"------Device {pci_address}: PCIe Gen4 (16GT/s) - Good atomic routing capability")
|
||||
elif "32GT/s" in line: # PCIe Gen5
|
||||
self.logger.info(f"------Device {pci_address}: PCIe Gen5 (32GT/s) - Excellent atomic routing capability")
|
||||
elif "8GT/s" in line: # PCIe Gen3
|
||||
self.logger.warning(f"!!! Device {pci_address}: PCIe Gen3 (8GT/s) - Limited atomic routing capability")
|
||||
|
||||
|
||||
atomic_ops_status = []
|
||||
devices_with_atomics = 0
|
||||
devices_without_atomics = 0
|
||||
check_failed_devices = 0
|
||||
|
||||
# Check atomic operations for each GPU device
|
||||
for gpu_line in gpu_devices:
|
||||
pci_address, status_type, status_msg = check_device_atomic_ops(gpu_line)
|
||||
atomic_ops_status.append(status_msg)
|
||||
if pci_address is not None:
|
||||
check_pcie_atomic_routing_capability(pci_address)
|
||||
|
||||
if status_type == "enabled":
|
||||
devices_with_atomics += 1
|
||||
elif status_type == "disabled":
|
||||
devices_without_atomics += 1
|
||||
else: # check_failed
|
||||
check_failed_devices += 1
|
||||
|
||||
# Log summary
|
||||
self.logger.info(f"----Atomic operations summary:")
|
||||
self.logger.info(f"------Devices with atomic ops enabled: {devices_with_atomics}")
|
||||
self.logger.info(f"------Devices with atomic ops disabled: {devices_without_atomics}")
|
||||
if check_failed_devices > 0:
|
||||
self.logger.info(f"------Devices with check failed/unclear: {check_failed_devices}")
|
||||
|
||||
# Determine overall status
|
||||
if devices_without_atomics > 0:
|
||||
return TestStatus.FAIL.value, f"Atomic operations not enabled for {devices_without_atomics} device(s). Details: {'; '.join(atomic_ops_status)}"
|
||||
elif check_failed_devices > 0:
|
||||
return TestStatus.FAIL.value, f"Atomic operations check completed with {check_failed_devices} warning(s). Details: {'; '.join(atomic_ops_status)}"
|
||||
else:
|
||||
return TestStatus.PASS.value, f"Atomic operations supported and enabled on all {devices_with_atomics} GPU device(s)."
|
||||
|
||||
|
||||
# Example component specific tests (these should be customized for each component)
|
||||
def test_check_hipcc(self):
|
||||
"""Test hipcc package"""
|
||||
@@ -1566,54 +1696,70 @@ class ROCMHealthCheck:
|
||||
# TODO
|
||||
return TestStatus.PASS.value, f"{component} is installed but no specific test available."
|
||||
|
||||
def _print_test_start(self, test_name):
|
||||
"""Print a separator line and test start message
|
||||
|
||||
Args:
|
||||
test_name (str): Name of the test being run
|
||||
"""
|
||||
separator = "=" * 80
|
||||
print(f"\n{separator}")
|
||||
self.logger.info(f"Running test: {test_name}...")
|
||||
|
||||
def run_default_tests(self):
|
||||
"""Run the default set of tests"""
|
||||
results = {}
|
||||
|
||||
# Test 1: GPU Presence
|
||||
self.logger.info("Running test: GPU Presence...")
|
||||
self._print_test_start("GPU Presence")
|
||||
status, reason = self.test_GPUPresence()
|
||||
results["gpu_presence"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 2: AMDGPU Driver
|
||||
self.logger.info("Running test: AMDGPU Driver...")
|
||||
self._print_test_start("AMDGPU Driver")
|
||||
status, reason = self.test_amdgpu_driver()
|
||||
results["amdgpu_driver"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 3: Kernel Parameters
|
||||
self.logger.info("Running test: Kernel Parameters...")
|
||||
self._print_test_start("Kernel Parameters")
|
||||
status, reason = self.test_check_kernel_parameters()
|
||||
results["kernel_parameters"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 4: rocminfo
|
||||
self.logger.info("Running test: rocminfo...")
|
||||
self._print_test_start("rocminfo")
|
||||
status, reason = self.test_rocminfo()
|
||||
results["rocminfo"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 5: rocm_agent_enumerator
|
||||
self.logger.info("Running test: rocm_agent_enumerator...")
|
||||
self._print_test_start("rocm_agent_enumerator")
|
||||
status, reason = self.test_rocm_agent_enumerator()
|
||||
results["rocm_agent_enumerator"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 6: amd-smi
|
||||
self.logger.info("Running test: amd-smi...")
|
||||
self._print_test_start("amd-smi")
|
||||
status, reason = self.test_amd_smi()
|
||||
results["amd_smi"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 7: Library Dependencies
|
||||
self.logger.info("Running test: Library Dependencies...")
|
||||
self._print_test_start("Library Dependencies")
|
||||
status, reason = self.test_check_lib_dependencies()
|
||||
results["lib_dependencies"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 8: Environment Variables
|
||||
self.logger.info("Running test: ENV variables...")
|
||||
self._print_test_start("ENV variables")
|
||||
status, reason = self.test_check_env_variables()
|
||||
results["env_variables"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 9: Multinode cluster readiness
|
||||
self.logger.info("Running test: Multinode cluster readiness...")
|
||||
self._print_test_start("Multinode cluster readiness")
|
||||
status, reason = self.test_check_multinode_cluster_readiness()
|
||||
results["Multinode_Readiness"] = {"status": status, "reason": reason}
|
||||
|
||||
# Test 10: Atomic Operations
|
||||
self._print_test_start("Is Atomic Operations Enabled")
|
||||
status, reason = self.test_check_atomic_operations()
|
||||
results["atomic_operations"] = {"status": status, "reason": reason}
|
||||
|
||||
return results
|
||||
|
||||
def run_component_tests(self):
|
||||
@@ -1622,7 +1768,7 @@ class ROCMHealthCheck:
|
||||
|
||||
for component in self.installed_components:
|
||||
if component not in self.exclude_list:
|
||||
self.logger.info(f"Running component test: {component}...")
|
||||
self._print_test_start(f"Component - {component}")
|
||||
status, reason = self.test_component(component)
|
||||
results[component] = {"status": status, "reason": reason}
|
||||
|
||||
@@ -1638,7 +1784,7 @@ class ROCMHealthCheck:
|
||||
|
||||
# Run tests for each application target
|
||||
for target in self.rocm_examples_targets.get("applications", []):
|
||||
self.logger.info(f"Running application test: [ {target} ]...")
|
||||
self._print_test_start(f"Application - {target}")
|
||||
status, reason = self._build_target_and_run(target, target)
|
||||
results[target] = {"status": status, "reason": reason}
|
||||
|
||||
@@ -1817,6 +1963,17 @@ def main():
|
||||
"\n"+
|
||||
"# Specify a directory for temp files and logs (default: /tmp/rdhc/)\n" +
|
||||
"sudo -E ./rdhc.py -d /home/user/rdhc-dir/\n" +
|
||||
"\n"+
|
||||
"NOTE for Ubuntu 24.04 (Python 3.12) users:\n" +
|
||||
"Due to enhanced security policies, you must use a virtual environment:\n" +
|
||||
" # Create and activate virtual environment (one-time setup)\n" +
|
||||
" python3 -m venv ~/rdhc-venv\n" +
|
||||
" source ~/rdhc-venv/bin/activate\n" +
|
||||
" pip3 install -r requirements.txt\n" +
|
||||
"\n" +
|
||||
" # Run the tool (use --preserve-env=PATH instead of -E)\n" +
|
||||
" sudo --preserve-env=PATH ./rdhc.py\n" +
|
||||
" sudo --preserve-env=PATH ./rdhc.py --all\n" +
|
||||
" ",
|
||||
)
|
||||
|
||||
|
||||
在新工单中引用
屏蔽一个用户