Files
Atul Kulkarni 30d36661c2 Adds Python-based test runner for RCCL (#2034)
* Added python test runner to execute rccl tests

* Disabled capture output to avoid hangs

* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile

* Converted test_type to boolean gtest flag

* Removed unused return values

* Added custom rccl library usage

* Removed json output

* Updates to test_runner: added num_gpus field

* Address review comments

* Prepend env vars for single node, single process executions

* Added separate enums for exit and result codes

* Update configuration files

* Moved configurations to its own dir

* Address review comments

* Update tools/scripts/test_runner/README.md

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

---------

Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>

[ROCm/rccl commit: 0c2c61d2f1]
2026-01-08 10:04:41 -06:00

985 líneas
27 KiB
Markdown

# RCCL Test Runner
A Python-based test runner focused on RCCL unit and functional tests with hierarchical configuration support and integrated code coverage reporting. Extensible to support performance benchmarks, MPI tests, and custom test scripts.
## Overview
This test runner provides a maintainable, extensible alternative to shell-based test execution. It uses JSON configuration files with hierarchical inheritance, and integrates with LLVM code coverage tools.
## Key Features
- **Multiple Test Types**: Support for GTest, performance tests, and custom executables
- **Hierarchical Configuration**: Use `"extends"` directive to inherit and merge configurations
- **Environment Variable Management**: Global, configuration, suite, and test-specific environment variables
- **Path Variable Expansion**: Use environment variables in paths with nested default value expansion
- **Custom Library Support**: Use pre-built RCCL libraries from custom locations via environment variables
- **Configurable Build System**: Customize CMake options, environment variables, and parallel jobs via config
- **MPI Support**: Full support for multi-rank and multi-node tests
- **Flexible Test Filtering**: Run all tests, specific test suites, or individual tests
- **Build Integration**: Automated RCCL building with CMake
- **Code Coverage**: Integrated LLVM coverage report generation (HTML and text)
- **Clean Output**: Automatic filtering of MPI verbose messages (enable with --verbose)
- **Verbose Logging**: Detailed output for debugging and troubleshooting
## Quick Start
### Basic Usage
```bash
# Run with specific configuration
python test_runner.py --config my_tests.json
# Run with verbose output
python test_runner.py --config my_tests.json --verbose
# Run specific test by name
python test_runner.py --config my_tests.json --test-name SHM_ComprehensiveWorkflow
```
### Generate Coverage Report
```bash
# Build, run tests, and generate coverage report
python test_runner.py --config test_config_sample.json --coverage-report --verbose
# Use existing build and generate coverage
python test_runner.py --config test_config_sample.json --no-build --coverage-report
```
### Use Custom RCCL Library
```bash
# Use pre-built RCCL library from custom location
export RCCL_LIB_PATH=/path/to/custom/rccl/build
python test_runner.py --config test_config_sample.json
# Or use RCCL_BUILD_DIR (alternative name)
export RCCL_BUILD_DIR=/path/to/custom/rccl/build
python test_runner.py --config test_config_sample.json
# When set, build step is automatically skipped
# --no-build is not needed
```
## Environment Variables
The test runner supports the following environment variables to customize behavior:
### Library and Build Configuration
| Variable | Description | Example |
|----------|-------------|---------|
| `RCCL_LIB_PATH` | Path to pre-built RCCL library directory (contains `librccl.so` and `test/` subdirectory). When set, the build step is automatically skipped. | `/path/to/rccl/build` |
| `RCCL_BUILD_DIR` | Alternative name for `RCCL_LIB_PATH`. Either variable can be used. | `/path/to/rccl/build` |
| `RCCL_TEST_MPI_HOSTFILE` | Path to MPI hostfile for multi-node tests. | `~/.mpi_hostfile` |
### Configuration Path Variables
These can be overridden via environment variables or specified in the JSON config:
| Variable | Description | Default |
|----------|-------------|---------|
| `WORKDIR` | RCCL source and build directory | Current rccl repository root |
| `ROCM_PATH` | ROCm installation path | `/opt/rocm` |
| `MPI_PATH` | MPI installation path | System default or config-specific |
### Priority Order
When determining which RCCL library to use, the test runner follows this priority:
1. **`RCCL_LIB_PATH` or `RCCL_BUILD_DIR` environment variable** (highest priority)
- Skips build automatically
- Must contain `librccl.so` and `test/` subdirectory
2. **`--no-build` flag with local build**
- Uses local `build_debug_cov_on_tests_on/` directory
- Requires prior build
3. **Default build process** (lowest priority)
- Builds RCCL in timestamped directory
- Uses CMake configuration from JSON
**Example Usage:**
```bash
# Priority 1: Use custom library (build skipped automatically)
export RCCL_LIB_PATH=/path/to/prebuilt/rccl/build
python test_runner.py --config my_tests.json
# Priority 2: Use existing local build (no new build)
python test_runner.py --config my_tests.json --no-build
# Priority 3: Fresh build (default)
python test_runner.py --config my_tests.json
```
## Configuration File Format
### Basic Structure
```json
{
"system_configurations": {
"name": "system-name",
"description": "System description"
},
"paths": {
"workdir": "/path/to/rccl",
"rocm_path": "/opt/rocm",
"mpi_path": "/path/to/mpi"
},
"env_variables": {
"GLOBAL_VAR": "value"
},
"test_configurations": {
"config_name": {
"env_variables": {...},
"tests": [...]
}
},
"test_suites": [
{
"name": "Test Suite Name",
"config": "config_name",
"enabled": true
}
]
}
```
### Environment Variable Expansion in Paths
The `paths` section supports environment variable expansion, allowing you to avoid hardcoding paths and make configurations portable across different systems.
#### Supported Syntax
```json
{
"paths": {
"workdir": "${HOME}/code/rccl",
"rocm_path": "$ROCM_PATH",
"mpi_path": "${MPI_PATH:-/opt/mpi}"
}
}
```
**Syntax Options:**
- `${VAR}` - Expands to the value of `VAR`, left as-is if undefined
- `$VAR` - Expands to the value of `VAR`, left as-is if undefined
- `${VAR:-default}` - Expands to the value of `VAR`, or `default` if undefined (bash-style default)
#### Examples
```json
{
"paths": {
"workdir": "${WORKDIR:-${HOME}/code/rti/scripts/rccl}",
"rocm_path": "${ROCM_PATH:-/opt/rocm}",
"mpi_path": "${MPI_PATH:-${HOME}/softwares/ompi}"
}
}
```
**Usage:**
```bash
# Use environment variables
export WORKDIR=/custom/path/to/rccl
export ROCM_PATH=/opt/rocm-6.0
export MPI_PATH=/usr/local/mpi
python test_runner.py --config test_config_sample.json
# Or use defaults (no environment variables set)
python test_runner.py --config test_config_sample.json
```
**Benefits:**
- **Portability**: Share configurations across different systems
- **Flexibility**: Override paths without modifying config files
- **CI/CD**: Easy integration with build systems and pipelines
- **Multi-user**: Same config works for different user environments
### Test Types Supported
The test runner uses the `is_gtest` boolean flag to distinguish between test types:
- **`is_gtest: true`** (default) - GTest-based unit tests using `--gtest_filter` syntax
- **`is_gtest: false`** - Non-GTest tests (performance benchmarks, custom scripts, etc.)
This simplified approach supports all test categories while reducing configuration complexity.
#### GTest Tests (`is_gtest: true`)
Used for unit tests with GTest framework. The `test_filter` field uses GTest filter syntax.
```json
{
"name": "AllReduce_InPlace",
"description": "Test AllReduce collective operation with in-place buffers",
"is_gtest": true,
"binary": "rccl-UnitTests",
"test_filter": "AllReduce.InPlace",
"num_ranks": 1,
"num_nodes": 1,
"timeout": 60
}
```
**Command generated:**
```bash
./rccl-UnitTests --gtest_filter=AllReduce.InPlace
```
#### Performance Tests (`is_gtest: false`)
Used for performance benchmarks. Arguments are passed directly without GTest syntax.
```json
{
"name": "Perf_Bandwidth",
"description": "Bandwidth benchmark for AllReduce",
"is_gtest": false,
"binary": "all_reduce_perf",
"command_args": "-b 8 -e 128M -f 2",
"num_ranks": 2,
"num_nodes": 1,
"timeout": 300
}
```
**Command generated:**
```bash
mpirun -np 2 ./all_reduce_perf -b 8 -e 128M -f 2
```
#### Custom Scripts (`is_gtest: false`)
Used for custom validation scripts or any non-GTest executables.
```json
{
"name": "Custom_Validation",
"description": "Custom GPU validation script",
"is_gtest": false,
"binary": "validate_gpus.sh",
"command_args": "--full-check --verbose",
"num_ranks": 1,
"num_nodes": 1,
"timeout": 120
}
```
**Command generated:**
```bash
./validate_gpus.sh --full-check --verbose
```
**Key Differences:**
| Feature | `is_gtest: true` | `is_gtest: false` |
|---------|------------------|-------------------|
| Test framework | GTest (Google Test) | Any executable |
| Filter syntax | `--gtest_filter=<pattern>` | Plain arguments |
| `test_filter` field | GTest pattern (e.g., `Suite.Test*`) | Passed as plain argument |
| `command_args` field | Appended after filter | Primary argument method |
| Typical use cases | Unit tests, functional tests | Performance tests, custom scripts |
### Test Definition Fields
| Field | Required | Type | Description |
|-------|----------|------|-------------|
| `name` | Yes | string | Unique test identifier |
| `description` | Recommended | string | Human-readable test description |
| `is_gtest` | Optional | boolean | Whether test uses GTest framework (default: true). Set to false for perf or custom tests |
| `binary` | Yes | string | Test binary name (relative to build/test/) |
| `test_filter` | Optional | string | Test filter (GTest filter syntax for gtest, plain argument for non-gtest) |
| `command_args` | Optional | string | Additional command-line arguments |
| `num_ranks` | Optional | integer | Number of MPI ranks (default: 1) |
| `num_nodes` | Optional | integer | Number of nodes (default: 1) |
| `num_gpus` | Optional | integer | GPUs per node - controls rank distribution (default: 8) |
| `timeout` | Optional | integer | Timeout in seconds (0 = unlimited) |
| `env_variables` | Optional | object | Test-specific environment variables |
### Configuration Inheritance
Use the `"extends"` directive to inherit from parent configurations:
```json
{
"test_configurations": {
"base": {
"env_variables": {
"NCCL_DEBUG": "INFO"
}
},
"shm_tests": {
"extends": "base",
"env_variables": {
"NCCL_SHM_DISABLE": "0"
},
"tests": [...]
},
"advanced_shm": {
"extends": ["base", "shm_tests"],
"env_variables": {
"NCCL_SHM_USE_CUDA_MEMCPY": "1"
}
}
}
}
```
### Hierarchical Defaults
To reduce repetition, you can specify default values at multiple levels with a clear override hierarchy:
**Priority Order (highest to lowest):**
1. **Individual test** - highest priority, overrides everything
2. **Test suite level** - overrides configuration defaults
3. **Configuration level** - base defaults for all tests in that config
4. **Built-in defaults** - system fallback values
**Supported default fields:** `is_gtest`, `binary`, `num_ranks`, `num_nodes`, `num_gpus`, `timeout`
#### Example with Three-Level Hierarchy
```json
{
"test_configurations": {
"p2p_tests": {
"is_gtest": true,
"binary": "rccl-UnitTestsMPI",
"num_ranks": 2,
"num_nodes": 1,
"num_gpus": 2,
"timeout": 120,
"env_variables": {
"NCCL_P2P_DISABLE": "0"
},
"tests": [
{
"name": "P2P_Basic",
"description": "Basic P2P test",
"test_filter": "P2pMPITest.Basic"
// Uses config defaults: is_gtest=true, binary, num_ranks=2, num_nodes=1, num_gpus=2, timeout=120
},
{
"name": "P2P_LongRunning",
"description": "Long-running P2P test",
"test_filter": "P2pMPITest.LongRunning",
"timeout": 300
// Overrides timeout=300, inherits other config defaults
}
]
}
},
"test_suites": [
{
"name": "P2P_Basic_Suite",
"config": "p2p_tests",
"num_ranks": 4,
"num_gpus": 4,
"timeout": 180
// Suite-level: overrides config's num_ranks, num_gpus, and timeout
// Tests in this suite will use: num_ranks=4, num_gpus=4, timeout=180
},
{
"name": "P2P_Stress_Suite",
"config": "p2p_tests",
"num_nodes": 2,
"num_ranks": 4,
"num_gpus": 2,
"timeout": 600
// Suite-level: overrides config's num_nodes, num_ranks, num_gpus, and timeout
// Tests in this suite will use: num_nodes=2, num_ranks=4, num_gpus=2, timeout=600
}
]
}
```
**Benefits:**
- **Less Repetition**: Define common values once
- **Easier Maintenance**: Update defaults in one place
- **Flexible Overrides**: Tests can still customize any field
- **Cleaner Config**: Shorter, more readable test definitions
## Command-Line Options
```
Required:
-c, --config CONFIG Test configuration file (JSON format)
Optional:
-v, --verbose Enable verbose output (shows build paths, commands, etc.)
-o, --output DIR Output directory for logs and reports
--test-name NAME Run only specific test by name
--no-build Skip build step and use existing build
--skip-tests Skip test execution (useful with --coverage-report)
--coverage-report Generate code coverage report (HTML + text)
--overwrite Overwrite previous workspace directories
--report-suffix SUFFIX Suffix for report directory (default: blank)
-h, --help Show help message and exit
```
## Code Coverage Reports
The test runner integrates with LLVM tools to generate comprehensive code coverage reports.
### Generating Coverage
```bash
# Build and test with coverage (recommended)
python test_runner.py --config test_config_sample.json --coverage-report --verbose
# Generate report from existing profraw files
python test_runner.py --config test_config_sample.json --no-build --skip-tests --coverage-report
```
### Coverage Output
When `--coverage-report` is specified, the runner generates:
1. **HTML Report**: Visual coverage report in `reports/` directory
- View with: `firefox reports/index.html`
- Shows line-by-line coverage with syntax highlighting
2. **Text Report**: Function-level coverage summary
- Location: `reports/function_coverage_report.txt`
- Includes per-function and per-file statistics
### Coverage Implementation Details
- Uses LLVM instrumentation (`-fprofile-instr-generate -fcoverage-mapping`)
- Collects `.profraw` files during test execution
- Merges profiles with `llvm-profdata`
- Generates reports with `llvm-cov show` and `llvm-cov report`
- Filters out irrelevant files (test/, gtest, external dependencies)
## Examples
### Run All Enabled Test Suites
```bash
python test_runner.py --config test_config_sample.json --verbose
```
### Run Specific Test
```bash
python test_runner.py --config test_config_sample.json --test-name P2P_AllTests
```
### Skip Build (Use Existing)
```bash
python test_runner.py --config test_config_sample.json --no-build
```
### Build and Generate Coverage
```bash
# Full workflow: build, test, coverage
python test_runner.py --config adhoc_test_config.json --coverage-report --verbose
```
### Generate Coverage from Existing Build
```bash
# Skip build, use existing profraw files
python test_runner.py --config adhoc_test_config.json --no-build --skip-tests --coverage-report
```
### Custom Output Directory
```bash
python test_runner.py --config test_config_sample.json -o /path/to/output --verbose
```
### Run with Overwrite (Clean Previous Results)
```bash
python test_runner.py --config test_config_sample.json --overwrite --coverage-report
```
## Environment Variable Merging
Environment variables are merged hierarchically (later values override earlier):
1. **Global** `env_variables` (top-level in config)
2. **Configuration** `env_variables` (test configuration level)
3. **Test Suite** `env_variables` (suite level)
4. **Test-specific** `env_variables` (individual test level)
Example:
```json
{
"env_variables": {
"NCCL_DEBUG": "INFO"
},
"test_configurations": {
"shm_tests": {
"env_variables": {
"NCCL_SHM_DISABLE": "0"
},
"tests": [
{
"name": "SHM_Test",
"env_variables": {
"NCCL_DEBUG": "TRACE"
}
}
]
}
}
}
```
Result: `NCCL_DEBUG=TRACE`, `NCCL_SHM_DISABLE=0`
## Test Execution
### Single-Node Tests
- All ranks run on a single node
- Multiple ranks map to different GPUs
- Examples: SHM tests, P2P tests, unit tests
```json
{
"name": "SHM_Test",
"num_ranks": 2,
"num_nodes": 1
}
```
### Multi-Node Tests
- Ranks distributed across multiple nodes via MPI
- Requires SLURM allocation or hostfile configuration
- Use `num_gpus` to control ranks per node (default: 8)
- Examples: NET transport tests, InfiniBand tests
```json
{
"name": "NET_Test_4Nodes_2GPUs",
"num_ranks": 8,
"num_nodes": 4,
"num_gpus": 2
}
```
**`num_gpus` Field:**
- Controls how many MPI ranks are placed on each node
- Overrides hostfile `slots` specification
- For multi-node tests, uses `--map-by ppr:{num_gpus}:node`
- Default value: 8 (matches typical 8-GPU nodes)
**Example: 2 nodes, 1 GPU per node**
```json
{
"name": "NET_Test_2Nodes_1GPU",
"num_ranks": 2,
"num_nodes": 2,
"num_gpus": 1
}
```
Command: `mpirun -np 2 --hostfile file --map-by ppr:1:node ...`
### Setting Up Multi-Node Tests
**Option 1: MPI Hostfile**
```bash
export RCCL_TEST_MPI_HOSTFILE=/path/to/hostfile
python test_runner.py --config net_ib_test_config.json
```
**Option 2: Default Hostfile**
Create `~/.mpi_hostfile` with node names (one per line):
```
node01 slots=8
node02 slots=8
```
## Advanced Features
### Build Configuration (New!)
Customize the RCCL build process through the `build_configuration` section in your JSON config file.
#### Basic Structure
```json
{
"build_configuration": {
"cmake_options": {
"CMAKE_BUILD_TYPE": "Debug",
"ENABLE_CODE_COVERAGE": "ON",
"ONLY_FUNCS": "SendRecv|AllReduce"
},
"env_variables": {
"HIPCC_COMPILE_FLAGS_APPEND": "-g -O1"
},
"parallel_jobs": 64,
"generator": "Unix Makefiles"
}
}
```
#### Examples
**Fast Development Build (No Coverage):**
```json
{
"build_configuration": {
"cmake_options": {
"ENABLE_CODE_COVERAGE": "OFF"
},
"parallel_jobs": 128
}
}
```
**Release Build:**
```json
{
"build_configuration": {
"cmake_options": {
"CMAKE_BUILD_TYPE": "Release",
"TRACE": "OFF",
"COLLTRACE": "OFF"
}
}
}
```
**Test Specific Functions Only:**
```json
{
"build_configuration": {
"cmake_options": {
"ONLY_FUNCS": "Broadcast|Reduce"
}
}
}
```
**All Options:**
- `cmake_options` - Any CMake option (user values override defaults)
- `env_variables` - Build environment variables
- `parallel_jobs` - Number of parallel build threads (default: 64)
- `generator` - CMake generator: "Unix Makefiles", "Ninja", etc.
See `BUILD_CONFIGURATION_GUIDE.md` for complete documentation.
### Enhanced Environment Variable Expansion
Environment variables in the `paths` section now support **nested expansion** in default values:
```json
{
"paths": {
"workdir": "${WORKDIR:-$HOME/code/rti/scripts/rccl}",
"rocm_path": "${ROCM_PATH:-/opt/rocm}",
"mpi_path": "${MPI_PATH:-$HOME/softwares/ompi}"
}
}
```
**Key Feature:** If `WORKDIR` is not set, the default `$HOME/code/rti/scripts/rccl` will expand `$HOME` automatically!
### Flexible Binary Paths
Specify test binary locations in multiple ways for maximum flexibility:
#### 1. Default (Relative to build_dir/test/)
```json
{
"binary": "all_reduce_perf"
}
```
Result: `<workdir>/build_debug_cov_on_tests_on/test/all_reduce_perf`
#### 2. Absolute Path
```json
{
"binary": "/opt/custom_rccl_build/test/all_reduce_perf"
}
```
Result: Uses the absolute path directly
#### 3. Environment Variable in Binary Name
```json
{
"binary": "${MY_RCCL_TESTS}/all_reduce_perf"
}
```
Result: Expands `$MY_RCCL_TESTS` environment variable
#### 4. Home Directory Expansion
```json
{
"binary": "~/my_builds/rccl/test/all_reduce_perf"
}
```
Result: Expands `~` to home directory
#### 5. Using test_binary_dir in Paths
```json
{
"paths": {
"test_binary_dir": "${RCCL_TEST_BIN_DIR}"
},
"test_configurations": {
"my_tests": {
"binary": "all_reduce_perf"
}
}
}
```
Result: `${RCCL_TEST_BIN_DIR}/all_reduce_perf`
#### 6. Using test_binary_dir in Test Config
```json
{
"test_configurations": {
"my_tests": {
"tests": [
{
"name": "CustomBinary",
"test_binary_dir": "/opt/rccl/tests",
"binary": "all_reduce_perf"
}
]
}
}
}
```
Result: `/opt/rccl/tests/all_reduce_perf`
#### Resolution Priority Order
1. **Absolute path in binary** - Highest priority
2. **Environment variable expansion** (if results in absolute path)
3. **test_binary_dir in test config** + binary
4. **test_binary_dir in paths** + binary
5. **Default:** `build_dir/test/` + binary - Lowest priority
#### Use Cases
- **CI/CD with pre-built binaries:** Use absolute paths or `RCCL_TEST_BIN_DIR`
- **Multiple RCCL versions:** Different `test_binary_dir` per configuration
- **Custom build locations:** Environment variables for flexibility
- **Standard builds:** Use default (no configuration needed)
#### Verbose Mode
Use `--verbose` to see the resolved binary path:
```bash
python test_runner.py --config test.json --verbose
```
Output includes:
```
Binary: all_reduce_perf
Binary path: /home/user/code/rti/scripts/rccl/build_debug_cov_on_tests_on/test/all_reduce_perf
```
### Configuration Best Practices
**Reduce Repetition:** Move common values to configuration level
```json
{
"test_configurations": {
"p2p_tests": {
"timeout": 120,
"env_variables": {
"NCCL_P2P_USE_CUDA_MEMCPY": "1",
"NCCL_LEGACY_CUDA_REGISTER": "1"
},
"tests": [
{
"name": "Test1"
// Inherits timeout and env vars from config level
},
{
"name": "Test2",
"timeout": 300
// Overrides timeout, inherits env vars
}
]
}
}
}
```
**Benefits:**
- ✅ Single source of truth for common settings
- ✅ Easier maintenance
- ✅ Tests can still override when needed
- ✅ Cleaner, more readable configurations
## Development and Testing
### Validate Configuration
```bash
# Test JSON syntax
python3 -m json.tool test_config_sample.json
# Test configuration loading
python3 -c "from lib.test_config import TestConfigProcessor; \
p = TestConfigProcessor('test_config_sample.json'); \
print('Configuration valid!')"
# Dry run (validate without executing)
python test_runner.py --config test_config_sample.json --skip-tests --verbose
```
### Adding New Tests
1. Add test definition to appropriate configuration in JSON file
2. Specify `is_gtest`, `description`, and required fields
3. Test with dry run first: `--skip-tests --verbose`
4. Run actual test: `--test-name YourTest --verbose`
### Test Type Handling
The test runner uses a boolean `is_gtest` flag to distinguish between test types:
- **`is_gtest: true`** (default): Uses GTest framework with `--gtest_filter=<filter>` syntax
- **`is_gtest: false`**: Runs binary with plain arguments (for performance tests, custom scripts, etc.)
This simplified approach eliminates the need for multiple test type conditionals while supporting all test categories (gtest, perf, custom).
## Troubleshooting
### "Configuration file not found"
- Check the path to your JSON config file
- Use absolute paths or ensure you're in the correct directory
- Verify file permissions
### "MPI path not found"
- Update `paths.mpi_path` in your configuration
- Ensure MPI is installed: `which mpirun`
- Check MPI_PATH environment variable
### "Test binary not found"
- Build first: remove `--no-build` flag
- Check binary name in `build/test/` directory
- Verify CMAKE built successfully
### Multi-node tests hang
- Ensure SLURM allocation or hostfile is configured
- Check network connectivity: `ping other_node`
- Verify MPI can reach nodes: `mpirun -np 2 hostname`
- Check firewall settings
### CMake configuration fails
- Check ROCm path: `ls $ROCM_PATH`
- Verify compiler: `$ROCM_PATH/bin/amdclang++ --version`
- Check MPI path: `ls $MPI_PATH/bin/mpirun`
### Coverage report fails
- Ensure LLVM tools are available: `which llvm-profdata llvm-cov`
- Check for `.profraw` files in build directory
- Verify coverage build flags were set correctly
- Run with `--verbose` to see detailed error messages
### "LLVM_PROFILE_FILE not being used"
- Ensure `--coverage-report` flag is specified
- Check that tests are actually executing (not skipped)
- Verify environment variables with `--verbose`
---
## Appendix: Environment Variables Reference
This section provides a quick reference for all environment variables supported by the test runner.
### Library and Build Location
| Variable | Description | Example |
|----------|-------------|---------|
| `RCCL_LIB_PATH` | Path to pre-built RCCL library directory. Automatically skips build. | `export RCCL_LIB_PATH=/path/to/rccl/build` |
| `RCCL_BUILD_DIR` | Alternative name for `RCCL_LIB_PATH`. | `export RCCL_BUILD_DIR=/home/user/rccl_builds/debug` |
**Requirements**: Directory must contain `librccl.so` and `test/` subdirectory.
### Configuration Paths
These override the paths specified in the JSON configuration file:
| Variable | Description | Example |
|----------|-------------|---------|
| `WORKDIR` | RCCL source and build directory | `export WORKDIR=/home/user/code/rccl` |
| `ROCM_PATH` | ROCm installation path | `export ROCM_PATH=/opt/rocm-6.0` |
| `MPI_PATH` | MPI installation path | `export MPI_PATH=/usr/local/openmpi` |
### Test Execution
| Variable | Description | Example |
|----------|-------------|---------|
| `RCCL_TEST_MPI_HOSTFILE` | Path to MPI hostfile for multi-node tests | `export RCCL_TEST_MPI_HOSTFILE=~/.mpi_hostfile` |
**Note**: Falls back to `~/.mpi_hostfile` if not set. For SLURM environments, hostfile is auto-generated from `SLURM_NODELIST`.
### Test-Specific Variables
These can be set globally or specified in the JSON configuration per test:
| Variable | Description | Example |
|----------|-------------|---------|
| `NCCL_DEBUG` | NCCL debug level (VERSION, WARN, INFO, TRACE) | `export NCCL_DEBUG=INFO` |
| `NCCL_DEBUG_SUBSYS` | NCCL debug subsystems to enable | `export NCCL_DEBUG_SUBSYS=INIT,COLL,NET` |
| `HSA_NO_SCRATCH_RECLAIM` | Disable HIP scratch memory reclaim | `export HSA_NO_SCRATCH_RECLAIM=1` |
| `NCCL_LAUNCH_MODE` | NCCL launch mode (GROUP, PARALLEL) | `export NCCL_LAUNCH_MODE=GROUP` |
### Coverage and Profiling
| Variable | Description | Example |
|----------|-------------|---------|
| `LLVM_PROFILE_FILE` | LLVM coverage profile output pattern | `export LLVM_PROFILE_FILE=rccl_%p_%m.profraw` |
**Note**: Automatically set by test runner to prevent collisions. Manual override not recommended.
### Complete Example
```bash
#!/bin/bash
# Configure paths
export WORKDIR=/home/user/code/rccl
export ROCM_PATH=/opt/rocm-6.0
export MPI_PATH=/usr/local/openmpi
# Use pre-built library
export RCCL_LIB_PATH=/home/user/rccl_builds/instrumented
# Configure MPI
export RCCL_TEST_MPI_HOSTFILE=~/.mpi_hostfile
# Enable debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,COLL,NET
# Run tests
python test_runner.py --config my_tests.json --verbose
```
### Variable Priority
When the same configuration can be specified in multiple places, the priority is:
1. **Environment variables** (highest priority)
2. **Test-specific configuration** (in JSON)
3. **Test suite configuration** (in JSON)
4. **Test configuration defaults** (in JSON)
5. **Built-in defaults** (lowest priority)
**Example**: If `ROCM_PATH` is set as an environment variable, it overrides the `rocm_path` value in the JSON configuration file.