30d36661c2
* Added python test runner to execute rccl tests
* Disabled capture output to avoid hangs
* Add RCCL_TEST_MPI_HOSTFILE env var to get the hostfile
* Converted test_type to boolean gtest flag
* Removed unused return values
* Added custom rccl library usage
* Removed json output
* Updates to test_runner: added num_gpus field
* Address review comments
* Prepend env vars for single node, single process executions
* Added separate enums for exit and result codes
* Update configuration files
* Moved configurations to its own dir
* Address review comments
* Update tools/scripts/test_runner/README.md
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
---------
Co-authored-by: Corey Derochie <161367113+corey-derochie-amd@users.noreply.github.com>
[ROCm/rccl commit: 0c2c61d2f1]
985 líneas
27 KiB
Markdown
985 líneas
27 KiB
Markdown
# RCCL Test Runner
|
|
|
|
A Python-based test runner focused on RCCL unit and functional tests with hierarchical configuration support and integrated code coverage reporting. Extensible to support performance benchmarks, MPI tests, and custom test scripts.
|
|
|
|
## Overview
|
|
|
|
This test runner provides a maintainable, extensible alternative to shell-based test execution. It uses JSON configuration files with hierarchical inheritance, and integrates with LLVM code coverage tools.
|
|
|
|
## Key Features
|
|
|
|
- **Multiple Test Types**: Support for GTest, performance tests, and custom executables
|
|
- **Hierarchical Configuration**: Use `"extends"` directive to inherit and merge configurations
|
|
- **Environment Variable Management**: Global, configuration, suite, and test-specific environment variables
|
|
- **Path Variable Expansion**: Use environment variables in paths with nested default value expansion
|
|
- **Custom Library Support**: Use pre-built RCCL libraries from custom locations via environment variables
|
|
- **Configurable Build System**: Customize CMake options, environment variables, and parallel jobs via config
|
|
- **MPI Support**: Full support for multi-rank and multi-node tests
|
|
- **Flexible Test Filtering**: Run all tests, specific test suites, or individual tests
|
|
- **Build Integration**: Automated RCCL building with CMake
|
|
- **Code Coverage**: Integrated LLVM coverage report generation (HTML and text)
|
|
- **Clean Output**: Automatic filtering of MPI verbose messages (enable with --verbose)
|
|
- **Verbose Logging**: Detailed output for debugging and troubleshooting
|
|
|
|
## Quick Start
|
|
|
|
### Basic Usage
|
|
|
|
```bash
|
|
# Run with specific configuration
|
|
python test_runner.py --config my_tests.json
|
|
|
|
# Run with verbose output
|
|
python test_runner.py --config my_tests.json --verbose
|
|
|
|
# Run specific test by name
|
|
python test_runner.py --config my_tests.json --test-name SHM_ComprehensiveWorkflow
|
|
```
|
|
|
|
### Generate Coverage Report
|
|
|
|
```bash
|
|
# Build, run tests, and generate coverage report
|
|
python test_runner.py --config test_config_sample.json --coverage-report --verbose
|
|
|
|
# Use existing build and generate coverage
|
|
python test_runner.py --config test_config_sample.json --no-build --coverage-report
|
|
```
|
|
|
|
### Use Custom RCCL Library
|
|
|
|
```bash
|
|
# Use pre-built RCCL library from custom location
|
|
export RCCL_LIB_PATH=/path/to/custom/rccl/build
|
|
python test_runner.py --config test_config_sample.json
|
|
|
|
# Or use RCCL_BUILD_DIR (alternative name)
|
|
export RCCL_BUILD_DIR=/path/to/custom/rccl/build
|
|
python test_runner.py --config test_config_sample.json
|
|
|
|
# When set, build step is automatically skipped
|
|
# --no-build is not needed
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
The test runner supports the following environment variables to customize behavior:
|
|
|
|
### Library and Build Configuration
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `RCCL_LIB_PATH` | Path to pre-built RCCL library directory (contains `librccl.so` and `test/` subdirectory). When set, the build step is automatically skipped. | `/path/to/rccl/build` |
|
|
| `RCCL_BUILD_DIR` | Alternative name for `RCCL_LIB_PATH`. Either variable can be used. | `/path/to/rccl/build` |
|
|
| `RCCL_TEST_MPI_HOSTFILE` | Path to MPI hostfile for multi-node tests. | `~/.mpi_hostfile` |
|
|
|
|
### Configuration Path Variables
|
|
|
|
These can be overridden via environment variables or specified in the JSON config:
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `WORKDIR` | RCCL source and build directory | Current rccl repository root |
|
|
| `ROCM_PATH` | ROCm installation path | `/opt/rocm` |
|
|
| `MPI_PATH` | MPI installation path | System default or config-specific |
|
|
|
|
### Priority Order
|
|
|
|
When determining which RCCL library to use, the test runner follows this priority:
|
|
|
|
1. **`RCCL_LIB_PATH` or `RCCL_BUILD_DIR` environment variable** (highest priority)
|
|
- Skips build automatically
|
|
- Must contain `librccl.so` and `test/` subdirectory
|
|
2. **`--no-build` flag with local build**
|
|
- Uses local `build_debug_cov_on_tests_on/` directory
|
|
- Requires prior build
|
|
3. **Default build process** (lowest priority)
|
|
- Builds RCCL in timestamped directory
|
|
- Uses CMake configuration from JSON
|
|
|
|
**Example Usage:**
|
|
|
|
```bash
|
|
# Priority 1: Use custom library (build skipped automatically)
|
|
export RCCL_LIB_PATH=/path/to/prebuilt/rccl/build
|
|
python test_runner.py --config my_tests.json
|
|
|
|
# Priority 2: Use existing local build (no new build)
|
|
python test_runner.py --config my_tests.json --no-build
|
|
|
|
# Priority 3: Fresh build (default)
|
|
python test_runner.py --config my_tests.json
|
|
```
|
|
|
|
## Configuration File Format
|
|
|
|
### Basic Structure
|
|
|
|
```json
|
|
{
|
|
"system_configurations": {
|
|
"name": "system-name",
|
|
"description": "System description"
|
|
},
|
|
"paths": {
|
|
"workdir": "/path/to/rccl",
|
|
"rocm_path": "/opt/rocm",
|
|
"mpi_path": "/path/to/mpi"
|
|
},
|
|
"env_variables": {
|
|
"GLOBAL_VAR": "value"
|
|
},
|
|
"test_configurations": {
|
|
"config_name": {
|
|
"env_variables": {...},
|
|
"tests": [...]
|
|
}
|
|
},
|
|
"test_suites": [
|
|
{
|
|
"name": "Test Suite Name",
|
|
"config": "config_name",
|
|
"enabled": true
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Environment Variable Expansion in Paths
|
|
|
|
The `paths` section supports environment variable expansion, allowing you to avoid hardcoding paths and make configurations portable across different systems.
|
|
|
|
#### Supported Syntax
|
|
|
|
```json
|
|
{
|
|
"paths": {
|
|
"workdir": "${HOME}/code/rccl",
|
|
"rocm_path": "$ROCM_PATH",
|
|
"mpi_path": "${MPI_PATH:-/opt/mpi}"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Syntax Options:**
|
|
- `${VAR}` - Expands to the value of `VAR`, left as-is if undefined
|
|
- `$VAR` - Expands to the value of `VAR`, left as-is if undefined
|
|
- `${VAR:-default}` - Expands to the value of `VAR`, or `default` if undefined (bash-style default)
|
|
|
|
#### Examples
|
|
|
|
```json
|
|
{
|
|
"paths": {
|
|
"workdir": "${WORKDIR:-${HOME}/code/rti/scripts/rccl}",
|
|
"rocm_path": "${ROCM_PATH:-/opt/rocm}",
|
|
"mpi_path": "${MPI_PATH:-${HOME}/softwares/ompi}"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Usage:**
|
|
```bash
|
|
# Use environment variables
|
|
export WORKDIR=/custom/path/to/rccl
|
|
export ROCM_PATH=/opt/rocm-6.0
|
|
export MPI_PATH=/usr/local/mpi
|
|
|
|
python test_runner.py --config test_config_sample.json
|
|
|
|
# Or use defaults (no environment variables set)
|
|
python test_runner.py --config test_config_sample.json
|
|
```
|
|
|
|
**Benefits:**
|
|
- **Portability**: Share configurations across different systems
|
|
- **Flexibility**: Override paths without modifying config files
|
|
- **CI/CD**: Easy integration with build systems and pipelines
|
|
- **Multi-user**: Same config works for different user environments
|
|
|
|
### Test Types Supported
|
|
|
|
The test runner uses the `is_gtest` boolean flag to distinguish between test types:
|
|
|
|
- **`is_gtest: true`** (default) - GTest-based unit tests using `--gtest_filter` syntax
|
|
- **`is_gtest: false`** - Non-GTest tests (performance benchmarks, custom scripts, etc.)
|
|
|
|
This simplified approach supports all test categories while reducing configuration complexity.
|
|
|
|
#### GTest Tests (`is_gtest: true`)
|
|
|
|
Used for unit tests with GTest framework. The `test_filter` field uses GTest filter syntax.
|
|
|
|
```json
|
|
{
|
|
"name": "AllReduce_InPlace",
|
|
"description": "Test AllReduce collective operation with in-place buffers",
|
|
"is_gtest": true,
|
|
"binary": "rccl-UnitTests",
|
|
"test_filter": "AllReduce.InPlace",
|
|
"num_ranks": 1,
|
|
"num_nodes": 1,
|
|
"timeout": 60
|
|
}
|
|
```
|
|
|
|
**Command generated:**
|
|
```bash
|
|
./rccl-UnitTests --gtest_filter=AllReduce.InPlace
|
|
```
|
|
|
|
#### Performance Tests (`is_gtest: false`)
|
|
|
|
Used for performance benchmarks. Arguments are passed directly without GTest syntax.
|
|
|
|
```json
|
|
{
|
|
"name": "Perf_Bandwidth",
|
|
"description": "Bandwidth benchmark for AllReduce",
|
|
"is_gtest": false,
|
|
"binary": "all_reduce_perf",
|
|
"command_args": "-b 8 -e 128M -f 2",
|
|
"num_ranks": 2,
|
|
"num_nodes": 1,
|
|
"timeout": 300
|
|
}
|
|
```
|
|
|
|
**Command generated:**
|
|
```bash
|
|
mpirun -np 2 ./all_reduce_perf -b 8 -e 128M -f 2
|
|
```
|
|
|
|
#### Custom Scripts (`is_gtest: false`)
|
|
|
|
Used for custom validation scripts or any non-GTest executables.
|
|
|
|
```json
|
|
{
|
|
"name": "Custom_Validation",
|
|
"description": "Custom GPU validation script",
|
|
"is_gtest": false,
|
|
"binary": "validate_gpus.sh",
|
|
"command_args": "--full-check --verbose",
|
|
"num_ranks": 1,
|
|
"num_nodes": 1,
|
|
"timeout": 120
|
|
}
|
|
```
|
|
|
|
**Command generated:**
|
|
```bash
|
|
./validate_gpus.sh --full-check --verbose
|
|
```
|
|
|
|
**Key Differences:**
|
|
|
|
| Feature | `is_gtest: true` | `is_gtest: false` |
|
|
|---------|------------------|-------------------|
|
|
| Test framework | GTest (Google Test) | Any executable |
|
|
| Filter syntax | `--gtest_filter=<pattern>` | Plain arguments |
|
|
| `test_filter` field | GTest pattern (e.g., `Suite.Test*`) | Passed as plain argument |
|
|
| `command_args` field | Appended after filter | Primary argument method |
|
|
| Typical use cases | Unit tests, functional tests | Performance tests, custom scripts |
|
|
|
|
### Test Definition Fields
|
|
|
|
| Field | Required | Type | Description |
|
|
|-------|----------|------|-------------|
|
|
| `name` | Yes | string | Unique test identifier |
|
|
| `description` | Recommended | string | Human-readable test description |
|
|
| `is_gtest` | Optional | boolean | Whether test uses GTest framework (default: true). Set to false for perf or custom tests |
|
|
| `binary` | Yes | string | Test binary name (relative to build/test/) |
|
|
| `test_filter` | Optional | string | Test filter (GTest filter syntax for gtest, plain argument for non-gtest) |
|
|
| `command_args` | Optional | string | Additional command-line arguments |
|
|
| `num_ranks` | Optional | integer | Number of MPI ranks (default: 1) |
|
|
| `num_nodes` | Optional | integer | Number of nodes (default: 1) |
|
|
| `num_gpus` | Optional | integer | GPUs per node - controls rank distribution (default: 8) |
|
|
| `timeout` | Optional | integer | Timeout in seconds (0 = unlimited) |
|
|
| `env_variables` | Optional | object | Test-specific environment variables |
|
|
|
|
### Configuration Inheritance
|
|
|
|
Use the `"extends"` directive to inherit from parent configurations:
|
|
|
|
```json
|
|
{
|
|
"test_configurations": {
|
|
"base": {
|
|
"env_variables": {
|
|
"NCCL_DEBUG": "INFO"
|
|
}
|
|
},
|
|
"shm_tests": {
|
|
"extends": "base",
|
|
"env_variables": {
|
|
"NCCL_SHM_DISABLE": "0"
|
|
},
|
|
"tests": [...]
|
|
},
|
|
"advanced_shm": {
|
|
"extends": ["base", "shm_tests"],
|
|
"env_variables": {
|
|
"NCCL_SHM_USE_CUDA_MEMCPY": "1"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Hierarchical Defaults
|
|
|
|
To reduce repetition, you can specify default values at multiple levels with a clear override hierarchy:
|
|
|
|
**Priority Order (highest to lowest):**
|
|
1. **Individual test** - highest priority, overrides everything
|
|
2. **Test suite level** - overrides configuration defaults
|
|
3. **Configuration level** - base defaults for all tests in that config
|
|
4. **Built-in defaults** - system fallback values
|
|
|
|
**Supported default fields:** `is_gtest`, `binary`, `num_ranks`, `num_nodes`, `num_gpus`, `timeout`
|
|
|
|
#### Example with Three-Level Hierarchy
|
|
|
|
```json
|
|
{
|
|
"test_configurations": {
|
|
"p2p_tests": {
|
|
"is_gtest": true,
|
|
"binary": "rccl-UnitTestsMPI",
|
|
"num_ranks": 2,
|
|
"num_nodes": 1,
|
|
"num_gpus": 2,
|
|
"timeout": 120,
|
|
"env_variables": {
|
|
"NCCL_P2P_DISABLE": "0"
|
|
},
|
|
"tests": [
|
|
{
|
|
"name": "P2P_Basic",
|
|
"description": "Basic P2P test",
|
|
"test_filter": "P2pMPITest.Basic"
|
|
// Uses config defaults: is_gtest=true, binary, num_ranks=2, num_nodes=1, num_gpus=2, timeout=120
|
|
},
|
|
{
|
|
"name": "P2P_LongRunning",
|
|
"description": "Long-running P2P test",
|
|
"test_filter": "P2pMPITest.LongRunning",
|
|
"timeout": 300
|
|
// Overrides timeout=300, inherits other config defaults
|
|
}
|
|
]
|
|
}
|
|
},
|
|
"test_suites": [
|
|
{
|
|
"name": "P2P_Basic_Suite",
|
|
"config": "p2p_tests",
|
|
"num_ranks": 4,
|
|
"num_gpus": 4,
|
|
"timeout": 180
|
|
// Suite-level: overrides config's num_ranks, num_gpus, and timeout
|
|
// Tests in this suite will use: num_ranks=4, num_gpus=4, timeout=180
|
|
},
|
|
{
|
|
"name": "P2P_Stress_Suite",
|
|
"config": "p2p_tests",
|
|
"num_nodes": 2,
|
|
"num_ranks": 4,
|
|
"num_gpus": 2,
|
|
"timeout": 600
|
|
// Suite-level: overrides config's num_nodes, num_ranks, num_gpus, and timeout
|
|
// Tests in this suite will use: num_nodes=2, num_ranks=4, num_gpus=2, timeout=600
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- **Less Repetition**: Define common values once
|
|
- **Easier Maintenance**: Update defaults in one place
|
|
- **Flexible Overrides**: Tests can still customize any field
|
|
- **Cleaner Config**: Shorter, more readable test definitions
|
|
|
|
## Command-Line Options
|
|
|
|
```
|
|
Required:
|
|
-c, --config CONFIG Test configuration file (JSON format)
|
|
|
|
Optional:
|
|
-v, --verbose Enable verbose output (shows build paths, commands, etc.)
|
|
-o, --output DIR Output directory for logs and reports
|
|
--test-name NAME Run only specific test by name
|
|
--no-build Skip build step and use existing build
|
|
--skip-tests Skip test execution (useful with --coverage-report)
|
|
--coverage-report Generate code coverage report (HTML + text)
|
|
--overwrite Overwrite previous workspace directories
|
|
--report-suffix SUFFIX Suffix for report directory (default: blank)
|
|
-h, --help Show help message and exit
|
|
```
|
|
|
|
## Code Coverage Reports
|
|
|
|
The test runner integrates with LLVM tools to generate comprehensive code coverage reports.
|
|
|
|
### Generating Coverage
|
|
|
|
```bash
|
|
# Build and test with coverage (recommended)
|
|
python test_runner.py --config test_config_sample.json --coverage-report --verbose
|
|
|
|
# Generate report from existing profraw files
|
|
python test_runner.py --config test_config_sample.json --no-build --skip-tests --coverage-report
|
|
```
|
|
|
|
### Coverage Output
|
|
|
|
When `--coverage-report` is specified, the runner generates:
|
|
|
|
1. **HTML Report**: Visual coverage report in `reports/` directory
|
|
- View with: `firefox reports/index.html`
|
|
- Shows line-by-line coverage with syntax highlighting
|
|
|
|
2. **Text Report**: Function-level coverage summary
|
|
- Location: `reports/function_coverage_report.txt`
|
|
- Includes per-function and per-file statistics
|
|
|
|
### Coverage Implementation Details
|
|
|
|
- Uses LLVM instrumentation (`-fprofile-instr-generate -fcoverage-mapping`)
|
|
- Collects `.profraw` files during test execution
|
|
- Merges profiles with `llvm-profdata`
|
|
- Generates reports with `llvm-cov show` and `llvm-cov report`
|
|
- Filters out irrelevant files (test/, gtest, external dependencies)
|
|
|
|
## Examples
|
|
|
|
### Run All Enabled Test Suites
|
|
|
|
```bash
|
|
python test_runner.py --config test_config_sample.json --verbose
|
|
```
|
|
|
|
### Run Specific Test
|
|
|
|
```bash
|
|
python test_runner.py --config test_config_sample.json --test-name P2P_AllTests
|
|
```
|
|
|
|
### Skip Build (Use Existing)
|
|
|
|
```bash
|
|
python test_runner.py --config test_config_sample.json --no-build
|
|
```
|
|
|
|
### Build and Generate Coverage
|
|
|
|
```bash
|
|
# Full workflow: build, test, coverage
|
|
python test_runner.py --config adhoc_test_config.json --coverage-report --verbose
|
|
```
|
|
|
|
### Generate Coverage from Existing Build
|
|
|
|
```bash
|
|
# Skip build, use existing profraw files
|
|
python test_runner.py --config adhoc_test_config.json --no-build --skip-tests --coverage-report
|
|
```
|
|
|
|
### Custom Output Directory
|
|
|
|
```bash
|
|
python test_runner.py --config test_config_sample.json -o /path/to/output --verbose
|
|
```
|
|
|
|
### Run with Overwrite (Clean Previous Results)
|
|
|
|
```bash
|
|
python test_runner.py --config test_config_sample.json --overwrite --coverage-report
|
|
```
|
|
|
|
## Environment Variable Merging
|
|
|
|
Environment variables are merged hierarchically (later values override earlier):
|
|
|
|
1. **Global** `env_variables` (top-level in config)
|
|
2. **Configuration** `env_variables` (test configuration level)
|
|
3. **Test Suite** `env_variables` (suite level)
|
|
4. **Test-specific** `env_variables` (individual test level)
|
|
|
|
Example:
|
|
```json
|
|
{
|
|
"env_variables": {
|
|
"NCCL_DEBUG": "INFO"
|
|
},
|
|
"test_configurations": {
|
|
"shm_tests": {
|
|
"env_variables": {
|
|
"NCCL_SHM_DISABLE": "0"
|
|
},
|
|
"tests": [
|
|
{
|
|
"name": "SHM_Test",
|
|
"env_variables": {
|
|
"NCCL_DEBUG": "TRACE"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
Result: `NCCL_DEBUG=TRACE`, `NCCL_SHM_DISABLE=0`
|
|
|
|
## Test Execution
|
|
|
|
### Single-Node Tests
|
|
|
|
- All ranks run on a single node
|
|
- Multiple ranks map to different GPUs
|
|
- Examples: SHM tests, P2P tests, unit tests
|
|
|
|
```json
|
|
{
|
|
"name": "SHM_Test",
|
|
"num_ranks": 2,
|
|
"num_nodes": 1
|
|
}
|
|
```
|
|
|
|
### Multi-Node Tests
|
|
|
|
- Ranks distributed across multiple nodes via MPI
|
|
- Requires SLURM allocation or hostfile configuration
|
|
- Use `num_gpus` to control ranks per node (default: 8)
|
|
- Examples: NET transport tests, InfiniBand tests
|
|
|
|
```json
|
|
{
|
|
"name": "NET_Test_4Nodes_2GPUs",
|
|
"num_ranks": 8,
|
|
"num_nodes": 4,
|
|
"num_gpus": 2
|
|
}
|
|
```
|
|
|
|
**`num_gpus` Field:**
|
|
- Controls how many MPI ranks are placed on each node
|
|
- Overrides hostfile `slots` specification
|
|
- For multi-node tests, uses `--map-by ppr:{num_gpus}:node`
|
|
- Default value: 8 (matches typical 8-GPU nodes)
|
|
|
|
**Example: 2 nodes, 1 GPU per node**
|
|
```json
|
|
{
|
|
"name": "NET_Test_2Nodes_1GPU",
|
|
"num_ranks": 2,
|
|
"num_nodes": 2,
|
|
"num_gpus": 1
|
|
}
|
|
```
|
|
Command: `mpirun -np 2 --hostfile file --map-by ppr:1:node ...`
|
|
|
|
### Setting Up Multi-Node Tests
|
|
|
|
**Option 1: MPI Hostfile**
|
|
```bash
|
|
export RCCL_TEST_MPI_HOSTFILE=/path/to/hostfile
|
|
python test_runner.py --config net_ib_test_config.json
|
|
```
|
|
|
|
**Option 2: Default Hostfile**
|
|
Create `~/.mpi_hostfile` with node names (one per line):
|
|
```
|
|
node01 slots=8
|
|
node02 slots=8
|
|
```
|
|
|
|
## Advanced Features
|
|
|
|
### Build Configuration (New!)
|
|
|
|
Customize the RCCL build process through the `build_configuration` section in your JSON config file.
|
|
|
|
#### Basic Structure
|
|
|
|
```json
|
|
{
|
|
"build_configuration": {
|
|
"cmake_options": {
|
|
"CMAKE_BUILD_TYPE": "Debug",
|
|
"ENABLE_CODE_COVERAGE": "ON",
|
|
"ONLY_FUNCS": "SendRecv|AllReduce"
|
|
},
|
|
"env_variables": {
|
|
"HIPCC_COMPILE_FLAGS_APPEND": "-g -O1"
|
|
},
|
|
"parallel_jobs": 64,
|
|
"generator": "Unix Makefiles"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Examples
|
|
|
|
**Fast Development Build (No Coverage):**
|
|
```json
|
|
{
|
|
"build_configuration": {
|
|
"cmake_options": {
|
|
"ENABLE_CODE_COVERAGE": "OFF"
|
|
},
|
|
"parallel_jobs": 128
|
|
}
|
|
}
|
|
```
|
|
|
|
**Release Build:**
|
|
```json
|
|
{
|
|
"build_configuration": {
|
|
"cmake_options": {
|
|
"CMAKE_BUILD_TYPE": "Release",
|
|
"TRACE": "OFF",
|
|
"COLLTRACE": "OFF"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Test Specific Functions Only:**
|
|
```json
|
|
{
|
|
"build_configuration": {
|
|
"cmake_options": {
|
|
"ONLY_FUNCS": "Broadcast|Reduce"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**All Options:**
|
|
- `cmake_options` - Any CMake option (user values override defaults)
|
|
- `env_variables` - Build environment variables
|
|
- `parallel_jobs` - Number of parallel build threads (default: 64)
|
|
- `generator` - CMake generator: "Unix Makefiles", "Ninja", etc.
|
|
|
|
See `BUILD_CONFIGURATION_GUIDE.md` for complete documentation.
|
|
|
|
### Enhanced Environment Variable Expansion
|
|
|
|
Environment variables in the `paths` section now support **nested expansion** in default values:
|
|
|
|
```json
|
|
{
|
|
"paths": {
|
|
"workdir": "${WORKDIR:-$HOME/code/rti/scripts/rccl}",
|
|
"rocm_path": "${ROCM_PATH:-/opt/rocm}",
|
|
"mpi_path": "${MPI_PATH:-$HOME/softwares/ompi}"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Key Feature:** If `WORKDIR` is not set, the default `$HOME/code/rti/scripts/rccl` will expand `$HOME` automatically!
|
|
|
|
### Flexible Binary Paths
|
|
|
|
Specify test binary locations in multiple ways for maximum flexibility:
|
|
|
|
#### 1. Default (Relative to build_dir/test/)
|
|
|
|
```json
|
|
{
|
|
"binary": "all_reduce_perf"
|
|
}
|
|
```
|
|
Result: `<workdir>/build_debug_cov_on_tests_on/test/all_reduce_perf`
|
|
|
|
#### 2. Absolute Path
|
|
|
|
```json
|
|
{
|
|
"binary": "/opt/custom_rccl_build/test/all_reduce_perf"
|
|
}
|
|
```
|
|
Result: Uses the absolute path directly
|
|
|
|
#### 3. Environment Variable in Binary Name
|
|
|
|
```json
|
|
{
|
|
"binary": "${MY_RCCL_TESTS}/all_reduce_perf"
|
|
}
|
|
```
|
|
Result: Expands `$MY_RCCL_TESTS` environment variable
|
|
|
|
#### 4. Home Directory Expansion
|
|
|
|
```json
|
|
{
|
|
"binary": "~/my_builds/rccl/test/all_reduce_perf"
|
|
}
|
|
```
|
|
Result: Expands `~` to home directory
|
|
|
|
#### 5. Using test_binary_dir in Paths
|
|
|
|
```json
|
|
{
|
|
"paths": {
|
|
"test_binary_dir": "${RCCL_TEST_BIN_DIR}"
|
|
},
|
|
"test_configurations": {
|
|
"my_tests": {
|
|
"binary": "all_reduce_perf"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
Result: `${RCCL_TEST_BIN_DIR}/all_reduce_perf`
|
|
|
|
#### 6. Using test_binary_dir in Test Config
|
|
|
|
```json
|
|
{
|
|
"test_configurations": {
|
|
"my_tests": {
|
|
"tests": [
|
|
{
|
|
"name": "CustomBinary",
|
|
"test_binary_dir": "/opt/rccl/tests",
|
|
"binary": "all_reduce_perf"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
Result: `/opt/rccl/tests/all_reduce_perf`
|
|
|
|
#### Resolution Priority Order
|
|
|
|
1. **Absolute path in binary** - Highest priority
|
|
2. **Environment variable expansion** (if results in absolute path)
|
|
3. **test_binary_dir in test config** + binary
|
|
4. **test_binary_dir in paths** + binary
|
|
5. **Default:** `build_dir/test/` + binary - Lowest priority
|
|
|
|
#### Use Cases
|
|
|
|
- **CI/CD with pre-built binaries:** Use absolute paths or `RCCL_TEST_BIN_DIR`
|
|
- **Multiple RCCL versions:** Different `test_binary_dir` per configuration
|
|
- **Custom build locations:** Environment variables for flexibility
|
|
- **Standard builds:** Use default (no configuration needed)
|
|
|
|
#### Verbose Mode
|
|
|
|
Use `--verbose` to see the resolved binary path:
|
|
```bash
|
|
python test_runner.py --config test.json --verbose
|
|
```
|
|
|
|
Output includes:
|
|
```
|
|
Binary: all_reduce_perf
|
|
Binary path: /home/user/code/rti/scripts/rccl/build_debug_cov_on_tests_on/test/all_reduce_perf
|
|
```
|
|
|
|
### Configuration Best Practices
|
|
|
|
**Reduce Repetition:** Move common values to configuration level
|
|
|
|
```json
|
|
{
|
|
"test_configurations": {
|
|
"p2p_tests": {
|
|
"timeout": 120,
|
|
"env_variables": {
|
|
"NCCL_P2P_USE_CUDA_MEMCPY": "1",
|
|
"NCCL_LEGACY_CUDA_REGISTER": "1"
|
|
},
|
|
"tests": [
|
|
{
|
|
"name": "Test1"
|
|
// Inherits timeout and env vars from config level
|
|
},
|
|
{
|
|
"name": "Test2",
|
|
"timeout": 300
|
|
// Overrides timeout, inherits env vars
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Benefits:**
|
|
- ✅ Single source of truth for common settings
|
|
- ✅ Easier maintenance
|
|
- ✅ Tests can still override when needed
|
|
- ✅ Cleaner, more readable configurations
|
|
|
|
## Development and Testing
|
|
|
|
### Validate Configuration
|
|
|
|
```bash
|
|
# Test JSON syntax
|
|
python3 -m json.tool test_config_sample.json
|
|
|
|
# Test configuration loading
|
|
python3 -c "from lib.test_config import TestConfigProcessor; \
|
|
p = TestConfigProcessor('test_config_sample.json'); \
|
|
print('Configuration valid!')"
|
|
|
|
# Dry run (validate without executing)
|
|
python test_runner.py --config test_config_sample.json --skip-tests --verbose
|
|
```
|
|
|
|
### Adding New Tests
|
|
|
|
1. Add test definition to appropriate configuration in JSON file
|
|
2. Specify `is_gtest`, `description`, and required fields
|
|
3. Test with dry run first: `--skip-tests --verbose`
|
|
4. Run actual test: `--test-name YourTest --verbose`
|
|
|
|
### Test Type Handling
|
|
|
|
The test runner uses a boolean `is_gtest` flag to distinguish between test types:
|
|
|
|
- **`is_gtest: true`** (default): Uses GTest framework with `--gtest_filter=<filter>` syntax
|
|
- **`is_gtest: false`**: Runs binary with plain arguments (for performance tests, custom scripts, etc.)
|
|
|
|
This simplified approach eliminates the need for multiple test type conditionals while supporting all test categories (gtest, perf, custom).
|
|
|
|
## Troubleshooting
|
|
|
|
### "Configuration file not found"
|
|
- Check the path to your JSON config file
|
|
- Use absolute paths or ensure you're in the correct directory
|
|
- Verify file permissions
|
|
|
|
### "MPI path not found"
|
|
- Update `paths.mpi_path` in your configuration
|
|
- Ensure MPI is installed: `which mpirun`
|
|
- Check MPI_PATH environment variable
|
|
|
|
### "Test binary not found"
|
|
- Build first: remove `--no-build` flag
|
|
- Check binary name in `build/test/` directory
|
|
- Verify CMAKE built successfully
|
|
|
|
### Multi-node tests hang
|
|
- Ensure SLURM allocation or hostfile is configured
|
|
- Check network connectivity: `ping other_node`
|
|
- Verify MPI can reach nodes: `mpirun -np 2 hostname`
|
|
- Check firewall settings
|
|
|
|
### CMake configuration fails
|
|
- Check ROCm path: `ls $ROCM_PATH`
|
|
- Verify compiler: `$ROCM_PATH/bin/amdclang++ --version`
|
|
- Check MPI path: `ls $MPI_PATH/bin/mpirun`
|
|
|
|
### Coverage report fails
|
|
- Ensure LLVM tools are available: `which llvm-profdata llvm-cov`
|
|
- Check for `.profraw` files in build directory
|
|
- Verify coverage build flags were set correctly
|
|
- Run with `--verbose` to see detailed error messages
|
|
|
|
### "LLVM_PROFILE_FILE not being used"
|
|
- Ensure `--coverage-report` flag is specified
|
|
- Check that tests are actually executing (not skipped)
|
|
- Verify environment variables with `--verbose`
|
|
|
|
---
|
|
|
|
## Appendix: Environment Variables Reference
|
|
|
|
This section provides a quick reference for all environment variables supported by the test runner.
|
|
|
|
### Library and Build Location
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `RCCL_LIB_PATH` | Path to pre-built RCCL library directory. Automatically skips build. | `export RCCL_LIB_PATH=/path/to/rccl/build` |
|
|
| `RCCL_BUILD_DIR` | Alternative name for `RCCL_LIB_PATH`. | `export RCCL_BUILD_DIR=/home/user/rccl_builds/debug` |
|
|
|
|
**Requirements**: Directory must contain `librccl.so` and `test/` subdirectory.
|
|
|
|
### Configuration Paths
|
|
|
|
These override the paths specified in the JSON configuration file:
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `WORKDIR` | RCCL source and build directory | `export WORKDIR=/home/user/code/rccl` |
|
|
| `ROCM_PATH` | ROCm installation path | `export ROCM_PATH=/opt/rocm-6.0` |
|
|
| `MPI_PATH` | MPI installation path | `export MPI_PATH=/usr/local/openmpi` |
|
|
|
|
### Test Execution
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `RCCL_TEST_MPI_HOSTFILE` | Path to MPI hostfile for multi-node tests | `export RCCL_TEST_MPI_HOSTFILE=~/.mpi_hostfile` |
|
|
|
|
**Note**: Falls back to `~/.mpi_hostfile` if not set. For SLURM environments, hostfile is auto-generated from `SLURM_NODELIST`.
|
|
|
|
### Test-Specific Variables
|
|
|
|
These can be set globally or specified in the JSON configuration per test:
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `NCCL_DEBUG` | NCCL debug level (VERSION, WARN, INFO, TRACE) | `export NCCL_DEBUG=INFO` |
|
|
| `NCCL_DEBUG_SUBSYS` | NCCL debug subsystems to enable | `export NCCL_DEBUG_SUBSYS=INIT,COLL,NET` |
|
|
| `HSA_NO_SCRATCH_RECLAIM` | Disable HIP scratch memory reclaim | `export HSA_NO_SCRATCH_RECLAIM=1` |
|
|
| `NCCL_LAUNCH_MODE` | NCCL launch mode (GROUP, PARALLEL) | `export NCCL_LAUNCH_MODE=GROUP` |
|
|
|
|
### Coverage and Profiling
|
|
|
|
| Variable | Description | Example |
|
|
|----------|-------------|---------|
|
|
| `LLVM_PROFILE_FILE` | LLVM coverage profile output pattern | `export LLVM_PROFILE_FILE=rccl_%p_%m.profraw` |
|
|
|
|
**Note**: Automatically set by test runner to prevent collisions. Manual override not recommended.
|
|
|
|
### Complete Example
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Configure paths
|
|
export WORKDIR=/home/user/code/rccl
|
|
export ROCM_PATH=/opt/rocm-6.0
|
|
export MPI_PATH=/usr/local/openmpi
|
|
|
|
# Use pre-built library
|
|
export RCCL_LIB_PATH=/home/user/rccl_builds/instrumented
|
|
|
|
# Configure MPI
|
|
export RCCL_TEST_MPI_HOSTFILE=~/.mpi_hostfile
|
|
|
|
# Enable debug output
|
|
export NCCL_DEBUG=INFO
|
|
export NCCL_DEBUG_SUBSYS=INIT,COLL,NET
|
|
|
|
# Run tests
|
|
python test_runner.py --config my_tests.json --verbose
|
|
```
|
|
|
|
### Variable Priority
|
|
|
|
When the same configuration can be specified in multiple places, the priority is:
|
|
|
|
1. **Environment variables** (highest priority)
|
|
2. **Test-specific configuration** (in JSON)
|
|
3. **Test suite configuration** (in JSON)
|
|
4. **Test configuration defaults** (in JSON)
|
|
5. **Built-in defaults** (lowest priority)
|
|
|
|
**Example**: If `ROCM_PATH` is set as an environment variable, it overrides the `rocm_path` value in the JSON configuration file.
|
|
|