rocm-systems/projects/rccl/tools/scripts/test_runner/README.md

# RCCL Test Runner

A Python-based test runner focused on RCCL unit and functional tests with hierarchical configuration support and integrated code coverage reporting. Extensible to support performance benchmarks, MPI tests, and custom test scripts.

## Overview

This test runner provides a maintainable, extensible alternative to shell-based test execution. It uses JSON configuration files with hierarchical inheritance, and integrates with LLVM code coverage tools.

## Key Features

- **Multiple Test Types**: Support for GTest, performance tests, and custom executables
- **Hierarchical Configuration**: Use `"extends"` directive to inherit and merge configurations
- **Environment Variable Management**: Global, configuration, suite, and test-specific environment variables
- **Path Variable Expansion**: Use environment variables in paths with nested default value expansion
- **Custom Library Support**: Use pre-built RCCL libraries from custom locations via environment variables
- **Configurable Build System**: Customize CMake options, environment variables, and parallel jobs via config
- **MPI Support**: Full support for multi-rank and multi-node tests
- **Flexible Test Filtering**: Run all tests, specific test suites, or individual tests
- **Build Integration**: Automated RCCL building with CMake
- **Code Coverage**: Integrated LLVM coverage report generation (HTML and text)
- **Clean Output**: Automatic filtering of MPI verbose messages (enable with --verbose)
- **Verbose Logging**: Detailed output for debugging and troubleshooting

## Quick Start

### Basic Usage

```bash
# Run with specific configuration
python test_runner.py --config my_tests.json

# Run with verbose output
python test_runner.py --config my_tests.json --verbose

# Run specific test by name
python test_runner.py --config my_tests.json --test-name SHM_ComprehensiveWorkflow
```

### Generate Coverage Report

```bash
# Build, run tests, and generate coverage report
python test_runner.py --config test_config_sample.json --coverage-report --verbose

# Use existing build and generate coverage
python test_runner.py --config test_config_sample.json --no-build --coverage-report
```

### Use Custom RCCL Library

```bash
# Use pre-built RCCL library from custom location
export RCCL_LIB_PATH=/path/to/custom/rccl/build
python test_runner.py --config test_config_sample.json

# Or use RCCL_BUILD_DIR (alternative name)
export RCCL_BUILD_DIR=/path/to/custom/rccl/build
python test_runner.py --config test_config_sample.json

# When set, build step is automatically skipped
# --no-build is not needed
```

## Environment Variables

The test runner supports the following environment variables to customize behavior:

### Library and Build Configuration

| Variable | Description | Example |
|----------|-------------|---------|
| `RCCL_LIB_PATH` | Path to pre-built RCCL library directory (contains `librccl.so` and `test/` subdirectory). When set, the build step is automatically skipped. | `/path/to/rccl/build` |
| `RCCL_BUILD_DIR` | Alternative name for `RCCL_LIB_PATH`. Either variable can be used. | `/path/to/rccl/build` |
| `RCCL_TEST_MPI_HOSTFILE` | Path to MPI hostfile for multi-node tests. | `~/.mpi_hostfile` |

### Configuration Path Variables

These can be overridden via environment variables or specified in the JSON config:

| Variable | Description | Default |
|----------|-------------|---------|
| `WORKDIR` | RCCL source and build directory | Current rccl repository root |
| `ROCM_PATH` | ROCm installation path | `/opt/rocm` |
| `MPI_PATH` | MPI installation path | System default or config-specific |

### Priority Order

When determining which RCCL library to use, the test runner follows this priority:

1. **`RCCL_LIB_PATH` or `RCCL_BUILD_DIR` environment variable** (highest priority)
   - Skips build automatically
   - Must contain `librccl.so` and `test/` subdirectory
2. **`--no-build` flag with local build**
   - Uses local `build_debug_cov_on_tests_on/` directory
   - Requires prior build
3. **Default build process** (lowest priority)
   - Builds RCCL in timestamped directory
   - Uses CMake configuration from JSON

**Example Usage:**

```bash
# Priority 1: Use custom library (build skipped automatically)
export RCCL_LIB_PATH=/path/to/prebuilt/rccl/build
python test_runner.py --config my_tests.json

# Priority 2: Use existing local build (no new build)
python test_runner.py --config my_tests.json --no-build

# Priority 3: Fresh build (default)
python test_runner.py --config my_tests.json
```

## Configuration File Format

### Basic Structure

```json
{
  "system_configurations": {
    "name": "system-name",
    "description": "System description"
  },
  "paths": {
    "workdir": "/path/to/rccl",
    "rocm_path": "/opt/rocm",
    "mpi_path": "/path/to/mpi"
  },
  "env_variables": {
    "GLOBAL_VAR": "value"
  },
  "test_configurations": {
    "config_name": {
      "env_variables": {...},
      "tests": [...]
    }
  },
  "test_suites": [
    {
      "name": "Test Suite Name",
      "config": "config_name",
      "enabled": true
    }
  ]
}
```

### Environment Variable Expansion in Paths

The `paths` section supports environment variable expansion, allowing you to avoid hardcoding paths and make configurations portable across different systems.

#### Supported Syntax

```json
{
  "paths": {
    "workdir": "${HOME}/code/rccl",
    "rocm_path": "$ROCM_PATH",
    "mpi_path": "${MPI_PATH:-/opt/mpi}"
  }
}
```

**Syntax Options:**
- `${VAR}` - Expands to the value of `VAR`, left as-is if undefined
- `$VAR` - Expands to the value of `VAR`, left as-is if undefined
- `${VAR:-default}` - Expands to the value of `VAR`, or `default` if undefined (bash-style default)

#### Examples

```json
{
  "paths": {
    "workdir": "${WORKDIR:-${HOME}/code/rti/scripts/rccl}",
    "rocm_path": "${ROCM_PATH:-/opt/rocm}",
    "mpi_path": "${MPI_PATH:-${HOME}/softwares/ompi}"
  }
}
```

**Usage:**
```bash
# Use environment variables
export WORKDIR=/custom/path/to/rccl
export ROCM_PATH=/opt/rocm-6.0
export MPI_PATH=/usr/local/mpi

python test_runner.py --config test_config_sample.json

# Or use defaults (no environment variables set)
python test_runner.py --config test_config_sample.json
```

**Benefits:**
- **Portability**: Share configurations across different systems
- **Flexibility**: Override paths without modifying config files
- **CI/CD**: Easy integration with build systems and pipelines
- **Multi-user**: Same config works for different user environments

### Test Types Supported

The test runner uses the `is_gtest` boolean flag to distinguish between test types:

- **`is_gtest: true`** (default) - GTest-based unit tests using `--gtest_filter` syntax
- **`is_gtest: false`** - Non-GTest tests (performance benchmarks, custom scripts, etc.)

This simplified approach supports all test categories while reducing configuration complexity.

#### GTest Tests (`is_gtest: true`)

Used for unit tests with GTest framework. The `test_filter` field uses GTest filter syntax.

```json
{
  "name": "AllReduce_InPlace",
  "description": "Test AllReduce collective operation with in-place buffers",
  "is_gtest": true,
  "binary": "rccl-UnitTests",
  "test_filter": "AllReduce.InPlace",
  "num_ranks": 1,
  "num_nodes": 1,
  "timeout": 60
}
```

**Command generated:**
```bash
./rccl-UnitTests --gtest_filter=AllReduce.InPlace
```

#### Performance Tests (`is_gtest: false`)

Used for performance benchmarks. Arguments are passed directly without GTest syntax.

```json
{
  "name": "Perf_Bandwidth",
  "description": "Bandwidth benchmark for AllReduce",
  "is_gtest": false,
  "binary": "all_reduce_perf",
  "command_args": "-b 8 -e 128M -f 2",
  "num_ranks": 2,
  "num_nodes": 1,
  "timeout": 300
}
```

**Command generated:**
```bash
mpirun -np 2 ./all_reduce_perf -b 8 -e 128M -f 2
```

#### Custom Scripts (`is_gtest: false`)

Used for custom validation scripts or any non-GTest executables.

```json
{
  "name": "Custom_Validation",
  "description": "Custom GPU validation script",
  "is_gtest": false,
  "binary": "validate_gpus.sh",
  "command_args": "--full-check --verbose",
  "num_ranks": 1,
  "num_nodes": 1,
  "timeout": 120
}
```

**Command generated:**
```bash
./validate_gpus.sh --full-check --verbose
```

**Key Differences:**

| Feature | `is_gtest: true` | `is_gtest: false` |
|---------|------------------|-------------------|
| Test framework | GTest (Google Test) | Any executable |
| Filter syntax | `--gtest_filter=<pattern>` | Plain arguments |
| `test_filter` field | GTest pattern (e.g., `Suite.Test*`) | Passed as plain argument |
| `command_args` field | Appended after filter | Primary argument method |
| Typical use cases | Unit tests, functional tests | Performance tests, custom scripts |

### Test Definition Fields

| Field | Required | Type | Description |
|-------|----------|------|-------------|
| `name` | Yes | string | Unique test identifier |
| `description` | Recommended | string | Human-readable test description |
| `is_gtest` | Optional | boolean | Whether test uses GTest framework (default: true). Set to false for perf or custom tests |
| `binary` | Yes | string | Test binary name (relative to build/test/) |
| `test_filter` | Optional | string | Test filter (GTest filter syntax for gtest, plain argument for non-gtest) |
| `command_args` | Optional | string | Additional command-line arguments |
| `num_ranks` | Optional | integer | Number of MPI ranks (default: 1) |
| `num_nodes` | Optional | integer | Number of nodes (default: 1) |
| `num_gpus` | Optional | integer | GPUs per node - controls rank distribution (default: 8) |
| `timeout` | Optional | integer | Timeout in seconds (0 = unlimited) |
| `env_variables` | Optional | object | Test-specific environment variables |

### Configuration Inheritance

Use the `"extends"` directive to inherit from parent configurations:

```json
{
  "test_configurations": {
    "base": {
      "env_variables": {
        "NCCL_DEBUG": "INFO"
      }
    },
    "shm_tests": {
      "extends": "base",
      "env_variables": {
        "NCCL_SHM_DISABLE": "0"
      },
      "tests": [...]
    },
    "advanced_shm": {
      "extends": ["base", "shm_tests"],
      "env_variables": {
        "NCCL_SHM_USE_CUDA_MEMCPY": "1"
      }
    }
  }
}
```

### Hierarchical Defaults

To reduce repetition, you can specify default values at multiple levels with a clear override hierarchy:

**Priority Order (highest to lowest):**
1. **Individual test** - highest priority, overrides everything
2. **Test suite level** - overrides configuration defaults
3. **Configuration level** - base defaults for all tests in that config
4. **Built-in defaults** - system fallback values

**Supported default fields:** `is_gtest`, `binary`, `num_ranks`, `num_nodes`, `num_gpus`, `timeout`

#### Example with Three-Level Hierarchy

```json
{
  "test_configurations": {
    "p2p_tests": {
      "is_gtest": true,
      "binary": "rccl-UnitTestsMPI",
      "num_ranks": 2,
      "num_nodes": 1,
      "num_gpus": 2,
      "timeout": 120,
      "env_variables": {
        "NCCL_P2P_DISABLE": "0"
      },
      "tests": [
        {
          "name": "P2P_Basic",
          "description": "Basic P2P test",
          "test_filter": "P2pMPITest.Basic"
          // Uses config defaults: is_gtest=true, binary, num_ranks=2, num_nodes=1, num_gpus=2, timeout=120
        },
        {
          "name": "P2P_LongRunning",
          "description": "Long-running P2P test",
          "test_filter": "P2pMPITest.LongRunning",
          "timeout": 300
          // Overrides timeout=300, inherits other config defaults
        }
      ]
    }
  },
  "test_suites": [
    {
      "name": "P2P_Basic_Suite",
      "config": "p2p_tests",
      "num_ranks": 4,
      "num_gpus": 4,
      "timeout": 180
      // Suite-level: overrides config's num_ranks, num_gpus, and timeout
      // Tests in this suite will use: num_ranks=4, num_gpus=4, timeout=180
    },
    {
      "name": "P2P_Stress_Suite",
      "config": "p2p_tests",
      "num_nodes": 2,
      "num_ranks": 4,
      "num_gpus": 2,
      "timeout": 600
      // Suite-level: overrides config's num_nodes, num_ranks, num_gpus, and timeout
      // Tests in this suite will use: num_nodes=2, num_ranks=4, num_gpus=2, timeout=600
    }
  ]
}
```

**Benefits:**
- **Less Repetition**: Define common values once
- **Easier Maintenance**: Update defaults in one place
- **Flexible Overrides**: Tests can still customize any field
- **Cleaner Config**: Shorter, more readable test definitions

## Command-Line Options

```
Required:
  -c, --config CONFIG       Test configuration file (JSON format)

Optional:
  -v, --verbose             Enable verbose output (shows build paths, commands, etc.)
  -o, --output DIR          Output directory for logs and reports
  --test-name NAME          Run only specific test by name
  --no-build                Skip build step and use existing build
  --skip-tests              Skip test execution (useful with --coverage-report)
  --coverage-report         Generate code coverage report (HTML + text)
  --overwrite               Overwrite previous workspace directories
  --report-suffix SUFFIX    Suffix for report directory (default: blank)
  -h, --help                Show help message and exit
```

## Code Coverage Reports

The test runner integrates with LLVM tools to generate comprehensive code coverage reports.

### Generating Coverage

```bash
# Build and test with coverage (recommended)
python test_runner.py --config test_config_sample.json --coverage-report --verbose

# Generate report from existing profraw files
python test_runner.py --config test_config_sample.json --no-build --skip-tests --coverage-report
```

### Coverage Output

When `--coverage-report` is specified, the runner generates:

1. **HTML Report**: Visual coverage report in `reports/` directory
   - View with: `firefox reports/index.html`
   - Shows line-by-line coverage with syntax highlighting

2. **Text Report**: Function-level coverage summary
   - Location: `reports/function_coverage_report.txt`
   - Includes per-function and per-file statistics

### Coverage Implementation Details

- Uses LLVM instrumentation (`-fprofile-instr-generate -fcoverage-mapping`)
- Collects `.profraw` files during test execution
- Merges profiles with `llvm-profdata`
- Generates reports with `llvm-cov show` and `llvm-cov report`
- Filters out irrelevant files (test/, gtest, external dependencies)

## Examples

### Run All Enabled Test Suites

```bash
python test_runner.py --config test_config_sample.json --verbose
```

### Run Specific Test

```bash
python test_runner.py --config test_config_sample.json --test-name P2P_AllTests
```

### Skip Build (Use Existing)

```bash
python test_runner.py --config test_config_sample.json --no-build
```

### Build and Generate Coverage

```bash
# Full workflow: build, test, coverage
python test_runner.py --config adhoc_test_config.json --coverage-report --verbose
```

### Generate Coverage from Existing Build

```bash
# Skip build, use existing profraw files
python test_runner.py --config adhoc_test_config.json --no-build --skip-tests --coverage-report
```

### Custom Output Directory

```bash
python test_runner.py --config test_config_sample.json -o /path/to/output --verbose
```

### Run with Overwrite (Clean Previous Results)

```bash
python test_runner.py --config test_config_sample.json --overwrite --coverage-report
```

## Environment Variable Merging

Environment variables are merged hierarchically (later values override earlier):

1. **Global** `env_variables` (top-level in config)
2. **Configuration** `env_variables` (test configuration level)
3. **Test Suite** `env_variables` (suite level)
4. **Test-specific** `env_variables` (individual test level)

Example:
```json
{
  "env_variables": {
    "NCCL_DEBUG": "INFO"
  },
  "test_configurations": {
    "shm_tests": {
      "env_variables": {
        "NCCL_SHM_DISABLE": "0"
      },
      "tests": [
        {
          "name": "SHM_Test",
          "env_variables": {
            "NCCL_DEBUG": "TRACE"
          }
        }
      ]
    }
  }
}
```

Result: `NCCL_DEBUG=TRACE`, `NCCL_SHM_DISABLE=0`

## Test Execution

### Single-Node Tests

- All ranks run on a single node
- Multiple ranks map to different GPUs
- Examples: SHM tests, P2P tests, unit tests

```json
{
  "name": "SHM_Test",
  "num_ranks": 2,
  "num_nodes": 1
}
```

### Multi-Node Tests

- Ranks distributed across multiple nodes via MPI
- Requires SLURM allocation or hostfile configuration
- Use `num_gpus` to control ranks per node (default: 8)
- Examples: NET transport tests, InfiniBand tests

```json
{
  "name": "NET_Test_4Nodes_2GPUs",
  "num_ranks": 8,
  "num_nodes": 4,
  "num_gpus": 2
}
```

**`num_gpus` Field:**
- Controls how many MPI ranks are placed on each node
- Overrides hostfile `slots` specification
- For multi-node tests, uses `--map-by ppr:{num_gpus}:node`
- Default value: 8 (matches typical 8-GPU nodes)

**Example: 2 nodes, 1 GPU per node**
```json
{
  "name": "NET_Test_2Nodes_1GPU",
  "num_ranks": 2,
  "num_nodes": 2,
  "num_gpus": 1
}
```
Command: `mpirun -np 2 --hostfile file --map-by ppr:1:node ...`

### Setting Up Multi-Node Tests

**Option 1: MPI Hostfile**
```bash
export RCCL_TEST_MPI_HOSTFILE=/path/to/hostfile
python test_runner.py --config net_ib_test_config.json
```

**Option 2: Default Hostfile**
Create `~/.mpi_hostfile` with node names (one per line):
```
node01 slots=8
node02 slots=8
```

## Advanced Features

### Build Configuration (New!)

Customize the RCCL build process through the `build_configuration` section in your JSON config file.

#### Basic Structure

```json
{
  "build_configuration": {
    "cmake_options": {
      "CMAKE_BUILD_TYPE": "Debug",
      "ENABLE_CODE_COVERAGE": "ON",
      "ONLY_FUNCS": "SendRecv|AllReduce"
    },
    "env_variables": {
      "HIPCC_COMPILE_FLAGS_APPEND": "-g -O1"
    },
    "parallel_jobs": 64,
    "generator": "Unix Makefiles"
  }
}
```

#### Examples

**Fast Development Build (No Coverage):**
```json
{
  "build_configuration": {
    "cmake_options": {
      "ENABLE_CODE_COVERAGE": "OFF"
    },
    "parallel_jobs": 128
  }
}
```

**Release Build:**
```json
{
  "build_configuration": {
    "cmake_options": {
      "CMAKE_BUILD_TYPE": "Release",
      "TRACE": "OFF",
      "COLLTRACE": "OFF"
    }
  }
}
```

**Test Specific Functions Only:**
```json
{
  "build_configuration": {
    "cmake_options": {
      "ONLY_FUNCS": "Broadcast|Reduce"
    }
  }
}
```

**All Options:**
- `cmake_options` - Any CMake option (user values override defaults)
- `env_variables` - Build environment variables
- `parallel_jobs` - Number of parallel build threads (default: 64)
- `generator` - CMake generator: "Unix Makefiles", "Ninja", etc.

See `BUILD_CONFIGURATION_GUIDE.md` for complete documentation.

### Enhanced Environment Variable Expansion

Environment variables in the `paths` section now support **nested expansion** in default values:

```json
{
  "paths": {
    "workdir": "${WORKDIR:-$HOME/code/rti/scripts/rccl}",
    "rocm_path": "${ROCM_PATH:-/opt/rocm}",
    "mpi_path": "${MPI_PATH:-$HOME/softwares/ompi}"
  }
}
```

**Key Feature:** If `WORKDIR` is not set, the default `$HOME/code/rti/scripts/rccl` will expand `$HOME` automatically!

### Flexible Binary Paths

Specify test binary locations in multiple ways for maximum flexibility:

#### 1. Default (Relative to build_dir/test/)

```json
{
  "binary": "all_reduce_perf"
}
```
Result: `<workdir>/build_debug_cov_on_tests_on/test/all_reduce_perf`

#### 2. Absolute Path

```json
{
  "binary": "/opt/custom_rccl_build/test/all_reduce_perf"
}
```
Result: Uses the absolute path directly

#### 3. Environment Variable in Binary Name

```json
{
  "binary": "${MY_RCCL_TESTS}/all_reduce_perf"
}
```
Result: Expands `$MY_RCCL_TESTS` environment variable

#### 4. Home Directory Expansion

```json
{
  "binary": "~/my_builds/rccl/test/all_reduce_perf"
}
```
Result: Expands `~` to home directory

#### 5. Using test_binary_dir in Paths

```json
{
  "paths": {
    "test_binary_dir": "${RCCL_TEST_BIN_DIR}"
  },
  "test_configurations": {
    "my_tests": {
      "binary": "all_reduce_perf"
    }
  }
}
```
Result: `${RCCL_TEST_BIN_DIR}/all_reduce_perf`

#### 6. Using test_binary_dir in Test Config

```json
{
  "test_configurations": {
    "my_tests": {
      "tests": [
        {
          "name": "CustomBinary",
          "test_binary_dir": "/opt/rccl/tests",
          "binary": "all_reduce_perf"
        }
      ]
    }
  }
}
```
Result: `/opt/rccl/tests/all_reduce_perf`

#### Resolution Priority Order

1. **Absolute path in binary** - Highest priority
2. **Environment variable expansion** (if results in absolute path)
3. **test_binary_dir in test config** + binary
4. **test_binary_dir in paths** + binary
5. **Default:** `build_dir/test/` + binary - Lowest priority

#### Use Cases

- **CI/CD with pre-built binaries:** Use absolute paths or `RCCL_TEST_BIN_DIR`
- **Multiple RCCL versions:** Different `test_binary_dir` per configuration
- **Custom build locations:** Environment variables for flexibility
- **Standard builds:** Use default (no configuration needed)

#### Verbose Mode

Use `--verbose` to see the resolved binary path:
```bash
python test_runner.py --config test.json --verbose
```

Output includes:
```
Binary:  all_reduce_perf
Binary path: /home/user/code/rti/scripts/rccl/build_debug_cov_on_tests_on/test/all_reduce_perf
```

### Configuration Best Practices

**Reduce Repetition:** Move common values to configuration level

```json
{
  "test_configurations": {
    "p2p_tests": {
      "timeout": 120,
      "env_variables": {
        "NCCL_P2P_USE_CUDA_MEMCPY": "1",
        "NCCL_LEGACY_CUDA_REGISTER": "1"
      },
      "tests": [
        {
          "name": "Test1"
          // Inherits timeout and env vars from config level
        },
        {
          "name": "Test2",
          "timeout": 300
          // Overrides timeout, inherits env vars
        }
      ]
    }
  }
}
```

**Benefits:**
- ✅ Single source of truth for common settings
- ✅ Easier maintenance
- ✅ Tests can still override when needed
- ✅ Cleaner, more readable configurations

## Development and Testing

### Validate Configuration

```bash
# Test JSON syntax
python3 -m json.tool test_config_sample.json

# Test configuration loading
python3 -c "from lib.test_config import TestConfigProcessor; \
            p = TestConfigProcessor('test_config_sample.json'); \
            print('Configuration valid!')"

# Dry run (validate without executing)
python test_runner.py --config test_config_sample.json --skip-tests --verbose
```

### Adding New Tests

1. Add test definition to appropriate configuration in JSON file
2. Specify `is_gtest`, `description`, and required fields
3. Test with dry run first: `--skip-tests --verbose`
4. Run actual test: `--test-name YourTest --verbose`

### Test Type Handling

The test runner uses a boolean `is_gtest` flag to distinguish between test types:

- **`is_gtest: true`** (default): Uses GTest framework with `--gtest_filter=<filter>` syntax
- **`is_gtest: false`**: Runs binary with plain arguments (for performance tests, custom scripts, etc.)

This simplified approach eliminates the need for multiple test type conditionals while supporting all test categories (gtest, perf, custom).

## Troubleshooting

### "Configuration file not found"
- Check the path to your JSON config file
- Use absolute paths or ensure you're in the correct directory
- Verify file permissions

### "MPI path not found"
- Update `paths.mpi_path` in your configuration
- Ensure MPI is installed: `which mpirun`
- Check MPI_PATH environment variable

### "Test binary not found"
- Build first: remove `--no-build` flag
- Check binary name in `build/test/` directory
- Verify CMAKE built successfully

### Multi-node tests hang
- Ensure SLURM allocation or hostfile is configured
- Check network connectivity: `ping other_node`
- Verify MPI can reach nodes: `mpirun -np 2 hostname`
- Check firewall settings

### CMake configuration fails
- Check ROCm path: `ls $ROCM_PATH`
- Verify compiler: `$ROCM_PATH/bin/amdclang++ --version`
- Check MPI path: `ls $MPI_PATH/bin/mpirun`

### Coverage report fails
- Ensure LLVM tools are available: `which llvm-profdata llvm-cov`
- Check for `.profraw` files in build directory
- Verify coverage build flags were set correctly
- Run with `--verbose` to see detailed error messages

### "LLVM_PROFILE_FILE not being used"
- Ensure `--coverage-report` flag is specified
- Check that tests are actually executing (not skipped)
- Verify environment variables with `--verbose`

---

## Appendix: Environment Variables Reference

This section provides a quick reference for all environment variables supported by the test runner.

### Library and Build Location

| Variable | Description | Example |
|----------|-------------|---------|
| `RCCL_LIB_PATH` | Path to pre-built RCCL library directory. Automatically skips build. | `export RCCL_LIB_PATH=/path/to/rccl/build` |
| `RCCL_BUILD_DIR` | Alternative name for `RCCL_LIB_PATH`. | `export RCCL_BUILD_DIR=/home/user/rccl_builds/debug` |

**Requirements**: Directory must contain `librccl.so` and `test/` subdirectory.

### Configuration Paths

These override the paths specified in the JSON configuration file:

| Variable | Description | Example |
|----------|-------------|---------|
| `WORKDIR` | RCCL source and build directory | `export WORKDIR=/home/user/code/rccl` |
| `ROCM_PATH` | ROCm installation path | `export ROCM_PATH=/opt/rocm-6.0` |
| `MPI_PATH` | MPI installation path | `export MPI_PATH=/usr/local/openmpi` |

### Test Execution

| Variable | Description | Example |
|----------|-------------|---------|
| `RCCL_TEST_MPI_HOSTFILE` | Path to MPI hostfile for multi-node tests | `export RCCL_TEST_MPI_HOSTFILE=~/.mpi_hostfile` |

**Note**: Falls back to `~/.mpi_hostfile` if not set. For SLURM environments, hostfile is auto-generated from `SLURM_NODELIST`.

### Test-Specific Variables

These can be set globally or specified in the JSON configuration per test:

| Variable | Description | Example |
|----------|-------------|---------|
| `NCCL_DEBUG` | NCCL debug level (VERSION, WARN, INFO, TRACE) | `export NCCL_DEBUG=INFO` |
| `NCCL_DEBUG_SUBSYS` | NCCL debug subsystems to enable | `export NCCL_DEBUG_SUBSYS=INIT,COLL,NET` |
| `HSA_NO_SCRATCH_RECLAIM` | Disable HIP scratch memory reclaim | `export HSA_NO_SCRATCH_RECLAIM=1` |
| `NCCL_LAUNCH_MODE` | NCCL launch mode (GROUP, PARALLEL) | `export NCCL_LAUNCH_MODE=GROUP` |

### Coverage and Profiling

| Variable | Description | Example |
|----------|-------------|---------|
| `LLVM_PROFILE_FILE` | LLVM coverage profile output pattern | `export LLVM_PROFILE_FILE=rccl_%p_%m.profraw` |

**Note**: Automatically set by test runner to prevent collisions. Manual override not recommended.

### Complete Example

```bash
#!/bin/bash
# Configure paths
export WORKDIR=/home/user/code/rccl
export ROCM_PATH=/opt/rocm-6.0
export MPI_PATH=/usr/local/openmpi

# Use pre-built library
export RCCL_LIB_PATH=/home/user/rccl_builds/instrumented

# Configure MPI
export RCCL_TEST_MPI_HOSTFILE=~/.mpi_hostfile

# Enable debug output
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=INIT,COLL,NET

# Run tests
python test_runner.py --config my_tests.json --verbose
```

### Variable Priority

When the same configuration can be specified in multiple places, the priority is:

1. **Environment variables** (highest priority)
2. **Test-specific configuration** (in JSON)
3. **Test suite configuration** (in JSON)
4. **Test configuration defaults** (in JSON)
5. **Built-in defaults** (lowest priority)

**Example**: If `ROCM_PATH` is set as an environment variable, it overrides the `rocm_path` value in the JSON configuration file.