projects/rccl/ext-tuner/example/scripts/README.md

# NCCL Tuner Configuration Scripts

This directory contains scripts for optimizing NCCL tuner configurations based on performance data.

## optimize_config.py

A Python script that reads performance data from CSV files and generates optimal NCCL tuner configurations.

### Usage

```bash
python scripts/optimize_config.py [options] <input_csv_file>
```

### Options

- `-o, --output FILE`: Output NCCL tuner config file (default: `nccl_tuner.conf`)
- `-m, --metric METRIC`: Optimization metric (`cost_metric`, `bandwidth_gbps`, `latency_us`)
- `--no-header`: Don't add header comments to output file
- `--dry-run`: Print configurations without writing to file

### CSV Input Format

The input CSV file should have the following columns:

```csv
collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
```

**Required columns:**
- `collective`: NCCL collective type (`allreduce`, `broadcast`, `reduce`, etc.)
- `size_bytes`: Message size in bytes
- `algorithm`: NCCL algorithm (`tree`, `ring`, `nvls`, etc.)
- `protocol`: NCCL protocol (`simple`, `ll`, `ll128`)
- `channels`: Number of channels (or `-1` for default)
- `nodes`: Number of nodes (or `-1` for any)
- `ranks`: Number of ranks (or `-1` for any)
- `pipeOps`: Number of pipeline operations (or `-1` for any)
- `regBuff`: Registered buffer flag (`0`, `1`, or `-1` for any)

**Optional metrics (must have at least one present):**
- `bandwidth_gbps`: Bandwidth in GB/s (higher is better)
- `latency_us`: Latency in microseconds (lower is better)

### Examples

**Basic usage with cost optimization:**
```bash
python scripts/optimize_config.py sample_performance_data.csv
```

**Optimize for bandwidth and write to custom file:**
```bash
python scripts/optimize_config.py -m bandwidth_gbps -o my_tuner.conf performance_data.csv
```

**Preview configurations without writing:**
```bash
python scripts/optimize_config.py --dry-run performance_data.csv
```

### How It Works

1. **Data Loading**: Reads CSV performance data and validates format
2. **Grouping**: Groups data by collective type, topology (nodes/ranks), and other parameters
3. **Size Ranges**: Automatically bins data into size ranges for optimization
4. **Optimization**: Finds the best performing configuration for each group/size combination
5. **Output**: Generates NCCL tuner config format and appends to specified file

### Default Size Ranges

The script uses these default size ranges (in bytes):
- Small: 0 - 1,024
- Medium: 1,025 - 65,536
- Large: 65,537 - 1,048,576
- XLarge: 1,048,577 - 16,777,216
- XXLarge: 16,777,217 - 4,294,967,295

### Sample Data

See `sample_performance_data.csv` for an example of the expected input format.

### Integration with NCCL

The generated configuration file can be used directly with the NCCL tuner plugin:

```bash
export NCCL_TUNER_CONFIG_FILE=/path/to/optimized_config.conf
export NCCL_TUNER_PLUGIN=/path/to/libnccl-tuner.so
mpirun -np 8 your_nccl_application
```

### Performance Data Collection

To collect performance data for optimization, you can:

1. **Use NCCL benchmarks** with different algorithm/protocol combinations
2. **Profile your applications** with various tuner settings
3. **Run systematic sweeps** across parameter combinations
4. **Use NCCL debug output** to collect timing information

The key is to have comprehensive data covering:
- Different message sizes (small to large)
- Various topologies (single node, multi-node)
- All relevant algorithm/protocol combinations
- Different channel counts and pipeline configurations
NCCL 2.27.5-1 2025-06-18 10:34:47 -07:00			`# NCCL Tuner Configuration Scripts`

			`This directory contains scripts for optimizing NCCL tuner configurations based on performance data.`

			`## optimize_config.py`

			`A Python script that reads performance data from CSV files and generates optimal NCCL tuner configurations.`

			`### Usage`

			```bash
			`python scripts/optimize_config.py [options] <input_csv_file>`
			```

			`### Options`

			- `-o, --output FILE`: Output NCCL tuner config file (default: `nccl_tuner.conf`)
			- `-m, --metric METRIC`: Optimization metric (`cost_metric`, `bandwidth_gbps`, `latency_us`)
			- `--no-header`: Don't add header comments to output file
			- `--dry-run`: Print configurations without writing to file

			`### CSV Input Format`

			`The input CSV file should have the following columns:`

			```csv
			`collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us`
			```

			`Required columns:`
			- `collective`: NCCL collective type (`allreduce`, `broadcast`, `reduce`, etc.)
			- `size_bytes`: Message size in bytes
			- `algorithm`: NCCL algorithm (`tree`, `ring`, `nvls`, etc.)
			- `protocol`: NCCL protocol (`simple`, `ll`, `ll128`)
			- `channels`: Number of channels (or `-1` for default)
			- `nodes`: Number of nodes (or `-1` for any)
			- `ranks`: Number of ranks (or `-1` for any)
			- `pipeOps`: Number of pipeline operations (or `-1` for any)
			- `regBuff`: Registered buffer flag (`0`, `1`, or `-1` for any)

			`Optional metrics (must have at least one present):`
			- `bandwidth_gbps`: Bandwidth in GB/s (higher is better)
			- `latency_us`: Latency in microseconds (lower is better)

			`### Examples`

			`Basic usage with cost optimization:`
			```bash
			`python scripts/optimize_config.py sample_performance_data.csv`
			```

			`Optimize for bandwidth and write to custom file:`
			```bash
			`python scripts/optimize_config.py -m bandwidth_gbps -o my_tuner.conf performance_data.csv`
			```

			`Preview configurations without writing:`
			```bash
			`python scripts/optimize_config.py --dry-run performance_data.csv`
			```

			`### How It Works`

			`1. Data Loading: Reads CSV performance data and validates format`
			`2. Grouping: Groups data by collective type, topology (nodes/ranks), and other parameters`
			`3. Size Ranges: Automatically bins data into size ranges for optimization`
			`4. Optimization: Finds the best performing configuration for each group/size combination`
			`5. Output: Generates NCCL tuner config format and appends to specified file`

			`### Default Size Ranges`

			`The script uses these default size ranges (in bytes):`
			`- Small: 0 - 1,024`
			`- Medium: 1,025 - 65,536`
			`- Large: 65,537 - 1,048,576`
			`- XLarge: 1,048,577 - 16,777,216`
			`- XXLarge: 16,777,217 - 4,294,967,295`

			`### Sample Data`

			See `sample_performance_data.csv` for an example of the expected input format.

			`### Integration with NCCL`

			`The generated configuration file can be used directly with the NCCL tuner plugin:`

			```bash
			`export NCCL_TUNER_CONFIG_FILE=/path/to/optimized_config.conf`
			`export NCCL_TUNER_PLUGIN=/path/to/libnccl-tuner.so`
			`mpirun -np 8 your_nccl_application`
			```

			`### Performance Data Collection`

			`To collect performance data for optimization, you can:`

			`1. Use NCCL benchmarks with different algorithm/protocol combinations`
			`2. Profile your applications with various tuner settings`
			`3. Run systematic sweeps across parameter combinations`
			`4. Use NCCL debug output to collect timing information`

			`The key is to have comprehensive data covering:`
			`- Different message sizes (small to large)`
			`- Various topologies (single node, multi-node)`
			`- All relevant algorithm/protocol combinations`
			`- Different channel counts and pipeline configurations`