107 líneas
3.5 KiB
Markdown
107 líneas
3.5 KiB
Markdown
|
|
# NCCL Tuner Configuration Scripts
|
||
|
|
|
||
|
|
This directory contains scripts for optimizing NCCL tuner configurations based on performance data.
|
||
|
|
|
||
|
|
## optimize_config.py
|
||
|
|
|
||
|
|
A Python script that reads performance data from CSV files and generates optimal NCCL tuner configurations.
|
||
|
|
|
||
|
|
### Usage
|
||
|
|
|
||
|
|
```bash
|
||
|
|
python scripts/optimize_config.py [options] <input_csv_file>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Options
|
||
|
|
|
||
|
|
- `-o, --output FILE`: Output NCCL tuner config file (default: `nccl_tuner.conf`)
|
||
|
|
- `-m, --metric METRIC`: Optimization metric (`cost_metric`, `bandwidth_gbps`, `latency_us`)
|
||
|
|
- `--no-header`: Don't add header comments to output file
|
||
|
|
- `--dry-run`: Print configurations without writing to file
|
||
|
|
|
||
|
|
### CSV Input Format
|
||
|
|
|
||
|
|
The input CSV file should have the following columns:
|
||
|
|
|
||
|
|
```csv
|
||
|
|
collective,size_bytes,algorithm,protocol,channels,nodes,ranks,pipeOps,regBuff,cost_metric,bandwidth_gbps,latency_us
|
||
|
|
```
|
||
|
|
|
||
|
|
**Required columns:**
|
||
|
|
- `collective`: NCCL collective type (`allreduce`, `broadcast`, `reduce`, etc.)
|
||
|
|
- `size_bytes`: Message size in bytes
|
||
|
|
- `algorithm`: NCCL algorithm (`tree`, `ring`, `nvls`, etc.)
|
||
|
|
- `protocol`: NCCL protocol (`simple`, `ll`, `ll128`)
|
||
|
|
- `channels`: Number of channels (or `-1` for default)
|
||
|
|
- `nodes`: Number of nodes (or `-1` for any)
|
||
|
|
- `ranks`: Number of ranks (or `-1` for any)
|
||
|
|
- `pipeOps`: Number of pipeline operations (or `-1` for any)
|
||
|
|
- `regBuff`: Registered buffer flag (`0`, `1`, or `-1` for any)
|
||
|
|
|
||
|
|
**Optional metrics (must have at least one present):**
|
||
|
|
- `bandwidth_gbps`: Bandwidth in GB/s (higher is better)
|
||
|
|
- `latency_us`: Latency in microseconds (lower is better)
|
||
|
|
|
||
|
|
### Examples
|
||
|
|
|
||
|
|
**Basic usage with cost optimization:**
|
||
|
|
```bash
|
||
|
|
python scripts/optimize_config.py sample_performance_data.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
**Optimize for bandwidth and write to custom file:**
|
||
|
|
```bash
|
||
|
|
python scripts/optimize_config.py -m bandwidth_gbps -o my_tuner.conf performance_data.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
**Preview configurations without writing:**
|
||
|
|
```bash
|
||
|
|
python scripts/optimize_config.py --dry-run performance_data.csv
|
||
|
|
```
|
||
|
|
|
||
|
|
### How It Works
|
||
|
|
|
||
|
|
1. **Data Loading**: Reads CSV performance data and validates format
|
||
|
|
2. **Grouping**: Groups data by collective type, topology (nodes/ranks), and other parameters
|
||
|
|
3. **Size Ranges**: Automatically bins data into size ranges for optimization
|
||
|
|
4. **Optimization**: Finds the best performing configuration for each group/size combination
|
||
|
|
5. **Output**: Generates NCCL tuner config format and appends to specified file
|
||
|
|
|
||
|
|
### Default Size Ranges
|
||
|
|
|
||
|
|
The script uses these default size ranges (in bytes):
|
||
|
|
- Small: 0 - 1,024
|
||
|
|
- Medium: 1,025 - 65,536
|
||
|
|
- Large: 65,537 - 1,048,576
|
||
|
|
- XLarge: 1,048,577 - 16,777,216
|
||
|
|
- XXLarge: 16,777,217 - 4,294,967,295
|
||
|
|
|
||
|
|
### Sample Data
|
||
|
|
|
||
|
|
See `sample_performance_data.csv` for an example of the expected input format.
|
||
|
|
|
||
|
|
### Integration with NCCL
|
||
|
|
|
||
|
|
The generated configuration file can be used directly with the NCCL tuner plugin:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
export NCCL_TUNER_CONFIG_FILE=/path/to/optimized_config.conf
|
||
|
|
export NCCL_TUNER_PLUGIN=/path/to/libnccl-tuner.so
|
||
|
|
mpirun -np 8 your_nccl_application
|
||
|
|
```
|
||
|
|
|
||
|
|
### Performance Data Collection
|
||
|
|
|
||
|
|
To collect performance data for optimization, you can:
|
||
|
|
|
||
|
|
1. **Use NCCL benchmarks** with different algorithm/protocol combinations
|
||
|
|
2. **Profile your applications** with various tuner settings
|
||
|
|
3. **Run systematic sweeps** across parameter combinations
|
||
|
|
4. **Use NCCL debug output** to collect timing information
|
||
|
|
|
||
|
|
The key is to have comprehensive data covering:
|
||
|
|
- Different message sizes (small to large)
|
||
|
|
- Various topologies (single node, multi-node)
|
||
|
|
- All relevant algorithm/protocol combinations
|
||
|
|
- Different channel counts and pipeline configurations
|