Files
rocm-systems/ext-profiler/inspector/exporter/example/README.md
T

152 wiersze
5.0 KiB
Markdown
Czysty Zwykły widok Historia

2025-09-02 13:21:14 -07:00
# NCCL Inspector Performance Summary Exporter
This tool processes NCCL Inspector log files and generates comprehensive performance analysis reports including visualizations and statistical summaries.
One can build similar exporters to integrate with various observability systems like Elastic, Prometheus or other Custom Metric systems.
## Features
- **Performance Analysis**: Generates statistical summaries for collective operations
- **Communication Type Classification**: Automatically categorizes communication patterns
- **Visualizations**: Creates scatter plots, histograms, and box plots for performance metrics
- **Data Export**: Converts logs to Parquet format for efficient processing
- **Multi-format Log Support**: Processes `.log`, `.log.gz`, `.jsonl`, and `.jsonl.gz` files
- **Parallel Processing**: Utilizes multi-core processing for faster analysis
## Requirements
- Python 3.7+
- Access to NCCL Inspector log files
## Installation
### Clone the Repository
```bash
git clone https://github.com/NVIDIA/nccl.git
cd nccl/ext-profiler/inspector/exporter/example
```
Install the required dependencies using the provided `requirements.txt` file:
```bash
pip install -r requirements.txt
```
## Usage
The script processes NCCL Inspector log files from a specified directory.
**Note:** To generate NCCL Inspector log files, you need to run your NCCL application with the inspector plugin enabled. The log files will be output to a directory specified by the `NCCL_INSPECTOR_DUMP_DIR` environment variable. For detailed setup instructions and environment variable configuration, see the [Inspector README](../../../README.md).
### Basic Usage
```bash
python perf_summary_exporter.py --input_dir /path/to/nccl/inspector/logs
```
This mode processes all log files in the specified directory and its subdirectories recursively.
### Command Line Arguments
- `--input_dir <path>`: **Required**. Directory containing NCCL Inspector log files (searches recursively in subdirectories)
- `--output_dir <name>`: **Optional**. Custom output directory name (default: `<input_directory_name>-analysis`)
## Output
The tool generates:
1. **Parquet Files**: One per log file containing processed log data (stored in `parquet_files/` subdirectory)
2. **Summary Directory**: Contains comprehensive analysis results
3. **Visualizations**: Scatter plots, histograms, and box plots for each message size
4. **CSV Files**: Detailed summaries for each message size and collective type
5. **Log File**: Processing log with detailed information
## Example Output Structure
```
<output_dir_name>/
├── output.log
├── parquet_files/
│ ├── <filename1>.parquet
│ ├── <filename2>.parquet
│ └── ...
└── summary/
├── scatter_plot_<comm_type>_<coll_type>.png
├── combined_scatter_plot_<comm_type>_<coll_type>.png
└── msg_size_<human_readable_size>/
├── histograms/
│ └── histogram_<comm_type>_<coll_type>_<size>.png
├── boxplots/
│ └── boxplot_<comm_type>_<coll_type>_<size>.png
└── summary_<comm_type>_<coll_type>_<size>.csv
```
## Supported Communicator Types
- `single-rank`
- `nvlink-only`
- `hca-only`
- `mixed`
## Supported Collective Types
- `AllReduce`
- `AllGather`
- `ReduceScatter`
- `Broadcast`
## Log File Formats
### Supported Formats
- `.log` - Plain text JSON lines
- `.log.gz` - Compressed JSON lines
- `.jsonl` - JSON lines format
- `.jsonl.gz` - Compressed JSON lines
### Expected JSON Structure
```json
{
"header": {
"id": "0x9e7a479f95a66c",
"rank": 31,
"n_ranks": 32,
"nnodes": 4
},
"metadata": {
"inspector_output_format_version": "v4.0",
"git_rev": "75e61acda-dirty",
"rec_mechanism": "nccl_profiler_interface",
"dump_timestamp_us": 1749490229087081,
"hostname": "example-hostname",
"pid": 468528
},
"coll_perf": {
"coll": "ReduceScatter",
"coll_sn": 129,
"coll_msg_size_bytes": 65536,
"coll_exec_time_us": 110,
"coll_timing_source": "kernel_gpu",
"coll_algobw_gbs": 19.065018,
"coll_busbw_gbs": 18.469236
}
}
```
## Troubleshooting
### Common Issues
1. **No log files found**: Ensure the log directory path is correct and contains valid log files
2. **Missing dependencies**: Ensure all requirements are installed in your virtual environment
3. **Mixed file formats**: The tool will exit if it detects mixed `.log`, `.log.gz`, `.jsonl`, and `.jsonl.gz` files in the same directory. This is typically indicative of corrupt input directories caused by multiple overlapping NCCL Inspector runs with different output format options. Clean the directory and re-run with consistent settings.
### Log Files
The tool creates detailed logs in the output directory. Check `output.log` for processing information and any error messages.
## Support
Please refer to the github issues page at https://github.com/NVIDIA/nccl/issues. Your question may already have been asked by another user. If not, feel free to create a new issue and refer to the "inspector plugin" in the title.