5.0 KiB
NCCL Inspector Performance Summary Exporter
This tool processes NCCL Inspector log files and generates comprehensive performance analysis reports including visualizations and statistical summaries. One can build similar exporters to integrate with various observability systems like Elastic, Prometheus or other Custom Metric systems.
Features
- Performance Analysis: Generates statistical summaries for collective operations
- Communication Type Classification: Automatically categorizes communication patterns
- Visualizations: Creates scatter plots, histograms, and box plots for performance metrics
- Data Export: Converts logs to Parquet format for efficient processing
- Multi-format Log Support: Processes
.log,.log.gz,.jsonl, and.jsonl.gzfiles - Parallel Processing: Utilizes multi-core processing for faster analysis
Requirements
- Python 3.7+
- Access to NCCL Inspector log files
Installation
Clone the Repository
git clone https://github.com/NVIDIA/nccl.git
cd nccl/ext-profiler/inspector/exporter/example
Install the required dependencies using the provided requirements.txt file:
pip install -r requirements.txt
Usage
The script processes NCCL Inspector log files from a specified directory.
Note: To generate NCCL Inspector log files, you need to run your NCCL application with the inspector plugin enabled. The log files will be output to a directory specified by the NCCL_INSPECTOR_DUMP_DIR environment variable. For detailed setup instructions and environment variable configuration, see the Inspector README.
Basic Usage
python perf_summary_exporter.py --input_dir /path/to/nccl/inspector/logs
This mode processes all log files in the specified directory and its subdirectories recursively.
Command Line Arguments
--input_dir <path>: Required. Directory containing NCCL Inspector log files (searches recursively in subdirectories)--output_dir <name>: Optional. Custom output directory name (default:<input_directory_name>-analysis)
Output
The tool generates:
- Parquet Files: One per log file containing processed log data (stored in
parquet_files/subdirectory) - Summary Directory: Contains comprehensive analysis results
- Visualizations: Scatter plots, histograms, and box plots for each message size
- CSV Files: Detailed summaries for each message size and collective type
- Log File: Processing log with detailed information
Example Output Structure
<output_dir_name>/
├── output.log
├── parquet_files/
│ ├── <filename1>.parquet
│ ├── <filename2>.parquet
│ └── ...
└── summary/
├── scatter_plot_<comm_type>_<coll_type>.png
├── combined_scatter_plot_<comm_type>_<coll_type>.png
└── msg_size_<human_readable_size>/
├── histograms/
│ └── histogram_<comm_type>_<coll_type>_<size>.png
├── boxplots/
│ └── boxplot_<comm_type>_<coll_type>_<size>.png
└── summary_<comm_type>_<coll_type>_<size>.csv
Supported Communicator Types
single-ranknvlink-onlyhca-onlymixed
Supported Collective Types
AllReduceAllGatherReduceScatterBroadcast
Log File Formats
Supported Formats
.log- Plain text JSON lines.log.gz- Compressed JSON lines.jsonl- JSON lines format.jsonl.gz- Compressed JSON lines
Expected JSON Structure
{
"header": {
"id": "0x9e7a479f95a66c",
"rank": 31,
"n_ranks": 32,
"nnodes": 4
},
"metadata": {
"inspector_output_format_version": "v4.0",
"git_rev": "75e61acda-dirty",
"rec_mechanism": "nccl_profiler_interface",
"dump_timestamp_us": 1749490229087081,
"hostname": "example-hostname",
"pid": 468528
},
"coll_perf": {
"coll": "ReduceScatter",
"coll_sn": 129,
"coll_msg_size_bytes": 65536,
"coll_exec_time_us": 110,
"coll_timing_source": "kernel_gpu",
"coll_algobw_gbs": 19.065018,
"coll_busbw_gbs": 18.469236
}
}
Troubleshooting
Common Issues
- No log files found: Ensure the log directory path is correct and contains valid log files
- Missing dependencies: Ensure all requirements are installed in your virtual environment
- Mixed file formats: The tool will exit if it detects mixed
.log,.log.gz,.jsonl, and.jsonl.gzfiles in the same directory. This is typically indicative of corrupt input directories caused by multiple overlapping NCCL Inspector runs with different output format options. Clean the directory and re-run with consistent settings.
Log Files
The tool creates detailed logs in the output directory. Check output.log for processing information and any error messages.
Support
Please refer to the github issues page at https://github.com/NVIDIA/nccl/issues. Your question may already have been asked by another user. If not, feel free to create a new issue and refer to the "inspector plugin" in the title.