Files
rocm-systems/ext-profiler/inspector/README.md
T

217 rivejä
6.5 KiB
Markdown
Raaka Normaali näkymä Historia

2025-09-02 13:21:14 -07:00
# NCCL Inspector Plugin
The NCCL Inspector is a plugin for the NVIDIA Collective Communications Library (NCCL) that provides detailed, per-communicator, per-collective performance and metadata logging. It is designed to help users analyze and debug NCCL collective operations by generating structured JSON output for each operation.
## Related Documentation
- **[Performance Exporter](exporter/example/README.md)** - Tool for analyzing and visualizing NCCL performance data from inspector logs
## Folder Location
The Inspector plugin source is located in:
```
ext-profiler/inspector/
```
## Building the Inspector Plugin
To build the Inspector plugin, run:
```bash
make
```
The build system will automatically detect CUDA and NCCL installations from your environment. If you need to specify custom paths, you can set `CUDA_HOME` and `NCCL_HOME` environment variables or pass them as make arguments.
### Build Options
The Makefile supports several build options:
- **DEBUG=1**: Enable debug build with additional debugging information
- **ASAN=1**: Enable Address Sanitizer for memory error detection
- **UBSAN=1**: Enable Undefined Behavior Sanitizer
Example debug build:
```bash
make DEBUG=1
```
### Build Output
The build process creates:
- `libnccl-profiler-inspector.so`: The main inspector plugin library
- `version.cc`: Auto-generated version information from git
## Using NCCL Inspector
### Key Differences from Normal NCCL Usage
The main difference between running NCCL with the Inspector plugin versus running NCCL normally is the addition of environment variables that enable detailed performance logging:
**Normal NCCL Run:**
```bash
# Standard NCCL execution
./your_nccl_application
```
**NCCL Inspector Run:**
```bash
# NCCL Inspector enabled execution
export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
export NCCL_INSPECTOR_ENABLE=1
export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
./your_nccl_application
```
### Required Environment Variables
- `NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so`
Loads the Inspector plugin into NCCL.
- `NCCL_INSPECTOR_ENABLE=1`
Enables the Inspector plugin.
- `NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=<interval>`
Sets the interval (in microseconds) for the internal dump thread to write output. Example: `500`.
- `NCCL_INSPECTOR_DUMP_DIR=<output_dir>` (optional)
Sets the output directory for logs. If not set, defaults to `nccl-inspector-unknown-jobid` or `nccl-inspector-<slurm_job_id>` if running under SLURM.
- `NCCL_INSPECTOR_DUMP_VERBOSE=<0|1>` (optional)
Enables verbose output including event trace information. Set to `1` to enable, `0` to disable (default).
### Example Usage
**Single Node:**
```bash
export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
export NCCL_INSPECTOR_ENABLE=1
export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
./build/test/perf/all_reduce_perf -b 8 -e 16G -f 2 -g 8
```
**Multi-Node (SLURM):**
```bash
# Add these environment variables to your SLURM script
export NCCL_PROFILER_PLUGIN=/path/to/nccl/ext-profiler/inspector/libnccl-profiler-inspector.so
export NCCL_INSPECTOR_ENABLE=1
export NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=500
export NCCL_INSPECTOR_DUMP_DIR=/path/to/logs/${SLURM_JOB_ID}/
# Then run your normal NCCL application
srun your_nccl_application
```
## Example Scripts
For detailed example scripts showing how to integrate NCCL Inspector with different workloads, see the **[test/examples/](test/examples/)** directory:
- **Single Node Example**: Basic NCCL performance testing with inspector
- **Multi-Node SLURM Example**: Comprehensive multi-node testing with various collective operations
- **Training Workload Example**: Integration with distributed training workloads
## Output Example
Each output file contains JSON objects with the following structure:
```json
{
"header": {
"id": "0x7f8c496ae9f661",
"rank": 2,
"n_ranks": 8,
"nnodes": 1
},
"metadata": {
"inspector_output_format_version": "v4.0",
"git_rev": "",
"rec_mechanism": "profiler_plugin",
"dump_timestamp_us": 1748030377748202,
"hostname": "example-hostname",
"pid": 1639453
},
"coll_perf": {
"coll": "AllReduce",
"coll_sn": 1407,
"coll_msg_size_bytes": 17179869184,
"coll_exec_time_us": 61974,
"coll_algobw_gbs": 277.210914,
"coll_busbw_gbs": 485.119099
}
}
```
## Output Example Verbose
To enable verbose output with event trace information, set the `NCCL_INSPECTOR_DUMP_VERBOSE=1` environment variable:
```bash
export NCCL_INSPECTOR_DUMP_VERBOSE=1
```
This will include additional event trace information in the JSON output, showing the sequence of callbacks and timestamps for each individual event.
```json
{
"header": {
"id": "0xe62dedaa97644a",
"rank": 4,
"n_ranks": 8,
"nnodes": 1
},
"metadata": {
"inspector_output_format_version": "v4.0",
"git_rev": "9019a1912-dirty",
"rec_mechanism": "nccl_profiler_interface",
"dump_timestamp_us": 1752867229276385,
"hostname": "example-hostname",
"pid": 438776
},
"coll_perf": {
"coll": "ReduceScatter",
"coll_sn": 1231,
"coll_msg_size_bytes": 2147483648,
"coll_exec_time_us": 41057,
"coll_timing_source": "kernel_gpu",
"coll_algobw_gbs": 418.439467,
"coll_busbw_gbs": 366.134533,
"event_trace_sn": {
"coll_start_sn": 1,
"coll_stop_sn": 2,
"kernel_events": [
{
"channel_id": 0,
"kernel_start_sn": 3,
"kernel_stop_sn": 48,
"kernel_record_sn": 47
}
]
},
"event_trace_ts": {
"coll_start_ts": 1752867229235059,
"coll_stop_ts": 1752867229235064,
"kernel_events": [
{
"channel_id": 0,
"kernel_start_ts": 1752867229235181,
"kernel_stop_ts": 1752867229275811,
"kernel_record_ts": 1752867229275811
}
]
}
}
}
```
Multiple such JSON objects are written, one per collective operation per communicator.
## Output Directory
- By default, output files are written to:
- `nccl-inspector-unknown-jobid` (if no SLURM job ID is present)
- `nccl-inspector-<slurm_job_id>` (if running under SLURM)
- You can override this with the `NCCL_INSPECTOR_DUMP_DIR` environment variable.
## Additional Notes
- The plugin is compatible with standard NCCL workflows and can be used in both single-node and multi-node (SLURM) environments.
- For more details, see the source code and comments in `ext-profiler/inspector/`.