[RcclReplayer] JSON <-> BIN log format conversion tool (#2056)

* Add replay log format converter

* Add Log Sanitizer

* Add no timestamp option (nts) to sanitizer

[ROCm/rccl commit: 5fd86021a8]
This commit is contained in:
Kapil S. Pawar
2025-11-24 11:51:36 -06:00
committed by GitHub
parent 8c928e60f9
commit 566671910a
2 changed files with 1350 additions and 0 deletions
@@ -85,3 +85,37 @@ Replayer is a separate tool which aims to re-run the same set of RCCL calls as r
Each rank will print out its progress as it goes through every line of calls, including its rank, line number, RCCL API name, status (INFO/WARNING/ERROR).
It will also report time and bandwidth (if the line is a communication call) for that call. In the end, it will report the total time taken by all communication calls.
Replayer is still under development and experimentations, so the formats of logging or contents of replayer output will be subject to changes.
## Log Converter
`replay_log_converter.py` is a utility to convert between binary and JSON log formats, standardize JSON logs for easier parsing, and sanitize logs for comparison.
**Usage:**
* **Binary to JSON:** `python3 replay_log_converter.py <basename> tojson`
* **JSON to Binary:** `python3 replay_log_converter.py <basename> tobin`
* **Standardize JSON:** `python3 replay_log_converter.py <basename> --standardize`
* **Sanitize JSON:** `python3 replay_log_converter.py <basename> --sanitize`
An optional output basename can be provided after the mode (tojson/tobin) to customize the output filename:
* `python3 replay_log_converter.py <basename> <mode> <output_basename>`
The converter automatically finds all matching log files with pattern `basename.PID.hostname` and processes them.
**Output Files:**
* Standardized JSON output is saved with `.standard.json` extension and can be parsed with standard JSON libraries.
* Sanitized files are modified in-place (original files are overwritten with sanitized versions).
**Examples:**
* `python3 replay_log_converter.py replayer_log tojson` produces `replayer_log.{1270-1278}.quanta-cx77-11.json`
* `python3 replay_log_converter.py replayer_log tojson converted_log` produces `converted_log.{1270-1278}.quanta-cx77-11.json`
* `python3 replay_log_converter.py replayer_log --sanitize` sanitizes existing JSON files in-place
* `python3 replay_log_converter.py replayer_log tojson --sanitize` converts to JSON and sanitizes in one step
* `python3 replay_log_converter.py replayer_log --sanitize --no-timestamp` (or `--nts`) sets all timestamps to 0.0
**Sanitization:**
The `--sanitize` option normalizes logs for easier comparison by:
* Remapping pointers to readable identifiers (e.g., `comm : 0x7fb680328010``comm : comm_001`)
* Normalizing timestamps relative to the first call (e.g., `time : 1762969171532.248535``time : 0.000000`)
* Use `--no-timestamp` (or `--nts`) to set all timestamps to 0.0 instead
* Preserving relationships: same pointer values get the same sanitized identifier
* Sanitized fields: communicators (`comm`), unique IDs (`uniqueID`), streams (`stream`), buffer addresses (`addr`/`base`/`ptr`/`acc`), handles (`handle`), thread IDs (`thread`), and process IDs (`pid`)
File diff suppressed because it is too large Load Diff