Файли
Kapil S. Pawar 566671910a [RcclReplayer] JSON <-> BIN log format conversion tool (#2056)
* Add replay log format converter

* Add Log Sanitizer

* Add no timestamp option (nts) to sanitizer

[ROCm/rccl commit: 5fd86021a8]
2025-11-24 11:51:36 -06:00

12 KiB

Structured Logging

As part of the efforts to enhance RCCL Replayer functionality, RCCL now provides detailed logging of API calls.

Usage

  • Structured logging is a built-in module of RCCL source. For RCCL library in ROCm release, it is present starting from ROCm 7.0. To enable structured logging, point LD_LIBRARY_PATH to supporting RCCL library, then run with environment variable RCCL_REPLAY_FILE="${filename}".
  • If the value of RCCL_REPLAY_FILE contains “.json” extension, the log will be exported as text in JSON format. Otherwise, the log will be a binary file by default, with ".bin" extension.
  • Each process will export its own log local to the directory and node of the executable. Names of each process' output will be in the format of filename.PID.hostname.extension. For example, run workload with RCCL_REPLAY_FILE="replayer_log" will produce logs named such as replayer_log.1275.quanta-cyxtera-cx77-11.bin.
  • Log level is controlled by RCCL_LOG_LEVEL environment. Currently, by default, log level is 1 and will record most essential RCCL APIs which perform actual operations or incur changes to the communicator. Otherwise, other informational RCCL calls such as ncclGetAsyncError or third party function calls like mscclRunAlgo will be recorded, too.

What it does

  • Every public RCCL API call is tracked and logged in order of execution
  • Whenever possible, a log entry is flushed to file immediately after a call is made. Exceptions are when a function returns a handle, in which case the RCCL API may not be registered in case of deadlock or error.
  • HIP Graph compatible - will report if a call is made in graph capture mode. If in capture mode, RCCL will append a CPU callback node to the graph which logs the call when the graph runs.

Output format

Each line in the log is an entry of RCCL API call. ncclGroupStart() and ncclGroupEnd() correspond to opening and closing brackets, each on a separate line, forming a scope. Each line has 2 space of indentation per level of group depth.

Each entry will take the format of Name : [Parameters : Value, ..., context : [...]](, Trailing Data)

Parameters

will contain all the parameters and their values of the function call as defined in nccl.h header. Collectives will contain additional data about the communicator size, number of tasks, its rank, and its opCount in communicator.

Wherever applicable, the structured logging preserves underlying RCCL data constructs and how they are filled.

Device context

contains the following fields: timestamp, thread (caller thread ID), device ( GPU ID which the caller was bound to), captured (graph capture mode or not), and the graphID. If a call was made in graph capture mode and not actually running on GPU device, captured will be 1. All entries would have a graphID, but their validity depend on whether there were previous captures.

Trailing Info

Certain calls will have additional data following the parameters and contexts:

  • ncclCommInitAll will print of list of devices
  • ncclCommInitRankConfig and ncclCommSplit will have fields of ncclConfig_t
  • ncclGroupSimulateEnd will print all fields of ncclSimInfo_t
  • ncclAllToAllv will be followed by four comma separated lists of send/recv counts and displacements

Example

Here is an example log of the CommSplit_Reduces Unit Test from RCCL, run with RCCL_REPLAY_FILE="log.json" UT_DATATYPES="ncclInt32" UT_PROCESS_MASK=2 UT_MIN_GPUS=4 ./rccl-UnitTests --gtest_filter="Standalone.SplitComms_Reduce" UT_DATATYPES="ncclInt32"

{
  hostname : banff-pla-r27-29, version : 0,
  CommInitAll : [# of device : 4, context : [time : 1745623615674.781006, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  GetUniqueId : [uniqueID : 17168916771200794912, context : [time : 1745623615676.041748, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  {
    CommInitDev : [comm : 0x7fb680328010, size : 4, uniqueID : 17168916771200794912, rank : 0, dev : 0, context : [time : 1745623615676.157959, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
    CommInitDev : [comm : 0x7fa67bb28010, size : 4, uniqueID : 17168916771200794912, rank : 1, dev : 1, context : [time : 1745623615676.207520, thread : 1085917, device : 1, captured : -1, graphID : 0 ]],
    CommInitDev : [comm : 0x7f967b650010, size : 4, uniqueID : 17168916771200794912, rank : 6, dev : 6, context : [time : 1745623615676.407227, thread : 1085917, device : 6, captured : -1, graphID : 0 ]],
    CommInitDev : [comm : 0x7f8e7ad27010, size : 4, uniqueID : 17168916771200794912, rank : 7, dev : 7, context : [time : 1745623615676.468750, thread : 1085917, device : 7, captured : -1, graphID : 0 ]]
  },
  {
    CommSplit : [comm : 0x7fb680328010, color : 0, key : 0, newcomm : (nil), context : [time : 1745623619986.908691, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
    CommSplit : [comm : 0x7fa67bb28010, color : 0, key : 0, newcomm : 0x1, context : [time : 1745623619987.017090, thread : 1085917, device : 1, captured : -1, graphID : 0 ]],
    CommSplit : [comm : 0x7f967b650010, color : 0, key : -1, newcomm : 0x6, context : [time : 1745623619987.202393, thread : 1085917, device : 6, captured : -1, graphID : 0 ]],
    CommSplit : [comm : 0x7f8e7ad27010, color : 0, key : -1, newcomm : 0x7, context : [time : 1745623619987.205322, thread : 1085917, device : 7, captured : -1, graphID : 0 ]]
  },
  CommDestroy : [comm : 0x126d9a0, context : [time : 1745623620419.313721, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : 0x1345370, context : [time : 1745623620419.422363, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : (nil), context : [time : 1745623620888.018555, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : (nil), context : [time : 1745623620888.020996, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : 0x7fb680328010, context : [time : 1745623620888.023193, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : 0x7fa67bb28010, context : [time : 1745623620888.228027, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : 0x7f967b650010, context : [time : 1745623620889.099121, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
  CommDestroy : [comm : 0x7f8e7ad27010, context : [time : 1745623620889.293213, thread : 1085917, device : 0, captured : -1, graphID : 0 ]]
}

RCCL Replayer

Replayer is a separate tool which aims to re-run the same set of RCCL calls as recorded and report the cumulative time taken by these calls, provided with the binary output of structured logging.

Installation

  • Replayer relies on MPI for out of band communication.
  • Under rccl/tools/RcclReplayer, run MPI_DIR=${MPI_PATH} make
  • Replayer can be built from RCCL source using internal headers directly. It links against a RCCL library specified by RCCL_DIR (defaults to ../../build/release). For compatibility, it is recommended that logs are collected using the same RCCL library version.

Running

  • Replayer requires the exact same number of processes and processes per node as the recorded job. And all log files must be accessible by all processes in Replayer, either through shared filesystem or copies.
  • To run Replayer, simply call mpirun -np ${np} ./rcclReplayer ${filename}.${extension}
  • For example, we are on node quanta-cyxtera-cx77-11, with 8 logs: replayer_log.{1270-1278}.quanta-cyxtera-cx77-11.bin. Run mpirun -np 8 ./rcclReplayer replayer_log.bin
  • Replayer will have a parse and replay phase. During parsing it will create communicators as original RCCL job, assign log files to individual processes, and allocate resources. Then actual replay happens, re-running all the RCCL APIs with same parameters and device assignment. It is also able to capture and launch graphs involving RCCL calls, as recorded by structured logging. Actual data in original job such as message payload or vector values are not recorded therefore not replicated.

Output

Each rank will print out its progress as it goes through every line of calls, including its rank, line number, RCCL API name, status (INFO/WARNING/ERROR). It will also report time and bandwidth (if the line is a communication call) for that call. In the end, it will report the total time taken by all communication calls. Replayer is still under development and experimentations, so the formats of logging or contents of replayer output will be subject to changes.

Log Converter

replay_log_converter.py is a utility to convert between binary and JSON log formats, standardize JSON logs for easier parsing, and sanitize logs for comparison.

Usage:

  • Binary to JSON: python3 replay_log_converter.py <basename> tojson
  • JSON to Binary: python3 replay_log_converter.py <basename> tobin
  • Standardize JSON: python3 replay_log_converter.py <basename> --standardize
  • Sanitize JSON: python3 replay_log_converter.py <basename> --sanitize

An optional output basename can be provided after the mode (tojson/tobin) to customize the output filename:

  • python3 replay_log_converter.py <basename> <mode> <output_basename>

The converter automatically finds all matching log files with pattern basename.PID.hostname and processes them.

Output Files:

  • Standardized JSON output is saved with .standard.json extension and can be parsed with standard JSON libraries.
  • Sanitized files are modified in-place (original files are overwritten with sanitized versions).

Examples:

  • python3 replay_log_converter.py replayer_log tojson produces replayer_log.{1270-1278}.quanta-cx77-11.json
  • python3 replay_log_converter.py replayer_log tojson converted_log produces converted_log.{1270-1278}.quanta-cx77-11.json
  • python3 replay_log_converter.py replayer_log --sanitize sanitizes existing JSON files in-place
  • python3 replay_log_converter.py replayer_log tojson --sanitize converts to JSON and sanitizes in one step
  • python3 replay_log_converter.py replayer_log --sanitize --no-timestamp (or --nts) sets all timestamps to 0.0

Sanitization: The --sanitize option normalizes logs for easier comparison by:

  • Remapping pointers to readable identifiers (e.g., comm : 0x7fb680328010comm : comm_001)
  • Normalizing timestamps relative to the first call (e.g., time : 1762969171532.248535time : 0.000000)
    • Use --no-timestamp (or --nts) to set all timestamps to 0.0 instead
  • Preserving relationships: same pointer values get the same sanitized identifier
  • Sanitized fields: communicators (comm), unique IDs (uniqueID), streams (stream), buffer addresses (addr/base/ptr/acc), handles (handle), thread IDs (thread), and process IDs (pid)