[ROCm/rccl commit: 858b4e76eb]
Structured Logging
As part of the efforts to enhance RCCL Replayer functionality, RCCL now provides detailed logging of API calls.
Usage
- Structured logging is a built-in module of RCCL source. For RCCL library in ROCm release, it is present starting from ROCm 7.0. To enable structured logging, point LD_LIBRARY_PATH to supporting RCCL library, then run with environment variable
RCCL_REPLAY_FILE="${filename}". - If the value of
RCCL_REPLAY_FILEcontains “.json” extension, the log will be exported as text in JSON format. Otherwise, the log will be a binary file by default, with ".bin" extension. - Each process will export its own log local to the directory and node of the executable. Names of each process' output will be in the format of
filename.PID.hostname.extension. For example, run workload withRCCL_REPLAY_FILE="replayer_log"will produce logs named such asreplayer_log.1275.quanta-cyxtera-cx77-11.bin. - Log level is controlled by
RCCL_LOG_LEVELenvironment. Currently, by default, log level is 1 and will record most essential RCCL APIs which perform actual operations or incur changes to the communicator. Otherwise, other informational RCCL calls such asncclGetAsyncErroror third party function calls likemscclRunAlgowill be recorded, too.
What it does
- Every public RCCL API call is tracked and logged in order of execution
- Whenever possible, a log entry is flushed to file immediately after a call is made. Exceptions are when a function returns a handle, in which case the RCCL API may not be registered in case of deadlock or error.
- HIP Graph compatible - will report if a call is made in graph capture mode. If in capture mode, RCCL will append a CPU callback node to the graph which logs the call when the graph runs.
Output format
Each line in the log is an entry of RCCL API call. ncclGroupStart() and ncclGroupEnd() correspond to opening and closing brackets, each on a separate line, forming a scope. Each line has 2 space of indentation per level of group depth.
Each entry will take the format of
Name : [Parameters : Value, ..., context : [...]](, Trailing Data)
Parameters
will contain all the parameters and their values of the function call as defined in nccl.h header. Collectives will contain additional data about the communicator size, number of tasks, its rank, and its opCount in communicator.
Wherever applicable, the structured logging preserves underlying RCCL data constructs and how they are filled.
Device context
contains the following fields: timestamp, thread (caller thread ID), device ( GPU ID which the caller was bound to), captured (graph capture mode or not), and the graphID. If a call was made in graph capture mode and not actually running on GPU device, captured will be 1. All entries would have a graphID, but their validity depend on whether there were previous captures.
Trailing Info
Certain calls will have additional data following the parameters and contexts:
ncclCommInitAllwill print of list of devicesncclCommInitRankConfigandncclCommSplitwill have fields ofncclConfig_tncclGroupSimulateEndwill print all fields ofncclSimInfo_tncclAllToAllvwill be followed by four comma separated lists of send/recv counts and displacements
Example
Here is an example log of the CommSplit_Reduces Unit Test from RCCL, run with RCCL_REPLAY_FILE="log.json" UT_DATATYPES="ncclInt32" UT_PROCESS_MASK=2 UT_MIN_GPUS=4 ./rccl-UnitTests --gtest_filter="Standalone.SplitComms_Reduce" UT_DATATYPES="ncclInt32"
{
hostname : banff-pla-r27-29, version : 0,
CommInitAll : [# of device : 4, context : [time : 1745623615674.781006, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
GetUniqueId : [uniqueID : 17168916771200794912, context : [time : 1745623615676.041748, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
{
CommInitDev : [comm : 0x7fb680328010, size : 4, uniqueID : 17168916771200794912, rank : 0, dev : 0, context : [time : 1745623615676.157959, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommInitDev : [comm : 0x7fa67bb28010, size : 4, uniqueID : 17168916771200794912, rank : 1, dev : 1, context : [time : 1745623615676.207520, thread : 1085917, device : 1, captured : -1, graphID : 0 ]],
CommInitDev : [comm : 0x7f967b650010, size : 4, uniqueID : 17168916771200794912, rank : 6, dev : 6, context : [time : 1745623615676.407227, thread : 1085917, device : 6, captured : -1, graphID : 0 ]],
CommInitDev : [comm : 0x7f8e7ad27010, size : 4, uniqueID : 17168916771200794912, rank : 7, dev : 7, context : [time : 1745623615676.468750, thread : 1085917, device : 7, captured : -1, graphID : 0 ]]
},
{
CommSplit : [comm : 0x7fb680328010, color : 0, key : 0, newcomm : (nil), context : [time : 1745623619986.908691, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommSplit : [comm : 0x7fa67bb28010, color : 0, key : 0, newcomm : 0x1, context : [time : 1745623619987.017090, thread : 1085917, device : 1, captured : -1, graphID : 0 ]],
CommSplit : [comm : 0x7f967b650010, color : 0, key : -1, newcomm : 0x6, context : [time : 1745623619987.202393, thread : 1085917, device : 6, captured : -1, graphID : 0 ]],
CommSplit : [comm : 0x7f8e7ad27010, color : 0, key : -1, newcomm : 0x7, context : [time : 1745623619987.205322, thread : 1085917, device : 7, captured : -1, graphID : 0 ]]
},
CommDestroy : [comm : 0x126d9a0, context : [time : 1745623620419.313721, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : 0x1345370, context : [time : 1745623620419.422363, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : (nil), context : [time : 1745623620888.018555, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : (nil), context : [time : 1745623620888.020996, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : 0x7fb680328010, context : [time : 1745623620888.023193, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : 0x7fa67bb28010, context : [time : 1745623620888.228027, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : 0x7f967b650010, context : [time : 1745623620889.099121, thread : 1085917, device : 0, captured : -1, graphID : 0 ]],
CommDestroy : [comm : 0x7f8e7ad27010, context : [time : 1745623620889.293213, thread : 1085917, device : 0, captured : -1, graphID : 0 ]]
}
RCCL Replayer
Replayer is a separate tool which aims to re-run the same set of RCCL calls as recorded and report the cumulative time taken by these calls, provided with the binary output of structured logging.
Installation
- Replayer relies on MPI for out of band communication.
- Under
rccl/tools/RcclReplayer, runMPI_DIR=${MPI_PATH} make - Replayer can be built from RCCL source using internal headers directly. It links against a RCCL library specified by
RCCL_DIR(defaults to../../build/release). For compatibility, it is recommended that logs are collected using the same RCCL library version.
Running
- Replayer requires the exact same number of processes and processes per node as the recorded job. And all log files must be accessible by all processes in Replayer, either through shared filesystem or copies.
- To run Replayer, simply call
mpirun -np ${np} ./rcclReplayer ${filename}.${extension} - For example, we are on node quanta-cyxtera-cx77-11, with 8 logs:
replayer_log.{1270-1278}.quanta-cyxtera-cx77-11.bin. Runmpirun -np 8 ./rcclReplayer replayer_log.bin - Replayer will have a parse and replay phase. During parsing it will create communicators as original RCCL job, assign log files to individual processes, and allocate resources. Then actual replay happens, re-running all the RCCL APIs with same parameters and device assignment. It is also able to capture and launch graphs involving RCCL calls, as recorded by structured logging. Actual data in original job such as message payload or vector values are not recorded therefore not replicated.
Output
Each rank will print out its progress as it goes through every line of calls, including its rank, line number, RCCL API name, status (INFO/WARNING/ERROR). It will also report time and bandwidth (if the line is a communication call) for that call. In the end, it will report the total time taken by all communication calls. Replayer is still under development and experimentations, so the formats of logging or contents of replayer output will be subject to changes.
Log Converter
replay_log_converter.py is a utility to convert between binary and JSON log formats, standardize JSON logs for easier parsing, and sanitize logs for comparison.
Usage:
- Binary to JSON:
python3 replay_log_converter.py <basename> tojson - JSON to Binary:
python3 replay_log_converter.py <basename> tobin - Standardize JSON:
python3 replay_log_converter.py <basename> --standardize - Sanitize JSON:
python3 replay_log_converter.py <basename> --sanitize
An optional output basename can be provided after the mode (tojson/tobin) to customize the output filename:
python3 replay_log_converter.py <basename> <mode> <output_basename>
The converter automatically finds all matching log files with pattern basename.PID.hostname and processes them.
Output Files:
- Standardized JSON output is saved with
.standard.jsonextension and can be parsed with standard JSON libraries. - Sanitized files are modified in-place (original files are overwritten with sanitized versions).
Examples:
python3 replay_log_converter.py replayer_log tojsonproducesreplayer_log.{1270-1278}.quanta-cx77-11.jsonpython3 replay_log_converter.py replayer_log tojson converted_logproducesconverted_log.{1270-1278}.quanta-cx77-11.jsonpython3 replay_log_converter.py replayer_log --sanitizesanitizes existing JSON files in-placepython3 replay_log_converter.py replayer_log tojson --sanitizeconverts to JSON and sanitizes in one steppython3 replay_log_converter.py replayer_log --sanitize --no-timestamp(or--nts) sets all timestamps to 0.0
Sanitization:
The --sanitize option normalizes logs for easier comparison by:
- Remapping pointers to readable identifiers (e.g.,
comm : 0x7fb680328010→comm : comm_001) - Normalizing timestamps relative to the first call (e.g.,
time : 1762969171532.248535→time : 0.000000)- Use
--no-timestamp(or--nts) to set all timestamps to 0.0 instead
- Use
- Preserving relationships: same pointer values get the same sanitized identifier
- Sanitized fields: communicators (
comm), unique IDs (uniqueID), streams (stream), buffer addresses (addr/base/ptr/acc), handles (handle), thread IDs (thread), and process IDs (pid)