From cb1027de978e244e7b2056eac07e4d4f10853836 Mon Sep 17 00:00:00 2001 From: gilbertlee-amd <44450918+gilbertlee-amd@users.noreply.github.com> Date: Tue, 5 Nov 2024 08:06:11 -0700 Subject: [PATCH] Updating RCCL Replayer README (#1408) --- tools/rccl_replayer/README.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/tools/rccl_replayer/README.md b/tools/rccl_replayer/README.md index 60de03a845..e61c8c3455 100644 --- a/tools/rccl_replayer/README.md +++ b/tools/rccl_replayer/README.md @@ -20,7 +20,7 @@ Replayer is a dubugging tool designed to analyze and replay collective logs obta - Replays collective calls based on the recorded data. - Skips faulty group calls during replay. - Supports various MPI ranks and GPU configurations. -- Supports multi-node environment. +- Supports multi-node environment. *Note: RCCL Replayer executes collective calls with dummy data.* @@ -54,6 +54,19 @@ Depending on the MPI library used and your installation path, you may need to se ## Usage +First Collect per-rank logs from the run by adding the following environment variables: +This prevents any race-conditions that might cause ranks to interupt other ranks lines of output. + +```bash + NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG_FILE=some_name_here.%h.%p.log +``` + +Secondly, combine all the logs into a single file which will be the input to the replayer: + +```bash + cat some_name_here_*.log > some_name_here.log +``` + After successfully building the replayer, you can run it using the following command: ```bash