Updating RCCL Replayer README (#1408)

This commit is contained in:
gilbertlee-amd
2024-11-05 08:06:11 -07:00
gecommit door GitHub
bovenliggende 69b2b712ab
commit cb1027de97
+14 -1
Bestand weergeven
@@ -20,7 +20,7 @@ Replayer is a dubugging tool designed to analyze and replay collective logs obta
- Replays collective calls based on the recorded data.
- Skips faulty group calls during replay.
- Supports various MPI ranks and GPU configurations.
- Supports multi-node environment.
- Supports multi-node environment.
*Note: RCCL Replayer executes collective calls with dummy data.*
@@ -54,6 +54,19 @@ Depending on the MPI library used and your installation path, you may need to se
## Usage
First Collect per-rank logs from the run by adding the following environment variables:
This prevents any race-conditions that might cause ranks to interupt other ranks lines of output.
```bash
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG_FILE=some_name_here.%h.%p.log
```
Secondly, combine all the logs into a single file which will be the input to the replayer:
```bash
cat some_name_here_*.log > some_name_here.log
```
After successfully building the replayer, you can run it using the following command:
```bash