* Support fused all reduce and elementwise operations
Add additional "acc" parameter to RCCL Replayer logs
Add flag which indicates availability of new API
* Fix Recorder json parsing
* Remove unreachable code
* Remove extra acc pointer check
* .
* Revert "[DEVICE] Adding ability to choose unroll factor at runtime (#1734)"
This reverts commit 9d72be7b2f.
* Use noinline to reduce kernels linking time
* Don't use noinline for gfx942 and gfx950 to avoid perf regression
---------
Co-authored-by: AtlantaPepsi <timhu102@amd.com>
Co-authored-by: BertanDogancay <bertan.dogancay@gmail.com>
RCCL REPLAYER
Collective log replayer tool for RCCL.
Table of Contents
Introduction
Replayer is a dubugging tool designed to analyze and replay collective logs obtained from RCCL (ROCm Communication Collectives Library) runs. It can be a useful tool when trying to recreate problem situations (without as much setup), or as a user-directed utility to run collectives (by crafting their own 'logfile').
Features
- Parses and validates collective logs from RCCL runs.
- Detects missing/faulty group calls and provides report.
- Replays collective calls based on the recorded data.
- Skips faulty group calls during replay.
- Supports various MPI ranks and GPU configurations.
- Supports multi-node environment.
Note: RCCL Replayer executes collective calls with dummy data.
How It Works
Replayer operates in the following steps:
-
Collective Log Collection: During your RCCL runs, the collective logs are generated when NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL enabled, capturing important information like hostname, deviceIdx, collective call type, number of elements used, data type, operation type, task number, and global rank number about collective communication patterns.
-
Data Aggregation: Replayer collects and pareses the collective logs. organizing them based on opCount (collective count in the group call), and global rank information.
-
Group Call Validation: After acquiring data from the collective logs and generating group calls, the replayer validates the results using two different methods. For Non-Send/Recv collectives, it checks if each MPI rank has the required number of collective tasks. For Send/Recv collectives, it verifies if they all have a matching pair.
-
Replaying RCCL: Based on the aggregated and validated data, Replayer will replay the collective logs to reproduce the RCCL runs from your application.
-
Reporting and Skipping. Replayer outputs the detected faulty group calls and skips them during replay. It provides a report showing which group calls were skipped and why and, at the end, summarizes how many group calls were replayed and how many were skipped.
Installation
To build the replayer, follow these steps:
- Navigate to the rccl_replayer directory.
- Make sure 'MPI_DIR' is set to the path where your MPI installation is located.
cd rccl/tools/rccl_replayer
MPI_DIR=/path/to/mpi make
Depending on the MPI library used and your installation path, you may need to set the MPI_DIR path accordingly.
Usage
First Collect per-rank logs from the run by adding the following environment variables: This prevents any race-conditions that might cause ranks to interupt other ranks lines of output.
NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=COLL NCCL_DEBUG_FILE=some_name_here.%h.%p.log
Secondly, combine all the logs into a single file which will be the input to the replayer:
cat some_name_here_*.log > some_name_here.log
After successfully building the replayer, you can run it using the following command:
mpirun -np <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
Replace with the number of MPI processes you want to run during the replay, </path/to/logfile> with the path to the collective log file generated during your RCCL runs, and with the number of GPUs per MPI rank used in your application.
Depending on the MPI library you use, you may need to modify the mpirun command accordingly.
Multi-Node Environment:
If multiple nodes were used for your application, you can also replay the collective logs using multiple nodes. See the following command:
mpirun --hostfile <path/to/hostfile.txt> -np <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
SLURM:
For systems using SLURM, you can use the following command to replay the collective logs:
srun -N <numNodes> -n <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
Replace with the number of nodes used in your application.