Files
Jack Taylor ebcac26530 Add pytorch rccl/intra node all-reduce benchmark (#1221)
* Add gpt-fast pytorch all reduce benchmark script

* Update readme instructions

* Minor changes

[ROCm/rccl commit: 5f2b88bc28]
2024-06-25 08:04:38 -07:00

1.0 KiB

Small benchmark utility for gpt-fast's all reduce.

How to run

Out of box run (This will try various sequence lengths and dump perf results to terminal output)

torchrun --nproc_per_node=8 all_reduce.py 

To enable intra node all-reduce algorithms use:

ENABLE_INTRA_NODE_COMM=1 torchrun --nproc_per_node=8 python3 all_reduce.py

Rocprof trace script

To create perfetto traces for each rank of each all reduce a bash script is provided.

ENABLE_INTRA_NODE_COMM=1 bash trace_runs.sh

Additional options:

The tensor size is dependent on sequence_length and dim supplied in gpt fast. There are 4 different all-reduce calls in gpt-fast at runtime:

  • 1: [seq_len, dim]
  • 2: [seq_len, 2, dim]
  • 3: [1, dim]
  • 4: [1, 2, dim]
--sequence_lengths (defaults to [50, 64, 128, 256, 512, 1024, 2048, 4096])
--dim (defaults to 6144)
--all_reduce (defaults to [0,1,2,3]) - Can be modified to only run a single all-reduce, mapping to the 4 all reduces listed above 
--tracing - Enables tracing mode to skip CPU timers in recording