From 98b958afbda32f34923c5fb06910f41a9bf200a5 Mon Sep 17 00:00:00 2001 From: David Addison Date: Tue, 30 Jul 2024 14:50:45 -0700 Subject: [PATCH] Added some missing command line options to README.md Also updated single and multi-node examples. [ROCm/rccl-tests commit: 0d86b5a6e755c52be6f23ef3f4792385f5e255b1] --- projects/rccl-tests/README.md | 13 +++++++++---- 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/projects/rccl-tests/README.md b/projects/rccl-tests/README.md index 4281799430..44e406a633 100644 --- a/projects/rccl-tests/README.md +++ b/projects/rccl-tests/README.md @@ -24,14 +24,15 @@ NCCL tests can run on multiple processes, multiple threads, and multiple CUDA de ### Quick examples -Run on 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes : +Run on single node with 8 GPUs (`-g 8`), scanning from 8 Bytes to 128MBytes : ```shell $ ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 8 ``` -Run with MPI on 10 processes (potentially on multiple nodes) with 4 GPUs each, for a total of 40 GPUs: +Run 64 MPI processes on nodes with 8 GPUs each, for a total of 64 GPUs spread across 8 nodes : +(NB: The nccl-tests binaries must be compiled with `MPI=1` for this case) ```shell -$ mpirun -np 10 ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 4 +$ mpirun -np 64 -N 8 ./build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 ``` ### Performance @@ -59,14 +60,18 @@ All tests support the same set of arguments : * `-n,--iters ` number of iterations. Default : 20. * `-w,--warmup_iters ` number of warmup iterations (not timed). Default : 5. * `-m,--agg_iters ` number of operations to aggregate together in each iteration. Default : 1. + * `-N,--run_cycles ` run & print each cycle. Default : 1; 0=infinite. * `-a,--average <0/1/2/3>` Report performance as an average across all ranks (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>. Default : 1. * Test operation * `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. Default : 0. * `-c,--check ` perform count iterations, checking correctness of results on each iteration. This can be quite slow on large numbers of GPUs. Default : 1. * `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0. * `-G,--cudagraph ` Capture iterations as a CUDA graph and then replay specified number of times. Default : 0. + * `-C,--report_cputime <0/1>]` Report CPU time instead of latency. Default : 0. + * `-R,--local_register <1/0>` enable local buffer registration on send/recv buffers. Default : 0. + * `-T,--timeout