diff --git a/projects/rccl-tests/README.md b/projects/rccl-tests/README.md index 1532a658fb..92b122c2f2 100644 --- a/projects/rccl-tests/README.md +++ b/projects/rccl-tests/README.md @@ -40,23 +40,24 @@ All tests support the same set of arguments : * Number of GPUs * `-t,--nthreads ` number of threads per process. Default : 1. - * `-g,--ngpus ` number of gpus per process. Default : 1. + * `-g,--ngpus ` number of gpus per thread. Default : 1. * Sizes to scan * `-b,--minbytes ` minimum size to start with. Default : 32M. * `-e,--maxbytes ` maximum size to end at. Default : 32M. * Increments can be either fixes of a multiplication factor. Only one of those should be used - * `-i,--stepbytes ` fixed increment between sizes. Default : (max-min)/10. - * `-f,--stepfactor ` multiplication factor between sizes. Default : disabled. + * `-i,--stepbytes ` fixed increment between sizes. Default : (max-min)/10. + * `-f,--stepfactor ` multiplication factor between sizes. Default : disabled. +* NCCL operations arguments + * `-o,--op ` Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum. + * `-d,--datatype ` Specify which datatype to use. Default : Float. + * `-r,--root ` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0. * Performance * `-n,--iters ` number of iterations. Default : 20. * `-w,--warmup_iters ` number of warmup iterations (not timed). Default : 5. -* `-s,--swap_args <0/1>` when used with multiple threads, have threads manage different GPUs for each iteration. Default : 0. -* `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. -* `-c,--check <0/1>` check correctness of results. This can be quite slow on large numbers of GPUs. Default : 1. -* NCCL operations arguments - * `-o,--op ` Specify which reduction operation to perform. Only relevant for reduction operations. Default : Sum. - * `-d,--datatype ` Specify which datatype to use. Default : Float. - * `-r,--root ` Specify which root to use. Only for operations with a root like broadcast or reduce. +* Test operation + * `-s,--swap_args <0/1>` when used with multiple threads, have threads manage different GPUs for each iteration. Default : 0. + * `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. Default : 0. + * `-c,--check <0/1>` check correctness of results. This can be quite slow on large numbers of GPUs. Default : 1. * `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0. ## Copyright