diff --git a/README.md b/README.md index 791bed2599..bff6433b89 100644 --- a/README.md +++ b/README.md @@ -52,19 +52,21 @@ All tests support the same set of arguments : * `-i,--stepbytes ` fixed increment between sizes. Default : (max-min)/10. * `-f,--stepfactor ` multiplication factor between sizes. Default : disabled. * NCCL operations arguments - * `-o,--op ` Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum. + * `-o,--op ` Specify which reduction operation to perform. Only relevant for reduction operations like Allreduce, Reduce or ReduceScatter. Default : Sum. * `-d,--datatype ` Specify which datatype to use. Default : Float. * `-r,--root ` Specify which root to use. Only for operations with a root like broadcast or reduce. Default : 0. * Performance * `-n,--iters ` number of iterations. Default : 20. * `-w,--warmup_iters ` number of warmup iterations (not timed). Default : 5. * `-m,--agg_iters ` number of operations to aggregate together in each iteration. Default : 1. + * `-a,--average <0/1/2/3>` Report performance as an average across all ranks (MPI=1 only). <0=Rank0,1=Avg,2=Min,3=Max>. Default : 1. * Test operation * `-p,--parallel_init <0/1>` use threads to initialize NCCL in parallel. Default : 0. * `-c,--check <0/1>` check correctness of results. This can be quite slow on large numbers of GPUs. Default : 1. * `-z,--blocking <0/1>` Make NCCL collective blocking, i.e. have CPUs wait and sync after each collective. Default : 0. + * `-G,--cudagraph ` Capture iterations as a CUDA graph and then replay specified number of times. Default : 0. ## Copyright -NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2019, NVIDIA CORPORATION. All rights reserved. +NCCL tests are provided under the BSD license. All source code and accompanying documentation is copyright (c) 2016-2021, NVIDIA CORPORATION. All rights reserved.