* addressing hip_fp8 support compatibility issue
* skipping mulsum and avg test for fp8, using hip_fp8 for product
* syncing with nccl-tests
removing the fp8 filter for pre-hopper gpus and resolving the merge conflict
---------
Co-authored-by: Marzieh Berenjkoub <mberenjk@amd.com>
Build option DSO=1 generates libverifiable.so which can be
used to reduce the combined binary size.
Build option NAME_SUFFIX can be used to a add suffix to all
generated binaries. e.g. NAME_SUFFIX=_mpi
Added new make target: clean_intermediates
avoid a division by zero which seems to only occur for op=prod and
datatype=half, since the maximum exponent is small (15) and can exceed
the number of ranks.
* Added "verifiable", a suite of kernels for generating and verifying reduction
input and output arrays in a bit-precise way.
* Data corruption errors now reported in number of wrong elements instead of max
deviation.
* Use ncclGetLastError.
* Don't run hypercube on non-powers of 2 ranks.
* Fix to hypercube data verification.
* Use "thread local" as the defaut CUDA capture mode.
* Replaced pthread_yield -> sched_yield()
* Bugfix to the cpu-side barrier/allreduce implementations.