* Add fault injection of starting warps with random variations
This is done by inserting randomly delays after __syncthreads().
The feature can be turned off by FAULT_INJECTION=OFF in cmake.
* Remove manually introduced bug for demo purpose
* Use only one thread per warp for checking wall clock
[ROCm/rccl commit: 90ad586d94]