70804da15b
* Refactor primitive test to support multiple GPUs in rings * Make GPUs sync before transfer optional * Use same ring format as RCCL * Extend to 8 GPUs and report errors if there is no P2P access * Control GPUs sync before ops from command line with "-s" option * Change buffer size through command line option "-n" Rename iterations command line option to "-i"