- use the reduce_psync buffers for synchronization in allreduce, not the
barrier_psync.
- execute a wwg barrier after the allreduce operation. After internal
discussion it was determined that it is required for correctness.
[ROCm/rocshmem commit: 6f512e92a5]