2836240906
if the number of elements to be used in the allreduce operation is not
exact multiple of the work-array buffer size and number of pe's, we need
to adjust the algorithm to:
- initially perform a ring_allreduce on n_segments * chunk_size (which
is the integer division of the number of elements and the work-buffer
size, i.e. will not cover the entire buffer)
- perform another ring_allreduce where chunk_size is reduced to match
the remaining elements
- if the remaining elements from the previous step cannot evenly be
divded by the number of pe's, we need to perform a direct_allreduce on
the outstanding number of elements.
[ROCm/rocshmem commit: a4b4281f50]