282fc7fe71
* The QueuePair object was out of scope at the end of the for loop.
So the deconstructor was called.
* Although correct for C++ to do this, it ignores that we copied the QueuePair object into
device memory and have an instance there.
* Early deconstruction resulted in calling ibv_dereg_mr on the atomics memory region.
So when the GPU kernel tried to use the memory region it wasn't
registered which resulted in a protection domain error.
* The solution was to allocate our QueuePair obj with the new operator which leaves memory
management to us, then we can manually call the deconstructor.
[ROCm/rocshmem commit: e856fbb0eb]