Use all available threads for polling the cq to increase the maximum
message rate. Even when posting a single wqe in the wave, use all
available theads for polling the cq to reserve space in the sq.
Changes were needed in the rocshmem abstraction to avoid disabling gpu
threads, like taking turns or using only the first thread in a wave or
wavefront. To avoid breaking other gda implementations, reimplement
turn-based or single thread strategy in post_wqe_rma_turn and
post_wqe_rma_single.
Signed-off-by: Allen Hubbe <allen.hubbe@amd.com>
* Update primitive tests for multi-workgroup support
* Update workgroup primitive tests for multi-workgroup support
* Update workfront primitive tests for multi-workgroup support
* Update team based primitive tests for multi-workgroup support
* Update RMA functional tests to capture timing after quiet call
- Modified RMA functional tests to record the time after a `quiet` call in thread, wavefront, and workgroup RMA calls.
* Improve error handling and memory management
- Replaced `cout` with `cerr` for improved error reporting.
- Ensured all allocated memory is freed when `rocshmem_malloc` fails.
* Update start time in primitive tests and latency calculations
- Modified primitive tests to capture the earliest start time.
- Updated latency calculations in functional tests.
* Remove `GetSwarmTester`
* Update start time in team primitive tests
* Invoke quiet call from a single thread within a block on a rocshmem context