Add host API for alltoallmem_on_stream collective operation (#333)

* Add host-side rocshmem_alltoallmem_on_stream function

Function signature:
  rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void *dest,
                                 const void *source, size_t size,
                                 hipStream_t stream)

- The function launches rocshmem_alltoallmem_kernel which calls
device-side alltoall<char> workgroup collective through default context.
- Uses dynamic block size determination via occupancy API.
- Implemented for all backends.

* Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends

When allocating memory for alltoall_pSync_pool in setup_teams() and
teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE
instead of ROCSHMEM_ALLTOALL_SYNC_SIZE.

* Add functional test for team_alltoallmem_on_stream

This commit adds a new functional test to verify the correctness of
the host-side rocshmem_team_alltoallmem_on_stream API.

* Add documentation for rocshmem_alltoallmem_on_stream

This commit adds API documentation for the host-side
rocshmem_alltoallmem_on_stream function in the collective routines
section. The documentation includes:

[ROCm/rocshmem commit: 5577feb70d]
This commit is contained in:
Anatolii Rozanov
2025-12-03 14:40:24 +01:00
committed by GitHub
vanhempi 0f32739b52
commit 4b04b540bf
24 muutettua tiedostoa jossa 479 lisäystä ja 6 poistoa
@@ -109,6 +109,7 @@ declare -A TEST_NUMBERS=(
["teamctxsingleinfra"]="73"
["teamctxblockinfra"]="74"
["teamctxoddeveninfra"]="75"
["alltoallmem_on_stream"]="76"
)
ExecTest() {
@@ -428,6 +429,8 @@ TestColl() {
ExecTest "fcollect" 2 1 1 32768
ExecTest "teamreduction" 2 1 1 32768
ExecTest "alltoallmem_on_stream" 2 1 1 32768
}
TestOther() {