* Add functional test for barrier_all_on_stream
* Add rocshmem_barrier_all_on_stream support for GDA and RO backends
Implements rocshmem_barrier_all_on_stream operation for
GPU Direct Access and Reverse Offload backends.
Previously, rocshmem_barrier_all_on_stream was only supported for IPC backend.
* Add functional test for rocshmem_broadcastmem_on_stream
* Add host-side rocshmem_broadcastmem_on_stream API
Implement stream-based broadcast collective operation
- Add rocshmem_broadcastmem_on_stream host API and kernel implementation
- Add functional test TeamBroadcastmemOnStreamTester with multi-stream
support and correctness verification
- Use per-workgroup contexts to avoid contention across parallel streams
API:
rocshmem_broadcastmem_on_stream(team, dest, source, nelems, pe_root, stream)
* Add functional test for rocshmem_getmem_on_stream
* Add host-side rocshmem_getmem_on_stream API
Implement stream-based point-to-point RMA get operation
- Add rocshmem_getmem_on_stream host API and kernel implementation
- Support for asynchronous getmem operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group collective getmem for efficient memory transfer
API:
rocshmem_getmem_on_stream(dest, source, nelems, pe, stream)
(AI Assist)
* Add host-side rocshmem_putmem_on_stream API
- Add rocshmem_putmem_on_stream for asynchronous remote writes
- Support for concurrent RMA operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group device collective operation
API:
rocshmem_putmem_on_stream(dest, source, bytes, pe, stream)
(AI Assist)
* Add functional test for rocshmem_putmem_on_stream
* Add host-side rocshmem_putmem_signal_on_stream API
Enables asynchronous putmem operations with signaling on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_putmem_signal_kernel
- Host interface putmem_signal_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Public API
Function signature:
void rocshmem_putmem_signal_on_stream(void *dest, const void *source,
size_t bytes, uint64_t *sig_addr,
uint64_t signal, int sig_op,
int pe, hipStream_t stream);
* Add functional test for rocshmem_putmem_signal_on_stream
* Add host-side rocshmem_signal_wait_until_on_stream API
Enables asynchronous signal wait operations on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_signal_wait_until_kernel
- Host interface signal_wait_until_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Native uint64_t support in wait_until API (generated from P2P_SYNC.py)
Function signature:
void rocshmem_signal_wait_until_on_stream(uint64_t *sig_addr, int cmp,
uint64_t cmp_value,
hipStream_t stream);
(AI Assist)
* Add functional test for rocshmem_signal_wait_until_on_stream
* Add documentation for stream API functions
This commit adds API documentation for the following host-side
stream functions:
- rocshmem_barrier_all_on_stream (collective routines)
- rocshmem_broadcastmem_on_stream (collective routines)
- rocshmem_getmem_on_stream (RMA operations)
- rocshmem_putmem_on_stream (RMA operations)
- rocshmem_putmem_signal_on_stream (signaling operations)
- rocshmem_signal_wait_until_on_stream (point-to-point sync)
The documentation includes function signatures, parameter descriptions,
and detailed explanations of asynchronous behavior and stream handling.
(AI Assist)
* Rename "bytes" -> "nelems"
* Add "_TEST_" to the variables used in tests
* Remove incorrect hipStreamDefault usage
hipStreamDefault is not a default stream. This is a flag.
If stream == nullptr, then just pass it to kernel. It will launch the kernel on the default stream
* Add host-side rocshmem_alltoallmem_on_stream function
Function signature:
rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void *dest,
const void *source, size_t size,
hipStream_t stream)
- The function launches rocshmem_alltoallmem_kernel which calls
device-side alltoall<char> workgroup collective through default context.
- Uses dynamic block size determination via occupancy API.
- Implemented for all backends.
* Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends
When allocating memory for alltoall_pSync_pool in setup_teams() and
teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE
instead of ROCSHMEM_ALLTOALL_SYNC_SIZE.
* Add functional test for team_alltoallmem_on_stream
This commit adds a new functional test to verify the correctness of
the host-side rocshmem_team_alltoallmem_on_stream API.
* Add documentation for rocshmem_alltoallmem_on_stream
This commit adds API documentation for the host-side
rocshmem_alltoallmem_on_stream function in the collective routines
section. The documentation includes:
* Add `ROCSHMEM_CTX_INVALID` for invalid context handling
- Define `ROCSHMEM_CTX_INVALID` as {nullptr, nullptr}
- Add == and != operators to rocshmem_ctx_t
- Use `ROCSHMEM_CTX_INVALID` on failed context creation
- Skip ctx destroy if context is invalid
* Update docs for context create and destroy APIs usage and behavior
* use dlsym for MPI functions
to allow compiling without MPI support, convert the usage of MPI functions and symbols to be based on a dlopen/dlsym based mechanism. Turns out this cannot be done entirely vendor neutral, slightly different solutions might be required for Open MPI, MPICH and the new MPI ABI.
* checkpoint
more work to be done.
* checkpoint 2
* checkpoint 3
* checkpoint 4
examples compile and link correctly
* checkpoitn 5 (I think)
* Checkpoitn 6
* dyld-mpi: adapt GDA
* dyldmpi: tests that depend on MPI need to link with it themselves
* do not ../mpi_instance.h
* dyldmpi: make the symetricHeapTestFixture compile
* dyldmpi: Change cmakery, compiles and run gda w/o external MPI
* Make it also compile in external MPI mode
* dyldmpi: ipc unit tests compile but do not link
* dyldmpi: new approach, if external mpi required, link with mpi,
otherwise use ompi5 abi
* C-style comments in cmakelist..
* dyldmpi: examples: do not fail compiling if MPI not found at build time,
instead do not compile the MPI required examples
* more updates to CMake logic
* convert RO backend
and a few other cleanups
* update some unit tests
to work with the dlopen MPI environment correctly.
---------
Co-authored-by: Aurelien Bouteiller <abouteil@amd.com>
* Add host APIs for querying device ctx and remote heap pointer
* Host API to query device pointer for ROCSHMEM_DEFAULT_CONTEXT,
this is needed to support dynamic module initialization via device kernel
library bitcode.
* Host API to query remote symmetric heap pointer that can be used in
custom device kernel for RMA operations.
* Added rocshmem_ptr implementation within the Host Context class
* Enables pointer retrieval functionality for symmetric data objects
* Copy IPC pointers to host memory in RO host context
---------
Co-authored-by: avinashkethineedi <avinash.kethineedi@amd.com>
* Refactor `Barrier_all` and `Sync_all` to use default context
- Removed context-specific implementations of barrier_all and sync_all
- Added barrier_all and sync_all to the default context implementation
- Updated functional tests to use the default context for barrier_all and sync_all
* Update `Barrier_all` and `Sync_all` API usage in documentation
* Update `CHANGELOG`
---------
Co-authored-by: Yiltan <ytemucin@amd.com>
* add code for bootstrapping
the bootstrapping code has been extracted from the MSCCLPP library,
which in parts is based on the code from NVIDIA. The code has been
modified to match the specific requirements of the rocSHMEM library.
* add code to use the new uniqueId bootstrapping
* adjust init_attr example
extend the rocshmem_init_attr example to use two disjoint groups
of processe, in order to trigger the new code path.
* add env variable for bootstrap timeout
* Update examples/rocshmem_init_attr_test.cc
Co-authored-by: Aurelien Bouteiller <Aurelien.bouteiller@gmail.com>
* Update src/rocshmem.cpp
Co-authored-by: Aurelien Bouteiller <Aurelien.bouteiller@gmail.com>
---------
Co-authored-by: Aurelien Bouteiller <Aurelien.bouteiller@gmail.com>
* Update the naming convention for collective APIs to ensure consistency across the interface.
* Move all collective API declarations to rocshmem_COLL.hpp
* The following APIs were updated as part of this change:
- `barrier`
- `barrier_all`
- `sync`
- `sync_all`
- `all_to_all`
- `broadcast`
- `fcollect`
- `all_reduce`
* Update header file generation code for collective APIs
* Add thread, wavefront, and workgroup-level `barrier` APIs in IPC and RO conduits; remove collectives on default context
- Implemented `barrier` APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `barrier` APIs
- Removed collective operations on default context
* Add thread, wavefront, and workgroup-level `sync` APIs in IPC and RO conduits.
- Implemented `sync` APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `sync` APIs
* update naming convention for context-based `barrier` APIs
* Fix deadlock in `rocshmem_ctx_wg_barrier_all` API in IPC conduit by adding per-context pSync buffers and context IDs
- Added separate pSync buffers for each device context
- Resolved deadlock when invoking barrier API (`rocshmem_ctx_wg_barrier_all`) concurrently from multiple contexts
* Update barrier_all functional tests for multi-context support
* Add thread, wavefront, and workgroup-level barrier_all APIs in IPC and RO conduits
- Implemented barrier_all APIs at thread, wavefront, and workgroup granularity
- Added support in both IPC and RO conduits
- Updated functional tests to cover all `barrier_all` APIs
* Add thread, wavefront, and workgroup-level sync_all APIs in IPC and RO conduits
- Implemented sync_all APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `sync_all` APIs
add the interfaces required to support rocshmem initialization
through the uniqueID mechanism. At the moment this still maps to
MPI initialization underneath the hood, but adding the functions might
simplify the porting of some applications to rocshmem. In addition, if
we need to transition away from MPI one day, this is also one step into
this direction.
* add team-barrier implementation
add a team-barrier API and implementation in the IPC and RO conduit.
Clean up some of the logic in the RO Conduit to distinguish between
sync, sync_all, barrier, and barrier_all.
* add team_barrier_tests to functional tests