* Add put to all pes from all lanes concurrently
* Remove wg_init, use size_t for size params, 64bit data exchange (more
bits for verification masking)
* Rename to flood-test, add put,putnbi,p,get,getnbi,g variants, count time
correctly
* Add flood tester to the testing script
* add to gda test case w/o the _g variant that is not implemented.
[ROCm/rocshmem commit: cca7872bcf]
* Add functional test for barrier_all_on_stream
* Add rocshmem_barrier_all_on_stream support for GDA and RO backends
Implements rocshmem_barrier_all_on_stream operation for
GPU Direct Access and Reverse Offload backends.
Previously, rocshmem_barrier_all_on_stream was only supported for IPC backend.
* Add functional test for rocshmem_broadcastmem_on_stream
* Add host-side rocshmem_broadcastmem_on_stream API
Implement stream-based broadcast collective operation
- Add rocshmem_broadcastmem_on_stream host API and kernel implementation
- Add functional test TeamBroadcastmemOnStreamTester with multi-stream
support and correctness verification
- Use per-workgroup contexts to avoid contention across parallel streams
API:
rocshmem_broadcastmem_on_stream(team, dest, source, nelems, pe_root, stream)
* Add functional test for rocshmem_getmem_on_stream
* Add host-side rocshmem_getmem_on_stream API
Implement stream-based point-to-point RMA get operation
- Add rocshmem_getmem_on_stream host API and kernel implementation
- Support for asynchronous getmem operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group collective getmem for efficient memory transfer
API:
rocshmem_getmem_on_stream(dest, source, nelems, pe, stream)
(AI Assist)
* Add host-side rocshmem_putmem_on_stream API
- Add rocshmem_putmem_on_stream for asynchronous remote writes
- Support for concurrent RMA operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group device collective operation
API:
rocshmem_putmem_on_stream(dest, source, bytes, pe, stream)
(AI Assist)
* Add functional test for rocshmem_putmem_on_stream
* Add host-side rocshmem_putmem_signal_on_stream API
Enables asynchronous putmem operations with signaling on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_putmem_signal_kernel
- Host interface putmem_signal_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Public API
Function signature:
void rocshmem_putmem_signal_on_stream(void *dest, const void *source,
size_t bytes, uint64_t *sig_addr,
uint64_t signal, int sig_op,
int pe, hipStream_t stream);
* Add functional test for rocshmem_putmem_signal_on_stream
* Add host-side rocshmem_signal_wait_until_on_stream API
Enables asynchronous signal wait operations on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_signal_wait_until_kernel
- Host interface signal_wait_until_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Native uint64_t support in wait_until API (generated from P2P_SYNC.py)
Function signature:
void rocshmem_signal_wait_until_on_stream(uint64_t *sig_addr, int cmp,
uint64_t cmp_value,
hipStream_t stream);
(AI Assist)
* Add functional test for rocshmem_signal_wait_until_on_stream
* Add documentation for stream API functions
This commit adds API documentation for the following host-side
stream functions:
- rocshmem_barrier_all_on_stream (collective routines)
- rocshmem_broadcastmem_on_stream (collective routines)
- rocshmem_getmem_on_stream (RMA operations)
- rocshmem_putmem_on_stream (RMA operations)
- rocshmem_putmem_signal_on_stream (signaling operations)
- rocshmem_signal_wait_until_on_stream (point-to-point sync)
The documentation includes function signatures, parameter descriptions,
and detailed explanations of asynchronous behavior and stream handling.
(AI Assist)
* Rename "bytes" -> "nelems"
* Add "_TEST_" to the variables used in tests
* Remove incorrect hipStreamDefault usage
hipStreamDefault is not a default stream. This is a flag.
If stream == nullptr, then just pass it to kernel. It will launch the kernel on the default stream
[ROCm/rocshmem commit: d0c8380650]
* Add host-side rocshmem_alltoallmem_on_stream function
Function signature:
rocshmem_alltoallmem_on_stream(rocshmem_team_t team, void *dest,
const void *source, size_t size,
hipStream_t stream)
- The function launches rocshmem_alltoallmem_kernel which calls
device-side alltoall<char> workgroup collective through default context.
- Uses dynamic block size determination via occupancy API.
- Implemented for all backends.
* Fix incorrect sync buffer size allocation for alltoall in GDA and IPC backends
When allocating memory for alltoall_pSync_pool in setup_teams() and
teams_init() functions, the code incorrectly used ROCSHMEM_BCAST_SYNC_SIZE
instead of ROCSHMEM_ALLTOALL_SYNC_SIZE.
* Add functional test for team_alltoallmem_on_stream
This commit adds a new functional test to verify the correctness of
the host-side rocshmem_team_alltoallmem_on_stream API.
* Add documentation for rocshmem_alltoallmem_on_stream
This commit adds API documentation for the host-side
rocshmem_alltoallmem_on_stream function in the collective routines
section. The documentation includes:
[ROCm/rocshmem commit: 5577feb70d]
* Remove testing of data types
As the collective is templated, we are just testing if sizeof(T) works
* Added single threaded varients
* Applied thread puts optimization to barrier
* Apply single threaded optimization to alltoall
* This optimization only works on bnxt, so place a switch to protect it
* Handle the edge case where the thread count is smaller than the number of PEs
[ROCm/rocshmem commit: 1347d5d628]
* SyncAll test case would run Sync
* Despecialized name for argument reader
* Rename sync-test to team-sync-test as it uses teams
* Another stab at probing NUM_GPUS
[ROCm/rocshmem commit: 054bc33dc4]
* use dlsym for MPI functions
to allow compiling without MPI support, convert the usage of MPI functions and symbols to be based on a dlopen/dlsym based mechanism. Turns out this cannot be done entirely vendor neutral, slightly different solutions might be required for Open MPI, MPICH and the new MPI ABI.
* checkpoint
more work to be done.
* checkpoint 2
* checkpoint 3
* checkpoint 4
examples compile and link correctly
* checkpoitn 5 (I think)
* Checkpoitn 6
* dyld-mpi: adapt GDA
* dyldmpi: tests that depend on MPI need to link with it themselves
* do not ../mpi_instance.h
* dyldmpi: make the symetricHeapTestFixture compile
* dyldmpi: Change cmakery, compiles and run gda w/o external MPI
* Make it also compile in external MPI mode
* dyldmpi: ipc unit tests compile but do not link
* dyldmpi: new approach, if external mpi required, link with mpi,
otherwise use ompi5 abi
* C-style comments in cmakelist..
* dyldmpi: examples: do not fail compiling if MPI not found at build time,
instead do not compile the MPI required examples
* more updates to CMake logic
* convert RO backend
and a few other cleanups
* update some unit tests
to work with the dlopen MPI environment correctly.
---------
Co-authored-by: Aurelien Bouteiller <abouteil@amd.com>
[ROCm/rocshmem commit: e4c427a736]
Create teams in the functional test that are not a duplicate of the
ROCSHMEM_TEAM_WORLD. THis commit contains only infra-tests to make sure
that n_pes and my_pe on the new teams is indeed correct.
[ROCm/rocshmem commit: e95360961d]
* Implement `rocshmem_ptr` in IPC conduit
* tests: add functional test for `rocshmem_ptr`
- Add safety check for pointer access and condition check before printing results for `rocshmem_ptr` test
- Use `rocshmem_put` to store `rocshmem_ptr` availability for data validation
[ROCm/rocshmem commit: 526105d315]
* relax MPI dependency from code
This commit (series) removes the strict dependency on MPI in code base.
rocSHMEM will still be compiled with MPI, but the goal is to make the
code work even if MPI_Init_thread has not been invoked, at least for
certain, well-defined scenarios. Hence, the goal is not remove any
mentioning of MPI from rocSHMEM, but to ensure correct execution of the
ipc conduit even if the library has been initialized using other means.
Details:
- add non-MPI version of remote_heap and WindowInfo classes
- host interfaces work on WindowInfoMPI, they will not work with the
non-MPI code path. Since it is unclear whether we plan to support the
host interfaces at all, this is probably not a major limitation.
* update symmetric_heap structures and backend
* first cut on initialization
and enabling non-MPI initialization of the IPCBackend
* add non-MPI hostInterface methods
at the moment, only barrier_all and sync_all are explicitely supported.
* add non-mpi version of ipc_policy
and a number of smaller fixes required in other files.
A small init/finalize test already passes now with the branch.
* add non-mpi team_split_strided code
* minor fixes for non-MPI use-case
* disable symmetric-heap-window-ionfo test
disable this test for now just to make the compilation pass. Will have
to rework it.
* make no-mpi great again
after rebasing on top of the MPI singleton changes.
* enable running functional tests with uuid init
to run the functional tests using rocshmem_init_attr and the uuid
mechanism requires
a) a PMIx installation on the system
b) setting the environment variable ROCSHMEM_TEST_UUID=1
* fix multi-team creation bug
fix a bug occuring when creating many teams, which was the result of
incorrectly applying two indices in our own implementation of Allreduce.
* make unit tests pass again
* reverse offload was impacted by code change
fix the RO conduit to cope wioth the non-MPI path introduced for the IPC
conduit.
* update to cmake logic to find pmix
* Update src/memory/window_info.hpp
Co-authored-by: Yiltan <ytemucin@amd.com>
* Update CMakeLists.txt
Co-authored-by: Yiltan <ytemucin@amd.com>
* document ROCSHMEM_UNIQUEID_NO_MPI
* rename env. variable to UNIQUEID_WITH_MPI
* update host.cpp to use USE_HDP_FLUSH macro
instead of the deprecated USE_COHERENT_HEAP.
* add note for running example with RO conduit
add a note clarifying that running init_attr_test from the example
directory requires setting an additional environment variable with the
RO conduit.
* Find PMIx in more cases, only apply pmix build options to the test that
needs it, if OMPI_COMM_WORLD_LOCA_RANK is not setenv, abort
---------
Co-authored-by: Yiltan <ytemucin@amd.com>
Co-authored-by: Aurelien Bouteiller <abouteil@amd.com>
[ROCm/rocshmem commit: 6ea5edc951]
* Add thread, wavefront, and workgroup-level `barrier` APIs in IPC and RO conduits; remove collectives on default context
- Implemented `barrier` APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `barrier` APIs
- Removed collective operations on default context
* Add thread, wavefront, and workgroup-level `sync` APIs in IPC and RO conduits.
- Implemented `sync` APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `sync` APIs
* update naming convention for context-based `barrier` APIs
[ROCm/rocshmem commit: dc61bca066]
* Fix deadlock in `rocshmem_ctx_wg_barrier_all` API in IPC conduit by adding per-context pSync buffers and context IDs
- Added separate pSync buffers for each device context
- Resolved deadlock when invoking barrier API (`rocshmem_ctx_wg_barrier_all`) concurrently from multiple contexts
* Update barrier_all functional tests for multi-context support
* Add thread, wavefront, and workgroup-level barrier_all APIs in IPC and RO conduits
- Implemented barrier_all APIs at thread, wavefront, and workgroup granularity
- Added support in both IPC and RO conduits
- Updated functional tests to cover all `barrier_all` APIs
* Add thread, wavefront, and workgroup-level sync_all APIs in IPC and RO conduits
- Implemented sync_all APIs for thread, wavefront, and workgroup scopes
- Added support into both IPC and RO conduits
- Added functional tests to cover all `sync_all` APIs
[ROCm/rocshmem commit: c652f58cef]
* Allocate default context buffers and initialize queue for management
- Allocated the status flag, g return, and atomic return buffers for
the default context.
- Initialized `AtomicWFQueueProxy` instances to manage these buffers
efficiently for concurrent access.
* Update `BlockHandle` with default context buffers
* Add default context flag and update buffer retrieval functions
- Added a flag to distinguish the default context from other contexts.
- Modified return buffer functionns and `get_status_flag` function to accommodate
the default context
* Add default context primitive tests
- get, put, get_nbi, put_nbi, g, and p APIs.
[ROCm/rocshmem commit: 867519e1d0]
* add team-barrier implementation
add a team-barrier API and implementation in the IPC and RO conduit.
Clean up some of the logic in the RO Conduit to distinguish between
sync, sync_all, barrier, and barrier_all.
* add team_barrier_tests to functional tests
[ROCm/rocshmem commit: bcbc42e78f]
* Update primitive tests for multi-workgroup support
* Update workgroup primitive tests for multi-workgroup support
* Update workfront primitive tests for multi-workgroup support
* Update team based primitive tests for multi-workgroup support
* Update RMA functional tests to capture timing after quiet call
- Modified RMA functional tests to record the time after a `quiet` call in thread, wavefront, and workgroup RMA calls.
* Improve error handling and memory management
- Replaced `cout` with `cerr` for improved error reporting.
- Ensured all allocated memory is freed when `rocshmem_malloc` fails.
* Update start time in primitive tests and latency calculations
- Modified primitive tests to capture the earliest start time.
- Updated latency calculations in functional tests.
* Remove `GetSwarmTester`
* Update start time in team primitive tests
* Invoke quiet call from a single thread within a block on a rocshmem context
[ROCm/rocshmem commit: aa3121a967]
- Added multi-work group support for the All-to-all, Fcollect, Broadcast, Barrier and Sync collective functional tests
- Renamed All-to-all and Fcollect tests to TeamAlltoAll and TeamFcollect
[ROCm/rocshmem commit: 57d60aa727]
* Implemented tiled version of put*_wave and get*_wave functions
* Maintain single kernel that supports both tiled and untiled versions
* Disable IPC in the default RO build script
[ROCm/rocshmem commit: b6d31ac7ef]
* These functional tests are simple puts and gets, where every wave will get/put the same amount of data
* Enabled workgroup level puts and gets tests
[ROCm/rocshmem commit: ff954237dd]