d0c8380650
* Add functional test for barrier_all_on_stream
* Add rocshmem_barrier_all_on_stream support for GDA and RO backends
Implements rocshmem_barrier_all_on_stream operation for
GPU Direct Access and Reverse Offload backends.
Previously, rocshmem_barrier_all_on_stream was only supported for IPC backend.
* Add functional test for rocshmem_broadcastmem_on_stream
* Add host-side rocshmem_broadcastmem_on_stream API
Implement stream-based broadcast collective operation
- Add rocshmem_broadcastmem_on_stream host API and kernel implementation
- Add functional test TeamBroadcastmemOnStreamTester with multi-stream
support and correctness verification
- Use per-workgroup contexts to avoid contention across parallel streams
API:
rocshmem_broadcastmem_on_stream(team, dest, source, nelems, pe_root, stream)
* Add functional test for rocshmem_getmem_on_stream
* Add host-side rocshmem_getmem_on_stream API
Implement stream-based point-to-point RMA get operation
- Add rocshmem_getmem_on_stream host API and kernel implementation
- Support for asynchronous getmem operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group collective getmem for efficient memory transfer
API:
rocshmem_getmem_on_stream(dest, source, nelems, pe, stream)
(AI Assist)
* Add host-side rocshmem_putmem_on_stream API
- Add rocshmem_putmem_on_stream for asynchronous remote writes
- Support for concurrent RMA operations on HIP streams
- Add backend support for GDA, RO, and IPC contexts
- Use work-group device collective operation
API:
rocshmem_putmem_on_stream(dest, source, bytes, pe, stream)
(AI Assist)
* Add functional test for rocshmem_putmem_on_stream
* Add host-side rocshmem_putmem_signal_on_stream API
Enables asynchronous putmem operations with signaling on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_putmem_signal_kernel
- Host interface putmem_signal_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Public API
Function signature:
void rocshmem_putmem_signal_on_stream(void *dest, const void *source,
size_t bytes, uint64_t *sig_addr,
uint64_t signal, int sig_op,
int pe, hipStream_t stream);
* Add functional test for rocshmem_putmem_signal_on_stream
* Add host-side rocshmem_signal_wait_until_on_stream API
Enables asynchronous signal wait operations on HIP streams.
The implementation includes:
- Kernel wrapper rocshmem_signal_wait_until_kernel
- Host interface signal_wait_until_on_stream method
- Context layer support across all backends (IPC, GDA, RO)
- Native uint64_t support in wait_until API (generated from P2P_SYNC.py)
Function signature:
void rocshmem_signal_wait_until_on_stream(uint64_t *sig_addr, int cmp,
uint64_t cmp_value,
hipStream_t stream);
(AI Assist)
* Add functional test for rocshmem_signal_wait_until_on_stream
* Add documentation for stream API functions
This commit adds API documentation for the following host-side
stream functions:
- rocshmem_barrier_all_on_stream (collective routines)
- rocshmem_broadcastmem_on_stream (collective routines)
- rocshmem_getmem_on_stream (RMA operations)
- rocshmem_putmem_on_stream (RMA operations)
- rocshmem_putmem_signal_on_stream (signaling operations)
- rocshmem_signal_wait_until_on_stream (point-to-point sync)
The documentation includes function signatures, parameter descriptions,
and detailed explanations of asynchronous behavior and stream handling.
(AI Assist)
* Rename "bytes" -> "nelems"
* Add "_TEST_" to the variables used in tests
* Remove incorrect hipStreamDefault usage
hipStreamDefault is not a default stream. This is a flag.
If stream == nullptr, then just pass it to kernel. It will launch the kernel on the default stream
126 line
8.4 KiB
ReStructuredText
126 line
8.4 KiB
ReStructuredText
.. meta::
|
|
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
|
|
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
|
|
|
|
.. _rocshmem-api-sigops:
|
|
|
|
---------------------
|
|
Signaling operations
|
|
---------------------
|
|
|
|
ROCSHMEM_PUTMEM_SIGNAL
|
|
----------------------
|
|
|
|
.. cpp:function:: __device__ void rocshmem_putmem_signal(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_putmem_signal_wave(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_putmem_signal_wg(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_putmem_signal_nbi(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_putmem_signal_nbi_wave(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_putmem_signal_nbi_wg(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_nbi(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_nbi_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_nbi_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
|
|
:param ctx: Context with which to perform this operation.
|
|
:param dest: Destination address. Must be an address on the symmetric heap.
|
|
:param source: Source address. Must be an address on the symmetric heap.
|
|
:param nelems: The number of bytes to transfer.
|
|
:param sig_addr: Signal address. Must be an address on the symmetric heap.
|
|
:param signal: Signal value.
|
|
:param sig_op: Atomic operation to apply the signal value.
|
|
:param pe: PE of the remote process.
|
|
:returns: None.
|
|
|
|
**Description:**
|
|
This function writes contiguous data of ``nelems`` bytes from source on the calling PE to ``dest`` at ``pe``,
|
|
then applies ``sig_op`` at ``sig_addr`` with the signal value.
|
|
Valid ``sig_op values`` are listed in SIGNAL_OPERATORS_.
|
|
|
|
ROCSHMEM_PUT_SIGNAL
|
|
-------------------
|
|
|
|
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_wave(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_wg(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_nbi(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_nbi_wave(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_nbi_wg(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_nbi(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_nbi_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_nbi_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
|
|
|
|
:param ctx: Context with which to perform this operation.
|
|
:param dest: Destination address. Must be an address on the symmetric heap.
|
|
:param source: Source address. Must be an address on the symmetric heap.
|
|
:param nelems: The number of elements of size ``TYPE`` to transfer.
|
|
:param sig_addr: Signal address. Must be an address on the symmetric heap.
|
|
:param signal: Signal value.
|
|
:param sig_op: Atomic operation to apply the signal value.
|
|
:param pe: PE of the remote process.
|
|
:returns: None.
|
|
|
|
**Description:**
|
|
This function writes contiguous data of ``nelems`` elements of ``TYPE`` from source on the calling PE to ``dest`` at ``pe``,
|
|
then applies ``sig_op`` at ``sig_addr`` with the signal value.
|
|
Valid ``sig_op values`` are listed in SIGNAL_OPERATORS_.
|
|
Valid ``TYPENAME`` and ``TYPE`` values are listed in :ref:`RMA_TYPES`.
|
|
|
|
ROCSHMEM_PUTMEM_SIGNAL_ON_STREAM
|
|
---------------------------------
|
|
|
|
.. cpp:function:: __host__ void rocshmem_putmem_signal_on_stream(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe, hipStream_t stream)
|
|
|
|
:param dest: Destination address on the remote PE. Must be an address on the symmetric heap.
|
|
:param source: Source address on the local PE. Must be an address on the symmetric heap.
|
|
:param nelems: Size of the transfer in bytes.
|
|
:param sig_addr: Address of signal variable on the remote PE. Must be an address on the symmetric heap.
|
|
:param signal: Signal value to be written.
|
|
:param sig_op: Signal operation (ROCSHMEM_SIGNAL_SET or ROCSHMEM_SIGNAL_ADD).
|
|
:param pe: PE number of the remote PE.
|
|
:param stream: HIP stream on which to enqueue the operation.
|
|
:returns: None.
|
|
|
|
**Description:**
|
|
This routine enqueues a put-with-signal operation on a HIP stream. The function writes contiguous
|
|
data of ``nelems`` bytes from source on the calling PE to ``dest`` at ``pe``, then applies ``sig_op``
|
|
at ``sig_addr`` with the signal value. The operation is enqueued on the specified stream and will
|
|
execute asynchronously. The caller must synchronize the stream (e.g., using ``hipStreamSynchronize``)
|
|
to ensure completion.
|
|
|
|
Valid ``sig_op`` values are listed in SIGNAL_OPERATORS_.
|
|
|
|
ROCSHMEM_SIGNAL_FETCH
|
|
---------------------
|
|
|
|
.. cpp:function:: __device__ uint64_t rocshmem_signal_fetch(const uint64_t *sig_addr)
|
|
.. cpp:function:: __device__ uint64_t rocshmem_signal_fetch_wg(const uint64_t *sig_addr)
|
|
.. cpp:function:: __device__ uint64_t rocshmem_signal_fetch_wave(const uint64_t *sig_addr)
|
|
|
|
:param sig_addr: Signal address. Must be an address on the symmetric heap.
|
|
:returns: Value at ``sig_addr``.
|
|
|
|
**Description:**
|
|
This function atomically fetches the value stored at ``sig_addr``.
|
|
|
|
Signal operators
|
|
----------------
|
|
.. _SIGNAL_OPERATORS:
|
|
|
|
.. list-table:: Signal Operators
|
|
:widths: 20 40
|
|
:header-rows: 1
|
|
|
|
* - Value
|
|
- Description
|
|
* - ROCSHMEM_SIGNAL_SET
|
|
- The signaling operation routines will atomically set the signal value at ``sig_addr``.
|
|
* - ROCSHMEM_SIGNAL_ADD
|
|
- The signaling operation routines will atomically add the signal value at ``sig_addr``.
|
|
|