Initial ROCm-docs (#92)

* Initial ROCm-docs commit

Co-authored-by: Aurélien Bouteiller <bouteill@icl.utk.edu>
Co-authored-by: Alex Xu <alex.xu@amd.com>
Co-authored-by: yugang-amd <yugang.wang@amd.com>
This commit is contained in:
Yiltan
2025-05-08 13:39:28 -04:00
zatwierdzone przez GitHub
rodzic 87179b1ffd
commit f693c98fb2
23 zmienionych plików z 4401 dodań i 0 usunięć
+18
Wyświetl plik
@@ -0,0 +1,18 @@
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
version: 2
sphinx:
configuration: docs/conf.py
formats: []
python:
install:
- requirements: docs/sphinx/requirements.txt
build:
os: ubuntu-22.04
tools:
python: "3.10"
+5
Wyświetl plik
@@ -0,0 +1,5 @@
_build/
_doxygen/
doxygen/html/
doxygen/xml/
sphinx/_toc.yml
+21
Wyświetl plik
@@ -0,0 +1,21 @@
# Building the rocSHMEM documentation
## macOS
To build html documentation locally:
```
brew install doxygen sphinx-doc
pip3.10 install -r ./requirements.txt
python3.10 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
open _build/html/index.html
```
To build pdf documentation we require a LaTeX installation on your machine.
Once LaTeX is installed, you may run the following:
```
pip3.10 install -r ./requirements.txt
sphinx-build -M latexpdf . _build
open _build/latex/rocshmem.pdf
```
+419
Wyświetl plik
@@ -0,0 +1,419 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-amo:
---------------------------
Atomic Memory Operations
---------------------------
- These functions can be called from divergent control paths at per-thread
granularity.
ROSHMEM_ATOMIC_FETCH
--------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_fetch(TYPE *source, int pe)
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_fetch(rocshmem_ctx_t ctx, TYPE *source, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param pe: PE of the remote process
:returns: The value of dest
**Description:**
Atomically return the value of dest to the calling PE.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in EXTENDED_AMO_TYPES_.
SHMEM_ATOMIC_SET
----------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_atomic_set(TYPE *dest, TYPE value, int pe);
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_atomic_set(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, int pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param val: The value to be atomically set
:param pe: PE of the remote process
:returns: None
**Description:**
Atomically set the value val to dest on pe.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in EXTENDED_AMO_TYPES_.
SHMEM_ATOMIC_COMPARE_SWAP
-------------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_compare_swap(TYPE *dest, TYPE cond, TYPE value, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_compare_swap(rocshmem_ctx_t ctx, TYPE *dest, TYPE cond, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param cond: The value to be compare with
:param val: The value to be atomically swapped
:param pe: PE of the remote process
:return: The old value of dest
**Description:**
Atomically compares if the value in dest with cond is equal then put val in dest.
The operation returns the older value of dest to the calling PE.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in STANDARD_AMO_TYPES_.
SHMEM_ATOMIC_SWAP
-----------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_swap(TYPE *dest, TYPE value, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_swap(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param val: The value to be atomically swapped
:param pe: PE of the remote process
:return: The old value of dest
**Description:**
Atomically swaps the value val to dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in EXTENDED_AMO_TYPES_.
SHMEM_ATOMIC_FETCH_INC
----------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_fetch_inc(TYPE *dest, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_fetch_inc(rocshmem_ctx_t ctx, TYPE *dest, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param pe: PE of the remote process
:return: The old value of dest
**Description:**
Atomically adds 1 to dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in STANDARD_AMO_TYPES_.
SHMEM_ATOMIC_INC
----------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_atomic_inc(TYPE *dest, TYPE pe);
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_atomic_inc(rocshmem_ctx_t ctx, TYPE *dest, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param pe: PE of the remote process
:return: None
**Description:**
Atomically adds 1 to dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in STANDARD_AMO_TYPES_.
SHMEM_ATOMIC_FETCH_ADD
----------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_fetch_add(TYPE *dest, TYPE value, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_fetch_add(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically added
:param pe: PE of the remote process
:return: The old value of dest
**Description:**
Atomically adds value to dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in STANDARD_AMO_TYPES_.
SHMEM_ATOMIC_ADD
----------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_atomic_add(TYPE *dest, TYPE value, TYPE pe);
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_atomic_add(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically added
:param pe: PE of the remote process
:return: None
**Description:**
Atomically adds value to dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in STANDARD_AMO_TYPES_.
SHMEM_ATOMIC_FETCH_AND
----------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_fetch_and(TYPE *dest, TYPE value, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_fetch_and(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically AND
:param pe: PE of the remote process
:return: The old value of dest.
**Description:**
Atomically bitwise-and value to the value at dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in BITWISE_AMO_TYPES_.
SHMEM_ATOMIC_AND
----------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_and(TYPE *dest, TYPE value, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_and(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically AND
:param pe: PE of the remote process
:return: None
**Description:**
Atomically bitwise-and value to the value at dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in BITWISE_AMO_TYPES_.
SHMEM_ATOMIC_FETCH_OR
----------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_fetch_or(TYPE *dest, TYPE value, TYPE pe)
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_fetch_or(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically OR
:param pe: PE of the remote process
:return: The old value of dest
**Description:**
Atomically bitwise-or value to the value at dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in BITWISE_AMO_TYPES_.
SHMEM_ATOMIC_OR
---------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_or(TYPE *dest, TYPE value, TYPE pe)
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_or(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically OR
:param pe: PE of the remote process
:return: None.
**Description:**
Atomically bitwise-or value to the value at dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in BITWISE_AMO_TYPES_.
SHMEM_ATOMIC_FETCH_XOR
----------------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_fetch_xor(TYPE *dest, TYPE value, TYPE pe);
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_fetch_xor(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe);
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically XOR
:param pe: PE of the remote process
:return: The old value of dest
**Description:**
Atomically bitwise-xor value to the value at dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in BITWISE_AMO_TYPES_.
SHMEM_ATOMIC_XOR
----------------
.. cpp:function:: __device__ TYPE rocshmem_TYPENAME_atomic_xor(TYPE *dest, TYPE value, TYPE pe)
.. cpp:function:: __device__ TYPE rocshmem_ctx_TYPENAME_atomic_xor(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, TYPE pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: The value to be atomically XOR
:param pe: PE of the remote process
:return: None
**Description:**
Atomically bitwise-xor value to the value at dest on pe.
The operation is blocking.
Valid ``TYPENAME`` and ``TYPE`` values can be seen in BITWISE_AMO_TYPES_.
SUPPORTED AMO DATA TYPES
------------------------
.. _STANDARD_AMO_TYPES:
.. list-table:: Standard AMO Datatypes
:widths: 10 20 20
:header-rows: 1
* - TYPE
- TYPENAME
- Supported
* - int
- int
- Yes
* - long
- long
- Yes
* - long long
- longlong
- Yes
* - unsigned int
- uint
- Yes
* - unsigned long
- ulong
- Yes
* - unsigned long long
- ulonglong
- Yes
* - int32_t
- int32
- Yes
* - int64_t
- int64
- Yes
* - uint32_t
- uint32
- Yes
* - uint64_t
- uint64
- Yes
* - size_t
- size
- Yes
* - ptrdiff_t
- ptrdiff
- Yes
.. _EXTENDED_AMO_TYPES:
.. list-table:: Extended AMO Datatypes
:widths: 10 20 20
:header-rows: 1
* - TYPE
- TYPENAME
- Supported
* - float
- float
- Yes
* - double
- double
- Yes
* - int
- int
- Yes
* - long
- long
- Yes
* - long long
- longlong
- Yes
* - unsigned int
- uint
- Yes
* - unsigned long
- ulong
- Yes
* - unsigned long long
- ulonglong
- Yes
* - int32_t
- int32
- Yes
* - int64_t
- int64
- Yes
* - uint32_t
- uint32
- Yes
* - uint64_t
- uint64
- Yes
* - size_t
- size
- Yes
* - ptrdiff_t
- ptrdiff
- Yes
.. _BITWISE_AMO_TYPES:
.. list-table:: Bitwise AMO Datatypes
:widths: 10 20 20
:header-rows: 1
* - TYPE
- TYPENAME
- Supported
* - unsigned int
- uint
- Yes
* - unsigned long
- ulong
- Yes
* - unsigned long long
- ulonglong
- Yes
* - int32_t
- int32
- Yes
* - int64_t
- int64
- Yes
* - uint32_t
- uint32
- Yes
* - uint64_t
- uint64
- Yes
+248
Wyświetl plik
@@ -0,0 +1,248 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-coll:
---------------------------
Collective Routines
---------------------------
ROCSHMEM_BARRIER_ALL
--------------------
.. cpp:function:: __device__ void rocshmem_ctx_wg_barrier_all(rocshmem_ctx_t ctx)
.. cpp:function:: __device__ void rocshmem_wg_barrier_all()
:param ctx: Context with which to perform this operation
:returns: None
**Description:**
Perform a collective barrier between all PEs in the system.
The caller is blocked until the barrier is resolved.
ROCSHMEM_TEAM_SYNC
------------------
.. cpp:function:: __device__ void rocshmem_ctx_wg_team_sync(rocshmem_ctx_t ctx, rocshmem_team_t team)
.. cpp:function:: __device__ void rocshmem_wg_team_sync(rocshmem_team_t team)
:param ctx: Context with which to perform this operation
:param team: Team with which to perform this operation
:returns: None
**Description:**
Registers the arrival of a PE at a barrier.
The caller is blocked until the synchronization is resolved.
In contrast with the shmem_barrier_all routine, shmem_team_sync only ensures
completion and visibility of previously issued memory stores and does not
ensure completion of remote memory updates issued via OpenSHMEM routines.
ROCSHMEM_SYNC_ALL
-----------------
.. cpp:function:: __device__ void rocshmem_ctx_wg_sync_all(rocshmem_ctx_t ctx)
.. cpp:function:: __device__ void rocshmem_wg_sync_all()
:param ctx: Context with which to perform this operation
:returns: None
**Description:**
This routine is the same as ``rocshmem_wg_team_sync`` if were to be called on the world team.
ROSHMEM_ALLTOALL
----------------
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_wg_alltoall(rocshmem_ctx_t ctx, rocshmem_team_t team, TYPE *dest, const TYPE *source, int nelems)
:param team: The team participating in the collective
:param dest: Destination address; Must be an address on the
symmetric heap
:param source: Source address; Must be an address on the symmetric
heap
:param nelems: Number of data blocks transferred per pair of PEs
:returns: None
**Description:**
Exchanges a fixed amount of contiguous data blocks between all pairs
of PEs participating in the collective routine.
This function must be called as a work-group collective.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`RMA_TYPES`.
ROCSHMEM_BROADCAST
------------------
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_wg_broadcast(rocshmem_ctx_t ctx, rocshmem_team_t team, TYPE *dest, const TYPE *source, int nelems, int pe_root)
:param ctx: Context with which to perform this collective
:param team: The team participating in the collective
:param dest: Destination address; Must be an address on the
symmetric heap
:param source: Source address; Must be an address on the symmetric
heap
:param nelems: Number of data blocks transferred per pair of PEs
:returns: None
**Description:**
Perform a broadcast between PEs in the team.
The caller is blocked until the broadcast completes.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`RMA_TYPES`.
ROCSHMEM_FCOLLECT
-----------------
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_wg_fcollect(rocshmem_ctx_t ctx, rocshmem_team_t team, TYPE *dest, const TYPE *source, int nelems)
:param ctx: Context with which to perform this collective
:param team: The team participating in the collective
:param dest: Destination address; Must be an address on the
symmetric heap
:param source: Source address; Must be an address on the symmetric
heap
:param nelems: Number of data blocks transferred per pair of PEs.
:returns: None
**Description:**
Concatenates blocks of data from multiple PEs to an array in every
PE participating in the collective routine.
ROCSHMEM_REDUCTION
------------------
.. cpp:function:: __device__ int rocshmem_ctx_TYPENAME_OPNAME_wg_reduce(rocshmem_ctx_t ctx, rocshmem_team_t team, TYPE *dest, const TYPE *source, int nreduce)
:param ctx: Context with which to perform this collective
:param team: The team participating in the collective
:param dest: Destination address; Must be an address on the
symmetric heap
:param source: Source address; Must be an address on the symmetric
heap
:param nreduce: Number of data blocks transferred per pair of PEs
:returns: Zero on successful local completion; Nonzero otherwise
**Description:**
Perform an allreduce between PEs in the team.
Valid ``TYPENAME``, ``TYPE``, and ``OPNAME`` values can be seen at :ref:`REDUCE_TYPES`.
SUPPORTED REDUCTION TYPES AND OPERATIONS
----------------------------------------
.. _REDUCE_TYPES:
.. list-table:: Reduction Types, Names and Operations
:widths: 20 20 20 20
:header-rows: 1
* - TYPE
- TYPENAME
- OPNAME
- Supported
* - char
- char
- max, min, sum, prod
- No
* - signed char
- schar
- max, min, sum, prod
- No
* - short
- short
- max, min, sum, prod
- Yes
* - int
- int
- max, min, sum, prod
- Yes
* - long
- long
- max, min, sum, prod
- Yes
* - long long
- longlong
- max, min, sum, prod
- Yes
* - ptrdiff_t
- ptrdiff
- max, min, sum, prod
- No
* - unsigned char
- uchar
- and, or, xor, max, min, sum, prod
- No
* - unsigned short
- ushort
- and, or, xor, max, min, sum, prod
- No
* - unsigned int
- uint
- and, or, xor, max, min, sum, prod
- No
* - unsigned long
- ulong
- and, or, xor, max, min, sum, prod
- No
* - unsigned long long
- ulonglong
- and, or, xor, max, min, sum, prod
- No
* - int8_t
- int8
- and, or, xor, max, min, sum, prod
- No
* - int16_t
- int16
- and, or, xor, max, min, sum, prod
- No
* - int32_t
- int32
- and, or, xor, max, min, sum, prod
- No
* - int64_t
- int64
- and, or, xor, max, min, sum, prod
- No
* - uint8_t
- uint8
- and, or, xor, max, min, sum, prod
- No
* - uint16_t
- uint16
- and, or, xor, max, min, sum, prod
- No
* - uint32_t
- uint32
- and, or, xor, max, min, sum, prod
- No
* - uint64_t
- uint64
- and, or, xor, max, min, sum, prod
- No
* - size_t
- size
- and, or, xor, max, min, sum, prod
- No
* - float
- float
- max, min, sum, prod
- Yes
* - double
- double
- max, min, sum, prod
- Yes
* - long double
- longdouble
- max, min, sum, prod
- No
* - double _Complex
- complexd
- sum, prod
- No
* - float _Complex
- complexf
- sum, prod
- No
+40
Wyświetl plik
@@ -0,0 +1,40 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-ctx:
-----------------------------------
Context Management Routines
-----------------------------------
ROCSHMEM_CTX_CREATE
-------------------
.. cpp:function:: __device__ int rocshmem_wg_ctx_create(int64_t options, rocshmem_ctx_t *ctx)
.. cpp:function:: __device__ int rocshmem_wg_team_create_ctx(rocshmem_team_t team, long options, rocshmem_ctx_t *ctx)
:param team: Team handle to derive the context from
:param options: Options for context creation (Ignored in current design, please use a value of 0)
:param ctx: Context handle
:returns: All threads returns 0 if the context was created successfully;
If any thread returns non-zero value, the operation failed and a higher number of
`ROCSHMEM_MAX_NUM_CONTEXTS` is required
**Description:**
Creates an OpenSHMEM context. By design, the context is private to the calling work-group.
Must be called collectively by all threads in the work-group.
ROCSHMEM_CTX_DESTROY
--------------------
.. cpp:function:: __device__ void rocshmem_wg_ctx_destroy(rocshmem_ctx_t *ctx)
:param ctx: Context handle
:returns: None
**Description:**
Destroys an rocSHMEM context.
Must be called collectively by all threads in the work-group.
+98
Wyświetl plik
@@ -0,0 +1,98 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-init:
---------------------------------------
Library Setup, Exit, and Query Routines
---------------------------------------
ROCSHMEM_INIT
-------------
.. cpp:function:: __host__ void rocshmem_init(void)
:Parameters: None
:returns: None
**Description:**
This routine initializes the rocSHMEM runtime and underlying transport layer.
Before ``rocshmem_init`` is called,
a user must select the device that this PE is associated to by calling
`hipSetDevice
<https://rocm.docs.amd.com/projects/HIP/en/docs-6.0.0/doxygen/html/group___device.html#ga43c1e7f15925eeb762195ccb5e063eae>`_.
.. cpp:function:: __device__ void rocshmem_wg_init(void)
:Parameters: None
:returns: None
**Description:**
Initializes device-side rocSHMEM resources.
Must be called before any threads in this work-group invoke other rocSHMEM functions.
Must be called collectively by all threads in the work-group.
ROCSHMEM_FINALIZE
-----------------
.. cpp:function:: __host__ void rocshmem_finalize(void)
:Parameters: None
:returns: None
**Description:**
Finalize the rocSHMEM runtime.
.. cpp:function:: __device__ void rocshmem_wg_finalize(void)
:Parameters: None
:returns: None
**Description:**
Finalizes device-side rocSHMEM resources.
Must be called before work-group completion if the work-group also called ``rocshmem_wg_init``.
Must be called collectively by all threads in the work-group.
ROCSHMEM_N_PES
--------------
.. cpp:function:: __host__ int rocshmem_n_pes(void)
:Parameters: None
:returns: Total number of PEs
**Description:**
Query the total number of PEs.
This routine can be called before ``rocshmem_init``.
.. cpp:function:: __device__ int rocshmem_n_pes(void)
.. cpp:function:: __device__ int rocshmem_ctx_n_pes(rocshmem_ctx_t ctx)
:param ctx: GPU side context handle
:returns: Total number of PEs
**Description:**
Query the total number of PEs for a given context.
Can be called per thread with no performance penalty.
ROCSHMEM_MY_PE
--------------
.. cpp:function:: __host__ int rocshmem_my_pe(void)
:Parameters: None
:returns: PE ID of the caller
**Description:**
Query the PE ID of the caller.
This routine can be called before ``rocshmem_init``.
.. cpp:function:: __device__ int rocshmem_my_pe(void)
.. cpp:function:: __device__ int rocshmem_ctx_my_pe(rocshmem_ctx_t ctx)
:param ctx: GPU side context handle
:returns: PE ID of the caller
**Description:**
Query the PE ID of the caller.
Can be called per thread with no performance penalty.
+35
Wyświetl plik
@@ -0,0 +1,35 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-memory-management:
---------------------------
Memory Management Routines
---------------------------
ROCSHMEM_MALLOC
---------------
.. cpp:function:: __host__ void *rocshmem_malloc(size_t size)
:param size: Memory allocation size in bytes
:returns: A pointer to the allocated memory on the symmetric heap;
If a valid allocation cannot be made, it returns NULL
**Description:**
Allocate memory of ``size`` bytes from the symmetric heap.
This is a collective operation and must be called by all PEs.
ROCSHMEM_FREE
-------------
.. cpp:function:: __host__ void rocshmem_free(void *ptr)
:param ptr: Pointer to previously allocated memory on the symmetric heap
:returns: None
**Description:**
Free a memory allocation from the symmetric heap.
This is a collective operation and must be called by all PEs.
+36
Wyświetl plik
@@ -0,0 +1,36 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-memory-ordering:
---------------------------
Memory Ordering Routines
---------------------------
ROCSHMEM_FENCE
--------------
.. cpp:function:: __device__ void rocshmem_fence()
.. cpp:function:: __device__ void rocshmem_fence(int pe)
.. cpp:function:: __device__ void rocshmem_ctx_fence(rocshmem_ctx_t ctx)
.. cpp:function:: __device__ void rocshmem_ctx_fence(rocshmem_ctx_t ctx, int pe)
:param ctx: Context with which to perform this operation
:param pe: Destination pe
:returns: None
**Description:**
Guarantees order between messages in this context in accordance with OpenSHMEM semantics.
ROCSHMEM_QUIET
--------------
.. cpp:function:: __device__ void rocshmem_ctx_quiet(rocshmem_ctx_t ctx)
.. cpp:function:: __device__ void rocshmem_quiet()
:param ctx: Context with which to perform this operation
:returns: None
**Description:**
Completes all previous operations posted to this context.
+122
Wyświetl plik
@@ -0,0 +1,122 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-pt2pt-sync:
-----------------------------------------
Point-to-Point Synchronization Routines
-----------------------------------------
ROCSHMEM_WAIT_UNTIL
-------------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_wait_until(TYPE *ivars, int cmp, TYPE val)
:param ivars: Pointer to memory on the symmetric heap to wait for
:param cmp: Operation for the comparison
:param val: Value to compare the memory at ivars to
:returns: None
**Description:**
Block the caller until the condition ``(*ivars cmp val)`` is true.
Valid ``cmp`` values can be seen at :ref:`CMP_VALUES`.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`STANDARD_AMO_TYPES`.
ROCSHMEM_WAIT_UNTIL_ALL
-----------------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_wait_until_all(TYPE *ivars, size_t nelems, const int* status, int cmp, TYPE val)
:param ivars: Pointer to memory on the symmetric heap to wait for
:param nelems: Number of elements in the ivars array
:param status: Array of length nelems that is used to exclude elements from wait
:param cmp: Operation for the comparison
:param val: Value to compare
:returns: None
**Description:**
Block the caller until the condition ``(ivars[i] cmp val)`` is true for all ivars
Valid ``cmp`` values can be seen at :ref:`CMP_VALUES`.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`STANDARD_AMO_TYPES`.
ROCSHMEM_WAIT_UNTIL_ANY
-----------------------
.. cpp:function:: __device__ size_t rocshmem_TYPENAME_wait_until_any(TYPE *ivars, size_t nelems, const int* status, int cmp, TYPE val)
:param ivars: Pointer to memory on the symmetric heap to wait for
:param nelems: Number of elements in the ivars array
:param status: Array of length nelems that is used to exclude elements from wait
:param cmp: Operation for the comparison
:param val: Value to compare
:returns: The index of an element in the ivars array that satisfies the wait condition; If the wait set is empty, this routine returns SIZE_MAX
**Description:**
Block the caller until any of the condition ``(ivars[i] cmp val)`` is true.
Valid `cmp` values can be seen at :ref:`CMP_VALUES`.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`STANDARD_AMO_TYPES`.
ROCSHMEM_WAIT_UNTIL_SOME
------------------------
.. cpp:function:: __device__ size_t rocshmem_TYPENAME_wait_until_some(TYPE *ivars, size_t nelems, size_t* indices, const int* status, int cmp, TYPE val)
:param ivars: Pointer to memory on the symmetric heap to wait for
:param nelems: Number of elements in the ivars array
:param indices: List of indices that of at least of length nelems
:param status: Array of length nelems that is used to exclude elements from wait
:param cmp: Operation for the comparison
:param val: Value to compare
:returns: The number of indices returned in the indices array; If the wait set is empty, this routine returns 0
**Description:**
Block the caller until any of the conditions ``(ivars[i] cmp val)`` is true.
Valid `cmp` values can be seen at :ref:`CMP_VALUES`.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`STANDARD_AMO_TYPES`.
ROCSHMEM_TEST
-------------
.. cpp:function:: __device__ int rocshmem_TYPENAME_test(TYPE *ivars, int cmp, TYPE val)
:param ivars: Pointer to memory on the symmetric heap to wait for
:param cmp: Operation for the comparison
:param val: Value to compare the memory at ivars to
:returnS: 1 if the evaluation is true, 0 otherwise
**Description:**
Test if the condition ``(*ivars cmp val)`` is true.
SUPPORTED COMPARISONS
---------------------
.. _CMP_VALUES:
.. list-table:: Point-to-Point Comparison Constants
:widths: 20 20
:header-rows: 1
* - Constant
- Description
* - ROCSHMEM_CMP_EQ
- Equal
* - ROCSHMEM_CMP_NE
- Not equal
* - ROCSHMEM_CMP_GT
- Greater than
* - ROCSHMEM_CMP_GE
- Greater than or equal to
* - ROCSHMEM_CMP_LT
- Less than
* - ROCSHMEM_CMP_LE
- Less than or equal to
+239
Wyświetl plik
@@ -0,0 +1,239 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-rma:
-----------------------------------------
Remote Memory Access Routines
-----------------------------------------
- Routines with the ``_wave`` and ``_wg`` suffixes,
require all threads in a wavefront and workgroup, respectively,
to call into the routine with the same parameters.
- Routines with the ``_nbi`` substring will return as soon as the request is posted.
- Routines without the ``_nbi`` substring block until the operation completes locally.
- Valid ``TYPENAME`` and ``TYPE`` values can be seen in RMA_TYPES_.
ROCSHMEM_PUT
------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_put(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_wave(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_wg(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_nbi_wave(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_nbi_wg(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_nbi(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_nbi_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_nbi_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param source: Source address; Must be an address on the symmetric heap
:param nelems: The number of elements to transfer
:param pe: PE of the remote process
:returns: None
**Description:**
Writes contiguous data of nelems elements from source on the calling PE to dest at pe.
ROCSHMEM_PUTMEM
---------------
.. cpp:function:: __device__ void rocshmem_putmem(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_wave(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_wg(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_nbi(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_nbi_wave(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_nbi_wg(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_nbi(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_nbi_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_nbi_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param source: Source address; Must be an address on the symmetric heap
:param nelems: Size of the transfer in bytes
:param pe: PE of the remote process
:returns: None
**Description:**
Writes contiguous data of nelems bytes from source on the calling PE to dest at pe.
ROCSHMEM_P
----------
.. cpp:function:: __device__ void rocshmem_TYPENAME_p(TYPE *dest, TYPE value, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_p(rocshmem_ctx_t ctx, TYPE *dest, TYPE value, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param value: Value to write to dest at pe
:param pe: PE of the remote process
:returns: None
**Description:**
Writes a single value to dest at pe PE to dst at pe.
ROCSHMEM_GET
------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_get(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_get_wave(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_get_wg(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_get_nbi(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_get_nbi_wave(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_get_nbi_wg(TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_get(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_get_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_get_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_get_nbi(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_get_nbi_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_get_nbi_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param source: Source address; Must be an address on the symmetric heap
:param nelems: The number of elements to transfer
:param pe: PE of the remote process
:returns: None
**Description:**
Reads contiguous data of nelems elements from source on pe to dest on the calling PE.
ROCSHMEM_GETMEM
---------------
.. cpp:function:: __device__ void rocshmem_getmem(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_getmem_wave(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_getmem_wg(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_getmem_nbi(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_getmem_nbi_wave(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_getmem_nbi_wg(void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_getmem(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_getmem_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_getmem_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_getmem_nbi(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_getmem_nbi_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_getmem_nbi_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param source: Source address; Must be an address on the symmetric heap
:param nelems: Size of the transfer in bytes
:param pe: PE of the remote process
:returns: None
**Description:**
Reads contiguous data of nelems bytes from source on pe to dest on the calling PE.
ROCSHMEM_G
----------
.. cpp:function:: __device__ float rocshmem_ctx_float_g(rocshmem_ctx_t ctx, const float *source, int pe)
.. cpp:function:: __device__ float rocshmem_float_g(const float *source, int pe)
:param ctx: Context with which to perform this operation
:param source: Source address; Must be an address on the symmetric heap
:param pe: PE of the remote process
:returns: The value read from source at pe
**Description:**
Reads and returns single value from source at pe.
SUPPORTED RMA DATA TYPES
------------------------
.. _RMA_TYPES:
.. list-table:: RMA Datatypes
:widths: 10 20 20
:header-rows: 1
* - TYPE
- TYPENAME
- Supported
* - float
- float
- Yes
* - double
- double
- Yes
* - long double
- longdouble
- No
* - char
- char
- Yes
* - signed char
- schar
- Yes
* - short
- short
- Yes
* - int
- int
- Yes
* - long
- long
- Yes
* - long long
- longlong
- Yes
* - unsigned char
- uchar
- Yes
* - unsigned short
- ushort
- Yes
* - unsigned int
- uint
- Yes
* - unsigned long
- ulong
- Yes
* - unsigned long long
- ulonglong
- Yes
* - int8_t
- int8
- No
* - int16_t
- int16
- No
* - int32_t
- int32
- No
* - int64_t
- int64
- No
* - uint8_t
- uint8
- No
* - uint16_t
- uint16
- No
* - uint32_t
- uint32
- No
* - uint64_t
- uint64
- No
* - size_t
- size
- No
* - ptrdiff_t
- ptrdiff
- No
+101
Wyświetl plik
@@ -0,0 +1,101 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-sigops:
---------------------
Signaling Operations
---------------------
ROCSHMEM_PUTMEM_SIGNAL
----------------------
.. cpp:function:: __device__ void rocshmem_putmem_signal(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_signal_wave(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_signal_wg(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_signal_nbi(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_signal_nbi_wave(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_putmem_signal_nbi_wg(void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_nbi(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_nbi_wave(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_putmem_signal_nbi_wg(rocshmem_ctx_t ctx, void *dest, const void *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param source: Source address; Must be an address on the symmetric heap
:param nelems: The number of bytes to transfer
:param sig_addr: Signal address; Must be an address on the symmetric heap
:param signal: Signal value
:param sig_op: Atomic operation to apply the signal value
:param pe: PE of the remote process
:returns: None
**Description:**
Writes contiguous data of nelems bytes from source on the calling PE to dest at pe.
Then applies sig_op at sig_addr using the signal value.
Valid sig_op values can be seen at SIGNAL_OPERATORS_.
ROCSHMEM_PUT_SIGNAL
-------------------
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_wave(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_wg(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_nbi(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_nbi_wave(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_TYPENAME_put_signal_nbi_wg(TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_nbi(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_nbi_wave(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
.. cpp:function:: __device__ void rocshmem_ctx_TYPENAME_put_signal_nbi_wg(rocshmem_ctx_t ctx, TYPE *dest, const TYPE *source, size_t nelems, uint64_t *sig_addr, uint64_t signal, int sig_op, int pe)
:param ctx: Context with which to perform this operation
:param dest: Destination address; Must be an address on the symmetric heap
:param source: Source address; Must be an address on the symmetric heap
:param nelems: The number of elements of size TYPE to transfer
:param sig_addr: Signal address; Must be an address on the symmetric heap
:param signal: Signal value
:param sig_op: Atomic operation to apply the signal value
:param pe: PE of the remote process
:returns: None
**Description:**
Writes contiguous data of nelems elements of TYPE from source on the calling PE to dest at pe.
Then applies sig_op at sig_addr using the signal value.
Valid sig_op values can be seen at SIGNAL_OPERATORS_.
Valid ``TYPENAME`` and ``TYPE`` values can be seen at :ref:`RMA_TYPES`.
ROCSHMEM_SIGNAL_FETCH
---------------------
.. cpp:function:: __device__ uint64_t rocshmem_signal_fetch(const uint64_t *sig_addr)
.. cpp:function:: __device__ uint64_t rocshmem_signal_fetch_wg(const uint64_t *sig_addr)
.. cpp:function:: __device__ uint64_t rocshmem_signal_fetch_wave(const uint64_t *sig_addr)
:param sig_addr: Signal address; Must be an address on the symmetric heap
:returns: Value at sig_addr
**Description:**
Atomically fetches the value stored at sig_addr.
SIGNAL OPERATORS
----------------
.. _SIGNAL_OPERATORS:
.. list-table:: Signal Operators
:widths: 20 40
:header-rows: 1
* - Value
- Description
* - ROCSHMEM_SIGNAL_SET
- The signaling operation routines will atomical set our signal value at sig_addr.
* - ROCSHMEM_SIGNAL_ADD
- The signaling operation routines will atomical add our signal value at sig_addr.
+90
Wyświetl plik
@@ -0,0 +1,90 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-api-teams:
-------------------------
Team Management Routines
-------------------------
ROCSHMEM_TEAM_MY_PE
-------------------
.. cpp:function:: __host__ int rocshmem_team_my_pe(rocshmem_team_t team)
:param team: The team to query
:returns: PE ID of the caller in the provided team
**Description:**
Query the PE ID of the caller in a team.
ROCSHMEM_TEAM_N_PES
-------------------
.. cpp:function:: __host__ int rocshmem_team_n_pes(rocshmem_team_t team)
:param team: The team to query
:returns: Number of PEs in the provided team
**Description:**
Query the number of PEs in a team.
ROCSHMEM_TEAM_TRANSLATE_PE
--------------------------
.. cpp:function:: __host__ int rocshmem_team_translate_pe(rocshmem_team_t src_team, int src_pe, rocshmem_team_t dest_team)
:param src_team: Handle of the team from which to translate
:param src_pe: PE-of-interest's index in src_team
:param dest_team: Handle of the team to which to translate
:returns: PE of src_pe in dest_team;
If any input is invalid or if src_pe is
not in both source and destination teams, a value of -1 is returned
**Description:**
Translate the PE in src_team to that in dest_team.
ROCSHMEM_TEAM_SPLIT_STRIDED
---------------------------
.. cpp:function:: __host__ int rocshmem_team_split_strided(rocshmem_team_t parent_team, int start, int stride, int size, const rocshmem_team_config_t *config, long config_mask, rocshmem_team_t *new_team)
:param parent_team: The team to split from
:param start: The lowest PE number of the subset of the PEs
from the parent team that will form the new
team
:param stride: The stride between team PE members in the
parent team that comprise the subset of PEs
that will form the new team
:param size: The number of PEs in the new team
:param config: Pointer to the config parameters for the new team
:param config_mask: Bitwise mask representing parameters to use from config
:param new_team: Pointer to the newly created team;
If an error occurs during team creation, or if the PE in
the parent team is not in the new team, the value will be
ROCSHMEM_TEAM_INVALID
:returns: Zero upon successful team creation; non-zero if erroneous
**Description:**
Create a new a team of PEs. Must be called by all PEs in the parent team.
ROCSHMEM_TEAM_DESTROY
---------------------
.. cpp:function:: __host__ void rocshmem_team_destroy(rocshmem_team_t team)
:param team: The team to destroy; The behavior is undefined if
the input team is ROCSHMEM_TEAM_WORLD or any other
invalid team; If the input is ROCSHMEM_TEAM_INVALID,
this function will not perform any operation
:returns: None
**Description:**
Destroy a team. Must be called by all PEs in the team.
The user must destroy all private contexts created in the
team before destroying this team. Otherwise, the behavior
is undefined. This call will destroy only the shareable contexts
created from the referenced team.
+80
Wyświetl plik
@@ -0,0 +1,80 @@
-------------------------
Running rocSHMEM Programs
-------------------------
Compiling and Linking with rocSHMEM
-----------------------------------
RocSHMEM is built as a library that can be statically
linked to your application during compilation using ``hipcc``.
During the compilation of your application, include the rocSHMEM header files
and the rocSHMEM library when using ``hipcc``.
Since rocSHMEM depends on MPI (in version 6.4.0, this requirement may be dropped
in future versions) you will need to link with an MPI library.
The arguments for MPI linkage must be added manually as opposed to using ``mpicc``.
When using ``hipcc`` directly (as opposed to through a build system), we
recommend performing the compilation and linking steps separately.
For example, one can refer to how to compile the examples files (``./examples/*`` in
the source tarball) with the following compile and link commands:
.. code-block:: bash
# Compile
hipcc -c -fgpu-rdc -x hip rocshmem_allreduce_test.cc \
-I/opt/rocm/include \
-I$ROCSHMEM_INSTALL_DIR/include \
-I$OPENMPI_UCX_INSTALL_DIR/include/
# Link
hipcc -fgpu-rdc --hip-link rocshmem_allreduce_test.o -o rocshmem_allreduce_test \
$ROCSHMEM_INSTALL_DIR/lib/librocshmem.a \
$OPENMPI_UCX_INSTALL_DIR/lib/libmpi.so \
-L/opt/rocm/lib -lamdhip64 -lhsa-runtime64
If your project uses cmake, you may refer to
`Using CMake with AMD ROCm <https://rocmdocs.amd.com/en/latest/conceptual/cmake-packages.html>`_.
Running a rocSHMEM program
--------------------------
Program that use rocSHMEM will typically deploy multiple processes (Typically, one per GPU).
The MPI launcher (e.g., ``mpiexec`` when using Open MPI) is used to start the required number
of processes. As an example, one may launch 2 getmem example processes (available when compiled from source) using the following command line:
.. code-block:: bash
mpiexec --map-by numa --mca pml ucx --mca osc ucx -np 2 ./build/examples/rocshmem_getmem_test
Please refer to the Open MPI documentation for more information about ``mpiexec`` command line parameters.
.. note::
Some systems may have multiple installs of MPI, some of which would not
have GPU support enabled. Make sure you use the ``mpiexec`` from the expected
MPI library, notably when using the MPI you built yourself
as part of :ref:`install-dependencies`.
Environment Variables
---------------------
The behavior of rocSHMEM can be controlled with the following environment variables:
.. list-table:: Environment Variables
:widths: 30 10 20
:header-rows: 1
* - Name
- Default Value
- Description
* - ROCSHMEM_HEAP_SIZE
- 1 GB
- Defines the size of the rocSHMEM symmetric heap.
Note the heap is on the GPU memory.
* - ROCSHMEM_MAX_NUM_CONTEXTS
- 1024
- Defines the number of contexts an application can use
* - ROCSHMEM_MAX_NUM_TEAMS
- 40
- Defines the number of teams an application can use
+36
Wyświetl plik
@@ -0,0 +1,36 @@
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
import re
from rocm_docs import ROCmDocs
with open('../include/rocshmem/rocshmem.hpp', encoding='utf-8') as f:
match = re.search(r'constexpr char VERSION\[\] = "([0-9.]+)[^0-9.]+', f.read())
if not match:
raise ValueError("VERSION not found!")
version_number = match[1]
left_nav_title = f"rocSHMEM {version_number} Documentation"
# for PDF output on Read the Docs
project = "rocSHMEM Documentation"
author = "Advanced Micro Devices, Inc."
copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved."
version = version_number
release = version_number
external_toc_path = "./sphinx/_toc.yml"
docs_core = ROCmDocs(left_nav_title)
docs_core.run_doxygen(doxygen_root="doxygen", doxygen_path="doxygen/xml")
docs_core.setup()
external_projects_current_project = "rocshmem"
cpp_id_attributes = ["__host__", "__global__", "__device__"]
exclude_patterns = ["README.md"]
for sphinx_var in ROCmDocs.SPHINX_VARS:
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
Plik diff jest za duży Load Diff
+46
Wyświetl plik
@@ -0,0 +1,46 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
****************************
rocSHMEM Documentation
****************************
The ROCm OpenSHMEM (rocSHMEM) runtime is part of an AMD and AMD Research initiative
to provide GPU-centric networking through an OpenSHMEM-like interface.
This intra-kernel networking library simplifies application code complexity and
enables more fine-grained communication/computation overlap
than traditional host-driven networking.
rocSHMEM uses a single symmetric heap (SHEAP) that is allocated on GPU memories. To learn more, see :doc:`introduction`
The code is open and hosted at `<https://github.com/ROCm/rocSHMEM>`_.
.. grid:: 2
:gutter: 3
.. grid-item-card:: Install
* :doc:`Install rocSHMEM <./install>`
.. grid-item-card:: How to
* :doc:`Compile and Run rocSHMEM Programs <./compile_and_run>`
.. grid-item-card:: API Reference
* :doc:`Library Setup, Exit, and Query Routines <./api/init>`
* :doc:`Memory Management Routines <./api/memory_management>`
* :doc:`Team Management Routines <./api/teams>`
* :doc:`Context Management Routines <./api/ctx>`
* :doc:`Remote Memory Access Routines <./api/rma>`
* :doc:`Atomic Memory Operations <./api/amo>`
* :doc:`Signaling Operations <./api/sigops>`
* :doc:`Collective Routines <./api/coll>`
* :doc:`Point-to-Point Synchronization Routines <./api/pt2pt_sync>`
* :doc:`Memory Ordering Routines <./api/memory_ordering>`
To contribute to the documentation, refer to
`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
You can find licensing information on the
`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
+116
Wyświetl plik
@@ -0,0 +1,116 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _install-rocshmem:
---------------------------
Installing rocSHMEM
---------------------------
This topic describes how to install rocSHMEM.
The file `README.md <https://github.com/ROCm/rocSHMEM/blob/rocm-6.4.0/README.md>`_ in the rocSHMEM sources may contain additional information.
Requirements
---------------------------
1. ROCm stack installed on the system (HIP runtime)
* ROCm v6.4.0 or later
2. AMD GPUs
* MI250X
* MI300X
3. ROCm-aware Open MPI and UCX as described in Building Dependencies
Installing from a Package Manager
---------------------------------
On Ubuntu, rocSHMEM can be installed with the following command:
.. code-block:: bash
apt install rocshmem-dev
.. note::
This installation method requires ROCm 6.4 or newer. Dependencies
(open MPI and UCX) still need to be built following the instructions
in the next paragraph, as the distribution packaged versions do not
include full accelerator support.
.. _install-dependencies:
Building Dependencies
---------------------------
rocSHMEM requires a ROCm-Aware Open MPI and UCX.
Other MPI implementations, such as MPICH,
*should* be compatible, if rocSHMEM is built from source,
but it has not been thoroughly tested.
To build and configure ROCm-Aware UCX (1.17.0 or later), you need to:
.. code-block:: bash
git clone https://github.com/ROCm/ucx.git -b v1.17.x
cd ucx
./autogen.sh
./configure --prefix=<prefix_dir> --with-rocm=<rocm_path> --enable-mt
make -j 8
make -j 8 install
Then, you need to build Open MPI (5.0.7 or later) with UCX support.
.. code-block:: bash
git clone --recursive https://github.com/open-mpi/ompi.git -b v5.0.x
cd ompi
./autogen.pl
./configure --prefix=<prefix_dir> --with-rocm=<rocm_path> --with-ucx=<ucx_path>
make -j 8
make -j 8 install
Alternatively, we have script to install dependencies.
Configuration options are platform dependent, so please review the script to
check for fitness with your system.
.. code-block:: bash
export BUILD_DIR=/path/to/not_rocshmem_src_or_build/dependencies
/path/to/rocshmem_src/scripts/install_dependencies.sh
For more information on OpenMPI-UCX support, please visit:
`GPU-enabled Message Passing Interface <https://rocm.docs.amd.com/en/latest/how-to/gpu-enabled-mpi.html>`_
Installing rocSHMEM from Source
--------------------------------
The following method can be used to build and install rocSHMEM with the IPC
on-node, GPU-to-GPU backend:
.. code-block:: bash
git clone git@github.com:ROCm/rocSHMEM.git
cd rocSHMEM
mkdir build
cd build
../scripts/build_configs/ipc_single
The build script passes configuration options to CMake to setup a canonical
build.
There are other scripts for experimental configurations in the
`./scripts/build_configs` directory, but currently, only `ipc_single`
is supported.
By default, the library is installed in `~/rocshmem`. You may provide a
custom install path by supplying it as an argument. For example:
.. code-block:: bash
../scripts/build_configs/ipc_single /path/to/install
+54
Wyświetl plik
@@ -0,0 +1,54 @@
.. meta::
:description: rocSHMEM intra-kernel networking runtime for AMD dGPUs on the ROCm platform.
:keywords: rocSHMEM, API, ROCm, documentation, HIP, Networking, Communication
.. _rocshmem-introduction:
---------------------------
What is rocSHMEM?
---------------------------
The ROCm OpenSHMEM (rocSHMEM) runtime is part of an AMD and AMD Research initiative
to provide GPU-centric networking through an OpenSHMEM-like interface.
This intra-kernel networking library simplifies application code complexity and
enables more fine-grained communication/computation overlap
than traditional host-driven networking.
rocSHMEM uses a single symmetric heap (SHEAP) that is allocated on GPU memories.
The code is open and hosted at `<https://github.com/ROCm/rocSHMEM>`_.
The rocSHMEM Programming Model
-------------------------------
Defining how OpenSHMEM applications interact with GPUs remains an
ongoing active discussion within the OpenSHMEM community, and the OpenSHMEM
specification has yet to coalesce on this topic.
rocSHMEM extends beyond the OpenSHMEM specification to add semantic that
support GPU kernel communication, while maintaining close resemblance to
the original OpenSHMEM specification semantics.
Applications that use HIP can be easily interface with rocSHMEM.
As per the HIP programming model,
rocSHMEM has `__host__` APIs which are to be called from host code,
and `__device__` APIs which can be called within GPU Kernels.
Any device APIs which do not have any special suffixes/infixes (e.g. `_wg` or `_wave`)
must be called by a single thread.
GPU specific `_wg` and `_wave` APIs are expected to be called from multiple GPU threads
and block until the calling scope completes.
These APIs can be called in divergent code paths but this is not recommended.
Wavefront APIs
==============
The wavefront APIs are any API calls that have the suffix `_wave`.
The parameters in which these routines are called must be
the same for every thread in the wavefront.
If any thread calls these routines with differing parameters, the behavior is undefined.
These APIs will block until the calling wavefront completes.
Workgroup APIs
==============
The workgroup APIs are any API calls that have the suffix `_wg` or infix `_wg_`.
The parameters in which these routines are called must be
the same for every thread in the workgroup.
If any thread calls these routines with differing parameters, the behavior is undefined.
These APIs will block until the calling workgroup completes.
+4
Wyświetl plik
@@ -0,0 +1,4 @@
# License
```{include} ../LICENSE.md
```
+93
Wyświetl plik
@@ -0,0 +1,93 @@
accessible-pygments==0.0.5
alabaster==1.0.0
appnope==0.1.4
asttokens==3.0.0
attrs==25.1.0
babel==2.17.0
beautifulsoup4==4.13.3
breathe==4.35.0
certifi==2025.1.31
cffi==1.17.1
charset-normalizer==3.4.1
click==8.1.8
comm==0.2.2
cryptography==44.0.1
debugpy==1.8.12
decorator==5.1.1
Deprecated==1.2.18
docutils==0.21.2
exceptiongroup==1.2.2
executing==2.2.0
fastjsonschema==2.21.1
gitdb==4.0.12
GitPython==3.1.44
idna==3.10
imagesize==1.4.1
importlib_metadata==8.6.1
ipykernel==6.29.5
ipython==8.32.0
jedi==0.19.2
Jinja2==3.1.5
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter-cache==1.0.1
jupyter_client==8.6.3
jupyter_core==5.7.2
markdown-it-py==3.0.0
MarkupSafe==3.0.2
matplotlib-inline==0.1.7
mdit-py-plugins==0.4.2
mdurl==0.1.2
myst-nb==1.2.0
myst-parser==4.0.1
nbclient==0.10.2
nbformat==5.10.4
nest-asyncio==1.6.0
packaging==24.2
parso==0.8.4
pexpect==4.9.0
platformdirs==4.3.6
prompt_toolkit==3.0.50
psutil==7.0.0
ptyprocess==0.7.0
pure_eval==0.2.3
pycparser==2.22
pydata-sphinx-theme==0.15.4
PyGithub==2.6.1
Pygments==2.19.1
PyJWT==2.10.1
PyNaCl==1.5.0
python-dateutil==2.9.0.post0
PyYAML==6.0.2
pyzmq==26.2.1
referencing==0.36.2
requests==2.32.3
rocm-docs-core==1.17.0
rpds-py==0.23.1
six==1.17.0
smmap==5.0.2
snowballstemmer==2.2.0
soupsieve==2.6
Sphinx==8.1.3
sphinx-book-theme==1.1.4
sphinx-copybutton==0.5.2
sphinx-notfound-page==1.1.0
sphinx_design==0.6.1
sphinx_external_toc==1.0.1
sphinxcontrib-applehelp==2.0.0
sphinxcontrib-devhelp==2.0.0
sphinxcontrib-htmlhelp==2.1.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==2.0.0
sphinxcontrib-serializinghtml==2.0.0
SQLAlchemy==2.0.38
stack-data==0.6.3
tabulate==0.9.0
tomli==2.2.1
tornado==6.4.2
traitlets==5.14.3
typing_extensions==4.12.2
urllib3==2.3.0
wcwidth==0.2.13
wrapt==1.17.2
zipp==3.21.0
+46
Wyświetl plik
@@ -0,0 +1,46 @@
defaults:
numbered: False
root: index
subtrees:
- caption: Introduction
entries:
- file: introduction.rst
title: What is rocSHMEM?
- caption: Install
entries:
- file: install.rst
title: Install rocSHMEM
- caption: How to
entries:
- file: compile_and_run.rst
title: Compile and Run rocSHMEM Programs
- caption: API Reference
entries:
- file: api/init.rst
title: Library Setup, Exit, and Query Routines
- file: api/memory_management.rst
title: Memory Management Routines
- file: api/teams.rst
title: Team Management Routines
- file: api/ctx.rst
title: Context Management Routines
- file: api/rma.rst
title: Remote Memory Access Routines
- file: api/amo.rst
title: Atomic Memory Operations
- file: api/sigops.rst
title: Signaling Operations
- file: api/coll.rst
title: Collective Routines
- file: api/pt2pt_sync.rst
title: Point-to-Point Synchronization Routines
- file: api/memory_ordering.rst
title: Memory Ordering Routines
- caption: About
entries:
- file: license.rst
+2
Wyświetl plik
@@ -52,6 +52,8 @@
namespace rocshmem {
constexpr char VERSION[] = "2.0.0";
/******************************************************************************
**************************** HOST INTERFACE **********************************
*****************************************************************************/