SWDEV-452786 - Adding Updates in kernel_language rst

Change-Id: I1fc96d598bcdb8fdaabe0b84d8b638de76d98905
Этот коммит содержится в:
Julia Jiang
2024-03-25 12:41:23 -04:00
родитель 11d368b6ac
Коммит fc4e762c43
+109 -35
Просмотреть файл
@@ -81,9 +81,9 @@ Calling ``__global__`` functions
=============================================================
`__global__` functions are often referred to as *kernels*. When you call a global function, you're
*launching a kernel*. When launching a kernel, you must specify a run configuration that includes the
grid and block dimensions. The run configuration can also include other information for the launch,
such as the amount of additional shared memory to allocate and the stream where you want to run the
*launching a kernel*. When launching a kernel, you must specify an execution configuration that includes the
grid and block dimensions. The execution configuration can also include other information for the launch,
such as the amount of additional shared memory to allocate and the stream where you want to execute the
kernel.
HIP introduces a standard C++ calling convention (``hipLaunchKernelGGL``) to pass the run
@@ -186,13 +186,13 @@ launch parameter.
Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the
``HIP_DYNAMIC_SHARED`` macro in order to ensure accuracy. This is because using static shared
memory in the same kernel could've resulted in overlapping memory ranges and data-races. The
HIP-Clang compiler provides support for ``extern`` shared declarations, so ``HIP_DYNAMIC_SHARED``
HIP-Clang compiler provides support for ``extern __shared_`` declarations, so ``HIP_DYNAMIC_SHARED``
is no longer required.
``__managed__``
-----------------------------------------------------------------------------
Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device
Managed memory, including the ``__managed__`` keyword, is supported in HIP combined host/device
compilation.
``__restrict__``
@@ -375,7 +375,9 @@ the instructions are serialized in an undefined order.
To support system scope atomic operations, you can use the HIP APIs that contain the ``_system`` suffix.
For example:
* ``atomicAnd``: This function is atomic and coherent within the GPU device running the function
* ``atomicAnd_system``: This function extends the atomic operation from the GPU device to other CPUs
and GPU devices in the system
@@ -683,9 +685,10 @@ RMW functions produce unsafe atomic RMW instructions:
Warp cross-lane functions
========================================================
Threads in a warp are referred to as `lanes` and are numbered from 0 to warpSize - 1.
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp
lanes are run in lockstep, meaning that additional synchronization is unnecessary. The instructions
don't use shared memory.
lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions
use no shared memory.
Note that NVIDIA and AMD devices have different warp sizes. You can use ``warpSize`` built-ins in you
portable code to query the warp size.
@@ -699,15 +702,26 @@ To get the default warp size of a GPU device, use ``hipGetDeviceProperties`` in
.. code:: cpp
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
// implement portable algorithm based on w (rather than assume 32 or 64)
Only use ``warpSize`` built-ins in device functions, and don't assume ``warpSize`` to be a compile-time
constant.
Note that assembly kernels may be built for a warp size that is different from the default.
All mask values either returned or accepted by these builtins are 64-bit
unsigned integer values, even when compiled for a wave-32 device, where all the
higher bits are unused. CUDA code ported to HIP requires changes to ensure that
the correct type is used.
Note that the ``__sync`` variants are made available in ROCm 6.2, but disabled by
default to help with the transition to 64-bit masks. They can be enabled by
setting the preprocessor macro ``HIP_ENABLE_WARP_SYNC_BUILTINS``. These builtins
will be enabled unconditionally in ROCm 6.3. Wherever possible, the
implementation includes a static assert to check that the program source uses
the correct type for the mask.
Warp vote and ballot functions
-------------------------------------------------------------------------------------------------------------
@@ -716,29 +730,76 @@ Warp vote and ballot functions
int __all(int predicate)
int __any(int predicate)
uint64_t __ballot(int predicate)
unsigned long long __ballot(int predicate)
unsigned long long __activemask()
Threads in a warp are referred to as *lanes* and are numbered from 0 to :math:` warpSize - 1`. Each
warp lane contributes 1 minus the bit value (the predicate), which is efficiently broadcast to all lanes in
the warp.
int __all_sync(unsigned long long mask, int predicate)
int __any_sync(unsigned long long mask, int predicate)
int __ballot(unsigned long long mask, int predicate)
The 32-bit int predicate from each lane reduces to a 1-bit value of 0 ``(predicate = 0)`` or 1
``(predicate != 0)``. To get a summary view of the predicates that are contributed by other warp lanes, you
can use:
You can use ``__any`` and ``__all`` to get a summary view of the predicates evaluated by the
participating lanes.
* ``__any()``: Returns 1 if any warp lane contributes a nonzero predicate, otherwise it returns 0
* ``__all()``: Returns 1 if all other warp lanes contribute nonzero predicates, otherwise it returns 0
* ``__any()``: Returns 1 if the predicate is non-zero for any participating lane, otherwise it returns 0.
* ``__all()``: Returns 1 if the predicate is non-zero for all participating lanes, otherwise it returns 0.
To determine if the target platform supports the any/all instruction, you can use the ``hasWarpVote``
device property or the ``HIP_ARCH_HAS_WARP_VOTE`` compiler definition.
HIP's ``__ballot`` function provides a bit mask that contains the 1-bit predicate value from each lane.
The nth bit of this result contains the 1 bit contributed by the nth warp lane. Note that ``__ballot``
supports a 64-bit return value (versus CUDA's 32 bits). Code ported from CUDA should support these
larger warp sizes.
``__ballot`` returns a bit mask containing the 1-bit predicate value from each
lane. The nth bit of the result contains the 1 bit contributed by the nth warp
lane.
To determine if the target platform supports the ballot instruction, you ca use the ``hasWarpBallot``
device property or the ``HIP_ARCH_HAS_WARP_BALLOT`` compiler definition.
``__activemask()`` returns a bit mask of currently active warp lanes. The nth bit
of the result is 1 if the nth warp lane is active.
Note that the ``__ballot`` and ``__activemask`` builtins in HIP have a 64-bit return
value (unlike the 32-bit value returned by the CUDA builtins). Code ported from
CUDA should be adapted to support the larger warp sizes that the HIP version
requires.
Applications can test whether the target platform supports the ``__ballot`` or
``__activemask`` instructions using the ``hasWarpBallot`` device property in host
code or the ``HIP_ARCH_HAS_WARP_BALLOT`` macro defined by the compiler for device
code.
The ``_sync`` variants require a 64-bit unsigned integer mask argument that
specifies the lanes in the warp that will participate in cross-lane
communication with the calling lane. Each participating thread must have its own
bit set in its mask argument, and all active threads specified in any mask
argument must execute the same call with the same mask, otherwise the result is
undefined.
Warp match functions
-------------------------------------------------------------------------------------------------------------
.. code:: cpp
unsigned long long __match_any(T value)
unsigned long long __match_all(T value, int *pred)
unsigned long long __match_any_sync(unsigned long long mask, T value)
unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred)
``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or
double precision floating point type.
``__match_any`` returns a bit mask containing a 1-bit for every participating lane
if and only if that lane has the same value in ``value`` as the current lane, and
a 0-bit for all other lanes.
``__match_all`` returns a bit mask containing a 1-bit for every participating lane
if and only if they all have the same value in ``value`` as the current lane, and
a 0-bit for all other lanes. The predicate ``pred`` is set to true if and only if
all participating threads have the same value in ``value``.
The ``_sync`` variants require a 64-bit unsigned integer mask argument that
specifies the lanes in the warp that will participate in cross-lane
communication with the calling lane. Each participating thread must have its own
bit set in its mask argument, and all active threads specified in any mask
argument must execute the same call with the same mask, otherwise the result is
undefined.
Warp shuffle functions
-------------------------------------------------------------------------------------------------------------
@@ -747,14 +808,25 @@ The default width is ``warpSize`` (see :ref:`warp-cross-lane`). Half-float shuff
.. code:: cpp
int __shfl (int var, int srcLane, int width=warpSize);
float __shfl (float var, int srcLane, int width=warpSize);
int __shfl_up (int var, unsigned int delta, int width=warpSize);
float __shfl_up (float var, unsigned int delta, int width=warpSize);
int __shfl_down (int var, unsigned int delta, int width=warpSize);
float __shfl_down (float var, unsigned int delta, int width=warpSize);
int __shfl_xor (int var, int laneMask, int width=warpSize);
float __shfl_xor (float var, int laneMask, int width=warpSize);
int __shfl (T var, int srcLane, int width=warpSize);
int __shfl_up (T var, unsigned int delta, int width=warpSize);
int __shfl_down (T var, unsigned int delta, int width=warpSize);
int __shfl_xor (T var, int laneMask, int width=warpSize);
int __shfl_sync (unsigned long long mask, T var, int srcLane, int width=warpSize);
int __shfl_up_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize);
int __shfl_down_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize);
int __shfl_xor_sync (unsigned long long mask, T var, int laneMask, int width=warpSize);
``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or
double precision floating point type.
The ``_sync`` variants require a 64-bit unsigned integer mask argument that
specifies the lanes in the warp that will participate in cross-lane
communication with the calling lane. Each participating thread must have its own
bit set in its mask argument, and all active threads specified in any mask
argument must execute the same call with the same mask, otherwise the result is
undefined.
Cooperative groups functions
==============================================================
@@ -933,6 +1005,8 @@ HIP provides the function ``abort()`` which can be used to terminate the applica
This function produces a similar effect of using ``asm("trap")`` in the CUDA code.
.. note:: In HIP, the function terminates the entire application, while in CUDA, `asm("trap")`only terminates the dispatch and the application continues to run.
Printf
============================================================
@@ -942,9 +1016,9 @@ The following is a simple example to print information in the kernel.
.. code:: cpp
#include <hip/hip_runtime.h>
__global__ void run_printf() { printf("Hello World\n"); }
int main() {
run_printf<<<dim3(1), dim3(1), 0, 0>>>();
}