SWDEV-452786 - Adding Updates in kernel_language rst
Change-Id: I1fc96d598bcdb8fdaabe0b84d8b638de76d98905
Этот коммит содержится в:
@@ -81,9 +81,9 @@ Calling ``__global__`` functions
|
||||
=============================================================
|
||||
|
||||
`__global__` functions are often referred to as *kernels*. When you call a global function, you're
|
||||
*launching a kernel*. When launching a kernel, you must specify a run configuration that includes the
|
||||
grid and block dimensions. The run configuration can also include other information for the launch,
|
||||
such as the amount of additional shared memory to allocate and the stream where you want to run the
|
||||
*launching a kernel*. When launching a kernel, you must specify an execution configuration that includes the
|
||||
grid and block dimensions. The execution configuration can also include other information for the launch,
|
||||
such as the amount of additional shared memory to allocate and the stream where you want to execute the
|
||||
kernel.
|
||||
|
||||
HIP introduces a standard C++ calling convention (``hipLaunchKernelGGL``) to pass the run
|
||||
@@ -186,13 +186,13 @@ launch parameter.
|
||||
Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the
|
||||
``HIP_DYNAMIC_SHARED`` macro in order to ensure accuracy. This is because using static shared
|
||||
memory in the same kernel could've resulted in overlapping memory ranges and data-races. The
|
||||
HIP-Clang compiler provides support for ``extern`` shared declarations, so ``HIP_DYNAMIC_SHARED``
|
||||
HIP-Clang compiler provides support for ``extern __shared_`` declarations, so ``HIP_DYNAMIC_SHARED``
|
||||
is no longer required.
|
||||
|
||||
``__managed__``
|
||||
-----------------------------------------------------------------------------
|
||||
|
||||
Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device
|
||||
Managed memory, including the ``__managed__`` keyword, is supported in HIP combined host/device
|
||||
compilation.
|
||||
|
||||
``__restrict__``
|
||||
@@ -375,7 +375,9 @@ the instructions are serialized in an undefined order.
|
||||
|
||||
To support system scope atomic operations, you can use the HIP APIs that contain the ``_system`` suffix.
|
||||
For example:
|
||||
|
||||
* ``atomicAnd``: This function is atomic and coherent within the GPU device running the function
|
||||
|
||||
* ``atomicAnd_system``: This function extends the atomic operation from the GPU device to other CPUs
|
||||
and GPU devices in the system
|
||||
|
||||
@@ -683,9 +685,10 @@ RMW functions produce unsafe atomic RMW instructions:
|
||||
Warp cross-lane functions
|
||||
========================================================
|
||||
|
||||
Threads in a warp are referred to as `lanes` and are numbered from 0 to warpSize - 1.
|
||||
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp
|
||||
lanes are run in lockstep, meaning that additional synchronization is unnecessary. The instructions
|
||||
don't use shared memory.
|
||||
lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions
|
||||
use no shared memory.
|
||||
|
||||
Note that NVIDIA and AMD devices have different warp sizes. You can use ``warpSize`` built-ins in you
|
||||
portable code to query the warp size.
|
||||
@@ -699,15 +702,26 @@ To get the default warp size of a GPU device, use ``hipGetDeviceProperties`` in
|
||||
|
||||
.. code:: cpp
|
||||
|
||||
cudaDeviceProp props;
|
||||
cudaGetDeviceProperties(&props, deviceID);
|
||||
int w = props.warpSize;
|
||||
cudaDeviceProp props;
|
||||
cudaGetDeviceProperties(&props, deviceID);
|
||||
int w = props.warpSize;
|
||||
// implement portable algorithm based on w (rather than assume 32 or 64)
|
||||
|
||||
Only use ``warpSize`` built-ins in device functions, and don't assume ``warpSize`` to be a compile-time
|
||||
constant.
|
||||
|
||||
Note that assembly kernels may be built for a warp size that is different from the default.
|
||||
All mask values either returned or accepted by these builtins are 64-bit
|
||||
unsigned integer values, even when compiled for a wave-32 device, where all the
|
||||
higher bits are unused. CUDA code ported to HIP requires changes to ensure that
|
||||
the correct type is used.
|
||||
|
||||
Note that the ``__sync`` variants are made available in ROCm 6.2, but disabled by
|
||||
default to help with the transition to 64-bit masks. They can be enabled by
|
||||
setting the preprocessor macro ``HIP_ENABLE_WARP_SYNC_BUILTINS``. These builtins
|
||||
will be enabled unconditionally in ROCm 6.3. Wherever possible, the
|
||||
implementation includes a static assert to check that the program source uses
|
||||
the correct type for the mask.
|
||||
|
||||
Warp vote and ballot functions
|
||||
-------------------------------------------------------------------------------------------------------------
|
||||
@@ -716,29 +730,76 @@ Warp vote and ballot functions
|
||||
|
||||
int __all(int predicate)
|
||||
int __any(int predicate)
|
||||
uint64_t __ballot(int predicate)
|
||||
unsigned long long __ballot(int predicate)
|
||||
unsigned long long __activemask()
|
||||
|
||||
Threads in a warp are referred to as *lanes* and are numbered from 0 to :math:` warpSize - 1`. Each
|
||||
warp lane contributes 1 minus the bit value (the predicate), which is efficiently broadcast to all lanes in
|
||||
the warp.
|
||||
int __all_sync(unsigned long long mask, int predicate)
|
||||
int __any_sync(unsigned long long mask, int predicate)
|
||||
int __ballot(unsigned long long mask, int predicate)
|
||||
|
||||
The 32-bit int predicate from each lane reduces to a 1-bit value of 0 ``(predicate = 0)`` or 1
|
||||
``(predicate != 0)``. To get a summary view of the predicates that are contributed by other warp lanes, you
|
||||
can use:
|
||||
You can use ``__any`` and ``__all`` to get a summary view of the predicates evaluated by the
|
||||
participating lanes.
|
||||
|
||||
* ``__any()``: Returns 1 if any warp lane contributes a nonzero predicate, otherwise it returns 0
|
||||
* ``__all()``: Returns 1 if all other warp lanes contribute nonzero predicates, otherwise it returns 0
|
||||
* ``__any()``: Returns 1 if the predicate is non-zero for any participating lane, otherwise it returns 0.
|
||||
|
||||
* ``__all()``: Returns 1 if the predicate is non-zero for all participating lanes, otherwise it returns 0.
|
||||
|
||||
To determine if the target platform supports the any/all instruction, you can use the ``hasWarpVote``
|
||||
device property or the ``HIP_ARCH_HAS_WARP_VOTE`` compiler definition.
|
||||
|
||||
HIP's ``__ballot`` function provides a bit mask that contains the 1-bit predicate value from each lane.
|
||||
The nth bit of this result contains the 1 bit contributed by the nth warp lane. Note that ``__ballot``
|
||||
supports a 64-bit return value (versus CUDA's 32 bits). Code ported from CUDA should support these
|
||||
larger warp sizes.
|
||||
``__ballot`` returns a bit mask containing the 1-bit predicate value from each
|
||||
lane. The nth bit of the result contains the 1 bit contributed by the nth warp
|
||||
lane.
|
||||
|
||||
To determine if the target platform supports the ballot instruction, you ca use the ``hasWarpBallot``
|
||||
device property or the ``HIP_ARCH_HAS_WARP_BALLOT`` compiler definition.
|
||||
``__activemask()`` returns a bit mask of currently active warp lanes. The nth bit
|
||||
of the result is 1 if the nth warp lane is active.
|
||||
|
||||
Note that the ``__ballot`` and ``__activemask`` builtins in HIP have a 64-bit return
|
||||
value (unlike the 32-bit value returned by the CUDA builtins). Code ported from
|
||||
CUDA should be adapted to support the larger warp sizes that the HIP version
|
||||
requires.
|
||||
|
||||
Applications can test whether the target platform supports the ``__ballot`` or
|
||||
``__activemask`` instructions using the ``hasWarpBallot`` device property in host
|
||||
code or the ``HIP_ARCH_HAS_WARP_BALLOT`` macro defined by the compiler for device
|
||||
code.
|
||||
|
||||
The ``_sync`` variants require a 64-bit unsigned integer mask argument that
|
||||
specifies the lanes in the warp that will participate in cross-lane
|
||||
communication with the calling lane. Each participating thread must have its own
|
||||
bit set in its mask argument, and all active threads specified in any mask
|
||||
argument must execute the same call with the same mask, otherwise the result is
|
||||
undefined.
|
||||
|
||||
Warp match functions
|
||||
-------------------------------------------------------------------------------------------------------------
|
||||
|
||||
.. code:: cpp
|
||||
|
||||
unsigned long long __match_any(T value)
|
||||
unsigned long long __match_all(T value, int *pred)
|
||||
|
||||
unsigned long long __match_any_sync(unsigned long long mask, T value)
|
||||
unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred)
|
||||
|
||||
``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or
|
||||
double precision floating point type.
|
||||
|
||||
``__match_any`` returns a bit mask containing a 1-bit for every participating lane
|
||||
if and only if that lane has the same value in ``value`` as the current lane, and
|
||||
a 0-bit for all other lanes.
|
||||
|
||||
``__match_all`` returns a bit mask containing a 1-bit for every participating lane
|
||||
if and only if they all have the same value in ``value`` as the current lane, and
|
||||
a 0-bit for all other lanes. The predicate ``pred`` is set to true if and only if
|
||||
all participating threads have the same value in ``value``.
|
||||
|
||||
The ``_sync`` variants require a 64-bit unsigned integer mask argument that
|
||||
specifies the lanes in the warp that will participate in cross-lane
|
||||
communication with the calling lane. Each participating thread must have its own
|
||||
bit set in its mask argument, and all active threads specified in any mask
|
||||
argument must execute the same call with the same mask, otherwise the result is
|
||||
undefined.
|
||||
|
||||
Warp shuffle functions
|
||||
-------------------------------------------------------------------------------------------------------------
|
||||
@@ -747,14 +808,25 @@ The default width is ``warpSize`` (see :ref:`warp-cross-lane`). Half-float shuff
|
||||
|
||||
.. code:: cpp
|
||||
|
||||
int __shfl (int var, int srcLane, int width=warpSize);
|
||||
float __shfl (float var, int srcLane, int width=warpSize);
|
||||
int __shfl_up (int var, unsigned int delta, int width=warpSize);
|
||||
float __shfl_up (float var, unsigned int delta, int width=warpSize);
|
||||
int __shfl_down (int var, unsigned int delta, int width=warpSize);
|
||||
float __shfl_down (float var, unsigned int delta, int width=warpSize);
|
||||
int __shfl_xor (int var, int laneMask, int width=warpSize);
|
||||
float __shfl_xor (float var, int laneMask, int width=warpSize);
|
||||
int __shfl (T var, int srcLane, int width=warpSize);
|
||||
int __shfl_up (T var, unsigned int delta, int width=warpSize);
|
||||
int __shfl_down (T var, unsigned int delta, int width=warpSize);
|
||||
int __shfl_xor (T var, int laneMask, int width=warpSize);
|
||||
|
||||
int __shfl_sync (unsigned long long mask, T var, int srcLane, int width=warpSize);
|
||||
int __shfl_up_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize);
|
||||
int __shfl_down_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize);
|
||||
int __shfl_xor_sync (unsigned long long mask, T var, int laneMask, int width=warpSize);
|
||||
|
||||
``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or
|
||||
double precision floating point type.
|
||||
|
||||
The ``_sync`` variants require a 64-bit unsigned integer mask argument that
|
||||
specifies the lanes in the warp that will participate in cross-lane
|
||||
communication with the calling lane. Each participating thread must have its own
|
||||
bit set in its mask argument, and all active threads specified in any mask
|
||||
argument must execute the same call with the same mask, otherwise the result is
|
||||
undefined.
|
||||
|
||||
Cooperative groups functions
|
||||
==============================================================
|
||||
@@ -933,6 +1005,8 @@ HIP provides the function ``abort()`` which can be used to terminate the applica
|
||||
|
||||
This function produces a similar effect of using ``asm("trap")`` in the CUDA code.
|
||||
|
||||
.. note:: In HIP, the function terminates the entire application, while in CUDA, `asm("trap")`only terminates the dispatch and the application continues to run.
|
||||
|
||||
Printf
|
||||
============================================================
|
||||
|
||||
@@ -942,9 +1016,9 @@ The following is a simple example to print information in the kernel.
|
||||
.. code:: cpp
|
||||
|
||||
#include <hip/hip_runtime.h>
|
||||
|
||||
|
||||
__global__ void run_printf() { printf("Hello World\n"); }
|
||||
|
||||
|
||||
int main() {
|
||||
run_printf<<<dim3(1), dim3(1), 0, 0>>>();
|
||||
}
|
||||
|
||||
Ссылка в новой задаче
Block a user