diff --git a/docs/old/reference/kernel_language.rst b/docs/old/reference/kernel_language.rst index a5d7a26ef0..7f2f123ac5 100644 --- a/docs/old/reference/kernel_language.rst +++ b/docs/old/reference/kernel_language.rst @@ -81,9 +81,9 @@ Calling ``__global__`` functions ============================================================= `__global__` functions are often referred to as *kernels*. When you call a global function, you're -*launching a kernel*. When launching a kernel, you must specify a run configuration that includes the -grid and block dimensions. The run configuration can also include other information for the launch, -such as the amount of additional shared memory to allocate and the stream where you want to run the +*launching a kernel*. When launching a kernel, you must specify an execution configuration that includes the +grid and block dimensions. The execution configuration can also include other information for the launch, +such as the amount of additional shared memory to allocate and the stream where you want to execute the kernel. HIP introduces a standard C++ calling convention (``hipLaunchKernelGGL``) to pass the run @@ -186,13 +186,13 @@ launch parameter. Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the ``HIP_DYNAMIC_SHARED`` macro in order to ensure accuracy. This is because using static shared memory in the same kernel could've resulted in overlapping memory ranges and data-races. The - HIP-Clang compiler provides support for ``extern`` shared declarations, so ``HIP_DYNAMIC_SHARED`` + HIP-Clang compiler provides support for ``extern __shared_`` declarations, so ``HIP_DYNAMIC_SHARED`` is no longer required. ``__managed__`` ----------------------------------------------------------------------------- -Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device +Managed memory, including the ``__managed__`` keyword, is supported in HIP combined host/device compilation. ``__restrict__`` @@ -375,7 +375,9 @@ the instructions are serialized in an undefined order. To support system scope atomic operations, you can use the HIP APIs that contain the ``_system`` suffix. For example: + * ``atomicAnd``: This function is atomic and coherent within the GPU device running the function + * ``atomicAnd_system``: This function extends the atomic operation from the GPU device to other CPUs and GPU devices in the system @@ -683,9 +685,10 @@ RMW functions produce unsafe atomic RMW instructions: Warp cross-lane functions ======================================================== +Threads in a warp are referred to as `lanes` and are numbered from 0 to warpSize - 1. Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp -lanes are run in lockstep, meaning that additional synchronization is unnecessary. The instructions -don't use shared memory. +lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions +use no shared memory. Note that NVIDIA and AMD devices have different warp sizes. You can use ``warpSize`` built-ins in you portable code to query the warp size. @@ -699,15 +702,26 @@ To get the default warp size of a GPU device, use ``hipGetDeviceProperties`` in .. code:: cpp - cudaDeviceProp props; - cudaGetDeviceProperties(&props, deviceID); - int w = props.warpSize; + cudaDeviceProp props; + cudaGetDeviceProperties(&props, deviceID); + int w = props.warpSize; // implement portable algorithm based on w (rather than assume 32 or 64) Only use ``warpSize`` built-ins in device functions, and don't assume ``warpSize`` to be a compile-time constant. Note that assembly kernels may be built for a warp size that is different from the default. +All mask values either returned or accepted by these builtins are 64-bit +unsigned integer values, even when compiled for a wave-32 device, where all the +higher bits are unused. CUDA code ported to HIP requires changes to ensure that +the correct type is used. + +Note that the ``__sync`` variants are made available in ROCm 6.2, but disabled by +default to help with the transition to 64-bit masks. They can be enabled by +setting the preprocessor macro ``HIP_ENABLE_WARP_SYNC_BUILTINS``. These builtins +will be enabled unconditionally in ROCm 6.3. Wherever possible, the +implementation includes a static assert to check that the program source uses +the correct type for the mask. Warp vote and ballot functions ------------------------------------------------------------------------------------------------------------- @@ -716,29 +730,76 @@ Warp vote and ballot functions int __all(int predicate) int __any(int predicate) - uint64_t __ballot(int predicate) + unsigned long long __ballot(int predicate) + unsigned long long __activemask() -Threads in a warp are referred to as *lanes* and are numbered from 0 to :math:` warpSize - 1`. Each -warp lane contributes 1 minus the bit value (the predicate), which is efficiently broadcast to all lanes in -the warp. + int __all_sync(unsigned long long mask, int predicate) + int __any_sync(unsigned long long mask, int predicate) + int __ballot(unsigned long long mask, int predicate) -The 32-bit int predicate from each lane reduces to a 1-bit value of 0 ``(predicate = 0)`` or 1 -``(predicate != 0)``. To get a summary view of the predicates that are contributed by other warp lanes, you -can use: +You can use ``__any`` and ``__all`` to get a summary view of the predicates evaluated by the +participating lanes. -* ``__any()``: Returns 1 if any warp lane contributes a nonzero predicate, otherwise it returns 0 -* ``__all()``: Returns 1 if all other warp lanes contribute nonzero predicates, otherwise it returns 0 +* ``__any()``: Returns 1 if the predicate is non-zero for any participating lane, otherwise it returns 0. + +* ``__all()``: Returns 1 if the predicate is non-zero for all participating lanes, otherwise it returns 0. To determine if the target platform supports the any/all instruction, you can use the ``hasWarpVote`` device property or the ``HIP_ARCH_HAS_WARP_VOTE`` compiler definition. -HIP's ``__ballot`` function provides a bit mask that contains the 1-bit predicate value from each lane. -The nth bit of this result contains the 1 bit contributed by the nth warp lane. Note that ``__ballot`` -supports a 64-bit return value (versus CUDA's 32 bits). Code ported from CUDA should support these -larger warp sizes. +``__ballot`` returns a bit mask containing the 1-bit predicate value from each +lane. The nth bit of the result contains the 1 bit contributed by the nth warp +lane. -To determine if the target platform supports the ballot instruction, you ca use the ``hasWarpBallot`` -device property or the ``HIP_ARCH_HAS_WARP_BALLOT`` compiler definition. +``__activemask()`` returns a bit mask of currently active warp lanes. The nth bit +of the result is 1 if the nth warp lane is active. + +Note that the ``__ballot`` and ``__activemask`` builtins in HIP have a 64-bit return +value (unlike the 32-bit value returned by the CUDA builtins). Code ported from +CUDA should be adapted to support the larger warp sizes that the HIP version +requires. + +Applications can test whether the target platform supports the ``__ballot`` or +``__activemask`` instructions using the ``hasWarpBallot`` device property in host +code or the ``HIP_ARCH_HAS_WARP_BALLOT`` macro defined by the compiler for device +code. + +The ``_sync`` variants require a 64-bit unsigned integer mask argument that +specifies the lanes in the warp that will participate in cross-lane +communication with the calling lane. Each participating thread must have its own +bit set in its mask argument, and all active threads specified in any mask +argument must execute the same call with the same mask, otherwise the result is +undefined. + +Warp match functions +------------------------------------------------------------------------------------------------------------- + +.. code:: cpp + + unsigned long long __match_any(T value) + unsigned long long __match_all(T value, int *pred) + + unsigned long long __match_any_sync(unsigned long long mask, T value) + unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred) + +``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or +double precision floating point type. + +``__match_any`` returns a bit mask containing a 1-bit for every participating lane +if and only if that lane has the same value in ``value`` as the current lane, and +a 0-bit for all other lanes. + +``__match_all`` returns a bit mask containing a 1-bit for every participating lane +if and only if they all have the same value in ``value`` as the current lane, and +a 0-bit for all other lanes. The predicate ``pred`` is set to true if and only if +all participating threads have the same value in ``value``. + +The ``_sync`` variants require a 64-bit unsigned integer mask argument that +specifies the lanes in the warp that will participate in cross-lane +communication with the calling lane. Each participating thread must have its own +bit set in its mask argument, and all active threads specified in any mask +argument must execute the same call with the same mask, otherwise the result is +undefined. Warp shuffle functions ------------------------------------------------------------------------------------------------------------- @@ -747,14 +808,25 @@ The default width is ``warpSize`` (see :ref:`warp-cross-lane`). Half-float shuff .. code:: cpp - int __shfl (int var, int srcLane, int width=warpSize); - float __shfl (float var, int srcLane, int width=warpSize); - int __shfl_up (int var, unsigned int delta, int width=warpSize); - float __shfl_up (float var, unsigned int delta, int width=warpSize); - int __shfl_down (int var, unsigned int delta, int width=warpSize); - float __shfl_down (float var, unsigned int delta, int width=warpSize); - int __shfl_xor (int var, int laneMask, int width=warpSize); - float __shfl_xor (float var, int laneMask, int width=warpSize); + int __shfl (T var, int srcLane, int width=warpSize); + int __shfl_up (T var, unsigned int delta, int width=warpSize); + int __shfl_down (T var, unsigned int delta, int width=warpSize); + int __shfl_xor (T var, int laneMask, int width=warpSize); + + int __shfl_sync (unsigned long long mask, T var, int srcLane, int width=warpSize); + int __shfl_up_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize); + int __shfl_down_sync (unsigned long long mask, T var, unsigned int delta, int width=warpSize); + int __shfl_xor_sync (unsigned long long mask, T var, int laneMask, int width=warpSize); + +``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or +double precision floating point type. + +The ``_sync`` variants require a 64-bit unsigned integer mask argument that +specifies the lanes in the warp that will participate in cross-lane +communication with the calling lane. Each participating thread must have its own +bit set in its mask argument, and all active threads specified in any mask +argument must execute the same call with the same mask, otherwise the result is +undefined. Cooperative groups functions ============================================================== @@ -933,6 +1005,8 @@ HIP provides the function ``abort()`` which can be used to terminate the applica This function produces a similar effect of using ``asm("trap")`` in the CUDA code. +.. note:: In HIP, the function terminates the entire application, while in CUDA, `asm("trap")`only terminates the dispatch and the application continues to run. + Printf ============================================================ @@ -942,9 +1016,9 @@ The following is a simple example to print information in the kernel. .. code:: cpp #include - + __global__ void run_printf() { printf("Hello World\n"); } - + int main() { run_printf<<>>(); }