SWDEV-452786 - Adding Updates in kernel_language rst

Change-Id: I1fc96d598bcdb8fdaabe0b84d8b638de76d98905
2024-03-25 12:41:23 -04:00
@@ -81,9 +81,9 @@ Calling ``__global__`` functions
 =============================================================

 `__global__` functions are often referred to as *kernels*. When you call a global function, you're
-*launching a kernel*. When launching a kernel, you must specify a run configuration that includes the
-grid and block dimensions. The run configuration can also include other information for the launch,
-such as the amount of additional shared memory to allocate and the stream where you want to run the
+*launching a kernel*. When launching a kernel, you must specify an execution configuration that includes the
+grid and block dimensions. The execution configuration can also include other information for the launch,
+such as the amount of additional shared memory to allocate and the stream where you want to execute the
 kernel.

 HIP introduces a standard C++ calling convention (``hipLaunchKernelGGL``) to pass the run
@@ -186,13 +186,13 @@ launch parameter.
  Prior to the HIP-Clang compiler, dynamic shared memory had to be declared using the
  ``HIP_DYNAMIC_SHARED`` macro in order to ensure accuracy. This is because using static shared
  memory in the same kernel could've resulted in overlapping memory ranges and data-races. The
-  HIP-Clang compiler provides support for ``extern`` shared declarations, so ``HIP_DYNAMIC_SHARED``
+  HIP-Clang compiler provides support for ``extern __shared_`` declarations, so ``HIP_DYNAMIC_SHARED``
  is no longer required.

 ``__managed__``
 -----------------------------------------------------------------------------

-Managed memory, including the `__managed__` keyword, is supported in HIP combined host/device
+Managed memory, including the ``__managed__`` keyword, is supported in HIP combined host/device
 compilation.

 ``__restrict__``
@@ -375,7 +375,9 @@ the instructions are serialized in an undefined order.

 To support system scope atomic operations, you can use the HIP APIs that contain the ``_system`` suffix.
 For example:
+
 * ``atomicAnd``: This function is atomic and coherent within the GPU device running the function
+
 * ``atomicAnd_system``: This function extends the atomic operation from the GPU device to other CPUs
 and GPU devices in the system

@@ -683,9 +685,10 @@ RMW functions produce unsafe atomic RMW instructions:
 Warp cross-lane functions
 ========================================================

+Threads in a warp are referred to as `lanes` and are numbered from 0 to warpSize - 1.
 Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all warp
-lanes are run in lockstep, meaning that additional synchronization is unnecessary. The instructions
-don't use shared memory.
+lanes will execute in lockstep, so additional synchronization is unnecessary, and the instructions
+use no shared memory.

 Note that NVIDIA and AMD devices have different warp sizes. You can use ``warpSize`` built-ins in you
 portable code to query the warp size.
@@ -699,15 +702,26 @@ To get the default warp size of a GPU device, use ``hipGetDeviceProperties`` in

 .. code:: cpp

-	cudaDeviceProp props;
-	cudaGetDeviceProperties(&props, deviceID);
-    int w = props.warpSize;
+  cudaDeviceProp props;
+  cudaGetDeviceProperties(&props, deviceID);
+  int w = props.warpSize;
    // implement portable algorithm based on w (rather than assume 32 or 64)

 Only use ``warpSize`` built-ins in device functions, and don't assume ``warpSize`` to be a compile-time
 constant.

 Note that assembly kernels may be built for a warp size that is different from the default.
+All mask values either returned or accepted by these builtins are 64-bit
+unsigned integer values, even when compiled for a wave-32 device, where all the
+higher bits are unused. CUDA code ported to HIP requires changes to ensure that
+the correct type is used.
+
+Note that the ``__sync`` variants are made available in ROCm 6.2, but disabled by
+default to help with the transition to 64-bit masks. They can be enabled by
+setting the preprocessor macro ``HIP_ENABLE_WARP_SYNC_BUILTINS``. These builtins
+will be enabled unconditionally in ROCm 6.3. Wherever possible, the
+implementation includes a static assert to check that the program source uses
+the correct type for the mask.

 Warp vote and ballot functions
 -------------------------------------------------------------------------------------------------------------
@@ -716,29 +730,76 @@ Warp vote and ballot functions

  int __all(int predicate)
  int __any(int predicate)
-  uint64_t __ballot(int predicate)
+  unsigned long long __ballot(int predicate)
+  unsigned long long __activemask()

-Threads in a warp are referred to as *lanes* and are numbered from 0 to :math:` warpSize - 1`. Each
-warp lane contributes 1 minus the bit value (the predicate), which is efficiently broadcast to all lanes in
-the warp.
+  int __all_sync(unsigned long long mask, int predicate)
+  int __any_sync(unsigned long long mask, int predicate)
+  int __ballot(unsigned long long mask, int predicate)

-The 32-bit int predicate from each lane reduces to a 1-bit value of 0 ``(predicate = 0)`` or 1
-``(predicate != 0)``. To get a summary view of the predicates that are contributed by other warp lanes, you
-can use:
+You can use ``__any`` and ``__all`` to get a summary view of the predicates evaluated by the
+participating lanes.

-* ``__any()``: Returns 1 if any warp lane contributes a nonzero predicate, otherwise it returns 0
-* ``__all()``: Returns 1 if all other warp lanes contribute nonzero predicates, otherwise it returns 0
+* ``__any()``: Returns 1 if the predicate is non-zero for any participating lane,  otherwise it returns 0.
+
+* ``__all()``: Returns 1 if the predicate is non-zero for all participating lanes, otherwise it returns 0.

 To determine if the target platform supports the any/all instruction, you can use the ``hasWarpVote``
 device property or the ``HIP_ARCH_HAS_WARP_VOTE`` compiler definition.

-HIP's ``__ballot`` function provides a bit mask that contains the 1-bit predicate value from each lane.
-The nth bit of this result contains the 1 bit contributed by the nth warp lane. Note that ``__ballot``
-supports a 64-bit return value (versus CUDA's 32 bits). Code ported from CUDA should support these
-larger warp sizes.
+``__ballot`` returns a bit mask containing the 1-bit predicate value from each
+lane. The nth bit of the result contains the 1 bit contributed by the nth warp
+lane.

-To determine if the target platform supports the ballot instruction, you ca use the ``hasWarpBallot``
-device property or the ``HIP_ARCH_HAS_WARP_BALLOT`` compiler definition.
+``__activemask()`` returns a bit mask of currently active warp lanes. The nth bit
+of the result is 1 if the nth warp lane is active.
+
+Note that the ``__ballot`` and ``__activemask`` builtins in HIP have a 64-bit return
+value (unlike the 32-bit value returned by the CUDA builtins). Code ported from
+CUDA should be adapted to support the larger warp sizes that the HIP version
+requires.
+
+Applications can test whether the target platform supports the ``__ballot`` or
+``__activemask`` instructions using the ``hasWarpBallot`` device property in host
+code or the ``HIP_ARCH_HAS_WARP_BALLOT`` macro defined by the compiler for device
+code.
+
+The ``_sync`` variants require a 64-bit unsigned integer mask argument that
+specifies the lanes in the warp that will participate in cross-lane
+communication with the calling lane. Each participating thread must have its own
+bit set in its mask argument, and all active threads specified in any mask
+argument must execute the same call with the same mask, otherwise the result is
+undefined.
+
+Warp match functions
+-------------------------------------------------------------------------------------------------------------
+
+.. code:: cpp
+
+  unsigned long long __match_any(T value)
+  unsigned long long __match_all(T value, int *pred)
+
+  unsigned long long __match_any_sync(unsigned long long mask, T value)
+  unsigned long long __match_all_sync(unsigned long long mask, T value, int *pred) 
+
+``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or
+double precision floating point type.
+
+``__match_any`` returns a bit mask containing a 1-bit for every participating lane
+if and only if that lane has the same value in ``value`` as the current lane, and
+a 0-bit for all other lanes.
+
+``__match_all`` returns a bit mask containing a 1-bit for every participating lane
+if and only if they all have the same value in ``value`` as the current lane, and
+a 0-bit for all other lanes. The predicate ``pred`` is set to true if and only if
+all participating threads have the same value in ``value``.
+
+The ``_sync`` variants require a 64-bit unsigned integer mask argument that
+specifies the lanes in the warp that will participate in cross-lane
+communication with the calling lane. Each participating thread must have its own
+bit set in its mask argument, and all active threads specified in any mask
+argument must execute the same call with the same mask, otherwise the result is
+undefined.

 Warp shuffle functions
 -------------------------------------------------------------------------------------------------------------
@@ -747,14 +808,25 @@ The default width is ``warpSize`` (see :ref:`warp-cross-lane`). Half-float shuff

 .. code:: cpp

-  int   __shfl      (int var,   int srcLane, int width=warpSize);
-  float __shfl      (float var, int srcLane, int width=warpSize);
-  int   __shfl_up   (int var,   unsigned int delta, int width=warpSize);
-  float __shfl_up   (float var, unsigned int delta, int width=warpSize);
-  int   __shfl_down (int var,   unsigned int delta, int width=warpSize);
-  float __shfl_down (float var, unsigned int delta, int width=warpSize);
-  int   __shfl_xor  (int var,   int laneMask, int width=warpSize);
-  float __shfl_xor  (float var, int laneMask, int width=warpSize);
+  int   __shfl      (T var,   int srcLane, int width=warpSize);
+  int   __shfl_up   (T var,   unsigned int delta, int width=warpSize);
+  int   __shfl_down (T var,   unsigned int delta, int width=warpSize);
+  int   __shfl_xor  (T var,   int laneMask, int width=warpSize);
+
+  int   __shfl_sync      (unsigned long long mask, T var,   int srcLane, int width=warpSize);
+  int   __shfl_up_sync   (unsigned long long mask, T var,   unsigned int delta, int width=warpSize);
+  int   __shfl_down_sync (unsigned long long mask, T var,   unsigned int delta, int width=warpSize);
+  int   __shfl_xor_sync  (unsigned long long mask, T var,   int laneMask, int width=warpSize);
+
+``T`` can be a 32-bit integer type, 64-bit integer type or a single precision or
+double precision floating point type.
+
+The ``_sync`` variants require a 64-bit unsigned integer mask argument that
+specifies the lanes in the warp that will participate in cross-lane
+communication with the calling lane. Each participating thread must have its own
+bit set in its mask argument, and all active threads specified in any mask
+argument must execute the same call with the same mask, otherwise the result is
+undefined.

 Cooperative groups functions
 ==============================================================
@@ -933,6 +1005,8 @@ HIP provides the function ``abort()`` which can be used to terminate the applica

 This function produces a similar effect of using ``asm("trap")`` in the CUDA code.

+.. note:: In HIP, the function terminates the entire application, while in CUDA, `asm("trap")`only terminates the dispatch and the application continues to run.
+
 Printf
 ============================================================

@@ -942,9 +1016,9 @@ The following is a simple example to print information in the kernel.
 .. code:: cpp

  #include <hip/hip_runtime.h>
-  
+
  __global__ void run_printf() { printf("Hello World\n"); }
-  
+
  int main() {
    run_printf<<<dim3(1), dim3(1), 0, 0>>>();
  }