diff --git a/projects/hip/docs/how-to/hip_porting_guide.rst b/projects/hip/docs/how-to/hip_porting_guide.rst index 418d38766b..7b074306d6 100644 --- a/projects/hip/docs/how-to/hip_porting_guide.rst +++ b/projects/hip/docs/how-to/hip_porting_guide.rst @@ -909,6 +909,70 @@ code, while the host can query it during runtime via the device properties. See the :ref:`HIP language extension for warpSize ` for information on how to write portable warpSize-aware code. +Lane masks bit-shift +==================== + +A thread in a warp is also called a lane, and a lane mask is a bitmask where +each bit corresponds to a thread in a warp. A bit is 1 if the thread is active, +0 if it's inactive. Bit-shift operations are typically used to create lane masks +and on AMD GPUs the ``warpSize`` can differ between different architectures, +that's why it's essential to use correct bitmask type, when porting code. + +Example: + +.. code-block:: cpp + + // Get the thread's position in the warp + unsigned int laneId = threadIdx.x % warpSize; + + // Use lane ID for bit-shift + val & ((1 << (threadIdx.x % warpSize) )-1 ); + + // Shift 32 bit integer with val variable + WarpReduce::sum( (val < warpSize) ? (1 << val) : 0); + +Lane masks are 32-bit integer types as this is the integer precision that C +assigns to such constants by default. GCN/CDNA architectures have a warp size of +64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain +values greater than 31. Consequently, shifting by such values would clear the +32-bit register to which the shift operation is applied. For AMD +architectures, a straightforward fix could look as follows: + +.. code-block:: cpp + + // Get the thread's position in the warp + unsigned int laneId = threadIdx.x % warpSize; + + // Use lane ID for bit-shift + val & ((1ull << (threadIdx.x % warpSize) )-1 ); + + // Shift 64 bit integer with val variable + WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0); + +For portability reasons, it is better to introduce appropriately +typed placeholders as shown below: + +.. code-block:: cpp + + #if defined(__GFX8__) || defined(__GFX9__) + typedef uint64_t lane_mask_t; + #else + typedef uint32_t lane_mask_t; + #endif + +The use of :code:`lane_mask_t` with the previous example: + +.. code-block:: cpp + + // Get the thread's position in the warp + unsigned int laneId = threadIdx.x % warpSize; + + // Use lane ID for bit-shift + val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 ); + + // Shift 32 or 64 bit integer with val variable + WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0); + Porting from CUDA __launch_bounds__ ===================================