Restore Lane masks bit shift content (#2411)
Co-authored-by: Christophe Paquot <35546540+chrispaquot@users.noreply.github.com>
This commit is contained in:
@@ -909,6 +909,70 @@ code, while the host can query it during runtime via the device properties. See
|
||||
the :ref:`HIP language extension for warpSize <warp_size>` for information on
|
||||
how to write portable warpSize-aware code.
|
||||
|
||||
Lane masks bit-shift
|
||||
====================
|
||||
|
||||
A thread in a warp is also called a lane, and a lane mask is a bitmask where
|
||||
each bit corresponds to a thread in a warp. A bit is 1 if the thread is active,
|
||||
0 if it's inactive. Bit-shift operations are typically used to create lane masks
|
||||
and on AMD GPUs the ``warpSize`` can differ between different architectures,
|
||||
that's why it's essential to use correct bitmask type, when porting code.
|
||||
|
||||
Example:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// Get the thread's position in the warp
|
||||
unsigned int laneId = threadIdx.x % warpSize;
|
||||
|
||||
// Use lane ID for bit-shift
|
||||
val & ((1 << (threadIdx.x % warpSize) )-1 );
|
||||
|
||||
// Shift 32 bit integer with val variable
|
||||
WarpReduce::sum( (val < warpSize) ? (1 << val) : 0);
|
||||
|
||||
Lane masks are 32-bit integer types as this is the integer precision that C
|
||||
assigns to such constants by default. GCN/CDNA architectures have a warp size of
|
||||
64, :code:`threadIdx.x % warpSize` and :code:`val` in the example may obtain
|
||||
values greater than 31. Consequently, shifting by such values would clear the
|
||||
32-bit register to which the shift operation is applied. For AMD
|
||||
architectures, a straightforward fix could look as follows:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// Get the thread's position in the warp
|
||||
unsigned int laneId = threadIdx.x % warpSize;
|
||||
|
||||
// Use lane ID for bit-shift
|
||||
val & ((1ull << (threadIdx.x % warpSize) )-1 );
|
||||
|
||||
// Shift 64 bit integer with val variable
|
||||
WarpReduce::sum( (val < warpSize) ? (1ull << val) : 0);
|
||||
|
||||
For portability reasons, it is better to introduce appropriately
|
||||
typed placeholders as shown below:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
#if defined(__GFX8__) || defined(__GFX9__)
|
||||
typedef uint64_t lane_mask_t;
|
||||
#else
|
||||
typedef uint32_t lane_mask_t;
|
||||
#endif
|
||||
|
||||
The use of :code:`lane_mask_t` with the previous example:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// Get the thread's position in the warp
|
||||
unsigned int laneId = threadIdx.x % warpSize;
|
||||
|
||||
// Use lane ID for bit-shift
|
||||
val & ((lane_mask_t{1} << (threadIdx.x % warpSize) )-1 );
|
||||
|
||||
// Shift 32 or 64 bit integer with val variable
|
||||
WarpReduce::sum( (val < warpSize) ? (lane_mask_t{1} << val) : 0);
|
||||
|
||||
Porting from CUDA __launch_bounds__
|
||||
===================================
|
||||
|
||||
|
||||
مرجع در شماره جدید
Block a user