b385b1230c
Signed-off-by: Jeffrey Novotny <jnovotny@amd.com>
457 wiersze
35 KiB
ReStructuredText
457 wiersze
35 KiB
ReStructuredText
.. _occupancy-example:
|
||
|
||
Occupancy limiters example
|
||
==========================
|
||
|
||
For this example, consider the
|
||
:dev-sample:`occupancy <occupancy.hip>` included with Omniperf. We will
|
||
investigate the use of the resource allocation panel in the
|
||
:ref:`Workgroup Manager <desc-spi>`’s metrics section to determine occupancy
|
||
limiters. This code contains several kernels to explore how both various
|
||
kernel resources impact achieved occupancy, and how this is reported in
|
||
Omniperf.
|
||
|
||
This example was compiled and run on a MI250 accelerator using ROCm
|
||
v5.6.0, and Omniperf v2.0.0:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ hipcc -O3 occupancy.hip -o occupancy --save-temps
|
||
|
||
We have again included the ``--save-temps`` flag to get the
|
||
corresponding assembly.
|
||
|
||
Finally, we generate our Omniperf profile as:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ omniperf profile -n occupancy --no-roof -- ./occupancy
|
||
|
||
.. _occupancy-experiment-design:
|
||
|
||
Design note
|
||
-----------
|
||
|
||
For our occupancy test, we need to create a kernel that is resource
|
||
heavy, in various ways. For this purpose, we use the following (somewhat
|
||
funny-looking) kernel:
|
||
|
||
.. code-block:: cpp
|
||
|
||
constexpr int bound = 16;
|
||
__launch_bounds__(256)
|
||
__global__ void vgprbound(int N, double* ptr) {
|
||
double intermediates[bound];
|
||
for (int i = 0 ; i < bound; ++i) intermediates[i] = N * threadIdx.x;
|
||
double x = ptr[threadIdx.x];
|
||
for (int i = 0; i < 100; ++i) {
|
||
x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % bound], intermediates[i % bound]));
|
||
intermediates[i % bound] = x;
|
||
}
|
||
if (x == N) ptr[threadIdx.x] = x;
|
||
}
|
||
|
||
Here we try to use as many :ref:`VGPRs <desc-valu>` as possible, to this end:
|
||
|
||
* We create a small array of double precision floats, that we size to try
|
||
to fit into registers (i.e., ``bound``, this may need to be tuned
|
||
depending on the ROCm version).
|
||
|
||
* We specify ``__launch_bounds___(256)``
|
||
to increase the number of VPGRs available to the kernel (by limiting the
|
||
number of wavefronts that can be resident on a
|
||
:doc:`CU </conceptual/compute-unit>`).
|
||
|
||
* Write a unique non-compile time constant to each element of the array.
|
||
|
||
* Repeatedly permute and call relatively expensive math functions on our
|
||
array elements.
|
||
|
||
* Keep the compiler from optimizing out any operations by faking a write to the
|
||
``ptr`` based on a run-time conditional.
|
||
|
||
This yields a total of 122 VGPRs, but it is expected this number will
|
||
depend on the exact ROCm/compiler version.
|
||
|
||
.. code-block:: asm
|
||
|
||
.size _Z9vgprboundiPd, .Lfunc_end1-_Z9vgprboundiPd
|
||
; -- End function
|
||
.section .AMDGPU.csdata
|
||
; Kernel info:
|
||
; codeLenInByte = 4732
|
||
; NumSgprs: 68
|
||
; NumVgprs: 122
|
||
; NumAgprs: 0
|
||
; <...>
|
||
; AccumOffset: 124
|
||
|
||
We will use various permutations of this kernel to limit occupancy, and
|
||
more importantly for the purposes of this example, demonstrate how this
|
||
is reported in Omniperf.
|
||
|
||
.. _vgpr-occupancy:
|
||
|
||
VGPR limited
|
||
------------
|
||
|
||
For our first test, we use the ``vgprbound`` kernel discussed in the
|
||
:ref:`design note <occupancy-experiment-design>`. After profiling, we run
|
||
the analyze step on this kernel:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 --dispatch 1
|
||
<...>
|
||
--------------------------------------------------------------------------------
|
||
0. Top Stat
|
||
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
|
||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
|
||
│ 0 │ vgprbound(int, double*) │ 1.00 │ 923093822.50 │ 923093822.50 │ 923093822.50 │ 100.00 │
|
||
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
2. System Speed-of-Light
|
||
2.1 Speed-of-Light
|
||
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
|
||
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
|
||
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
|
||
│ 2.1.15 │ Wavefront Occupancy │ 1661.24 │ Wavefronts │ 3328.00 │ 49.92 │
|
||
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
6. Workgroup Manager (SPI)
|
||
6.2 Workgroup Manager - Resource Allocation
|
||
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
|
||
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 0.64 │ 0.64 │ 0.64 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 24.94 │ 24.94 │ 24.94 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 24.49 │ 24.49 │ 24.49 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.3 │ Scratch Stall Rate │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.4 │ Insufficient SIMD Waveslots │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.5 │ Insufficient SIMD VGPRs │ 94.90 │ 94.90 │ 94.90 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.6 │ Insufficient SIMD SGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.7 │ Insufficient CU LDS │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
7. Wavefront
|
||
7.1 Wavefront Launch Stats
|
||
╒═════════╤══════════╤════════╤════════╤════════╤═══════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪══════════╪════════╪════════╪════════╪═══════════╡
|
||
│ 7.1.5 │ VGPRs │ 124.00 │ 124.00 │ 124.00 │ Registers │
|
||
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
|
||
│ 7.1.6 │ AGPRs │ 4.00 │ 4.00 │ 4.00 │ Registers │
|
||
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
|
||
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
|
||
╘═════════╧══════════╧════════╧════════╧════════╧═══════════╛
|
||
|
||
Here we see that the kernel indeed does use *around* (but not exactly)
|
||
122 VGPRs, with the difference due to granularity of VGPR allocations.
|
||
In addition, we see that we have allocated 4 “:ref:`AGPRs <desc-agprs>`”. We
|
||
note that on current CDNA2 accelerators, the ``AccumOffset`` field of
|
||
the assembly metadata:
|
||
|
||
.. code-block:: asm
|
||
|
||
; AccumOffset: 124
|
||
|
||
denotes the divide between ``VGPRs`` and ``AGPRs``.
|
||
|
||
Next, we examine our wavefront occupancy (**2.1.15**), and see that we are
|
||
reaching only :math:`\sim50\%` of peak occupancy. As a result, we see
|
||
that:
|
||
|
||
- We are not scheduling workgroups :math:`\sim25\%` of
|
||
:ref:`total scheduler-pipe cycles <total-pipe-cycles>` (**6.2.1**); recall
|
||
from the discussion of the `workgroup manager <desc-spi>`, 25% is the maximum.
|
||
|
||
- The scheduler-pipe is stalled (**6.2.2**) from scheduling workgroups due to
|
||
resource constraints for the same :math:`\sim25\%` of the time.
|
||
|
||
- And finally, :math:`\sim91\%` of those stalls are due to a lack of SIMDs
|
||
with the appropriate number of VGPRs available (6.2.5).
|
||
|
||
That is, the reason we can’t reach full occupancy is due to our VGPR
|
||
usage, as expected!
|
||
|
||
LDS limited
|
||
-----------
|
||
|
||
To examine an LDS limited example, we must change our kernel slightly:
|
||
|
||
.. code-block:: cpp
|
||
|
||
constexpr size_t fully_allocate_lds = 64ul * 1024ul / sizeof(double);
|
||
__launch_bounds__(256)
|
||
__global__ void ldsbound(int N, double* ptr) {
|
||
__shared__ double intermediates[fully_allocate_lds];
|
||
for (int i = threadIdx.x ; i < fully_allocate_lds; i += blockDim.x) intermediates[i] = N * threadIdx.x;
|
||
__syncthreads();
|
||
double x = ptr[threadIdx.x];
|
||
for (int i = threadIdx.x; i < fully_allocate_lds; i += blockDim.x) {
|
||
x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % fully_allocate_lds], intermediates[i % fully_allocate_lds]));
|
||
__syncthreads();
|
||
intermediates[i % fully_allocate_lds] = x;
|
||
}
|
||
if (x == N) ptr[threadIdx.x] = x;
|
||
}
|
||
|
||
Where we now:
|
||
|
||
* Allocate an 64 KiB LDS array per workgroup, and
|
||
|
||
* Use our allocated LDS array instead of a register array
|
||
|
||
Analyzing this:
|
||
|
||
.. code-block:: shell
|
||
|
||
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 --dispatch 3
|
||
<...>
|
||
--------------------------------------------------------------------------------
|
||
2. System Speed-of-Light
|
||
2.1 Speed-of-Light
|
||
╒═════════╤═════════════════════╤════════╤════════════╤═════════╤═══════════════╕
|
||
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
|
||
╞═════════╪═════════════════════╪════════╪════════════╪═════════╪═══════════════╡
|
||
│ 2.1.15 │ Wavefront Occupancy │ 415.52 │ Wavefronts │ 3328.00 │ 12.49 │
|
||
╘═════════╧═════════════════════╧════════╧════════════╧═════════╧═══════════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
6. Workgroup Manager (SPI)
|
||
6.2 Workgroup Manager - Resource Allocation
|
||
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
|
||
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 0.13 │ 0.13 │ 0.13 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 24.87 │ 24.87 │ 24.87 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 24.84 │ 24.84 │ 24.84 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.3 │ Scratch Stall Rate │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.4 │ Insufficient SIMD Waveslots │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.5 │ Insufficient SIMD VGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.6 │ Insufficient SIMD SGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.7 │ Insufficient CU LDS │ 96.47 │ 96.47 │ 96.47 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
7. Wavefront
|
||
7.1 Wavefront Launch Stats
|
||
╒═════════╤════════════════╤══════════╤══════════╤══════════╤═══════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════╪══════════╪══════════╪══════════╪═══════════╡
|
||
│ 7.1.5 │ VGPRs │ 96.00 │ 96.00 │ 96.00 │ Registers │
|
||
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
|
||
│ 7.1.6 │ AGPRs │ 0.00 │ 0.00 │ 0.00 │ Registers │
|
||
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
|
||
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
|
||
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
|
||
│ 7.1.8 │ LDS Allocation │ 65536.00 │ 65536.00 │ 65536.00 │ Bytes │
|
||
╘═════════╧════════════════╧══════════╧══════════╧══════════╧═══════════╛
|
||
|
||
We see that our VGPR allocation has gone down to 96 registers, but now
|
||
we see our 64KiB LDS allocation (**7.1.8**). In addition, we see a similar
|
||
non-schedule rate (**6.2.1**) and stall rate (**6.2.2**) as in our
|
||
:ref:`VGPR example <vgpr-occupancy>`. However, our occupancy limiter has now
|
||
shifted from VGPRs (**6.2.5**) to LDS (**6.2.7**).
|
||
|
||
We note that although we see the around the same scheduler/stall rates
|
||
(with our LDS limiter), our wave occupancy (**2.1.15**) is significantly
|
||
lower (:math:`\sim12\%`)! This is important to remember: the occupancy
|
||
limiter metrics in the resource allocation section tell you what the
|
||
limiter was, but *not* how much the occupancy was limited. These metrics
|
||
should always be analyzed in concert with the wavefront occupancy
|
||
metric!
|
||
|
||
.. _sgpr-occupancy:
|
||
|
||
SGPR limited
|
||
------------
|
||
|
||
Finally, we modify our kernel once more to make it limited by
|
||
:ref:`SGPRs <desc-salu>`:
|
||
|
||
.. code-block:: cpp
|
||
|
||
constexpr int sgprlim = 1;
|
||
__launch_bounds__(1024, 8)
|
||
__global__ void sgprbound(int N, double* ptr) {
|
||
double intermediates[sgprlim];
|
||
for (int i = 0 ; i < sgprlim; ++i) intermediates[i] = i;
|
||
double x = ptr[0];
|
||
#pragma unroll 1
|
||
for (int i = 0; i < 100; ++i) {
|
||
x += sin(pow(intermediates[(i - 1) % sgprlim], intermediates[i % sgprlim]));
|
||
intermediates[i % sgprlim] = x;
|
||
}
|
||
if (x == N) ptr[0] = x;
|
||
}
|
||
|
||
The major changes here are to: - make as much as possible provably
|
||
uniform across the wave (notice the lack of ``threadIdx.x`` in the
|
||
``intermediates`` initialization and elsewhere), - addition of
|
||
``__launch_bounds__(1024, 8)``, which reduces our maximum VGPRs to 64
|
||
(such that 8 waves can fit per SIMD), but causes some register spills
|
||
(i.e., :ref:`scratch <memory-spaces>` usage), and - lower the ``bound`` (here we
|
||
use ``sgprlim``) of the array to reduce VGPR/Scratch usage.
|
||
|
||
This results in the following assembly metadata for this kernel:
|
||
|
||
.. code-block:: asm
|
||
|
||
.size _Z9sgprboundiPd, .Lfunc_end3-_Z9sgprboundiPd
|
||
; -- End function
|
||
.section .AMDGPU.csdata
|
||
; Kernel info:
|
||
; codeLenInByte = 4872
|
||
; NumSgprs: 76
|
||
; NumVgprs: 64
|
||
; NumAgprs: 0
|
||
; TotalNumVgprs: 64
|
||
; ScratchSize: 60
|
||
; <...>
|
||
; AccumOffset: 64
|
||
; Occupancy: 8
|
||
|
||
Analyzing this workload yields:
|
||
|
||
.. code-block:: shell-session
|
||
|
||
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 7.1.9 --dispatch 5
|
||
<...>
|
||
--------------------------------------------------------------------------------
|
||
0. Top Stat
|
||
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
|
||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
|
||
│ 0 │ sgprbound(int, double*) │ 1.00 │ 782069812.00 │ 782069812.00 │ 782069812.00 │ 100.00 │
|
||
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
2. System Speed-of-Light
|
||
2.1 Speed-of-Light
|
||
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
|
||
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
|
||
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
|
||
│ 2.1.15 │ Wavefront Occupancy │ 3291.76 │ Wavefronts │ 3328.00 │ 98.91 │
|
||
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
6. Workgroup Manager (SPI)
|
||
6.2 Workgroup Manager - Resource Allocation
|
||
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
|
||
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 7.72 │ 7.72 │ 7.72 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 15.17 │ 15.17 │ 15.17 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 7.38 │ 7.38 │ 7.38 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.3 │ Scratch Stall Rate │ 39.76 │ 39.76 │ 39.76 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.4 │ Insufficient SIMD Waveslots │ 26.32 │ 26.32 │ 26.32 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.5 │ Insufficient SIMD VGPRs │ 26.32 │ 26.32 │ 26.32 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.6 │ Insufficient SIMD SGPRs │ 25.52 │ 25.52 │ 25.52 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.7 │ Insufficient CU LDS │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
|
||
|
||
|
||
--------------------------------------------------------------------------------
|
||
7. Wavefront
|
||
7.1 Wavefront Launch Stats
|
||
╒═════════╤════════════════════╤═══════╤═══════╤═══════╤════════════════╕
|
||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||
╞═════════╪════════════════════╪═══════╪═══════╪═══════╪════════════════╡
|
||
│ 7.1.5 │ VGPRs │ 64.00 │ 64.00 │ 64.00 │ Registers │
|
||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||
│ 7.1.6 │ AGPRs │ 0.00 │ 0.00 │ 0.00 │ Registers │
|
||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
|
||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||
│ 7.1.8 │ LDS Allocation │ 0.00 │ 0.00 │ 0.00 │ Bytes │
|
||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||
│ 7.1.9 │ Scratch Allocation │ 60.00 │ 60.00 │ 60.00 │ Bytes/workitem │
|
||
╘═════════╧════════════════════╧═══════╧═══════╧═══════╧════════════════╛
|
||
|
||
Here we see that our wavefront launch stats (**7.1**) have changed to
|
||
reflect the metadata seen in the ``--save-temps`` output. Of particular
|
||
interest, we see:
|
||
|
||
* The SGPR allocation (**7.1.7**) is 80 registers, slightly more than the 76
|
||
requested by the compiler due to allocation granularity, and
|
||
|
||
* We have a :ref:`"scratch" <memory-spaces>`, that is, private memory,
|
||
allocation of 60 bytes per work-item.
|
||
|
||
Analyzing the resource allocation block (**6.2**) we now see that for the
|
||
first time, the "Not-scheduled Rate (Workgroup Manager)" metric (**6.2.0**)
|
||
has become non-zero. This is because the workgroup manager is
|
||
responsible for management of scratch, which we see also contributes to
|
||
our occupancy limiters in the "Scratch Stall Rate" (**6.2.3**). Note that
|
||
the sum of the workgroup manager not-scheduled rate and the
|
||
scheduler-pipe non-scheduled rate is still :math:`\sim25\%`, as in our
|
||
previous examples.
|
||
|
||
Next, we see that the scheduler-pipe stall rate (**6.2.2**), that is, how often
|
||
we could not schedule a workgroup to a CU, was only about
|
||
:math:`\sim8\%`. This hints that perhaps, our kernel is not
|
||
*particularly* occupancy limited by resources. Indeed, checking the
|
||
wave occupancy metric (**2.1.15**) shows that this kernel is reaching nearly
|
||
99% occupancy.
|
||
|
||
Finally, we inspect the occupancy limiter metrics and see a roughly even
|
||
split between :ref:`waveslots <desc-valu>` (**6.2.4**), :ref:`VGPRs <desc-valu>`
|
||
(**6.2.5**), and :ref:`SGPRs <desc-salu>` (**6.2.6**) along with the scratch stalls
|
||
(**6.2.3**) previously mentioned.
|
||
|
||
This is yet another reminder to view occupancy holistically. While these
|
||
metrics tell you why a workgroup cannot be scheduled, they do *not* tell
|
||
you what your occupancy was (consult wavefront occupancy) *nor* whether
|
||
increasing occupancy will be beneficial to performance.
|