Docs: refactor and integrate into ROCm docs portal (#362)

* pip-compile docs/requirements.txt

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add Sphinx docs config

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add Sphinx config

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update docs build config

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* style(conf.py): Apply black formatting to docs/conf.py

Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

* Update docs requirements

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update to rocm-docs-core 1.3.0

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update docs requirements

Signed-off-by: Peter Jun Park <peter.park@amd.com>

pip-compile requirements

Signed-off-by: Peter Jun Park <peter.park@amd.com>

bump rocm-docs-core to 1.5.0

bump rocm-docs-core to 1.4.1

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* Add dependabot.yml and update CODEOWNERS

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Update toc and conf

Signed-off-by: Peter Jun Park <peter.park@amd.com>

update dependabot

* Port docs to rocm-docs standard

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add toc and Diataxis cards

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add basic file structure

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add glossary

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add includes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

Add license.rst

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add compatible hw

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix spelling and license

Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up index

Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up installation guides

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add basic usage (quickstart)

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add ref to global options

update toc

Signed-off-by: Peter Jun Park <peter.park@amd.com>

modularize modes and global options

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add profile mode

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg and clean up

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add dynamic omniperf version number in installation guide

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add datatemplate

more reorg

Signed-off-by: Peter Jun Park <peter.park@amd.com>

clean up

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg images

move profile mode

reorg

reorg

reorg more

fix formatting

fix headings

ref anchor mi2xx note

add extlinks

add extlinks

Signed-off-by: Peter Jun Park <peter.park@amd.com>

black format

fix formatting, anchors

Signed-off-by: Peter Jun Park <peter.park@amd.com>

reorg

fix words and formatting

Signed-off-by: Peter Jun Park <peter.park@amd.com>

formatting

Signed-off-by: Peter Jun Park <peter.park@amd.com>

same

reorg

format

fix formatting

fix toc

Signed-off-by: Peter Jun Park <peter.park@amd.com>

format

* impr internal linking and fix sphinx warnings

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add spellcheck/linting from rocm-docs-core

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix rst directives

satisfy spellcheck

fix more spelling

rm unused files

fix spelling and update wordlist

* bump rocm-docs-core to 1.6.0

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add fixes from @skyreflectedinmirrors and @lpaoletti

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add references to toc

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add more fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add package manager install section

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* add fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add metadata and fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add fixes

bump to 1.6.1

more fixes

fix fmt in profiling examples

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add missing mem type table

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix formatting

fmt

* add custom css

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix css fs

* make images/figs click-to-expand

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add missed image

update

fix link

* update documentation link in README

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* formatting fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

more formatting

* fix heading

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* move archived docs

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* exclude archived docs from docs build

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* update archived docs workflow

Signed-off-by: Peter Jun Park <peter.park@amd.com>

move files

update archived docs workflow

Signed-off-by: Peter Jun Park <peter.park@amd.com>

fix version number

clean up workflow

workflow test

workflow test

another workflow test

* rm docs linting

Signed-off-by: Peter Jun Park <peter.park@amd.com>

* Apply cmake-format suggested changes

Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

* Apply cmake-format

Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

---------

Signed-off-by: Peter Jun Park <peter.park@amd.com>
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>

[ROCm/rocprofiler-compute commit: a0dc485ceb]
Этот коммит содержится в:
Peter Park
2024-07-31 10:42:27 -04:00
коммит произвёл David Galiffi
родитель ddbc208489
Коммит 5d22d5ac8e
307 изменённых файлов: 16783 добавлений и 34 удалений
+675
Просмотреть файл
@@ -0,0 +1,675 @@
.. _infinity-fabric-example:
Infinity Fabric transactions
============================
For this example, consider the
:dev-sample:`Infinity Fabric™ sample <fabric.hip>` distributed as a part of
Omniperf.
This following code snippet launches a simple read-only kernel.
.. code-block:: cpp
// the main streaming kernel
__global__ void kernel(int* x, size_t N, int zero) {
int sum = 0;
const size_t offset_start = threadIdx.x + blockIdx.x * blockDim.x;
for (int i = 0; i < 10; ++i) {
for (size_t offset = offset_start; offset < N; offset += blockDim.x * gridDim.x) {
sum += x[offset];
}
}
if (sum != 0) {
x[offset_start] = sum;
}
}
This happens twice -- once as a warm-up and once for analysis. Note that the
buffer ``x`` is initialized to all zeros via a call to ``hipMemcpy`` on the
host before the kernel is ever launched. Therefore, the following conditional
is identically false -- and thus we expect no writes.
.. code-block:: cpp
if (sum != 0) { ...
.. note::
The actual sample included with Omniperf also includes the ability to select
different operation types (such as atomics, writes). This abbreviated version
is presented here for reference only.
Finally, this sample code lets the user control the
:ref:`granularity of an allocation <memory-type>`, the owner of an allocation
(local HBM, CPU DRAM or remote HBM), and the size of an allocation (the default
is :math:`\sim4`\ GiB) via command line arguments. In doing so, we can explore
the impact of these parameters on the L2-Fabric metrics reported by Omniperf to
further understand their meaning.
.. note::
All results in this section were generated an a node of Infinity
Fabric connected MI250 accelerators using ROCm version 5.6.0, and Omniperf
version 2.0.0. Although results may vary with ROCm versions and accelerator
connectivity, we expect the lessons learned here to be broadly applicable.
.. _infinity-fabric-ex1:
Experiment 1: Coarse-grained, accelerator-local HBM reads
-----------------------------------------------------------
In our first experiment, we consider the simplest possible case, a
``hipMalloc``\ d buffer that is local to our current accelerator:
.. code-block:: shell-session
$ omniperf profile -n coarse_grained_local --no-roof -- ./fabric -t 1 -o 0
Using:
mtype:CoarseGrained
mowner:Device
mspace:Global
mop:Read
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/coarse_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0 │ L2-Fabric Read BW │ 42947428672.00 │ 42947428672.00 │ 42947428672.00 │ Bytes per kernel │
├─────────┼─────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1 │ HBM Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼─────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2 │ Remote Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧═════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.07 │ 0.07 │ 0.07 │ Pct │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.1 │ Read (Uncached) │ 1450.00 │ 1450.00 │ 1450.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.2 │ Read (64B) │ 671053573.00 │ 671053573.00 │ 671053573.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.3 │ HBM Read │ 671053565.00 │ 671053565.00 │ 671053565.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.4 │ Remote Read │ 8.00 │ 8.00 │ 8.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
Here, you can make the following observations.
- The vast majority of L2-Fabric requests (>99%) are 64B
read requests (**17.5.2**).
- Nearly 100% of the read requests (**17.2.1**) are homed in on the
accelerator-local HBM (**17.5.3**), while some small fraction of these reads are
routed to a “remote” device (**17.5.4**).
- These drive a :math:`\sim40`\ GiB per kernel read-bandwidth (**17.2.0**).
In addition, we see a small amount of :ref:`uncached <memory-type>` reads
(**17.5.1**), these correspond to things like:
* The assembly code to execute the kernel
* Kernel arguments
* Coordinate parameters (such as ``blockDim.z``) that were not initialized by the
hardware, etc. and may account for some of our "remote" read requests
(**17.5.4**), for example, reading from CPU DRAM
The above list is not exhaustive, nor are all of these guaranteed to be
"uncached" – the exact implementation depends on the accelerator and
ROCm versions used. These read requests could be interrogated further in
the :ref:`Scalar L1 Data Cache <desc-sl1d>` and
:ref:`Instruction Cache <desc-l1i>` metric sections.
.. note::
The Traffic metrics in Sec **17.2** are presented as a percentage of the total
number of requests. For example, "HBM Read Traffic" is the percent of read requests
(**17.5.0** - **17.5.2**) that were directed to the accelerators' local HBM (**17.5.3**).
.. _infinity-fabric-ex2:
Experiment 2: Fine-grained, accelerator-local HBM reads
---------------------------------------------------------
In this experiment, we change the :ref:`granularity <memory-type>` of our
device-allocation to be fine-grained device memory, local to the current
accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the
``hipDeviceMallocFinegrained`` flag to accomplish this.
.. note::
On some systems (such as those with only PCIe® connected accelerators), you need
to set the environment variable ``HSA_FORCE_FINE_GRAIN_PCIE=1`` to enable
this memory type.
.. code-block:: shell-session
$ omniperf profile -n fine_grained_local --no-roof -- ./fabric -t 0 -o 0
Using:
mtype:FineGrained
mowner:Device
mspace:Global
mop:Read
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0 │ L2-Fabric Read BW │ 42948661824.00 │ 42948661824.00 │ 42948661824.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1 │ HBM Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2 │ Remote Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3 │ Uncached Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.07 │ 0.07 │ 0.07 │ Pct │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.1 │ Read (Uncached) │ 1334.00 │ 1334.00 │ 1334.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.2 │ Read (64B) │ 671072841.00 │ 671072841.00 │ 671072841.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.3 │ HBM Read │ 671072835.00 │ 671072835.00 │ 671072835.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.4 │ Remote Read │ 6.00 │ 6.00 │ 6.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
Comparing with our :ref:`previous example <infinity-fabric-ex1>`, we see a
relatively similar result, namely:
- The vast majority of L2-Fabric requests are 64B read requests (**17.5.2**)
- Nearly all these read requests are directed to the accelerator-local HBM (**17.2.1**)
In addition, we now see a small percentage of HBM Read Stalls (**17.4.2**),
as streaming fine-grained memory is putting more stress on Infinity
Fabric.
.. note::
The stalls in Sec 17.4 are presented as a percentage of the total number
active L2 cycles, summed over :doc:`all L2 channels </conceptual/l2-cache>`.
.. _infinity-fabric-ex3:
Experiment 3: Fine-grained, remote-accelerator HBM reads
----------------------------------------------------------
In this experiment, we move our :ref:`fine-grained <memory-type>` allocation to
be owned by a remote accelerator. We accomplish this by first changing
the HIP device using, for instance, the ``hipSetDevice(1)`` API, then allocating
fine-grained memory (as described :ref:`previously <infinity-fabric-ex2>`), and
finally resetting the device back to the default, for instance,
``hipSetDevice(0)``.
Although we have not changed our code significantly, we do see a
substantial change in the L2-Fabric metrics:
.. code-block:: shell-session
$ omniperf profile -n fine_grained_remote --no-roof -- ./fabric -t 0 -o 2
Using:
mtype:FineGrained
mowner:Remote
mspace:Global
mop:Read
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_remote/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0 │ L2-Fabric Read BW │ 42949692736.00 │ 42949692736.00 │ 42949692736.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1 │ HBM Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2 │ Remote Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3 │ Uncached Read Traffic │ 200.00 │ 200.00 │ 200.00 │ Pct │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 17.85 │ 17.85 │ 17.85 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪════════════════╡
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.1 │ Read (Uncached) │ 1342177894.00 │ 1342177894.00 │ 1342177894.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.2 │ Read (64B) │ 671088949.00 │ 671088949.00 │ 671088949.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.3 │ HBM Read │ 307.00 │ 307.00 │ 307.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.4 │ Remote Read │ 671088642.00 │ 671088642.00 │ 671088642.00 │ Req per kernel │
╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛
First, we see that while we still observe approximately the same number
of 64B Read Requests (**17.5.2**), we now see an even larger number of
Uncached Read Requests (**17.5.3**). Some simple division reveals:
.. math::
342177894.00 / 671088949.00 ≈ 2
That is, each 64B Read Request is *also* counted as two Uncached Read
Requests, as reflected in the :ref:`request-flow diagram <l2-request-flow>`.
This is also why the Uncached Read Traffic metric (**17.2.3**) is at the
counter-intuitive value of 200%!
In addition, observe that:
- We no longer see any significant number of HBM Read Requests (**17.2.1**,
**17.5.3**), nor HBM Read Stalls (**17.4.2**), but instead,
- we see that almost all of these requests are considered “remote”
(**17.2.2**, **17.5.4**) are being routed to another
accelerator, or the CPU — in this case HIP Device 1 — and,
- we see a significantly larger percentage of AMD Infinity Fabric Read Stalls
(**17.4.1**) as compared to the HBM Read Stalls in the
:ref:`previous example <infinity-fabric-ex2>`.
These stalls correspond to reads that are going out over the AMD
Infinity Fabric connection to another MI250 accelerator. In
addition, because these are crossing between accelerators, we expect
significantly lower achievable bandwidths as compared to the local
accelerators HBM – this is reflected (indirectly) in the magnitude of
the stall metric (**17.4.1**). Finally, we note that if our system contained
only PCIe connected accelerators, these observations will differ.
.. _infinity-fabric-ex4:
Experiment 4: Fine-grained, CPU-DRAM reads
--------------------------------------------
In this experiment, we move our :ref:`fine-grained <memory-type>` allocation to
be owned by the CPUs DRAM. We accomplish this by allocating host-pinned
fine-grained memory using the ``hipHostMalloc`` API:
.. code-block:: shell-session
$ omniperf profile -n fine_grained_host --no-roof -- ./fabric -t 0 -o 1
Using:
mtype:FineGrained
mowner:Host
mspace:Global
mop:Read
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0 │ L2-Fabric Read BW │ 42949691264.00 │ 42949691264.00 │ 42949691264.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1 │ HBM Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2 │ Remote Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3 │ Uncached Read Traffic │ 200.00 │ 200.00 │ 200.00 │ Pct │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 91.29 │ 91.29 │ 91.29 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪════════════════╡
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.1 │ Read (Uncached) │ 1342177848.00 │ 1342177848.00 │ 1342177848.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.2 │ Read (64B) │ 671088926.00 │ 671088926.00 │ 671088926.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.3 │ HBM Read │ 284.00 │ 284.00 │ 284.00 │ Req per kernel │
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
│ 17.5.4 │ Remote Read │ 671088642.00 │ 671088642.00 │ 671088642.00 │ Req per kernel │
╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛
Here we see *almost* the same results as in the
:ref:`previous experiment <infinity-fabric-ex3>`, however now as we are crossing
a PCIe bus to the CPU, we see that the Infinity Fabric Read stalls (**17.4.1**)
have shifted to be a PCIe stall (**17.4.2**). In addition, as (on this
system) the PCIe bus has a lower peak bandwidth than the AMD Infinity
Fabric connection between two accelerators, we once again observe an
increase in the percentage of stalls on this interface.
.. note::
Had we performed this same experiment on an
`MI250X system <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_,
these transactions would again have been marked as Infinity Fabric Read
stalls (**17.4.1**), as the CPU is connected to the accelerator via AMD Infinity
Fabric.
.. _infinity-fabric-ex5:
Experiment 5: Coarse-grained, CPU-DRAM reads
----------------------------------------------
In our next fabric experiment, we change our CPU memory allocation to be
`coarse-grained <Mtype>`__. We accomplish this by passing the
``hipHostMalloc`` API the ``hipHostMallocNonCoherent`` flag, to mark the
allocation as coarse-grained:
.. code-block:: shell-session
$ omniperf profile -n coarse_grained_host --no-roof -- ./fabric -t 1 -o 1
Using:
mtype:CoarseGrained
mowner:Host
mspace:Global
mop:Read
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/coarse_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.0 │ L2-Fabric Read BW │ 42949691264.00 │ 42949691264.00 │ 42949691264.00 │ Bytes per kernel │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.1 │ HBM Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.2 │ Remote Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.3 │ Uncached Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 91.27 │ 91.27 │ 91.27 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.1 │ Read (Uncached) │ 562.00 │ 562.00 │ 562.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.2 │ Read (64B) │ 671088926.00 │ 671088926.00 │ 671088926.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.3 │ HBM Read │ 281.00 │ 281.00 │ 281.00 │ Req per kernel │
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.4 │ Remote Read │ 671088645.00 │ 671088645.00 │ 671088645.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
Here we see a similar result to our
:ref:`previous experiment <infinity-fabric-ex4>`, with one key difference: our
accesses are no longer marked as Uncached Read requests (**17.2.3, 17.5.1**), but instead
are 64B read requests (**17.5.2**), as observed in our
:ref:`Coarse-grained, accelerator-local HBM <infinity-fabric-ex1>` experiment.
.. _infinity-fabric-ex6:
Experiment 6: Fine-grained, CPU-DRAM writes
--------------------------------------------
Thus far in our exploration of the L2-Fabric interface, we have
primarily focused on read operations. However, in
:ref:`our request flow diagram <l2-request-flow>`, we note that writes are
counted separately. To observe this, we use the ``-p`` flag to trigger write
operations to fine-grained memory allocated on the host:
.. code-block:: shell-session
$ omniperf profile -n fine_grained_host_write --no-roof -- ./fabric -t 0 -o 1 -p 1
Using:
mtype:FineGrained
mowner:Host
mspace:Global
mop:Write
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host_writes/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
│ 17.2.4 │ L2-Fabric Write and Atomic BW │ 42949672960.00 │ 42949672960.00 │ 42949672960.00 │ Bytes per kernel │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.5 │ HBM Write and Atomic Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.6 │ Remote Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.7 │ Atomic Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
│ 17.2.8 │ Uncached Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
╘═════════╧═══════════════════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤════════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.3 │ Write - PCIe Stall │ PCIe Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.4 │ Write - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.5 │ Write - HBM Stall │ HBM Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.6 │ Write - Credit Starvation │ Credit Starvation │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧════════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
│ 17.5.5 │ Write (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.6 │ Write (Uncached) │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.7 │ Write (64B) │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.8 │ HBM Write and Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.9 │ Remote Write and Atomic │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
│ 17.5.10 │ Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
╘═════════╧═════════════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
Here we notice a few changes in our request pattern:
* As expected, the requests have changed from 64B Reads to 64B Write requests
(**17.5.7**),
* these requests are homed in on a “remote” destination (**17.2.6, 17.5.9**), as
expected, and
* these are also counted as a single Uncached Write request (**17.5.6**).
In addition, there are rather significant changes in the bandwidth values
reported:
- The “L2-Fabric Write and Atomic” bandwidth metric (**17.2.4**)
reports about 40GiB of data written across Infinity Fabric while
- The “Remote Write and Traffic” metric (**17.2.5**) indicates that nearly
100% of these request are being directed to a remote source.
The precise meaning of these metrics are explored in the
:ref:`subsequent experiment <infinity-fabric-ex7>`.
Finally, we note that we see no write stalls on the PCIe bus
(**17.4.3**). This is because writes over a PCIe bus `are
non-posted <https://members.pcisig.com/wg/PCI-SIG/document/10912>`_,
that is, they do not require acknowledgement.
.. _infinity-fabric-ex7:
Experiment 7: Fine-grained, CPU-DRAM atomicAdd
------------------------------------------------
Next, we change our experiment to instead target ``atomicAdd``
operations to the CPUs DRAM.
.. code-block:: shell-session
$ omniperf profile -n fine_grained_host_add --no-roof -- ./fabric -t 0 -o 1 -p 2
Using:
mtype:FineGrained
mowner:Host
mspace:Global
mop:Add
mdata:Unsigned
remoteId:-1
<...>
$ omniperf analyze -p workloads/fine_grained_host_add/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
<...>
17. L2 Cache
17.2 L2 - Fabric Transactions
╒═════════╤═══════════════════════════════════╤══════════════╤══════════════╤══════════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════════════════╪══════════════╪══════════════╪══════════════╪══════════════════╡
│ 17.2.4 │ L2-Fabric Write and Atomic BW │ 429496736.00 │ 429496736.00 │ 429496736.00 │ Bytes per kernel │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.5 │ HBM Write and Atomic Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.6 │ Remote Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.7 │ Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
│ 17.2.8 │ Uncached Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
╘═════════╧═══════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════════╛
17.4 L2 - Fabric Interface Stalls
╒═════════╤════════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
│ 17.4.3 │ Write - PCIe Stall │ PCIe Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.4 │ Write - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.5 │ Write - HBM Stall │ HBM Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
│ 17.4.6 │ Write - Credit Starvation │ Credit Starvation │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧════════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
17.5 L2 - Fabric Detailed Transaction Breakdown
╒═════════╤═════════════════════════╤═════════════╤═════════════╤═════════════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════════╪═════════════╪═════════════╪═════════════╪════════════════╡
│ 17.5.5 │ Write (32B) │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.6 │ Write (Uncached) │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.7 │ Write (64B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.8 │ HBM Write and Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.9 │ Remote Write and Atomic │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
│ 17.5.10 │ Atomic │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
╘═════════╧═════════════════════════╧═════════════╧═════════════╧═════════════╧════════════════╛
In this case, there is quite a lot to unpack:
- For the first time, the 32B Write requests (**17.5.5**) are heavily used.
- These correspond to Atomic requests (**17.2.7, 17.5.10**), and are counted as
Uncached Writes (**17.5.6**).
- The L2-Fabric Write and Atomic bandwidth metric (**17.2.4**) shows about 0.4
GiB of traffic. For convenience, the sample reduces the default problem size
for this case due to the speed of atomics across a PCIe bus, and finally,
- The traffic is directed to a remote device (**17.2.6, 17.5.9**).
Let's consider what an “atomic” request means in this context. Recall
that we are discussing memory traffic flowing from the L2 cache, the
device-wide coherence point on current CDNA accelerators such as the
MI250, to for example, the CPUs DRAM. In this light, we see that these
requests correspond to *system scope* atomics, and specifically in the
case of the MI250, to fine-grained memory.
.. rubric:: Disclaimer
PCIe® is a registered trademark of PCI-SIG Corporation.
..
`Leave as possible future experiment to add
### Experiment #2 - Non-temporal writes
If we take the same code (for convenience only) as previously described, we can demonstrate how to achieve 'streaming' writes, as described in the [L2 Cache Access metrics](L2_cache_metrics) section.
To see this, we use the Clang built-in [`__builtin_nontemporal_store`](https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins), for example
```
template<typename T>
__device__ void store (T* ptr, T val) {
__builtin_nontemporal_store(val, ptr);
}
```
On an AMD MI2XX accelerator, for FP32 values this will generate a `global_store_dword` instruction, with the `glc` and `slc` bits set, described in [section 10.1](https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf) of the CDNA2 ISA guide.`
@@ -0,0 +1,486 @@
.. _ipc-example:
Instructions-per-cycle and utilizations example
===============================================
For this example, consider the
:dev-sample:`instructions-per-cycle (IPC) example <ipc.hip>` included with
Omniperf.
This example is compiled using ``c++17`` support:
.. code-block:: shell
$ hipcc -O3 ipc.hip -o ipc -std=c++17
and was run on an MI250 CDNA2 accelerator:
.. code-block:: shell
$ omniperf profile -n ipc --no-roof -- ./ipc
The results shown in this section are *generally* applicable to CDNA
accelerators, but may vary between generations and specific products.
.. _ipc-experiment-design-note:
Design note
-----------
The kernels in this example all execute a specific assembly operation
``N`` times (1000, by default), for instance the ``vmov`` kernel:
.. code-block:: cpp
template<int N=1000>
__device__ void vmov_op() {
int dummy;
if constexpr (N >= 1) {
asm volatile("v_mov_b32 v0, v1\n" : : "{v31}"(dummy));
vmov_op<N - 1>();
}
}
template<int N=1000>
__global__ void vmov() {
vmov_op<N>();
}
The kernels are then launched twice, once for a warm-up run, and once
for measurement.
.. _ipc-valu-utilization:
VALU utilization and IPC
------------------------
Now we can use our test to measure the achieved instructions-per-cycle
of various types of instructions. We start with a simple :ref:`VALU <desc-valu>`
operation, i.e., a ``v_mov_b32`` instruction, e.g.:
.. code-block:: asm
v_mov_b32 v0, v1
This instruction simply copies the contents from the source register
(``v1``) to the destination register (``v0``). Investigating this kernel
with Omniperf, we see:
.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 7 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│ 0 │ void vmov<1000>() [clone .kd] │ 1.00 │ 99317423.00 │ 99317423.00 │ 99317423.00 │ 100.00 │
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0 │ IPC │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2 │ SALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3 │ VALU Util │ 99.98 │ 99.98 │ 99.98 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5 │ Branch Util │ 0.1 │ 0.1 │ 0.1 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6 │ VALU Active Threads │ 64.0 │ 64.0 │ 64.0 │ Threads │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
Here we see that:
1. Both the IPC (**11.2.0**) and “Issued” IPC (**11.2.1**) metrics are
:math:`\sim 1`
2. The VALU Utilization metric (**11.2.3**) is also :math:`\sim100\%`, and
finally
3. The VALU Active Threads metric (**11.2.4**) is 64, i.e., the wavefront
size on CDNA accelerators, as all threads in the wavefront are
active.
We will explore the difference between the IPC (**11.2.0**) and “Issued” IPC
(**11.2.1**) metrics in the :ref:`next section <issued-ipc>`.
Additionally, we notice a small (0.1%) Branch utilization (**11.2.5**).
Inspecting the assembly of this kernel shows there are no branch
operations, however recalling the note in the :ref:`Pipeline
statistics <pipeline-stats>` section:
The branch utilization <…> includes time spent in other instruction
types (namely: ``s_endpgm``) that are *typically* a very small
percentage of the overall kernel execution.
We see that this is coming from execution of the ``s_endpgm``
instruction at the end of every wavefront.
.. note::
Technically, the cycle counts used in the denominators of our IPC metrics are
actually in units of quad-cycles, a group of 4 consecutive cycles. However, a
typical :ref:`VALU <desc-valu>` instruction on CDNA accelerators runs for a
single quad-cycle (see :gcn-crash-course:`30`). Therefore, for simplicity, we
simply report these metrics as "instructions per cycle".
.. _issued-ipc:
Exploring “issued” IPC via MFMA operations
------------------------------------------
.. warning::
The MFMA assembly operations used in this example are inherently not portable
to older CDNA architectures.
Unlike the simple quad-cycle ``v_mov_b32`` operation discussed in our
:ref:`previous example <ipc-valu-utilization>`, some operations take many
quad-cycles to execute. For example, using the
`AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator#example-of-querying-instruction-information>`_
we can see that some :ref:`MFMA <desc-mfma>` operations take 64 cycles, e.g.:
.. code-block:: shell
$ ./matrix_calculator.py --arch CDNA2 --detail-instruction --instruction v_mfma_f32_32x32x8bf16_1k
Architecture: CDNA2
Instruction: V_MFMA_F32_32X32X8BF16_1K
<...>
Execution statistics:
FLOPs: 16384
Execution cycles: 64
FLOPs/CU/cycle: 1024
Can co-execute with VALU: True
VALU co-execution cycles possible: 60
What happens to our IPC when we utilize this ``v_mfma_f32_32x32x8bf16_1k``
instruction on a CDNA2 accelerator? To find out, we turn to our ``mfma`` kernel
in the IPC example:
.. code-block:: shell
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 8 -b 11.2 --decimal 4
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤══════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪══════════╡
0 │ void mfma<1000>() [clone .kd] │ 1.0000 │ 1623167595.0000 │ 1623167595.0000 │ 1623167595.0000 │ 100.0000 │
╘════╧═══════════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧══════════╛
--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═════════╤═════════╤═════════╤══════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════╪═════════╪═════════╪═════════╪══════════════╡
│ 11.2.0 │ IPC │ 0.0626 │ 0.0626 │ 0.0626 │ Instr/cycle │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.1 │ IPC (Issued) │ 1.0000 │ 1.0000 │ 1.0000 │ Instr/cycle │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.2 │ SALU Util │ 0.0000 │ 0.0000 │ 0.0000 │ Pct │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.3 │ VALU Util │ 6.2496 │ 6.2496 │ 6.2496 │ Pct │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.4 │ VMEM Util │ 0.0000 │ 0.0000 │ 0.0000 │ Pct │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.5 │ Branch Util │ 0.0062 │ 0.0062 │ 0.0062 │ Pct │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.6 │ VALU Active Threads │ 64.0000 │ 64.0000 │ 64.0000 │ Threads │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.7 │ MFMA Util │ 99.9939 │ 99.9939 │ 99.9939 │ Pct │
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
│ 11.2.8 │ MFMA Instr Cycles │ 64.0000 │ 64.0000 │ 64.0000 │ Cycles/instr │
╘═════════╧═════════════════════╧═════════╧═════════╧═════════╧══════════════╛
In contrast to our :ref:`VALU IPC example <ipc-valu-utilization>`, we now see
that the IPC metric (**11.2.0**) and Issued IPC (**11.2.1**) metric differ
substantially. First, we see the VALU utilization (**11.2.3**) has decreased
substantially, from nearly 100% to :math:`\sim6.25\%`. We note that this matches
the ratio of: :math:`((Execution\ cycles) - (VALU\ coexecution\ cycles)) / (Execution\ cycles)`
reported by the matrix calculator, while the MFMA utilization (**11.2.7**)
has increased to nearly 100%.
Recall that our ``v_mfma_f32_32x32x8bf16_1k`` instruction takes 64 cycles to
execute, or 16 quad-cycles, matching our observed MFMA Instruction
Cycles (**11.2.8**). That is, we have a single instruction executed every 16
quad-cycles, or :math:`1/16 = 0.0625`, which is almost identical to our IPC
metric (**11.2.0**). Why then is the Issued IPC metric (**11.2.1**) equal to 1.0?
Instead of simply counting the number of instructions issued and
dividing by the number of cycles the :doc:`CUs </conceptual/compute-unit>` on
the accelerator were active (as is done for **11.2.0**), this metric is formulated
differently, and instead counts the number of
(non-:ref:`internal <ipc-internal-instructions>`) instructions issued divided
by the number of (quad-) cycles where the :ref:`scheduler <desc-scheduler>` was
actively working on issuing instructions. Thus the Issued IPC metric
(**11.2.1**) gives more of a sense of “what percent of the total number of
:ref:`scheduler <desc-scheduler>` cycles did a wave schedule an instruction?”
while the IPC metric (**11.2.0**) indicates the ratio of the number of
instructions executed over the total
:ref:`active CU cycles <total-active-cu-cycles>`.
.. warning::
There are further complications of the Issued IPC metric (**11.2.1**) that make
its use more complicated. We will be explore that in the
:ref:`following section <ipc-internal-instructions>`. For these reasons,
Omniperf typically promotes use of the regular IPC metric (**11.2.0**), e.g., in
the top-level Speed-of-Light chart.
.. _ipc-internal-instructions:
Internal instructions and IPC
-----------------------------
Next, we explore the concept of an “internal” instruction. From
:gcn-crash-course:`29`, we see a few candidates for internal instructions, and
we choose a ``s_nop`` instruction, which according to the
:mi200-isa-pdf:`CDNA2 ISA guide <>`:
Does nothing; it can be repeated in hardware up to eight times.
Here we choose to use the following no-op to make our point:
.. code-block:: asm
s_nop 0x0
Running this kernel through Omniperf yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 9 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│ 0 │ void snop<1000>() [clone .kd] │ 1.00 │ 14221851.50 │ 14221851.50 │ 14221851.50 │ 100.00 │
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0 │ IPC │ 6.79 │ 6.79 │ 6.79 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2 │ SALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3 │ VALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5 │ Branch Util │ 0.68 │ 0.68 │ 0.68 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6 │ VALU Active Threads │ │ │ │ Threads │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
First, we see that the IPC metric (**11.2.0**) tops our theoretical maximum
of 5 instructions per cycle (discussed in the :ref:`scheduler <desc-scheduler>`
section). How can this be?
Recall that :gcn-crash-course:`27` say “no functional unit” for the internal
instructions. This removes the limitation on the IPC. If we are *only*
issuing internal instructions, we are not issuing to any execution
units! However, workloads such as these are almost *entirely* artificial
(that is, repeatedly issuing internal instructions almost exclusively). In
practice, a maximum of IPC of 5 is expected in almost all cases.
Secondly, note that our “Issued” IPC (**11.2.1**) is still identical to
the one here. Again, this has to do with the details of “internal”
instructions. Recall in our :ref:`previous example <issued-ipc>` we defined
this metric as explicitly excluding internal instruction counts. The
logical question then is, "what *is* this metric counting in our
``s_nop`` kernel?"
The generated assembly looks something like:
.. code-block:: asm
;;#ASMSTART
s_nop 0x0
;;#ASMEND
;;#ASMSTART
s_nop 0x0
;;#ASMEND
;;<... omitting many more ...>
s_endpgm
.section .rodata,#alloc
.p2align 6, 0x0
.amdhsa_kernel _Z4snopILi1000EEvv
Of particular interest here is the ``s_endpgm`` instruction, of which
the `CDNA2 ISA
guide <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`__
states:
End of program; terminate wavefront.
This is not on our list of internal instructions from
:gcn-crash-course:`The AMD GCN Architecture <>`, and is therefore counted as part
of our Issued IPC (**11.2.1**). Thus, the issued IPC being equal to one here
indicates that we issued an ``s_endpgm`` instruction every cycle the
:ref:`scheduler <desc-scheduler>` was active for non-internal instructions, which
is expected as this was our *only* non-internal instruction.
SALU Utilization
----------------
Next, we explore a simple :ref:`SALU <desc-salu>` kernel in our on-going IPC and
utilization example. For this case, we select a simple scalar move
operation, for instance:
.. code-block:: asm
s_mov_b32 s0, s1
which, in analogue to our :ref:`v_mov <ipc-valu-utilization>` example, copies the
contents of the source scalar register (``s1``) to the destination
scalar register (``s0``). Running this kernel through Omniperf yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 10 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│ 0 │ void smov<1000>() [clone .kd] │ 1.00 │ 96246554.00 │ 96246554.00 │ 96246554.00 │ 100.00 │
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0 │ IPC │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2 │ SALU Util │ 99.98 │ 99.98 │ 99.98 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3 │ VALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5 │ Branch Util │ 0.1 │ 0.1 │ 0.1 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6 │ VALU Active Threads │ │ │ │ Threads │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
Here we see that:
- Both our IPC (**11.2.0**) and Issued IPC (**11.2.1**) are
:math:`\sim1.0` as expected, and
- The SALU Utilization (**11.2.2**) was
nearly 100% as it was active for almost the entire kernel.
VALU Active Threads
-------------------
For our final IPC/Utilization example, we consider a slight modification
of our :ref:`v_mov <ipc-valu-utilization>` example:
.. code-block:: cpp
template<int N=1000>
__global__ void vmov_with_divergence() {
if (threadIdx.x % 64 == 0)
vmov_op<N>();
}
That is, we wrap our :ref:`VALU <desc-valu>` operation inside a conditional
where only one lane in our wavefront is active. Running this kernel
through Omniperf yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 11 -b 11.2
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪══════════════════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
│ 0 │ void vmov_with_divergence<1000>() [clone │ 1.00 │ 97125097.00 │ 97125097.00 │ 97125097.00 │ 100.00 │
│ │ .kd] │ │ │ │ │ │
╘════╧══════════════════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
11. Compute Units - Compute Pipeline
11.2 Pipeline Stats
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
│ 11.2.0 │ IPC │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.2 │ SALU Util │ 0.1 │ 0.1 │ 0.1 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.3 │ VALU Util │ 99.98 │ 99.98 │ 99.98 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.5 │ Branch Util │ 0.2 │ 0.2 │ 0.2 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.6 │ VALU Active Threads │ 1.13 │ 1.13 │ 1.13 │ Threads │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
Here we see that once again, our VALU Utilization (**11.2.3**) is nearly
100%. However, we note that the VALU Active Threads metric (**11.2.6**) is
:math:`\sim 1`, which matches our conditional in the source code. So
VALU Active Threads reports the average number of lanes of our wavefront
that are active over all :ref:`VALU <desc-valu>` instructions, or thread
“convergence” (i.e., 1 - :ref:`divergence <desc-divergence>`).
.. note::
1. The act of evaluating a vector conditional in this example typically triggers VALU operations, contributing to why the VALU Active Threads metric is not identically one.
2. This metric is a time (cycle) averaged value, and thus contains an implicit dependence on the duration of various VALU instructions.
Nonetheless, this metric serves as a useful measure of thread-convergence.
Finally, we note that our branch utilization (**11.2.5**) has increased
slightly from our baseline, as we now have a branch (checking the value
of ``threadIdx.x``).
+272
Просмотреть файл
@@ -0,0 +1,272 @@
.. _lds-examples:
LDS examples
============
For this example, consider the
:dev-sample:`LDS sample <lds.hip>` distributed as a part of Omniperf. This
code contains two kernels to explore how both :doc:`LDS </conceptual/local-data-share>` bandwidth and
bank conflicts are calculated in Omniperf.
This example was compiled and run on an MI250 accelerator using ROCm
v5.6.0, and Omniperf v2.0.0.
.. code-block:: shell-session
$ hipcc -O3 lds.hip -o lds
Finally, we generate our ``omniperf profile`` as:
.. code-block:: shell-session
$ omniperf profile -n lds --no-roof -- ./lds
.. _lds-bandwidth:
LDS bandwidth
-------------
To explore our *theoretical LDS bandwidth* metric, we use a simple
kernel:
.. code-block:: cpp
constexpr unsigned max_threads = 256;
__global__ void load(int* out, int flag) {
__shared__ int array[max_threads];
int index = threadIdx.x;
// fake a store to the LDS array to avoid unwanted behavior
if (flag)
array[max_threads - index] = index;
__syncthreads();
int x = array[index];
if (x == int(-1234567))
out[threadIdx.x] = x;
}
Here we:
* Create an array of 256 integers in :doc:`LDS </conceptual/local-data-share>`
* Fake a write to the LDS using the ``flag`` variable (always set to zero on the
host) to avoid dead-code elimination
* Read a single integer per work-item from ``threadIdx.x`` of the LDS array
* If the integer is equal to a magic number (always false), write the value out
to global memory to again, avoid dead-code elimination
Finally, we launch this kernel repeatedly, varying the number of threads
in our workgroup:
.. code-block:: cpp
void bandwidth_demo(int N) {
for (int i = 1; i <= N; ++i)
load<<<1,i>>>(nullptr, 0);
hipDeviceSynchronize();
}
Next, lets analyze the first of our bandwidth kernel dispatches:
.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤═══════════════════════╤════════╤════════╤════════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪════════╪════════╪════════╪══════════════════╡
│ 12.2.1 │ Theoretical Bandwidth │ 256.00 │ 256.00 │ 256.00 │ Bytes per kernel │
╘═════════╧═══════════════════════╧════════╧════════╧════════╧══════════════════╛
Here we see that our Theoretical Bandwidth metric (**12.2.1**) is reporting
256 Bytes were loaded even though we launched a single work-item
workgroup, and thus only loaded a single integer from LDS. Why is this?
Recall our definition of this metric:
Indicates the maximum amount of bytes that could have been loaded
from/stored to/atomically updated in the LDS per
:ref:`normalization unit <normalization-units>`.
Here we see that this instruction *could* have loaded up to 256 bytes of
data (4 bytes for each work-item in the wavefront), and therefore this
is the expected value for this metric in Omniperf, hence why this metric
is named the “theoretical” bandwidth.
To further illustrate this point we plot the relationship of the
theoretical bandwidth metric (**12.2.1**) as compared to the effective (or
achieved) bandwidth of this kernel, varying the number of work-items
launched from 1 to 256:
.. figure:: ../data/profiling-by-example/ldsbandwidth.png
:align: center
:alt: Comparison of effective bandwidth versus the theoretical bandwidth
metric in Omniperf for our simple example.
:width: 800
Comparison of effective bandwidth versus the theoretical bandwidth
metric in Omniperf for our simple example.
Here we see that the theoretical bandwidth metric follows a step-function. It
increases only when another wavefront issues an LDS instruction for up to 256
bytes of data. Such increases are marked in the plot using dashed lines. In
contrast, the effective bandwidth increases linearly, by 4 bytes, with the
number of work-items in the kernel, N.
.. _lds-bank-conflicts:
Bank conflicts
--------------
Next we explore bank conflicts using a slight modification of our bandwidth
kernel:
.. code-block:: cpp
constexpr unsigned nbanks = 32;
__global__ void conflicts(int* out, int flag) {
constexpr unsigned nelements = nbanks * max_threads;
__shared__ int array[nelements];
// each thread reads from the same bank
int index = threadIdx.x * nbanks;
// fake a store to the LDS array to avoid unwanted behavior
if (flag)
array[max_threads - index] = index;
__syncthreads();
int x = array[index];
if (x == int(-1234567))
out[threadIdx.x] = x;
}
Here we:
* Allocate an :doc:`LDS </conceptual/local-data-share>` array of size
:math:`32*256*4{B}=32{KiB}`
* Fake a write to the LDS using the ``flag``
variable (always set to zero on the host) to avoid dead-code elimination
* Read a single integer per work-item from index
``threadIdx.x * nbanks`` of the LDS array
* If the integer is equal to a
magic number (always false), write the value out to global memory to,
again, avoid dead-code elimination.
On the host, we again repeatedly launch this kernel, varying the number
of work-items:
.. code-block:: cpp
void conflicts_demo(int N) {
for (int i = 1; i <= N; ++i)
conflicts<<<1,i>>>(nullptr, 0);
hipDeviceSynchronize();
}
Analyzing our first ``conflicts`` kernel (i.e., a single work-item), we
see:
.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
│ 12.2.4 │ Index Accesses │ 2.00 │ 2.00 │ 2.00 │ Cycles per kernel │
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
│ 12.2.6 │ Bank Conflict │ 0.00 │ 0.00 │ 0.00 │ Cycles per kernel │
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
In our :ref:`previous example <lds-bank-conflicts>`, we showed how a load
from a single work-item is considered to have a theoretical bandwidth of
256B. Recall, the :doc:`LDS </conceptual/local-data-share>` can load up to :math:`128B` per
cycle (i.e, 32 banks x 4B / bank / cycle). Hence, we see that loading an 4B
integer spends two cycles accessing the LDS
(:math:`2\ {cycle} = (256B) / (128\ B/{cycle})`).
Looking at the next ``conflicts`` dispatch (i.e., two work-items) yields:
.. code-block:: shell
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
<...>
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
│ 12.2.4 │ Index Accesses │ 3.00 │ 3.00 │ 3.00 │ Cycles per kernel │
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
│ 12.2.6 │ Bank Conflict │ 1.00 │ 1.00 │ 1.00 │ Cycles per kernel │
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
Here we see a bank conflict! What happened?
Recall that the index for each thread was calculated as:
.. code-block:: cpp
int index = threadIdx.x * nbanks;
Or, precisely 32 elements, and each element is 4B wide (for a standard
integer). That is, each thread strides back to the same bank in the LDS,
such that each work-item we add to the dispatch results in another bank
conflict!
Recalling our discussion of bank conflicts in our
:doc:`LDS </conceptual/local-data-share>` description:
A bank conflict occurs when two (or more) work-items in a wavefront
want to read, write, or atomically update different addresses that
map to the same bank in the same cycle. In this case, the conflict
detection hardware will determined a new schedule such that the
access is split into multiple cycles with no conflicts in any
single cycle.
Here we see the conflict resolution hardware in action! Because we have
engineered our kernel to generate conflicts, we expect our bank conflict
metric to scale linearly with the number of work-items:
.. figure:: ../data/profiling-by-example/ldsconflicts.png
:align: center
:alt: Comparison of LDS conflict cycles versus access cycles for our simple
example.
:width: 800
Comparison of LDS conflict cycles versus access cycles for our simple
example.
Here we show the comparison of the Index Accesses (**12.2.4**), to the Bank
Conflicts (**12.2.6**) for the first 20 kernel invocations. We see that each grows
linearly, and there is a constant gap of 2 cycles between them (i.e., the first
access is never considered a conflict).
Finally, we can use these two metrics to derive the Bank Conflict Rate (**12.1.4**).
Since within an Index Access we have 32 banks that may need to be updated, we
use:
$$
Bank\ Conflict\ Rate = 100 * ((Bank\ Conflicts / 32) / (Index\ Accesses - Bank\ Conflicts))
$$
Plotting this, we see:
.. figure:: ../data/profiling-by-example/ldsconflictrate.png
:align: center
:alt: LDS bank conflict rate example
:width: 800
LDS Bank Conflict rate for our simple example.
The bank conflict rate linearly increases with the number of work-items
within a wavefront that are active, *approaching* 100%, but never quite
reaching it.
+456
Просмотреть файл
@@ -0,0 +1,456 @@
.. _occupancy-example:
Occupancy limiters example
==========================
For this example, consider the
:dev-sample:`occupancy <occupancy.hip>` included with Omniperf. We will
investigate the use of the resource allocation panel in the
:ref:`Workgroup Manager <desc-spi>`s metrics section to determine occupancy
limiters. This code contains several kernels to explore how both various
kernel resources impact achieved occupancy, and how this is reported in
Omniperf.
This example was compiled and run on a MI250 accelerator using ROCm
v5.6.0, and Omniperf v2.0.0:
.. code-block:: shell
$ hipcc -O3 occupancy.hip -o occupancy --save-temps
We have again included the ``--save-temps`` flag to get the
corresponding assembly.
Finally, we generate our Omniperf profile as:
.. code-block:: shell
$ omniperf profile -n occupancy --no-roof -- ./occupancy
.. _occupancy-experiment-design:
Design note
-----------
For our occupancy test, we need to create a kernel that is resource
heavy, in various ways. For this purpose, we use the following (somewhat
funny-looking) kernel:
.. code-block:: cpp
constexpr int bound = 16;
__launch_bounds__(256)
__global__ void vgprbound(int N, double* ptr) {
double intermediates[bound];
for (int i = 0 ; i < bound; ++i) intermediates[i] = N * threadIdx.x;
double x = ptr[threadIdx.x];
for (int i = 0; i < 100; ++i) {
x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % bound], intermediates[i % bound]));
intermediates[i % bound] = x;
}
if (x == N) ptr[threadIdx.x] = x;
}
Here we try to use as many :ref:`VGPRs <desc-valu>` as possible, to this end:
* We create a small array of double precision floats, that we size to try
to fit into registers (i.e., ``bound``, this may need to be tuned
depending on the ROCm version).
* We specify ``__launch_bounds___(256)``
to increase the number of VPGRs available to the kernel (by limiting the
number of wavefronts that can be resident on a
:doc:`CU </conceptual/compute-unit>`).
* Write a unique non-compile time constant to each element of the array.
* Repeatedly permute and call relatively expensive math functions on our
array elements.
* Keep the compiler from optimizing out any operations by faking a write to the
``ptr`` based on a run-time conditional.
This yields a total of 122 VGPRs, but it is expected this number will
depend on the exact ROCm/compiler version.
.. code-block:: asm
.size _Z9vgprboundiPd, .Lfunc_end1-_Z9vgprboundiPd
; -- End function
.section .AMDGPU.csdata
; Kernel info:
; codeLenInByte = 4732
; NumSgprs: 68
; NumVgprs: 122
; NumAgprs: 0
; <...>
; AccumOffset: 124
We will use various permutations of this kernel to limit occupancy, and
more importantly for the purposes of this example, demonstrate how this
is reported in Omniperf.
.. _vgpr-occupancy:
VGPR limited
------------
For our first test, we use the ``vgprbound`` kernel discussed in the
:ref:`design note <occupancy-experiment-design>`. After profiling, we run
the analyze step on this kernel:
.. code-block:: shell
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 --dispatch 1
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
0 │ vgprbound(int, double*) │ 1.00 │ 923093822.50 │ 923093822.50 │ 923093822.50 │ 100.00 │
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
│ 2.1.15 │ Wavefront Occupancy │ 1661.24 │ Wavefronts │ 3328.00 │ 49.92 │
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛
--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 0.64 │ 0.64 │ 0.64 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 24.94 │ 24.94 │ 24.94 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 24.49 │ 24.49 │ 24.49 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.3 │ Scratch Stall Rate │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.4 │ Insufficient SIMD Waveslots │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.5 │ Insufficient SIMD VGPRs │ 94.90 │ 94.90 │ 94.90 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.6 │ Insufficient SIMD SGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.7 │ Insufficient CU LDS │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
╒═════════╤══════════╤════════╤════════╤════════╤═══════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪══════════╪════════╪════════╪════════╪═══════════╡
│ 7.1.5 │ VGPRs │ 124.00 │ 124.00 │ 124.00 │ Registers │
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
│ 7.1.6 │ AGPRs │ 4.00 │ 4.00 │ 4.00 │ Registers │
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
╘═════════╧══════════╧════════╧════════╧════════╧═══════════╛
Here we see that the kernel indeed does use *around* (but not exactly)
122 VGPRs, with the difference due to granularity of VGPR allocations.
In addition, we see that we have allocated 4 “:ref:`AGPRs <desc-agprs>`”. We
note that on current CDNA2 accelerators, the ``AccumOffset`` field of
the assembly metadata:
.. code-block:: asm
; AccumOffset: 124
denotes the divide between ``VGPRs`` and ``AGPRs``.
Next, we examine our wavefront occupancy (**2.1.15**), and see that we are
reaching only :math:`\sim50\%` of peak occupancy. As a result, we see
that:
- We are not scheduling workgroups :math:`\sim25\%` of
:ref:`total scheduler-pipe cycles <total-pipe-cycles>` (**6.2.1**); recall
from the discussion of the `workgroup manager <desc-spi>`, 25% is the maximum.
- The scheduler-pipe is stalled (**6.2.2**) from scheduling workgroups due to
resource constraints for the same :math:`\sim25\%` of the time.
- And finally, :math:`\sim91\%` of those stalls are due to a lack of SIMDs
with the appropriate number of VGPRs available (6.2.5).
That is, the reason we cant reach full occupancy is due to our VGPR
usage, as expected!
LDS limited
-----------
To examine an LDS limited example, we must change our kernel slightly:
.. code-block:: cpp
constexpr size_t fully_allocate_lds = 64ul * 1024ul / sizeof(double);
__launch_bounds__(256)
__global__ void ldsbound(int N, double* ptr) {
__shared__ double intermediates[fully_allocate_lds];
for (int i = threadIdx.x ; i < fully_allocate_lds; i += blockDim.x) intermediates[i] = N * threadIdx.x;
__syncthreads();
double x = ptr[threadIdx.x];
for (int i = threadIdx.x; i < fully_allocate_lds; i += blockDim.x) {
x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % fully_allocate_lds], intermediates[i % fully_allocate_lds]));
__syncthreads();
intermediates[i % fully_allocate_lds] = x;
}
if (x == N) ptr[threadIdx.x] = x;
}
Where we now:
* Allocate an 64 KiB LDS array per workgroup, and
* Use our allocated LDS array instead of a register array
Analyzing this:
.. code-block:: shell
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 --dispatch 3
<...>
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
╒═════════╤═════════════════════╤════════╤════════════╤═════════╤═══════════════╕
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
╞═════════╪═════════════════════╪════════╪════════════╪═════════╪═══════════════╡
│ 2.1.15 │ Wavefront Occupancy │ 415.52 │ Wavefronts │ 3328.00 │ 12.49 │
╘═════════╧═════════════════════╧════════╧════════════╧═════════╧═══════════════╛
--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 0.13 │ 0.13 │ 0.13 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 24.87 │ 24.87 │ 24.87 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 24.84 │ 24.84 │ 24.84 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.3 │ Scratch Stall Rate │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.4 │ Insufficient SIMD Waveslots │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.5 │ Insufficient SIMD VGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.6 │ Insufficient SIMD SGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.7 │ Insufficient CU LDS │ 96.47 │ 96.47 │ 96.47 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
╒═════════╤════════════════╤══════════╤══════════╤══════════╤═══════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════╪══════════╪══════════╪══════════╪═══════════╡
│ 7.1.5 │ VGPRs │ 96.00 │ 96.00 │ 96.00 │ Registers │
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
│ 7.1.6 │ AGPRs │ 0.00 │ 0.00 │ 0.00 │ Registers │
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
│ 7.1.8 │ LDS Allocation │ 65536.00 │ 65536.00 │ 65536.00 │ Bytes │
╘═════════╧════════════════╧══════════╧══════════╧══════════╧═══════════╛
We see that our VGPR allocation has gone down to 96 registers, but now
we see our 64KiB LDS allocation (**7.1.8**). In addition, we see a similar
non-schedule rate (**6.2.1**) and stall rate (**6.2.2**) as in our
:ref:`VGPR example <vgpr-occupancy>`. However, our occupancy limiter has now
shifted from VGPRs (**6.2.5**) to LDS (**6.2.7**).
We note that although we see the around the same scheduler/stall rates
(with our LDS limiter), our wave occupancy (**2.1.15**) is significantly
lower (:math:`\sim12\%`)! This is important to remember: the occupancy
limiter metrics in the resource allocation section tell you what the
limiter was, but *not* how much the occupancy was limited. These metrics
should always be analyzed in concert with the wavefront occupancy
metric!
.. _sgpr-occupancy:
SGPR limited
------------
Finally, we modify our kernel once more to make it limited by
`SGPRs <salu>`__:
.. code-block:: cpp
constexpr int sgprlim = 1;
__launch_bounds__(1024, 8)
__global__ void sgprbound(int N, double* ptr) {
double intermediates[sgprlim];
for (int i = 0 ; i < sgprlim; ++i) intermediates[i] = i;
double x = ptr[0];
#pragma unroll 1
for (int i = 0; i < 100; ++i) {
x += sin(pow(intermediates[(i - 1) % sgprlim], intermediates[i % sgprlim]));
intermediates[i % sgprlim] = x;
}
if (x == N) ptr[0] = x;
}
The major changes here are to: - make as much as possible provably
uniform across the wave (notice the lack of ``threadIdx.x`` in the
``intermediates`` initialization and elsewhere), - addition of
``__launch_bounds__(1024, 8)``, which reduces our maximum VGPRs to 64
(such that 8 waves can fit per SIMD), but causes some register spills
(i.e., :ref:`scratch <memory-spaces>` usage), and - lower the ``bound`` (here we
use ``sgprlim``) of the array to reduce VGPR/Scratch usage.
This results in the following assembly metadata for this kernel:
.. code-block:: asm
.size _Z9sgprboundiPd, .Lfunc_end3-_Z9sgprboundiPd
; -- End function
.section .AMDGPU.csdata
; Kernel info:
; codeLenInByte = 4872
; NumSgprs: 76
; NumVgprs: 64
; NumAgprs: 0
; TotalNumVgprs: 64
; ScratchSize: 60
; <...>
; AccumOffset: 64
; Occupancy: 8
Analyzing this workload yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 7.1.9 --dispatch 5
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
│ 0 │ sgprbound(int, double*) │ 1.00 │ 782069812.00 │ 782069812.00 │ 782069812.00 │ 100.00 │
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
2. System Speed-of-Light
2.1 Speed-of-Light
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
│ 2.1.15 │ Wavefront Occupancy │ 3291.76 │ Wavefronts │ 3328.00 │ 98.91 │
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛
--------------------------------------------------------------------------------
6. Workgroup Manager (SPI)
6.2 Workgroup Manager - Resource Allocation
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 7.72 │ 7.72 │ 7.72 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 15.17 │ 15.17 │ 15.17 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 7.38 │ 7.38 │ 7.38 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.3 │ Scratch Stall Rate │ 39.76 │ 39.76 │ 39.76 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.4 │ Insufficient SIMD Waveslots │ 26.32 │ 26.32 │ 26.32 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.5 │ Insufficient SIMD VGPRs │ 26.32 │ 26.32 │ 26.32 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.6 │ Insufficient SIMD SGPRs │ 25.52 │ 25.52 │ 25.52 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.7 │ Insufficient CU LDS │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
--------------------------------------------------------------------------------
7. Wavefront
7.1 Wavefront Launch Stats
╒═════════╤════════════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 7.1.5 │ VGPRs │ 64.00 │ 64.00 │ 64.00 │ Registers │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.6 │ AGPRs │ 0.00 │ 0.00 │ 0.00 │ Registers │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.8 │ LDS Allocation │ 0.00 │ 0.00 │ 0.00 │ Bytes │
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
│ 7.1.9 │ Scratch Allocation │ 60.00 │ 60.00 │ 60.00 │ Bytes/workitem │
╘═════════╧════════════════════╧═══════╧═══════╧═══════╧════════════════╛
Here we see that our wavefront launch stats (**7.1**) have changed to
reflect the metadata seen in the ``--save-temps`` output. Of particular
interest, we see:
* The SGPR allocation (**7.1.7**) is 80 registers, slightly more than the 76
requested by the compiler due to allocation granularity, and
* We have a :ref:`"scratch" <memory-spaces>`, that is, private memory,
allocation of 60 bytes per work-item.
Analyzing the resource allocation block (**6.2**) we now see that for the
first time, the "Not-scheduled Rate (Workgroup Manager)" metric (**6.2.0**)
has become non-zero. This is because the workgroup manager is
responsible for management of scratch, which we see also contributes to
our occupancy limiters in the "Scratch Stall Rate" (**6.2.3**). Note that
the sum of the workgroup manager not-scheduled rate and the
scheduler-pipe non-scheduled rate is still :math:`\sim25\%`, as in our
previous examples.
Next, we see that the scheduler-pipe stall rate (**6.2.2**), that is, how often
we could not schedule a workgroup to a CU, was only about
:math:`\sim8\%`. This hints that perhaps, our kernel is not
*particularly* occupancy limited by resources. Indeed, checking the
wave occupancy metric (**2.1.15**) shows that this kernel is reaching nearly
99% occupancy.
Finally, we inspect the occupancy limiter metrics and see a roughly even
split between :ref:`waveslots <desc-valu>` (**6.2.4**), :ref:`VGPRs <desc-valu>`
(**6.2.5**), and :ref:`SGPRs <desc-salu>` (**6.2.6**) along with the scratch stalls
(**6.2.3**) previously mentioned.
This is yet another reminder to view occupancy holistically. While these
metrics tell you why a workgroup cannot be scheduled, they do *not* tell
you what your occupancy was (consult wavefront occupancy) *nor* whether
increasing occupancy will be beneficial to performance.
+113
Просмотреть файл
@@ -0,0 +1,113 @@
.. _valu-arith-instruction-mix-ex:
VALU arithmetic instruction mix
===============================
For this example, consider the
:dev-sample:`instruction mix sample <instmix.hip>` distributed as a part
of Omniperf.
.. note::
The examples in the section are expected to work on all CDNA™ accelerators.
However, the actual experiment results in this section were collected on an
:ref:`MI2XX <mixxx-note>` accelerator.
.. _valu-experiment-design:
Design note
-----------
This code uses a number of inline assembly instructions to cleanly
identify the types of instructions being issued, as well as to avoid
optimization / dead-code elimination by the compiler. While inline
assembly is inherently not portable, this example is expected to work on
all GCN™ GPUs and CDNA accelerators.
We reproduce a sample of the kernel as follows:
.. code-block:: cpp
// fp32: add, mul, transcendental and fma
float f1, f2;
asm volatile(
"v_add_f32_e32 %0, %1, %0\n"
"v_mul_f32_e32 %0, %1, %0\n"
"v_sqrt_f32 %0, %1\n"
"v_fma_f32 %0, %1, %0, %1\n"
: "=v"(f1)
: "v"(f2));
These instructions correspond to:
* A 32-bit floating point addition,
* a 32-bit floating point multiplication,
* a 32-bit floating point square-root transcendental operation, and
* a 32-bit floating point fused multiply-add operation.
For more detail, refer to the `CDNA2 ISA
Guide <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`__.
Instruction mix
^^^^^^^^^^^^^^^
This example was compiled and run on a MI250 accelerator using ROCm
v5.6.0, and Omniperf v2.0.0.
.. code-block:: shell
$ hipcc -O3 instmix.hip -o instmix
Generate the profile for this example using the following command.
.. code-block:: shell
$ omniperf profile -n instmix --no-roof -- ./instmix
Analyze the instruction mix section.
.. code-block:: shell
$ omniperf analyze -p workloads/instmix/mi200/ -b 10.2
<...>
10. Compute Units - Instruction Mix
10.2 VALU Arithmetic Instr Mix
╒═════════╤════════════╤═════════╤════════════════╕
│ Index │ Metric │ Count │ Unit │
╞═════════╪════════════╪═════════╪════════════════╡
│ 10.2.0 │ INT32 │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.1 │ INT64 │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.2 │ F16-ADD │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.3 │ F16-MUL │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.4 │ F16-FMA │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.5 │ F16-Trans │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.6 │ F32-ADD │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.7 │ F32-MUL │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.8 │ F32-FMA │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.9 │ F32-Trans │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.10 │ F64-ADD │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.11 │ F64-MUL │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.12 │ F64-FMA │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.13 │ F64-Trans │ 1.00 │ Instr per wave │
├─────────┼────────────┼─────────┼────────────────┤
│ 10.2.14 │ Conversion │ 1.00 │ Instr per wave │
╘═════════╧════════════╧═════════╧════════════════╛
This shows that we have exactly one of each type of VALU arithmetic instruction
by construction.
+698
Просмотреть файл
@@ -0,0 +1,698 @@
.. _vmem-example:
Vector memory operation counting
================================
.. _flat-memory-ex:
Global / Generic (FLAT)
-----------------------
For this example, consider the
:dev-sample:`vector memory sample <vmem.hip>` distributed as a part of
Omniperf. This code launches many different versions of a simple
read/write/atomic-only kernels targeting various address spaces. For example,
below is our simple ``global_write`` kernel:
.. code-block:: cpp
// write to a global pointer
__global__ void global_write(int* ptr, int zero) {
ptr[threadIdx.x] = zero;
}
.. note::
This example was compiled and run on an MI250 accelerator using ROCm
v5.6.0, and Omniperf v2.0.0.
.. code-block:: shell-session
$ hipcc -O3 --save-temps vmem.hip -o vmem
We have also chosen to include the ``--save-temps`` flag to save the
compiler temporary files, such as the generated CDNA assembly code, for
inspection.
Finally, we generate our ``omniperf profile`` as follows.
.. code-block:: shell-session
$ omniperf profile -n vmem --no-roof -- ./vmem
.. _flat-experiment-design:
Design note
^^^^^^^^^^^
This section explains some of the more peculiar lines of code in the
example, for example, the use of compiler built-ins and explicit address space
casting, and so forth.
.. code-block:: cpp
// write to a generic pointer
typedef int __attribute__((address_space(0)))* generic_ptr;
__attribute__((noinline)) __device__ void generic_store(generic_ptr ptr, int zero) { *ptr = zero; }
__global__ void generic_write(int* ptr, int zero, int filter) {
__shared__ int lds[1024];
int* generic = (threadIdx.x < filter) ? &ptr[threadIdx.x] : &lds[threadIdx.x];
generic_store((generic_ptr)generic, zero);
}
One of the aims of this example is to demonstrate the use of the
:llvm-docs:`"generic" FLAT <address-space-identifier>` address space. This
address space is typically used when the compiler cannot statically prove where
the backing memory is located.
To try to *force* the compiler to use this address space, we applied
``__attribute__((noinline))`` to the ``generic_store`` function to have the
compiler treat it as a function call (that is, on the other side of which, the
address space may not be known). However, in a trivial example such as this, the
compiler may choose to specialize the ``generic_store`` function to the two
address spaces that might provably be used from our translation unit, that is,
:ref:`"local" (or, LDS) <memory-spaces>` and :ref:`"global" <memory-spaces>`.
Hence, we forcibly cast the address space to
:ref:`"generic" (or, FLAT) <memory-spaces>` to avoid this compiler
optimization.
.. warning::
While convenient for this example, this sort of explicit address space
casting can lead to strange compilation errors, and in the worst case,
incorrect results. As a result, use is discouraged in production code.
For more details on address spaces, refer to
:ref:`memory-spaces`.
Global write
^^^^^^^^^^^^
First, we demonstrate our simple ``global_write`` kernel:
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 1 -b 10.3 15.1.4 15.1.5 15.1.6 15.1.7 15.1.8 15.1.9 15.1.10 15.1.11 -n per_kernel
<...>
--------------------------------------------------------------------------------
0. Top Stat
╒════╤═════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪═════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ global_write(int*, int) [clone .kd] │ 1.00 │ 2400.00 │ 2400.00 │ 2400.00 │ 100.00 │
╘════╧═════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
15. Address Processing Unit and Data Return Path (TA/TD)
15.1 Address Processing Unit
╒═════════╤═════════════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 15.1.4 │ Total Instructions │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.5 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.6 │ Global/Generic Read Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.7 │ Global/Generic Write Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.8 │ Global/Generic Atomic Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.9 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.10 │ Spill/Stack Read Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 15.1.11 │ Spill/Stack Write Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═════════════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
Here, we have presented both the information in the VMEM Instruction Mix
table (**10.3**) and the Address Processing Unit (**15.1**). We note that this
data is expected to be identical, and hence we omit table 15.1 in our
subsequent examples.
In addition, as expected, we see a single Global/Generic Write
instruction (**10.3.2**, **15.1.7**). Inspecting the generated assembly, we get:
.. code-block:: asm
.protected _Z12global_writePii ; -- Begin function _Z12global_writePii
.globl _Z12global_writePii
.p2align 8
.type _Z12global_writePii,@function
_Z12global_writePii: ; @_Z12global_writePii
; %bb.0:
s_load_dword s2, s[4:5], 0x8
s_load_dwordx2 s[0:1], s[4:5], 0x0
v_lshlrev_b32_e32 v0, 2, v0
s_waitcnt lgkmcnt(0)
v_mov_b32_e32 v1, s2
global_store_dword v0, v1, s[0:1]
s_endpgm
.section .rodata,#alloc
.p2align 6, 0x0
.amdhsa_kernel _Z12global_writePii
Notice that this corresponds to an instance of a ``global_store_dword``
operation.
.. note::
The assembly in these experiments were generated for an
:ref:`MI2XX <mixxx-note>` accelerator using ROCm 5.6.0, and may change
depending on ROCm versions and the targeted hardware architecture.
.. _generic-write-ex:
Generic write to LDS
^^^^^^^^^^^^^^^^^^^^
Next, we examine a generic write. As discussed
:ref:`previously <flat-experiment-design>`, our ``generic_write`` kernel uses an
address space cast to *force* the compiler to choose our desired address
space, regardless of other optimizations that may be possible.
Also note that the ``filter`` parameter passed in as a kernel argument (see
:dev-sample:`example <vmem.hip>` and
:ref:`design note <flat-experiment-design>`) is set to zero on the host, such
that we always write to the :doc:`local </conceptual/local-data-share>` (LDS)
memory allocation ``lds``.
Examining this kernel in the VMEM Instruction Mix table yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 10.3 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ generic_write(int*, int, int) [clone .kd │ 1.00 │ 2880.00 │ 2880.00 │ 2880.00 │ 100.00 │
│ │ ] │ │ │ │ │ │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
As expected we see a single generic write (**10.3.2**). In the assembly
generated for this kernel (in particular, we care about the
``generic_store`` function), we see that this corresponds to a
``flat_store_dword`` instruction:
.. code-block:: asm
.type _Z13generic_storePii,@function
_Z13generic_storePii: ; @_Z13generic_storePii
; %bb.0:
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
flat_store_dword v[0:1], v2
s_waitcnt vmcnt(0) lgkmcnt(0)
s_setpc_b64 s[30:31]
.Lfunc_end0:
In addition, we note that we can observe the destination of this request
by looking at the LDS Instructions metric (**12.2.0**) -- which indicates one LDS
access.
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 12.2.0 -n per_kernel
<...>
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 12.2.0 │ LDS Instrs │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛
.. note::
Exercise for the reader: if this access had been targeted at global memory
(for instance, by changing value of ``filter``), where should we look for the
memory traffic? Hint: see the :ref:`generic read <generic-read-ex>` example.
.. _global-read-ex:
Global read
^^^^^^^^^^^
Next, we examine a simple global read operation:
.. code-block:: cpp
__global__ void global_read(int* ptr, int zero) {
int x = ptr[threadIdx.x];
if (x != zero) {
ptr[threadIdx.x] = x + 1;
}
}
Here we observe a now familiar pattern:
- Read a value in from global memory.
- Have a write hidden behind a conditional that is impossible for
the compiler to statically eliminate, but is identically false. In this
case, our ``main()`` function initializes the data in ``ptr`` to zero.
Running Omniperf on this kernel yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 3 -b 10.3 -n per_kernel
<...>
0. Top Stat
╒════╤════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ global_read(int*, int) [clone .kd] │ 1.00 │ 4480.00 │ 4480.00 │ 4480.00 │ 100.00 │
╘════╧════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
Here we see a single global/generic instruction (**10.3.0**) which, as
expected, is a read (**10.3.1**).
.. _generic-read-ex:
Generic read from global memory
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For our generic read example, we choose to change our target for the
generic read to be global memory:
.. code-block:: cpp
__global__ void generic_read(int* ptr, int zero, int filter) {
__shared__ int lds[1024];
if (static_cast<int>(filter - 1) == zero) {
lds[threadIdx.x] = 0; // initialize to zero to avoid conditional, but hide behind _another_ conditional
}
int* generic;
if (static_cast<int>(threadIdx.x) > filter - 1) {
generic = &ptr[threadIdx.x];
} else {
generic = &lds[threadIdx.x];
abort();
}
int x = generic_load((generic_ptr)generic);
if (x != zero) {
ptr[threadIdx.x] = x + 1;
}
}
In addition to our usual ``if (condition_that_wont_happen)`` guard
around the write operation, there is an additional conditional around
the initialization of the ``lds`` buffer. We note that its typically
required to write to this buffer to prevent the compiler from
eliminating the local memory branch entirely due to undefined behavior
(use of an uninitialized value). However, to report *only* our global
memory read, we again hide this initialization behind an identically
false conditional (both ``zero`` and ``filter`` are set to zero in the
kernel launch). Note that this is a *different* conditional from our
pointer assignment (to avoid combination of the two).
Running Omniperf on this kernel reports:
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 4 -b 10.3 12.2.0 16.3.10 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ generic_read(int*, int, int) [clone .kd] │ 1.00 │ 2240.00 │ 2240.00 │ 2240.00 │ 100.00 │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 12.2.0 │ LDS Instrs │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.10 │ L1-L2 Read │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧════════════════╛
Here we observe:
- A single global/generic read operation (**10.3.1**), which
- Is not an LDS instruction (**12.2**), as seen in the
:ref:`generic write <generic-write-ex>` example, but is instead
- An L1-L2 read operation (**16.3.10**)
That is, we have successfully targeted our generic read at global
memory. Inspecting the assembly shows this corresponds to a
``flat_load_dword`` instruction.
.. _global-atomic-ex:
Global atomic
^^^^^^^^^^^^^
Our global atomic kernel simply atomically adds a (non-compile-time) zero value
to a pointer.
.. code-block:: cpp
__global__ void global_atomic(int* ptr, int zero) {
atomicAdd(ptr, zero);
}
Running Omniperf on this kernel yields:
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 5 -b 10.3 16.3.12 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪══════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ global_atomic(int*, int) [clone .kd] │ 1.00 │ 4640.00 │ 4640.00 │ 4640.00 │ 100.00 │
╘════╧══════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
--------------------------------------------------------------------------------
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤══════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪══════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.12 │ L1-L2 Atomic │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
╘═════════╧══════════════╧═══════╧═══════╧═══════╧════════════════╛
Here we see a single global/generic atomic instruction (**10.3.3**), which
corresponds to an L1-L2 atomic request (**16.3.12**).
.. _generic-mixed-atomic-ex:
Generic, mixed atomic
^^^^^^^^^^^^^^^^^^^^^
In our final global/generic example, we look at a case where our generic
operation targets both LDS and global memory:
.. code-block:: cpp
__global__ void generic_atomic(int* ptr, int filter, int zero) {
__shared__ int lds[1024];
int* generic = (threadIdx.x % 2 == filter) ? &ptr[threadIdx.x] : &lds[threadIdx.x];
generic_atomic((generic_ptr)generic, zero);
}
This assigns every other work-item to atomically update global memory or
local memory.
Running this kernel through Omniperf shows:
.. code-block:: shell-session
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 6 -b 10.3 12.2.0 16.3.12 -n per_kernel
<...>
0. Top Stat
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
│ 0 │ generic_atomic(int*, int, int) [clone .k │ 1.00 │ 3360.00 │ 3360.00 │ 3360.00 │ 100.00 │
│ │ d] │ │ │ │ │ │
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
12. Local Data Share (LDS)
12.2 LDS Stats
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 12.2.0 │ LDS Instrs │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤══════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪══════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.12 │ L1-L2 Atomic │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
╘═════════╧══════════════╧═══════╧═══════╧═══════╧════════════════╛
That is, we see:
- A single generic atomic instruction (**10.3.3**) that maps to both
- An LDS instruction (**12.2.0**), and
- An L1-L2 atomic request (**16.3**)
We have demonstrated the ability of the generic address space to
*dynamically* target different backing memory.
.. _spill-scratch:
Spill/Scratch (BUFFER)
----------------------
Next we examine the use of "Spill/Scratch" memory. On current CDNA
accelerators such as the :ref:`MI2XX <mixxx-note>`, this is implemented using
the :ref:`private <memory-spaces>` memory space, which maps to
:llvm-docs:`"scratch" memory <amdgpu-address-spaces>` in AMDGPU hardware
terminology. This type of memory can be accessed via different instructions
depending on the specific architecture targeted. However, current CDNA
accelerators such as the :ref:`MI2XX <mixxx-note>` use so called ``buffer``
instructions to access private memory in a simple (and typically) coalesced
manner. See
:mi200-isa-pdf:`Sec. 9.1, "Vector Memory Buffer Instructions" of the CDNA2 ISA guide <>`
for further reading on this instruction type.
We develop a `simple
kernel <https://github.com/ROCm/omniperf/blob/dev/sample/stack.hip>`__
that uses stack memory:
.. code-block:: cpp
#include <hip/hip_runtime.h>
__global__ void knl(int* out, int filter) {
int x[1024];
x[filter] = 0;
if (threadIdx.x < filter)
out[threadIdx.x] = x[threadIdx.x];
}
Our strategy here is to:
* Create a large stack buffer (that cannot reasonably fit into registers) - Write to a compile-time unknown
location on the stack, and then
* Behind the typical compile-time unknown ``if(condition_that_wont_happen)``
* Read from a different, compile-time unknown, location on the stack and write
to global memory to prevent the compiler from optimizing it out.
This example was compiled and run on an MI250 accelerator using ROCm v5.6.0, and
Omniperf v2.0.0.
.. code-block:: shell-session
$ hipcc -O3 stack.hip -o stack.hip
And profiled using Omniperf:
.. code-block:: shell-session
$ omniperf profile -n stack --no-roof -- ./stack
<...>
$ omniperf analyze -p workloads/stack/mi200/ -b 10.3 16.3.11 -n per_kernel
<...>
10. Compute Units - Instruction Mix
10.3 VMEM Instr Mix
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
│ 10.3.0 │ Global/Generic Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.4 │ Spill/Stack Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.6 │ Spill/Stack Write │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
--------------------------------------------------------------------------------
16. Vector L1 Data Cache
16.3 L1D Cache Accesses
╒═════════╤═════════════╤═══════╤═══════╤═══════╤════════════════╕
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
╞═════════╪═════════════╪═══════╪═══════╪═══════╪════════════════╡
│ 16.3.11 │ L1-L2 Write │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
╘═════════╧═════════════╧═══════╧═══════╧═══════╧════════════════╛
Here we see a single write to the stack (**10.3.6**), which corresponds to
an L1-L2 write request (**16.3.11**), that is, the stack is backed by global
memory and travels through the same memory hierarchy.
+22
Просмотреть файл
@@ -0,0 +1,22 @@
.. meta::
:description: Omniperf external training resources
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD,
training, examples
******************
Learning resources
******************
This section is a catalog of external resources and third-party content that
can help you learn Omniperf. Some areas of the following content might be
outdated.
Introduction to Omniperf
:fab:`youtube` `AMD profiling workshop (Pawsey Supercomputing Research Centre) <https://www.youtube.com/watch?v=9AkxBCiInCw>`_
Omniperf example exercises
`<https://github.com/amd/HPCTrainingExamples/tree/main/OmniperfExamples>`__
AMD Instinct™ tuning guides
:doc:`rocm:how-to/tuning-guides/mi300x/workload`
+23
Просмотреть файл
@@ -0,0 +1,23 @@
.. meta::
:description: Omniperf: Profiling by example
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD
********************
Profiling by example
********************
The following examples refer to sample :doc:`HIP <hip:index>` code located in
:fab:`github` :dev-sample:`ROCm/omniperf/blob/dev/sample <>` and distributed
as part of Omniperf.
.. include:: ./includes/valu-arithmetic-instruction-mix.rst
.. include:: ./includes/infinity-fabric-transactions.rst
.. include:: ./includes/vector-memory-operation-counting.rst
.. include:: ./includes/instructions-per-cycle-and-utilizations.rst
.. include:: ./includes/lds-examples.rst
.. include:: ./includes/occupancy-limiters-example.rst