Docs: refactor and integrate into ROCm docs portal (#362)
* pip-compile docs/requirements.txt
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add Sphinx docs config
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add Sphinx config
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update docs build config
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* style(conf.py): Apply black formatting to docs/conf.py
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
* Update docs requirements
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update to rocm-docs-core 1.3.0
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update docs requirements
Signed-off-by: Peter Jun Park <peter.park@amd.com>
pip-compile requirements
Signed-off-by: Peter Jun Park <peter.park@amd.com>
bump rocm-docs-core to 1.5.0
bump rocm-docs-core to 1.4.1
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* Add dependabot.yml and update CODEOWNERS
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Update toc and conf
Signed-off-by: Peter Jun Park <peter.park@amd.com>
update dependabot
* Port docs to rocm-docs standard
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add toc and Diataxis cards
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add basic file structure
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add glossary
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add includes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Add license.rst
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add compatible hw
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix spelling and license
Signed-off-by: Peter Jun Park <peter.park@amd.com>
clean up index
Signed-off-by: Peter Jun Park <peter.park@amd.com>
clean up installation guides
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add basic usage (quickstart)
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add ref to global options
update toc
Signed-off-by: Peter Jun Park <peter.park@amd.com>
modularize modes and global options
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add profile mode
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
reorg and clean up
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add dynamic omniperf version number in installation guide
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add datatemplate
more reorg
Signed-off-by: Peter Jun Park <peter.park@amd.com>
clean up
Signed-off-by: Peter Jun Park <peter.park@amd.com>
reorg images
move profile mode
reorg
reorg
reorg more
fix formatting
fix headings
ref anchor mi2xx note
add extlinks
add extlinks
Signed-off-by: Peter Jun Park <peter.park@amd.com>
black format
fix formatting, anchors
Signed-off-by: Peter Jun Park <peter.park@amd.com>
reorg
fix words and formatting
Signed-off-by: Peter Jun Park <peter.park@amd.com>
formatting
Signed-off-by: Peter Jun Park <peter.park@amd.com>
same
reorg
format
fix formatting
fix toc
Signed-off-by: Peter Jun Park <peter.park@amd.com>
format
* impr internal linking and fix sphinx warnings
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add spellcheck/linting from rocm-docs-core
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix rst directives
satisfy spellcheck
fix more spelling
rm unused files
fix spelling and update wordlist
* bump rocm-docs-core to 1.6.0
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add fixes from @skyreflectedinmirrors and @lpaoletti
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add references to toc
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add more fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add package manager install section
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* add fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add metadata and fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add fixes
bump to 1.6.1
more fixes
fix fmt in profiling examples
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add missing mem type table
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix formatting
fmt
* add custom css
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix css fs
* make images/figs click-to-expand
Signed-off-by: Peter Jun Park <peter.park@amd.com>
add missed image
update
fix link
* update documentation link in README
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* formatting fixes
Signed-off-by: Peter Jun Park <peter.park@amd.com>
more formatting
* fix heading
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* move archived docs
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* exclude archived docs from docs build
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* update archived docs workflow
Signed-off-by: Peter Jun Park <peter.park@amd.com>
move files
update archived docs workflow
Signed-off-by: Peter Jun Park <peter.park@amd.com>
fix version number
clean up workflow
workflow test
workflow test
another workflow test
* rm docs linting
Signed-off-by: Peter Jun Park <peter.park@amd.com>
* Apply cmake-format suggested changes
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
* Apply cmake-format
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
---------
Signed-off-by: Peter Jun Park <peter.park@amd.com>
Signed-off-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
Co-authored-by: Sam Wu <22262939+samjwu@users.noreply.github.com>
[ROCm/rocprofiler-compute commit: a0dc485ceb]
Этот коммит содержится в:
коммит произвёл
David Galiffi
родитель
ddbc208489
Коммит
5d22d5ac8e
+675
@@ -0,0 +1,675 @@
|
||||
.. _infinity-fabric-example:
|
||||
|
||||
Infinity Fabric transactions
|
||||
============================
|
||||
|
||||
For this example, consider the
|
||||
:dev-sample:`Infinity Fabric™ sample <fabric.hip>` distributed as a part of
|
||||
Omniperf.
|
||||
|
||||
This following code snippet launches a simple read-only kernel.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// the main streaming kernel
|
||||
__global__ void kernel(int* x, size_t N, int zero) {
|
||||
int sum = 0;
|
||||
const size_t offset_start = threadIdx.x + blockIdx.x * blockDim.x;
|
||||
for (int i = 0; i < 10; ++i) {
|
||||
for (size_t offset = offset_start; offset < N; offset += blockDim.x * gridDim.x) {
|
||||
sum += x[offset];
|
||||
}
|
||||
}
|
||||
if (sum != 0) {
|
||||
x[offset_start] = sum;
|
||||
}
|
||||
}
|
||||
|
||||
This happens twice -- once as a warm-up and once for analysis. Note that the
|
||||
buffer ``x`` is initialized to all zeros via a call to ``hipMemcpy`` on the
|
||||
host before the kernel is ever launched. Therefore, the following conditional
|
||||
is identically false -- and thus we expect no writes.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
if (sum != 0) { ...
|
||||
|
||||
.. note::
|
||||
|
||||
The actual sample included with Omniperf also includes the ability to select
|
||||
different operation types (such as atomics, writes). This abbreviated version
|
||||
is presented here for reference only.
|
||||
|
||||
Finally, this sample code lets the user control the
|
||||
:ref:`granularity of an allocation <memory-type>`, the owner of an allocation
|
||||
(local HBM, CPU DRAM or remote HBM), and the size of an allocation (the default
|
||||
is :math:`\sim4`\ GiB) via command line arguments. In doing so, we can explore
|
||||
the impact of these parameters on the L2-Fabric metrics reported by Omniperf to
|
||||
further understand their meaning.
|
||||
|
||||
.. note::
|
||||
|
||||
All results in this section were generated an a node of Infinity
|
||||
Fabric connected MI250 accelerators using ROCm version 5.6.0, and Omniperf
|
||||
version 2.0.0. Although results may vary with ROCm versions and accelerator
|
||||
connectivity, we expect the lessons learned here to be broadly applicable.
|
||||
|
||||
.. _infinity-fabric-ex1:
|
||||
|
||||
Experiment 1: Coarse-grained, accelerator-local HBM reads
|
||||
-----------------------------------------------------------
|
||||
|
||||
In our first experiment, we consider the simplest possible case, a
|
||||
``hipMalloc``\ ’d buffer that is local to our current accelerator:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n coarse_grained_local --no-roof -- ./fabric -t 1 -o 0
|
||||
Using:
|
||||
mtype:CoarseGrained
|
||||
mowner:Device
|
||||
mspace:Global
|
||||
mop:Read
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/coarse_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
|
||||
│ 17.2.0 │ L2-Fabric Read BW │ 42947428672.00 │ 42947428672.00 │ 42947428672.00 │ Bytes per kernel │
|
||||
├─────────┼─────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.1 │ HBM Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼─────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.2 │ Remote Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧═════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.07 │ 0.07 │ 0.07 │ Pct │
|
||||
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
|
||||
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.1 │ Read (Uncached) │ 1450.00 │ 1450.00 │ 1450.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.2 │ Read (64B) │ 671053573.00 │ 671053573.00 │ 671053573.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.3 │ HBM Read │ 671053565.00 │ 671053565.00 │ 671053565.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.4 │ Remote Read │ 8.00 │ 8.00 │ 8.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
|
||||
|
||||
Here, you can make the following observations.
|
||||
|
||||
- The vast majority of L2-Fabric requests (>99%) are 64B
|
||||
read requests (**17.5.2**).
|
||||
|
||||
- Nearly 100% of the read requests (**17.2.1**) are homed in on the
|
||||
accelerator-local HBM (**17.5.3**), while some small fraction of these reads are
|
||||
routed to a “remote” device (**17.5.4**).
|
||||
|
||||
- These drive a :math:`\sim40`\ GiB per kernel read-bandwidth (**17.2.0**).
|
||||
|
||||
In addition, we see a small amount of :ref:`uncached <memory-type>` reads
|
||||
(**17.5.1**), these correspond to things like:
|
||||
|
||||
* The assembly code to execute the kernel
|
||||
|
||||
* Kernel arguments
|
||||
|
||||
* Coordinate parameters (such as ``blockDim.z``) that were not initialized by the
|
||||
hardware, etc. and may account for some of our "remote" read requests
|
||||
(**17.5.4**), for example, reading from CPU DRAM
|
||||
|
||||
The above list is not exhaustive, nor are all of these guaranteed to be
|
||||
"uncached" – the exact implementation depends on the accelerator and
|
||||
ROCm versions used. These read requests could be interrogated further in
|
||||
the :ref:`Scalar L1 Data Cache <desc-sl1d>` and
|
||||
:ref:`Instruction Cache <desc-l1i>` metric sections.
|
||||
|
||||
.. note::
|
||||
|
||||
The Traffic metrics in Sec **17.2** are presented as a percentage of the total
|
||||
number of requests. For example, "HBM Read Traffic" is the percent of read requests
|
||||
(**17.5.0** - **17.5.2**) that were directed to the accelerators' local HBM (**17.5.3**).
|
||||
|
||||
.. _infinity-fabric-ex2:
|
||||
|
||||
Experiment 2: Fine-grained, accelerator-local HBM reads
|
||||
---------------------------------------------------------
|
||||
|
||||
In this experiment, we change the :ref:`granularity <memory-type>` of our
|
||||
device-allocation to be fine-grained device memory, local to the current
|
||||
accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the
|
||||
``hipDeviceMallocFinegrained`` flag to accomplish this.
|
||||
|
||||
.. note::
|
||||
|
||||
On some systems (such as those with only PCIe® connected accelerators), you need
|
||||
to set the environment variable ``HSA_FORCE_FINE_GRAIN_PCIE=1`` to enable
|
||||
this memory type.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n fine_grained_local --no-roof -- ./fabric -t 0 -o 0
|
||||
Using:
|
||||
mtype:FineGrained
|
||||
mowner:Device
|
||||
mspace:Global
|
||||
mop:Read
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/fine_grained_local/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
|
||||
│ 17.2.0 │ L2-Fabric Read BW │ 42948661824.00 │ 42948661824.00 │ 42948661824.00 │ Bytes per kernel │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.1 │ HBM Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.2 │ Remote Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.3 │ Uncached Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.07 │ 0.07 │ 0.07 │ Pct │
|
||||
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
|
||||
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.1 │ Read (Uncached) │ 1334.00 │ 1334.00 │ 1334.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.2 │ Read (64B) │ 671072841.00 │ 671072841.00 │ 671072841.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.3 │ HBM Read │ 671072835.00 │ 671072835.00 │ 671072835.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.4 │ Remote Read │ 6.00 │ 6.00 │ 6.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
|
||||
|
||||
Comparing with our :ref:`previous example <infinity-fabric-ex1>`, we see a
|
||||
relatively similar result, namely:
|
||||
|
||||
- The vast majority of L2-Fabric requests are 64B read requests (**17.5.2**)
|
||||
|
||||
- Nearly all these read requests are directed to the accelerator-local HBM (**17.2.1**)
|
||||
|
||||
In addition, we now see a small percentage of HBM Read Stalls (**17.4.2**),
|
||||
as streaming fine-grained memory is putting more stress on Infinity
|
||||
Fabric.
|
||||
|
||||
.. note::
|
||||
|
||||
The stalls in Sec 17.4 are presented as a percentage of the total number
|
||||
active L2 cycles, summed over :doc:`all L2 channels </conceptual/l2-cache>`.
|
||||
|
||||
.. _infinity-fabric-ex3:
|
||||
|
||||
Experiment 3: Fine-grained, remote-accelerator HBM reads
|
||||
----------------------------------------------------------
|
||||
|
||||
In this experiment, we move our :ref:`fine-grained <memory-type>` allocation to
|
||||
be owned by a remote accelerator. We accomplish this by first changing
|
||||
the HIP device using, for instance, the ``hipSetDevice(1)`` API, then allocating
|
||||
fine-grained memory (as described :ref:`previously <infinity-fabric-ex2>`), and
|
||||
finally resetting the device back to the default, for instance,
|
||||
``hipSetDevice(0)``.
|
||||
|
||||
Although we have not changed our code significantly, we do see a
|
||||
substantial change in the L2-Fabric metrics:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n fine_grained_remote --no-roof -- ./fabric -t 0 -o 2
|
||||
Using:
|
||||
mtype:FineGrained
|
||||
mowner:Remote
|
||||
mspace:Global
|
||||
mop:Read
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/fine_grained_remote/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
|
||||
│ 17.2.0 │ L2-Fabric Read BW │ 42949692736.00 │ 42949692736.00 │ 42949692736.00 │ Bytes per kernel │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.1 │ HBM Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.2 │ Remote Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.3 │ Uncached Read Traffic │ 200.00 │ 200.00 │ 200.00 │ Pct │
|
||||
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 17.85 │ 17.85 │ 17.85 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪════════════════╡
|
||||
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.1 │ Read (Uncached) │ 1342177894.00 │ 1342177894.00 │ 1342177894.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.2 │ Read (64B) │ 671088949.00 │ 671088949.00 │ 671088949.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.3 │ HBM Read │ 307.00 │ 307.00 │ 307.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.4 │ Remote Read │ 671088642.00 │ 671088642.00 │ 671088642.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛
|
||||
|
||||
First, we see that while we still observe approximately the same number
|
||||
of 64B Read Requests (**17.5.2**), we now see an even larger number of
|
||||
Uncached Read Requests (**17.5.3**). Some simple division reveals:
|
||||
|
||||
.. math::
|
||||
|
||||
342177894.00 / 671088949.00 ≈ 2
|
||||
|
||||
That is, each 64B Read Request is *also* counted as two Uncached Read
|
||||
Requests, as reflected in the :ref:`request-flow diagram <l2-request-flow>`.
|
||||
This is also why the Uncached Read Traffic metric (**17.2.3**) is at the
|
||||
counter-intuitive value of 200%!
|
||||
|
||||
In addition, observe that:
|
||||
|
||||
- We no longer see any significant number of HBM Read Requests (**17.2.1**,
|
||||
**17.5.3**), nor HBM Read Stalls (**17.4.2**), but instead,
|
||||
|
||||
- we see that almost all of these requests are considered “remote”
|
||||
(**17.2.2**, **17.5.4**) are being routed to another
|
||||
accelerator, or the CPU — in this case HIP Device 1 — and,
|
||||
|
||||
- we see a significantly larger percentage of AMD Infinity Fabric Read Stalls
|
||||
(**17.4.1**) as compared to the HBM Read Stalls in the
|
||||
:ref:`previous example <infinity-fabric-ex2>`.
|
||||
|
||||
These stalls correspond to reads that are going out over the AMD
|
||||
Infinity Fabric connection to another MI250 accelerator. In
|
||||
addition, because these are crossing between accelerators, we expect
|
||||
significantly lower achievable bandwidths as compared to the local
|
||||
accelerator’s HBM – this is reflected (indirectly) in the magnitude of
|
||||
the stall metric (**17.4.1**). Finally, we note that if our system contained
|
||||
only PCIe connected accelerators, these observations will differ.
|
||||
|
||||
.. _infinity-fabric-ex4:
|
||||
|
||||
Experiment 4: Fine-grained, CPU-DRAM reads
|
||||
--------------------------------------------
|
||||
|
||||
In this experiment, we move our :ref:`fine-grained <memory-type>` allocation to
|
||||
be owned by the CPU’s DRAM. We accomplish this by allocating host-pinned
|
||||
fine-grained memory using the ``hipHostMalloc`` API:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n fine_grained_host --no-roof -- ./fabric -t 0 -o 1
|
||||
Using:
|
||||
mtype:FineGrained
|
||||
mowner:Host
|
||||
mspace:Global
|
||||
mop:Read
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/fine_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
|
||||
│ 17.2.0 │ L2-Fabric Read BW │ 42949691264.00 │ 42949691264.00 │ 42949691264.00 │ Bytes per kernel │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.1 │ HBM Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.2 │ Remote Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.3 │ Uncached Read Traffic │ 200.00 │ 200.00 │ 200.00 │ Pct │
|
||||
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 91.29 │ 91.29 │ 91.29 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════╤═══════════════╤═══════════════╤═══════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════╪═══════════════╪═══════════════╪═══════════════╪════════════════╡
|
||||
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.1 │ Read (Uncached) │ 1342177848.00 │ 1342177848.00 │ 1342177848.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.2 │ Read (64B) │ 671088926.00 │ 671088926.00 │ 671088926.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.3 │ HBM Read │ 284.00 │ 284.00 │ 284.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼───────────────┼───────────────┼───────────────┼────────────────┤
|
||||
│ 17.5.4 │ Remote Read │ 671088642.00 │ 671088642.00 │ 671088642.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════╧═══════════════╧═══════════════╧═══════════════╧════════════════╛
|
||||
|
||||
Here we see *almost* the same results as in the
|
||||
:ref:`previous experiment <infinity-fabric-ex3>`, however now as we are crossing
|
||||
a PCIe bus to the CPU, we see that the Infinity Fabric Read stalls (**17.4.1**)
|
||||
have shifted to be a PCIe stall (**17.4.2**). In addition, as (on this
|
||||
system) the PCIe bus has a lower peak bandwidth than the AMD Infinity
|
||||
Fabric connection between two accelerators, we once again observe an
|
||||
increase in the percentage of stalls on this interface.
|
||||
|
||||
.. note::
|
||||
|
||||
Had we performed this same experiment on an
|
||||
`MI250X system <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_,
|
||||
these transactions would again have been marked as Infinity Fabric Read
|
||||
stalls (**17.4.1**), as the CPU is connected to the accelerator via AMD Infinity
|
||||
Fabric.
|
||||
|
||||
.. _infinity-fabric-ex5:
|
||||
|
||||
Experiment 5: Coarse-grained, CPU-DRAM reads
|
||||
----------------------------------------------
|
||||
|
||||
In our next fabric experiment, we change our CPU memory allocation to be
|
||||
`coarse-grained <Mtype>`__. We accomplish this by passing the
|
||||
``hipHostMalloc`` API the ``hipHostMallocNonCoherent`` flag, to mark the
|
||||
allocation as coarse-grained:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n coarse_grained_host --no-roof -- ./fabric -t 1 -o 1
|
||||
Using:
|
||||
mtype:CoarseGrained
|
||||
mowner:Host
|
||||
mspace:Global
|
||||
mop:Read
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/coarse_grained_host/mi200 -b 17.2.0 17.2.1 17.2.2 17.2.3 17.4.0 17.4.1 17.4.2 17.5.0 17.5.1 17.5.2 17.5.3 17.5.4 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═══════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
|
||||
│ 17.2.0 │ L2-Fabric Read BW │ 42949691264.00 │ 42949691264.00 │ 42949691264.00 │ Bytes per kernel │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.1 │ HBM Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.2 │ Remote Read Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.3 │ Uncached Read Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧═══════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤═══════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.0 │ Read - PCIe Stall │ PCIe Stall │ Read │ 91.27 │ 91.27 │ 91.27 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.1 │ Read - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.2 │ Read - HBM Stall │ HBM Stall │ Read │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧═══════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
|
||||
│ 17.5.0 │ Read (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.1 │ Read (Uncached) │ 562.00 │ 562.00 │ 562.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.2 │ Read (64B) │ 671088926.00 │ 671088926.00 │ 671088926.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.3 │ HBM Read │ 281.00 │ 281.00 │ 281.00 │ Req per kernel │
|
||||
├─────────┼─────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.4 │ Remote Read │ 671088645.00 │ 671088645.00 │ 671088645.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
|
||||
|
||||
Here we see a similar result to our
|
||||
:ref:`previous experiment <infinity-fabric-ex4>`, with one key difference: our
|
||||
accesses are no longer marked as Uncached Read requests (**17.2.3, 17.5.1**), but instead
|
||||
are 64B read requests (**17.5.2**), as observed in our
|
||||
:ref:`Coarse-grained, accelerator-local HBM <infinity-fabric-ex1>` experiment.
|
||||
|
||||
.. _infinity-fabric-ex6:
|
||||
|
||||
Experiment 6: Fine-grained, CPU-DRAM writes
|
||||
--------------------------------------------
|
||||
|
||||
Thus far in our exploration of the L2-Fabric interface, we have
|
||||
primarily focused on read operations. However, in
|
||||
:ref:`our request flow diagram <l2-request-flow>`, we note that writes are
|
||||
counted separately. To observe this, we use the ``-p`` flag to trigger write
|
||||
operations to fine-grained memory allocated on the host:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n fine_grained_host_write --no-roof -- ./fabric -t 0 -o 1 -p 1
|
||||
Using:
|
||||
mtype:FineGrained
|
||||
mowner:Host
|
||||
mspace:Global
|
||||
mop:Write
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/fine_grained_host_writes/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═══════════════════════════════════╤════════════════╤════════════════╤════════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════════╪════════════════╪════════════════╪════════════════╪══════════════════╡
|
||||
│ 17.2.4 │ L2-Fabric Write and Atomic BW │ 42949672960.00 │ 42949672960.00 │ 42949672960.00 │ Bytes per kernel │
|
||||
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.5 │ HBM Write and Atomic Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.6 │ Remote Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.7 │ Atomic Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────────┼────────────────┼────────────────┼────────────────┼──────────────────┤
|
||||
│ 17.2.8 │ Uncached Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
╘═════════╧═══════════════════════════════════╧════════════════╧════════════════╧════════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤════════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.3 │ Write - PCIe Stall │ PCIe Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.4 │ Write - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.5 │ Write - HBM Stall │ HBM Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.6 │ Write - Credit Starvation │ Credit Starvation │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧════════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════════════╤══════════════╤══════════════╤══════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════════╪══════════════╪══════════════╪══════════════╪════════════════╡
|
||||
│ 17.5.5 │ Write (32B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.6 │ Write (Uncached) │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.7 │ Write (64B) │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.8 │ HBM Write and Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.9 │ Remote Write and Atomic │ 671088640.00 │ 671088640.00 │ 671088640.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼──────────────┼──────────────┼──────────────┼────────────────┤
|
||||
│ 17.5.10 │ Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
|
||||
|
||||
Here we notice a few changes in our request pattern:
|
||||
|
||||
* As expected, the requests have changed from 64B Reads to 64B Write requests
|
||||
(**17.5.7**),
|
||||
|
||||
* these requests are homed in on a “remote” destination (**17.2.6, 17.5.9**), as
|
||||
expected, and
|
||||
|
||||
* these are also counted as a single Uncached Write request (**17.5.6**).
|
||||
|
||||
In addition, there are rather significant changes in the bandwidth values
|
||||
reported:
|
||||
|
||||
- The “L2-Fabric Write and Atomic” bandwidth metric (**17.2.4**)
|
||||
reports about 40GiB of data written across Infinity Fabric while
|
||||
|
||||
- The “Remote Write and Traffic” metric (**17.2.5**) indicates that nearly
|
||||
100% of these request are being directed to a remote source.
|
||||
|
||||
The precise meaning of these metrics are explored in the
|
||||
:ref:`subsequent experiment <infinity-fabric-ex7>`.
|
||||
|
||||
Finally, we note that we see no write stalls on the PCIe bus
|
||||
(**17.4.3**). This is because writes over a PCIe bus `are
|
||||
non-posted <https://members.pcisig.com/wg/PCI-SIG/document/10912>`_,
|
||||
that is, they do not require acknowledgement.
|
||||
|
||||
.. _infinity-fabric-ex7:
|
||||
|
||||
Experiment 7: Fine-grained, CPU-DRAM atomicAdd
|
||||
------------------------------------------------
|
||||
|
||||
Next, we change our experiment to instead target ``atomicAdd``
|
||||
operations to the CPU’s DRAM.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n fine_grained_host_add --no-roof -- ./fabric -t 0 -o 1 -p 2
|
||||
Using:
|
||||
mtype:FineGrained
|
||||
mowner:Host
|
||||
mspace:Global
|
||||
mop:Add
|
||||
mdata:Unsigned
|
||||
remoteId:-1
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/fine_grained_host_add/mi200 -b 17.2.4 17.2.5 17.2.6 17.2.7 17.2.8 17.4.3 17.4.4 17.4.5 17.4.6 17.5.5 17.5.6 17.5.7 17.5.8 17.5.9 17.5.10 -n per_kernel --dispatch 2
|
||||
<...>
|
||||
17. L2 Cache
|
||||
17.2 L2 - Fabric Transactions
|
||||
╒═════════╤═══════════════════════════════════╤══════════════╤══════════════╤══════════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════════════════╪══════════════╪══════════════╪══════════════╪══════════════════╡
|
||||
│ 17.2.4 │ L2-Fabric Write and Atomic BW │ 429496736.00 │ 429496736.00 │ 429496736.00 │ Bytes per kernel │
|
||||
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
|
||||
│ 17.2.5 │ HBM Write and Atomic Traffic │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
|
||||
│ 17.2.6 │ Remote Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
|
||||
│ 17.2.7 │ Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
├─────────┼───────────────────────────────────┼──────────────┼──────────────┼──────────────┼──────────────────┤
|
||||
│ 17.2.8 │ Uncached Write and Atomic Traffic │ 100.00 │ 100.00 │ 100.00 │ Pct │
|
||||
╘═════════╧═══════════════════════════════════╧══════════════╧══════════════╧══════════════╧══════════════════╛
|
||||
17.4 L2 - Fabric Interface Stalls
|
||||
╒═════════╤════════════════════════════════╤════════════════════════╤═══════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Type │ Transaction │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════════════════════╪════════════════════════╪═══════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 17.4.3 │ Write - PCIe Stall │ PCIe Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.4 │ Write - Infinity Fabric™ Stall │ Infinity Fabric™ Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.5 │ Write - HBM Stall │ HBM Stall │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────┼────────────────────────┼───────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 17.4.6 │ Write - Credit Starvation │ Credit Starvation │ Write │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧════════════════════════════════╧════════════════════════╧═══════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
17.5 L2 - Fabric Detailed Transaction Breakdown
|
||||
╒═════════╤═════════════════════════╤═════════════╤═════════════╤═════════════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════════╪═════════════╪═════════════╪═════════════╪════════════════╡
|
||||
│ 17.5.5 │ Write (32B) │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
|
||||
│ 17.5.6 │ Write (Uncached) │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
|
||||
│ 17.5.7 │ Write (64B) │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
|
||||
│ 17.5.8 │ HBM Write and Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
|
||||
│ 17.5.9 │ Remote Write and Atomic │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
|
||||
├─────────┼─────────────────────────┼─────────────┼─────────────┼─────────────┼────────────────┤
|
||||
│ 17.5.10 │ Atomic │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
|
||||
╘═════════╧═════════════════════════╧═════════════╧═════════════╧═════════════╧════════════════╛
|
||||
|
||||
In this case, there is quite a lot to unpack:
|
||||
|
||||
- For the first time, the 32B Write requests (**17.5.5**) are heavily used.
|
||||
|
||||
- These correspond to Atomic requests (**17.2.7, 17.5.10**), and are counted as
|
||||
Uncached Writes (**17.5.6**).
|
||||
|
||||
- The L2-Fabric Write and Atomic bandwidth metric (**17.2.4**) shows about 0.4
|
||||
GiB of traffic. For convenience, the sample reduces the default problem size
|
||||
for this case due to the speed of atomics across a PCIe bus, and finally,
|
||||
|
||||
- The traffic is directed to a remote device (**17.2.6, 17.5.9**).
|
||||
|
||||
Let's consider what an “atomic” request means in this context. Recall
|
||||
that we are discussing memory traffic flowing from the L2 cache, the
|
||||
device-wide coherence point on current CDNA accelerators such as the
|
||||
MI250, to for example, the CPU’s DRAM. In this light, we see that these
|
||||
requests correspond to *system scope* atomics, and specifically in the
|
||||
case of the MI250, to fine-grained memory.
|
||||
|
||||
|
||||
.. rubric:: Disclaimer
|
||||
|
||||
PCIe® is a registered trademark of PCI-SIG Corporation.
|
||||
|
||||
..
|
||||
`Leave as possible future experiment to add
|
||||
|
||||
|
||||
### Experiment #2 - Non-temporal writes
|
||||
|
||||
If we take the same code (for convenience only) as previously described, we can demonstrate how to achieve 'streaming' writes, as described in the [L2 Cache Access metrics](L2_cache_metrics) section.
|
||||
To see this, we use the Clang built-in [`__builtin_nontemporal_store`](https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins), for example
|
||||
|
||||
```
|
||||
template<typename T>
|
||||
__device__ void store (T* ptr, T val) {
|
||||
__builtin_nontemporal_store(val, ptr);
|
||||
}
|
||||
```
|
||||
|
||||
On an AMD MI2XX accelerator, for FP32 values this will generate a `global_store_dword` instruction, with the `glc` and `slc` bits set, described in [section 10.1](https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf) of the CDNA2 ISA guide.`
|
||||
+486
@@ -0,0 +1,486 @@
|
||||
.. _ipc-example:
|
||||
|
||||
Instructions-per-cycle and utilizations example
|
||||
===============================================
|
||||
|
||||
For this example, consider the
|
||||
:dev-sample:`instructions-per-cycle (IPC) example <ipc.hip>` included with
|
||||
Omniperf.
|
||||
|
||||
This example is compiled using ``c++17`` support:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ hipcc -O3 ipc.hip -o ipc -std=c++17
|
||||
|
||||
and was run on an MI250 CDNA2 accelerator:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf profile -n ipc --no-roof -- ./ipc
|
||||
|
||||
The results shown in this section are *generally* applicable to CDNA
|
||||
accelerators, but may vary between generations and specific products.
|
||||
|
||||
.. _ipc-experiment-design-note:
|
||||
|
||||
Design note
|
||||
-----------
|
||||
|
||||
The kernels in this example all execute a specific assembly operation
|
||||
``N`` times (1000, by default), for instance the ``vmov`` kernel:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
template<int N=1000>
|
||||
__device__ void vmov_op() {
|
||||
int dummy;
|
||||
if constexpr (N >= 1) {
|
||||
asm volatile("v_mov_b32 v0, v1\n" : : "{v31}"(dummy));
|
||||
vmov_op<N - 1>();
|
||||
}
|
||||
}
|
||||
|
||||
template<int N=1000>
|
||||
__global__ void vmov() {
|
||||
vmov_op<N>();
|
||||
}
|
||||
|
||||
The kernels are then launched twice, once for a warm-up run, and once
|
||||
for measurement.
|
||||
|
||||
.. _ipc-valu-utilization:
|
||||
|
||||
VALU utilization and IPC
|
||||
------------------------
|
||||
|
||||
Now we can use our test to measure the achieved instructions-per-cycle
|
||||
of various types of instructions. We start with a simple :ref:`VALU <desc-valu>`
|
||||
operation, i.e., a ``v_mov_b32`` instruction, e.g.:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
v_mov_b32 v0, v1
|
||||
|
||||
This instruction simply copies the contents from the source register
|
||||
(``v1``) to the destination register (``v0``). Investigating this kernel
|
||||
with Omniperf, we see:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 7 -b 11.2
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
|
||||
│ 0 │ void vmov<1000>() [clone .kd] │ 1.00 │ 99317423.00 │ 99317423.00 │ 99317423.00 │ 100.00 │
|
||||
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
11. Compute Units - Compute Pipeline
|
||||
11.2 Pipeline Stats
|
||||
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
|
||||
│ 11.2.0 │ IPC │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.2 │ SALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.3 │ VALU Util │ 99.98 │ 99.98 │ 99.98 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.5 │ Branch Util │ 0.1 │ 0.1 │ 0.1 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.6 │ VALU Active Threads │ 64.0 │ 64.0 │ 64.0 │ Threads │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
|
||||
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
|
||||
|
||||
Here we see that:
|
||||
|
||||
1. Both the IPC (**11.2.0**) and “Issued” IPC (**11.2.1**) metrics are
|
||||
:math:`\sim 1`
|
||||
2. The VALU Utilization metric (**11.2.3**) is also :math:`\sim100\%`, and
|
||||
finally
|
||||
3. The VALU Active Threads metric (**11.2.4**) is 64, i.e., the wavefront
|
||||
size on CDNA accelerators, as all threads in the wavefront are
|
||||
active.
|
||||
|
||||
We will explore the difference between the IPC (**11.2.0**) and “Issued” IPC
|
||||
(**11.2.1**) metrics in the :ref:`next section <issued-ipc>`.
|
||||
|
||||
Additionally, we notice a small (0.1%) Branch utilization (**11.2.5**).
|
||||
Inspecting the assembly of this kernel shows there are no branch
|
||||
operations, however recalling the note in the :ref:`Pipeline
|
||||
statistics <pipeline-stats>` section:
|
||||
|
||||
The branch utilization <…> includes time spent in other instruction
|
||||
types (namely: ``s_endpgm``) that are *typically* a very small
|
||||
percentage of the overall kernel execution.
|
||||
|
||||
We see that this is coming from execution of the ``s_endpgm``
|
||||
instruction at the end of every wavefront.
|
||||
|
||||
.. note::
|
||||
|
||||
Technically, the cycle counts used in the denominators of our IPC metrics are
|
||||
actually in units of quad-cycles, a group of 4 consecutive cycles. However, a
|
||||
typical :ref:`VALU <desc-valu>` instruction on CDNA accelerators runs for a
|
||||
single quad-cycle (see :gcn-crash-course:`30`). Therefore, for simplicity, we
|
||||
simply report these metrics as "instructions per cycle".
|
||||
|
||||
.. _issued-ipc:
|
||||
|
||||
Exploring “issued” IPC via MFMA operations
|
||||
------------------------------------------
|
||||
|
||||
.. warning::
|
||||
|
||||
The MFMA assembly operations used in this example are inherently not portable
|
||||
to older CDNA architectures.
|
||||
|
||||
Unlike the simple quad-cycle ``v_mov_b32`` operation discussed in our
|
||||
:ref:`previous example <ipc-valu-utilization>`, some operations take many
|
||||
quad-cycles to execute. For example, using the
|
||||
`AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator#example-of-querying-instruction-information>`_
|
||||
we can see that some :ref:`MFMA <desc-mfma>` operations take 64 cycles, e.g.:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ ./matrix_calculator.py --arch CDNA2 --detail-instruction --instruction v_mfma_f32_32x32x8bf16_1k
|
||||
Architecture: CDNA2
|
||||
Instruction: V_MFMA_F32_32X32X8BF16_1K
|
||||
<...>
|
||||
Execution statistics:
|
||||
FLOPs: 16384
|
||||
Execution cycles: 64
|
||||
FLOPs/CU/cycle: 1024
|
||||
Can co-execute with VALU: True
|
||||
VALU co-execution cycles possible: 60
|
||||
|
||||
What happens to our IPC when we utilize this ``v_mfma_f32_32x32x8bf16_1k``
|
||||
instruction on a CDNA2 accelerator? To find out, we turn to our ``mfma`` kernel
|
||||
in the IPC example:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 8 -b 11.2 --decimal 4
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═══════════════════════════════╤═════════╤═════════════════╤═════════════════╤═════════════════╤══════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═══════════════════════════════╪═════════╪═════════════════╪═════════════════╪═════════════════╪══════════╡
|
||||
│ 0 │ void mfma<1000>() [clone .kd] │ 1.0000 │ 1623167595.0000 │ 1623167595.0000 │ 1623167595.0000 │ 100.0000 │
|
||||
╘════╧═══════════════════════════════╧═════════╧═════════════════╧═════════════════╧═════════════════╧══════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
11. Compute Units - Compute Pipeline
|
||||
11.2 Pipeline Stats
|
||||
╒═════════╤═════════════════════╤═════════╤═════════╤═════════╤══════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════╪═════════╪═════════╪═════════╪══════════════╡
|
||||
│ 11.2.0 │ IPC │ 0.0626 │ 0.0626 │ 0.0626 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.1 │ IPC (Issued) │ 1.0000 │ 1.0000 │ 1.0000 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.2 │ SALU Util │ 0.0000 │ 0.0000 │ 0.0000 │ Pct │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.3 │ VALU Util │ 6.2496 │ 6.2496 │ 6.2496 │ Pct │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.4 │ VMEM Util │ 0.0000 │ 0.0000 │ 0.0000 │ Pct │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.5 │ Branch Util │ 0.0062 │ 0.0062 │ 0.0062 │ Pct │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.6 │ VALU Active Threads │ 64.0000 │ 64.0000 │ 64.0000 │ Threads │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.7 │ MFMA Util │ 99.9939 │ 99.9939 │ 99.9939 │ Pct │
|
||||
├─────────┼─────────────────────┼─────────┼─────────┼─────────┼──────────────┤
|
||||
│ 11.2.8 │ MFMA Instr Cycles │ 64.0000 │ 64.0000 │ 64.0000 │ Cycles/instr │
|
||||
╘═════════╧═════════════════════╧═════════╧═════════╧═════════╧══════════════╛
|
||||
|
||||
In contrast to our :ref:`VALU IPC example <ipc-valu-utilization>`, we now see
|
||||
that the IPC metric (**11.2.0**) and Issued IPC (**11.2.1**) metric differ
|
||||
substantially. First, we see the VALU utilization (**11.2.3**) has decreased
|
||||
substantially, from nearly 100% to :math:`\sim6.25\%`. We note that this matches
|
||||
the ratio of: :math:`((Execution\ cycles) - (VALU\ coexecution\ cycles)) / (Execution\ cycles)`
|
||||
reported by the matrix calculator, while the MFMA utilization (**11.2.7**)
|
||||
has increased to nearly 100%.
|
||||
|
||||
Recall that our ``v_mfma_f32_32x32x8bf16_1k`` instruction takes 64 cycles to
|
||||
execute, or 16 quad-cycles, matching our observed MFMA Instruction
|
||||
Cycles (**11.2.8**). That is, we have a single instruction executed every 16
|
||||
quad-cycles, or :math:`1/16 = 0.0625`, which is almost identical to our IPC
|
||||
metric (**11.2.0**). Why then is the Issued IPC metric (**11.2.1**) equal to 1.0?
|
||||
|
||||
Instead of simply counting the number of instructions issued and
|
||||
dividing by the number of cycles the :doc:`CUs </conceptual/compute-unit>` on
|
||||
the accelerator were active (as is done for **11.2.0**), this metric is formulated
|
||||
differently, and instead counts the number of
|
||||
(non-:ref:`internal <ipc-internal-instructions>`) instructions issued divided
|
||||
by the number of (quad-) cycles where the :ref:`scheduler <desc-scheduler>` was
|
||||
actively working on issuing instructions. Thus the Issued IPC metric
|
||||
(**11.2.1**) gives more of a sense of “what percent of the total number of
|
||||
:ref:`scheduler <desc-scheduler>` cycles did a wave schedule an instruction?”
|
||||
while the IPC metric (**11.2.0**) indicates the ratio of the number of
|
||||
instructions executed over the total
|
||||
:ref:`active CU cycles <total-active-cu-cycles>`.
|
||||
|
||||
.. warning::
|
||||
|
||||
There are further complications of the Issued IPC metric (**11.2.1**) that make
|
||||
its use more complicated. We will be explore that in the
|
||||
:ref:`following section <ipc-internal-instructions>`. For these reasons,
|
||||
Omniperf typically promotes use of the regular IPC metric (**11.2.0**), e.g., in
|
||||
the top-level Speed-of-Light chart.
|
||||
|
||||
.. _ipc-internal-instructions:
|
||||
|
||||
Internal instructions and IPC
|
||||
-----------------------------
|
||||
|
||||
Next, we explore the concept of an “internal” instruction. From
|
||||
:gcn-crash-course:`29`, we see a few candidates for internal instructions, and
|
||||
we choose a ``s_nop`` instruction, which according to the
|
||||
:mi200-isa-pdf:`CDNA2 ISA guide <>`:
|
||||
|
||||
Does nothing; it can be repeated in hardware up to eight times.
|
||||
|
||||
Here we choose to use the following no-op to make our point:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
s_nop 0x0
|
||||
|
||||
Running this kernel through Omniperf yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 9 -b 11.2
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
|
||||
│ 0 │ void snop<1000>() [clone .kd] │ 1.00 │ 14221851.50 │ 14221851.50 │ 14221851.50 │ 100.00 │
|
||||
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
11. Compute Units - Compute Pipeline
|
||||
11.2 Pipeline Stats
|
||||
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
|
||||
│ 11.2.0 │ IPC │ 6.79 │ 6.79 │ 6.79 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.2 │ SALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.3 │ VALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.5 │ Branch Util │ 0.68 │ 0.68 │ 0.68 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.6 │ VALU Active Threads │ │ │ │ Threads │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
|
||||
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
|
||||
|
||||
First, we see that the IPC metric (**11.2.0**) tops our theoretical maximum
|
||||
of 5 instructions per cycle (discussed in the :ref:`scheduler <desc-scheduler>`
|
||||
section). How can this be?
|
||||
|
||||
Recall that :gcn-crash-course:`27` say “no functional unit” for the internal
|
||||
instructions. This removes the limitation on the IPC. If we are *only*
|
||||
issuing internal instructions, we are not issuing to any execution
|
||||
units! However, workloads such as these are almost *entirely* artificial
|
||||
(that is, repeatedly issuing internal instructions almost exclusively). In
|
||||
practice, a maximum of IPC of 5 is expected in almost all cases.
|
||||
|
||||
Secondly, note that our “Issued” IPC (**11.2.1**) is still identical to
|
||||
the one here. Again, this has to do with the details of “internal”
|
||||
instructions. Recall in our :ref:`previous example <issued-ipc>` we defined
|
||||
this metric as explicitly excluding internal instruction counts. The
|
||||
logical question then is, "what *is* this metric counting in our
|
||||
``s_nop`` kernel?"
|
||||
|
||||
The generated assembly looks something like:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
;;#ASMSTART
|
||||
s_nop 0x0
|
||||
;;#ASMEND
|
||||
;;#ASMSTART
|
||||
s_nop 0x0
|
||||
;;#ASMEND
|
||||
;;<... omitting many more ...>
|
||||
s_endpgm
|
||||
.section .rodata,#alloc
|
||||
.p2align 6, 0x0
|
||||
.amdhsa_kernel _Z4snopILi1000EEvv
|
||||
|
||||
Of particular interest here is the ``s_endpgm`` instruction, of which
|
||||
the `CDNA2 ISA
|
||||
guide <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`__
|
||||
states:
|
||||
|
||||
End of program; terminate wavefront.
|
||||
|
||||
This is not on our list of internal instructions from
|
||||
:gcn-crash-course:`The AMD GCN Architecture <>`, and is therefore counted as part
|
||||
of our Issued IPC (**11.2.1**). Thus, the issued IPC being equal to one here
|
||||
indicates that we issued an ``s_endpgm`` instruction every cycle the
|
||||
:ref:`scheduler <desc-scheduler>` was active for non-internal instructions, which
|
||||
is expected as this was our *only* non-internal instruction.
|
||||
|
||||
SALU Utilization
|
||||
----------------
|
||||
|
||||
Next, we explore a simple :ref:`SALU <desc-salu>` kernel in our on-going IPC and
|
||||
utilization example. For this case, we select a simple scalar move
|
||||
operation, for instance:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
s_mov_b32 s0, s1
|
||||
|
||||
which, in analogue to our :ref:`v_mov <ipc-valu-utilization>` example, copies the
|
||||
contents of the source scalar register (``s1``) to the destination
|
||||
scalar register (``s0``). Running this kernel through Omniperf yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 10 -b 11.2
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═══════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═══════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
|
||||
│ 0 │ void smov<1000>() [clone .kd] │ 1.00 │ 96246554.00 │ 96246554.00 │ 96246554.00 │ 100.00 │
|
||||
╘════╧═══════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
11. Compute Units - Compute Pipeline
|
||||
11.2 Pipeline Stats
|
||||
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
|
||||
│ 11.2.0 │ IPC │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.2 │ SALU Util │ 99.98 │ 99.98 │ 99.98 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.3 │ VALU Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.5 │ Branch Util │ 0.1 │ 0.1 │ 0.1 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.6 │ VALU Active Threads │ │ │ │ Threads │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
|
||||
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
|
||||
|
||||
Here we see that:
|
||||
|
||||
- Both our IPC (**11.2.0**) and Issued IPC (**11.2.1**) are
|
||||
:math:`\sim1.0` as expected, and
|
||||
|
||||
- The SALU Utilization (**11.2.2**) was
|
||||
nearly 100% as it was active for almost the entire kernel.
|
||||
|
||||
VALU Active Threads
|
||||
-------------------
|
||||
|
||||
For our final IPC/Utilization example, we consider a slight modification
|
||||
of our :ref:`v_mov <ipc-valu-utilization>` example:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
template<int N=1000>
|
||||
__global__ void vmov_with_divergence() {
|
||||
if (threadIdx.x % 64 == 0)
|
||||
vmov_op<N>();
|
||||
}
|
||||
|
||||
That is, we wrap our :ref:`VALU <desc-valu>` operation inside a conditional
|
||||
where only one lane in our wavefront is active. Running this kernel
|
||||
through Omniperf yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/ipc/mi200/ --dispatch 11 -b 11.2
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤══════════════════════════════════════════╤═════════╤═════════════╤═════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════════╪═════════╪═════════════╪═════════════╪══════════════╪════════╡
|
||||
│ 0 │ void vmov_with_divergence<1000>() [clone │ 1.00 │ 97125097.00 │ 97125097.00 │ 97125097.00 │ 100.00 │
|
||||
│ │ .kd] │ │ │ │ │ │
|
||||
╘════╧══════════════════════════════════════════╧═════════╧═════════════╧═════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
11. Compute Units - Compute Pipeline
|
||||
11.2 Pipeline Stats
|
||||
╒═════════╤═════════════════════╤═══════╤═══════╤═══════╤══════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════╪═══════╪═══════╪═══════╪══════════════╡
|
||||
│ 11.2.0 │ IPC │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.1 │ IPC (Issued) │ 1.0 │ 1.0 │ 1.0 │ Instr/cycle │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.2 │ SALU Util │ 0.1 │ 0.1 │ 0.1 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.3 │ VALU Util │ 99.98 │ 99.98 │ 99.98 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.4 │ VMEM Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.5 │ Branch Util │ 0.2 │ 0.2 │ 0.2 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.6 │ VALU Active Threads │ 1.13 │ 1.13 │ 1.13 │ Threads │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.7 │ MFMA Util │ 0.0 │ 0.0 │ 0.0 │ Pct │
|
||||
├─────────┼─────────────────────┼───────┼───────┼───────┼──────────────┤
|
||||
│ 11.2.8 │ MFMA Instr Cycles │ │ │ │ Cycles/instr │
|
||||
╘═════════╧═════════════════════╧═══════╧═══════╧═══════╧══════════════╛
|
||||
|
||||
Here we see that once again, our VALU Utilization (**11.2.3**) is nearly
|
||||
100%. However, we note that the VALU Active Threads metric (**11.2.6**) is
|
||||
:math:`\sim 1`, which matches our conditional in the source code. So
|
||||
VALU Active Threads reports the average number of lanes of our wavefront
|
||||
that are active over all :ref:`VALU <desc-valu>` instructions, or thread
|
||||
“convergence” (i.e., 1 - :ref:`divergence <desc-divergence>`).
|
||||
|
||||
.. note::
|
||||
|
||||
1. The act of evaluating a vector conditional in this example typically triggers VALU operations, contributing to why the VALU Active Threads metric is not identically one.
|
||||
2. This metric is a time (cycle) averaged value, and thus contains an implicit dependence on the duration of various VALU instructions.
|
||||
|
||||
Nonetheless, this metric serves as a useful measure of thread-convergence.
|
||||
|
||||
Finally, we note that our branch utilization (**11.2.5**) has increased
|
||||
slightly from our baseline, as we now have a branch (checking the value
|
||||
of ``threadIdx.x``).
|
||||
@@ -0,0 +1,272 @@
|
||||
.. _lds-examples:
|
||||
|
||||
LDS examples
|
||||
============
|
||||
|
||||
For this example, consider the
|
||||
:dev-sample:`LDS sample <lds.hip>` distributed as a part of Omniperf. This
|
||||
code contains two kernels to explore how both :doc:`LDS </conceptual/local-data-share>` bandwidth and
|
||||
bank conflicts are calculated in Omniperf.
|
||||
|
||||
This example was compiled and run on an MI250 accelerator using ROCm
|
||||
v5.6.0, and Omniperf v2.0.0.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ hipcc -O3 lds.hip -o lds
|
||||
|
||||
Finally, we generate our ``omniperf profile`` as:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n lds --no-roof -- ./lds
|
||||
|
||||
.. _lds-bandwidth:
|
||||
|
||||
LDS bandwidth
|
||||
-------------
|
||||
|
||||
To explore our *theoretical LDS bandwidth* metric, we use a simple
|
||||
kernel:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
constexpr unsigned max_threads = 256;
|
||||
__global__ void load(int* out, int flag) {
|
||||
__shared__ int array[max_threads];
|
||||
int index = threadIdx.x;
|
||||
// fake a store to the LDS array to avoid unwanted behavior
|
||||
if (flag)
|
||||
array[max_threads - index] = index;
|
||||
__syncthreads();
|
||||
int x = array[index];
|
||||
if (x == int(-1234567))
|
||||
out[threadIdx.x] = x;
|
||||
}
|
||||
|
||||
Here we:
|
||||
|
||||
* Create an array of 256 integers in :doc:`LDS </conceptual/local-data-share>`
|
||||
|
||||
* Fake a write to the LDS using the ``flag`` variable (always set to zero on the
|
||||
host) to avoid dead-code elimination
|
||||
|
||||
* Read a single integer per work-item from ``threadIdx.x`` of the LDS array
|
||||
|
||||
* If the integer is equal to a magic number (always false), write the value out
|
||||
to global memory to again, avoid dead-code elimination
|
||||
|
||||
Finally, we launch this kernel repeatedly, varying the number of threads
|
||||
in our workgroup:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
void bandwidth_demo(int N) {
|
||||
for (int i = 1; i <= N; ++i)
|
||||
load<<<1,i>>>(nullptr, 0);
|
||||
hipDeviceSynchronize();
|
||||
}
|
||||
|
||||
Next, let’s analyze the first of our bandwidth kernel dispatches:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.1 --dispatch 0 -n per_kernel
|
||||
<...>
|
||||
12. Local Data Share (LDS)
|
||||
12.2 LDS Stats
|
||||
╒═════════╤═══════════════════════╤════════╤════════╤════════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪════════╪════════╪════════╪══════════════════╡
|
||||
│ 12.2.1 │ Theoretical Bandwidth │ 256.00 │ 256.00 │ 256.00 │ Bytes per kernel │
|
||||
╘═════════╧═══════════════════════╧════════╧════════╧════════╧══════════════════╛
|
||||
|
||||
Here we see that our Theoretical Bandwidth metric (**12.2.1**) is reporting
|
||||
256 Bytes were loaded even though we launched a single work-item
|
||||
workgroup, and thus only loaded a single integer from LDS. Why is this?
|
||||
|
||||
Recall our definition of this metric:
|
||||
|
||||
Indicates the maximum amount of bytes that could have been loaded
|
||||
from/stored to/atomically updated in the LDS per
|
||||
:ref:`normalization unit <normalization-units>`.
|
||||
|
||||
Here we see that this instruction *could* have loaded up to 256 bytes of
|
||||
data (4 bytes for each work-item in the wavefront), and therefore this
|
||||
is the expected value for this metric in Omniperf, hence why this metric
|
||||
is named the “theoretical” bandwidth.
|
||||
|
||||
To further illustrate this point we plot the relationship of the
|
||||
theoretical bandwidth metric (**12.2.1**) as compared to the effective (or
|
||||
achieved) bandwidth of this kernel, varying the number of work-items
|
||||
launched from 1 to 256:
|
||||
|
||||
.. figure:: ../data/profiling-by-example/ldsbandwidth.png
|
||||
:align: center
|
||||
:alt: Comparison of effective bandwidth versus the theoretical bandwidth
|
||||
metric in Omniperf for our simple example.
|
||||
:width: 800
|
||||
|
||||
Comparison of effective bandwidth versus the theoretical bandwidth
|
||||
metric in Omniperf for our simple example.
|
||||
|
||||
Here we see that the theoretical bandwidth metric follows a step-function. It
|
||||
increases only when another wavefront issues an LDS instruction for up to 256
|
||||
bytes of data. Such increases are marked in the plot using dashed lines. In
|
||||
contrast, the effective bandwidth increases linearly, by 4 bytes, with the
|
||||
number of work-items in the kernel, N.
|
||||
|
||||
.. _lds-bank-conflicts:
|
||||
|
||||
Bank conflicts
|
||||
--------------
|
||||
|
||||
Next we explore bank conflicts using a slight modification of our bandwidth
|
||||
kernel:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
constexpr unsigned nbanks = 32;
|
||||
__global__ void conflicts(int* out, int flag) {
|
||||
constexpr unsigned nelements = nbanks * max_threads;
|
||||
__shared__ int array[nelements];
|
||||
// each thread reads from the same bank
|
||||
int index = threadIdx.x * nbanks;
|
||||
// fake a store to the LDS array to avoid unwanted behavior
|
||||
if (flag)
|
||||
array[max_threads - index] = index;
|
||||
__syncthreads();
|
||||
int x = array[index];
|
||||
if (x == int(-1234567))
|
||||
out[threadIdx.x] = x;
|
||||
}
|
||||
|
||||
Here we:
|
||||
|
||||
* Allocate an :doc:`LDS </conceptual/local-data-share>` array of size
|
||||
:math:`32*256*4{B}=32{KiB}`
|
||||
|
||||
* Fake a write to the LDS using the ``flag``
|
||||
variable (always set to zero on the host) to avoid dead-code elimination
|
||||
|
||||
* Read a single integer per work-item from index
|
||||
``threadIdx.x * nbanks`` of the LDS array
|
||||
|
||||
* If the integer is equal to a
|
||||
magic number (always false), write the value out to global memory to,
|
||||
again, avoid dead-code elimination.
|
||||
|
||||
On the host, we again repeatedly launch this kernel, varying the number
|
||||
of work-items:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
void conflicts_demo(int N) {
|
||||
for (int i = 1; i <= N; ++i)
|
||||
conflicts<<<1,i>>>(nullptr, 0);
|
||||
hipDeviceSynchronize();
|
||||
}
|
||||
|
||||
Analyzing our first ``conflicts`` kernel (i.e., a single work-item), we
|
||||
see:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 256 -n per_kernel
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
12. Local Data Share (LDS)
|
||||
12.2 LDS Stats
|
||||
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
|
||||
│ 12.2.4 │ Index Accesses │ 2.00 │ 2.00 │ 2.00 │ Cycles per kernel │
|
||||
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
|
||||
│ 12.2.6 │ Bank Conflict │ 0.00 │ 0.00 │ 0.00 │ Cycles per kernel │
|
||||
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
|
||||
|
||||
In our :ref:`previous example <lds-bank-conflicts>`, we showed how a load
|
||||
from a single work-item is considered to have a theoretical bandwidth of
|
||||
256B. Recall, the :doc:`LDS </conceptual/local-data-share>` can load up to :math:`128B` per
|
||||
cycle (i.e, 32 banks x 4B / bank / cycle). Hence, we see that loading an 4B
|
||||
integer spends two cycles accessing the LDS
|
||||
(:math:`2\ {cycle} = (256B) / (128\ B/{cycle})`).
|
||||
|
||||
Looking at the next ``conflicts`` dispatch (i.e., two work-items) yields:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/lds/mi200/ -b 12.2.4 12.2.6 --dispatch 257 -n per_kernel
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
12. Local Data Share (LDS)
|
||||
12.2 LDS Stats
|
||||
╒═════════╤════════════════╤═══════╤═══════╤═══════╤═══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════╪═══════╪═══════╪═══════╪═══════════════════╡
|
||||
│ 12.2.4 │ Index Accesses │ 3.00 │ 3.00 │ 3.00 │ Cycles per kernel │
|
||||
├─────────┼────────────────┼───────┼───────┼───────┼───────────────────┤
|
||||
│ 12.2.6 │ Bank Conflict │ 1.00 │ 1.00 │ 1.00 │ Cycles per kernel │
|
||||
╘═════════╧════════════════╧═══════╧═══════╧═══════╧═══════════════════╛
|
||||
|
||||
Here we see a bank conflict! What happened?
|
||||
|
||||
Recall that the index for each thread was calculated as:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
int index = threadIdx.x * nbanks;
|
||||
|
||||
Or, precisely 32 elements, and each element is 4B wide (for a standard
|
||||
integer). That is, each thread strides back to the same bank in the LDS,
|
||||
such that each work-item we add to the dispatch results in another bank
|
||||
conflict!
|
||||
|
||||
Recalling our discussion of bank conflicts in our
|
||||
:doc:`LDS </conceptual/local-data-share>` description:
|
||||
|
||||
A bank conflict occurs when two (or more) work-items in a wavefront
|
||||
want to read, write, or atomically update different addresses that
|
||||
map to the same bank in the same cycle. In this case, the conflict
|
||||
detection hardware will determined a new schedule such that the
|
||||
access is split into multiple cycles with no conflicts in any
|
||||
single cycle.
|
||||
|
||||
Here we see the conflict resolution hardware in action! Because we have
|
||||
engineered our kernel to generate conflicts, we expect our bank conflict
|
||||
metric to scale linearly with the number of work-items:
|
||||
|
||||
.. figure:: ../data/profiling-by-example/ldsconflicts.png
|
||||
:align: center
|
||||
:alt: Comparison of LDS conflict cycles versus access cycles for our simple
|
||||
example.
|
||||
:width: 800
|
||||
|
||||
Comparison of LDS conflict cycles versus access cycles for our simple
|
||||
example.
|
||||
|
||||
Here we show the comparison of the Index Accesses (**12.2.4**), to the Bank
|
||||
Conflicts (**12.2.6**) for the first 20 kernel invocations. We see that each grows
|
||||
linearly, and there is a constant gap of 2 cycles between them (i.e., the first
|
||||
access is never considered a conflict).
|
||||
|
||||
Finally, we can use these two metrics to derive the Bank Conflict Rate (**12.1.4**).
|
||||
Since within an Index Access we have 32 banks that may need to be updated, we
|
||||
use:
|
||||
|
||||
$$
|
||||
Bank\ Conflict\ Rate = 100 * ((Bank\ Conflicts / 32) / (Index\ Accesses - Bank\ Conflicts))
|
||||
$$
|
||||
|
||||
Plotting this, we see:
|
||||
|
||||
.. figure:: ../data/profiling-by-example/ldsconflictrate.png
|
||||
:align: center
|
||||
:alt: LDS bank conflict rate example
|
||||
:width: 800
|
||||
|
||||
LDS Bank Conflict rate for our simple example.
|
||||
|
||||
The bank conflict rate linearly increases with the number of work-items
|
||||
within a wavefront that are active, *approaching* 100%, but never quite
|
||||
reaching it.
|
||||
+456
@@ -0,0 +1,456 @@
|
||||
.. _occupancy-example:
|
||||
|
||||
Occupancy limiters example
|
||||
==========================
|
||||
|
||||
For this example, consider the
|
||||
:dev-sample:`occupancy <occupancy.hip>` included with Omniperf. We will
|
||||
investigate the use of the resource allocation panel in the
|
||||
:ref:`Workgroup Manager <desc-spi>`’s metrics section to determine occupancy
|
||||
limiters. This code contains several kernels to explore how both various
|
||||
kernel resources impact achieved occupancy, and how this is reported in
|
||||
Omniperf.
|
||||
|
||||
This example was compiled and run on a MI250 accelerator using ROCm
|
||||
v5.6.0, and Omniperf v2.0.0:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ hipcc -O3 occupancy.hip -o occupancy --save-temps
|
||||
|
||||
We have again included the ``--save-temps`` flag to get the
|
||||
corresponding assembly.
|
||||
|
||||
Finally, we generate our Omniperf profile as:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf profile -n occupancy --no-roof -- ./occupancy
|
||||
|
||||
.. _occupancy-experiment-design:
|
||||
|
||||
Design note
|
||||
-----------
|
||||
|
||||
For our occupancy test, we need to create a kernel that is resource
|
||||
heavy, in various ways. For this purpose, we use the following (somewhat
|
||||
funny-looking) kernel:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
constexpr int bound = 16;
|
||||
__launch_bounds__(256)
|
||||
__global__ void vgprbound(int N, double* ptr) {
|
||||
double intermediates[bound];
|
||||
for (int i = 0 ; i < bound; ++i) intermediates[i] = N * threadIdx.x;
|
||||
double x = ptr[threadIdx.x];
|
||||
for (int i = 0; i < 100; ++i) {
|
||||
x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % bound], intermediates[i % bound]));
|
||||
intermediates[i % bound] = x;
|
||||
}
|
||||
if (x == N) ptr[threadIdx.x] = x;
|
||||
}
|
||||
|
||||
Here we try to use as many :ref:`VGPRs <desc-valu>` as possible, to this end:
|
||||
|
||||
* We create a small array of double precision floats, that we size to try
|
||||
to fit into registers (i.e., ``bound``, this may need to be tuned
|
||||
depending on the ROCm version).
|
||||
|
||||
* We specify ``__launch_bounds___(256)``
|
||||
to increase the number of VPGRs available to the kernel (by limiting the
|
||||
number of wavefronts that can be resident on a
|
||||
:doc:`CU </conceptual/compute-unit>`).
|
||||
|
||||
* Write a unique non-compile time constant to each element of the array.
|
||||
|
||||
* Repeatedly permute and call relatively expensive math functions on our
|
||||
array elements.
|
||||
|
||||
* Keep the compiler from optimizing out any operations by faking a write to the
|
||||
``ptr`` based on a run-time conditional.
|
||||
|
||||
This yields a total of 122 VGPRs, but it is expected this number will
|
||||
depend on the exact ROCm/compiler version.
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
.size _Z9vgprboundiPd, .Lfunc_end1-_Z9vgprboundiPd
|
||||
; -- End function
|
||||
.section .AMDGPU.csdata
|
||||
; Kernel info:
|
||||
; codeLenInByte = 4732
|
||||
; NumSgprs: 68
|
||||
; NumVgprs: 122
|
||||
; NumAgprs: 0
|
||||
; <...>
|
||||
; AccumOffset: 124
|
||||
|
||||
We will use various permutations of this kernel to limit occupancy, and
|
||||
more importantly for the purposes of this example, demonstrate how this
|
||||
is reported in Omniperf.
|
||||
|
||||
.. _vgpr-occupancy:
|
||||
|
||||
VGPR limited
|
||||
------------
|
||||
|
||||
For our first test, we use the ``vgprbound`` kernel discussed in the
|
||||
:ref:`design note <occupancy-experiment-design>`. After profiling, we run
|
||||
the analyze step on this kernel:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 --dispatch 1
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
|
||||
│ 0 │ vgprbound(int, double*) │ 1.00 │ 923093822.50 │ 923093822.50 │ 923093822.50 │ 100.00 │
|
||||
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2. System Speed-of-Light
|
||||
2.1 Speed-of-Light
|
||||
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
|
||||
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
|
||||
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
|
||||
│ 2.1.15 │ Wavefront Occupancy │ 1661.24 │ Wavefronts │ 3328.00 │ 49.92 │
|
||||
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
6. Workgroup Manager (SPI)
|
||||
6.2 Workgroup Manager - Resource Allocation
|
||||
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 0.64 │ 0.64 │ 0.64 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 24.94 │ 24.94 │ 24.94 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 24.49 │ 24.49 │ 24.49 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.3 │ Scratch Stall Rate │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.4 │ Insufficient SIMD Waveslots │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.5 │ Insufficient SIMD VGPRs │ 94.90 │ 94.90 │ 94.90 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.6 │ Insufficient SIMD SGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.7 │ Insufficient CU LDS │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
7. Wavefront
|
||||
7.1 Wavefront Launch Stats
|
||||
╒═════════╤══════════╤════════╤════════╤════════╤═══════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪══════════╪════════╪════════╪════════╪═══════════╡
|
||||
│ 7.1.5 │ VGPRs │ 124.00 │ 124.00 │ 124.00 │ Registers │
|
||||
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
|
||||
│ 7.1.6 │ AGPRs │ 4.00 │ 4.00 │ 4.00 │ Registers │
|
||||
├─────────┼──────────┼────────┼────────┼────────┼───────────┤
|
||||
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
|
||||
╘═════════╧══════════╧════════╧════════╧════════╧═══════════╛
|
||||
|
||||
Here we see that the kernel indeed does use *around* (but not exactly)
|
||||
122 VGPRs, with the difference due to granularity of VGPR allocations.
|
||||
In addition, we see that we have allocated 4 “:ref:`AGPRs <desc-agprs>`”. We
|
||||
note that on current CDNA2 accelerators, the ``AccumOffset`` field of
|
||||
the assembly metadata:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
; AccumOffset: 124
|
||||
|
||||
denotes the divide between ``VGPRs`` and ``AGPRs``.
|
||||
|
||||
Next, we examine our wavefront occupancy (**2.1.15**), and see that we are
|
||||
reaching only :math:`\sim50\%` of peak occupancy. As a result, we see
|
||||
that:
|
||||
|
||||
- We are not scheduling workgroups :math:`\sim25\%` of
|
||||
:ref:`total scheduler-pipe cycles <total-pipe-cycles>` (**6.2.1**); recall
|
||||
from the discussion of the `workgroup manager <desc-spi>`, 25% is the maximum.
|
||||
|
||||
- The scheduler-pipe is stalled (**6.2.2**) from scheduling workgroups due to
|
||||
resource constraints for the same :math:`\sim25\%` of the time.
|
||||
|
||||
- And finally, :math:`\sim91\%` of those stalls are due to a lack of SIMDs
|
||||
with the appropriate number of VGPRs available (6.2.5).
|
||||
|
||||
That is, the reason we can’t reach full occupancy is due to our VGPR
|
||||
usage, as expected!
|
||||
|
||||
LDS limited
|
||||
-----------
|
||||
|
||||
To examine an LDS limited example, we must change our kernel slightly:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
constexpr size_t fully_allocate_lds = 64ul * 1024ul / sizeof(double);
|
||||
__launch_bounds__(256)
|
||||
__global__ void ldsbound(int N, double* ptr) {
|
||||
__shared__ double intermediates[fully_allocate_lds];
|
||||
for (int i = threadIdx.x ; i < fully_allocate_lds; i += blockDim.x) intermediates[i] = N * threadIdx.x;
|
||||
__syncthreads();
|
||||
double x = ptr[threadIdx.x];
|
||||
for (int i = threadIdx.x; i < fully_allocate_lds; i += blockDim.x) {
|
||||
x += sin(pow(__shfl(x, i % warpSize) * intermediates[(i - 1) % fully_allocate_lds], intermediates[i % fully_allocate_lds]));
|
||||
__syncthreads();
|
||||
intermediates[i % fully_allocate_lds] = x;
|
||||
}
|
||||
if (x == N) ptr[threadIdx.x] = x;
|
||||
}
|
||||
|
||||
Where we now:
|
||||
|
||||
* Allocate an 64 KiB LDS array per workgroup, and
|
||||
|
||||
* Use our allocated LDS array instead of a register array
|
||||
|
||||
Analyzing this:
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 --dispatch 3
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
2. System Speed-of-Light
|
||||
2.1 Speed-of-Light
|
||||
╒═════════╤═════════════════════╤════════╤════════════╤═════════╤═══════════════╕
|
||||
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
|
||||
╞═════════╪═════════════════════╪════════╪════════════╪═════════╪═══════════════╡
|
||||
│ 2.1.15 │ Wavefront Occupancy │ 415.52 │ Wavefronts │ 3328.00 │ 12.49 │
|
||||
╘═════════╧═════════════════════╧════════╧════════════╧═════════╧═══════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
6. Workgroup Manager (SPI)
|
||||
6.2 Workgroup Manager - Resource Allocation
|
||||
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 0.13 │ 0.13 │ 0.13 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 24.87 │ 24.87 │ 24.87 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 24.84 │ 24.84 │ 24.84 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.3 │ Scratch Stall Rate │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.4 │ Insufficient SIMD Waveslots │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.5 │ Insufficient SIMD VGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.6 │ Insufficient SIMD SGPRs │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.7 │ Insufficient CU LDS │ 96.47 │ 96.47 │ 96.47 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
7. Wavefront
|
||||
7.1 Wavefront Launch Stats
|
||||
╒═════════╤════════════════╤══════════╤══════════╤══════════╤═══════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════╪══════════╪══════════╪══════════╪═══════════╡
|
||||
│ 7.1.5 │ VGPRs │ 96.00 │ 96.00 │ 96.00 │ Registers │
|
||||
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
|
||||
│ 7.1.6 │ AGPRs │ 0.00 │ 0.00 │ 0.00 │ Registers │
|
||||
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
|
||||
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
|
||||
├─────────┼────────────────┼──────────┼──────────┼──────────┼───────────┤
|
||||
│ 7.1.8 │ LDS Allocation │ 65536.00 │ 65536.00 │ 65536.00 │ Bytes │
|
||||
╘═════════╧════════════════╧══════════╧══════════╧══════════╧═══════════╛
|
||||
|
||||
We see that our VGPR allocation has gone down to 96 registers, but now
|
||||
we see our 64KiB LDS allocation (**7.1.8**). In addition, we see a similar
|
||||
non-schedule rate (**6.2.1**) and stall rate (**6.2.2**) as in our
|
||||
:ref:`VGPR example <vgpr-occupancy>`. However, our occupancy limiter has now
|
||||
shifted from VGPRs (**6.2.5**) to LDS (**6.2.7**).
|
||||
|
||||
We note that although we see the around the same scheduler/stall rates
|
||||
(with our LDS limiter), our wave occupancy (**2.1.15**) is significantly
|
||||
lower (:math:`\sim12\%`)! This is important to remember: the occupancy
|
||||
limiter metrics in the resource allocation section tell you what the
|
||||
limiter was, but *not* how much the occupancy was limited. These metrics
|
||||
should always be analyzed in concert with the wavefront occupancy
|
||||
metric!
|
||||
|
||||
.. _sgpr-occupancy:
|
||||
|
||||
SGPR limited
|
||||
------------
|
||||
|
||||
Finally, we modify our kernel once more to make it limited by
|
||||
`SGPRs <salu>`__:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
constexpr int sgprlim = 1;
|
||||
__launch_bounds__(1024, 8)
|
||||
__global__ void sgprbound(int N, double* ptr) {
|
||||
double intermediates[sgprlim];
|
||||
for (int i = 0 ; i < sgprlim; ++i) intermediates[i] = i;
|
||||
double x = ptr[0];
|
||||
#pragma unroll 1
|
||||
for (int i = 0; i < 100; ++i) {
|
||||
x += sin(pow(intermediates[(i - 1) % sgprlim], intermediates[i % sgprlim]));
|
||||
intermediates[i % sgprlim] = x;
|
||||
}
|
||||
if (x == N) ptr[0] = x;
|
||||
}
|
||||
|
||||
The major changes here are to: - make as much as possible provably
|
||||
uniform across the wave (notice the lack of ``threadIdx.x`` in the
|
||||
``intermediates`` initialization and elsewhere), - addition of
|
||||
``__launch_bounds__(1024, 8)``, which reduces our maximum VGPRs to 64
|
||||
(such that 8 waves can fit per SIMD), but causes some register spills
|
||||
(i.e., :ref:`scratch <memory-spaces>` usage), and - lower the ``bound`` (here we
|
||||
use ``sgprlim``) of the array to reduce VGPR/Scratch usage.
|
||||
|
||||
This results in the following assembly metadata for this kernel:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
.size _Z9sgprboundiPd, .Lfunc_end3-_Z9sgprboundiPd
|
||||
; -- End function
|
||||
.section .AMDGPU.csdata
|
||||
; Kernel info:
|
||||
; codeLenInByte = 4872
|
||||
; NumSgprs: 76
|
||||
; NumVgprs: 64
|
||||
; NumAgprs: 0
|
||||
; TotalNumVgprs: 64
|
||||
; ScratchSize: 60
|
||||
; <...>
|
||||
; AccumOffset: 64
|
||||
; Occupancy: 8
|
||||
|
||||
Analyzing this workload yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/occupancy/mi200/ -b 2.1.15 6.2 7.1.5 7.1.6 7.1.7 7.1.8 7.1.9 --dispatch 5
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═════════════════════════╤═════════╤══════════════╤══════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═════════════════════════╪═════════╪══════════════╪══════════════╪══════════════╪════════╡
|
||||
│ 0 │ sgprbound(int, double*) │ 1.00 │ 782069812.00 │ 782069812.00 │ 782069812.00 │ 100.00 │
|
||||
╘════╧═════════════════════════╧═════════╧══════════════╧══════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
2. System Speed-of-Light
|
||||
2.1 Speed-of-Light
|
||||
╒═════════╤═════════════════════╤═════════╤════════════╤═════════╤═══════════════╕
|
||||
│ Index │ Metric │ Avg │ Unit │ Peak │ Pct of Peak │
|
||||
╞═════════╪═════════════════════╪═════════╪════════════╪═════════╪═══════════════╡
|
||||
│ 2.1.15 │ Wavefront Occupancy │ 3291.76 │ Wavefronts │ 3328.00 │ 98.91 │
|
||||
╘═════════╧═════════════════════╧═════════╧════════════╧═════════╧═══════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
6. Workgroup Manager (SPI)
|
||||
6.2 Workgroup Manager - Resource Allocation
|
||||
╒═════════╤════════════════════════════════════════╤═══════╤═══════╤═══════╤════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════════════════════════════╪═══════╪═══════╪═══════╪════════╡
|
||||
│ 6.2.0 │ Not-scheduled Rate (Workgroup Manager) │ 7.72 │ 7.72 │ 7.72 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.1 │ Not-scheduled Rate (Scheduler-Pipe) │ 15.17 │ 15.17 │ 15.17 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.2 │ Scheduler-Pipe Stall Rate │ 7.38 │ 7.38 │ 7.38 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.3 │ Scratch Stall Rate │ 39.76 │ 39.76 │ 39.76 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.4 │ Insufficient SIMD Waveslots │ 26.32 │ 26.32 │ 26.32 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.5 │ Insufficient SIMD VGPRs │ 26.32 │ 26.32 │ 26.32 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.6 │ Insufficient SIMD SGPRs │ 25.52 │ 25.52 │ 25.52 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.7 │ Insufficient CU LDS │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.8 │ Insufficient CU Barriers │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.9 │ Reached CU Workgroup Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
├─────────┼────────────────────────────────────────┼───────┼───────┼───────┼────────┤
|
||||
│ 6.2.10 │ Reached CU Wavefront Limit │ 0.00 │ 0.00 │ 0.00 │ Pct │
|
||||
╘═════════╧════════════════════════════════════════╧═══════╧═══════╧═══════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
7. Wavefront
|
||||
7.1 Wavefront Launch Stats
|
||||
╒═════════╤════════════════════╤═══════╤═══════╤═══════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════════════╪═══════╪═══════╪═══════╪════════════════╡
|
||||
│ 7.1.5 │ VGPRs │ 64.00 │ 64.00 │ 64.00 │ Registers │
|
||||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||||
│ 7.1.6 │ AGPRs │ 0.00 │ 0.00 │ 0.00 │ Registers │
|
||||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||||
│ 7.1.7 │ SGPRs │ 80.00 │ 80.00 │ 80.00 │ Registers │
|
||||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||||
│ 7.1.8 │ LDS Allocation │ 0.00 │ 0.00 │ 0.00 │ Bytes │
|
||||
├─────────┼────────────────────┼───────┼───────┼───────┼────────────────┤
|
||||
│ 7.1.9 │ Scratch Allocation │ 60.00 │ 60.00 │ 60.00 │ Bytes/workitem │
|
||||
╘═════════╧════════════════════╧═══════╧═══════╧═══════╧════════════════╛
|
||||
|
||||
Here we see that our wavefront launch stats (**7.1**) have changed to
|
||||
reflect the metadata seen in the ``--save-temps`` output. Of particular
|
||||
interest, we see:
|
||||
|
||||
* The SGPR allocation (**7.1.7**) is 80 registers, slightly more than the 76
|
||||
requested by the compiler due to allocation granularity, and
|
||||
|
||||
* We have a :ref:`"scratch" <memory-spaces>`, that is, private memory,
|
||||
allocation of 60 bytes per work-item.
|
||||
|
||||
Analyzing the resource allocation block (**6.2**) we now see that for the
|
||||
first time, the "Not-scheduled Rate (Workgroup Manager)" metric (**6.2.0**)
|
||||
has become non-zero. This is because the workgroup manager is
|
||||
responsible for management of scratch, which we see also contributes to
|
||||
our occupancy limiters in the "Scratch Stall Rate" (**6.2.3**). Note that
|
||||
the sum of the workgroup manager not-scheduled rate and the
|
||||
scheduler-pipe non-scheduled rate is still :math:`\sim25\%`, as in our
|
||||
previous examples.
|
||||
|
||||
Next, we see that the scheduler-pipe stall rate (**6.2.2**), that is, how often
|
||||
we could not schedule a workgroup to a CU, was only about
|
||||
:math:`\sim8\%`. This hints that perhaps, our kernel is not
|
||||
*particularly* occupancy limited by resources. Indeed, checking the
|
||||
wave occupancy metric (**2.1.15**) shows that this kernel is reaching nearly
|
||||
99% occupancy.
|
||||
|
||||
Finally, we inspect the occupancy limiter metrics and see a roughly even
|
||||
split between :ref:`waveslots <desc-valu>` (**6.2.4**), :ref:`VGPRs <desc-valu>`
|
||||
(**6.2.5**), and :ref:`SGPRs <desc-salu>` (**6.2.6**) along with the scratch stalls
|
||||
(**6.2.3**) previously mentioned.
|
||||
|
||||
This is yet another reminder to view occupancy holistically. While these
|
||||
metrics tell you why a workgroup cannot be scheduled, they do *not* tell
|
||||
you what your occupancy was (consult wavefront occupancy) *nor* whether
|
||||
increasing occupancy will be beneficial to performance.
|
||||
+113
@@ -0,0 +1,113 @@
|
||||
.. _valu-arith-instruction-mix-ex:
|
||||
|
||||
VALU arithmetic instruction mix
|
||||
===============================
|
||||
|
||||
For this example, consider the
|
||||
:dev-sample:`instruction mix sample <instmix.hip>` distributed as a part
|
||||
of Omniperf.
|
||||
|
||||
.. note::
|
||||
|
||||
The examples in the section are expected to work on all CDNA™ accelerators.
|
||||
However, the actual experiment results in this section were collected on an
|
||||
:ref:`MI2XX <mixxx-note>` accelerator.
|
||||
|
||||
.. _valu-experiment-design:
|
||||
|
||||
Design note
|
||||
-----------
|
||||
|
||||
This code uses a number of inline assembly instructions to cleanly
|
||||
identify the types of instructions being issued, as well as to avoid
|
||||
optimization / dead-code elimination by the compiler. While inline
|
||||
assembly is inherently not portable, this example is expected to work on
|
||||
all GCN™ GPUs and CDNA accelerators.
|
||||
|
||||
We reproduce a sample of the kernel as follows:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// fp32: add, mul, transcendental and fma
|
||||
float f1, f2;
|
||||
asm volatile(
|
||||
"v_add_f32_e32 %0, %1, %0\n"
|
||||
"v_mul_f32_e32 %0, %1, %0\n"
|
||||
"v_sqrt_f32 %0, %1\n"
|
||||
"v_fma_f32 %0, %1, %0, %1\n"
|
||||
: "=v"(f1)
|
||||
: "v"(f2));
|
||||
|
||||
These instructions correspond to:
|
||||
|
||||
* A 32-bit floating point addition,
|
||||
|
||||
* a 32-bit floating point multiplication,
|
||||
|
||||
* a 32-bit floating point square-root transcendental operation, and
|
||||
|
||||
* a 32-bit floating point fused multiply-add operation.
|
||||
|
||||
For more detail, refer to the `CDNA2 ISA
|
||||
Guide <https://www.amd.com/system/files/TechDocs/instinct-mi200-cdna2-instruction-set-architecture.pdf>`__.
|
||||
|
||||
Instruction mix
|
||||
^^^^^^^^^^^^^^^
|
||||
|
||||
This example was compiled and run on a MI250 accelerator using ROCm
|
||||
v5.6.0, and Omniperf v2.0.0.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ hipcc -O3 instmix.hip -o instmix
|
||||
|
||||
Generate the profile for this example using the following command.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf profile -n instmix --no-roof -- ./instmix
|
||||
|
||||
Analyze the instruction mix section.
|
||||
|
||||
.. code-block:: shell
|
||||
|
||||
$ omniperf analyze -p workloads/instmix/mi200/ -b 10.2
|
||||
<...>
|
||||
10. Compute Units - Instruction Mix
|
||||
10.2 VALU Arithmetic Instr Mix
|
||||
╒═════════╤════════════╤═════════╤════════════════╕
|
||||
│ Index │ Metric │ Count │ Unit │
|
||||
╞═════════╪════════════╪═════════╪════════════════╡
|
||||
│ 10.2.0 │ INT32 │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.1 │ INT64 │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.2 │ F16-ADD │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.3 │ F16-MUL │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.4 │ F16-FMA │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.5 │ F16-Trans │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.6 │ F32-ADD │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.7 │ F32-MUL │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.8 │ F32-FMA │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.9 │ F32-Trans │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.10 │ F64-ADD │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.11 │ F64-MUL │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.12 │ F64-FMA │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.13 │ F64-Trans │ 1.00 │ Instr per wave │
|
||||
├─────────┼────────────┼─────────┼────────────────┤
|
||||
│ 10.2.14 │ Conversion │ 1.00 │ Instr per wave │
|
||||
╘═════════╧════════════╧═════════╧════════════════╛
|
||||
|
||||
This shows that we have exactly one of each type of VALU arithmetic instruction
|
||||
by construction.
|
||||
+698
@@ -0,0 +1,698 @@
|
||||
.. _vmem-example:
|
||||
|
||||
Vector memory operation counting
|
||||
================================
|
||||
|
||||
.. _flat-memory-ex:
|
||||
|
||||
Global / Generic (FLAT)
|
||||
-----------------------
|
||||
|
||||
For this example, consider the
|
||||
:dev-sample:`vector memory sample <vmem.hip>` distributed as a part of
|
||||
Omniperf. This code launches many different versions of a simple
|
||||
read/write/atomic-only kernels targeting various address spaces. For example,
|
||||
below is our simple ``global_write`` kernel:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// write to a global pointer
|
||||
__global__ void global_write(int* ptr, int zero) {
|
||||
ptr[threadIdx.x] = zero;
|
||||
}
|
||||
|
||||
.. note::
|
||||
|
||||
This example was compiled and run on an MI250 accelerator using ROCm
|
||||
v5.6.0, and Omniperf v2.0.0.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ hipcc -O3 --save-temps vmem.hip -o vmem
|
||||
|
||||
We have also chosen to include the ``--save-temps`` flag to save the
|
||||
compiler temporary files, such as the generated CDNA assembly code, for
|
||||
inspection.
|
||||
|
||||
Finally, we generate our ``omniperf profile`` as follows.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n vmem --no-roof -- ./vmem
|
||||
|
||||
.. _flat-experiment-design:
|
||||
|
||||
Design note
|
||||
^^^^^^^^^^^
|
||||
|
||||
This section explains some of the more peculiar lines of code in the
|
||||
example, for example, the use of compiler built-ins and explicit address space
|
||||
casting, and so forth.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
// write to a generic pointer
|
||||
typedef int __attribute__((address_space(0)))* generic_ptr;
|
||||
|
||||
__attribute__((noinline)) __device__ void generic_store(generic_ptr ptr, int zero) { *ptr = zero; }
|
||||
|
||||
__global__ void generic_write(int* ptr, int zero, int filter) {
|
||||
__shared__ int lds[1024];
|
||||
int* generic = (threadIdx.x < filter) ? &ptr[threadIdx.x] : &lds[threadIdx.x];
|
||||
generic_store((generic_ptr)generic, zero);
|
||||
}
|
||||
|
||||
One of the aims of this example is to demonstrate the use of the
|
||||
:llvm-docs:`"generic" FLAT <address-space-identifier>` address space. This
|
||||
address space is typically used when the compiler cannot statically prove where
|
||||
the backing memory is located.
|
||||
|
||||
To try to *force* the compiler to use this address space, we applied
|
||||
``__attribute__((noinline))`` to the ``generic_store`` function to have the
|
||||
compiler treat it as a function call (that is, on the other side of which, the
|
||||
address space may not be known). However, in a trivial example such as this, the
|
||||
compiler may choose to specialize the ``generic_store`` function to the two
|
||||
address spaces that might provably be used from our translation unit, that is,
|
||||
:ref:`"local" (or, LDS) <memory-spaces>` and :ref:`"global" <memory-spaces>`.
|
||||
Hence, we forcibly cast the address space to
|
||||
:ref:`"generic" (or, FLAT) <memory-spaces>` to avoid this compiler
|
||||
optimization.
|
||||
|
||||
.. warning::
|
||||
|
||||
While convenient for this example, this sort of explicit address space
|
||||
casting can lead to strange compilation errors, and in the worst case,
|
||||
incorrect results. As a result, use is discouraged in production code.
|
||||
|
||||
For more details on address spaces, refer to
|
||||
:ref:`memory-spaces`.
|
||||
|
||||
Global write
|
||||
^^^^^^^^^^^^
|
||||
|
||||
First, we demonstrate our simple ``global_write`` kernel:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 1 -b 10.3 15.1.4 15.1.5 15.1.6 15.1.7 15.1.8 15.1.9 15.1.10 15.1.11 -n per_kernel
|
||||
<...>
|
||||
--------------------------------------------------------------------------------
|
||||
0. Top Stat
|
||||
╒════╤═════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪═════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ global_write(int*, int) [clone .kd] │ 1.00 │ 2400.00 │ 2400.00 │ 2400.00 │ 100.00 │
|
||||
╘════╧═════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
15. Address Processing Unit and Data Return Path (TA/TD)
|
||||
15.1 Address Processing Unit
|
||||
╒═════════╤═════════════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 15.1.4 │ Total Instructions │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.5 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.6 │ Global/Generic Read Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.7 │ Global/Generic Write Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.8 │ Global/Generic Atomic Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.9 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.10 │ Spill/Stack Read Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼─────────────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 15.1.11 │ Spill/Stack Write Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═════════════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
Here, we have presented both the information in the VMEM Instruction Mix
|
||||
table (**10.3**) and the Address Processing Unit (**15.1**). We note that this
|
||||
data is expected to be identical, and hence we omit table 15.1 in our
|
||||
subsequent examples.
|
||||
|
||||
In addition, as expected, we see a single Global/Generic Write
|
||||
instruction (**10.3.2**, **15.1.7**). Inspecting the generated assembly, we get:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
.protected _Z12global_writePii ; -- Begin function _Z12global_writePii
|
||||
.globl _Z12global_writePii
|
||||
.p2align 8
|
||||
.type _Z12global_writePii,@function
|
||||
_Z12global_writePii: ; @_Z12global_writePii
|
||||
; %bb.0:
|
||||
s_load_dword s2, s[4:5], 0x8
|
||||
s_load_dwordx2 s[0:1], s[4:5], 0x0
|
||||
v_lshlrev_b32_e32 v0, 2, v0
|
||||
s_waitcnt lgkmcnt(0)
|
||||
v_mov_b32_e32 v1, s2
|
||||
global_store_dword v0, v1, s[0:1]
|
||||
s_endpgm
|
||||
.section .rodata,#alloc
|
||||
.p2align 6, 0x0
|
||||
.amdhsa_kernel _Z12global_writePii
|
||||
|
||||
Notice that this corresponds to an instance of a ``global_store_dword``
|
||||
operation.
|
||||
|
||||
.. note::
|
||||
|
||||
The assembly in these experiments were generated for an
|
||||
:ref:`MI2XX <mixxx-note>` accelerator using ROCm 5.6.0, and may change
|
||||
depending on ROCm versions and the targeted hardware architecture.
|
||||
|
||||
.. _generic-write-ex:
|
||||
|
||||
Generic write to LDS
|
||||
^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
Next, we examine a generic write. As discussed
|
||||
:ref:`previously <flat-experiment-design>`, our ``generic_write`` kernel uses an
|
||||
address space cast to *force* the compiler to choose our desired address
|
||||
space, regardless of other optimizations that may be possible.
|
||||
|
||||
Also note that the ``filter`` parameter passed in as a kernel argument (see
|
||||
:dev-sample:`example <vmem.hip>` and
|
||||
:ref:`design note <flat-experiment-design>`) is set to zero on the host, such
|
||||
that we always write to the :doc:`local </conceptual/local-data-share>` (LDS)
|
||||
memory allocation ``lds``.
|
||||
|
||||
Examining this kernel in the VMEM Instruction Mix table yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 10.3 -n per_kernel
|
||||
<...>
|
||||
0. Top Stat
|
||||
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ generic_write(int*, int, int) [clone .kd │ 1.00 │ 2880.00 │ 2880.00 │ 2880.00 │ 100.00 │
|
||||
│ │ ] │ │ │ │ │ │
|
||||
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
As expected we see a single generic write (**10.3.2**). In the assembly
|
||||
generated for this kernel (in particular, we care about the
|
||||
``generic_store`` function), we see that this corresponds to a
|
||||
``flat_store_dword`` instruction:
|
||||
|
||||
.. code-block:: asm
|
||||
|
||||
.type _Z13generic_storePii,@function
|
||||
_Z13generic_storePii: ; @_Z13generic_storePii
|
||||
; %bb.0:
|
||||
s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
|
||||
flat_store_dword v[0:1], v2
|
||||
s_waitcnt vmcnt(0) lgkmcnt(0)
|
||||
s_setpc_b64 s[30:31]
|
||||
.Lfunc_end0:
|
||||
|
||||
In addition, we note that we can observe the destination of this request
|
||||
by looking at the LDS Instructions metric (**12.2.0**) -- which indicates one LDS
|
||||
access.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 2 -b 12.2.0 -n per_kernel
|
||||
<...>
|
||||
12. Local Data Share (LDS)
|
||||
12.2 LDS Stats
|
||||
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 12.2.0 │ LDS Instrs │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
.. note::
|
||||
|
||||
Exercise for the reader: if this access had been targeted at global memory
|
||||
(for instance, by changing value of ``filter``), where should we look for the
|
||||
memory traffic? Hint: see the :ref:`generic read <generic-read-ex>` example.
|
||||
|
||||
.. _global-read-ex:
|
||||
|
||||
Global read
|
||||
^^^^^^^^^^^
|
||||
|
||||
Next, we examine a simple global read operation:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
__global__ void global_read(int* ptr, int zero) {
|
||||
int x = ptr[threadIdx.x];
|
||||
if (x != zero) {
|
||||
ptr[threadIdx.x] = x + 1;
|
||||
}
|
||||
}
|
||||
|
||||
Here we observe a now familiar pattern:
|
||||
|
||||
- Read a value in from global memory.
|
||||
|
||||
- Have a write hidden behind a conditional that is impossible for
|
||||
the compiler to statically eliminate, but is identically false. In this
|
||||
case, our ``main()`` function initializes the data in ``ptr`` to zero.
|
||||
|
||||
Running Omniperf on this kernel yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 3 -b 10.3 -n per_kernel
|
||||
<...>
|
||||
0. Top Stat
|
||||
╒════╤════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ global_read(int*, int) [clone .kd] │ 1.00 │ 4480.00 │ 4480.00 │ 4480.00 │ 100.00 │
|
||||
╘════╧════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
Here we see a single global/generic instruction (**10.3.0**) which, as
|
||||
expected, is a read (**10.3.1**).
|
||||
|
||||
.. _generic-read-ex:
|
||||
|
||||
Generic read from global memory
|
||||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
For our generic read example, we choose to change our target for the
|
||||
generic read to be global memory:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
__global__ void generic_read(int* ptr, int zero, int filter) {
|
||||
__shared__ int lds[1024];
|
||||
if (static_cast<int>(filter - 1) == zero) {
|
||||
lds[threadIdx.x] = 0; // initialize to zero to avoid conditional, but hide behind _another_ conditional
|
||||
}
|
||||
int* generic;
|
||||
if (static_cast<int>(threadIdx.x) > filter - 1) {
|
||||
generic = &ptr[threadIdx.x];
|
||||
} else {
|
||||
generic = &lds[threadIdx.x];
|
||||
abort();
|
||||
}
|
||||
int x = generic_load((generic_ptr)generic);
|
||||
if (x != zero) {
|
||||
ptr[threadIdx.x] = x + 1;
|
||||
}
|
||||
}
|
||||
|
||||
In addition to our usual ``if (condition_that_wont_happen)`` guard
|
||||
around the write operation, there is an additional conditional around
|
||||
the initialization of the ``lds`` buffer. We note that it’s typically
|
||||
required to write to this buffer to prevent the compiler from
|
||||
eliminating the local memory branch entirely due to undefined behavior
|
||||
(use of an uninitialized value). However, to report *only* our global
|
||||
memory read, we again hide this initialization behind an identically
|
||||
false conditional (both ``zero`` and ``filter`` are set to zero in the
|
||||
kernel launch). Note that this is a *different* conditional from our
|
||||
pointer assignment (to avoid combination of the two).
|
||||
|
||||
Running Omniperf on this kernel reports:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 4 -b 10.3 12.2.0 16.3.10 -n per_kernel
|
||||
<...>
|
||||
0. Top Stat
|
||||
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ generic_read(int*, int, int) [clone .kd] │ 1.00 │ 2240.00 │ 2240.00 │ 2240.00 │ 100.00 │
|
||||
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
12. Local Data Share (LDS)
|
||||
12.2 LDS Stats
|
||||
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 12.2.0 │ LDS Instrs │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
16. Vector L1 Data Cache
|
||||
16.3 L1D Cache Accesses
|
||||
╒═════════╤════════════╤═══════╤═══════╤═══════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════╪═══════╪═══════╪═══════╪════════════════╡
|
||||
│ 16.3.10 │ L1-L2 Read │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
|
||||
╘═════════╧════════════╧═══════╧═══════╧═══════╧════════════════╛
|
||||
|
||||
Here we observe:
|
||||
|
||||
- A single global/generic read operation (**10.3.1**), which
|
||||
|
||||
- Is not an LDS instruction (**12.2**), as seen in the
|
||||
:ref:`generic write <generic-write-ex>` example, but is instead
|
||||
|
||||
- An L1-L2 read operation (**16.3.10**)
|
||||
|
||||
That is, we have successfully targeted our generic read at global
|
||||
memory. Inspecting the assembly shows this corresponds to a
|
||||
``flat_load_dword`` instruction.
|
||||
|
||||
.. _global-atomic-ex:
|
||||
|
||||
Global atomic
|
||||
^^^^^^^^^^^^^
|
||||
|
||||
Our global atomic kernel simply atomically adds a (non-compile-time) zero value
|
||||
to a pointer.
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
__global__ void global_atomic(int* ptr, int zero) {
|
||||
atomicAdd(ptr, zero);
|
||||
}
|
||||
|
||||
|
||||
Running Omniperf on this kernel yields:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 5 -b 10.3 16.3.12 -n per_kernel
|
||||
<...>
|
||||
0. Top Stat
|
||||
╒════╤══════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ global_atomic(int*, int) [clone .kd] │ 1.00 │ 4640.00 │ 4640.00 │ 4640.00 │ 100.00 │
|
||||
╘════╧══════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
16. Vector L1 Data Cache
|
||||
16.3 L1D Cache Accesses
|
||||
╒═════════╤══════════════╤═══════╤═══════╤═══════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪══════════════╪═══════╪═══════╪═══════╪════════════════╡
|
||||
│ 16.3.12 │ L1-L2 Atomic │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
|
||||
╘═════════╧══════════════╧═══════╧═══════╧═══════╧════════════════╛
|
||||
|
||||
Here we see a single global/generic atomic instruction (**10.3.3**), which
|
||||
corresponds to an L1-L2 atomic request (**16.3.12**).
|
||||
|
||||
.. _generic-mixed-atomic-ex:
|
||||
|
||||
Generic, mixed atomic
|
||||
^^^^^^^^^^^^^^^^^^^^^
|
||||
|
||||
In our final global/generic example, we look at a case where our generic
|
||||
operation targets both LDS and global memory:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
__global__ void generic_atomic(int* ptr, int filter, int zero) {
|
||||
__shared__ int lds[1024];
|
||||
int* generic = (threadIdx.x % 2 == filter) ? &ptr[threadIdx.x] : &lds[threadIdx.x];
|
||||
generic_atomic((generic_ptr)generic, zero);
|
||||
}
|
||||
|
||||
This assigns every other work-item to atomically update global memory or
|
||||
local memory.
|
||||
|
||||
Running this kernel through Omniperf shows:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf analyze -p workloads/vmem/mi200/ --dispatch 6 -b 10.3 12.2.0 16.3.12 -n per_kernel
|
||||
<...>
|
||||
0. Top Stat
|
||||
╒════╤══════════════════════════════════════════╤═════════╤═══════════╤════════════╤══════════════╤════════╕
|
||||
│ │ KernelName │ Count │ Sum(ns) │ Mean(ns) │ Median(ns) │ Pct │
|
||||
╞════╪══════════════════════════════════════════╪═════════╪═══════════╪════════════╪══════════════╪════════╡
|
||||
│ 0 │ generic_atomic(int*, int, int) [clone .k │ 1.00 │ 3360.00 │ 3360.00 │ 3360.00 │ 100.00 │
|
||||
│ │ d] │ │ │ │ │ │
|
||||
╘════╧══════════════════════════════════════════╧═════════╧═══════════╧════════════╧══════════════╧════════╛
|
||||
|
||||
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
12. Local Data Share (LDS)
|
||||
12.2 LDS Stats
|
||||
╒═════════╤════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 12.2.0 │ LDS Instrs │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
╘═════════╧════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
16. Vector L1 Data Cache
|
||||
16.3 L1D Cache Accesses
|
||||
╒═════════╤══════════════╤═══════╤═══════╤═══════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪══════════════╪═══════╪═══════╪═══════╪════════════════╡
|
||||
│ 16.3.12 │ L1-L2 Atomic │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
|
||||
╘═════════╧══════════════╧═══════╧═══════╧═══════╧════════════════╛
|
||||
|
||||
That is, we see:
|
||||
|
||||
- A single generic atomic instruction (**10.3.3**) that maps to both
|
||||
|
||||
- An LDS instruction (**12.2.0**), and
|
||||
|
||||
- An L1-L2 atomic request (**16.3**)
|
||||
|
||||
We have demonstrated the ability of the generic address space to
|
||||
*dynamically* target different backing memory.
|
||||
|
||||
.. _spill-scratch:
|
||||
|
||||
Spill/Scratch (BUFFER)
|
||||
----------------------
|
||||
|
||||
Next we examine the use of "Spill/Scratch" memory. On current CDNA
|
||||
accelerators such as the :ref:`MI2XX <mixxx-note>`, this is implemented using
|
||||
the :ref:`private <memory-spaces>` memory space, which maps to
|
||||
:llvm-docs:`"scratch" memory <amdgpu-address-spaces>` in AMDGPU hardware
|
||||
terminology. This type of memory can be accessed via different instructions
|
||||
depending on the specific architecture targeted. However, current CDNA
|
||||
accelerators such as the :ref:`MI2XX <mixxx-note>` use so called ``buffer``
|
||||
instructions to access private memory in a simple (and typically) coalesced
|
||||
manner. See
|
||||
:mi200-isa-pdf:`Sec. 9.1, "Vector Memory Buffer Instructions" of the CDNA2 ISA guide <>`
|
||||
for further reading on this instruction type.
|
||||
|
||||
We develop a `simple
|
||||
kernel <https://github.com/ROCm/omniperf/blob/dev/sample/stack.hip>`__
|
||||
that uses stack memory:
|
||||
|
||||
.. code-block:: cpp
|
||||
|
||||
#include <hip/hip_runtime.h>
|
||||
__global__ void knl(int* out, int filter) {
|
||||
int x[1024];
|
||||
x[filter] = 0;
|
||||
if (threadIdx.x < filter)
|
||||
out[threadIdx.x] = x[threadIdx.x];
|
||||
}
|
||||
|
||||
Our strategy here is to:
|
||||
|
||||
* Create a large stack buffer (that cannot reasonably fit into registers) - Write to a compile-time unknown
|
||||
location on the stack, and then
|
||||
|
||||
* Behind the typical compile-time unknown ``if(condition_that_wont_happen)``
|
||||
|
||||
* Read from a different, compile-time unknown, location on the stack and write
|
||||
to global memory to prevent the compiler from optimizing it out.
|
||||
|
||||
This example was compiled and run on an MI250 accelerator using ROCm v5.6.0, and
|
||||
Omniperf v2.0.0.
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ hipcc -O3 stack.hip -o stack.hip
|
||||
|
||||
And profiled using Omniperf:
|
||||
|
||||
.. code-block:: shell-session
|
||||
|
||||
$ omniperf profile -n stack --no-roof -- ./stack
|
||||
<...>
|
||||
$ omniperf analyze -p workloads/stack/mi200/ -b 10.3 16.3.11 -n per_kernel
|
||||
<...>
|
||||
10. Compute Units - Instruction Mix
|
||||
10.3 VMEM Instr Mix
|
||||
╒═════════╤═══════════════════════╤═══════╤═══════╤═══════╤══════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═══════════════════════╪═══════╪═══════╪═══════╪══════════════════╡
|
||||
│ 10.3.0 │ Global/Generic Instr │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.1 │ Global/Generic Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.2 │ Global/Generic Write │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.3 │ Global/Generic Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.4 │ Spill/Stack Instr │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.5 │ Spill/Stack Read │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.6 │ Spill/Stack Write │ 1.00 │ 1.00 │ 1.00 │ Instr per kernel │
|
||||
├─────────┼───────────────────────┼───────┼───────┼───────┼──────────────────┤
|
||||
│ 10.3.7 │ Spill/Stack Atomic │ 0.00 │ 0.00 │ 0.00 │ Instr per kernel │
|
||||
╘═════════╧═══════════════════════╧═══════╧═══════╧═══════╧══════════════════╛
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
16. Vector L1 Data Cache
|
||||
16.3 L1D Cache Accesses
|
||||
╒═════════╤═════════════╤═══════╤═══════╤═══════╤════════════════╕
|
||||
│ Index │ Metric │ Avg │ Min │ Max │ Unit │
|
||||
╞═════════╪═════════════╪═══════╪═══════╪═══════╪════════════════╡
|
||||
│ 16.3.11 │ L1-L2 Write │ 1.00 │ 1.00 │ 1.00 │ Req per kernel │
|
||||
╘═════════╧═════════════╧═══════╧═══════╧═══════╧════════════════╛
|
||||
|
||||
Here we see a single write to the stack (**10.3.6**), which corresponds to
|
||||
an L1-L2 write request (**16.3.11**), that is, the stack is backed by global
|
||||
memory and travels through the same memory hierarchy.
|
||||
@@ -0,0 +1,22 @@
|
||||
.. meta::
|
||||
:description: Omniperf external training resources
|
||||
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD,
|
||||
training, examples
|
||||
|
||||
******************
|
||||
Learning resources
|
||||
******************
|
||||
|
||||
This section is a catalog of external resources and third-party content that
|
||||
can help you learn Omniperf. Some areas of the following content might be
|
||||
outdated.
|
||||
|
||||
Introduction to Omniperf
|
||||
:fab:`youtube` `AMD profiling workshop (Pawsey Supercomputing Research Centre) <https://www.youtube.com/watch?v=9AkxBCiInCw>`_
|
||||
|
||||
Omniperf example exercises
|
||||
`<https://github.com/amd/HPCTrainingExamples/tree/main/OmniperfExamples>`__
|
||||
|
||||
AMD Instinct™ tuning guides
|
||||
:doc:`rocm:how-to/tuning-guides/mi300x/workload`
|
||||
|
||||
@@ -0,0 +1,23 @@
|
||||
.. meta::
|
||||
:description: Omniperf: Profiling by example
|
||||
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD
|
||||
|
||||
********************
|
||||
Profiling by example
|
||||
********************
|
||||
|
||||
The following examples refer to sample :doc:`HIP <hip:index>` code located in
|
||||
:fab:`github` :dev-sample:`ROCm/omniperf/blob/dev/sample <>` and distributed
|
||||
as part of Omniperf.
|
||||
|
||||
.. include:: ./includes/valu-arithmetic-instruction-mix.rst
|
||||
|
||||
.. include:: ./includes/infinity-fabric-transactions.rst
|
||||
|
||||
.. include:: ./includes/vector-memory-operation-counting.rst
|
||||
|
||||
.. include:: ./includes/instructions-per-cycle-and-utilizations.rst
|
||||
|
||||
.. include:: ./includes/lds-examples.rst
|
||||
|
||||
.. include:: ./includes/occupancy-limiters-example.rst
|
||||
Ссылка в новой задаче
Block a user