Files
vedithal-amd 354fe5f52c Unified configuration for metrics (#726)
* Show description of metrics during analysis
    * Use --include-cols Description show the Description column in analyze mode (this is hidden by default)
    * Remove tips field from analysis config

* Align metric names in analysis config and documentation

* Add unified config utils/unified_config.yaml

* Add python script utils/split_config.py to auto generate analysis configuration and documentation metrics description
   * Add test case to ensure unified config is older than auto-generated config
   * Auto generate analysis config and documentation metrics description

* Update CONTRIBUTING.md to add instructions to build documentation assets
    * Add docker image and compose file to build documentation

* Update CHANGELOG and Documentation

* Use jinja template instead of hardcoding metric tables in documentation

[ROCm/rocprofiler-compute commit: bb44e90b2d]
2025-07-25 14:01:34 -04:00

252 sor
8.9 KiB
ReStructuredText

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
.. meta::
:description: ROCm Compute Profiler performance model: Shader engine (SE)
:keywords: Omniperf, ROCm Compute Profiler, ROCm, profiler, tool, Instinct, accelerator, shader, engine, sL1D, L1I, workgroup manager, SPI
******************
Shader engine (SE)
******************
The :doc:`compute units <compute-unit>` on a CDNA™ accelerator are grouped
together into a higher-level organizational unit called a shader engine (SE):
.. figure:: ../data/performance-model/selayout.png
:align: center
:alt: Example of CU-grouping into shader engines
:width: 800
Example of CU-grouping into shader engines on AMD Instinct MI-series
accelerators.
The number of CUs on a SE varies from chip to chip -- see for example
:hip-training-pdf:`20`. In addition, newer accelerators such as the AMD
Instinct™ MI 250X have 8 SEs per accelerator.
For the purposes of ROCm Compute Profiler, we consider resources that are shared between
multiple CUs on a single SE as part of the SE's metrics.
These include:
* The :ref:`scalar L1 data cache <desc-sl1d>`
* The :ref:`L1 instruction cache <desc-l1i>`
* The :ref:`workgroup manager <desc-spi>`
.. _desc-sl1d:
Scalar L1 data cache (sL1D)
===========================
The Scalar L1 Data cache (sL1D) can cache data accessed from scalar load
instructions (and scalar store instructions on architectures where they exist)
from wavefronts in the :doc:`CUs <compute-unit>`. The sL1D is shared between
multiple CUs (:gcn-crash-course:`36`) -- the exact number of CUs depends on the
architecture in question (3 CUs in GCN™ GPUs and MI100, 2 CUs in
:ref:`MI2XX <mixxx-note>`) -- and is backed by the :doc:`L2 cache <l2-cache>`.
In typical usage, the data in the sL1D is comprised of:
* Kernel arguments, such as pointers,
`non-populated <https://llvm.org/docs/AMDGPUUsage.html#amdgpu-amdhsa-sgpr-register-set-up-order-table>`_
grid and block dimensions, and others
* HIP's ``__constant__`` memory, when accessed in a provably uniform manner
[#uniform-access]_
* Other memory, when accessed in a provably uniform manner, *and* the backing
memory is provably constant [#uniform-access]_
.. _desc-sl1d-sol:
Scalar L1D Speed-of-Light
-------------------------
.. warning::
The theoretical maximum throughput for some metrics in this section are
currently computed with the maximum achievable clock frequency, as reported
by ``rocminfo``, for an accelerator. This may not be realistic for all
workloads.
The Scalar L1D speed-of-light chart shows some key metrics of the sL1D
cache as a comparison with the peak achievable values of those metrics:
.. jinja:: desc-sl1d-sol
:file: _templates/metrics_table.j2
.. _desc-sl1d-stats:
Scalar L1D cache accesses
-------------------------
This panel gives more detail on the types of accesses made to the sL1D,
and the hit/miss statistics.
.. jinja:: desc-sl1d-stats
:file: _templates/metrics_table.j2
.. _desc-sl1d-l2-interface:
sL1D ↔ L2 Interface
-------------------
This panel gives more detail on the data requested across the
sL1D↔
:doc:`L2 <l2-cache>` interface.
.. jinja:: desc-sl1d-l2-interface
:file: _templates/metrics_table.j2
.. rubric:: Footnotes
.. [#uniform-access] The scalar data cache is used when the compiler emits
scalar loads to access data. This requires that the data be *provably*
uniformly accesses (that is, the compiler can verify that all work-items in a
wavefront access the same data), *and* that the data can be proven to be
read-only (for instance, HIP's ``__constant__`` memory, or properly
``__restrict__``\ed pointers to avoid write-aliasing). Access of
``__constant__`` memory for example is not guaranteed to go through the sL1D
if the wavefront loads a non-uniform value.
.. [#sl1d-cache] Unlike the :doc:`vL1D <vector-l1-cache>` and
:doc:`L2 <l2-cache>` caches, the sL1D cache on AMD Instinct MI-series CDNA
accelerators does *not* use the "hit-on-miss" approach to reporting cache
hits. That is, if while satisfying a miss, another request comes in that
would hit on the same pending cache line, the subsequent request will be
counted as a *duplicated miss*.
.. _desc-l1i:
L1 Instruction Cache (L1I)
==========================
As with the :ref:`sL1D <desc-sL1D>`, the L1 Instruction (L1I) cache is shared
between multiple CUs on a shader-engine, where the precise number of CUs
sharing a L1I depends on the architecture in question (:gcn-crash-course:`36`)
and is backed by the :doc:`L2 cache <l2-cache>`. Unlike the sL1D, the
instruction cache is read-only.
.. _desc-l1i-sol:
L1I Speed-of-Light
------------------
.. warning::
The theoretical maximum throughput for some metrics in this section are
currently computed with the maximum achievable clock frequency, as reported
by ``rocminfo``, for an accelerator. This may not be realistic for all
workloads.
The L1 Instruction Cache speed-of-light chart shows some key metrics of
the L1I cache as a comparison with the peak achievable values of those
metrics:
.. jinja:: desc-l1i-sol
:file: _templates/metrics_table.j2
.. _desc-l1i-stats:
L1I cache accesses
------------------
This panel gives more detail on the hit/miss statistics of the L1I:
.. jinja:: desc-l1i-stats
:file: _templates/metrics_table.j2
.. _desc-l1i-l2-interface:
L1I - L2 interface
------------------
This panel gives more detail on the data requested across the
L1I-:doc:`L2 <l2-cache>` interface.
.. jinja:: desc-l1i-l2-interface
:file: _templates/metrics_table.j2
.. rubric:: Footnotes
.. [#l1i-cache] Unlike the :doc:`vL1D <vector-l1-cache>` and
:doc:`L2 <l2-cache>` caches, the L1I cache on AMD Instinct MI-series CDNA
accelerators does *not* use the "hit-on-miss" approach to reporting cache
hits. That is, if while satisfying a miss, another request comes in that
would hit on the same pending cache line, the subsequent request will be
counted as a *duplicated miss*.
.. _desc-spi:
Workgroup manager (SPI)
=======================
The workgroup manager (SPI) is the bridge between the
:doc:`command processor <command-processor>` and the
:doc:`compute units <compute-unit>`. After the command processor processes a
kernel dispatch, it will then pass the dispatch off to the workgroup manager,
which then schedules :ref:`workgroups <desc-workgroup>` onto the compute units.
As workgroups complete execution and resources become available, the
workgroup manager will schedule new workgroups onto compute units. The workgroup
managers metrics therefore are focused on reporting the following:
* Utilizations of various parts of the accelerator that the workgroup
manager interacts with (and the workgroup manager itself)
* How many workgroups were dispatched, their size, and how many
resources they used
* Percent of scheduler opportunities (cycles) where workgroups failed
to dispatch, and
* Percent of scheduler opportunities (cycles) where workgroups failed
to dispatch due to lack of a specific resource on the CUs (for instance, too
many VGPRs allocated)
This gives you an idea of why the workgroup manager couldnt schedule more
wavefronts onto the device, and is most useful for workloads that you suspect to
be limited by scheduling or launch rate.
As discussed in :doc:`Command processor <command-processor>`, the command
processor on AMD Instinct MI-series architectures contains four hardware
scheduler-pipes, each with eight software threads (:mantor-vega10-pdf:`19`). Each
scheduler-pipe can issue a kernel dispatch to the workgroup manager to schedule
concurrently. Therefore, some workgroup manager metrics are presented relative
to the utilization of these scheduler-pipes (for instance, whether all four are
issuing concurrently).
.. note::
Current versions of the profiling libraries underlying ROCm Compute Profiler attempt to
serialize concurrent kernels running on the accelerator, as the performance
counters on the device are global (that is, shared between concurrent
kernels). This means that these scheduler-pipe utilization metrics are
expected to reach (for example) a maximum of one pipe active -- only 25%.
.. _spi-util:
Workgroup manager utilizations
------------------------------
This section describes the utilization of the workgroup manager, and the
hardware components it interacts with.
.. jinja:: spi-util
:file: _templates/metrics_table.j2
.. _spi-resc-util:
Resource allocation
-------------------
This panel gives more detail on how workgroups and wavefronts were scheduled
onto compute units, and what occupancy limiters they hit -- if any. When
analyzing these metrics, you should also take into account their
achieved occupancy -- such as
:ref:`wavefront occupancy <wavefront-runtime-stats>`. A kernel may be occupancy
limited by LDS usage, for example, but may still achieve high occupancy levels
such that improving occupancy further may not improve performance. See
:ref:`occupancy-example` for details.
.. jinja:: spi-resc-util
:file: _templates/metrics_table.j2