Unified configuration for metrics (#726)

* Show description of metrics during analysis * Use --include-cols Description show the Description column in analyze mode (this is hidden by default) * Remove tips field from analysis config * Align metric names in analysis config and documentation * Add unified config utils/unified_config.yaml * Add python script utils/split_config.py to auto generate analysis configuration and documentation metrics description * Add test case to ensure unified config is older than auto-generated config * Auto generate analysis config and documentation metrics description * Update CONTRIBUTING.md to add instructions to build documentation assets * Add docker image and compose file to build documentation * Update CHANGELOG and Documentation * Use jinja template instead of hardcoding metric tables in documentation [ROCm/rocprofiler-compute commit: bb44e90b2d]
2025-07-25 14:01:34 -04:00
@@ -0,0 +1,12 @@
+.. list-table::
+    :header-rows: 1
+
+    * - Metric
+      - Description
+      - Unit
+
+    {% for metric, metric_info in data.items() %}
+    * - {{ metric }}
+      - {{ metric_info.rst }}
+      - {{ metric_info.unit }}
+    {% endfor %}
@@ -46,108 +46,13 @@ processor’s metrics therefore are focused on reporting, for example:
 Command processor fetcher (CPF)
 ===============================

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - CPF Utilization
-
-     - Percent of total cycles where the CPF was busy actively doing any work.
-       The ratio of CPF busy cycles over total cycles counted by the CPF.
-
-     - Percent
-
-   * - CPF Stall
-
-     - Percent of CPF busy cycles where the CPF was stalled for any reason.
-
-     - Percent
-
-   * - CPF-L2 Utilization
-
-     - Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
-       where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
-       busy cycles over total cycles counted by the CPF-L2.
-
-     - Percent
-
-   * - CPF-L2 Stall
-
-     - Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2
-       interface was stalled for any reason.
-
-     - Percent
-
-   * - CPF-UTCL1 Stall
-
-     - Percent of CPF busy cycles where the CPF was stalled by address
-       translation.
-
-     - Percent
+.. jinja:: cpf-metrics
+   :file: _templates/metrics_table.j2

 .. _cpc-metrics:

 Command processor packet processor (CPC)
 ========================================

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - CPC Utilization
-
-     - Percent of total cycles where the CPC was busy actively doing any work.
-       The ratio of CPC busy cycles over total cycles counted by the CPC.
-
-     - Percent
-
-   * - CPC Stall
-
-     - Percent of CPC busy cycles where the CPC was stalled for any reason.
-
-     - Percent
-
-   * - CPC Packet Decoding Utilization
-
-     - Percent of CPC busy cycles spent decoding commands for processing.
-
-     - Percent
-
-   * - CPC-Workgroup Manager Utilization
-
-     - Percent of CPC busy cycles spent dispatching workgroups to the
-       :ref:`workgroup manager <desc-spi>`.
-
-     - Percent
-
-   * - CPC-L2 Utilization
-
-     - Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
-       where the CPC-L2 interface was active doing any work.
-
-     - Percent
-
-   * - CPC-UTCL1 Stall
-
-     - Percent of CPC busy cycles where the CPC was stalled by address
-       translation.
-
-     - Percent
-
-   * - CPC-UTCL2 Utilization
-
-     - Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address
-       translation interface where the CPC was busy doing address translation
-       work.
-
-     - Percent
+.. jinja:: cpc-metrics
+   :file: _templates/metrics_table.j2
@@ -48,56 +48,8 @@ The L2 cache’s speed-of-light table contains a few key metrics about the
 performance of the L2 cache, aggregated over all the L2 channels, as a
 comparison with the peak achievable values of those metrics:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Utilization
-
-     - The ratio of the
-       :ref:`number of cycles an L2 channel was active, summed over all L2 channels on the accelerator <total-active-l2-cycles>`
-       over the :ref:`total L2 cycles <total-l2-cycles>`.
-
-     - Percent
-
-   * - Bandwidth
-
-     - The number of bytes looked up in the L2 cache, as a percent of the peak
-       theoretical bandwidth achievable on the specific accelerator. The number
-       of bytes is calculated as the number of cache lines requested multiplied
-       by the cache line size. This value does not consider partial requests, so
-       e.g., if only a single value is requested in a cache line, the data
-       movement will still be counted as a full cache line.
-
-     - Percent
-
-   * - Hit Rate
-
-     - The ratio of the number of L2 cache line requests that hit in the L2
-       cache over the total number of incoming cache line requests to the L2
-       cache.
-
-     - Percent
-
-   * - L2-Fabric Read BW
-
-     - The number of bytes read by the L2 over the
-       :ref:`Infinity Fabric interface <l2-fabric>` per unit time.
-
-     - GB/s
-
-   * - L2-Fabric Write and Atomic BW
-
-     - The number of bytes sent by the L2 over the
-       :ref:`Infinity Fabric interface <l2-fabric>` by write and atomic
-       operations per unit time.
-
-     - GB/s
+.. jinja:: l2-sol
+   :file: _templates/metrics_table.j2

 .. note::

@@ -117,168 +69,8 @@ This section details the incoming requests to the L2 cache from the
 :doc:`vL1D <vector-l1-cache>` and other clients -- for instance, the
 :ref:`sL1D <desc-sL1D>` and :ref:`L1I <desc-l1i>` caches.

-.. list-table::
-   :header-rows: 1
-   :widths: 13 70 17
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Bandwidth
-
-     - The number of bytes looked up in the L2 cache, per
-       :ref:`normalization unit <normalization-units>`.  The number of bytes is
-       calculated as the number of cache lines requested multiplied by the cache
-       line size. This value does not consider partial requests, so for example,
-       if only a single value is requested in a cache line, the data movement
-       will still be counted as a full cache line.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`.
-
-   * - Requests
-
-     - The total number of incoming requests to the L2 from all clients for all
-       request types, per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Read Requests
-
-     - The total number of read requests to the L2 from all clients.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Write Requests
-
-     - The total number of write requests to the L2 from all clients.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Atomic Requests
-
-     - The total number of atomic requests (with and without return) to the L2
-       from all clients.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Streaming Requests
-
-     - The total number of incoming requests to the L2 that are marked as
-       *streaming*. The exact meaning of this may differ depending on the
-       targeted accelerator, however on an :ref:`MI2XX <mixxx-note>` this
-       corresponds to
-       `non-temporal load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
-       The L2 cache attempts to evict *streaming* requests before normal
-       requests when the L2 is at capacity.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Probe Requests
-
-     - The number of coherence probe requests made to the L2 cache from outside
-       the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be
-       generated by, for example, writes to
-       :ref:`fine-grained device <memory-type>` memory or by writes to
-       :ref:`coarse-grained <memory-type>` device memory.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Hit Rate
-
-     - The ratio of the number of L2 cache line requests that hit in the L2
-       cache over the total number of incoming cache line requests to the L2
-       cache.
-
-     - Percent
-
-   * - Hits
-
-     - The total number of requests to the L2 from all clients that hit in the
-       cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, this
-       includes hit-on-miss requests.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Misses
-
-     - The total number of requests to the L2 from all clients that miss in the
-       cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do
-       not include hit-on-miss requests.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Writebacks
-
-     - The total number of L2 cache lines written back to memory for any reason.
-       Write-backs may occur due to user code (such as HIP kernel calls to
-       ``__threadfence_system`` or atomic built-ins) by the
-       :doc:`command processor <command-processor>`'s memory acquire/release
-       fences, or for other internal hardware reasons.
-
-     - Cache lines per :ref:`normalization unit <normalization-units>`
-
-   * - Writebacks (Internal)
-
-     - The total number of L2 cache lines written back to memory for internal
-       hardware reasons, per :ref:`normalization unit <normalization-units>`.
-
-     - Cache lines per :ref:`normalization unit <normalization-units>`.
-
-   * - Writebacks (vL1D Req)
-
-     - The total number of L2 cache lines written back to memory due to requests
-       initiated by the :doc:`vL1D cache <vector-l1-cache>`, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cache lines per :ref:`normalization unit <normalization-units>`.
-
-   * - Evictions (Normal)
-
-     - The total number of L2 cache lines evicted from the cache due to capacity
-       limits, per :ref:`normalization unit <normalization-units>`.
-
-     - Cache lines per :ref:`normalization unit <normalization-units>`.
-
-   * - Evictions (vL1D Req)
-
-     - The total number of L2 cache lines evicted from the cache due to
-       invalidation requests initiated by the
-       :doc:`vL1D cache <vector-l1-cache>`, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cache lines per :ref:`normalization unit <normalization-units>`.
-
-   * - Non-hardware-Coherent Requests
-
-     - The total number of requests to the L2 to Not-hardware-Coherent (NC)
-       memory allocations, per :ref:`normalization unit <normalization-units>`.
-       See the :ref:`memory-type` for more information.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Uncached Requests
-
-     - The total number of requests to the L2 that go to Uncached (UC) memory
-       allocations. See the :ref:`memory-type` for more information.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Coherently Cached Requests
-
-     - The total number of requests to the L2 that go to Coherently Cacheable (CC)
-       memory allocations. See the :ref:`memory-type` for more information.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Read/Write Coherent Requests
-
-     - The total number of requests to the L2 that go to Read-Write coherent memory
-       (RW) allocations. See the :ref:`memory-type` for more information.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
+.. jinja:: l2-cache-accesses
+   :file: _templates/metrics_table.j2

 .. note::

@@ -300,7 +92,7 @@ is responsible for routing these memory requests/data to the correct
 location and returning any fetched data to the L2 cache. The
 :ref:`l2-request-flow` describes the flow of these requests through
 Infinity Fabric in more detail, as described by ROCm Compute Profiler metrics,
-while :ref:`l2-request-metrics` give detailed definitions of
+while :ref:`l2-fabric` give detailed definitions of
 individual metrics.

 .. _l2-request-flow:
@@ -363,176 +155,15 @@ to uncached memory (denoted by the dashed line), they will also be
 counted as *two* uncached read requests (that is, the request is split).


-.. _l2-request-metrics:
+.. _l2-fabric-metrics:

 Metrics
 -------

 The following metrics are reported for the L2-Fabric interface:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - L2-Fabric Read Bandwidth
-
-     - The total number of bytes read by the L2 cache from Infinity Fabric per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`.
-
-   * - HBM Read Traffic
-
-     - The percent of read requests generated by the L2 cache that are routed to
-       the accelerator's local high-bandwidth memory (HBM). This breakdown does
-       not consider the *size* of the request (meaning that 32B and 64B requests
-       are both counted as a single request), so this metric only *approximates*
-       the percent of the L2-Fabric Read bandwidth directed to the local HBM.
-
-     - Percent
-
-   * - Remote Read Traffic
-
-     - The percent of read requests generated by the L2 cache that are routed to
-       any memory location other than the accelerator's local high-bandwidth
-       memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's
-       HBM. This breakdown does not consider the *size* of the request (meaning
-       that 32B and 64B requests are both counted as a single request), so this
-       metric only *approximates* the percent of the L2-Fabric Read bandwidth
-       directed to a remote location.
-
-     - Percent
-
-   * - Uncached Read Traffic
-
-     - The percent of read requests generated by the L2 cache that are reading
-       from an :ref:`uncached memory allocation <memory-type>`. Note, as
-       described in the :ref:`request flow <l2-request-flow>` section, a single
-       64B read request is typically counted as two uncached read requests. So,
-       it is possible for the Uncached Read Traffic to reach up to 200% of the
-       total number of read requests. This breakdown does not consider the
-       *size* of the request (i.e., 32B and 64B requests are both counted as a
-       single request), so this metric only *approximates* the percent of the
-       L2-Fabric read bandwidth directed to an uncached memory location.
-
-     - Percent
-
-   * - L2-Fabric Write and Atomic Bandwidth
-
-     - The total number of bytes written by the L2 over Infinity Fabric by write
-       and atomic operations per
-       :ref:`normalization unit <normalization-units>`. Note that on current
-       CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are
-       only considered *atomic* by Infinity Fabric if they are targeted at
-       non-write-cacheable memory, for example,
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations on the
-       MI2XX.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`.
-
-   * - HBM Write and Atomic Traffic
-
-     - The percent of write and atomic requests generated by the L2 cache that
-       are routed to the accelerator's local high-bandwidth memory (HBM). This
-       breakdown does not consider the *size* of the request (meaning that 32B
-       and 64B requests are both counted as a single request), so this metric
-       only *approximates* the percent of the L2-Fabric Write and Atomic
-       bandwidth directed to the local HBM. Note that on current CDNA
-       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
-       considered *atomic* by Infinity Fabric if they are targeted at
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations.
-
-     - Percent
-
-   * - Remote Write and Atomic Traffic
-
-     - The percent of read requests generated by the L2 cache that are routed to
-       any memory location other than the accelerator's local high-bandwidth
-       memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's
-       HBM. This breakdown does not consider the *size* of the request (meaning
-       that 32B and 64B requests are both counted as a single request), so this
-       metric only *approximates* the percent of the L2-Fabric Read bandwidth
-       directed to a remote location. Note that on current CDNA
-       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
-       considered *atomic* by Infinity Fabric if they are targeted at
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations.
-
-     - Percent
-
-   * - Atomic Traffic
-
-     - The percent of write requests generated by the L2 cache that are atomic
-       requests to *any* memory location. This breakdown does not consider the
-       *size* of the request (meaning that 32B and 64B requests are both counted
-       as a single request), so this metric only *approximates* the percent of
-       the L2-Fabric Read bandwidth directed to a remote location. Note that on
-       current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
-       requests are only considered *atomic* by Infinity Fabric if they are
-       targeted at :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations.
-
-     - Percent
-
-   * - Uncached Write and Atomic Traffic
-
-     - The percent of write and atomic requests generated by the L2 cache that
-       are targeting :ref:`uncached memory allocations <memory-type>`. This
-       breakdown does not consider the *size* of the request (meaning that 32B
-       and 64B requests are both counted as a single request), so this metric
-       only *approximates* the percent of the L2-Fabric read bandwidth directed
-       to uncached memory allocations.
-
-     - Percent
-
-   * - Read Latency
-
-     - The time-averaged number of cycles read requests spent in Infinity Fabric
-       before data was returned to the L2.
-
-     - Cycles
-
-   * - Write Latency
-
-     - The time-averaged number of cycles write requests spent in Infinity
-       Fabric before a completion acknowledgement was returned to the L2.
-
-     - Cycles
-
-   * - Atomic Latency
-
-     - The time-averaged number of cycles atomic requests spent in Infinity
-       Fabric before a completion acknowledgement (atomic without return value)
-       or data (atomic with return value) was returned to the L2.
-
-     - Cycles
-
-   * - Read Stall
-
-     - The ratio of the total number of cycles the L2-Fabric interface was
-       stalled on a read request to any destination (local HBM, remote PCIe®
-       connected accelerator or CPU, or remote Infinity Fabric connected
-       accelerator [#inf]_ or CPU) over the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Write Stall
-
-     - The ratio of the total number of cycles the L2-Fabric interface was
-       stalled on a write or atomic request to any destination (local HBM,
-       remote accelerator or CPU, PCIe connected accelerator or CPU, or remote
-       Infinity Fabric connected accelerator [#inf]_ or CPU) over the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
+.. jinja:: l2-fabric-metrics
+   :file: _templates/metrics_table.j2

 .. _l2-detailed-metrics:

@@ -542,121 +173,8 @@ Detailed transaction metrics
 The following metrics are available in the detailed L2-Fabric
 transaction breakdown table:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - 32B Read Requests
-
-     - The total number of L2 requests to Infinity Fabric to read 32B of data
-       from any memory location, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail. Typically unused on CDNA
-       accelerators.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Uncached Read Requests
-
-     - The total number of L2 requests to Infinity Fabric to read
-       :ref:`uncached data <memory-type>` from any memory location, per
-       :ref:`normalization unit <normalization-units>`. 64B requests for
-       uncached data are counted as two 32B uncached data requests. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - 64B Read Requests
-
-     - The total number of L2 requests to Infinity Fabric to read 64B of data
-       from any memory location, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - HBM Read Requests
-
-     - The total number of L2 requests to Infinity Fabric to read 32B or 64B of
-       data from the accelerator's local HBM, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Remote Read Requests
-
-     - The total number of L2 requests to Infinity Fabric to read 32B or 64B of
-       data from any source other than the accelerator's local HBM, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - 32B Write and Atomic Requests
-
-     - The total number of L2 requests to Infinity Fabric to write or atomically
-       update 32B of data to any memory location, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Uncached Write and Atomic Requests
-
-     - The total number of L2 requests to Infinity Fabric to write or atomically
-       update 32B or 64B of :ref:`uncached data <memory-type>`, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - 64B Write and Atomic Requests
-
-     - The total number of L2 requests to Infinity Fabric to write or atomically
-       update 64B of data in any memory location, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - HBM Write and Atomic Requests
-
-     - The total number of L2 requests to Infinity Fabric to write or atomically
-       update 32B or 64B of data in the accelerator's local HBM, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Remote Write and Atomic Requests
-
-     - The total number of L2 requests to Infinity Fabric to write or atomically
-       update 32B or 64B of data in any memory location other than the
-       accelerator's local HBM, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Atomic Requests
-
-     - The total number of L2 requests to Infinity Fabric to atomically update
-       32B or 64B of data in any memory location, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`l2-request-flow` for more detail. Note that on current CDNA
-       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
-       considered *atomic* by Infinity Fabric if they are targeted at
-       non-write-cacheable memory, such as
-       :ref:`fine-grained memory <memory-type>` allocations or
-       :ref:`uncached memory <memory-type>` allocations on the MI2XX.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
+.. jinja:: l2-detailed-metrics
+   :file: _templates/metrics_table.j2

 .. _l2-fabric-stalls:

@@ -670,72 +188,8 @@ what types of requests in a kernel caused a stall (like read versus write), and
 to which locations -- for instance, to the accelerator’s local memory, or to
 remote accelerators or CPUs.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Read - PCIe Stall
-
-     - The number of cycles the L2-Fabric interface was stalled on read requests
-       to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Read - Infinity Fabric Stall
-
-     - The number of cycles the L2-Fabric interface was stalled on read requests
-       to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a
-       percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Read - HBM Stall
-
-     - The number of cycles the L2-Fabric interface was stalled on read requests
-       to the accelerator's local HBM as a percent of the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Write - PCIe Stall
-
-     - The number of cycles the L2-Fabric interface was stalled on write or
-       atomic requests to remote PCIe connected accelerators [#inf]_ or CPUs as
-       a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Write - Infinity Fabric Stall
-
-     - The number of cycles the L2-Fabric interface was stalled on write or
-       atomic requests to remote Infinity Fabric connected accelerators [#inf]_
-       or CPUs as a percent of the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Write - HBM Stall
-
-     - The number of cycles the L2-Fabric interface was stalled on write or
-       atomic requests to accelerator's local HBM as a percent of the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
-
-   * - Write - Credit Starvation
-
-     - The number of cycles the L2-Fabric interface was stalled on write or
-       atomic requests to any memory location because too many write/atomic
-       requests were currently in flight, as a percent of the
-       :ref:`total active L2 cycles <total-active-l2-cycles>`.
-
-     - Percent
+.. jinja:: l2-fabric-stalls
+   :file: _templates/metrics_table.j2

 .. warning::

@@ -21,53 +21,8 @@ LDS Speed-of-Light
 The :ref:`LDS <desc-lds>` speed-of-light chart shows a number of key metrics for
 the LDS as a comparison with the peak achievable values of those metrics.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Utilization
-
-     - Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`
-       was actively executing instructions (including, but not limited to, load,
-       store, atomic and HIP's ``__shfl`` operations).  Calculated as the ratio
-       of the total number of cycles LDS was active over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - Access Rate
-
-     - Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
-       actively issuing LDS instructions, averaged over the lifetime of the
-       kernel. Calculated as the ratio of the total number of cycles spent by
-       the :ref:`scheduler <desc-scheduler>` issuing :ref:`LDS <desc-lds>`
-       instructions over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - Theoretical Bandwidth (% of Peak)
-
-     - Indicates the maximum amount of bytes that *could* have been loaded from,
-       stored to, or atomically updated in the LDS in this kernel, as a percent
-       of the peak LDS bandwidth achievable. See the
-       :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
-
-     - Percent
-
-   * - Bank Conflict Rate
-
-     - Indicates the percentage of active LDS cycles that were spent servicing
-       bank conflicts. Calculated as the ratio of LDS cycles spent servicing
-       bank conflicts over the number of LDS cycles that would have been
-       required to move the same amount of data in an uncontended access. [#lds-bank-conflict]_
-
-     - Percent
+.. jinja:: lds-sol
+   :file: _templates/metrics_table.j2

 .. rubric:: Footnotes

@@ -90,93 +45,5 @@ Statistics

 The LDS statistics panel gives a more detailed view of the hardware:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - LDS Instructions
-
-     - The total number of LDS instructions (including, but not limited to,
-       read/write/atomics and HIP's ``__shfl`` instructions) executed per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Theoretical Bandwidth
-
-     - Indicates the maximum amount of bytes that could have been loaded from,
-       stored to, or atomically updated in the LDS per
-       :ref:`normalization unit <normalization-units>`. Does *not* take into
-       account the execution mask of the wavefront when the instruction was
-       executed. See the
-       :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`
-
-   * - LDS Latency
-
-     - The average number of round-trip cycles (i.e., from issue to data-return
-       / acknowledgment) required for an LDS instruction to complete.
-
-     - Cycles
-
-   * - Bank Conflicts/Access
-
-     - The ratio of the number of cycles spent in the
-       :ref:`LDS scheduler <desc-lds>` due to bank conflicts (as determined by
-       the conflict resolution hardware) to the base number of cycles that would
-       be spent in the LDS scheduler in a completely uncontended case. This is
-       the unnormalized form of the Bank Conflict Rate.
-
-     - Conflicts/Access
-
-   * - Index Accesses
-
-     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
-       over all operations per :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Atomic Return Cycles
-
-     - The total number of cycles spent on LDS atomics with return per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Bank Conflicts
-
-     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
-       due to bank conflicts (as determined by the conflict resolution hardware)
-       per :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Address Conflicts
-
-     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
-       due to address conflicts (as determined by the conflict resolution
-       hardware) per :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Unaligned Stall
-
-     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
-       due to stalls from non-dword aligned addresses per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Memory Violations
-
-     - The total number of out-of-bounds accesses made to the LDS, per
-       :ref:`normalization unit <normalization-units>`. This is unused and
-       expected to be zero in most configurations for modern CDNA™ accelerators.
-
-     - Accesses per :ref:`normalization unit <normalization-units>`
+.. jinja:: lds-stats
+   :file: _templates/metrics_table.j2
@@ -23,97 +23,8 @@ Wavefront launch stats
 The wavefront launch stats panel gives general information about the
 kernel launch:

-.. list-table::
-   :header-rows: 1
-   :widths: 20 65 15
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Grid Size
-
-     - The total number of work-items (or, threads) launched as a part of
-       the kernel dispatch.  In HIP, this is equivalent to the total grid size
-       multiplied by the total workgroup (or, block) size.
-
-     - :ref:`Work-items <desc-work-item>`
-
-   * - Workgroup Size
-
-     - The total number of work-items (or, threads) in each workgroup
-       (or, block) launched as part of the kernel dispatch.  In HIP, this is
-       equivalent to the total block size.
-
-     - :ref:`Work-items <desc-work-item>`
-
-   * - Total Wavefronts
-
-     - The total number of wavefronts launched as part of the kernel dispatch.
-       On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is
-       always 64 work-items.  Thus, the total number of wavefronts should be
-       equivalent to the ceiling of grid size divided by 64.
-
-     - :ref:`Wavefronts <desc-wavefront>`
-
-   * - Saved Wavefronts
-
-     - The total number of wavefronts saved at a context-save. See
-       `cwsr_enable <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
-
-     - :ref:`Wavefronts <desc-wavefront>`
-
-   * - Restored Wavefronts
-
-     - The total number of wavefronts restored from a context-save. See
-       `cwsr_enable <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
-
-     - :ref:`Wavefronts <desc-wavefront>`
-
-   * - VGPRs
-
-     - The number of architected vector general-purpose registers allocated for
-       the kernel, see :ref:`VALU <desc-valu>`.  Note: this may not exactly
-       match the number of VGPRs requested by the compiler due to allocation
-       granularity.
-
-     - :ref:`VGPRs <desc-valu>`
-
-   * - AGPRs
-
-     - The number of accumulation vector general-purpose registers allocated for
-       the kernel, see :ref:`AGPRs <desc-agprs>`.  Note: this may not exactly
-       match the number of AGPRs requested by the compiler due to allocation
-       granularity.
-
-     - :ref:`AGPRs <desc-agprs>`
-
-   * - SGPRs
-
-     - The number of scalar general-purpose registers allocated for the kernel,
-       see :ref:`SALU <desc-salu>`.  Note: this may not exactly match the number
-       of SGPRs requested by the compiler due to allocation granularity.
-
-     - :ref:`SGPRs <desc-salu>`
-
-   * - LDS Allocation
-
-     - The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared
-       memory) allocated for this kernel.  Note: This may also be larger than
-       what was requested at compile time due to both allocation granularity and
-       dynamic per-dispatch LDS allocations.
-
-     - Bytes per :ref:`workgroup <desc-workgroup>`
-
-   * - Scratch Allocation
-
-     - The number of bytes of :ref:`scratch memory <memory-spaces>` requested
-       per work-item for this kernel. Scratch memory is used for stack memory
-       on the accelerator, as well as for register spills and restores.
-
-     - Bytes per :ref:`work-item <desc-work-item>`
+.. jinja:: wavefront-launch-stats
+   :file: _templates/metrics_table.j2

 .. _wavefront-runtime-stats:

@@ -123,96 +34,8 @@ Wavefront runtime stats
 The wavefront runtime statistics gives a high-level overview of the
 execution of wavefronts in a kernel:

-.. list-table::
-   :header-rows: 1
-   :widths: 18 65 17
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - :ref:`Kernel time <kernel-time>`
-
-     - The total duration of the executed kernel. Note: this should not be
-       directly compared to the wavefront cycles / timings below.
-
-     - Nanoseconds
-
-   * - :ref:`Kernel cycles <kernel-cycles>`
-
-     - The total duration of the executed kernel in cycles. Note: this should
-       not be directly compared to the wavefront cycles / timings below.
-
-     - Cycles
-
-   * - Instructions per wavefront
-
-     - The average number of instructions (of all types) executed per wavefront.
-       This is averaged over all wavefronts in a kernel dispatch.
-
-     - Instructions / wavefront
-
-   * - Wave cycles
-
-     - The number of cycles a wavefront in the kernel dispatch spent resident on
-       a compute unit per :ref:`normalization unit <normalization-units>`. This
-       is averaged over all wavefronts in a kernel dispatch.  Note: this should
-       not be directly compared to the kernel cycles above.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Dependency wait cycles
-
-     - The number of cycles a wavefront in the kernel dispatch stalled waiting
-       on memory of any kind (e.g., instruction fetch, vector or scalar memory,
-       etc.) per :ref:`normalization unit <normalization-units>`. This counter
-       is incremented at every cycle by *all* wavefronts on a CU stalled at a
-       memory operation.  As such, it is most useful to get a sense of how waves
-       were spending their time, rather than identification of a precise limiter
-       because another wave could be actively executing while a wave is stalled.
-       The sum of this metric, Issue Wait Cycles and Active Cycles should be
-       equal to the total Wave Cycles metric.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Issue Wait Cycles
-
-     - The number of cycles a wavefront in the kernel dispatch was unable to
-       issue an instruction for any reason (e.g., execution pipe back-pressure,
-       arbitration loss, etc.) per
-       :ref:`normalization unit <normalization-units>`.  This counter is
-       incremented at every cycle by *all* wavefronts on a CU unable to issue an
-       instruction.  As such, it is most useful to get a sense of how waves were
-       spending their time, rather than identification of a precise limiter
-       because another wave could be actively executing while a wave is issue
-       stalled.  The sum of this metric, Dependency Wait Cycles and Active
-       Cycles should be equal to the total Wave Cycles metric.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Active Cycles
-
-     - The average number of cycles a wavefront in the kernel dispatch was
-       actively executing instructions per
-       :ref:`normalization unit <normalization-units>`. This measurement is made
-       on a per-wavefront basis, and may include cycles that another wavefront
-       spent actively executing (on another execution unit, for example) or was
-       stalled.  As such, it is most useful to get a sense of how waves were
-       spending their time, rather than identification of a precise limiter. The
-       sum of this metric, Issue Wait Cycles and Active Wait Cycles should be
-       equal to the total Wave Cycles metric.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Wavefront Occupancy
-
-     - The time-averaged number of wavefronts resident on the accelerator over
-       the lifetime of the kernel. Note: this metric may be inaccurate for
-       short-running kernels (less than 1ms).
-
-     - :ref:`Wavefronts <desc-wavefront>`
+.. jinja:: wavefront-runtime-stats
+   :file: _templates/metrics_table.j2

 .. note::

@@ -256,71 +79,8 @@ This panel shows the total number of each type of instruction issued to
 the :doc:`various compute pipelines </conceptual/pipeline-descriptions>` on the
 :doc:`CU </conceptual/compute-unit>`. These are:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - :ref:`VALU <desc-valu>` instructions
-
-     - The total number of vector arithmetic logic unit (VALU) operations
-       issued. These are the workhorses of the
-       :doc:`compute unit <compute-unit>`, and are used to execute a wide range of
-       instruction types including floating point operations, non-uniform
-       address calculations, transcendental operations, integer operations,
-       shifts, conditional evaluation, etc.
-
-     - Instructions
-
-   * - VMEM instructions
-
-     - The total number of vector memory operations issued. These include most
-       loads, stores and atomic operations and all accesses to
-       :ref:`generic, global, private and texture <memory-spaces>` memory.
-
-     - Instructions
-
-   * - :doc:`LDS <local-data-share>` instructions
-
-     - The total number of LDS (also known as shared memory) operations issued.
-       These include loads, stores, atomics, and HIP's ``__shfl`` operations.
-
-     - Instructions
-
-   * - :ref:`MFMA <desc-mfma>` instructions
-
-     - The total number of matrix fused multiply-add instructions issued.
-
-     - Instructions
-
-   * - :ref:`SALU <desc-salu>` instructions
-
-     - The total number of scalar arithmetic logic unit (SALU) operations
-       issued. Typically these are used for address calculations, literal
-       constants, and other operations that are *provably* uniform across a
-       wavefront. Although scalar memory (SMEM) operations are issued by the
-       SALU, they are counted separately in this section.
-
-     - Instructions
-
-   * - SMEM instructions
-
-     - The total number of scalar memory (SMEM) operations issued. These are
-       typically used for loading kernel arguments, base-pointers and loads
-       from HIP's ``__constant__`` memory.
-
-     - Instructions
-
-   * - :ref:`Branch <desc-branch>` instructions
-
-     - The total number of branch operations issued. These typically consist of
-       jump or branch operations and are used to implement control flow.
-
-     - Instructions
+.. jinja:: instruction-mix
+   :file: _templates/metrics_table.j2

 .. note::

@@ -345,133 +105,8 @@ include :ref:`MFMA <desc-mfma>` instructions using the same precision; for
 instance, the “F16-ADD” metric does not include any 16-bit floating point
 additions executed as part of an MFMA instruction using the same precision.

-.. list-table::
-   :header-rows: 1
-   :widths: 15 65 20
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - INT32
-
-     - The total number of instructions operating on 32-bit integer operands
-       issued to the VALU per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - INT64
-
-     - The total number of instructions operating on 64-bit integer operands
-       issued to the VALU per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F16-ADD
-
-     - The total number of addition instructions operating on 16-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F16-MUL
-
-     - The total number of multiplication instructions operating on 16-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F16-FMA
-
-     - The total number of fused multiply-add instructions operating on 16-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F16-TRANS
-
-     - The total number of transcendental instructions (e.g., `sqrt`) operating
-       on 16-bit floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F32-ADD
-
-     - The total number of addition instructions operating on 32-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F32-MUL
-
-     - The total number of multiplication instructions operating on 32-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F32-FMA
-
-     - The total number of fused multiply-add instructions operating on 32-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F32-TRANS
-
-     - The total number of transcendental instructions (such as ``sqrt``)
-       operating on 32-bit floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F64-ADD
-
-     - The total number of addition instructions operating on 64-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F64-MUL
-
-     - The total number of multiplication instructions operating on 64-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F64-FMA
-
-     - The total number of fused multiply-add instructions operating on 64-bit
-       floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - F64-TRANS
-
-     - The total number of transcendental instructions (such as `sqrt`)
-       operating on 64-bit floating-point operands issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Conversion
-
-     - The total number of type conversion instructions (such as converting data
-       to or from F32↔F64) issued to the VALU per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
+.. jinja:: valu-arith-instruction-mix
+   :file: _templates/metrics_table.j2

 For an example of these counters in action, refer to
 :ref:`valu-arith-instruction-mix-ex`.
@@ -502,57 +137,8 @@ This section details the types of Matrix Fused Multiply-Add
 MFMA instructions are classified by the type of input data they operate on, and
 *not* the data type the result is accumulated to.

-.. list-table::
-   :header-rows: 1
-   :widths: 25 60 17
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - MFMA-I8 Instructions
-
-     - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions
-       issued per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - MFMA-F8 Instructions
-
-     - The total number of 8-bit floating point :ref:`MFMA <desc-mfma>`
-       instructions issued per :ref:`normalization unit <normalization-units>`. This is supported in AMD Instinct MI300 series and later only.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - MFMA-F16 Instructions
-
-     - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
-       instructions issued per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - MFMA-BF16 Instructions
-
-     - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
-       instructions issued per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - MFMA-F32 Instructions
-
-     - The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`
-       instructions issued per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - MFMA-F64 Instructions
-
-     - The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`
-       instructions issued per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
+.. jinja:: mfma-instruction-mix
+   :file: _templates/metrics_table.j2

 Compute pipeline
 ================
@@ -612,84 +198,8 @@ various precisions. We note that unlike the
 are reported as FLOPs and IOPs, that is, the total number of operations
 executed.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - VALU FLOPs
-
-     - The total floating-point operations executed per second on the
-       :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
-       theoretical FLOPs achievable on the specific accelerator. Note: this does
-       not include any floating-point operations from :ref:`MFMA <desc-mfma>`
-       instructions.
-
-     - GFLOPs
-
-   * - VALU IOPs
-
-     - The total integer operations executed per second on the
-       :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
-       theoretical IOPs achievable on the specific accelerator. Note: this does
-       not include any integer operations from :ref:`MFMA <desc-mfma>`
-       instructions.
-
-     - GIOPs
-
-   * - MFMA FLOPs (BF16)
-
-     - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 16-bit
-       brain floating point operations from :ref:`VALU <desc-valu>`
-       instructions. This is also presented as a percent of the peak theoretical
-       BF16 MFMA operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - MFMA FLOPs (F16)
-
-     - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 16-bit
-       floating point operations from :ref:`VALU <desc-valu>` instructions. This
-       is also presented as a percent of the peak theoretical F16 MFMA
-       operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - MFMA FLOPs (F32)
-
-     - The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 32-bit
-       floating point operations from :ref:`VALU <desc-valu>` instructions. This
-       is also presented as a percent of the peak theoretical F32 MFMA
-       operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - MFMA FLOPs (F64)
-
-     - The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 64-bit
-       floating point operations from :ref:`VALU <desc-valu>` instructions. This
-       is also presented as a percent of the peak theoretical F64 MFMA
-       operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - MFMA IOPs (INT8)
-
-     - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations
-       executed per second. Note: this does not include any 8-bit integer
-       operations from :ref:`VALU <desc-valu>` instructions. This is also
-       presented as a percent of the peak theoretical INT8 MFMA operations
-       achievable on the specific accelerator.
-
-     - GIOPs
+.. jinja:: compute-speed-of-light
+   :file: _templates/metrics_table.j2

 .. _pipeline-stats:

@@ -702,120 +212,8 @@ various execution units on the :doc:`CU <compute-unit>`. Refer to
 :ref:`scheduler <desc-scheduler>` the for a high-level overview of execution
 units and instruction issue.

-.. list-table::
-   :header-rows: 1
-   :widths: 20 65 15
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - IPC
-
-     - The ratio of the total number of instructions executed on the
-       :doc:`CU <compute-unit>` over the
-       :ref:`total active CU cycles <total-active-cu-cycles>`.
-
-     - Instructions per-cycle
-
-   * - IPC (Issued)
-
-     - The ratio of the total number of
-       (non-:ref:`internal <ipc-internal-instructions>`) instructions issued over
-       the number of cycles where the :ref:`scheduler <desc-scheduler>` was
-       actively working on issuing instructions. Refer to the
-       :ref:`Issued IPC <issued-ipc>` example for further detail.
-
-     - Instructions per-cycle
-
-   * - SALU utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`SALU <desc-salu>` was busy executing instructions. Computed as the
-       ratio of the total number of cycles spent by the
-       :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM <desc-smem>`
-       instructions over the :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - VALU utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`VALU <desc-valu>` was busy executing instructions. Does not include
-       :ref:`VMEM <desc-vmem>` operations. Computed as the ratio of the total
-       number of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing
-       VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - VMEM utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`VMEM <desc-vmem>` unit was busy executing instructions, including
-       both global/generic and spill/scratch operations (see the
-       :ref:`VMEM instruction count metrics <ta-instruction-counts>` for more
-       detail).  Does not include :ref:`VALU <desc-valu>` operations. Computed
-       as the ratio of the total number of cycles spent by the
-       :ref:`scheduler <desc-scheduler>` issuing VMEM instructions over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - Branch utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`branch <desc-branch>` unit was busy executing instructions.
-       Computed as the ratio of the total number of cycles spent by the
-       :ref:`scheduler <desc-scheduler>` issuing branch instructions over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - VALU active threads
-
-     - Indicates the average level of :ref:`divergence <desc-divergence>` within
-       a wavefront over the lifetime of the kernel. The number of work-items
-       that were active in a wavefront during execution of each
-       :ref:`VALU <desc-valu>` instruction, time-averaged over all VALU
-       instructions run on all wavefronts in the kernel.
-
-     - Work-items
-
-   * - MFMA utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`MFMA <desc-mfma>` unit was busy executing instructions. Computed as
-       the ratio of the total number of cycles spent by the
-       :ref:`MFMA <desc-salu>` was busy over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - MFMA instruction cycles
-
-     - The average duration of :ref:`MFMA <desc-mfma>` instructions in this
-       kernel in cycles. Computed as the ratio of the total number of cycles the
-       MFMA unit was busy over the total number of MFMA instructions. Compare
-       to, for example, the
-       `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
-
-     - Cycles per instruction
-
-   * - VMEM latency
-
-     - The average number of round-trip cycles (that is, from issue to data
-       return / acknowledgment) required for a VMEM instruction to complete.
-
-     - Cycles
-
-   * - SMEM latency
-
-     - The average number of round-trip cycles (that is, from issue to data
-       return / acknowledgment) required for a SMEM instruction to complete.
-
-     - Cycles
+.. jinja:: pipeline-stats
+   :file: _templates/metrics_table.j2

 .. note::

@@ -846,70 +244,5 @@ not. For more detail on how operations are counted see the
   take into account the execution mask of the operation, and will report the
   same value even if EXEC is identically zero.

-.. list-table::
-   :header-rows: 1
-   :widths: 18 65 17
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - FLOPs (Total)
-
-     - The total number of floating-point operations executed on either the
-       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - FLOP per :ref:`normalization unit <normalization-units>`
-
-   * - IOPs (Total)
-
-     - The total number of integer operations executed on either the
-       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - IOP per :ref:`normalization unit <normalization-units>`
-
-   * - F16 OPs
-
-     - The total number of 16-bit floating-point operations executed on either the
-       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - FLOP per :ref:`normalization unit <normalization-units>`
-
-   * - BF16 OPs
-
-     - The total number of 16-bit brain floating-point operations executed on either the
-       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`. Note: on current CDNA
-       accelerators, the VALU has no native BF16 instructions.
-
-     - FLOP per :ref:`normalization unit <normalization-units>`
-
-   * - F32 OPs
-
-     - The total number of 32-bit floating-point operations executed on either
-       the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - FLOP per :ref:`normalization unit <normalization-units>`
-
-   * - F64 OPs
-
-     - The total number of 64-bit floating-point operations executed on either
-       the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - FLOP per :ref:`normalization unit <normalization-units>`
-
-   * - INT8 OPs
-
-     - The total number of 8-bit integer operations executed on either the
-       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
-       :ref:`normalization unit <normalization-units>`. Note: on current CDNA
-       accelerators, the VALU has no native INT8 instructions.
-
-     - IOPs per :ref:`normalization unit <normalization-units>`
+.. jinja:: arithmetic-operations
+   :file: _templates/metrics_table.j2
@@ -71,40 +71,8 @@ Scalar L1D Speed-of-Light
 The Scalar L1D speed-of-light chart shows some key metrics of the sL1D
 cache as a comparison with the peak achievable values of those metrics:

-.. list-table::
-   :header-rows: 1
-   :widths: 20 65 15
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Bandwidth
-
-     - The number of bytes looked up in the sL1D cache, as a percent of the peak
-       theoretical bandwidth. Calculated as the ratio of sL1D requests over the
-       :ref:`total sL1D cycles <total-sl1d-cycles>`.
-
-     - Percent
-
-   * - Cache Hit Rate
-
-     - The percent of sL1D requests that hit [#sl1d-cache]_ on a previously
-       loaded line in the cache. Calculated as the ratio of the number of sL1D
-       requests that hit over the number of all sL1D requests.
-
-     - Percent
-
-   * - sL1D-L2 BW
-
-     - The number of bytes requested by the sL1D from the L2 cache, as a percent
-       of the peak theoretical sL1D → L2 cache bandwidth.  Calculated as the
-       ratio of the total number of requests from the sL1D to the L2 cache over
-       the :ref:`total sL1D-L2 interface cycles <total-sl1d-cycles>`.
-
-     - Percent
+.. jinja:: desc-sl1d-sol
+   :file: _templates/metrics_table.j2

 .. _desc-sl1d-stats:

@@ -114,104 +82,8 @@ Scalar L1D cache accesses
 This panel gives more detail on the types of accesses made to the sL1D,
 and the hit/miss statistics.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Requests
-
-     - The total number of requests, of any size or type, made to the sL1D per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Hits
-
-     - The total number of sL1D requests that hit on a previously loaded cache
-       line, per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Misses - Non Duplicated
-
-     - The total number of sL1D requests that missed on a cache line that *was
-       not* already pending due to another request, per
-       :ref:`normalization unit <normalization-units>`. See :ref:`desc-sl1d-sol`
-       for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Misses - Duplicated
-
-     - The total number of sL1D requests that missed on a cache line that *was*
-       already pending due to another request, per
-       :ref:`normalization unit <normalization-units>`. See
-       :ref:`desc-sl1d-sol` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Cache Hit Rate
-
-     - Indicates the percent of sL1D requests that hit on a previously loaded
-       line the cache. The ratio of the number of sL1D requests that hit
-       [#sl1d-cache]_ over the number of all sL1D requests.
-
-     - Percent
-
-   * - Read Requests (Total)
-
-     - The total number of sL1D read requests of any size, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Atomic Requests
-
-     - The total number of sL1D atomic requests of any size, per
-       :ref:`normalization unit <normalization-units>`. Typically unused on CDNA
-       accelerators.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Read Requests (1 DWord)
-
-     - The total number of sL1D read requests made for a single dword of data
-       (4B), per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Read Requests (2 DWord)
-
-     - The total number of sL1D read requests made for a two dwords of data
-       (8B), per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Read Requests (4 DWord)
-
-     - The total number of sL1D read requests made for a four dwords of data
-       (16B), per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Read Requests (8 DWord)
-
-     - The total number of sL1D read requests made for a eight dwords of data
-       (32B), per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Read Requests (16 DWord)
-
-     - The total number of sL1D read requests made for a sixteen dwords of data
-       (64B), per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
+.. jinja:: desc-sl1d-stats
+   :file: _templates/metrics_table.j2

 .. _desc-sl1d-l2-interface:

@@ -222,56 +94,8 @@ This panel gives more detail on the data requested across the
 sL1D↔
 :doc:`L2 <l2-cache>` interface.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - sL1D-L2 BW
-
-     - The total number of bytes read from, written to, or atomically updated
-       across the sL1D↔:doc:`L2 <l2-cache>` interface, per
-       :ref:`normalization unit <normalization-units>`. Note that sL1D writes
-       and atomics are typically unused on current CDNA accelerators, so in the
-       majority of cases this can be interpreted as an sL1D→L2 read bandwidth.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`
-
-   * - Read Requests
-
-     - The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,
-       per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Write Requests
-
-     - The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,
-       per :ref:`normalization unit <normalization-units>`. Typically unused on
-       current CDNA accelerators.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Atomic Requests
-
-     - The total number of atomic requests from sL1D to the
-       :doc:`L2 <l2-cache>`, per
-       :ref:`normalization unit <normalization-units>`. Typically unused on
-       current CDNA accelerators.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Stall Cycles
-
-     - The total number of cycles the sL1D↔
-       :doc:`L2 <l2-cache>` interface was stalled, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
+.. jinja:: desc-sl1d-l2-interface
+   :file: _templates/metrics_table.j2

 .. rubric:: Footnotes

@@ -318,46 +142,8 @@ The L1 Instruction Cache speed-of-light chart shows some key metrics of
 the L1I cache as a comparison with the peak achievable values of those
 metrics:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Bandwidth
-
-     - The number of bytes looked up in the L1I cache, as a percent of the peak
-       theoretical bandwidth. Calculated as the ratio of L1I requests over the
-       :ref:`total L1I cycles <total-l1i-cycles>`.
-
-     - Percent
-
-   * - Cache Hit Rate
-
-     - The percent of L1I requests that hit on a previously loaded line the
-       cache. Calculated as the ratio of the number of L1I requests that hit
-       [#l1i-cache]_ over the number of all L1I requests.
-
-     - Percent
-
-   * - L1I-L2 BW
-
-     - The percent of the peak theoretical L1I → L2 cache request bandwidth
-       achieved. Calculated as the ratio of the total number of requests from
-       the L1I to the L2 cache over the
-       :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
-
-     - Percent
-
-   * - Instruction Fetch Latency
-
-     - The average number of cycles spent to fetch instructions to a
-       :doc:`CU <compute-unit>`.
-
-     - Cycles
+.. jinja:: desc-l1i-sol
+   :file: _templates/metrics_table.j2

 .. _desc-l1i-stats:

@@ -366,54 +152,10 @@ L1I cache accesses

 This panel gives more detail on the hit/miss statistics of the L1I:

-.. list-table::
-   :header-rows: 1
+.. jinja:: desc-l1i-stats
+   :file: _templates/metrics_table.j2

-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Requests
-
-     - The total number of requests made to the L1I per
-       :ref:`normalization-unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Hits
-
-     - The total number of L1I requests that hit on a previously loaded cache
-       line, per :ref:`normalization-unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Misses - Non Duplicated
-
-     - The total number of L1I requests that missed on a cache line that
-       *were not* already pending due to another request, per
-       :ref:`normalization-unit <normalization-units>`. See note in
-       :ref:`desc-l1i-sol` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`.
-
-   * - Misses - Duplicated
-
-     - The total number of L1I requests that missed on a cache line that *were*
-       already pending due to another request, per
-       :ref:`normalization-unit <normalization-units>`. See note in
-       :ref:`desc-l1i-sol` for more detail.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Cache Hit Rate
-
-     - The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
-       line the cache. Calculated as the ratio of the number of L1I requests
-       that hit over the number of all L1I requests.
-
-     - Percent
+.. _desc-l1i-l2-interface:

 L1I - L2 interface
 ------------------
@@ -421,21 +163,8 @@ L1I - L2 interface
 This panel gives more detail on the data requested across the
 L1I-:doc:`L2 <l2-cache>` interface.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - L1I-L2 BW
-
-     - The total number of bytes read across the L1I-:doc:`L2 <l2-cache>`
-       interface, per :ref:`normalization unit <normalization-units>`.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`
+.. jinja:: desc-l1i-l2-interface
+   :file: _templates/metrics_table.j2

 .. rubric:: Footnotes

@@ -493,90 +222,18 @@ issuing concurrently).
   kernels). This means that these scheduler-pipe utilization metrics are
   expected to reach (for example) a maximum of one pipe active -- only 25%.

+.. _spi-util:
+
 Workgroup manager utilizations
 ------------------------------

 This section describes the utilization of the workgroup manager, and the
 hardware components it interacts with.

-.. list-table::
-   :header-rows: 1
-   :widths: 20 65 15
+.. jinja:: spi-util
+   :file: _templates/metrics_table.j2

-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Accelerator utilization
-
-     - The percent of cycles in the kernel where the accelerator was actively
-       doing any work.
-
-     - Percent
-
-   * - Scheduler-pipe utilization
-
-     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
-       the kernel where the scheduler-pipes were actively doing any work. Note:
-       this value is expected to range between 0% and 25%. See :ref:`desc-spi`.
-
-     - Percent
-
-   * - Workgroup manager utilization
-
-     - The percent of cycles in the kernel where the workgroup manager was
-       actively doing any work.
-
-     - Percent
-
-   * - Shader engine utilization
-
-     - The percent of :ref:`total shader engine cycles <total-se-cycles>` in the
-       kernel where any CU in a shader-engine was actively doing any work,
-       normalized over all shader-engines. Low values (e.g., << 100%) indicate
-       that the accelerator was not fully saturated by the kernel, or a
-       potential load-imbalance issue.
-
-     - Percent
-
-   * - SIMD utilization
-
-     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
-       where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work,
-       summed over all CUs. Low values (less than 100%) indicate that the
-       accelerator was not fully saturated by the kernel, or a potential
-       load-imbalance issue.
-
-     - Percent
-
-   * - Dispatched workgroups
-
-     - The total number of workgroups forming this kernel launch.
-
-     - Workgroups
-
-   * - Dispatched wavefronts
-
-     - The total number of wavefronts, summed over all workgroups, forming this
-       kernel launch.
-
-     - Wavefronts
-
-   * - VGPR writes
-
-     - The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`
-       at wave creation.
-
-     - Cycles/wave
-
-   * - SGPR Writes
-
-     - The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`
-       at wave creation.
-
-     - Cycles/wave
+.. _spi-resc-util:

 Resource allocation
 -------------------
@@ -590,117 +247,5 @@ limited by LDS usage, for example, but may still achieve high occupancy levels
 such that improving occupancy further may not improve performance. See
 :ref:`occupancy-example` for details.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Not-scheduled rate (Workgroup Manager)
-
-     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
-       the kernel where a workgroup could not be scheduled to a
-       :doc:`CU <compute-unit>` due to a bottleneck within the workgroup manager
-       rather than a lack of a CU or :ref:`SIMD <desc-valu>` with sufficient
-       resources. Note: this value is expected to range between 0-25%. See note
-       in :ref:`workgroup manager <desc-spi>` description.
-
-     - Percent
-
-   * - Not-scheduled rate (Scheduler-Pipe)
-
-     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
-       the kernel where a workgroup could not be scheduled to a
-       :doc:`CU <compute-unit>` due to a bottleneck within the scheduler-pipes
-       rather than a lack of a CU or :ref:`SIMD <desc-valu>` with sufficient
-       resources. Note: this value is expected to range between 0-25%, see note
-       in :ref:`workgroup manager <desc-spi>` description.
-
-     - Percent
-
-   * - Scheduler-Pipe Stall Rate
-
-     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
-       the kernel where a workgroup could not be scheduled to a
-       :doc:`CU <compute-unit>` due to occupancy limitations (like a lack of a
-       CU or :ref:`SIMD <desc-valu>` with sufficient resources). Note: this
-       value is expected to range between 0-25%, see note in
-       :ref:`workgroup manager <desc-spi>` description.
-
-     - Percent
-
-   * - Scratch Stall Rate
-
-     - The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the
-       kernel where a workgroup could not be scheduled to a
-       :doc:`CU <compute-unit>` due to lack of
-       :ref:`private (a.k.a., scratch) memory <memory-type>` slots. While this
-       can reach up to 100%, note that the actual occupancy limitations on a
-       kernel using private memory are typically quite small (for example, less
-       than 1% of the total number of waves that can be scheduled to an
-       accelerator).
-
-     - Percent
-
-   * - Insufficient SIMD Waveslots
-
-     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
-       where a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`
-       due to lack of available :ref:`waveslots <desc-valu>`.
-
-     - Percent
-
-   * - Insufficient SIMD VGPRs
-
-     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
-       where a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`
-       due to lack of available :ref:`VGPRs <desc-valu>`.
-
-     - Percent
-
-   * - Insufficient SIMD SGPRs
-
-     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
-       where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`
-       due to lack of available :ref:`SGPRs <desc-salu>`.
-
-     - Percent
-
-   * - Insufficient CU LDS
-
-     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
-       where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
-       due to lack of available :doc:`LDS <local-data-share>`.
-
-     - Percent
-
-   * - Insufficient CU Barriers
-
-     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
-       where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
-       due to lack of available :ref:`barriers <desc-barrier>`.
-
-     - Percent
-
-   * - Reached CU Workgroup Limit
-
-     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
-       where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
-       due to limits within the workgroup manager.  This is expected to be
-       always be zero on CDNA2 or newer accelerators (and small for previous
-       accelerators).
-
-     - Percent
-
-   * - Reached CU Wavefront Limit
-
-     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
-       where a wavefront could not be scheduled to a :doc:`CU <compute-unit>`
-       due to limits within the workgroup manager.  This is expected to be
-       always be zero on CDNA2 or newer accelerators (and small for previous
-       accelerators).
-
-     - Percent
+.. jinja:: spi-resc-util
+   :file: _templates/metrics_table.j2
@@ -2,6 +2,8 @@
   :description: ROCm Compute Profiler performance model: System Speed-of-Light
   :keywords: Omniperf, ROCm Compute Profiler, ROCm, profiler, tool, Instinct, accelerator, AMD, system, speed of light

+.. _sys-sol:
+
 *********************
 System Speed-of-Light
 *********************
@@ -20,308 +22,5 @@ of ROCm Compute Profiler’s profiling report.
   Instinct™ MI-series accelerators. For more detail on how operations are
   counted, see the :ref:`metrics-flop-count` section.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - :ref:`VALU <desc-valu>` FLOPs
-
-     - The total floating-point operations executed per second on the
-       :ref:`VALU <desc-valu>`.  This is also presented as a percent of the peak
-       theoretical FLOPs achievable on the specific accelerator. Note: this does
-       not include any floating-point operations from :ref:`MFMA <desc-mfma>`
-       instructions.
-
-     - GFLOPs
-
-   * - :ref:`VALU <desc-valu>` IOPs
-
-     - The total integer operations executed per second on the
-       :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
-       theoretical IOPs achievable on the specific accelerator. Note: this does
-       not include any integer operations from :ref:`MFMA <desc-mfma>`
-       instructions.
-
-     - GIOPs
-
-   * - :ref:`MFMA <desc-mfma>` FLOPs (F8)
-
-     - The total number of 8-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. This does not include any 16-bit
-       brain floating point operations from :ref:`VALU <desc-valu>`
-       instructions. This is also presented as a percent of the peak theoretical
-       F8 MFMA operations achievable on the specific accelerator. It is supported on AMD Instinct MI300 series and later only.
-
-     - GFLOPs
-
-   * - :ref:`MFMA <desc-mfma>` FLOPs (BF16)
-
-     - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 16-bit
-       brain floating point operations from :ref:`VALU <desc-valu>`
-       instructions. This is also presented as a percent of the peak theoretical
-       BF16 MFMA operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - :ref:`MFMA <desc-mfma>` FLOPs (F16)
-
-     - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 16-bit
-       floating point operations from :ref:`VALU <desc-valu>` instructions. This
-       is also presented as a percent of the peak theoretical F16 MFMA
-       operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - :ref:`MFMA <desc-mfma>` FLOPs (F32)
-
-     - The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 32-bit
-       floating point operations from :ref:`VALU <desc-valu>` instructions. This
-       is also presented as a percent of the peak theoretical F32 MFMA
-       operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - :ref:`MFMA <desc-mfma>` FLOPs (F64)
-
-     - The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`
-       operations executed per second. Note: this does not include any 64-bit
-       floating point operations from :ref:`VALU <desc-valu>` instructions. This
-       is also presented as a percent of the peak theoretical F64 MFMA
-       operations achievable on the specific accelerator.
-
-     - GFLOPs
-
-   * - :ref:`MFMA <desc-mfma>` IOPs (INT8)
-
-     - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations
-       executed per second. Note: this does not include any 8-bit integer
-       operations from :ref:`VALU <desc-valu>` instructions. This is also
-       presented as a percent of the peak theoretical INT8 MFMA operations
-       achievable on the specific accelerator.
-
-     - GIOPs
-
-   * - :ref:`SALU <desc-salu>` utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`SALU <desc-salu>` was busy executing instructions. Computed as the
-       ratio of the total number of cycles spent by the
-       :ref:`scheduler <desc-scheduler>` issuing :ref:`SALU <desc-salu>` or
-       :ref:`SMEM <desc-salu>` instructions over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - :ref:`VALU <desc-valu>` utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`VALU <desc-valu>` was busy executing instructions. Does not include
-       :ref:`VMEM <desc-vmem>` operations.  Computed as the ratio of the total
-       number of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing
-       :ref:`VALU <desc-valu>` instructions over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - :ref:`MFMA <desc-mfma>` utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`MFMA <desc-mfma>` unit was busy executing instructions. Computed as
-       the ratio of the total number of cycles the MFMA was busy over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - :ref:`VMEM <desc-valu>` utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`VMEM <desc-valu>` unit was busy executing instructions, including
-       both global/generic and spill/scratch operations (see the
-       :ref:`VMEM instruction count metrics <ta-instruction-counts>`) for more
-       detail). Does not include :ref:`VALU <desc-valu>` operations. Computed as
-       the ratio of the total number of cycles spent by the
-       :ref:`scheduler <desc-scheduler>` issuing VMEM instructions over the
-       :ref:`total CU cycles <total-cu-cycles>`.
-
-     - Percent
-
-   * - :ref:`Branch <desc-branch>` utilization
-
-     - Indicates what percent of the kernel's duration the
-       :ref:`branch <desc-branch>` unit was busy executing instructions.
-       Computed as the ratio of the total number of cycles spent by the
-       :ref:`scheduler <desc-scheduler>` issuing :ref:`branch <desc-branch>`
-       instructions over the :ref:`total CU cycles <total-cu-cycles>`
-
-     - Percent
-
-   * - :ref:`VALU <desc-valu>` active threads
-
-     - Indicates the average level of :ref:`divergence <desc-divergence>` within
-       a wavefront over the lifetime of the kernel. The number of work-items
-       that were active in a wavefront during execution of each
-       :ref:`VALU <desc-valu>` instruction, time-averaged over all VALU
-       instructions run on all wavefronts in the kernel.
-
-     - Work-items
-
-   * - IPC
-
-     - The ratio of the total number of instructions executed on the
-       :doc:`CU <compute-unit>` over the
-       :ref:`total active CU cycles <total-active-cu-cycles>`. This is also
-       presented as a percent of the peak theoretical bandwidth achievable on
-       the specific accelerator.
-
-     - Instructions per-cycle
-
-   * - Wavefront occupancy
-
-     - The time-averaged number of wavefronts resident on the accelerator over
-       the lifetime of the kernel. Note: this metric may be inaccurate for
-       short-running kernels (less than 1ms). This is also presented as a
-       percent of the peak theoretical occupancy achievable on the specific
-       accelerator.
-
-     - Wavefronts
-
-   * - :doc:`LDS <local-data-share>` theoretical bandwidth
-
-     - Indicates the maximum amount of bytes that could have been loaded from,
-       stored to, or atomically updated in the LDS per unit time (see
-       :ref:`LDS Bandwidth <lds-bandwidth>` example for more detail). This is
-       also presented as a percent of the peak theoretical F64 MFMA operations
-       achievable on the specific accelerator.
-
-     - GB/s
-
-   * - :doc:`LDS <local-data-share>` bank conflicts/access
-
-     - The ratio of the number of cycles spent in the
-       :doc:`LDS scheduler <local-data-share>` due to bank conflicts (as
-       determined by the conflict resolution hardware) to the base number of
-       cycles that would be spent in the LDS scheduler in a completely
-       uncontended case. This is also presented in normalized form (i.e., the
-       Bank Conflict Rate).
-
-     - Conflicts/Access
-
-   * - :doc:`vL1D <vector-l1-cache>` cache hit rate
-
-     - The ratio of the number of vL1D cache line requests that hit in vL1D
-       cache over the total number of cache line requests to the
-       :ref:`vL1D cache RAM <desc-tc>`.
-
-     - Percent
-
-   * - :doc:`vL1D <vector-l1-cache>` cache bandwidth
-
-     - The number of bytes looked up in the vL1D cache as a result of
-       :ref:`VMEM <desc-vmem>` instructions per unit time. The number of bytes
-       is calculated as the number of cache lines requested multiplied by the
-       cache line size. This value does not consider partial requests, so e.g.,
-       if only a single value is requested in a cache line, the data movement
-       will still be counted as a full cache line. This is also presented as a
-       percent of the peak theoretical bandwidth achievable on the specific
-       accelerator.
-
-     - GB/s
-
-   * - :doc:`L2 <l2-cache>` cache hit rate
-
-     - The ratio of the number of L2 cache line requests that hit in the L2
-       cache over the total number of incoming cache line requests to the L2
-       cache.
-
-     - Percent
-
-   * - :doc:`L2 <l2-cache>` cache bandwidth
-
-     - The number of bytes looked up in the L2 cache per unit time.  The number
-       of bytes is calculated as the number of cache lines requested multiplied
-       by the cache line size. This value does not consider partial requests, so
-       e.g., if only a single value is requested in a cache line, the data
-       movement will still be counted as a full cache line. This is also
-       presented as a percent of the peak theoretical bandwidth achievable on
-       the specific accelerator.
-
-     - GB/s
-
-   * - :doc:`L2 <l2-cache>`-fabric read BW
-
-     - The number of bytes read by the L2 over the
-       :ref:`Infinity Fabric™ interface <l2-fabric>` per unit time. This is also
-       presented as a percent of the peak theoretical bandwidth achievable on
-       the specific accelerator.
-
-     - GB/s
-
-   * - :doc:`L2 <l2-cache>`-fabric write and atomic BW
-
-     - The number of bytes sent by the L2 over the
-       :ref:`Infinity Fabric interface <l2-fabric>` by write and atomic
-       operations per unit time. This is also presented as a percent of the peak
-       theoretical bandwidth achievable on the specific accelerator.
-
-     - GB/s
-
-   * - :doc:`L2 <l2-cache>`-fabric read latency
-
-     - The time-averaged number of cycles read requests spent in Infinity Fabric
-       before data was returned to the L2.
-
-     - Cycles
-
-   * - :doc:`L2 <l2-cache>`-fabric write latency
-
-     - The time-averaged number of cycles write requests spent in Infinity
-       Fabric before a completion acknowledgement was returned to the L2.
-
-     - Cycles
-
-   * - :ref:`sL1D <desc-sl1d>` cache hit rate
-
-     - The percent of sL1D requests that hit on a previously loaded line the
-       cache. Calculated as the ratio of the number of sL1D requests that hit
-       over the number of all sL1D requests.
-
-     - Percent
-
-   * - :ref:`sL1D <desc-sl1d>` bandwidth
-
-     - The number of bytes looked up in the sL1D cache per unit time. This is
-       also presented as a percent of the peak theoretical bandwidth achievable
-       on the specific accelerator.
-
-     - GB/s
-
-   * - :ref:`L1I <desc-l1i>` bandwidth
-
-     - The number of bytes looked up in the L1I cache per unit time. This is
-       also presented as a percent of the peak theoretical bandwidth achievable
-       on the specific accelerator.
-
-     - GB/s
-
-   * - :ref:`L1I <desc-l1i>` cache hit rate
-
-     - The percent of L1I requests that hit on a previously loaded line the
-       cache. Calculated as the ratio of the number of L1I requests that hit
-       over the number of all L1I requests.
-
-     - Percent
-
-   * - :ref:`L1I <desc-l1i>` fetch latency
-
-     - The average number of cycles spent to fetch instructions to a
-       :doc:`CU <compute-unit>`.
-
-     - Cycles
+.. jinja:: sys-sol
+   :file: _templates/metrics_table.j2
@@ -63,53 +63,8 @@ vL1D Speed-of-Light
 The vL1D’s speed-of-light chart shows several key metrics for the vL1D
 as a comparison with the peak achievable values of those metrics.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Hit Rate
-
-     - The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_
-       in vL1D cache over the total number of cache line requests to the
-       :ref:`vL1D Cache RAM <desc-tc>`.
-
-     - Percent
-
-   * - Bandwidth
-
-     - The number of bytes looked up in the vL1D cache as a result of
-       :ref:`VMEM <desc-vmem>` instructions, as a percent of the peak
-       theoretical bandwidth achievable on the specific accelerator. The number
-       of bytes is calculated as the number of cache lines requested multiplied
-       by the cache line size. This value does not consider partial requests, so
-       for instance, if only a single value is requested in a cache line, the
-       data movement will still be counted as a full cache line.
-
-     - Percent
-
-   * - Utilization
-
-     - Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the
-       kernel execution. The number of cycles where the vL1D Cache RAM is
-       actively processing any request divided by the number of cycles where the
-       vL1D is active [#vl1d-activity]_.
-
-     - Percent
-
-   * - Coalescing
-
-     - Indicates how well memory instructions were coalesced by the
-       :ref:`address processing unit <desc-ta>`, ranging from uncoalesced (25%)
-       to fully coalesced (100%). Calculated as the average number of
-       :ref:`thread-requests <thread-requests>` generated per instruction
-       divided by the ideal number of thread-requests per instruction.
-
-     - Percent
+.. jinja:: vl1d-sol
+   :file: _templates/metrics_table.j2

 .. _desc-ta:

@@ -145,45 +100,8 @@ processing unit. When the front-end cannot accept any more addresses, it
 must backpressure the wave-issue logic for the VMEM pipe and prevent the
 issue of further vector memory instructions.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Busy
-
-     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
-       processor was busy
-
-     - Percent
-
-   * - Address Stall
-
-     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
-       processor was stalled from sending address requests further into the vL1D
-       pipeline
-
-     - Percent
-
-   * - Data Stall
-
-     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
-       processor was stalled from sending write/atomic data further into the
-       vL1D pipeline
-
-     - Percent
-
-   * - Data-Processor → Address Stall
-
-     - Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor
-       was stalled waiting to send command data to the
-       :ref:`data processor <desc-td>`
-
-     - Percent
+.. jinja:: ta-busy-stall
+   :file: _templates/metrics_table.j2

 .. _ta-instruction-counts:

@@ -232,80 +150,8 @@ kernel. These are broken down into a few major categories:

 The address processor counts these instruction types as follows:

-.. list-table::
-   :header-rows: 1
-
-   * - Type
-
-     - Description
-
-     - Unit
-
-   * - Global/Generic
-
-     - The total number of global & generic memory instructions executed on all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Global/Generic Read
-
-     - The total number of global & generic memory read instructions executed on
-       all :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Global/Generic Write
-
-     - The total number of global & generic memory write instructions executed
-       on all :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Global/Generic Atomic
-
-     - The total number of global & generic memory atomic (with and without
-       return) instructions executed on all :doc:`compute units <compute-unit>`
-       on the accelerator, per :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Spill/Stack
-
-     - The total number of spill/stack memory instructions executed on all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Spill/Stack Read
-
-     - The total number of spill/stack memory read instructions executed on all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Spill/Stack Write
-
-     - The total number of spill/stack memory write instructions executed on all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instruction per :ref:`normalization unit <normalization-units>`
-
-   * - Spill/Stack Atomic
-
-     - The total number of spill/stack memory atomic (with and without return)
-       instructions executed on all :doc:`compute units <compute-unit>` on the
-       accelerator, per :ref:`normalization unit <normalization-units>`.
-       Typically unused as these memory operations are typically used to
-       implement thread-local storage.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
+.. jinja:: ta-instruction-counts
+   :file: _templates/metrics_table.j2

 .. note::

@@ -333,38 +179,8 @@ Spill / stack metrics
 Finally, the address processing unit contains a separate coalescing
 stage for spill/stack memory, and thus reports:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Spill/Stack Total Cycles
-
-     - The number of cycles the address processing unit spent working on
-       spill/stack instructions, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Spill/Stack Coalesced Read Cycles
-
-     - The number of cycles the address processing unit spent working on
-       coalesced spill/stack read instructions, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
-
-   * - Spill/Stack Coalesced Write Cycles
-
-     - The number of cycles the address processing unit spent working on
-       coalesced spill/stack write instructions, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cycles per :ref:`normalization unit <normalization-units>`
+.. jinja:: ta-spill-stack
+   :file: _templates/metrics_table.j2

 .. _desc-utcl1:

@@ -380,52 +196,8 @@ reduce the cost of subsequent re-translations.

 ROCm Compute Profiler reports the following L1 TLB metrics:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Requests
-
-     - The number of translation requests made to the UTCL1 per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Hits
-
-     - The number of translation requests that hit in the UTCL1, and could be
-       reused, per :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Hit Ratio
-
-     - The ratio of the number of translation requests that hit in the UTCL1
-       divided by the total number of translation requests made to the UTCL1.
-
-     - Percent
-
-   * - Translation Misses
-
-     - The total number of translation requests that missed in the UTCL1 due to
-       translation not being present in the cache, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Permission Misses
-
-     - The total number of translation requests that missed in the UTCL1 due to
-       a permission error, per :ref:`normalization unit <normalization-units>`.
-       This is unused and expected to be zero in most configurations for modern
-       CDNA™ accelerators.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
+.. jinja:: desc-utcl1
+   :file: _templates/metrics_table.j2

 .. note::

@@ -464,39 +236,8 @@ L2 requests may backpressure the wave-issue logic of the :ref:`VMEM <desc-vmem>`
 pipe and prevent it from issuing more vector memory instructions until
 the vL1D’s outstanding requests are completed.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Stalled on L2 Data
-
-     - The ratio of the number of cycles where the vL1D is stalled waiting for
-       requested data to return from the :doc:`L2 cache <l2-cache>` divided by
-       the number of cycles where the vL1D is active [#vl1d-activity]_.
-
-     - Percent
-
-   * - Stalled on L2 Requests
-
-     - The ratio of the number of cycles where the vL1D is stalled waiting to
-       issue a request for data to the :doc:`L2 cache <l2-cache>` divided by the
-       number of cycles where the vL1D is active [#vl1d-activity]_.
-
-     - Percent
-
-   * - Tag RAM Stall (Read/Write/Atomic)
-
-     - The ratio of the number of cycles where the vL1D is stalled due to
-       Read/Write/Atomic requests with conflicting tags being looked up
-       concurrently, divided by the number of cycles where the
-       vL1D is active [#vl1d-activity]_.
-
-     - Percent
+.. jinja:: vl1d-cache-stall-metrics
+   :file: _templates/metrics_table.j2

 .. _vl1d-cache-access-metrics:

@@ -510,135 +251,8 @@ the :doc:`L2 cache <l2-cache>`. In addition, this section includes the
 approximate latencies of accesses to the cache itself, along with
 latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Total Requests
-
-     - The total number of incoming requests from the
-       :ref:`address processing unit <desc-ta>` after coalescing.
-
-     - Requests
-
-   * - Total read/write/atomic requests
-
-     - The total number of incoming read/write/atomic requests from the
-       :ref:`address processing unit <desc-ta>` after coalescing per
-       :ref:`normalization unit <normalization-units>`
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - Cache Bandwidth
-
-     - The number of bytes looked up in the vL1D cache as a result of
-       :ref:`VMEM <desc-vmem>` instructions per
-       :ref:`normalization unit <normalization-units>`.  The number of bytes is
-       calculated as the number of cache lines requested multiplied by the cache
-       line size.  This value does not consider partial requests, so for
-       instance, if only a single value is requested in a cache line, the data
-       movement will still be counted as a full cache line.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`
-
-   * - Cache Hit Rate [#vl1d-hit]_
-
-     - The ratio of the number of vL1D cache line requests that hit in vL1D
-       cache over the total number of cache line requests to the
-       :ref:`vL1D Cache RAM <desc-tc>`.
-
-     - Percent
-
-   * - Cache Accesses
-
-     - The total number of cache line lookups in the vL1D.
-
-     - Cache lines
-
-   * - Cache Hits [#vl1d-hit]_
-
-     - The number of cache accesses minus the number of outgoing requests to the
-       :doc:`L2 cache <l2-cache>`, that is, the number of cache line requests
-       serviced by the :ref:`vL1D Cache RAM <desc-tc>` per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Cache lines per :ref:`normalization unit <normalization-units>`
-
-   * - Invalidations
-
-     - The number of times the vL1D was issued a write-back invalidate command
-       during the kernel's execution per
-       :ref:`normalization unit <normalization-units>`.  This may be triggered
-       by, for instance, the ``buffer_wbinvl1`` instruction.
-
-     - Invalidations per :ref:`normalization unit <normalization-units>`
-
-   * - L1-L2 Bandwidth
-
-     - The number of bytes transferred across the vL1D-L2 interface as a result
-       of :ref:`VMEM <desc-vmem>` instructions, per
-       :ref:`normalization unit <normalization-units>`. The number of bytes is
-       calculated as the number of cache lines requested multiplied by the cache
-       line size. This value does not consider partial requests, so for
-       instance, if only a single value is requested in a cache line, the data
-       movement will still be counted as a full cache line.
-
-     - Bytes per :ref:`normalization unit <normalization-units>`
-
-   * - L1-L2 Reads
-
-     - The number of read requests for a vL1D cache line that were not satisfied
-       by the vL1D and must be retrieved from the to the
-       :doc:`L2 Cache <l2-cache>` per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - L1-L2 Writes
-
-     - The number of write requests to a vL1D cache line that were sent through
-       the vL1D to the :doc:`L2 cache <l2-cache>`, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - L1-L2 Atomics
-
-     - The number of atomic requests that are sent through the vL1D to the
-       :doc:`L2 cache <l2-cache>`, per
-       :ref:`normalization unit <normalization-units>`. This includes requests
-       for atomics with, and without return.
-
-     - Requests per :ref:`normalization unit <normalization-units>`
-
-   * - L1 Access Latency
-
-     - Calculated as the average number of cycles that a vL1D cache line request
-       spent in the vL1D cache pipeline.
-
-     - Cycles
-
-   * - L1-L2 Read Access Latency
-
-     - Calculated as the average number of cycles that the vL1D cache took to
-       issue and receive read requests from the :doc:`L2 Cache <l2-cache>`. This
-       number also includes requests for atomics with return values.
-
-     - Cycles
-
-   * - L1-L2 Write Access Latency
-
-     - Calculated as the average number of cycles that the vL1D cache took to
-       issue and receive acknowledgement of a write request to the
-       :doc:`L2 Cache <l2-cache>`. This number also includes requests for
-       atomics without return values.
-
-     - Cycles
+.. jinja:: vl1d-cache-access-metrics
+   :file: _templates/metrics_table.j2

 .. note::

@@ -687,80 +301,5 @@ data, and returned to the appropriate SIMD.

 ROCm Compute Profiler reports the following vL1D data-return path metrics:

-.. list-table::
-   :header-rows: 1
-
-   * - Metric
-
-     - Description
-
-     - Unit
-
-   * - Data-return Busy
-
-     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
-       unit was busy processing or waiting on data to return to the
-       :doc:`CU <compute-unit>`.
-
-     - Percent
-
-   * - Cache RAM → Data-return Stall
-
-     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
-       unit was stalled on data to be returned from the
-       :ref:`vL1D Cache RAM <desc-tc>`.
-
-     - Percent
-
-   * - Workgroup manager → Data-return Stall
-
-     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
-       unit was stalled by the :ref:`workgroup manager <desc-spi>` due to
-       initialization of registers as a part of launching new workgroups.
-
-     - Percent
-
-   * - Coalescable Instructions
-
-     - The number of instructions submitted to the
-       :ref:`data-return unit <desc-td>` by the
-       :ref:`address processor <desc-ta>` that were found to be coalescable, per
-       :ref:`normalization unit <normalization-units>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Read Instructions
-
-     - The number of read instructions submitted to the
-       :ref:`data-return unit <desc-td>` by the
-       :ref:`address processor <desc-ta>` summed over all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`. This is expected to be
-       the sum of global/generic and spill/stack reads in the
-       :ref:`address processor <desc-ta>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Write Instructions
-
-     - The number of store instructions submitted to the
-       :ref:`data-return unit <desc-td>` by the
-       :ref:`address processor <desc-ta>` summed over all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`. This is expected to be
-       the sum of global/generic and spill/stack stores counted by the
-       :ref:`vL1D cache-front-end <ta-instruction-counts>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
-
-   * - Atomic Instructions
-
-     - The number of atomic instructions submitted to the
-       :ref:`data-return unit <desc-td>` by the
-       :ref:`address processor <desc-ta>` summed over all
-       :doc:`compute units <compute-unit>` on the accelerator, per
-       :ref:`normalization unit <normalization-units>`. This is expected to be
-       the sum of global/generic and spill/stack atomics in the
-       :ref:`address processor <desc-ta>`.
-
-     - Instructions per :ref:`normalization unit <normalization-units>`
+.. jinja:: desc-td
+   :file: _templates/metrics_table.j2
@@ -30,6 +30,8 @@

 import re

+import yaml
+
 with open("../VERSION", encoding="utf-8") as f:
    match = re.search(r"([0-9.]+)[^0-9.]+", f.read())
    if not match:
@@ -43,7 +45,12 @@ copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved
 version = version_number
 release = version_number

-extensions = ["rocm_docs", "sphinx.ext.extlinks", "sphinxcontrib.datatemplates"]
+extensions = [
+    "rocm_docs",
+    "sphinx.ext.extlinks",
+    "sphinxcontrib.datatemplates",
+    "sphinx_jinja",
+]
 html_theme = "rocm_docs_theme"
 html_theme_options = {"flavor": "rocm"}
 html_title = f"{project} {version_number} documentation"
@@ -52,6 +59,113 @@ exclude_patterns = ["archive", "*/includes"]
 html_static_path = ["sphinx/static/css"]
 html_css_files = ["o_custom.css"]

+with open("data/metrics_description.yaml", "r") as f:
+    metrics_data = yaml.safe_load(f)
+jinja_contexts = {
+    "wavefront-launch-stats": {
+        "data": metrics_data["Wavefront launch stats"],
+    },
+    "wavefront-runtime-stats": {
+        "data": metrics_data["Wavefront runtime stats"],
+    },
+    "instruction-mix": {
+        "data": metrics_data["Overall instruction mix"],
+    },
+    "valu-arith-instruction-mix": {
+        "data": metrics_data["VALU arithmetic instruction mix"],
+    },
+    "mfma-instruction-mix": {
+        "data": metrics_data["MFMA instruction mix"],
+    },
+    "compute-speed-of-light": {
+        "data": metrics_data["Compute Speed-of-Light"],
+    },
+    "pipeline-stats": {
+        "data": metrics_data["Pipeline statistics"],
+    },
+    "arithmetic-operations": {
+        "data": metrics_data["Arithmetic operations"],
+    },
+    "lds-sol": {
+        "data": metrics_data["LDS Speed-of-Light"],
+    },
+    "lds-stats": {
+        "data": metrics_data["LDS Statistics"],
+    },
+    "vl1d-sol": {
+        "data": metrics_data["vL1D Speed-of-Light"],
+    },
+    "ta-busy-stall": {
+        "data": metrics_data["Busy / stall metrics"],
+    },
+    "ta-instruction-counts": {
+        "data": metrics_data["Instruction counts"],
+    },
+    "ta-spill-stack": {
+        "data": metrics_data["Spill / stack metrics"],
+    },
+    "desc-utcl1": {
+        "data": metrics_data["L1 Unified Translation Cache (UTCL1)"],
+    },
+    "vl1d-cache-stall-metrics": {
+        "data": metrics_data["vL1D cache stall metrics"],
+    },
+    "vl1d-cache-access-metrics": {
+        "data": metrics_data["vL1D cache access metrics"],
+    },
+    "desc-td": {
+        "data": metrics_data["Vector L1 data-return path or Texture Data (TD)"],
+    },
+    "l2-sol": {
+        "data": metrics_data["L2 Speed-of-Light"],
+    },
+    "l2-cache-accesses": {
+        "data": metrics_data["L2 cache accesses"],
+    },
+    "l2-fabric-metrics": {
+        "data": metrics_data["L2-Fabric interface metrics"],
+    },
+    "l2-detailed-metrics": {
+        "data": metrics_data["L2 - Fabric interface detailed metrics"],
+    },
+    "l2-fabric-stalls": {
+        "data": metrics_data["L2 - Fabric Interface stalls"],
+    },
+    "desc-sl1d-sol": {
+        "data": metrics_data["Scalar L1D Speed-of-Light"],
+    },
+    "desc-sl1d-stats": {
+        "data": metrics_data["Scalar L1D cache accesses"],
+    },
+    "desc-sl1d-l2-interface": {
+        "data": metrics_data["Scalar L1D Cache - L2 Interface"],
+    },
+    "desc-l1i-sol": {
+        "data": metrics_data["L1I Speed-of-Light"],
+    },
+    "desc-l1i-stats": {
+        "data": metrics_data["L1I cache accesses"],
+    },
+    "desc-l1i-l2-interface": {
+        "data": metrics_data["L1I <-> L2 interface"],
+    },
+    "spi-util": {
+        "data": metrics_data["Workgroup manager utilizations"],
+    },
+    "spi-resc-util": {
+        "data": metrics_data["Workgroup Manager - Resource Allocation"],
+    },
+    "cpf-metrics": {
+        "data": metrics_data["Command processor fetcher (CPF)"],
+    },
+    "cpc-metrics": {
+        "data": metrics_data["Command processor packet processor (CPC)"],
+    },
+    "sys-sol": {
+        "data": metrics_data["System Speed-of-Light"],
+    },
+}
+
 external_toc_path = "./sphinx/_toc.yml"
 external_projects_current_project = "rocprofiler-compute"

@@ -96,3 +210,6 @@ extlinks = {
        "HSA Runtime Programmer's Reference Manual (page %s)",
    ),
 }
+
+# Uncomment if facing rate limit exceed issue with local build
+external_projects_remote_repository = ""
@@ -242,6 +242,11 @@ List metrics

     $ rocprof-compute analyze -p workloads/vcopy/MI200/  --list-metrics gfx90a

+Show Description column which is excluded by default in cli output
+  .. code-block:: shell
+
+     $ rocprof-compute analyze -p workloads/vcopy/MI200/  --list-metrics gfx90a --include-cols Description
+
 Show System Speed-of-Light and CS_Busy blocks only
  .. code-block:: shell

@@ -1,2 +1,3 @@
 rocm-docs-core==1.21.1
 sphinxcontrib.datatemplates==0.11.0
+sphinx-jinja==2.0.2
@@ -53,7 +53,8 @@ docutils==0.21.2
    #   myst-parser
    #   pydata-sphinx-theme
    #   sphinx
-exceptiongroup==1.2.2
+    #   sphinx-jinja
+exceptiongroup==1.3.0
    # via ipython
 executing==2.2.0
    # via stack-data
@@ -87,6 +88,7 @@ jinja2==3.1.5
    # via
    #   myst-parser
    #   sphinx
+    #   sphinx-jinja
 jsonschema==4.23.0
    # via nbformat
 jsonschema-specifications==2024.10.1
@@ -215,6 +217,7 @@ sphinx==8.1.3
    #   sphinx-copybutton
    #   sphinx-design
    #   sphinx-external-toc
+    #   sphinx-jinja
    #   sphinx-notfound-page
    #   sphinxcontrib-datatemplates
    #   sphinxcontrib-runcmd
@@ -226,6 +229,8 @@ sphinx-design==0.6.1
    # via rocm-docs-core
 sphinx-external-toc==1.0.1
    # via rocm-docs-core
+sphinx-jinja==2.0.2
+    # via -r requirements.in
 sphinx-notfound-page==1.0.4
    # via rocm-docs-core
 sphinxcontrib-applehelp==2.0.0
@@ -268,6 +273,7 @@ traitlets==5.14.3
    #   nbformat
 typing-extensions==4.12.2
    # via
+    #   exceptiongroup
    #   ipython
    #   myst-nb
    #   pydata-sphinx-theme