Unified configuration for metrics (#726)

* Show description of metrics during analysis * Use --include-cols Description show the Description column in analyze mode (this is hidden by default) * Remove tips field from analysis config * Align metric names in analysis config and documentation * Add unified config utils/unified_config.yaml * Add python script utils/split_config.py to auto generate analysis configuration and documentation metrics description * Add test case to ensure unified config is older than auto-generated config * Auto generate analysis config and documentation metrics description * Update CONTRIBUTING.md to add instructions to build documentation assets * Add docker image and compose file to build documentation * Update CHANGELOG and Documentation * Use jinja template instead of hardcoding metric tables in documentation [ROCm/rocprofiler-compute commit: bb44e90b2d]
2025-07-25 14:01:34 -04:00
Commit 354fe5f52c
@@ -66,6 +66,9 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
 * Add deprecation warning for database update mode.
 * Show description of metrics during analysis
  * Use `--include-cols Description` to show `Description` column which is excluded by default from cli output
 ### Changed
 * Change the default rocprof version to rocprofv3, this is used when environment variable "ROCPROF" is not set
@@ -101,6 +104,7 @@ Full documentation for ROCm Compute Profiler is available at [https://rocm.docs.
 * Fixed not detecting memory clock issue when using amd-smi
 * Fixed standalone GUI crashing
 * Fixed L2 read/write/atomic bandwidths on MI350
 * Update metric names for better alignment between analysis configuration and documentation
 ### Known issues
@@ -335,6 +335,16 @@ add_test(
            ${PROJECT_SOURCE_DIR}/tests/test_utils.py
    WORKING_DIRECTORY ${PROJECT_SOURCE_DIR})
 # -----------------------------------
 # Autogenerated configuration tests
 # -----------------------------------
 add_test(
    NAME test_autogen_config
    COMMAND ${Python3_EXECUTABLE} -m pytest --junitxml=tests/test_autogen_config.xml
            ${COV_OPTION} ${PROJECT_SOURCE_DIR}/tests/test_autogen_config.py
    WORKING_DIRECTORY ${PROJECT_SOURCE_DIR})
 # ---------
 # Install
 # ---------
@@ -57,3 +57,7 @@ Please see the [pre-commit documentation](https://pre-commit.com/#quick-start) f
 Below are some repository specific guidelines which are followed througout the repository.
 Any future contributions should adhere to these guidelines:
 * Use the `pathlib` library functions instead of `os.path` for manipulating the file paths.
 ## Build and test documentation changes
 For instructions on how to build and test documentation changes (files under docs folder), please see https://rocm.docs.amd.com/en/latest/contribute/contributing.html
@@ -3,11 +3,6 @@ services:
    build:
      context: ../
      dockerfile: docker/Dockerfile.doctest
    devices:
      - /dev/kfd
      - /dev/dri
    security_opt:
      - seccomp:unconfined
    volumes:
      - ../:/app
    tty: true
@@ -0,0 +1,12 @@
 .. list-table::
    :header-rows: 1
    * - Metric
      - Description
      - Unit
    {% for metric, metric_info in data.items() %}
    * - {{ metric }}
      - {{ metric_info.rst }}
      - {{ metric_info.unit }}
    {% endfor %}
@@ -46,108 +46,13 @@ processor’s metrics therefore are focused on reporting, for example:
 Command processor fetcher (CPF)
 ===============================
-.. list-table::
+.. jinja:: cpf-metrics
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - CPF Utilization
     - Percent of total cycles where the CPF was busy actively doing any work.
       The ratio of CPF busy cycles over total cycles counted by the CPF.
     - Percent
   * - CPF Stall
     - Percent of CPF busy cycles where the CPF was stalled for any reason.
     - Percent
   * - CPF-L2 Utilization
     - Percent of total cycles counted by the CPF-:doc:`L2 <l2-cache>` interface
       where the CPF-L2 interface was active doing any work. The ratio of CPF-L2
       busy cycles over total cycles counted by the CPF-L2.
     - Percent
   * - CPF-L2 Stall
     - Percent of CPF-:doc:`L2 <l2-cache>` L2 busy cycles where the CPF-L2
       interface was stalled for any reason.
     - Percent
   * - CPF-UTCL1 Stall
     - Percent of CPF busy cycles where the CPF was stalled by address
       translation.
     - Percent
 .. _cpc-metrics:
 Command processor packet processor (CPC)
 ========================================
-.. list-table::
+.. jinja:: cpc-metrics
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - CPC Utilization
     - Percent of total cycles where the CPC was busy actively doing any work.
       The ratio of CPC busy cycles over total cycles counted by the CPC.
     - Percent
   * - CPC Stall
     - Percent of CPC busy cycles where the CPC was stalled for any reason.
     - Percent
   * - CPC Packet Decoding Utilization
     - Percent of CPC busy cycles spent decoding commands for processing.
     - Percent
   * - CPC-Workgroup Manager Utilization
     - Percent of CPC busy cycles spent dispatching workgroups to the
       :ref:`workgroup manager <desc-spi>`.
     - Percent
   * - CPC-L2 Utilization
     - Percent of total cycles counted by the CPC-:doc:`L2 <l2-cache>` interface
       where the CPC-L2 interface was active doing any work.
     - Percent
   * - CPC-UTCL1 Stall
     - Percent of CPC busy cycles where the CPC was stalled by address
       translation.
     - Percent
   * - CPC-UTCL2 Utilization
     - Percent of total cycles counted by the CPC's :doc:`L2 <l2-cache>` address
       translation interface where the CPC was busy doing address translation
       work.
     - Percent
@@ -48,56 +48,8 @@ The L2 cache’s speed-of-light table contains a few key metrics about the
 performance of the L2 cache, aggregated over all the L2 channels, as a
 comparison with the peak achievable values of those metrics:
-.. list-table::
+.. jinja:: l2-sol
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Utilization
     - The ratio of the
       :ref:`number of cycles an L2 channel was active, summed over all L2 channels on the accelerator <total-active-l2-cycles>`
       over the :ref:`total L2 cycles <total-l2-cycles>`.
     - Percent
   * - Bandwidth
     - The number of bytes looked up in the L2 cache, as a percent of the peak
       theoretical bandwidth achievable on the specific accelerator. The number
       of bytes is calculated as the number of cache lines requested multiplied
       by the cache line size. This value does not consider partial requests, so
       e.g., if only a single value is requested in a cache line, the data
       movement will still be counted as a full cache line.
     - Percent
   * - Hit Rate
     - The ratio of the number of L2 cache line requests that hit in the L2
       cache over the total number of incoming cache line requests to the L2
       cache.
     - Percent
   * - L2-Fabric Read BW
     - The number of bytes read by the L2 over the
       :ref:`Infinity Fabric interface <l2-fabric>` per unit time.
     - GB/s
   * - L2-Fabric Write and Atomic BW
     - The number of bytes sent by the L2 over the
       :ref:`Infinity Fabric interface <l2-fabric>` by write and atomic
       operations per unit time.
     - GB/s
 .. note::
@@ -117,168 +69,8 @@ This section details the incoming requests to the L2 cache from the
 :doc:`vL1D <vector-l1-cache>` and other clients -- for instance, the
 :ref:`sL1D <desc-sL1D>` and :ref:`L1I <desc-l1i>` caches.
-.. list-table::
+.. jinja:: l2-cache-accesses
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 13 70 17
   * - Metric
     - Description
     - Unit
   * - Bandwidth
     - The number of bytes looked up in the L2 cache, per
       :ref:`normalization unit <normalization-units>`.  The number of bytes is
       calculated as the number of cache lines requested multiplied by the cache
       line size. This value does not consider partial requests, so for example,
       if only a single value is requested in a cache line, the data movement
       will still be counted as a full cache line.
     - Bytes per :ref:`normalization unit <normalization-units>`.
   * - Requests
     - The total number of incoming requests to the L2 from all clients for all
       request types, per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Read Requests
     - The total number of read requests to the L2 from all clients.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Write Requests
     - The total number of write requests to the L2 from all clients.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Atomic Requests
     - The total number of atomic requests (with and without return) to the L2
       from all clients.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Streaming Requests
     - The total number of incoming requests to the L2 that are marked as
       *streaming*. The exact meaning of this may differ depending on the
       targeted accelerator, however on an :ref:`MI2XX <mixxx-note>` this
       corresponds to
       `non-temporal load or stores <https://clang.llvm.org/docs/LanguageExtensions.html#non-temporal-load-store-builtins>`_.
       The L2 cache attempts to evict *streaming* requests before normal
       requests when the L2 is at capacity.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Probe Requests
     - The number of coherence probe requests made to the L2 cache from outside
       the accelerator. On an :ref:`MI2XX <mixxx-note>`, probe requests may be
       generated by, for example, writes to
       :ref:`fine-grained device <memory-type>` memory or by writes to
       :ref:`coarse-grained <memory-type>` device memory.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Hit Rate
     - The ratio of the number of L2 cache line requests that hit in the L2
       cache over the total number of incoming cache line requests to the L2
       cache.
     - Percent
   * - Hits
     - The total number of requests to the L2 from all clients that hit in the
       cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, this
       includes hit-on-miss requests.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Misses
     - The total number of requests to the L2 from all clients that miss in the
       cache. As noted in the :ref:`Speed-of-Light <l2-sol>` section, these do
       not include hit-on-miss requests.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Writebacks
     - The total number of L2 cache lines written back to memory for any reason.
       Write-backs may occur due to user code (such as HIP kernel calls to
       ``__threadfence_system`` or atomic built-ins) by the
       :doc:`command processor <command-processor>`'s memory acquire/release
       fences, or for other internal hardware reasons.
     - Cache lines per :ref:`normalization unit <normalization-units>`
   * - Writebacks (Internal)
     - The total number of L2 cache lines written back to memory for internal
       hardware reasons, per :ref:`normalization unit <normalization-units>`.
     - Cache lines per :ref:`normalization unit <normalization-units>`.
   * - Writebacks (vL1D Req)
     - The total number of L2 cache lines written back to memory due to requests
       initiated by the :doc:`vL1D cache <vector-l1-cache>`, per
       :ref:`normalization unit <normalization-units>`.
     - Cache lines per :ref:`normalization unit <normalization-units>`.
   * - Evictions (Normal)
     - The total number of L2 cache lines evicted from the cache due to capacity
       limits, per :ref:`normalization unit <normalization-units>`.
     - Cache lines per :ref:`normalization unit <normalization-units>`.
   * - Evictions (vL1D Req)
     - The total number of L2 cache lines evicted from the cache due to
       invalidation requests initiated by the
       :doc:`vL1D cache <vector-l1-cache>`, per
       :ref:`normalization unit <normalization-units>`.
     - Cache lines per :ref:`normalization unit <normalization-units>`.
   * - Non-hardware-Coherent Requests
     - The total number of requests to the L2 to Not-hardware-Coherent (NC)
       memory allocations, per :ref:`normalization unit <normalization-units>`.
       See the :ref:`memory-type` for more information.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Uncached Requests
     - The total number of requests to the L2 that go to Uncached (UC) memory
       allocations. See the :ref:`memory-type` for more information.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Coherently Cached Requests
     - The total number of requests to the L2 that go to Coherently Cacheable (CC)
       memory allocations. See the :ref:`memory-type` for more information.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Read/Write Coherent Requests
     - The total number of requests to the L2 that go to Read-Write coherent memory
       (RW) allocations. See the :ref:`memory-type` for more information.
     - Requests per :ref:`normalization unit <normalization-units>`.
 .. note::
@@ -300,7 +92,7 @@ is responsible for routing these memory requests/data to the correct
 location and returning any fetched data to the L2 cache. The
 :ref:`l2-request-flow` describes the flow of these requests through
 Infinity Fabric in more detail, as described by ROCm Compute Profiler metrics,
-while :ref:`l2-request-metrics` give detailed definitions of
+while :ref:`l2-fabric` give detailed definitions of
 individual metrics.
 .. _l2-request-flow:
@@ -363,176 +155,15 @@ to uncached memory (denoted by the dashed line), they will also be
 counted as *two* uncached read requests (that is, the request is split).
-.. _l2-request-metrics:
+.. _l2-fabric-metrics:
 Metrics
 -------
 The following metrics are reported for the L2-Fabric interface:
-.. list-table::
+.. jinja:: l2-fabric-metrics
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - L2-Fabric Read Bandwidth
     - The total number of bytes read by the L2 cache from Infinity Fabric per
       :ref:`normalization unit <normalization-units>`.
     - Bytes per :ref:`normalization unit <normalization-units>`.
   * - HBM Read Traffic
     - The percent of read requests generated by the L2 cache that are routed to
       the accelerator's local high-bandwidth memory (HBM). This breakdown does
       not consider the *size* of the request (meaning that 32B and 64B requests
       are both counted as a single request), so this metric only *approximates*
       the percent of the L2-Fabric Read bandwidth directed to the local HBM.
     - Percent
   * - Remote Read Traffic
     - The percent of read requests generated by the L2 cache that are routed to
       any memory location other than the accelerator's local high-bandwidth
       memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's
       HBM. This breakdown does not consider the *size* of the request (meaning
       that 32B and 64B requests are both counted as a single request), so this
       metric only *approximates* the percent of the L2-Fabric Read bandwidth
       directed to a remote location.
     - Percent
   * - Uncached Read Traffic
     - The percent of read requests generated by the L2 cache that are reading
       from an :ref:`uncached memory allocation <memory-type>`. Note, as
       described in the :ref:`request flow <l2-request-flow>` section, a single
       64B read request is typically counted as two uncached read requests. So,
       it is possible for the Uncached Read Traffic to reach up to 200% of the
       total number of read requests. This breakdown does not consider the
       *size* of the request (i.e., 32B and 64B requests are both counted as a
       single request), so this metric only *approximates* the percent of the
       L2-Fabric read bandwidth directed to an uncached memory location.
     - Percent
   * - L2-Fabric Write and Atomic Bandwidth
     - The total number of bytes written by the L2 over Infinity Fabric by write
       and atomic operations per
       :ref:`normalization unit <normalization-units>`. Note that on current
       CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are
       only considered *atomic* by Infinity Fabric if they are targeted at
       non-write-cacheable memory, for example,
       :ref:`fine-grained memory <memory-type>` allocations or
       :ref:`uncached memory <memory-type>` allocations on the
       MI2XX.
     - Bytes per :ref:`normalization unit <normalization-units>`.
   * - HBM Write and Atomic Traffic
     - The percent of write and atomic requests generated by the L2 cache that
       are routed to the accelerator's local high-bandwidth memory (HBM). This
       breakdown does not consider the *size* of the request (meaning that 32B
       and 64B requests are both counted as a single request), so this metric
       only *approximates* the percent of the L2-Fabric Write and Atomic
       bandwidth directed to the local HBM. Note that on current CDNA
       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
       considered *atomic* by Infinity Fabric if they are targeted at
       :ref:`fine-grained memory <memory-type>` allocations or
       :ref:`uncached memory <memory-type>` allocations.
     - Percent
   * - Remote Write and Atomic Traffic
     - The percent of read requests generated by the L2 cache that are routed to
       any memory location other than the accelerator's local high-bandwidth
       memory (HBM) -- for example, the CPU's DRAM or a remote accelerator's
       HBM. This breakdown does not consider the *size* of the request (meaning
       that 32B and 64B requests are both counted as a single request), so this
       metric only *approximates* the percent of the L2-Fabric Read bandwidth
       directed to a remote location. Note that on current CDNA
       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
       considered *atomic* by Infinity Fabric if they are targeted at
       :ref:`fine-grained memory <memory-type>` allocations or
       :ref:`uncached memory <memory-type>` allocations.
     - Percent
   * - Atomic Traffic
     - The percent of write requests generated by the L2 cache that are atomic
       requests to *any* memory location. This breakdown does not consider the
       *size* of the request (meaning that 32B and 64B requests are both counted
       as a single request), so this metric only *approximates* the percent of
       the L2-Fabric Read bandwidth directed to a remote location. Note that on
       current CDNA accelerators, such as the :ref:`MI2XX <mixxx-note>`,
       requests are only considered *atomic* by Infinity Fabric if they are
       targeted at :ref:`fine-grained memory <memory-type>` allocations or
       :ref:`uncached memory <memory-type>` allocations.
     - Percent
   * - Uncached Write and Atomic Traffic
     - The percent of write and atomic requests generated by the L2 cache that
       are targeting :ref:`uncached memory allocations <memory-type>`. This
       breakdown does not consider the *size* of the request (meaning that 32B
       and 64B requests are both counted as a single request), so this metric
       only *approximates* the percent of the L2-Fabric read bandwidth directed
       to uncached memory allocations.
     - Percent
   * - Read Latency
     - The time-averaged number of cycles read requests spent in Infinity Fabric
       before data was returned to the L2.
     - Cycles
   * - Write Latency
     - The time-averaged number of cycles write requests spent in Infinity
       Fabric before a completion acknowledgement was returned to the L2.
     - Cycles
   * - Atomic Latency
     - The time-averaged number of cycles atomic requests spent in Infinity
       Fabric before a completion acknowledgement (atomic without return value)
       or data (atomic with return value) was returned to the L2.
     - Cycles
   * - Read Stall
     - The ratio of the total number of cycles the L2-Fabric interface was
       stalled on a read request to any destination (local HBM, remote PCIe®
       connected accelerator or CPU, or remote Infinity Fabric connected
       accelerator [#inf]_ or CPU) over the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Write Stall
     - The ratio of the total number of cycles the L2-Fabric interface was
       stalled on a write or atomic request to any destination (local HBM,
       remote accelerator or CPU, PCIe connected accelerator or CPU, or remote
       Infinity Fabric connected accelerator [#inf]_ or CPU) over the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
 .. _l2-detailed-metrics:
@@ -542,121 +173,8 @@ Detailed transaction metrics
 The following metrics are available in the detailed L2-Fabric
 transaction breakdown table:
-.. list-table::
+.. jinja:: l2-detailed-metrics
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - 32B Read Requests
     - The total number of L2 requests to Infinity Fabric to read 32B of data
       from any memory location, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail. Typically unused on CDNA
       accelerators.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Uncached Read Requests
     - The total number of L2 requests to Infinity Fabric to read
       :ref:`uncached data <memory-type>` from any memory location, per
       :ref:`normalization unit <normalization-units>`. 64B requests for
       uncached data are counted as two 32B uncached data requests. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - 64B Read Requests
     - The total number of L2 requests to Infinity Fabric to read 64B of data
       from any memory location, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - HBM Read Requests
     - The total number of L2 requests to Infinity Fabric to read 32B or 64B of
       data from the accelerator's local HBM, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Remote Read Requests
     - The total number of L2 requests to Infinity Fabric to read 32B or 64B of
       data from any source other than the accelerator's local HBM, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - 32B Write and Atomic Requests
     - The total number of L2 requests to Infinity Fabric to write or atomically
       update 32B of data to any memory location, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Uncached Write and Atomic Requests
     - The total number of L2 requests to Infinity Fabric to write or atomically
       update 32B or 64B of :ref:`uncached data <memory-type>`, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - 64B Write and Atomic Requests
     - The total number of L2 requests to Infinity Fabric to write or atomically
       update 64B of data in any memory location, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - HBM Write and Atomic Requests
     - The total number of L2 requests to Infinity Fabric to write or atomically
       update 32B or 64B of data in the accelerator's local HBM, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Remote Write and Atomic Requests
     - The total number of L2 requests to Infinity Fabric to write or atomically
       update 32B or 64B of data in any memory location other than the
       accelerator's local HBM, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Atomic Requests
     - The total number of L2 requests to Infinity Fabric to atomically update
       32B or 64B of data in any memory location, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`l2-request-flow` for more detail. Note that on current CDNA
       accelerators, such as the :ref:`MI2XX <mixxx-note>`, requests are only
       considered *atomic* by Infinity Fabric if they are targeted at
       non-write-cacheable memory, such as
       :ref:`fine-grained memory <memory-type>` allocations or
       :ref:`uncached memory <memory-type>` allocations on the MI2XX.
     - Requests per :ref:`normalization unit <normalization-units>`.
 .. _l2-fabric-stalls:
@@ -670,72 +188,8 @@ what types of requests in a kernel caused a stall (like read versus write), and
 to which locations -- for instance, to the accelerator’s local memory, or to
 remote accelerators or CPUs.
-.. list-table::
+.. jinja:: l2-fabric-stalls
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Read - PCIe Stall
     - The number of cycles the L2-Fabric interface was stalled on read requests
       to remote PCIe connected accelerators [#inf]_ or CPUs as a percent of the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Read - Infinity Fabric Stall
     - The number of cycles the L2-Fabric interface was stalled on read requests
       to remote Infinity Fabric connected accelerators [#inf]_ or CPUs as a
       percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Read - HBM Stall
     - The number of cycles the L2-Fabric interface was stalled on read requests
       to the accelerator's local HBM as a percent of the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Write - PCIe Stall
     - The number of cycles the L2-Fabric interface was stalled on write or
       atomic requests to remote PCIe connected accelerators [#inf]_ or CPUs as
       a percent of the :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Write - Infinity Fabric Stall
     - The number of cycles the L2-Fabric interface was stalled on write or
       atomic requests to remote Infinity Fabric connected accelerators [#inf]_
       or CPUs as a percent of the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Write - HBM Stall
     - The number of cycles the L2-Fabric interface was stalled on write or
       atomic requests to accelerator's local HBM as a percent of the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
   * - Write - Credit Starvation
     - The number of cycles the L2-Fabric interface was stalled on write or
       atomic requests to any memory location because too many write/atomic
       requests were currently in flight, as a percent of the
       :ref:`total active L2 cycles <total-active-l2-cycles>`.
     - Percent
 .. warning::
@@ -21,53 +21,8 @@ LDS Speed-of-Light
 The :ref:`LDS <desc-lds>` speed-of-light chart shows a number of key metrics for
 the LDS as a comparison with the peak achievable values of those metrics.
-.. list-table::
+.. jinja:: lds-sol
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Utilization
     - Indicates what percent of the kernel's duration the :ref:`LDS <desc-lds>`
       was actively executing instructions (including, but not limited to, load,
       store, atomic and HIP's ``__shfl`` operations).  Calculated as the ratio
       of the total number of cycles LDS was active over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - Access Rate
     - Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
       actively issuing LDS instructions, averaged over the lifetime of the
       kernel. Calculated as the ratio of the total number of cycles spent by
       the :ref:`scheduler <desc-scheduler>` issuing :ref:`LDS <desc-lds>`
       instructions over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - Theoretical Bandwidth (% of Peak)
     - Indicates the maximum amount of bytes that *could* have been loaded from,
       stored to, or atomically updated in the LDS in this kernel, as a percent
       of the peak LDS bandwidth achievable. See the
       :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
     - Percent
   * - Bank Conflict Rate
     - Indicates the percentage of active LDS cycles that were spent servicing
       bank conflicts. Calculated as the ratio of LDS cycles spent servicing
       bank conflicts over the number of LDS cycles that would have been
       required to move the same amount of data in an uncontended access. [#lds-bank-conflict]_
     - Percent
 .. rubric:: Footnotes
@@ -90,93 +45,5 @@ Statistics
 The LDS statistics panel gives a more detailed view of the hardware:
-.. list-table::
+.. jinja:: lds-stats
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - LDS Instructions
     - The total number of LDS instructions (including, but not limited to,
       read/write/atomics and HIP's ``__shfl`` instructions) executed per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Theoretical Bandwidth
     - Indicates the maximum amount of bytes that could have been loaded from,
       stored to, or atomically updated in the LDS per
       :ref:`normalization unit <normalization-units>`. Does *not* take into
       account the execution mask of the wavefront when the instruction was
       executed. See the
       :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
     - Bytes per :ref:`normalization unit <normalization-units>`
   * - LDS Latency
     - The average number of round-trip cycles (i.e., from issue to data-return
       / acknowledgment) required for an LDS instruction to complete.
     - Cycles
   * - Bank Conflicts/Access
     - The ratio of the number of cycles spent in the
       :ref:`LDS scheduler <desc-lds>` due to bank conflicts (as determined by
       the conflict resolution hardware) to the base number of cycles that would
       be spent in the LDS scheduler in a completely uncontended case. This is
       the unnormalized form of the Bank Conflict Rate.
     - Conflicts/Access
   * - Index Accesses
     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
       over all operations per :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Atomic Return Cycles
     - The total number of cycles spent on LDS atomics with return per
       :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Bank Conflicts
     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
       due to bank conflicts (as determined by the conflict resolution hardware)
       per :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Address Conflicts
     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
       due to address conflicts (as determined by the conflict resolution
       hardware) per :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Unaligned Stall
     - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
       due to stalls from non-dword aligned addresses per
       :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Memory Violations
     - The total number of out-of-bounds accesses made to the LDS, per
       :ref:`normalization unit <normalization-units>`. This is unused and
       expected to be zero in most configurations for modern CDNA™ accelerators.
     - Accesses per :ref:`normalization unit <normalization-units>`
@@ -23,97 +23,8 @@ Wavefront launch stats
 The wavefront launch stats panel gives general information about the
 kernel launch:
-.. list-table::
+.. jinja:: wavefront-launch-stats
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 20 65 15
   * - Metric
     - Description
     - Unit
   * - Grid Size
     - The total number of work-items (or, threads) launched as a part of
       the kernel dispatch.  In HIP, this is equivalent to the total grid size
       multiplied by the total workgroup (or, block) size.
     - :ref:`Work-items <desc-work-item>`
   * - Workgroup Size
     - The total number of work-items (or, threads) in each workgroup
       (or, block) launched as part of the kernel dispatch.  In HIP, this is
       equivalent to the total block size.
     - :ref:`Work-items <desc-work-item>`
   * - Total Wavefronts
     - The total number of wavefronts launched as part of the kernel dispatch.
       On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is
       always 64 work-items.  Thus, the total number of wavefronts should be
       equivalent to the ceiling of grid size divided by 64.
     - :ref:`Wavefronts <desc-wavefront>`
   * - Saved Wavefronts
     - The total number of wavefronts saved at a context-save. See
       `cwsr_enable <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
     - :ref:`Wavefronts <desc-wavefront>`
   * - Restored Wavefronts
     - The total number of wavefronts restored from a context-save. See
       `cwsr_enable <https://docs.kernel.org/gpu/amdgpu/module-parameters.html?highlight=cwsr>`_.
     - :ref:`Wavefronts <desc-wavefront>`
   * - VGPRs
     - The number of architected vector general-purpose registers allocated for
       the kernel, see :ref:`VALU <desc-valu>`.  Note: this may not exactly
       match the number of VGPRs requested by the compiler due to allocation
       granularity.
     - :ref:`VGPRs <desc-valu>`
   * - AGPRs
     - The number of accumulation vector general-purpose registers allocated for
       the kernel, see :ref:`AGPRs <desc-agprs>`.  Note: this may not exactly
       match the number of AGPRs requested by the compiler due to allocation
       granularity.
     - :ref:`AGPRs <desc-agprs>`
   * - SGPRs
     - The number of scalar general-purpose registers allocated for the kernel,
       see :ref:`SALU <desc-salu>`.  Note: this may not exactly match the number
       of SGPRs requested by the compiler due to allocation granularity.
     - :ref:`SGPRs <desc-salu>`
   * - LDS Allocation
     - The number of bytes of :doc:`LDS <local-data-share>` memory (or, shared
       memory) allocated for this kernel.  Note: This may also be larger than
       what was requested at compile time due to both allocation granularity and
       dynamic per-dispatch LDS allocations.
     - Bytes per :ref:`workgroup <desc-workgroup>`
   * - Scratch Allocation
     - The number of bytes of :ref:`scratch memory <memory-spaces>` requested
       per work-item for this kernel. Scratch memory is used for stack memory
       on the accelerator, as well as for register spills and restores.
     - Bytes per :ref:`work-item <desc-work-item>`
 .. _wavefront-runtime-stats:
@@ -123,96 +34,8 @@ Wavefront runtime stats
 The wavefront runtime statistics gives a high-level overview of the
 execution of wavefronts in a kernel:
-.. list-table::
+.. jinja:: wavefront-runtime-stats
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 18 65 17
   * - Metric
     - Description
     - Unit
   * - :ref:`Kernel time <kernel-time>`
     - The total duration of the executed kernel. Note: this should not be
       directly compared to the wavefront cycles / timings below.
     - Nanoseconds
   * - :ref:`Kernel cycles <kernel-cycles>`
     - The total duration of the executed kernel in cycles. Note: this should
       not be directly compared to the wavefront cycles / timings below.
     - Cycles
   * - Instructions per wavefront
     - The average number of instructions (of all types) executed per wavefront.
       This is averaged over all wavefronts in a kernel dispatch.
     - Instructions / wavefront
   * - Wave cycles
     - The number of cycles a wavefront in the kernel dispatch spent resident on
       a compute unit per :ref:`normalization unit <normalization-units>`. This
       is averaged over all wavefronts in a kernel dispatch.  Note: this should
       not be directly compared to the kernel cycles above.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Dependency wait cycles
     - The number of cycles a wavefront in the kernel dispatch stalled waiting
       on memory of any kind (e.g., instruction fetch, vector or scalar memory,
       etc.) per :ref:`normalization unit <normalization-units>`. This counter
       is incremented at every cycle by *all* wavefronts on a CU stalled at a
       memory operation.  As such, it is most useful to get a sense of how waves
       were spending their time, rather than identification of a precise limiter
       because another wave could be actively executing while a wave is stalled.
       The sum of this metric, Issue Wait Cycles and Active Cycles should be
       equal to the total Wave Cycles metric.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Issue Wait Cycles
     - The number of cycles a wavefront in the kernel dispatch was unable to
       issue an instruction for any reason (e.g., execution pipe back-pressure,
       arbitration loss, etc.) per
       :ref:`normalization unit <normalization-units>`.  This counter is
       incremented at every cycle by *all* wavefronts on a CU unable to issue an
       instruction.  As such, it is most useful to get a sense of how waves were
       spending their time, rather than identification of a precise limiter
       because another wave could be actively executing while a wave is issue
       stalled.  The sum of this metric, Dependency Wait Cycles and Active
       Cycles should be equal to the total Wave Cycles metric.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Active Cycles
     - The average number of cycles a wavefront in the kernel dispatch was
       actively executing instructions per
       :ref:`normalization unit <normalization-units>`. This measurement is made
       on a per-wavefront basis, and may include cycles that another wavefront
       spent actively executing (on another execution unit, for example) or was
       stalled.  As such, it is most useful to get a sense of how waves were
       spending their time, rather than identification of a precise limiter. The
       sum of this metric, Issue Wait Cycles and Active Wait Cycles should be
       equal to the total Wave Cycles metric.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Wavefront Occupancy
     - The time-averaged number of wavefronts resident on the accelerator over
       the lifetime of the kernel. Note: this metric may be inaccurate for
       short-running kernels (less than 1ms).
     - :ref:`Wavefronts <desc-wavefront>`
 .. note::
@@ -256,71 +79,8 @@ This panel shows the total number of each type of instruction issued to
 the :doc:`various compute pipelines </conceptual/pipeline-descriptions>` on the
 :doc:`CU </conceptual/compute-unit>`. These are:
-.. list-table::
+.. jinja:: instruction-mix
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - :ref:`VALU <desc-valu>` instructions
     - The total number of vector arithmetic logic unit (VALU) operations
       issued. These are the workhorses of the
       :doc:`compute unit <compute-unit>`, and are used to execute a wide range of
       instruction types including floating point operations, non-uniform
       address calculations, transcendental operations, integer operations,
       shifts, conditional evaluation, etc.
     - Instructions
   * - VMEM instructions
     - The total number of vector memory operations issued. These include most
       loads, stores and atomic operations and all accesses to
       :ref:`generic, global, private and texture <memory-spaces>` memory.
     - Instructions
   * - :doc:`LDS <local-data-share>` instructions
     - The total number of LDS (also known as shared memory) operations issued.
       These include loads, stores, atomics, and HIP's ``__shfl`` operations.
     - Instructions
   * - :ref:`MFMA <desc-mfma>` instructions
     - The total number of matrix fused multiply-add instructions issued.
     - Instructions
   * - :ref:`SALU <desc-salu>` instructions
     - The total number of scalar arithmetic logic unit (SALU) operations
       issued. Typically these are used for address calculations, literal
       constants, and other operations that are *provably* uniform across a
       wavefront. Although scalar memory (SMEM) operations are issued by the
       SALU, they are counted separately in this section.
     - Instructions
   * - SMEM instructions
     - The total number of scalar memory (SMEM) operations issued. These are
       typically used for loading kernel arguments, base-pointers and loads
       from HIP's ``__constant__`` memory.
     - Instructions
   * - :ref:`Branch <desc-branch>` instructions
     - The total number of branch operations issued. These typically consist of
       jump or branch operations and are used to implement control flow.
     - Instructions
 .. note::
@@ -345,133 +105,8 @@ include :ref:`MFMA <desc-mfma>` instructions using the same precision; for
 instance, the “F16-ADD” metric does not include any 16-bit floating point
 additions executed as part of an MFMA instruction using the same precision.
-.. list-table::
+.. jinja:: valu-arith-instruction-mix
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 15 65 20
   * - Metric
     - Description
     - Unit
   * - INT32
     - The total number of instructions operating on 32-bit integer operands
       issued to the VALU per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - INT64
     - The total number of instructions operating on 64-bit integer operands
       issued to the VALU per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F16-ADD
     - The total number of addition instructions operating on 16-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F16-MUL
     - The total number of multiplication instructions operating on 16-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F16-FMA
     - The total number of fused multiply-add instructions operating on 16-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F16-TRANS
     - The total number of transcendental instructions (e.g., `sqrt`) operating
       on 16-bit floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F32-ADD
     - The total number of addition instructions operating on 32-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F32-MUL
     - The total number of multiplication instructions operating on 32-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F32-FMA
     - The total number of fused multiply-add instructions operating on 32-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F32-TRANS
     - The total number of transcendental instructions (such as ``sqrt``)
       operating on 32-bit floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F64-ADD
     - The total number of addition instructions operating on 64-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F64-MUL
     - The total number of multiplication instructions operating on 64-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F64-FMA
     - The total number of fused multiply-add instructions operating on 64-bit
       floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - F64-TRANS
     - The total number of transcendental instructions (such as `sqrt`)
       operating on 64-bit floating-point operands issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Conversion
     - The total number of type conversion instructions (such as converting data
       to or from F32↔F64) issued to the VALU per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
 For an example of these counters in action, refer to
 :ref:`valu-arith-instruction-mix-ex`.
@@ -502,57 +137,8 @@ This section details the types of Matrix Fused Multiply-Add
 MFMA instructions are classified by the type of input data they operate on, and
 *not* the data type the result is accumulated to.
-.. list-table::
+.. jinja:: mfma-instruction-mix
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 25 60 17
   * - Metric
     - Description
     - Unit
   * - MFMA-I8 Instructions
     - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions
       issued per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - MFMA-F8 Instructions
     - The total number of 8-bit floating point :ref:`MFMA <desc-mfma>`
       instructions issued per :ref:`normalization unit <normalization-units>`. This is supported in AMD Instinct MI300 series and later only.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - MFMA-F16 Instructions
     - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
       instructions issued per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - MFMA-BF16 Instructions
     - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
       instructions issued per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - MFMA-F32 Instructions
     - The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`
       instructions issued per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - MFMA-F64 Instructions
     - The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`
       instructions issued per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
 Compute pipeline
 ================
@@ -612,84 +198,8 @@ various precisions. We note that unlike the
 are reported as FLOPs and IOPs, that is, the total number of operations
 executed.
-.. list-table::
+.. jinja:: compute-speed-of-light
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - VALU FLOPs
     - The total floating-point operations executed per second on the
       :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
       theoretical FLOPs achievable on the specific accelerator. Note: this does
       not include any floating-point operations from :ref:`MFMA <desc-mfma>`
       instructions.
     - GFLOPs
   * - VALU IOPs
     - The total integer operations executed per second on the
       :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
       theoretical IOPs achievable on the specific accelerator. Note: this does
       not include any integer operations from :ref:`MFMA <desc-mfma>`
       instructions.
     - GIOPs
   * - MFMA FLOPs (BF16)
     - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 16-bit
       brain floating point operations from :ref:`VALU <desc-valu>`
       instructions. This is also presented as a percent of the peak theoretical
       BF16 MFMA operations achievable on the specific accelerator.
     - GFLOPs
   * - MFMA FLOPs (F16)
     - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 16-bit
       floating point operations from :ref:`VALU <desc-valu>` instructions. This
       is also presented as a percent of the peak theoretical F16 MFMA
       operations achievable on the specific accelerator.
     - GFLOPs
   * - MFMA FLOPs (F32)
     - The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 32-bit
       floating point operations from :ref:`VALU <desc-valu>` instructions. This
       is also presented as a percent of the peak theoretical F32 MFMA
       operations achievable on the specific accelerator.
     - GFLOPs
   * - MFMA FLOPs (F64)
     - The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 64-bit
       floating point operations from :ref:`VALU <desc-valu>` instructions. This
       is also presented as a percent of the peak theoretical F64 MFMA
       operations achievable on the specific accelerator.
     - GFLOPs
   * - MFMA IOPs (INT8)
     - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations
       executed per second. Note: this does not include any 8-bit integer
       operations from :ref:`VALU <desc-valu>` instructions. This is also
       presented as a percent of the peak theoretical INT8 MFMA operations
       achievable on the specific accelerator.
     - GIOPs
 .. _pipeline-stats:
@@ -702,120 +212,8 @@ various execution units on the :doc:`CU <compute-unit>`. Refer to
 :ref:`scheduler <desc-scheduler>` the for a high-level overview of execution
 units and instruction issue.
-.. list-table::
+.. jinja:: pipeline-stats
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 20 65 15
   * - Metric
     - Description
     - Unit
   * - IPC
     - The ratio of the total number of instructions executed on the
       :doc:`CU <compute-unit>` over the
       :ref:`total active CU cycles <total-active-cu-cycles>`.
     - Instructions per-cycle
   * - IPC (Issued)
     - The ratio of the total number of
       (non-:ref:`internal <ipc-internal-instructions>`) instructions issued over
       the number of cycles where the :ref:`scheduler <desc-scheduler>` was
       actively working on issuing instructions. Refer to the
       :ref:`Issued IPC <issued-ipc>` example for further detail.
     - Instructions per-cycle
   * - SALU utilization
     - Indicates what percent of the kernel's duration the
       :ref:`SALU <desc-salu>` was busy executing instructions. Computed as the
       ratio of the total number of cycles spent by the
       :ref:`scheduler <desc-scheduler>` issuing SALU / :ref:`SMEM <desc-smem>`
       instructions over the :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - VALU utilization
     - Indicates what percent of the kernel's duration the
       :ref:`VALU <desc-valu>` was busy executing instructions. Does not include
       :ref:`VMEM <desc-vmem>` operations. Computed as the ratio of the total
       number of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing
       VALU instructions over the :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - VMEM utilization
     - Indicates what percent of the kernel's duration the
       :ref:`VMEM <desc-vmem>` unit was busy executing instructions, including
       both global/generic and spill/scratch operations (see the
       :ref:`VMEM instruction count metrics <ta-instruction-counts>` for more
       detail).  Does not include :ref:`VALU <desc-valu>` operations. Computed
       as the ratio of the total number of cycles spent by the
       :ref:`scheduler <desc-scheduler>` issuing VMEM instructions over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - Branch utilization
     - Indicates what percent of the kernel's duration the
       :ref:`branch <desc-branch>` unit was busy executing instructions.
       Computed as the ratio of the total number of cycles spent by the
       :ref:`scheduler <desc-scheduler>` issuing branch instructions over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - VALU active threads
     - Indicates the average level of :ref:`divergence <desc-divergence>` within
       a wavefront over the lifetime of the kernel. The number of work-items
       that were active in a wavefront during execution of each
       :ref:`VALU <desc-valu>` instruction, time-averaged over all VALU
       instructions run on all wavefronts in the kernel.
     - Work-items
   * - MFMA utilization
     - Indicates what percent of the kernel's duration the
       :ref:`MFMA <desc-mfma>` unit was busy executing instructions. Computed as
       the ratio of the total number of cycles spent by the
       :ref:`MFMA <desc-salu>` was busy over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - MFMA instruction cycles
     - The average duration of :ref:`MFMA <desc-mfma>` instructions in this
       kernel in cycles. Computed as the ratio of the total number of cycles the
       MFMA unit was busy over the total number of MFMA instructions. Compare
       to, for example, the
       `AMD Matrix Instruction Calculator <https://github.com/RadeonOpenCompute/amd_matrix_instruction_calculator>`_.
     - Cycles per instruction
   * - VMEM latency
     - The average number of round-trip cycles (that is, from issue to data
       return / acknowledgment) required for a VMEM instruction to complete.
     - Cycles
   * - SMEM latency
     - The average number of round-trip cycles (that is, from issue to data
       return / acknowledgment) required for a SMEM instruction to complete.
     - Cycles
 .. note::
@@ -846,70 +244,5 @@ not. For more detail on how operations are counted see the
   take into account the execution mask of the operation, and will report the
   same value even if EXEC is identically zero.
-.. list-table::
+.. jinja:: arithmetic-operations
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 18 65 17
   * - Metric
     - Description
     - Unit
   * - FLOPs (Total)
     - The total number of floating-point operations executed on either the
       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`.
     - FLOP per :ref:`normalization unit <normalization-units>`
   * - IOPs (Total)
     - The total number of integer operations executed on either the
       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`.
     - IOP per :ref:`normalization unit <normalization-units>`
   * - F16 OPs
     - The total number of 16-bit floating-point operations executed on either the
       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`.
     - FLOP per :ref:`normalization unit <normalization-units>`
   * - BF16 OPs
     - The total number of 16-bit brain floating-point operations executed on either the
       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`. Note: on current CDNA
       accelerators, the VALU has no native BF16 instructions.
     - FLOP per :ref:`normalization unit <normalization-units>`
   * - F32 OPs
     - The total number of 32-bit floating-point operations executed on either
       the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`.
     - FLOP per :ref:`normalization unit <normalization-units>`
   * - F64 OPs
     - The total number of 64-bit floating-point operations executed on either
       the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`.
     - FLOP per :ref:`normalization unit <normalization-units>`
   * - INT8 OPs
     - The total number of 8-bit integer operations executed on either the
       :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
       :ref:`normalization unit <normalization-units>`. Note: on current CDNA
       accelerators, the VALU has no native INT8 instructions.
     - IOPs per :ref:`normalization unit <normalization-units>`
@@ -71,40 +71,8 @@ Scalar L1D Speed-of-Light
 The Scalar L1D speed-of-light chart shows some key metrics of the sL1D
 cache as a comparison with the peak achievable values of those metrics:
-.. list-table::
+.. jinja:: desc-sl1d-sol
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 20 65 15
   * - Metric
     - Description
     - Unit
   * - Bandwidth
     - The number of bytes looked up in the sL1D cache, as a percent of the peak
       theoretical bandwidth. Calculated as the ratio of sL1D requests over the
       :ref:`total sL1D cycles <total-sl1d-cycles>`.
     - Percent
   * - Cache Hit Rate
     - The percent of sL1D requests that hit [#sl1d-cache]_ on a previously
       loaded line in the cache. Calculated as the ratio of the number of sL1D
       requests that hit over the number of all sL1D requests.
     - Percent
   * - sL1D-L2 BW
     - The number of bytes requested by the sL1D from the L2 cache, as a percent
       of the peak theoretical sL1D → L2 cache bandwidth.  Calculated as the
       ratio of the total number of requests from the sL1D to the L2 cache over
       the :ref:`total sL1D-L2 interface cycles <total-sl1d-cycles>`.
     - Percent
 .. _desc-sl1d-stats:
@@ -114,104 +82,8 @@ Scalar L1D cache accesses
 This panel gives more detail on the types of accesses made to the sL1D,
 and the hit/miss statistics.
-.. list-table::
+.. jinja:: desc-sl1d-stats
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Requests
     - The total number of requests, of any size or type, made to the sL1D per
       :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Hits
     - The total number of sL1D requests that hit on a previously loaded cache
       line, per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Misses - Non Duplicated
     - The total number of sL1D requests that missed on a cache line that *was
       not* already pending due to another request, per
       :ref:`normalization unit <normalization-units>`. See :ref:`desc-sl1d-sol`
       for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Misses - Duplicated
     - The total number of sL1D requests that missed on a cache line that *was*
       already pending due to another request, per
       :ref:`normalization unit <normalization-units>`. See
       :ref:`desc-sl1d-sol` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Cache Hit Rate
     - Indicates the percent of sL1D requests that hit on a previously loaded
       line the cache. The ratio of the number of sL1D requests that hit
       [#sl1d-cache]_ over the number of all sL1D requests.
     - Percent
   * - Read Requests (Total)
     - The total number of sL1D read requests of any size, per
       :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Atomic Requests
     - The total number of sL1D atomic requests of any size, per
       :ref:`normalization unit <normalization-units>`. Typically unused on CDNA
       accelerators.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Read Requests (1 DWord)
     - The total number of sL1D read requests made for a single dword of data
       (4B), per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Read Requests (2 DWord)
     - The total number of sL1D read requests made for a two dwords of data
       (8B), per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Read Requests (4 DWord)
     - The total number of sL1D read requests made for a four dwords of data
       (16B), per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Read Requests (8 DWord)
     - The total number of sL1D read requests made for a eight dwords of data
       (32B), per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Read Requests (16 DWord)
     - The total number of sL1D read requests made for a sixteen dwords of data
       (64B), per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
 .. _desc-sl1d-l2-interface:
@@ -222,56 +94,8 @@ This panel gives more detail on the data requested across the
 sL1D↔
 :doc:`L2 <l2-cache>` interface.
-.. list-table::
+.. jinja:: desc-sl1d-l2-interface
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - sL1D-L2 BW
     - The total number of bytes read from, written to, or atomically updated
       across the sL1D↔:doc:`L2 <l2-cache>` interface, per
       :ref:`normalization unit <normalization-units>`. Note that sL1D writes
       and atomics are typically unused on current CDNA accelerators, so in the
       majority of cases this can be interpreted as an sL1D→L2 read bandwidth.
     - Bytes per :ref:`normalization unit <normalization-units>`
   * - Read Requests
     - The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,
       per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Write Requests
     - The total number of write requests from sL1D to the :doc:`L2 <l2-cache>`,
       per :ref:`normalization unit <normalization-units>`. Typically unused on
       current CDNA accelerators.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Atomic Requests
     - The total number of atomic requests from sL1D to the
       :doc:`L2 <l2-cache>`, per
       :ref:`normalization unit <normalization-units>`. Typically unused on
       current CDNA accelerators.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Stall Cycles
     - The total number of cycles the sL1D↔
       :doc:`L2 <l2-cache>` interface was stalled, per
       :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
 .. rubric:: Footnotes
@@ -318,46 +142,8 @@ The L1 Instruction Cache speed-of-light chart shows some key metrics of
 the L1I cache as a comparison with the peak achievable values of those
 metrics:
-.. list-table::
+.. jinja:: desc-l1i-sol
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Bandwidth
     - The number of bytes looked up in the L1I cache, as a percent of the peak
       theoretical bandwidth. Calculated as the ratio of L1I requests over the
       :ref:`total L1I cycles <total-l1i-cycles>`.
     - Percent
   * - Cache Hit Rate
     - The percent of L1I requests that hit on a previously loaded line the
       cache. Calculated as the ratio of the number of L1I requests that hit
       [#l1i-cache]_ over the number of all L1I requests.
     - Percent
   * - L1I-L2 BW
     - The percent of the peak theoretical L1I → L2 cache request bandwidth
       achieved. Calculated as the ratio of the total number of requests from
       the L1I to the L2 cache over the
       :ref:`total L1I-L2 interface cycles <total-l1i-cycles>`.
     - Percent
   * - Instruction Fetch Latency
     - The average number of cycles spent to fetch instructions to a
       :doc:`CU <compute-unit>`.
     - Cycles
 .. _desc-l1i-stats:
@@ -366,54 +152,10 @@ L1I cache accesses
 This panel gives more detail on the hit/miss statistics of the L1I:
-.. list-table::
+.. jinja:: desc-l1i-stats
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
-   * - Metric
+.. _desc-l1i-l2-interface:
     - Description
     - Unit
   * - Requests
     - The total number of requests made to the L1I per
       :ref:`normalization-unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Hits
     - The total number of L1I requests that hit on a previously loaded cache
       line, per :ref:`normalization-unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Misses - Non Duplicated
     - The total number of L1I requests that missed on a cache line that
       *were not* already pending due to another request, per
       :ref:`normalization-unit <normalization-units>`. See note in
       :ref:`desc-l1i-sol` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`.
   * - Misses - Duplicated
     - The total number of L1I requests that missed on a cache line that *were*
       already pending due to another request, per
       :ref:`normalization-unit <normalization-units>`. See note in
       :ref:`desc-l1i-sol` for more detail.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Cache Hit Rate
     - The percent of L1I requests that hit [#l1i-cache]_ on a previously loaded
       line the cache. Calculated as the ratio of the number of L1I requests
       that hit over the number of all L1I requests.
     - Percent
 L1I - L2 interface
 ------------------
@@ -421,21 +163,8 @@ L1I - L2 interface
 This panel gives more detail on the data requested across the
 L1I-:doc:`L2 <l2-cache>` interface.
-.. list-table::
+.. jinja:: desc-l1i-l2-interface
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - L1I-L2 BW
     - The total number of bytes read across the L1I-:doc:`L2 <l2-cache>`
       interface, per :ref:`normalization unit <normalization-units>`.
     - Bytes per :ref:`normalization unit <normalization-units>`
 .. rubric:: Footnotes
@@ -493,90 +222,18 @@ issuing concurrently).
   kernels). This means that these scheduler-pipe utilization metrics are
   expected to reach (for example) a maximum of one pipe active -- only 25%.
 .. _spi-util:
 Workgroup manager utilizations
 ------------------------------
 This section describes the utilization of the workgroup manager, and the
 hardware components it interacts with.
-.. list-table::
+.. jinja:: spi-util
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   :widths: 20 65 15
-   * - Metric
+.. _spi-resc-util:
     - Description
     - Unit
   * - Accelerator utilization
     - The percent of cycles in the kernel where the accelerator was actively
       doing any work.
     - Percent
   * - Scheduler-pipe utilization
     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
       the kernel where the scheduler-pipes were actively doing any work. Note:
       this value is expected to range between 0% and 25%. See :ref:`desc-spi`.
     - Percent
   * - Workgroup manager utilization
     - The percent of cycles in the kernel where the workgroup manager was
       actively doing any work.
     - Percent
   * - Shader engine utilization
     - The percent of :ref:`total shader engine cycles <total-se-cycles>` in the
       kernel where any CU in a shader-engine was actively doing any work,
       normalized over all shader-engines. Low values (e.g., << 100%) indicate
       that the accelerator was not fully saturated by the kernel, or a
       potential load-imbalance issue.
     - Percent
   * - SIMD utilization
     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
       where any :ref:`SIMD <desc-valu>` on a CU was actively doing any work,
       summed over all CUs. Low values (less than 100%) indicate that the
       accelerator was not fully saturated by the kernel, or a potential
       load-imbalance issue.
     - Percent
   * - Dispatched workgroups
     - The total number of workgroups forming this kernel launch.
     - Workgroups
   * - Dispatched wavefronts
     - The total number of wavefronts, summed over all workgroups, forming this
       kernel launch.
     - Wavefronts
   * - VGPR writes
     - The average number of cycles spent initializing :ref:`VGPRs <desc-valu>`
       at wave creation.
     - Cycles/wave
   * - SGPR Writes
     - The average number of cycles spent initializing :ref:`SGPRs <desc-salu>`
       at wave creation.
     - Cycles/wave
 Resource allocation
 -------------------
@@ -590,117 +247,5 @@ limited by LDS usage, for example, but may still achieve high occupancy levels
 such that improving occupancy further may not improve performance. See
 :ref:`occupancy-example` for details.
-.. list-table::
+.. jinja:: spi-resc-util
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Not-scheduled rate (Workgroup Manager)
     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
       the kernel where a workgroup could not be scheduled to a
       :doc:`CU <compute-unit>` due to a bottleneck within the workgroup manager
       rather than a lack of a CU or :ref:`SIMD <desc-valu>` with sufficient
       resources. Note: this value is expected to range between 0-25%. See note
       in :ref:`workgroup manager <desc-spi>` description.
     - Percent
   * - Not-scheduled rate (Scheduler-Pipe)
     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
       the kernel where a workgroup could not be scheduled to a
       :doc:`CU <compute-unit>` due to a bottleneck within the scheduler-pipes
       rather than a lack of a CU or :ref:`SIMD <desc-valu>` with sufficient
       resources. Note: this value is expected to range between 0-25%, see note
       in :ref:`workgroup manager <desc-spi>` description.
     - Percent
   * - Scheduler-Pipe Stall Rate
     - The percent of :ref:`total scheduler-pipe cycles <total-pipe-cycles>` in
       the kernel where a workgroup could not be scheduled to a
       :doc:`CU <compute-unit>` due to occupancy limitations (like a lack of a
       CU or :ref:`SIMD <desc-valu>` with sufficient resources). Note: this
       value is expected to range between 0-25%, see note in
       :ref:`workgroup manager <desc-spi>` description.
     - Percent
   * - Scratch Stall Rate
     - The percent of :ref:`total shader-engine cycles <total-se-cycles>` in the
       kernel where a workgroup could not be scheduled to a
       :doc:`CU <compute-unit>` due to lack of
       :ref:`private (a.k.a., scratch) memory <memory-type>` slots. While this
       can reach up to 100%, note that the actual occupancy limitations on a
       kernel using private memory are typically quite small (for example, less
       than 1% of the total number of waves that can be scheduled to an
       accelerator).
     - Percent
   * - Insufficient SIMD Waveslots
     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
       where a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`
       due to lack of available :ref:`waveslots <desc-valu>`.
     - Percent
   * - Insufficient SIMD VGPRs
     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
       where a workgroup could not be scheduled to a  :ref:`SIMD <desc-valu>`
       due to lack of available :ref:`VGPRs <desc-valu>`.
     - Percent
   * - Insufficient SIMD SGPRs
     - The percent of :ref:`total SIMD cycles <total-simd-cycles>` in the kernel
       where a workgroup could not be scheduled to a :ref:`SIMD <desc-valu>`
       due to lack of available :ref:`SGPRs <desc-salu>`.
     - Percent
   * - Insufficient CU LDS
     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
       where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
       due to lack of available :doc:`LDS <local-data-share>`.
     - Percent
   * - Insufficient CU Barriers
     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
       where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
       due to lack of available :ref:`barriers <desc-barrier>`.
     - Percent
   * - Reached CU Workgroup Limit
     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
       where a workgroup could not be scheduled to a :doc:`CU <compute-unit>`
       due to limits within the workgroup manager.  This is expected to be
       always be zero on CDNA2 or newer accelerators (and small for previous
       accelerators).
     - Percent
   * - Reached CU Wavefront Limit
     - The percent of :ref:`total CU cycles <total-cu-cycles>` in the kernel
       where a wavefront could not be scheduled to a :doc:`CU <compute-unit>`
       due to limits within the workgroup manager.  This is expected to be
       always be zero on CDNA2 or newer accelerators (and small for previous
       accelerators).
     - Percent
@@ -2,6 +2,8 @@
   :description: ROCm Compute Profiler performance model: System Speed-of-Light
   :keywords: Omniperf, ROCm Compute Profiler, ROCm, profiler, tool, Instinct, accelerator, AMD, system, speed of light
 .. _sys-sol:
 *********************
 System Speed-of-Light
 *********************
@@ -20,308 +22,5 @@ of ROCm Compute Profiler’s profiling report.
   Instinct™ MI-series accelerators. For more detail on how operations are
   counted, see the :ref:`metrics-flop-count` section.
-.. list-table::
+.. jinja:: sys-sol
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - :ref:`VALU <desc-valu>` FLOPs
     - The total floating-point operations executed per second on the
       :ref:`VALU <desc-valu>`.  This is also presented as a percent of the peak
       theoretical FLOPs achievable on the specific accelerator. Note: this does
       not include any floating-point operations from :ref:`MFMA <desc-mfma>`
       instructions.
     - GFLOPs
   * - :ref:`VALU <desc-valu>` IOPs
     - The total integer operations executed per second on the
       :ref:`VALU <desc-valu>`. This is also presented as a percent of the peak
       theoretical IOPs achievable on the specific accelerator. Note: this does
       not include any integer operations from :ref:`MFMA <desc-mfma>`
       instructions.
     - GIOPs
   * - :ref:`MFMA <desc-mfma>` FLOPs (F8)
     - The total number of 8-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. This does not include any 16-bit
       brain floating point operations from :ref:`VALU <desc-valu>`
       instructions. This is also presented as a percent of the peak theoretical
       F8 MFMA operations achievable on the specific accelerator. It is supported on AMD Instinct MI300 series and later only.
     - GFLOPs
   * - :ref:`MFMA <desc-mfma>` FLOPs (BF16)
     - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 16-bit
       brain floating point operations from :ref:`VALU <desc-valu>`
       instructions. This is also presented as a percent of the peak theoretical
       BF16 MFMA operations achievable on the specific accelerator.
     - GFLOPs
   * - :ref:`MFMA <desc-mfma>` FLOPs (F16)
     - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 16-bit
       floating point operations from :ref:`VALU <desc-valu>` instructions. This
       is also presented as a percent of the peak theoretical F16 MFMA
       operations achievable on the specific accelerator.
     - GFLOPs
   * - :ref:`MFMA <desc-mfma>` FLOPs (F32)
     - The total number of 32-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 32-bit
       floating point operations from :ref:`VALU <desc-valu>` instructions. This
       is also presented as a percent of the peak theoretical F32 MFMA
       operations achievable on the specific accelerator.
     - GFLOPs
   * - :ref:`MFMA <desc-mfma>` FLOPs (F64)
     - The total number of 64-bit floating point :ref:`MFMA <desc-mfma>`
       operations executed per second. Note: this does not include any 64-bit
       floating point operations from :ref:`VALU <desc-valu>` instructions. This
       is also presented as a percent of the peak theoretical F64 MFMA
       operations achievable on the specific accelerator.
     - GFLOPs
   * - :ref:`MFMA <desc-mfma>` IOPs (INT8)
     - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` operations
       executed per second. Note: this does not include any 8-bit integer
       operations from :ref:`VALU <desc-valu>` instructions. This is also
       presented as a percent of the peak theoretical INT8 MFMA operations
       achievable on the specific accelerator.
     - GIOPs
   * - :ref:`SALU <desc-salu>` utilization
     - Indicates what percent of the kernel's duration the
       :ref:`SALU <desc-salu>` was busy executing instructions. Computed as the
       ratio of the total number of cycles spent by the
       :ref:`scheduler <desc-scheduler>` issuing :ref:`SALU <desc-salu>` or
       :ref:`SMEM <desc-salu>` instructions over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - :ref:`VALU <desc-valu>` utilization
     - Indicates what percent of the kernel's duration the
       :ref:`VALU <desc-valu>` was busy executing instructions. Does not include
       :ref:`VMEM <desc-vmem>` operations.  Computed as the ratio of the total
       number of cycles spent by the :ref:`scheduler <desc-scheduler>` issuing
       :ref:`VALU <desc-valu>` instructions over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - :ref:`MFMA <desc-mfma>` utilization
     - Indicates what percent of the kernel's duration the
       :ref:`MFMA <desc-mfma>` unit was busy executing instructions. Computed as
       the ratio of the total number of cycles the MFMA was busy over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - :ref:`VMEM <desc-valu>` utilization
     - Indicates what percent of the kernel's duration the
       :ref:`VMEM <desc-valu>` unit was busy executing instructions, including
       both global/generic and spill/scratch operations (see the
       :ref:`VMEM instruction count metrics <ta-instruction-counts>`) for more
       detail). Does not include :ref:`VALU <desc-valu>` operations. Computed as
       the ratio of the total number of cycles spent by the
       :ref:`scheduler <desc-scheduler>` issuing VMEM instructions over the
       :ref:`total CU cycles <total-cu-cycles>`.
     - Percent
   * - :ref:`Branch <desc-branch>` utilization
     - Indicates what percent of the kernel's duration the
       :ref:`branch <desc-branch>` unit was busy executing instructions.
       Computed as the ratio of the total number of cycles spent by the
       :ref:`scheduler <desc-scheduler>` issuing :ref:`branch <desc-branch>`
       instructions over the :ref:`total CU cycles <total-cu-cycles>`
     - Percent
   * - :ref:`VALU <desc-valu>` active threads
     - Indicates the average level of :ref:`divergence <desc-divergence>` within
       a wavefront over the lifetime of the kernel. The number of work-items
       that were active in a wavefront during execution of each
       :ref:`VALU <desc-valu>` instruction, time-averaged over all VALU
       instructions run on all wavefronts in the kernel.
     - Work-items
   * - IPC
     - The ratio of the total number of instructions executed on the
       :doc:`CU <compute-unit>` over the
       :ref:`total active CU cycles <total-active-cu-cycles>`. This is also
       presented as a percent of the peak theoretical bandwidth achievable on
       the specific accelerator.
     - Instructions per-cycle
   * - Wavefront occupancy
     - The time-averaged number of wavefronts resident on the accelerator over
       the lifetime of the kernel. Note: this metric may be inaccurate for
       short-running kernels (less than 1ms). This is also presented as a
       percent of the peak theoretical occupancy achievable on the specific
       accelerator.
     - Wavefronts
   * - :doc:`LDS <local-data-share>` theoretical bandwidth
     - Indicates the maximum amount of bytes that could have been loaded from,
       stored to, or atomically updated in the LDS per unit time (see
       :ref:`LDS Bandwidth <lds-bandwidth>` example for more detail). This is
       also presented as a percent of the peak theoretical F64 MFMA operations
       achievable on the specific accelerator.
     - GB/s
   * - :doc:`LDS <local-data-share>` bank conflicts/access
     - The ratio of the number of cycles spent in the
       :doc:`LDS scheduler <local-data-share>` due to bank conflicts (as
       determined by the conflict resolution hardware) to the base number of
       cycles that would be spent in the LDS scheduler in a completely
       uncontended case. This is also presented in normalized form (i.e., the
       Bank Conflict Rate).
     - Conflicts/Access
   * - :doc:`vL1D <vector-l1-cache>` cache hit rate
     - The ratio of the number of vL1D cache line requests that hit in vL1D
       cache over the total number of cache line requests to the
       :ref:`vL1D cache RAM <desc-tc>`.
     - Percent
   * - :doc:`vL1D <vector-l1-cache>` cache bandwidth
     - The number of bytes looked up in the vL1D cache as a result of
       :ref:`VMEM <desc-vmem>` instructions per unit time. The number of bytes
       is calculated as the number of cache lines requested multiplied by the
       cache line size. This value does not consider partial requests, so e.g.,
       if only a single value is requested in a cache line, the data movement
       will still be counted as a full cache line. This is also presented as a
       percent of the peak theoretical bandwidth achievable on the specific
       accelerator.
     - GB/s
   * - :doc:`L2 <l2-cache>` cache hit rate
     - The ratio of the number of L2 cache line requests that hit in the L2
       cache over the total number of incoming cache line requests to the L2
       cache.
     - Percent
   * - :doc:`L2 <l2-cache>` cache bandwidth
     - The number of bytes looked up in the L2 cache per unit time.  The number
       of bytes is calculated as the number of cache lines requested multiplied
       by the cache line size. This value does not consider partial requests, so
       e.g., if only a single value is requested in a cache line, the data
       movement will still be counted as a full cache line. This is also
       presented as a percent of the peak theoretical bandwidth achievable on
       the specific accelerator.
     - GB/s
   * - :doc:`L2 <l2-cache>`-fabric read BW
     - The number of bytes read by the L2 over the
       :ref:`Infinity Fabric™ interface <l2-fabric>` per unit time. This is also
       presented as a percent of the peak theoretical bandwidth achievable on
       the specific accelerator.
     - GB/s
   * - :doc:`L2 <l2-cache>`-fabric write and atomic BW
     - The number of bytes sent by the L2 over the
       :ref:`Infinity Fabric interface <l2-fabric>` by write and atomic
       operations per unit time. This is also presented as a percent of the peak
       theoretical bandwidth achievable on the specific accelerator.
     - GB/s
   * - :doc:`L2 <l2-cache>`-fabric read latency
     - The time-averaged number of cycles read requests spent in Infinity Fabric
       before data was returned to the L2.
     - Cycles
   * - :doc:`L2 <l2-cache>`-fabric write latency
     - The time-averaged number of cycles write requests spent in Infinity
       Fabric before a completion acknowledgement was returned to the L2.
     - Cycles
   * - :ref:`sL1D <desc-sl1d>` cache hit rate
     - The percent of sL1D requests that hit on a previously loaded line the
       cache. Calculated as the ratio of the number of sL1D requests that hit
       over the number of all sL1D requests.
     - Percent
   * - :ref:`sL1D <desc-sl1d>` bandwidth
     - The number of bytes looked up in the sL1D cache per unit time. This is
       also presented as a percent of the peak theoretical bandwidth achievable
       on the specific accelerator.
     - GB/s
   * - :ref:`L1I <desc-l1i>` bandwidth
     - The number of bytes looked up in the L1I cache per unit time. This is
       also presented as a percent of the peak theoretical bandwidth achievable
       on the specific accelerator.
     - GB/s
   * - :ref:`L1I <desc-l1i>` cache hit rate
     - The percent of L1I requests that hit on a previously loaded line the
       cache. Calculated as the ratio of the number of L1I requests that hit
       over the number of all L1I requests.
     - Percent
   * - :ref:`L1I <desc-l1i>` fetch latency
     - The average number of cycles spent to fetch instructions to a
       :doc:`CU <compute-unit>`.
     - Cycles
@@ -63,53 +63,8 @@ vL1D Speed-of-Light
 The vL1D’s speed-of-light chart shows several key metrics for the vL1D
 as a comparison with the peak achievable values of those metrics.
-.. list-table::
+.. jinja:: vl1d-sol
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Hit Rate
     - The ratio of the number of vL1D cache line requests that hit [#vl1d-hit]_
       in vL1D cache over the total number of cache line requests to the
       :ref:`vL1D Cache RAM <desc-tc>`.
     - Percent
   * - Bandwidth
     - The number of bytes looked up in the vL1D cache as a result of
       :ref:`VMEM <desc-vmem>` instructions, as a percent of the peak
       theoretical bandwidth achievable on the specific accelerator. The number
       of bytes is calculated as the number of cache lines requested multiplied
       by the cache line size. This value does not consider partial requests, so
       for instance, if only a single value is requested in a cache line, the
       data movement will still be counted as a full cache line.
     - Percent
   * - Utilization
     - Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the
       kernel execution. The number of cycles where the vL1D Cache RAM is
       actively processing any request divided by the number of cycles where the
       vL1D is active [#vl1d-activity]_.
     - Percent
   * - Coalescing
     - Indicates how well memory instructions were coalesced by the
       :ref:`address processing unit <desc-ta>`, ranging from uncoalesced (25%)
       to fully coalesced (100%). Calculated as the average number of
       :ref:`thread-requests <thread-requests>` generated per instruction
       divided by the ideal number of thread-requests per instruction.
     - Percent
 .. _desc-ta:
@@ -145,45 +100,8 @@ processing unit. When the front-end cannot accept any more addresses, it
 must backpressure the wave-issue logic for the VMEM pipe and prevent the
 issue of further vector memory instructions.
-.. list-table::
+.. jinja:: ta-busy-stall
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Busy
     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
       processor was busy
     - Percent
   * - Address Stall
     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
       processor was stalled from sending address requests further into the vL1D
       pipeline
     - Percent
   * - Data Stall
     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the address
       processor was stalled from sending write/atomic data further into the
       vL1D pipeline
     - Percent
   * - Data-Processor → Address Stall
     - Percent of :ref:`total CU cycles <total-cu-cycles>` the address processor
       was stalled waiting to send command data to the
       :ref:`data processor <desc-td>`
     - Percent
 .. _ta-instruction-counts:
@@ -232,80 +150,8 @@ kernel. These are broken down into a few major categories:
 The address processor counts these instruction types as follows:
-.. list-table::
+.. jinja:: ta-instruction-counts
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Type
     - Description
     - Unit
   * - Global/Generic
     - The total number of global & generic memory instructions executed on all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Global/Generic Read
     - The total number of global & generic memory read instructions executed on
       all :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Global/Generic Write
     - The total number of global & generic memory write instructions executed
       on all :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Global/Generic Atomic
     - The total number of global & generic memory atomic (with and without
       return) instructions executed on all :doc:`compute units <compute-unit>`
       on the accelerator, per :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Spill/Stack
     - The total number of spill/stack memory instructions executed on all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Spill/Stack Read
     - The total number of spill/stack memory read instructions executed on all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Spill/Stack Write
     - The total number of spill/stack memory write instructions executed on all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`.
     - Instruction per :ref:`normalization unit <normalization-units>`
   * - Spill/Stack Atomic
     - The total number of spill/stack memory atomic (with and without return)
       instructions executed on all :doc:`compute units <compute-unit>` on the
       accelerator, per :ref:`normalization unit <normalization-units>`.
       Typically unused as these memory operations are typically used to
       implement thread-local storage.
     - Instructions per :ref:`normalization unit <normalization-units>`
 .. note::
@@ -333,38 +179,8 @@ Spill / stack metrics
 Finally, the address processing unit contains a separate coalescing
 stage for spill/stack memory, and thus reports:
-.. list-table::
+.. jinja:: ta-spill-stack
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Spill/Stack Total Cycles
     - The number of cycles the address processing unit spent working on
       spill/stack instructions, per
       :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Spill/Stack Coalesced Read Cycles
     - The number of cycles the address processing unit spent working on
       coalesced spill/stack read instructions, per
       :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
   * - Spill/Stack Coalesced Write Cycles
     - The number of cycles the address processing unit spent working on
       coalesced spill/stack write instructions, per
       :ref:`normalization unit <normalization-units>`.
     - Cycles per :ref:`normalization unit <normalization-units>`
 .. _desc-utcl1:
@@ -380,52 +196,8 @@ reduce the cost of subsequent re-translations.
 ROCm Compute Profiler reports the following L1 TLB metrics:
-.. list-table::
+.. jinja:: desc-utcl1
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Requests
     - The number of translation requests made to the UTCL1 per
       :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Hits
     - The number of translation requests that hit in the UTCL1, and could be
       reused, per :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Hit Ratio
     - The ratio of the number of translation requests that hit in the UTCL1
       divided by the total number of translation requests made to the UTCL1.
     - Percent
   * - Translation Misses
     - The total number of translation requests that missed in the UTCL1 due to
       translation not being present in the cache, per
       :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Permission Misses
     - The total number of translation requests that missed in the UTCL1 due to
       a permission error, per :ref:`normalization unit <normalization-units>`.
       This is unused and expected to be zero in most configurations for modern
       CDNA™ accelerators.
     - Requests per :ref:`normalization unit <normalization-units>`
 .. note::
@@ -464,39 +236,8 @@ L2 requests may backpressure the wave-issue logic of the :ref:`VMEM <desc-vmem>`
 pipe and prevent it from issuing more vector memory instructions until
 the vL1D’s outstanding requests are completed.
-.. list-table::
+.. jinja:: vl1d-cache-stall-metrics
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Stalled on L2 Data
     - The ratio of the number of cycles where the vL1D is stalled waiting for
       requested data to return from the :doc:`L2 cache <l2-cache>` divided by
       the number of cycles where the vL1D is active [#vl1d-activity]_.
     - Percent
   * - Stalled on L2 Requests
     - The ratio of the number of cycles where the vL1D is stalled waiting to
       issue a request for data to the :doc:`L2 cache <l2-cache>` divided by the
       number of cycles where the vL1D is active [#vl1d-activity]_.
     - Percent
   * - Tag RAM Stall (Read/Write/Atomic)
     - The ratio of the number of cycles where the vL1D is stalled due to
       Read/Write/Atomic requests with conflicting tags being looked up
       concurrently, divided by the number of cycles where the
       vL1D is active [#vl1d-activity]_.
     - Percent
 .. _vl1d-cache-access-metrics:
@@ -510,135 +251,8 @@ the :doc:`L2 cache <l2-cache>`. In addition, this section includes the
 approximate latencies of accesses to the cache itself, along with
 latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
-.. list-table::
+.. jinja:: vl1d-cache-access-metrics
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Total Requests
     - The total number of incoming requests from the
       :ref:`address processing unit <desc-ta>` after coalescing.
     - Requests
   * - Total read/write/atomic requests
     - The total number of incoming read/write/atomic requests from the
       :ref:`address processing unit <desc-ta>` after coalescing per
       :ref:`normalization unit <normalization-units>`
     - Requests per :ref:`normalization unit <normalization-units>`
   * - Cache Bandwidth
     - The number of bytes looked up in the vL1D cache as a result of
       :ref:`VMEM <desc-vmem>` instructions per
       :ref:`normalization unit <normalization-units>`.  The number of bytes is
       calculated as the number of cache lines requested multiplied by the cache
       line size.  This value does not consider partial requests, so for
       instance, if only a single value is requested in a cache line, the data
       movement will still be counted as a full cache line.
     - Bytes per :ref:`normalization unit <normalization-units>`
   * - Cache Hit Rate [#vl1d-hit]_
     - The ratio of the number of vL1D cache line requests that hit in vL1D
       cache over the total number of cache line requests to the
       :ref:`vL1D Cache RAM <desc-tc>`.
     - Percent
   * - Cache Accesses
     - The total number of cache line lookups in the vL1D.
     - Cache lines
   * - Cache Hits [#vl1d-hit]_
     - The number of cache accesses minus the number of outgoing requests to the
       :doc:`L2 cache <l2-cache>`, that is, the number of cache line requests
       serviced by the :ref:`vL1D Cache RAM <desc-tc>` per
       :ref:`normalization unit <normalization-units>`.
     - Cache lines per :ref:`normalization unit <normalization-units>`
   * - Invalidations
     - The number of times the vL1D was issued a write-back invalidate command
       during the kernel's execution per
       :ref:`normalization unit <normalization-units>`.  This may be triggered
       by, for instance, the ``buffer_wbinvl1`` instruction.
     - Invalidations per :ref:`normalization unit <normalization-units>`
   * - L1-L2 Bandwidth
     - The number of bytes transferred across the vL1D-L2 interface as a result
       of :ref:`VMEM <desc-vmem>` instructions, per
       :ref:`normalization unit <normalization-units>`. The number of bytes is
       calculated as the number of cache lines requested multiplied by the cache
       line size. This value does not consider partial requests, so for
       instance, if only a single value is requested in a cache line, the data
       movement will still be counted as a full cache line.
     - Bytes per :ref:`normalization unit <normalization-units>`
   * - L1-L2 Reads
     - The number of read requests for a vL1D cache line that were not satisfied
       by the vL1D and must be retrieved from the to the
       :doc:`L2 Cache <l2-cache>` per
       :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - L1-L2 Writes
     - The number of write requests to a vL1D cache line that were sent through
       the vL1D to the :doc:`L2 cache <l2-cache>`, per
       :ref:`normalization unit <normalization-units>`.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - L1-L2 Atomics
     - The number of atomic requests that are sent through the vL1D to the
       :doc:`L2 cache <l2-cache>`, per
       :ref:`normalization unit <normalization-units>`. This includes requests
       for atomics with, and without return.
     - Requests per :ref:`normalization unit <normalization-units>`
   * - L1 Access Latency
     - Calculated as the average number of cycles that a vL1D cache line request
       spent in the vL1D cache pipeline.
     - Cycles
   * - L1-L2 Read Access Latency
     - Calculated as the average number of cycles that the vL1D cache took to
       issue and receive read requests from the :doc:`L2 Cache <l2-cache>`. This
       number also includes requests for atomics with return values.
     - Cycles
   * - L1-L2 Write Access Latency
     - Calculated as the average number of cycles that the vL1D cache took to
       issue and receive acknowledgement of a write request to the
       :doc:`L2 Cache <l2-cache>`. This number also includes requests for
       atomics without return values.
     - Cycles
 .. note::
@@ -687,80 +301,5 @@ data, and returned to the appropriate SIMD.
 ROCm Compute Profiler reports the following vL1D data-return path metrics:
-.. list-table::
+.. jinja:: desc-td
-   :header-rows: 1
+   :file: _templates/metrics_table.j2
   * - Metric
     - Description
     - Unit
   * - Data-return Busy
     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
       unit was busy processing or waiting on data to return to the
       :doc:`CU <compute-unit>`.
     - Percent
   * - Cache RAM → Data-return Stall
     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
       unit was stalled on data to be returned from the
       :ref:`vL1D Cache RAM <desc-tc>`.
     - Percent
   * - Workgroup manager → Data-return Stall
     - Percent of the :ref:`total CU cycles <total-cu-cycles>` the data-return
       unit was stalled by the :ref:`workgroup manager <desc-spi>` due to
       initialization of registers as a part of launching new workgroups.
     - Percent
   * - Coalescable Instructions
     - The number of instructions submitted to the
       :ref:`data-return unit <desc-td>` by the
       :ref:`address processor <desc-ta>` that were found to be coalescable, per
       :ref:`normalization unit <normalization-units>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Read Instructions
     - The number of read instructions submitted to the
       :ref:`data-return unit <desc-td>` by the
       :ref:`address processor <desc-ta>` summed over all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`. This is expected to be
       the sum of global/generic and spill/stack reads in the
       :ref:`address processor <desc-ta>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Write Instructions
     - The number of store instructions submitted to the
       :ref:`data-return unit <desc-td>` by the
       :ref:`address processor <desc-ta>` summed over all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`. This is expected to be
       the sum of global/generic and spill/stack stores counted by the
       :ref:`vL1D cache-front-end <ta-instruction-counts>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
   * - Atomic Instructions
     - The number of atomic instructions submitted to the
       :ref:`data-return unit <desc-td>` by the
       :ref:`address processor <desc-ta>` summed over all
       :doc:`compute units <compute-unit>` on the accelerator, per
       :ref:`normalization unit <normalization-units>`. This is expected to be
       the sum of global/generic and spill/stack atomics in the
       :ref:`address processor <desc-ta>`.
     - Instructions per :ref:`normalization unit <normalization-units>`
@@ -30,6 +30,8 @@
 import re
 import yaml
 with open("../VERSION", encoding="utf-8") as f:
    match = re.search(r"([0-9.]+)[^0-9.]+", f.read())
    if not match:
@@ -43,7 +45,12 @@ copyright = "Copyright (c) 2025 Advanced Micro Devices, Inc. All rights reserved
 version = version_number
 release = version_number
-extensions = ["rocm_docs", "sphinx.ext.extlinks", "sphinxcontrib.datatemplates"]
+extensions = [
    "rocm_docs",
    "sphinx.ext.extlinks",
    "sphinxcontrib.datatemplates",
    "sphinx_jinja",
 ]
 html_theme = "rocm_docs_theme"
 html_theme_options = {"flavor": "rocm"}
 html_title = f"{project} {version_number} documentation"
@@ -52,6 +59,113 @@ exclude_patterns = ["archive", "*/includes"]
 html_static_path = ["sphinx/static/css"]
 html_css_files = ["o_custom.css"]
 with open("data/metrics_description.yaml", "r") as f:
    metrics_data = yaml.safe_load(f)
 jinja_contexts = {
    "wavefront-launch-stats": {
        "data": metrics_data["Wavefront launch stats"],
    },
    "wavefront-runtime-stats": {
        "data": metrics_data["Wavefront runtime stats"],
    },
    "instruction-mix": {
        "data": metrics_data["Overall instruction mix"],
    },
    "valu-arith-instruction-mix": {
        "data": metrics_data["VALU arithmetic instruction mix"],
    },
    "mfma-instruction-mix": {
        "data": metrics_data["MFMA instruction mix"],
    },
    "compute-speed-of-light": {
        "data": metrics_data["Compute Speed-of-Light"],
    },
    "pipeline-stats": {
        "data": metrics_data["Pipeline statistics"],
    },
    "arithmetic-operations": {
        "data": metrics_data["Arithmetic operations"],
    },
    "lds-sol": {
        "data": metrics_data["LDS Speed-of-Light"],
    },
    "lds-stats": {
        "data": metrics_data["LDS Statistics"],
    },
    "vl1d-sol": {
        "data": metrics_data["vL1D Speed-of-Light"],
    },
    "ta-busy-stall": {
        "data": metrics_data["Busy / stall metrics"],
    },
    "ta-instruction-counts": {
        "data": metrics_data["Instruction counts"],
    },
    "ta-spill-stack": {
        "data": metrics_data["Spill / stack metrics"],
    },
    "desc-utcl1": {
        "data": metrics_data["L1 Unified Translation Cache (UTCL1)"],
    },
    "vl1d-cache-stall-metrics": {
        "data": metrics_data["vL1D cache stall metrics"],
    },
    "vl1d-cache-access-metrics": {
        "data": metrics_data["vL1D cache access metrics"],
    },
    "desc-td": {
        "data": metrics_data["Vector L1 data-return path or Texture Data (TD)"],
    },
    "l2-sol": {
        "data": metrics_data["L2 Speed-of-Light"],
    },
    "l2-cache-accesses": {
        "data": metrics_data["L2 cache accesses"],
    },
    "l2-fabric-metrics": {
        "data": metrics_data["L2-Fabric interface metrics"],
    },
    "l2-detailed-metrics": {
        "data": metrics_data["L2 - Fabric interface detailed metrics"],
    },
    "l2-fabric-stalls": {
        "data": metrics_data["L2 - Fabric Interface stalls"],
    },
    "desc-sl1d-sol": {
        "data": metrics_data["Scalar L1D Speed-of-Light"],
    },
    "desc-sl1d-stats": {
        "data": metrics_data["Scalar L1D cache accesses"],
    },
    "desc-sl1d-l2-interface": {
        "data": metrics_data["Scalar L1D Cache - L2 Interface"],
    },
    "desc-l1i-sol": {
        "data": metrics_data["L1I Speed-of-Light"],
    },
    "desc-l1i-stats": {
        "data": metrics_data["L1I cache accesses"],
    },
    "desc-l1i-l2-interface": {
        "data": metrics_data["L1I <-> L2 interface"],
    },
    "spi-util": {
        "data": metrics_data["Workgroup manager utilizations"],
    },
    "spi-resc-util": {
        "data": metrics_data["Workgroup Manager - Resource Allocation"],
    },
    "cpf-metrics": {
        "data": metrics_data["Command processor fetcher (CPF)"],
    },
    "cpc-metrics": {
        "data": metrics_data["Command processor packet processor (CPC)"],
    },
    "sys-sol": {
        "data": metrics_data["System Speed-of-Light"],
    },
 }
 external_toc_path = "./sphinx/_toc.yml"
 external_projects_current_project = "rocprofiler-compute"
@@ -96,3 +210,6 @@ extlinks = {
        "HSA Runtime Programmer's Reference Manual (page %s)",
    ),
 }
 # Uncomment if facing rate limit exceed issue with local build
 external_projects_remote_repository = ""
@@ -242,6 +242,11 @@ List metrics
     $ rocprof-compute analyze -p workloads/vcopy/MI200/  --list-metrics gfx90a
 Show Description column which is excluded by default in cli output
  .. code-block:: shell
     $ rocprof-compute analyze -p workloads/vcopy/MI200/  --list-metrics gfx90a --include-cols Description
 Show System Speed-of-Light and CS_Busy blocks only
  .. code-block:: shell
@@ -1,2 +1,3 @@
 rocm-docs-core==1.21.1
 sphinxcontrib.datatemplates==0.11.0
 sphinx-jinja==2.0.2
@@ -53,7 +53,8 @@ docutils==0.21.2
    #   myst-parser
    #   pydata-sphinx-theme
    #   sphinx
-exceptiongroup==1.2.2
+    #   sphinx-jinja
 exceptiongroup==1.3.0
    # via ipython
 executing==2.2.0
    # via stack-data
@@ -87,6 +88,7 @@ jinja2==3.1.5
    # via
    #   myst-parser
    #   sphinx
    #   sphinx-jinja
 jsonschema==4.23.0
    # via nbformat
 jsonschema-specifications==2024.10.1
@@ -215,6 +217,7 @@ sphinx==8.1.3
    #   sphinx-copybutton
    #   sphinx-design
    #   sphinx-external-toc
    #   sphinx-jinja
    #   sphinx-notfound-page
    #   sphinxcontrib-datatemplates
    #   sphinxcontrib-runcmd
@@ -226,6 +229,8 @@ sphinx-design==0.6.1
    # via rocm-docs-core
 sphinx-external-toc==1.0.1
    # via rocm-docs-core
 sphinx-jinja==2.0.2
    # via -r requirements.in
 sphinx-notfound-page==1.0.4
    # via rocm-docs-core
 sphinxcontrib-applehelp==2.0.0
@@ -268,6 +273,7 @@ traitlets==5.14.3
    #   nbformat
 typing-extensions==4.12.2
    # via
    #   exceptiongroup
    #   ipython
    #   myst-nb
    #   pydata-sphinx-theme
@@ -202,7 +202,7 @@ Examples:
        nargs="?",
        const="",
        # Argument to --list-metrics is optional
-        choices=[""] + list(supported_archs.keys()),  # ["gfx906", "gfx908", "gfx90a"],
+        choices=[""] + list(supported_archs.keys()),  # ["gfx908", "gfx90a"],
        help=print_avail_arch(supported_archs.keys()),
    )
    profile_group.add_argument(
@@ -623,7 +623,18 @@ Examples:
        dest="cols",
        metavar="",
        nargs="+",
-        help="\t\tSpecify column indices to display.",
+        help="\t\tSpecify column indices to display.\n\t\tDefaults to display all columns.",
    )
    analyze_advanced_group.add_argument(
        "--include-cols",
        dest="include_cols",
        metavar="",
        nargs="+",
        help=(
            "\t\tSpecify which hidden column names should be included in cli output.\n"
            "\t\tFor example, to show 'Description' column which is hidden by default in cli output,\n"
            "\t\tuse the option --include-cols Description."
        ),
    )
    analyze_advanced_group.add_argument(
        "-g", dest="debug", action="store_true", help="\t\tDebug single metric."
@@ -28,7 +28,8 @@ from pathlib import Path
 rocprof_compute_home = Path(__file__).resolve().parent
 PROJECT_NAME = "rocprofiler-compute"
-HIDDEN_COLUMNS = ["Tips", "coll_level"]
+HIDDEN_COLUMNS = ["coll_level"]
 HIDDEN_COLUMNS_CLI = ["Description", "coll_level"]
 HIDDEN_SECTIONS = [400, 1900, 2000]
 TIME_UNITS = {"s": 10**9, "ms": 10**6, "us": 10**3, "ns": 1}
@@ -25,6 +25,7 @@
 import copy
 import os
 import sys
 import textwrap
 from abc import ABC, abstractmethod
 from collections import OrderedDict
 from pathlib import Path
@@ -96,15 +97,28 @@ class OmniAnalyze_Base:
                    sys_info.iloc[0],
                )
            metric_descriptions = {
                k: v
                for dfs in self._arch_configs[args.list_metrics].dfs.values()
                for k, v in dfs.to_dict().get("Description", {}).items()
            }
            for key, value in self._arch_configs[args.list_metrics].metric_list.items():
                prefix = ""
                description = ""
                if "." not in str(key):
                    prefix = ""
                elif str(key).count(".") == 1:
                    prefix = "\t"
                else:
                    prefix = "\t\t"
-                print(prefix + key, "->", value)
+                    description = metric_descriptions.get(key, "")
                print(prefix + key, "->", value + "\n")
                if description:
                    print(
                        prefix
                        + f"\n{prefix}".join(textwrap.wrap(description, width=40))
                        + "\n"
                    )
            sys.exit(0)
        else:
            console_error("Unsupported arch")
@@ -1,14 +1,14 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
-  id: 000
+  id: 0
  title: Top Stats
  metrics_description: {}
  data source:
-    - raw_csv_table:
+  - raw_csv_table:
-        id: 001
+      id: 1
-        title: Top Kernels
+      title: Top Kernels
-        source: pmc_kernel_top.csv
+      source: pmc_kernel_top.csv
-
+  - raw_csv_table:
-    - raw_csv_table:
+      id: 2
-        id: 002
+      title: Dispatch List
-        title: Dispatch List
+      source: pmc_dispatch_info.csv
        source: pmc_dispatch_info.csv
@@ -1,9 +1,10 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 100
  title: System Info
  metrics_description: {}
  data source:
-    - raw_csv_table:
+  - raw_csv_table:
-        id: 101
+      id: 101
-        source: sysinfo.csv
+      source: sysinfo.csv
-        columnwise: True
+      columnwise: true
@@ -1,236 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
  SALU: &SALU_anchor Scalar Arithmetic Logic Unit
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 200
  title: System Speed-of-Light
  data source:
    - metric_table:
        id: 201
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          peak: Peak
          pop: Pct of Peak
          tips: Tips
        metric:
          VALU FLOPs:
            value: None # No perf counter
            unit: GFLOP/s
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: None # No perf counter
            tips:
          VALU IOPs:
            value: None # No perf counter
            unit: GIOP/s
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: None # No perf counter
            tips:
          MFMA FLOPs (BF16):
            value: None # No perf counter
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 512) / 1000)
            pop: None # No perf counter
            tips:
          MFMA FLOPs (F16):
            value: None # No perf counter
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: None # No perf counter
            tips:
          MFMA FLOPs (F32):
            value: None # No perf counter
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: None # No perf counter
            tips:
          MFMA FLOPs (F64):
            value: None # No perf counter
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: None # No perf counter
            tips:
          MFMA IOPs (Int8):
            value: None # No perf counter
            unit: GIOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: None # No perf counter
            tips:
          Active CUs:
            value: $numActiveCUs
            unit: CUs
            peak: $cu_per_gpu
            pop: ((100 * $numActiveCUs) / $cu_per_gpu)
            tips:
          SALU Utilization:
            value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            tips:
          VALU Utilization:
            value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            tips:
          MFMA Utilization:
            value: None # No HW module
            unit: pct
            peak: 100
            pop: None # No HW module
            tips:
          VMEM Utilization:
            value: None # No HW module
            unit: pct
            peak: 100
            pop: None # No HW module
            tips:
          Branch Utilization:
            value: None # No HW module
            unit: pct
            peak: 100
            pop: None # No HW module
            tips:
          VALU Active Threads:
            value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            unit: Threads
            peak: $wave_size
            pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size) if (SQ_ACTIVE_INST_VALU != 0) else None))
            tips:
          IPC:
            value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            unit: Instr/cycle
            peak: 5
            pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
            tips:
          Wavefront Occupancy:
            value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            unit: Wavefronts
            peak: ($max_waves_per_cu * $cu_per_gpu)
            pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
              * $cu_per_gpu))))
            coll_level: SQ_LEVEL_WAVES
            tips:
          Theoretical LDS Bandwidth:
            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: (($max_sclk * $cu_per_gpu) * 0.128)
            pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
            tips:
          LDS Bank Conflicts/Access:
            value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Conflicts/access
            peak: 32
            pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
            tips:
          vL1D Cache Hit Rate:
            value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            unit: pct
            peak: 100
            pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            tips:
          vL1D Cache BW:
            value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
            pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
            tips:
          L2 Cache Hit Rate:
            value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            unit: pct
            peak: 100
            pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            tips:
          L2 Cache BW:
            value: AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
            pop: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
            tips:
          L2-Fabric Read BW:
            value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: $hbmBandwidth
            pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
            tips:
          L2-Fabric Write BW:
            value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: $hbmBandwidth
            pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
            tips:
          L2-Fabric Read Latency:
            value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
              != 0) else None))
            unit: Cycles
            peak: None
            pop: None
            tips:
          L2-Fabric Write Latency:
            value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
              != 0) else None))
            unit: Cycles
            peak: None
            pop: None
            tips:
          sL1D Cache Hit Rate:
            value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
            unit: pct
            peak: 100
            pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
            tips:
          sL1D Cache BW:
            value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
            pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
              / 1000) * 64) * $sqc_per_gpu))
            tips:
          L1I Hit Rate:
            value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
            tips:
          L1I BW:
            value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
            pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
              / 1000) * 64) * $sqc_per_gpu))
            tips:
          L1I Fetch Latency:
            value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            unit: Cycles
            peak: None
            pop: None
            coll_level: SQ_IFETCH_LEVEL
            tips:
@@ -0,0 +1,317 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 200
  title: System Speed-of-Light
  metrics_description:
    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.'
    VALU IOPs: 'The total integer operations executed per second on the VALU. This
      is also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.'
    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
      executed per second. This does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
      executed per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. This is also presented as a percent of the
      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
      per second. Note: this does not include any 16-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
      per second. Note: this does not include any 32-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
      per second. Note: this does not include any 64-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.'
    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
      per second. Note: this does not include any 8-bit integer operations from VALU
      instructions. This is also presented as a percent of the peak theoretical INT8
      MFMA operations achievable on the specific accelerator.'
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
      busy executing instructions. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
      busy executing instructions. Does not include VMEM operations. Computed as the
      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
      over the total CU cycles.
    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
      was busy executing instructions. Computed as the ratio of the total number of
      cycles the MFMA was busy over the total CU cycles.
    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
      was busy executing instructions, including both global/generic and spill/scratch
      operations (see the VMEM instruction count metrics) for more detail). Does not
      include VALU operations. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing VMEM instructions over the total CU cycles.
    Branch Utilization: Indicates what percent of the kernel's duration the branch
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the scheduler issuing branch instructions over the total
      CU cycles
    VALU Active Threads: Indicates the average level of divergence within a wavefront
      over the lifetime of the kernel. The number of work-items that were active in
      a wavefront during execution of each VALU instruction, time-averaged over all
      VALU instructions run on all wavefronts in the kernel.
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
      occupancy achievable on the specific accelerator.'
    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
      been loaded from, stored to, or atomically updated in the LDS per unit time
      (see LDS Bandwidth example for more detail). This is also presented as a percent
      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
      to the base number of cycles that would be spent in the LDS scheduler in a completely
      uncontended case. This is also presented in normalized form (i.e., the Bank
      Conflict Rate).
    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
      hit in vL1D cache over the total number of cache line requests to the vL1D cache
      RAM.
    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
      VMEM instructions per unit time. The number of bytes is calculated as the number
      of cache lines requested multiplied by the cache line size. This value does
      not consider partial requests, so e.g., if only a single value is requested
      in a cache line, the data movement will still be counted as a full cache line.
      This is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
      in the L2 cache over the total number of incoming cache line requests to the
      L2 cache.
    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
      number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so e.g.,
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
      \ interface per unit time. This is also presented as a percent of the peak theoretical\
      \ bandwidth achievable on the specific accelerator."
    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
      interface by write and atomic operations per unit time. This is also presented
      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
      in Infinity Fabric before data was returned to the L2.
    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
      in Infinity Fabric before a completion acknowledgement was returned to the L2.
    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
      line the cache. Calculated as the ratio of the number of sL1D requests that
      hit over the number of all sL1D requests.
    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
      This is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
      is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
      Calculated as the ratio of the number of L1I requests that hit over the number
      of all L1I requests.
    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
      a CU.
  data source:
  - metric_table:
      id: 201
      title: System Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
        peak: Peak
        pop: Pct of Peak
      metric:
        VALU FLOPs:
          value: None
          unit: GFLOP/s
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: None
        VALU IOPs:
          value: None
          unit: GIOP/s
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: None
        MFMA FLOPs (BF16):
          value: None
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 512) / 1000)
          pop: None
        MFMA FLOPs (F16):
          value: None
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: None
        MFMA FLOPs (F32):
          value: None
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: None
        MFMA FLOPs (F64):
          value: None
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: None
        MFMA IOPs (Int8):
          value: None
          unit: GIOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: None
        Active CUs:
          value: $numActiveCUs
          unit: CUs
          peak: $cu_per_gpu
          pop: ((100 * $numActiveCUs) / $cu_per_gpu)
        SALU Utilization:
          value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
        VALU Utilization:
          value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
        MFMA Utilization:
          value: None
          unit: pct
          peak: 100
          pop: None
        VMEM Utilization:
          value: None
          unit: pct
          peak: 100
          pop: None
        Branch Utilization:
          value: None
          unit: pct
          peak: 100
          pop: None
        VALU Active Threads:
          value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          unit: Threads
          peak: $wave_size
          pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size)
            if (SQ_ACTIVE_INST_VALU != 0) else None))
        IPC:
          value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          unit: Instr/cycle
          peak: 5
          pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
        Wavefront Occupancy:
          value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          peak: ($max_waves_per_cu * $cu_per_gpu)
          pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
            * $cu_per_gpu))))
          coll_level: SQ_LEVEL_WAVES
        Theoretical LDS Bandwidth:
          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: (($max_sclk * $cu_per_gpu) * 0.128)
          pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
        LDS Bank Conflicts/Access:
          value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Conflicts/access
          peak: 32
          pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
        vL1D Cache Hit Rate:
          value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: pct
          peak: 100
          pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
        vL1D Cache BW:
          value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
          pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
        L2 Cache Hit Rate:
          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          unit: pct
          peak: 100
          pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
        L2 Cache BW:
          value: AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
          pop: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
            / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
        L2-Fabric Read BW:
          value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: $hbmBandwidth
          pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
        L2-Fabric Write BW:
          value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: $hbmBandwidth
          pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
        L2-Fabric Read Latency:
          value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: Cycles
          peak: None
          pop: None
        L2-Fabric Write Latency:
          value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: Cycles
          peak: None
          pop: None
        sL1D Cache Hit Rate:
          value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
            if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
          unit: pct
          peak: 100
          pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
            if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
        sL1D Cache BW:
          value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
          pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) *
            64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
        L1I Hit Rate:
          value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
        L1I BW:
          value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
          pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) *
            64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
        L1I Fetch Latency:
          value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          unit: Cycles
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
@@ -1,310 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 300
  title: Memory Chart
  data source:
    - metric_table:
        id: 301
        title: Memory Chart
        header:
          metric: Metric
          #alias: #alias
          value: Value
          tips: Tips
        metric:
          # ----------------------------------------
          # Instr Buff Block
          #TODO: double check wave_occupancy
          Wavefront Occupancy:
            #alias: wave_occ_
            value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs), 0)
            coll_level: SQ_LEVEL_WAVES
            tips:
          Wave Life:
            #alias: wave_life_
            value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else 0)), 0)
            tips:
          # ----------------------------------------
          # Instr Dispatch Block
          SALU:
            #alias: salu_
            value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
            tips:
          SMEM:
            #alias: smem_
            value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
            tips:
          VALU:
            #alias: valu_
            value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
            tips:
          VMEM:
            #alias: vmem_
            value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
            tips:
          LDS:
            #alias: lds_
            value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
            tips:
          GWS:
            #alias: gws_
            value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
            tips:
          BR:
            #alias: br_
            value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
            tips:
          # ----------------------------------------
          # Exec Block
          Active CUs:
            #alias: active_cu_
            value: $numActiveCUs
            tips:
          Num CUs:
            #alias: num_cu_
            value: $cu_per_gpu
            tips:
          VGPR:
            #alias: vgpr_
            value: ROUND(AVG(Arch_VGPR), 0)
            tips:
          SGPR:
            #alias: sgpr_
            value: ROUND(AVG(SGPR), 0)
            tips:
          LDS Allocation:
            #alias: lds_alloc_
            value: ROUND(AVG(LDS_Per_Workgroup), 0)
            tips:
          Scratch Allocation:
            #alias: scratch_alloc_
            value: ROUND(AVG(Scratch_Per_Workitem), 0)
            tips:
          Wavefronts:
            #alias: wavefronts_
            value: ROUND(AVG(SPI_CSN_WAVE), 0)
            tips:
          Workgroups:
            #alias: workgroups_
            value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
            tips:
          # ----------------------------------------
          # LDS Block
          LDS Req:
            #alias: lds_req_
            value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
            tips:
          LDS Util:
            #alias: lds_util_
            value:
              ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))),
              0)
            tips:
          LDS Latency:
            #alias: lds_lat
            value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)),0)
            coll_level: SQ_INST_LEVEL_LDS
            tips:
          # ----------------------------------------
          # Vector L1 Cache Block
          VL1 Rd:
            #alias: vl1_rd_
            value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
            tips:
          VL1 Wr:
            #alias: vl1_wr_
            value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
            tips:
          VL1 Atomic:
            #alias: vl1_atom_
            value:
              ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom)), 0)
            tips:
          VL1 Hit:
            #alias: vl1_hit_
            value:
              ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None )), 0)
            tips:
          VL1 Lat:
            #alias: vl1_lat_
            value:
              ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None)), 0)
            tips:
          VL1 Coalesce:
            #alias: vl1_coales_
            value:
              ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
              * 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
            tips:
          VL1 Stall:
            #alias: vl1_stall_
            value:
              ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)), 0)
            tips:
          VL1_L2 Rd:
            #alias: vl1_l2_rd_
            value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
            tips:
          VL1_L2 Wr:
            #alias: vl1_l2_wr_
            value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
            tips:
          VL1_L2 Atomic:
            #alias: vl1_l2_atom_
            value:
              ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom)), 0)
            tips:
          # ----------------------------------------
          # Scalar L1D Cache Block
          VL1D Rd:
            #alias: sl1_rd_
            value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
            tips:
          VL1D Hit:
            #alias: sl1_hit_
            value:
              ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
              0) else None)) * 100), 0)
            tips:
          VL1D Lat:
            #alias: sl1_lat_
            value:
              ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
              0) else None)) * 100), 0)
            coll_level: SQC_DCACHE_INFLIGHT_LEVEL
            tips:
          VL1D_L2 Rd:
            #alias: sl1_l2_rd_
            value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
            tips:
          VL1D_L2 Wr:
            #alias: sl1_l2_wr_
            value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
            tips:
          VL1D_L2 Atomic:
            #alias: sl1_l2_atom_
            value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
            tips:
          # ----------------------------------------
          # Instr L1  Cache Block
          IL1 Fetch:
            #alias: il1_fetch_
            value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
            tips:
          IL1 Hit:
            #alias: il1_hit_
            value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
            tips:
          IL1 Lat:
            #alias: il1_lat_
            value:
              ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ !=
              0) else None)) * 100), 0)
            tips: # ??? coll_level: SQ_IFETCH_LEVEL
          IL1_L2 Rd:
            #alias: il1_l2_req_
            value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
            tips:
          # ----------------------------------------
          # L2 Cache Block(inside)
          L2 Rd:
            #alias: l2_rd_
            value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
            tips:
          L2 Wr:
            #alias: l2_wr_
            value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
            tips:
          L2 Atomic:
            #alias: l2_atom_
            value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
            tips:
          L2 Hit:
            #alias: l2_hit_
            value:
              ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else 0)), 0)
            tips:
          L2 Rd Lat:
            #alias: l2_rd_lat_
            value:
              ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)),
              0)
            tips:
          L2 Wr Lat:
            #alias: l2_wr_lat_
            value:
              ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum +
              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              != 0) else None)), 0)
            tips:
          # ----------------------------------------
          # Fabric Block
          Fabric_L2 Rd:
            #alias: l2_fabric_rd_
            value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
            tips:
          Fabric_L2 Wr:
            #alias: l2_fabric_wr_
            value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
            tips:
          Fabric_L2 Atomic:
            #alias: l2_fabric_atom_
            value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
            tips:
          Fabric Rd Lat:
            #alias: fabric_rd_lat_
            value:
              ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
              != 0) else  0)), 0)
            tips:
          Fabric Wr Lat:
            #alias: fabric_wr_lat_
            value:
              ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
              != 0) else  0)), 0)
            tips:
          Fabric Atomic Lat:
            #alias: fabric_atom_lat_
            value:
              ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else  0)), 0)
            tips:
          HBM Rd:
            #alias: hbm_rd_
            value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
            tips:
          HBM Wr:
            #alias: hbm_wr_
            value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
            tips:
        comparable: false # for now
        cli_style: mem_chart
@@ -0,0 +1,267 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 300
  title: Memory Chart
  metrics_description:
    Wavefront Occupancy: Wavefronts per active CU.
    Wave Life: Average number of cycles executing a wave.
    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
      unit.
    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
      unit.
    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
      normalization unit.
    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
      memory) per normalization unit.
    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
      and HIP's __shfl instructions) executed per normalization unit.
    GWS: Total number of GDS (global data sync) instructions issued per normalization
      unit.
    BR: Total number of BRANCH instructions issued per normalization unit.
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
    VGPR: 'The number of architected vector general-purpose registers allocated for
      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
      by the compiler due to allocation granularity.'
    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.'
    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
      for this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
    Scratch Allocation: The number of bytes of scratch memory requested per work-item
      for this kernel. Scratch memory is used for stack memory on the accelerator,
      as well as for register spills and restores.
    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
      this kernel launch.
    Workgroups: The total number of workgroups forming this kernel launch.
    LDS Req: The total number of LDS instructions (including, but not limited to,
      read/write/atomics and HIP's __shfl instructions) executed per normalization
      unit.
    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
      executing instructions (including, but not limited to, load, store, atomic and
      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
      LDS was active over the total CU cycles.
    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
      / acknowledgment) required for an LDS instruction to complete.
    VL1 Rd: The total number of incoming read requests from the address processing
      unit after coalescing per normalization unit
    VL1 Wr: The total number of incoming write requests from the address processing
      unit after coalescing per normalization unit
    VL1 Atomic: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit
    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
      as the average number of thread-requests generated per instruction divided by
      the ideal number of thread-requests per instruction.
    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
      to issue a request for data to the L2 cache divided by the number of cycles
      where the vL1D is active.
    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
      by the vL1D and must be retrieved from the to the L2 Cache per normalization
      unit.
    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
      the vL1D to the L2 cache, per normalization unit.
    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
      the L2 cache, per normalization unit. This includes requests for atomics with,
      and without return.
    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
      normalization unit.
    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
      unit.
    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
      cache. Calculated as the ratio of the number of L1I requests that hit over the
      number of all L1I requests.
    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
    L2 Rd: The total number of read requests to the L2 from all clients.
    L2 Wr: The total number of write requests to the L2 from all clients.
    L2 Atomic: The total number of atomic requests (with and without return) to the
      L2 from all clients.
    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
      over the total number of incoming cache line requests to the L2 cache.
    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
      to issue and receive read requests from the L2 Cache. This number also includes
      requests for atomics with return values.
    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
      to issue and receive acknowledgement of a write request to the L2 Cache. This
      number also includes requests for atomics without return values.
    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
      or 64-byte) summed over TCC instances per normalization unit.
    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
      or 64-byte) summed over TCC instances per normalization unit.
    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
      per normalization unit.
    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
      Fabric before data was returned to the L2.
    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
      Fabric before a completion acknowledgement was returned to the L2.
    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
      Infinity Fabric before a completion acknowledgement (atomic without return value)
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator''s local HBM, per normalization
      unit. '
  data source:
  - metric_table:
      id: 301
      title: Memory Chart
      header:
        metric: Metric
        value: Value
      metric:
        Wavefront Occupancy:
          value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs),
            0)
          coll_level: SQ_LEVEL_WAVES
        Wave Life:
          value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else
            0)), 0)
        SALU:
          value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
        SMEM:
          value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
        VALU:
          value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
        MFMA:
          value: None
        VMEM:
          value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
        LDS:
          value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
        GWS:
          value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
        BR:
          value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
        Active CUs:
          value: $numActiveCUs
        Num CUs:
          value: $cu_per_gpu
        VGPR:
          value: ROUND(AVG(Arch_VGPR), 0)
        SGPR:
          value: ROUND(AVG(SGPR), 0)
        LDS Allocation:
          value: ROUND(AVG(LDS_Per_Workgroup), 0)
        Scratch Allocation:
          value: ROUND(AVG(Scratch_Per_Workitem), 0)
        Wavefronts:
          value: ROUND(AVG(SPI_CSN_WAVE), 0)
        Workgroups:
          value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
        LDS Req:
          value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
        LDS Util:
          value: ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu))), 0)
        LDS Latency:
          value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS
            != 0) else None)),0)
          coll_level: SQ_INST_LEVEL_LDS
        VL1 Rd:
          value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
        VL1 Wr:
          value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
        VL1 Atomic:
          value: ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom)), 0)
        VL1 Hit:
          value: ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None )), 0)
        VL1 Lat:
          value: ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None)), 0)
        VL1 Coalesce:
          value: ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
            * 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
        VL1 Stall:
          value: ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)), 0)
        VL1_L2 Rd:
          value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
        VL1_L2 Wr:
          value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
        VL1_L2 Atomic:
          value: ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom)), 0)
        sL1D Rd:
          value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
        sL1D Hit:
          value: ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
            != 0) else None)) * 100), 0)
        sL1D Lat:
          value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
            != 0) else None)) * 100), 0)
          coll_level: SQC_DCACHE_INFLIGHT_LEVEL
        sL1D_L2 Rd:
          value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
        sL1D_L2 Wr:
          value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
        sL1D_L2 Atomic:
          value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
        IL1 Fetch:
          value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
        IL1 Hit:
          value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
        IL1 Lat:
          value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ
            != 0) else None)) * 100), 0)
        IL1_L2 Rd:
          value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
        L2 Rd:
          value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
        L2 Wr:
          value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
        L2 Atomic:
          value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
        L2 Hit:
          value: ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if
            ((TCC_HIT_sum + TCC_MISS_sum) != 0) else 0)), 0)
        L2 Rd Lat:
          value: ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            != 0) else None)), 0)
        L2 Wr Lat:
          value: ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            != 0) else None)), 0)
        Fabric_L2 Rd:
          value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
        Fabric_L2 Wr:
          value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
        Fabric_L2 Atomic:
          value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else  0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else  0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else  0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
          value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
@@ -0,0 +1,9 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 400
  title: Roofline
  metrics_description: {}
  data source:
  - None:
      id: 401
      title: Roofline
@@ -1,135 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
  data source:
    - metric_table:
        id: 501
        title: Command Processor Fetcher
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          CPF Utilization:
            avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            unit: pct
            tips:
          CPF Stall:
            avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPF-L2 Utilization:
            avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            unit: pct
            tips:
          CPF-L2 Stall:
            avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPF-UTCL1 Stall:
            avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            unit: pct
            tips:
    - metric_table:
        id: 502
        title: Packet Processor
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          CPC Utilization:
            avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            unit: pct
            tips:
          CPC Stall Rate:
            avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPC Packet Decoding Utilization:
            avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            unit: pct
            tips:
          CPC-Workgroup Manager Utilization:
            avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            unit: Pct
            tips:
          CPC-L2 Utilization:
            avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            unit: pct
            tips:
          CPC-UTCL1 Stall:
            avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            unit: pct
            tips:
          CPC-UTCL2 Utilization:
            avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            unit: pct
            tips:
@@ -0,0 +1,145 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
  metrics_description:
    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
      over total cycles counted by the CPF-L2.
    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
      stalled for any reason.
    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
      translation.
    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
      for processing.
    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
      workgroups to the workgroup manager.
    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
      title: Command processor fetcher (CPF)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        CPF Utilization:
          avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          unit: pct
        CPF Stall:
          avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          unit: pct
        CPF-L2 Utilization:
          avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          unit: pct
        CPF-L2 Stall:
          avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          unit: pct
        CPF-UTCL1 Stall:
          avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          unit: pct
  - metric_table:
      id: 502
      title: Command processor packet processor (CPC)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        CPC Utilization:
          avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          unit: pct
        CPC Stall Rate:
          avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          unit: pct
        CPC Packet Decoding Utilization:
          avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          unit: pct
        CPC-Workgroup Manager Utilization:
          avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          unit: Pct
        CPC-L2 Utilization:
          avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          unit: pct
        CPC-UTCL1 Stall:
          avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          unit: pct
        CPC-UTCL2 Utilization:
          avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
@@ -1,167 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
  data source:
    - metric_table:
        id: 601
        title: Workgroup Manager Utilizations
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Accelerator Utilization:
            avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            unit: Pct
            tips:
          Scheduler-Pipe Utilization:
            avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            unit: Pct
            tips:
          Workgroup Manager Utilization:
            avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            unit: Pct
            tips:
          Shader Engine Utilization:
            avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            unit: Pct
            tips:
          SIMD Utilization:
            avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Dispatched Workgroups:
            avg: AVG(SPI_CSN_NUM_THREADGROUPS)
            min: MIN(SPI_CSN_NUM_THREADGROUPS)
            max: MAX(SPI_CSN_NUM_THREADGROUPS)
            unit: Workgroups
            tips:
          Dispatched Wavefronts:
            avg: AVG(SPI_CSN_WAVE)
            min: MIN(SPI_CSN_WAVE)
            max: MAX(SPI_CSN_WAVE)
            unit: Wavefronts
            tips:
          VGPR Writes:
            avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            unit: Cycles/wave
            tips:
          SGPR Writes:
            avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            unit: Cycles/wave
            tips:
    - metric_table:
        id: 602
        title: Workgroup Manager - Resource Allocation
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Not-scheduled Rate (Workgroup Manager):
            avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            unit: Pct
            tips:
          Not-scheduled Rate (Scheduler-Pipe):
            avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            unit: Pct
            tips:
          Scheduler-Pipe Stall Rate:
            avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            unit: Pct
            tips:
          Scratch Stall Rate:
            avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            unit: Pct
            tips:
          Insufficient SIMD Waveslots:
            avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient SIMD VGPRs:
            avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient SIMD SGPRs:
            avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient CU LDS:
            avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient CU Barriers:
            avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Reached CU Workgroup Limit:
            avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Reached CU Wavefront Limit:
            avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
@@ -0,0 +1,201 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
  metrics_description:
    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
      was actively doing any work.
    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
      kernel where the scheduler-pipes were actively doing any work.
    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
      manager was actively doing any work.
    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
      where any CU in a shader-engine was actively doing any work, normalized over
      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
      was not fully saturated by the kernel, or a potential load-imbalance issue.
    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
      on a CU was actively doing any work, summed over all CUs. Low values (less than
      100%) indicate that the accelerator was not fully saturated by the kernel, or
      a potential load-imbalance issue.
    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
      forming this kernel launch.
    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
      resources. '
    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
      where a workgroup could not be scheduled to a CU due to occupancy limitations
      (like a lack of a CU or SIMD with sufficient resources).
    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
      memory slots. While this can reach up to 100%, note that the actual occupancy
      limitations on a kernel using private memory are typically quite small (for
      example, less than 1% of the total number of waves that can be scheduled to
      an accelerator).
    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
      could not be scheduled to a CU due to lack of available LDS.
    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
      workgroup could not be scheduled to a CU due to lack of available barriers.
    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
      a workgroup could not be scheduled to a CU due to limits within the workgroup
      manager. This is expected to be always be zero on CDNA2 or newer accelerators
      (and small for previous accelerators).
    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
      a wavefront could not be scheduled to a CU due to limits within the workgroup
      manager. This is expected to be always be zero on CDNA2 or newer accelerators
      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
      title: Workgroup manager utilizations
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Accelerator Utilization:
          avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          unit: Pct
        Scheduler-Pipe Utilization:
          avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          unit: Pct
        Workgroup Manager Utilization:
          avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          unit: Pct
        Shader Engine Utilization:
          avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          unit: Pct
        SIMD Utilization:
          avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Dispatched Workgroups:
          avg: AVG(SPI_CSN_NUM_THREADGROUPS)
          min: MIN(SPI_CSN_NUM_THREADGROUPS)
          max: MAX(SPI_CSN_NUM_THREADGROUPS)
          unit: Workgroups
        Dispatched Wavefronts:
          avg: AVG(SPI_CSN_WAVE)
          min: MIN(SPI_CSN_WAVE)
          max: MAX(SPI_CSN_WAVE)
          unit: Wavefronts
        VGPR Writes:
          avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          unit: Cycles/wave
        SGPR Writes:
          avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          unit: Cycles/wave
  - metric_table:
      id: 602
      title: Workgroup Manager - Resource Allocation
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Not-scheduled Rate (Workgroup Manager):
          avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Not-scheduled Rate (Scheduler-Pipe):
          avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Scheduler-Pipe Stall Rate:
          avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          unit: Pct
        Scratch Stall Rate:
          avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Insufficient SIMD Waveslots:
          avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient SIMD VGPRs:
          avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient SIMD SGPRs:
          avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient CU LDS:
          avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient CU Barriers:
          avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Reached CU Workgroup Limit:
          avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Reached CU Wavefront Limit:
          avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
@@ -1,142 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 700
  title: Wavefront
  data source:
    - metric_table:
        id: 701
        title: Wavefront Launch Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Grid Size:
            avg: AVG(Grid_Size)
            min: MIN(Grid_Size)
            max: MAX(Grid_Size)
            unit: Work Items
            tips:
          Workgroup Size:
            avg: AVG(Workgroup_Size)
            min: MIN(Workgroup_Size)
            max: MAX(Workgroup_Size)
            unit: Work Items
            tips:
          Total Wavefronts:
            avg: AVG(SPI_CSN_WAVE)
            min: MIN(SPI_CSN_WAVE)
            max: MAX(SPI_CSN_WAVE)
            unit: Wavefronts
            tips:
          Saved Wavefronts:
            avg: AVG(SQ_WAVES_SAVED)
            min: MIN(SQ_WAVES_SAVED)
            max: MAX(SQ_WAVES_SAVED)
            unit: Wavefronts
            tips:
          Restored Wavefronts:
            avg: AVG(SQ_WAVES_RESTORED)
            min: MIN(SQ_WAVES_RESTORED)
            max: MAX(SQ_WAVES_RESTORED)
            unit: Wavefronts
            tips:
          VGPRs:
            avg: AVG(Arch_VGPR)
            min: MIN(Arch_VGPR)
            max: MAX(Arch_VGPR)
            unit: Registers
            tips:
          AGPRs:
            avg: AVG(Accum_VGPR)
            min: MIN(Accum_VGPR)
            max: MAX(Accum_VGPR)
            unit: Registers
            tips:
          SGPRs:
            avg: AVG(SGPR)
            min: MIN(SGPR)
            max: MAX(SGPR)
            unit: Registers
            tips:
          LDS Allocation:
            avg: AVG(LDS_Per_Workgroup)
            min: MIN(LDS_Per_Workgroup)
            max: MAX(LDS_Per_Workgroup)
            unit: Bytes
            tips:
          Scratch Allocation:
            avg: AVG(Scratch_Per_Workitem)
            min: MIN(Scratch_Per_Workitem)
            max: MAX(Scratch_Per_Workitem)
            unit: Bytes/Workitem
            tips:
    - metric_table:
        id: 702
        title: Wavefront Runtime Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Kernel Time:
            avg: AVG((End_Timestamp - Start_Timestamp))
            min: MIN((End_Timestamp - Start_Timestamp))
            max: MAX((End_Timestamp - Start_Timestamp))
            unit: ns
            tips:
          Kernel Time (Cycles):
            avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
            min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
            max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
            unit: Cycle
            tips:
          Instructions per wavefront:
            avg: AVG((SQ_INSTS / SQ_WAVES))
            min: MIN((SQ_INSTS / SQ_WAVES))
            max: MAX((SQ_INSTS / SQ_WAVES))
            unit: Instr/wavefront
            tips:
          Wave Cycles:
            avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
            min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
            max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Dependency Wait Cycles:
            avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
            min: MIN(((4 * SQ_WAIT_ANY) / $denom))
            max: MAX(((4 * SQ_WAIT_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Issue Wait Cycles:
            avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
            min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
            max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Active Cycles:
            avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Wavefront Occupancy:
            avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            unit: Wavefronts
            coll_level: SQ_LEVEL_WAVES
            tips:
@@ -0,0 +1,173 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 700
  title: Wavefront
  metrics_description:
    Grid Size: The total number of work-items (or, threads) launched as a part of
      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
      by the total workgroup (or, block) size.
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
      \ should be equivalent to the ceiling of grid size divided by 64."
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
    VGPRs: 'The number of architected vector general-purpose registers allocated for
      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
      by the compiler due to allocation granularity.'
    AGPRs: 'The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
      requested by the compiler due to allocation granularity.'
    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.'
    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
      for this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
    Scratch Allocation: The number of bytes of scratch memory requested per work-item
      for this kernel. Scratch memory is used for stack memory on the accelerator,
      as well as for register spills and restores.
    Kernel Time: The total duration of the executed kernel.
    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
    Instructions per wavefront: The average number of instructions (of all types)
      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per normalization unit. This is averaged over all wavefronts
      in a kernel dispatch.
    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
      spent resident on a compute unit per normalization unit. This is averaged over
      all wavefronts in a kernel dispatch.
    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
      arbitration loss, etc.) per normalization unit. This counter is incremented
      at every cycle by all wavefronts on a CU unable to issue an instruction. As
      such, it is most useful to get a sense of how waves were spending their time,
      rather than identification of a precise limiter because another wave could be
      actively executing while a wave is issue stalled. The sum of this metric, Dependency
      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
      was actively executing instructions per normalization unit. This measurement
      is made on a per-wavefront basis, and may include cycles that another wavefront
      spent actively executing (on another execution unit, for example) or was stalled.
      As such, it is most useful to get a sense of how waves were spending their time,
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
      title: Wavefront Launch Stats
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Grid Size:
          avg: AVG(Grid_Size)
          min: MIN(Grid_Size)
          max: MAX(Grid_Size)
          unit: Work Items
        Workgroup Size:
          avg: AVG(Workgroup_Size)
          min: MIN(Workgroup_Size)
          max: MAX(Workgroup_Size)
          unit: Work Items
        Total Wavefronts:
          avg: AVG(SPI_CSN_WAVE)
          min: MIN(SPI_CSN_WAVE)
          max: MAX(SPI_CSN_WAVE)
          unit: Wavefronts
        Saved Wavefronts:
          avg: AVG(SQ_WAVES_SAVED)
          min: MIN(SQ_WAVES_SAVED)
          max: MAX(SQ_WAVES_SAVED)
          unit: Wavefronts
        Restored Wavefronts:
          avg: AVG(SQ_WAVES_RESTORED)
          min: MIN(SQ_WAVES_RESTORED)
          max: MAX(SQ_WAVES_RESTORED)
          unit: Wavefronts
        VGPRs:
          avg: AVG(Arch_VGPR)
          min: MIN(Arch_VGPR)
          max: MAX(Arch_VGPR)
          unit: Registers
        AGPRs:
          avg: AVG(Accum_VGPR)
          min: MIN(Accum_VGPR)
          max: MAX(Accum_VGPR)
          unit: Registers
        SGPRs:
          avg: AVG(SGPR)
          min: MIN(SGPR)
          max: MAX(SGPR)
          unit: Registers
        LDS Allocation:
          avg: AVG(LDS_Per_Workgroup)
          min: MIN(LDS_Per_Workgroup)
          max: MAX(LDS_Per_Workgroup)
          unit: Bytes
        Scratch Allocation:
          avg: AVG(Scratch_Per_Workitem)
          min: MIN(Scratch_Per_Workitem)
          max: MAX(Scratch_Per_Workitem)
          unit: Bytes/Workitem
  - metric_table:
      id: 702
      title: Wavefront Runtime Stats
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Kernel Time:
          avg: AVG((End_Timestamp - Start_Timestamp))
          min: MIN((End_Timestamp - Start_Timestamp))
          max: MAX((End_Timestamp - Start_Timestamp))
          unit: ns
        Kernel Time (Cycles):
          avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
          min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
          max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
          unit: Cycle
        Instructions per wavefront:
          avg: AVG((SQ_INSTS / SQ_WAVES))
          min: MIN((SQ_INSTS / SQ_WAVES))
          max: MAX((SQ_INSTS / SQ_WAVES))
          unit: Instr/wavefront
        Wave Cycles:
          avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
          min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
          max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
          unit: (Cycles + $normUnit)
        Dependency Wait Cycles:
          avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
          min: MIN(((4 * SQ_WAIT_ANY) / $denom))
          max: MAX(((4 * SQ_WAIT_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Issue Wait Cycles:
          avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
          min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
          max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Active Cycles:
          avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Wavefront Occupancy:
          avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
@@ -1,129 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
  data source:
    - metric_table:
        id: 1001
        title: Overall Instruction Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          LDS:
            avg: AVG((SQ_INSTS_LDS / $denom))
            min: MIN((SQ_INSTS_LDS / $denom))
            max: MAX((SQ_INSTS_LDS / $denom))
            unit: (instr + $normUnit)
            tips:
          SALU:
            avg: AVG((SQ_INSTS_SALU / $denom))
            min: MIN((SQ_INSTS_SALU / $denom))
            max: MAX((SQ_INSTS_SALU / $denom))
            unit: (instr + $normUnit)
            tips:
          SMEM:
            avg: AVG((SQ_INSTS_SMEM / $denom))
            min: MIN((SQ_INSTS_SMEM / $denom))
            max: MAX((SQ_INSTS_SMEM / $denom))
            unit: (instr + $normUnit)
            tips:
          Branch:
            avg: AVG((SQ_INSTS_BRANCH / $denom))
            min: MIN((SQ_INSTS_BRANCH / $denom))
            max: MAX((SQ_INSTS_BRANCH / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1002
        title: VALU Arithmetic Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
    - metric_table:
        id: 1003
        title: VMEM Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Global/Generic Instr:
            avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Read:
            avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Write:
            avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Atomic:
            avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Instr:
            avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Read:
            avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Write:
            avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Atomic:
            avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1004
        title: MFMA Arithmetic Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
@@ -0,0 +1,189 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
  metrics_description:
    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
      These are the workhorses of the compute unit, and are used to execute a wide
      range of instruction types including floating point operations, non-uniform
      address calculations, transcendental operations, integer operations, shifts,
      conditional evaluation, etc.
    VMEM: The total number of vector memory operations issued. These include most
      loads, stores and atomic operations and all accesses to generic, global, private
      and texture memory.
    LDS: The total number of LDS (also known as shared memory) operations issued.
      These include loads, stores, atomics, and HIP's __shfl operations.
    MFMA: The total number of matrix fused multiply-add instructions issued.
    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
      Typically these are used for address calculations, literal constants, and other
      operations that are provably uniform across a wavefront. Although scalar memory
      (SMEM) operations are issued by the SALU, they are counted separately in this
      section.
    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
      memory.
    Branch: The total number of branch operations issued. These typically consist
      of jump or branch operations and are used to implement control flow.
    INT32: The total number of instructions operating on 32-bit integer operands issued
      to the VALU per normalization unit.
    INT64: The total number of instructions operating on 64-bit integer operands issued
      to the VALU per normalization unit.
    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
      operands issued to the VALU per normalization unit.
    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
      operands issued to the VALU per normalization unit.
    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
      floating-point operands issued to the VALU per normalization unit.
    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
      on 16-bit floating-point operands issued to the VALU per normalization unit.
    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
      operands issued to the VALU per normalization unit.
    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
      operands issued to the VALU per normalization unit.
    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
      floating-point operands issued to the VALU per normalization unit.
    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 32-bit floating-point operands issued to the VALU per normalization unit.
    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
      operands issued to the VALU per normalization unit.
    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
      operands issued to the VALU per normalization unit.
    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
    Conversion: "The total number of type conversion instructions (such as converting\
      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
    Global/Generic Instr: The total number of global & generic memory instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Read: The total number of global & generic memory read instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Write: The total number of global & generic memory write instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Atomic: The total number of global & generic memory atomic (with
      and without return) instructions executed on all compute units on the accelerator,
      per normalization unit.
    Spill/Stack Instr: The total number of spill/stack memory instructions executed
      on all compute units on the accelerator, per normalization unit.
    Spill/Stack Read: The total number of spill/stack memory read instructions executed
      on all compute units on the accelerator, per normalization unit.
    Spill/Stack Write: The total number of spill/stack memory write instructions executed
      on all compute units on the accelerator, per normalization unit.
    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
      return) instructions executed on all compute units on the accelerator, per normalization
      unit. Typically unused as these memory operations are typically used to implement
      thread-local storage.
    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
      unit.
    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
      normalization unit. This is supported in AMD Instinct MI300 series and later
      only.
    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
      normalization unit.
    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
      per normalization unit.
    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
      normalization unit.
    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
      normalization unit.
  data source:
  - metric_table:
      id: 1001
      title: Overall Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        LDS:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
          unit: (instr + $normUnit)
        SALU:
          avg: AVG((SQ_INSTS_SALU / $denom))
          min: MIN((SQ_INSTS_SALU / $denom))
          max: MAX((SQ_INSTS_SALU / $denom))
          unit: (instr + $normUnit)
        SMEM:
          avg: AVG((SQ_INSTS_SMEM / $denom))
          min: MIN((SQ_INSTS_SMEM / $denom))
          max: MAX((SQ_INSTS_SMEM / $denom))
          unit: (instr + $normUnit)
        Branch:
          avg: AVG((SQ_INSTS_BRANCH / $denom))
          min: MIN((SQ_INSTS_BRANCH / $denom))
          max: MAX((SQ_INSTS_BRANCH / $denom))
          unit: (instr + $normUnit)
  - metric_table:
      id: 1002
      title: VALU Arithmetic Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric: {}
  - metric_table:
      id: 1003
      title: VMEM Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Global/Generic Instr:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Global/Generic Read:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Global/Generic Write:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Global/Generic Atomic:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Instr:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Read:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Write:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Atomic:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
  - metric_table:
      id: 1004
      title: MFMA Arithmetic Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric: {}
@@ -1,84 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
  data source:
    - metric_table:
        id: 1101
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          peak: Peak
          pop: Pct of Peak
          tips: Tips
        metric:
    - metric_table:
        id: 1102
        title: Pipeline Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          IPC:
            avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            unit: Instr/cycle
            tips:
          IPC (Issued):
            avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
              + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
              / SQ_ACTIVE_INST_ANY))
            min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
              + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
              / SQ_ACTIVE_INST_ANY))
            max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
              + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
              / SQ_ACTIVE_INST_ANY))
            unit: Instr/cycle
            tips:
          SALU Utilization:
            avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            tips:
          VALU Utilization:
            avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            tips:
          VALU Active Threads:
            avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            unit: Threads
            tips:
    - metric_table:
        id: 1103
        title: Arithmetic Operations
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
@@ -0,0 +1,147 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
  metrics_description:
    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.'
    VALU IOPs: 'The total integer operations executed per second on the VALU. This
      is also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.'
    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
      executed per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. This is also presented as a percent of the
      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
      per second. Note: this does not include any 16-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
      per second. Note: this does not include any 32-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
      per second. Note: this does not include any 64-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.'
    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
      per second. Note: this does not include any 8-bit integer operations from VALU
      instructions. This is also presented as a percent of the peak theoretical INT8
      MFMA operations achievable on the specific accelerator.'
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles.
    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
      over the number of cycles where the scheduler was actively working on issuing
      instructions.
    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
      busy executing instructions. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
      busy executing instructions. Does not include VMEM operations. Computed as the
      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
      over the total CU cycles.
    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
      was busy executing instructions, including both global/generic and spill/scratch
      operations (see the VMEM instruction count metrics for more detail). Does not
      include VALU operations. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing VMEM instructions over the total CU cycles.
    Branch Utilization: Indicates what percent of the kernel's duration the branch
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the scheduler issuing branch instructions over the total
      CU cycles.
    VALU Active Threads: Indicates the average level of divergence within a wavefront
      over the lifetime of the kernel. The number of work-items that were active in
      a wavefront during execution of each VALU instruction, time-averaged over all
      VALU instructions run on all wavefronts in the kernel
    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
      was busy executing instructions. Computed as the ratio of the total number of
      cycles spent by the MFMA was busy over the total CU cycles.
    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
      was busy over the total number of MFMA instructions.
    VMEM Latency: The average number of round-trip cycles (that is, from issue to
      data return / acknowledgment) required for a VMEM instruction to complete.
    SMEM Latency: The average number of round-trip cycles (that is, from issue to
      data return / acknowledgment) required for a SMEM instruction to complete.
    FLOPs (Total): The total number of floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    IOPs (Total): The total number of integer operations executed on either the VALU
      or MFMA units, per normalization unit.
    F16 OPs: The total number of 16-bit floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    BF16 OPs: The total number of 16-bit brain floating-point operations executed
      on either the VALU or MFMA units, per normalization unit.
    F32 OPs: The total number of 32-bit floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    F64 OPs: The total number of 64-bit floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    INT8 OPs: The total number of 8-bit integer operations executed on either the
      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
      title: Compute Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
        peak: Peak
        pop: Pct of Peak
      metric: {}
  - metric_table:
      id: 1102
      title: Pipeline Statistics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        IPC:
          avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
          avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
        VALU Utilization:
          avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
        VALU Active Threads:
          avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          unit: Threads
  - metric_table:
      id: 1103
      title: Arithmetic Operations
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric: {}
@@ -1,118 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
  data source:
    - metric_table:
        id: 1201
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Utilization:
            value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: Pct of Peak
            tips:
          Access Rate:
            value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: Pct of Peak
            tips:
          Theoretical Bandwidth:
            value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
            unit: Pct of Peak
            tips:
          Bank Conflict Rate:
            value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1202
        title: LDS Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          LDS Instrs:
            avg: AVG((SQ_INSTS_LDS / $denom))
            min: MIN((SQ_INSTS_LDS / $denom))
            max: MAX((SQ_INSTS_LDS / $denom))
            unit: (Instr  + $normUnit)
            tips:
          Theoretical Bandwidth:
            avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / $denom))
            min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / $denom))
            max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / $denom))
            unit: (Bytes  + $normUnit)
            tips:
          LDS Latency:
            avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
            min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
            max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
            unit: Cycles
            coll_level: SQ_INST_LEVEL_LDS
            tips:
          Bank Conflicts/Access:
            avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Conflicts/Access
            tips:
          Index Accesses:
            avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
            min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
            max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Atomic Return Cycles:
            avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
            min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
            max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Bank Conflict:
            avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
            min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
            max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Addr Conflict:
            avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
            min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
            max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Unaligned Stall:
            avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
            min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
            max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Mem Violations:
            avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
            min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
            max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
            unit: (Accesses + $normUnit)
            tips:
@@ -0,0 +1,141 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
  metrics_description:
    Utilization: Indicates what percent of the kernel's duration the LDS was actively
      executing instructions (including, but not limited to, load, store, atomic and
      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
      LDS was active over the total CU cycles.
    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
      loaded from, stored to, or atomically updated in the LDS per normalization unit.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
      bank conflicts over the number of LDS cycles that would have been required to
      move the same amount of data in an uncontended access.
    LDS Instructions: The total number of LDS instructions (including, but not limited
      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
      unit.
    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
      / acknowledgment) required for an LDS instruction to complete.
    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
      due to bank conflicts (as determined by the conflict resolution hardware) to
      the base number of cycles that would be spent in the LDS scheduler in a completely
      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
    Index Accesses: The total number of cycles spent in the LDS scheduler over all
      operations per normalization unit.
    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
      per normalization unit.
    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
      conflicts (as determined by the conflict resolution hardware) per normalization
      unit.
    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
      conflicts (as determined by the conflict resolution hardware) per normalization
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
      \ normalization unit. This is unused and expected to be zero in most configurations\
      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
      title: LDS Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Utilization:
          value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
        Theoretical Bandwidth:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
        Bank Conflict Rate:
          value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1202
      title: LDS Statistics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        LDS Instructions:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / $denom))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / $denom))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / $denom))
          unit: (Bytes  + $normUnit)
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
          min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
          max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
          unit: Cycles
          coll_level: SQ_INST_LEVEL_LDS
        Bank Conflicts/Access:
          avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Conflicts/Access
        Index Accesses:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
          unit: (Cycles  + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
          unit: (Cycles  + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
          unit: (Cycles  + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
          unit: (Cycles  + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
          unit: (Cycles  + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
@@ -1,105 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1300
  title: Instruction Cache
  data source:
    - metric_table:
        id: 1301
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Bandwidth:
            value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
              * (End_Timestamp - Start_Timestamp))))
            unit: Pct of Peak
            tips:
          Cache Hit Rate:
            value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            unit: Pct of Peak
            tips:
          L1I-L2 Bandwidth:
            value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
              * (End_Timestamp - Start_Timestamp))))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1302
        title: Instruction Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Req:
            avg: AVG((SQC_ICACHE_REQ / $denom))
            min: MIN((SQC_ICACHE_REQ / $denom))
            max: MAX((SQC_ICACHE_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Hits:
            avg: AVG((SQC_ICACHE_HITS / $denom))
            min: MIN((SQC_ICACHE_HITS / $denom))
            max: MAX((SQC_ICACHE_HITS / $denom))
            unit: (Hits  + $normUnit)
            tips:
          Misses - Non Duplicated:
            avg: AVG((SQC_ICACHE_MISSES / $denom))
            min: MIN((SQC_ICACHE_MISSES / $denom))
            max: MAX((SQC_ICACHE_MISSES / $denom))
            unit: (Misses  + $normUnit)
            tips:
          Misses - Duplicated:
            avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
            min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
            max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
            unit: (Misses  + $normUnit)
            tips:
          Cache Hit Rate:
            avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            unit: pct
            tips:
          Instruction Fetch Latency:
            avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            unit: Cycles
            coll_level: SQ_IFETCH_LEVEL
            tips:
    - metric_table:
        id: 1303
        title: Instruction Cache - L2 Interface
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          L1I-L2 Bandwidth:
            avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
            min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
            max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
            unit: (Bytes + $normUnit)
            tips:
@@ -0,0 +1,106 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
      total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
      line that were not already pending due to another request, per normalization-unit.
    Misses - Duplicated: The total number of L1I requests that missed on a cache line
      that were already pending due to another request, per normalization-unit.
    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
      to a CU.
  data source:
  - metric_table:
      id: 1301
      title: L1I Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Bandwidth:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
        Cache Hit Rate:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
        L1I-L2 Bandwidth:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1302
      title: L1I cache accesses
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Req:
          avg: AVG((SQC_ICACHE_REQ / $denom))
          min: MIN((SQC_ICACHE_REQ / $denom))
          max: MAX((SQC_ICACHE_REQ / $denom))
          unit: (Req  + $normUnit)
        Hits:
          avg: AVG((SQC_ICACHE_HITS / $denom))
          min: MIN((SQC_ICACHE_HITS / $denom))
          max: MAX((SQC_ICACHE_HITS / $denom))
          unit: (Hits  + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_ICACHE_MISSES / $denom))
          min: MIN((SQC_ICACHE_MISSES / $denom))
          max: MAX((SQC_ICACHE_MISSES / $denom))
          unit: (Misses  + $normUnit)
        Misses - Duplicated:
          avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          unit: (Misses  + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: pct
        Instruction Fetch Latency:
          avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          unit: Cycles
          coll_level: SQ_IFETCH_LEVEL
  - metric_table:
      id: 1303
      title: L1I <-> L2 interface
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
          unit: (Bytes + $normUnit)
@@ -1,171 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  data source:
    - metric_table:
        id: 1401
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Bandwidth:
            value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
              * (End_Timestamp - Start_Timestamp))))
            unit: Pct of Peak
            tips:
          Cache Hit Rate:
            value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            unit: Pct of Peak
            tips:
          sL1D-L2 BW:
            value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000)
                        / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1402
        title: Scalar L1D Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Req:
            avg: AVG((SQC_DCACHE_REQ / $denom))
            min: MIN((SQC_DCACHE_REQ / $denom))
            max: MAX((SQC_DCACHE_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Hits:
            avg: AVG((SQC_DCACHE_HITS / $denom))
            min: MIN((SQC_DCACHE_HITS / $denom))
            max: MAX((SQC_DCACHE_HITS / $denom))
            unit: (Req  + $normUnit)
            tips:
          Misses - Non Duplicated:
            avg: AVG((SQC_DCACHE_MISSES / $denom))
            min: MIN((SQC_DCACHE_MISSES / $denom))
            max: MAX((SQC_DCACHE_MISSES / $denom))
            unit: (Req  + $normUnit)
            tips:
          Misses- Duplicated:
            avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
            min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
            max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache Hit Rate:
            avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            unit: pct
            tips:
          Read Req (Total):
            avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
              + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
            min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
              + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
            max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
              + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG((SQC_DCACHE_ATOMIC / $denom))
            min: MIN((SQC_DCACHE_ATOMIC / $denom))
            max: MAX((SQC_DCACHE_ATOMIC / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (1 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (2 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (4 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (8 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (16 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
            unit: (Req  + $normUnit)
            tips:
    - metric_table:
        id: 1403
        title: Scalar L1D Cache - L2 Interface
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          sL1D-L2 BW:
            avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
            min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
            max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
            unit: (Bytes + $normUnit)
            tips:
          Read Req:
            avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
            min: MIN((SQC_TC_DATA_READ_REQ / $denom))
            max: MAX((SQC_TC_DATA_READ_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write Req:
            avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
            min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
            max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
            min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
            max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Stall Cycles:
            avg: AVG((SQC_TC_STALL / $denom))
            min: MIN((SQC_TC_STALL / $denom))
            max: MAX((SQC_TC_STALL / $denom))
            unit: (Cycles  + $normUnit)
            tips:
@@ -0,0 +1,186 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
      total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
    Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
      line that was not already pending due to another request, per normalization
      unit. '
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
      that was already pending due to another request, per normalization unit.
    Read Req (Total): The total number of sL1D read requests of any size, per normalization
      unit.
    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
      of data (4B), per normalization unit.
    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
      of data (8B), per normalization unit.
    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
      of data (16B), per normalization unit.
    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
      of data (32B), per normalization unit.
    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
      dwords of data (64B), per normalization unit.
    Read Req: The total number of read requests from sL1D to the L2 per normalization
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
      \ per normalization unit."
  data source:
  - metric_table:
      id: 1401
      title: Scalar L1D Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Bandwidth:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
        Cache Hit Rate:
          value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
        sL1D-L2 BW:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1402
      title: Scalar L1D cache accesses
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Req:
          avg: AVG((SQC_DCACHE_REQ / $denom))
          min: MIN((SQC_DCACHE_REQ / $denom))
          max: MAX((SQC_DCACHE_REQ / $denom))
          unit: (Req  + $normUnit)
        Hits:
          avg: AVG((SQC_DCACHE_HITS / $denom))
          min: MIN((SQC_DCACHE_HITS / $denom))
          max: MAX((SQC_DCACHE_HITS / $denom))
          unit: (Req  + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_DCACHE_MISSES / $denom))
          min: MIN((SQC_DCACHE_MISSES / $denom))
          max: MAX((SQC_DCACHE_MISSES / $denom))
          unit: (Req  + $normUnit)
        Misses- Duplicated:
          avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          unit: (Req  + $normUnit)
        Cache Hit Rate:
          avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: pct
        Read Req (Total):
          avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG((SQC_DCACHE_ATOMIC / $denom))
          min: MIN((SQC_DCACHE_ATOMIC / $denom))
          max: MAX((SQC_DCACHE_ATOMIC / $denom))
          unit: (Req  + $normUnit)
        Read Req (1 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
          unit: (Req  + $normUnit)
        Read Req (2 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
          unit: (Req  + $normUnit)
        Read Req (4 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
          unit: (Req  + $normUnit)
        Read Req (8 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
          unit: (Req  + $normUnit)
        Read Req (16 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
          unit: (Req  + $normUnit)
  - metric_table:
      id: 1403
      title: Scalar L1D Cache - L2 Interface
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 64)) / $denom))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 64)) / $denom))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 64)) / $denom))
          unit: (Bytes + $normUnit)
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
          max: MAX((SQC_TC_DATA_READ_REQ / $denom))
          unit: (Req  + $normUnit)
        Write Req:
          avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
          min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
          max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
          min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
          max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
          unit: (Req  + $normUnit)
        Stall Cycles:
          avg: AVG((SQC_TC_STALL / $denom))
          min: MIN((SQC_TC_STALL / $denom))
          max: MAX((SQC_TC_STALL / $denom))
          unit: (Cycles  + $normUnit)
@@ -1,168 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
  data source:
    - metric_table:
        id: 1501
        title: Address Processing Unit
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Address Processing Unit Busy:
            avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Address Stall:
            avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Data Stall:
            avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Data-Processor → Address Stall:
            avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Total Instructions:
            avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
            min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
            max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Instructions:
            avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Read Instructions:
            avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Write Instructions:
            avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Atomic Instructions:
            avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Instructions:
            avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Read Instructions:
            avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Write Instructions:
            avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Atomic Instructions:
            avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Total Cycles:
            avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
            min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
            max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Spill/Stack Coalesced Read:
            avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
            min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
            max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Spill/Stack Coalesced Write:
            avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
            min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
            max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
            unit: (Cycles  + $normUnit)
            tips:
    - metric_table:
        id: 1502
        title: Data-Return Path
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Data-Return Busy:
            avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Cache RAM → Data-Return Stall:
            avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Coalescable Instructions:
            avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
            min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
            max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Read Instructions:
            avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
              / $denom))
            min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
              / $denom))
            max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
              / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Write Instructions:
            avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
            min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
            max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Atomic Instructions:
            avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
            min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
            max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
@@ -0,0 +1,233 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
  metrics_description:
    Address Processing Unit Busy: Percent of the total CU cycles the address processor
      was busy
    Address Stall: Percent of the total CU cycles the address processor was stalled
      from sending address requests further into the vL1D pipeline.
    Data Stall: Percent of the total CU cycles the address processor was stalled from
      sending write/atomic data further into the vL1D pipeline.
    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
      processor was stalled waiting to send command data to the data processor.
    Total Instructions: The total number of memory instructions executed by the address
      processer over all compute units on the accelerator, per normalization unit.
    Global/Generic Instructions: The total number of global & generic memory instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Read Instructions: The total number of global & generic memory
      read instructions executed on all compute units on the accelerator, per normalization
      unit.
    Global/Generic Write Instructions: The total number of global & generic memory
      write instructions executed on all compute units on the accelerator, per normalization
      unit.
    Global/Generic Atomic Instructions: The total number of global & generic memory
      atomic (with and without return) instructions executed on all compute units
      on the accelerator, per normalization unit.
    Spill/Stack Instructions: The total number of spill/stack memory instructions
      executed on all compute units on the accelerator, per normalization unit.
    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
      executed on all compute units on the accelerator, per normalization unit.
    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
      executed on all compute units on the accelerator, per normalization unit.
    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
      (with and without return) instructions executed on all compute units on the
      accelerator, per normalization unit. Typically unused as these memory operations
      are typically used to implement thread-local storage.
    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
      working on spill/stack instructions, per normalization unit.
    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
      working on coalesced spill/stack read instructions, per normalization unit.
    Spill/Stack Coalesced Write: The number of cycles the address processing unit
      spent working on coalesced spill/stack write instructions, per normalization
      unit.
    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
      processing or waiting on data to return to the CU.
    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
      unit was stalled on data to be returned from the vL1D Cache RAM.
    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
      data-return unit was stalled by the workgroup manager due to initialization
      of registers as a part of launching new workgroups.
    Coalescable Instructions: The number of instructions submitted to the data-return
      unit by the address processor that were found to be coalescable, per normalization
      unit.
    Read Instructions: The number of read instructions submitted to the data-return
      unit by the address processor summed over all compute units on the accelerator,
      per normalization unit. This is expected to be the sum of global/generic and
      spill/stack reads in the address processor.
    Write Instructions: The number of store instructions submitted to the data-return
      unit by the address processor summed over all compute units on the accelerator,
      per normalization unit. This is expected to be the sum of global/generic and
      spill/stack stores in the address processor.
    Atomic Instructions: The number of atomic instructions submitted to the data-return
      unit by the address processor summed over all compute units on the accelerator,
      per normalization unit. This is expected to be the sum of global/generic and
      spill/stack atomics in the address processor.
  data source:
  - metric_table:
      id: 1501
      title: Busy and stall metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Address Processing Unit Busy:
          avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        Address Stall:
          avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          unit: pct
        Data Stall:
          avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          unit: pct
        "Data-Processor \u2192 Address Stall":
          avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          unit: pct
  - metric_table:
      id: 1502
      title: Instruction counts
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Total Instructions:
          avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
          min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
          max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Instructions:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Read Instructions:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Write Instructions:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Atomic Instructions:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Instructions:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Read Instructions:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Write Instructions:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Atomic Instructions:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
  - metric_table:
      id: 1503
      title: Spill and stack metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Spill/Stack Total Cycles:
          avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          unit: (Cycles  + $normUnit)
        Spill/Stack Coalesced Read:
          avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          unit: (Cycles  + $normUnit)
        Spill/Stack Coalesced Write:
          avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          unit: (Cycles  + $normUnit)
  - metric_table:
      id: 1504
      title: Vector L1 data-return path or Texture Data (TD)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Data-Return Busy:
          avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        "Cache RAM \u2192 Data-Return Stall":
          avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        "Workgroup manager \u2192 Data-Return Stall":
          avg: null
          min: null
          max: null
          unit: pct
        Coalescable Instructions:
          avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          unit: (Instructions  + $normUnit)
        Read Instructions:
          avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
          min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
          max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
          unit: (Instructions  + $normUnit)
        Write Instructions:
          avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
          min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
          max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
          unit: (Instructions  + $normUnit)
        Atomic Instructions:
          avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
          min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
          max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
          unit: (Instructions  + $normUnit)
@@ -1,414 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
  data source:
    - metric_table:
        id: 1601
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Hit rate:
            value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            unit: Pct of Peak
            tips:
          Bandwidth:
            value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
            unit: Pct of Peak
            tips:
          Utilization:
            value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
              != 0) else None))
            unit: Pct of Peak
            tips:
          Coalescing:
            value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
              * 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1602
        title: L1D Cache Stalls (%)
        header:
          metric: Metric
          expr: Expression
          tips: Tips
        metric:
          Stalled on L2 Data:
            expr:
              (((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
              != 0) else None)
            tips:
          Stalled on L2 Req:
            expr:
              (((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
              != 0) else None)
            tips:
          Stalled on Address:
            expr:
              None
            tips:
          Stalled on Data:
            expr:
              None
            tips:
          Stalled on Latency FIFO:
            expr:
              None
            tips:
          Stalled on Request FIFO:
            expr:
              None
            tips:
          Stalled on Read Return:
            expr:
              None
            tips:
          Tag RAM Stall (Read):
            expr:
              (((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)
            tips:
          Tag RAM Stall (Write):
            expr:
              (((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)
            tips:
          Tag RAM Stall (Atomic):
            expr:
              (((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)
            tips:
        cli_style: simple_box
    - metric_table:
        id: 1603
        title: L1D Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Total Req:
            avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
            min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
            max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req:
            avg: AVG((TCP_TOTAL_READ_sum / $denom))
            min: MIN((TCP_TOTAL_READ_sum / $denom))
            max: MAX((TCP_TOTAL_READ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write Req:
            avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
            min: MIN((TCP_TOTAL_WRITE_sum / $denom))
            max: MAX((TCP_TOTAL_WRITE_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom))
            min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom))
            max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache BW:
            avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
            min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
            max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
            unit: (Bytes + $normUnit)
            tips:
          Cache Hit Rate:
            avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            unit: pct
            tips:
          Cache Accesses:
            avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
            min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
            max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache Hits:
            avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / $denom))
            min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / $denom))
            max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / $denom))
            unit: (Req  + $normUnit)
            tips:
          Invalidations:
            avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
            min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
            max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
            unit: (Req + $normUnit)
            tips:
          L1-L2 BW:
            avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
              + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
            min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
              + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
            max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
              + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
            unit: (Bytes + $normUnit)
            tips:
          L1-L2 Read:
            avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          L1-L2 Write:
            avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          L1-L2 Atomic:
            avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom))
            min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom))
            max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom))
            unit: (Req  + $normUnit)
            tips:
          L1 Access Latency:
            avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None))
            min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None))
            max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None))
            unit: Cycles
            tips:
          L1-L2 Read Latency:
            avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
            min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
            max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
            unit: Cycles
            tips:
          L1-L2 Write Latency:
            avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
              None))
            min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
              None))
            max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
              None))
            unit: Cycles
            tips:
    - metric_table:
        id: 1604
        title: L1D - L2 Transactions
        header:
          metric: Metric
          xfer: Xfer
          coherency: Coherency
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          NC - Read:
            xfer: Read
            coherency: NC
            avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC - Read:
            xfer: Read
            coherency: UC
            avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC - Read:
            xfer: Read
            coherency: CC
            avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW - Read:
            xfer: Read
            coherency: RW
            avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW - Write:
            xfer: Write
            coherency: RW
            avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          NC - Write:
            xfer: Write
            coherency: NC
            avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC - Write:
            xfer: Write
            coherency: UC
            avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC - Write:
            xfer: Write
            coherency: CC
            avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          NC - Atomic:
            xfer: Atomic
            coherency: NC
            avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC - Atomic:
            xfer: Atomic
            coherency: UC
            avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC - Atomic:
            xfer: Atomic
            coherency: CC
            avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW - Atomic:
            xfer: Atomic
            coherency: RW
            avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
    - metric_table:
        id: 1605
        title: L1D Addr Translation
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          units: Units
          tips: Tips
        metric:
          Req:
            avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
            min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
            max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
            units: (Req + $normUnit)
            tips:
          Inflight Req:
            avg:  None # Missing perfmon
            min:  None # Missing perfmon
            max:  None # Missing perfmon
            units: (Req + $normUnit)
            tips:
          Hit Ratio:
            avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
              (TCP_UTCL1_REQUEST_sum != 0) else None))
            min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
              (TCP_UTCL1_REQUEST_sum != 0) else None))
            max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
              (TCP_UTCL1_REQUEST_sum != 0) else None))
            units: pct
            tips:
          Hits:
            avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
            min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
            max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
            units: (Req + $normUnit)
            tips:
          Translation Misses:
            avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
            min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
            max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
            units: (Req + $normUnit)
            tips:
          Permission Misses:
            avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
            min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
            max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
            units: (Req + $normUnit)
            tips:
    - metric_table:
        id: 1606
        title: L1D Addr Translation Stalls
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          units: Units
          tips: Tips
        metric:
@@ -0,0 +1,442 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
      instructions, as a percent of the peak theoretical bandwidth achievable on the
      specific accelerator. The number of bytes is calculated as the number of cache
      lines requested multiplied by the cache line size. This value does not consider
      partial requests, so for instance, if only a single value is requested in a
      cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
    Coalescing: Indicates how well memory instructions were coalesced by the address
      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
      as the average number of thread-requests generated per instruction divided by
      the ideal number of thread-requests per instruction.
    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
      waiting for requested data to return from the L2 cache divided by the number
      of cycles where the vL1D is active.
    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
      waiting to issue a request for data to the L2 cache divided by the number of
      cycles where the vL1D is active.
    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
      due to Read requests with conflicting tags being looked up concurrently, divided
      by the number of cycles where the vL1D is active.
    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
      due to Write requests with conflicting tags being looked up concurrently, divided
      by the number of cycles where the vL1D is active.
    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
      due to Atomic requests with conflicting tags being looked up concurrently, divided
      by the number of cycles where the vL1D is active.
    Total Req: The total number of incoming requests from the address processing unit
      after coalescing.
    Read Req: The total number of incoming read requests from the address processing
      unit after coalescing per normalization unit.
    Write Req: The total number of incoming write requests from the address processing
      unit after coalescing per normalization unit.
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
      instructions per normalization unit. The number of bytes is calculated as the
      number of cache lines requested multiplied by the cache line size.  This value
      does not consider partial requests, so for instance, if only a single value
      is requested in a cache line, the data movement will still be counted as a full
      cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
    Cache Hits: The number of cache accesses minus the number of outgoing requests
      to the L2 cache, that is, the number of cache line requests serviced by the
      vL1D Cache RAM per normalization unit.
    Invalidations: The number of times the vL1D was issued a write-back invalidate
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
      of VMEM instructions, per normalization unit. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
      as a full cache line.
    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
      by the vL1D and must be retrieved from the to the L2 Cache per normalization
      unit.
    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
      through the vL1D to the L2 cache, per normalization unit.
    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
      the L2 cache, per normalization unit. This includes requests for atomics with,
      and without return.
    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
      line request spent in the vL1D cache pipeline.
    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
      took to issue and receive read requests from the L2 Cache. This number also
      includes requests for atomics with return values.
    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
      cache took to issue and receive acknowledgement of a write request to the L2
      Cache. This number also includes requests for atomics without return values.
    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    Req: The number of translation requests made to the UTCL1 per normalization unit.
    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
      divided by the total number of translation requests made to the UTCL1.
    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to  translation not being present in the cache, per normalization
      unit.
    Permission Misses: "The total number of translation requests that missed in the\
      \ UTCL1 due to a permission error, per normalization unit. This is unused and\
      \ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1601
      title: vL1D Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Hit rate:
          value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
        Bandwidth:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
          unit: Pct of Peak
        Utilization:
          value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
            != 0) else None))
          unit: Pct of Peak
        Coalescing:
          value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
            * 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1602
      title: vL1D cache stall metrics
      header:
        metric: Metric
        expr: Expression
      metric:
        Stalled on L2 Data:
          expr: (((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
            != 0) else None)
        Stalled on L2 Req:
          expr: (((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
            != 0) else None)
        Tag RAM Stall (Read):
          expr: (((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)
        Tag RAM Stall (Write):
          expr: (((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)
        Tag RAM Stall (Atomic):
          expr: (((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1603
      title: vL1D cache access metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Total Req:
          avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
          unit: (Req  + $normUnit)
        Read Req:
          avg: AVG((TCP_TOTAL_READ_sum / $denom))
          min: MIN((TCP_TOTAL_READ_sum / $denom))
          max: MAX((TCP_TOTAL_READ_sum / $denom))
          unit: (Req  + $normUnit)
        Write Req:
          avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
          min: MIN((TCP_TOTAL_WRITE_sum / $denom))
          max: MAX((TCP_TOTAL_WRITE_sum / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
          min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
          max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
          unit: (Bytes + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: pct
        Cache Accesses:
          avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          unit: (Req  + $normUnit)
        Cache Hits:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
          unit: (Req  + $normUnit)
        Invalidations:
          avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
          min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
          max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
          unit: (Bytes + $normUnit)
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        L1-L2 Write:
          avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        L1-L2 Atomic:
          avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
          min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
          max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
          unit: (Req  + $normUnit)
        L1 Access Latency:
          avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
          min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
          max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
          unit: Cycles
        L1-L2 Read Latency:
          avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
            if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
            None))
          min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
            if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
            None))
          max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
            if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
            None))
          unit: Cycles
        L1-L2 Write Latency:
          avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
            else None))
          min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
            else None))
          max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
            else None))
          unit: Cycles
  - metric_table:
      id: 1604
      title: L1D - L2 Transactions
      header:
        metric: Metric
        xfer: Xfer
        coherency: Coherency
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        NC - Read:
          xfer: Read
          coherency: NC
          avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC - Read:
          xfer: Read
          coherency: UC
          avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC - Read:
          xfer: Read
          coherency: CC
          avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW - Read:
          xfer: Read
          coherency: RW
          avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW - Write:
          xfer: Write
          coherency: RW
          avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        NC - Write:
          xfer: Write
          coherency: NC
          avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC - Write:
          xfer: Write
          coherency: UC
          avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC - Write:
          xfer: Write
          coherency: CC
          avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        NC - Atomic:
          xfer: Atomic
          coherency: NC
          avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC - Atomic:
          xfer: Atomic
          coherency: UC
          avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC - Atomic:
          xfer: Atomic
          coherency: CC
          avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW - Atomic:
          xfer: Atomic
          coherency: RW
          avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
  - metric_table:
      id: 1605
      title: L1 Unified Translation Cache (UTCL1)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        units: Units
      metric:
        Req:
          avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
          min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
          max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
          units: (Req + $normUnit)
        Hit Ratio:
          avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
            if (TCP_UTCL1_REQUEST_sum != 0) else None))
          min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
            if (TCP_UTCL1_REQUEST_sum != 0) else None))
          max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
            if (TCP_UTCL1_REQUEST_sum != 0) else None))
          units: pct
        Hits:
          avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
          min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
          max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
          units: (Req + $normUnit)
        Translation Misses:
          avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
          min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
          max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
          units: (Req + $normUnit)
        Permission Misses:
          avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
          min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
          max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
          units: (Req + $normUnit)
  - metric_table:
      id: 1606
      title: L1D Addr Translation Stalls
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        units: Units
      metric: {}
@@ -1,388 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1700
  title: L2 Cache
  data source:
    - metric_table:
        id: 1701
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Utilization:
            value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
            unit: pct
            tips:
          Bandwidth:
            value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
            unit: pct
            tips:
          Hit Rate:
            value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else 0))
            unit: pct
            tips:
          L2-Fabric Read BW:
            value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            tips:
          L2-Fabric Write and Atomic BW:
            value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            tips:
          HBM Bandwidth:
            value: $hbmBandwidth
            unit: GB/s
            tips:
    - metric_table:
        id: 1702
        title: L2 - Fabric Transactions
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Read BW:
            avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / $denom))
            min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / $denom))
            max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / $denom))
            unit: (Bytes  + $normUnit)
            tips:
          HBM Read Traffic:
            avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            unit: pct
            tips:
          Remote Read Traffic:
            avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            unit: pct
            tips:
          Uncached Read Traffic:
            avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            unit: pct
            tips:
          Write and Atomic BW:
            avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / $denom))
            min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / $denom))
            max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / $denom))
            unit: (Bytes  + $normUnit)
            tips:
          HBM Write and Atomic Traffic:
            avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Remote Write and Atomic Traffic:
            avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Atomic Traffic:
            avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Uncached Write and Atomic Traffic:
            avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Read Latency:
            avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
              0) else None))
            min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
              0) else None))
            max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
              0) else None))
            unit: Cycles
            tips:
          Write and Atomic Latency:
            avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
              0) else None))
            min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
              0) else None))
            max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
              0) else None))
            unit: Cycles
            tips:
          Atomic Latency:
            avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else None))
            min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else None))
            max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else None))
            unit: Cycles
            tips:
    - metric_table:
        id: 1703
        title: L2 Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Bandwidth:
            avg: AVG((TCC_REQ_sum * 64) / $denom)
            min: MIN((TCC_REQ_sum * 64) / $denom)
            max: MAX((TCC_REQ_sum * 64) / $denom)
            unit: (Bytes + $normUnit)
            tips:
          Req:
            avg: AVG((TCC_REQ_sum / $denom))
            min: MIN((TCC_REQ_sum / $denom))
            max: MAX((TCC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req:
            avg: AVG((TCC_READ_sum / $denom))
            min: MIN((TCC_READ_sum / $denom))
            max: MAX((TCC_READ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write Req:
            avg: AVG((TCC_WRITE_sum / $denom))
            min: MIN((TCC_WRITE_sum / $denom))
            max: MAX((TCC_WRITE_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG((TCC_ATOMIC_sum / $denom))
            min: MIN((TCC_ATOMIC_sum / $denom))
            max: MAX((TCC_ATOMIC_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Streaming Req:
            avg: AVG((TCC_STREAMING_REQ_sum / $denom))
            min: MIN((TCC_STREAMING_REQ_sum / $denom))
            max: MAX((TCC_STREAMING_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Probe Req:
            avg: AVG((TCC_PROBE_sum / $denom))
            min: MIN((TCC_PROBE_sum / $denom))
            max: MAX((TCC_PROBE_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache Hit:
            avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            unit: pct
            tips:
          Hits:
            avg: AVG((TCC_HIT_sum / $denom))
            min: MIN((TCC_HIT_sum / $denom))
            max: MAX((TCC_HIT_sum / $denom))
            unit: (Hits  + $normUnit)
            tips:
          Misses:
            avg: AVG((TCC_MISS_sum / $denom))
            min: MIN((TCC_MISS_sum / $denom))
            max: MAX((TCC_MISS_sum / $denom))
            unit: (Misses  + $normUnit)
            tips:
          Writeback:
            avg: AVG((TCC_WRITEBACK_sum / $denom))
            min: MIN((TCC_WRITEBACK_sum / $denom))
            max: MAX((TCC_WRITEBACK_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Writeback (Internal):
            avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
            min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
            max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Writeback (vL1D Req):
            avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
            min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
            max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Evict (Internal):
            avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
            min: MIN((TCC_NORMAL_EVICT_sum / $denom))
            max: MAX((TCC_NORMAL_EVICT_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Evict (vL1D Req):
            avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
            min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
            max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          NC Req:
            avg: AVG((TCC_NC_REQ_sum / $denom))
            min: MIN((TCC_NC_REQ_sum / $denom))
            max: MAX((TCC_NC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC Req:
            avg: AVG((TCC_UC_REQ_sum / $denom))
            min: MIN((TCC_UC_REQ_sum / $denom))
            max: MAX((TCC_UC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC Req:
            avg: AVG((TCC_CC_REQ_sum / $denom))
            min: MIN((TCC_CC_REQ_sum / $denom))
            max: MAX((TCC_CC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW Req:
            avg: AVG((TCC_RW_REQ_sum / $denom))
            min: MIN((TCC_RW_REQ_sum / $denom))
            max: MAX((TCC_RW_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
    - metric_table:
        id: 1704
        title: L2 Cache Stalls
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
    - metric_table:
        id: 1705
        title: L2 - Fabric Interface Stalls
        header:
          metric: Metric
          type: Type
          transaction: Transaction
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        style:
          type: simple_multi_bar
        metric:
          Write - Credit Starvation:
            type: Credit Starvation
            transaction: Write
            avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
            min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
            max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
            unit: pct
            tips:
    - metric_table:
        id: 1706
        title: L2 - Fabric Detailed Transaction Breakdown
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Read (32B):
            avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
            min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
            max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read (64B):
            avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
            min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
            max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read (Uncached):
            avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
            min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
            max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          HBM Read:
            avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
            min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
            max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Remote Read:
            avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
            min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
            max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write and Atomic (32B):
            avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
            min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
            max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write and Atomic (Uncached):
            avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
            min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
            max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write and Atomic (64B):
            avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
            min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
            max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          HBM Write and Atomic:
            avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
            min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
            max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Remote Write and Atomic:
            avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
            min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
            max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic:
            avg: AVG((TCC_EA_ATOMIC_sum / $denom))
            min: MIN((TCC_EA_ATOMIC_sum / $denom))
            max: MAX((TCC_EA_ATOMIC_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
@@ -0,0 +1,536 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1700
  title: L2 Cache
  metrics_description:
    Utilization: The ratio of the number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator over the total L2 cycles.
    Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator. The number
      of bytes is calculated as the number of cache lines requested multiplied by
      the cache line size. This value does not consider partial requests, so e.g.,
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line.
    Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
      cache over the total number of incoming cache line requests to the L2 cache.
    L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
      interface per unit time.
    L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
      Fabric interface by write and atomic operations per unit time.
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
      normalization unit.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
      are both counted as a single request), so this metric only approximates the
      percent of the L2-Fabric Read bandwidth directed to the local HBM.
    Remote Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to any memory location other than the accelerator's local high-bandwidth
      memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
      breakdown does not consider the size of the request (meaning that 32B and 64B
      requests are both counted as a single request), so this metric only approximates
      the percent of the L2-Fabric Read bandwidth directed to a remote location.
    Uncached Read Traffic: The percent of read requests generated by the L2 cache
      that are reading from an uncached memory allocation. Note, as described in the
      request flow section, a single 64B read request is typically counted as two
      uncached read requests. So, it is possible for the Uncached Read Traffic to
      reach up to 200% of the total number of read requests. This breakdown does not
      consider the size of the request (i.e., 32B and 64B requests are both counted
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
      Fabric by write and atomic operations per normalization unit. Note that on current
      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
      (HBM). This breakdown does not consider the size of the request (meaning that
      32B and 64B requests are both counted as a single request), so this metric only
      approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
      to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at fine-grained memory allocations or uncached memory allocations.
    Remote Write and Atomic Traffic: The percent of read requests generated by the
      L2 cache that are routed to any memory location other than the accelerator's
      local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
      accelerator's HBM. This breakdown does not consider the size of the request
      (meaning that 32B and 64B requests are both counted as a single request), so
      this metric only approximates the percent of the L2-Fabric Read bandwidth directed
      to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at fine-grained memory allocations or uncached memory allocations.
    Atomic Traffic: The percent of write requests generated by the L2 cache that are
      atomic requests to any memory location. This breakdown does not consider the
      size of the request (meaning that 32B and 64B requests are both counted as a
      single request), so this metric only approximates the percent of the L2-Fabric
      Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
      such as the MI2XX, requests are only considered atomic by Infinity Fabric if
      they are targeted at fine-grained memory allocations or uncached memory allocations.
    Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are targeting uncached memory allocations. This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
      are both counted as a single request), so this metric only approximates the
      percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
    Read Latency: The time-averaged number of cycles read requests spent in Infinity
      Fabric before data was returned to the L2.
    Write and Atomic Latency: The time-averaged number of cycles write requests spent
      in Infinity Fabric before a completion acknowledgement was returned to the L2.
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
    Write Req: The total number of write requests to the L2 from all clients.
    Atomic Req: The total number of atomic requests (with and without return) to the
      L2 from all clients.
    Streaming Req: The total number of incoming requests to the L2 that are marked
      as streaming. The exact meaning of this may differ depending on the targeted
      accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
      The L2 cache attempts to evict streaming requests before normal requests when
      the L2 is at capacity.
    Probe Req: The number of coherence probe requests made to the L2 cache from outside
      the accelerator. On an MI2XX, probe requests may be generated by, for example,
      writes to fine-grained device memory or by writes to coarse-grained device memory.
    Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
      cache over the total number of incoming cache line requests to the L2 cache.
    Hits: The total number of requests to the L2 from all clients that hit in the
      cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
    Misses: The total number of requests to the L2 from all clients that miss in the
      cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
      requests.
    Writeback: The total number of L2 cache lines written back to memory for any reason.
      Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
      or atomic built-ins) by the command processor's memory acquire/release fences,
      or for other internal hardware reasons.
    Writeback (Internal): The total number of L2 cache lines written back to memory
      for internal hardware reasons, per normalization unit.
    Writeback (vL1D Req): The total number of L2 cache lines written back to memory
      due to requests initiated by the vL1D cache, per normalization unit.
    Evict (Internal): The total number of L2 cache lines evicted from the cache due
      to capacity limits, per normalization unit.
    Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
      to invalidation requests initiated by the vL1D cache, per normalization unit.
    NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
      allocations, per normalization unit.
    UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
      allocations.
    CC Req: The total number of requests to the L2 that go to Coherently Cacheable
      (CC) memory allocations.
    RW Req: The total number of requests to the L2 that go to Read-Write coherent
      memory (RW) allocations.
    Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
      on write or atomic requests to any memory location because too many write/atomic
      requests were currently in flight, as a percent of the total active L2 cycles.
    Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
      data from any memory location, per normalization unit.
    Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
      data from any memory location, per normalization unit.
    Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
      data from any memory location, per normalization unit. 64B requests for uncached
      data are counted as two 32B uncached data requests.
    HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
    Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
      to write or atomically update 32B or 64B of uncached data, per normalization
      unit.
    Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 64B of data in any memory location, per normalization
      unit.
    HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
      or atomically update 32B or 64B of data in the accelerator's local HBM, per
      normalization unit.
    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
      memory allocations on the MI2XX.
    Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
      \ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
      \ over the total active L2 cycles."
    Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
      stalled on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator or CPU) over the total active L2 cycles.
    Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
      read requests to remote PCIe connected accelerators or CPUs as a percent of
      the total active L2 cycles.
    Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
      stalled on read requests to remote Infinity Fabric connected accelerators or
      CPUs as a percent of the total active L2 cycles.
    Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
      read requests to the accelerator's local HBM as a percent of the total active
      L2 cycles.
    Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
      write or atomic requests to remote PCIe connected accelerators or CPUs as a
      percent of the total active L2 cycles.
    Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
      stalled on write or atomic requests to remote Infinity Fabric connected accelerators
      or CPUs as a percent of the total active L2 cycles.
    Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
      write or atomic requests to accelerator's local HBM as a percent of the total
      active L2 cycles.
  data source:
  - metric_table:
      id: 1701
      title: L2 Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Utilization:
          value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
          unit: pct
        Peak Bandwidth:
          value: ((100 * AVG(((TCC_REQ_sum * 64) / (End_Timestamp - Start_Timestamp))))
            / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
          unit: pct
        Hit Rate:
          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else 0))
          unit: pct
        L2-Fabric Read BW:
          value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
        L2-Fabric Write and Atomic BW:
          value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
        HBM Bandwidth:
          value: $hbmBandwidth
          unit: GB/s
  - metric_table:
      id: 1702
      title: L2-Fabric interface metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Read BW:
          avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / $denom))
          min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / $denom))
          max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / $denom))
          unit: (Bytes  + $normUnit)
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: pct
        Remote Read Traffic:
          avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
            if (TCC_EA_RDREQ_sum != 0) else None))
          min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
            if (TCC_EA_RDREQ_sum != 0) else None))
          max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
            if (TCC_EA_RDREQ_sum != 0) else None))
          unit: pct
        Uncached Read Traffic:
          avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / $denom))
          min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / $denom))
          max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / $denom))
          unit: (Bytes  + $normUnit)
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: pct
        Remote Write and Atomic Traffic:
          avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
            if (TCC_EA_WRREQ_sum != 0) else None))
          min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
            if (TCC_EA_WRREQ_sum != 0) else None))
          max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
            if (TCC_EA_WRREQ_sum != 0) else None))
          unit: pct
        Atomic Traffic:
          avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: pct
        Uncached Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: pct
        Read Latency:
          avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: Cycles
        Write and Atomic Latency:
          avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: Cycles
        Atomic Latency:
          avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else None))
          min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else None))
          max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else None))
          unit: Cycles
  - metric_table:
      id: 1703
      title: L2 Cache Accesses
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Bandwidth:
          avg: AVG((TCC_REQ_sum * 64) / $denom)
          min: MIN((TCC_REQ_sum * 64) / $denom)
          max: MAX((TCC_REQ_sum * 64) / $denom)
          unit: (Bytes + $normUnit)
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
          max: MAX((TCC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        Read Req:
          avg: AVG((TCC_READ_sum / $denom))
          min: MIN((TCC_READ_sum / $denom))
          max: MAX((TCC_READ_sum / $denom))
          unit: (Req  + $normUnit)
        Write Req:
          avg: AVG((TCC_WRITE_sum / $denom))
          min: MIN((TCC_WRITE_sum / $denom))
          max: MAX((TCC_WRITE_sum / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG((TCC_ATOMIC_sum / $denom))
          min: MIN((TCC_ATOMIC_sum / $denom))
          max: MAX((TCC_ATOMIC_sum / $denom))
          unit: (Req  + $normUnit)
        Streaming Req:
          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
          min: MIN((TCC_STREAMING_REQ_sum / $denom))
          max: MAX((TCC_STREAMING_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        Probe Req:
          avg: AVG((TCC_PROBE_sum / $denom))
          min: MIN((TCC_PROBE_sum / $denom))
          max: MAX((TCC_PROBE_sum / $denom))
          unit: (Req  + $normUnit)
        Cache Hit:
          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          unit: pct
        Hits:
          avg: AVG((TCC_HIT_sum / $denom))
          min: MIN((TCC_HIT_sum / $denom))
          max: MAX((TCC_HIT_sum / $denom))
          unit: (Hits  + $normUnit)
        Misses:
          avg: AVG((TCC_MISS_sum / $denom))
          min: MIN((TCC_MISS_sum / $denom))
          max: MAX((TCC_MISS_sum / $denom))
          unit: (Misses  + $normUnit)
        Writeback:
          avg: AVG((TCC_WRITEBACK_sum / $denom))
          min: MIN((TCC_WRITEBACK_sum / $denom))
          max: MAX((TCC_WRITEBACK_sum / $denom))
          unit: (Cachelines + $normUnit)
        Writeback (Internal):
          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
          max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
          unit: (Cachelines + $normUnit)
        Writeback (vL1D Req):
          avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
          min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
          max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
          unit: (Cachelines + $normUnit)
        Evict (Internal):
          avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
          min: MIN((TCC_NORMAL_EVICT_sum / $denom))
          max: MAX((TCC_NORMAL_EVICT_sum / $denom))
          unit: (Cachelines + $normUnit)
        Evict (vL1D Req):
          avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
          min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
          max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
          unit: (Cachelines + $normUnit)
        NC Req:
          avg: AVG((TCC_NC_REQ_sum / $denom))
          min: MIN((TCC_NC_REQ_sum / $denom))
          max: MAX((TCC_NC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC Req:
          avg: AVG((TCC_UC_REQ_sum / $denom))
          min: MIN((TCC_UC_REQ_sum / $denom))
          max: MAX((TCC_UC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC Req:
          avg: AVG((TCC_CC_REQ_sum / $denom))
          min: MIN((TCC_CC_REQ_sum / $denom))
          max: MAX((TCC_CC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW Req:
          avg: AVG((TCC_RW_REQ_sum / $denom))
          min: MIN((TCC_RW_REQ_sum / $denom))
          max: MAX((TCC_RW_REQ_sum / $denom))
          unit: (Req  + $normUnit)
  - metric_table:
      id: 1704
      title: L2 Cache Stalls
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric: {}
  - metric_table:
      id: 1705
      title: L2 - Fabric Interface stalls
      header:
        metric: Metric
        type: Type
        transaction: Transaction
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      style:
        type: simple_multi_bar
      metric:
        Write - Credit Starvation:
          type: Credit Starvation
          transaction: Write
          avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
            != 0) else None))
          min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
            != 0) else None))
          max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
            != 0) else None))
          unit: pct
  - metric_table:
      id: 1706
      title: L2 - Fabric interface detailed metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Read (32B):
          avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
          min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
          max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
          unit: (Req  + $normUnit)
        Read (64B):
          avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
          min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
          max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
          unit: (Req  + $normUnit)
        Read (Uncached):
          avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
          unit: (Req  + $normUnit)
        HBM Read:
          avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
          unit: (Req  + $normUnit)
        Remote Read:
          avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
          unit: (Req  + $normUnit)
        Write and Atomic (32B):
          avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
          min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
          max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
          unit: (Req  + $normUnit)
        Write and Atomic (Uncached):
          avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
          unit: (Req  + $normUnit)
        Write and Atomic (64B):
          avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
          min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
          max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
          unit: (Req  + $normUnit)
        HBM Write and Atomic:
          avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
          unit: (Req  + $normUnit)
        Remote Write and Atomic:
          avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
          unit: (Req  + $normUnit)
        Atomic:
          avg: AVG((TCC_EA_ATOMIC_sum / $denom))
          min: MIN((TCC_EA_ATOMIC_sum / $denom))
          max: MAX((TCC_EA_ATOMIC_sum / $denom))
          unit: (Req  + $normUnit)
@@ -1,350 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
  data source:
    - metric_table:
        id: 1801
        title: Aggregate Stats (All channels)
        header:
          metric: Metric
          avg: Avg
          std dev: Std Dev
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          L2 Cache Hit Rate:
            avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            unit: pct
            tips:
        # FIXME: other arggr metrics!!
    - metric_table:
        id: 1802
        title: L2 Cache Hit Rate (pct)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr:
              (((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
              + TCC_MISS[::_1]) != 0) else None)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1803
        title: L2 Requests (per normUnit)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr: (TO_INT(TCC_REQ[::_1]) / $denom)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1804
        title: L2 Requests (per normUnit)
        header:
          metric: Channel
          read req: L2 Read
          write req: L2 Write
          atomic req: L2 Atomic
        metric:
          "::_1":
            read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
            write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
            atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    - metric_table:
        id: 1805
        title: L2-Fabric Requests (per normUnit)
        header:
          metric: Channel
          read req: L2-Fabric Read
          write req: L2-Fabric Write and Atomic
          atomic req: L2-Fabric Atomic
        metric:
          "::_1":
            read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
            write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
            atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    # - metric_table:
    #     id: 1806
    #     title: L2-EA Latency (Cycles)
    #     header:
    #       metric: Metric
    #       read lat: L2-EA Read
    #       write lat: L2-EA Write
    #       atomic lat: L2-EA Atomic
    #     metric:
    #       "::_1":
    #         read lat:
    #           AVG(((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
    #           != 0) else None))
    #         write lat:
    #           AVG(((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
    #           != 0) else None))
    #         atomic lat:
    #           AVG(((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
    #           (TCC_EA_ATOMIC[::_1] != 0) else 0))
    #       placeholder_range:
    #         "::_1": 32
    #     cli_style: simple_multiple_bar
    - metric_table:
        id: 1806
        title: L2-Fabric Read Latency (Cycles)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr:
              ((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
              != 0) else None)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1807
        title: L2-Fabric Write and Atomic Latency (Cycles)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr:
              ((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
              != 0) else None)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1808
        title: L2-Fabric Atomic Latency (Cycles)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
              (TCC_EA_ATOMIC[::_1] != 0) else 0)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1809
        title: L2-Fabric Read Stall (Cycles per normUnit)
        header:
          metric: Channel
          ea read stall - pcie: L2-Fabric Read Stall (PCIe)
          ea read stall - if: L2-Fabric Read Stall (Infinity Fabric™)
          ea read stall - hbm: L2-Fabric Read Stall (HBM)
        metric:
          "::_1":
            ea read stall - pcie: None # Missing perfmon
            ea read stall - if: None # Missing perfmon
            ea read stall - hbm: None # Missing perfmon
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    - metric_table:
        id: 1810
        title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
        header:
          metric: Channel
          ea write stall - pcie: L2-Fabric Write Stall (PCIe)
          ea write stall - if: L2-Fabric Write Stall (Infinity Fabric™)
          ea write stall - hbm: L2-Fabric Write Stall (HBM)
          ea write stall - starve: L2-Fabric Write Starve
        metric:
          "::_1":
            ea write stall - pcie: None # Missing perfmon
            ea write stall - if: None # Missing perfmon
            ea write stall - hbm: None # Missing perfmon
            ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1]) / $denom))
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    - metric_table:
        id: 1812
        title: L2-Fabric (128B read requests per normUnit)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
          placeholder_range:
            "::_1": $total_l2_chan
          # tips: Number of 128-byte read requests sent to EA
        cli_style: simple_box
        tui_style: simple_box
@@ -0,0 +1,323 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
  metrics_description:
    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
      clients that hit in the cache. As noted in the Speed-of-Light section, this
      includes hit-on-miss requests.
  data source:
  - metric_table:
      id: 1801
      title: Aggregate Stats (All channels)
      header:
        metric: Metric
        avg: Avg
        std dev: Std Dev
        min: Min
        max: Max
        unit: Unit
      metric:
        L2 Cache Hit Rate:
          avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
            + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
            * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
            + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
            (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
            * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
            TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
            + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
            (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
            * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
            TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100
            * TCC_HIT[1])) + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4]))
            + (100 * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100
            * TCC_HIT[8])) + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11]))
            + (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) +
            (100 * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100
            * TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 *
            TCC_HIT[21])) + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24]))
            + (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) +
            (100 * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100
            * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
            + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
            * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
            + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
            (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
            * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
            TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
            + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
            (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
            * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
            TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
            + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
            * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
            + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
            (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
            * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
            TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
            + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
            (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
            * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
            TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          unit: pct
  - metric_table:
      id: 1802
      title: L2 Cache Hit Rate (pct)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: (((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
            + TCC_MISS[::_1]) != 0) else None)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1803
      title: L2 Requests (per normUnit)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: (TO_INT(TCC_REQ[::_1]) / $denom)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1804
      title: L2 Requests (per normUnit)
      header:
        metric: Channel
        read req: L2 Read
        write req: L2 Write
        atomic req: L2 Atomic
      metric:
        ::_1:
          read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
          write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
          atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1805
      title: L2-Fabric Requests (per normUnit)
      header:
        metric: Channel
        read req: L2-Fabric Read
        write req: L2-Fabric Write and Atomic
        atomic req: L2-Fabric Atomic
      metric:
        ::_1:
          read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
          write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
          atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1806
      title: L2-Fabric Read Latency (Cycles)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: ((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
            != 0) else None)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1807
      title: L2-Fabric Write and Atomic Latency (Cycles)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: ((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
            != 0) else None)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1808
      title: L2-Fabric Atomic Latency (Cycles)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if (TCC_EA_ATOMIC[::_1]
            != 0) else 0)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1809
      title: L2-Fabric Read Stall (Cycles per normUnit)
      header:
        metric: Channel
        ea read stall - pcie: L2-Fabric Read Stall (PCIe)
        ea read stall - if: "L2-Fabric Read Stall (Infinity Fabric\u2122)"
        ea read stall - hbm: L2-Fabric Read Stall (HBM)
      metric:
        ::_1:
          ea read stall - pcie: None
          ea read stall - if: None
          ea read stall - hbm: None
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1810
      title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
      header:
        metric: Channel
        ea write stall - pcie: L2-Fabric Write Stall (PCIe)
        ea write stall - if: "L2-Fabric Write Stall (Infinity Fabric\u2122)"
        ea write stall - hbm: L2-Fabric Write Stall (HBM)
        ea write stall - starve: L2-Fabric Write Starve
      metric:
        ::_1:
          ea write stall - pcie: None
          ea write stall - if: None
          ea write stall - hbm: None
          ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1])
            / $denom))
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1812
      title: L2-Fabric (128B read requests per normUnit)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
@@ -1,10 +1,11 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 2100
  title: PC Sampling
  metrics_description: {}
  data source:
-    - pc_sampling_table:
+  - pc_sampling_table:
-        id: 2101
+      id: 2101
-        title: PC Sampling
+      title: PC Sampling
-        source: None # not support
+      source: ps_file
-        comparable: false # enable it later
+      comparable: false
@@ -1,14 +1,14 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
-  id: 000
+  id: 0
  title: Top Stats
  metrics_description: {}
  data source:
-    - raw_csv_table:
+  - raw_csv_table:
-        id: 001
+      id: 1
-        title: Top Kernels
+      title: Top Kernels
-        source: pmc_kernel_top.csv
+      source: pmc_kernel_top.csv
-
+  - raw_csv_table:
-    - raw_csv_table:
+      id: 2
-        id: 002
+      title: Dispatch List
-        title: Dispatch List
+      source: pmc_dispatch_info.csv
        source: pmc_dispatch_info.csv
@@ -1,9 +1,10 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 100
  title: System Info
  metrics_description: {}
  data source:
-    - raw_csv_table:
+  - raw_csv_table:
-        id: 101
+      id: 101
-        source: sysinfo.csv
+      source: sysinfo.csv
-        columnwise: True
+      columnwise: true
@@ -1,254 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
  SALU: &SALU_anchor Scalar Arithmetic Logic Unit
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 200
  title: System Speed-of-Light
  data source:
    - metric_table:
        id: 201
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          peak: Peak
          pop: Pct of Peak
          tips: Tips
        metric:
          VALU FLOPs:
            value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
              + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
              + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32)))) + (64 * (((SQ_INSTS_VALU_ADD_F64
              + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (2 * SQ_INSTS_VALU_FMA_F64))))
              / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
              + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
              + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
              + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))) / (((($max_sclk
              * $cu_per_gpu) * 64) * 2) / 1000))
            tips:
          VALU IOPs:
            value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp - Start_Timestamp)))
            unit: GIOP/s
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
              - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
            tips:
          MFMA FLOPs (BF16):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
            tips:
          MFMA FLOPs (F16):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
            tips:
          MFMA FLOPs (F32):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
            tips:
          MFMA FLOPs (F64):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
            tips:
          MFMA IOPs (Int8):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GIOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
            tips:
          Active CUs:
            value: $numActiveCUs
            unit: CUs
            peak: $cu_per_gpu
            pop: ((100 * $numActiveCUs) / $cu_per_gpu)
            tips:
          SALU Utilization:
            value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            tips:
          VALU Utilization:
            value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            tips:
          MFMA Utilization:
            value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
              * 4)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
              * 4)))
            tips:
          VMEM Utilization:
            value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            peak: 100
            pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            tips:
          Branch Utilization:
            value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            peak: 100
            pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            tips:
          VALU Active Threads:
            value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            unit: Threads
            peak: 64
            pop: (AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None)) * 1.5625)
            tips:
          IPC:
            value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            unit: Instr/cycle
            peak: 5
            pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
            tips:
          Wavefront Occupancy:
            value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            unit: Wavefronts
            peak: ($max_waves_per_cu * $cu_per_gpu)
            pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
              * $cu_per_gpu))))
            coll_level: SQ_LEVEL_WAVES
            tips:
          Theoretical LDS Bandwidth:
            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: (($max_sclk * $cu_per_gpu) * 0.128)
            pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
            tips:
          LDS Bank Conflicts/Access:
            value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Conflicts/access
            peak: 32
            pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
            tips:
          vL1D Cache Hit Rate:
            value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            unit: pct
            peak: 100
            pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            tips:
          vL1D Cache BW:
            value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
            pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
            tips:
          L2 Cache Hit Rate:
            value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            unit: pct
            peak: 100
            pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            tips:
          L2 Cache BW:
            value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
            pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
            tips:
          L2-Fabric Read BW:
            value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: $hbmBandwidth
            pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
            tips:
          L2-Fabric Write BW:
            value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: $hbmBandwidth
            pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
            tips:
          L2-Fabric Read Latency:
            value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
              != 0) else None))
            unit: Cycles
            peak: None
            pop: None
            tips:
          L2-Fabric Write Latency:
            value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
              != 0) else None))
            unit: Cycles
            peak: None
            pop: None
            tips:
          sL1D Cache Hit Rate:
            value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
            unit: pct
            peak: 100
            pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
            tips:
          sL1D Cache BW:
            value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
            pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
              / 1000) * 64) * $sqc_per_gpu))
            tips:
          L1I Hit Rate:
            value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
            tips:
          L1I BW:
            value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
            pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
              / 1000) * 64) * $sqc_per_gpu))
            tips:
          L1I Fetch Latency:
            value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            unit: Cycles
            peak: None
            pop: None
            coll_level: SQ_IFETCH_LEVEL
            tips:
@@ -0,0 +1,337 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 200
  title: System Speed-of-Light
  metrics_description:
    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.'
    VALU IOPs: 'The total integer operations executed per second on the VALU. This
      is also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.'
    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
      executed per second. This does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
      executed per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. This is also presented as a percent of the
      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
      per second. Note: this does not include any 16-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
      per second. Note: this does not include any 32-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
      per second. Note: this does not include any 64-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.'
    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
      per second. Note: this does not include any 8-bit integer operations from VALU
      instructions. This is also presented as a percent of the peak theoretical INT8
      MFMA operations achievable on the specific accelerator.'
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
      busy executing instructions. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
      busy executing instructions. Does not include VMEM operations. Computed as the
      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
      over the total CU cycles.
    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
      was busy executing instructions. Computed as the ratio of the total number of
      cycles the MFMA was busy over the total CU cycles.
    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
      was busy executing instructions, including both global/generic and spill/scratch
      operations (see the VMEM instruction count metrics) for more detail). Does not
      include VALU operations. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing VMEM instructions over the total CU cycles.
    Branch Utilization: Indicates what percent of the kernel's duration the branch
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the scheduler issuing branch instructions over the total
      CU cycles
    VALU Active Threads: Indicates the average level of divergence within a wavefront
      over the lifetime of the kernel. The number of work-items that were active in
      a wavefront during execution of each VALU instruction, time-averaged over all
      VALU instructions run on all wavefronts in the kernel.
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
      occupancy achievable on the specific accelerator.'
    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
      been loaded from, stored to, or atomically updated in the LDS per unit time
      (see LDS Bandwidth example for more detail). This is also presented as a percent
      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
      to the base number of cycles that would be spent in the LDS scheduler in a completely
      uncontended case. This is also presented in normalized form (i.e., the Bank
      Conflict Rate).
    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
      hit in vL1D cache over the total number of cache line requests to the vL1D cache
      RAM.
    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
      VMEM instructions per unit time. The number of bytes is calculated as the number
      of cache lines requested multiplied by the cache line size. This value does
      not consider partial requests, so e.g., if only a single value is requested
      in a cache line, the data movement will still be counted as a full cache line.
      This is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
      in the L2 cache over the total number of incoming cache line requests to the
      L2 cache.
    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
      number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so e.g.,
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
      \ interface per unit time. This is also presented as a percent of the peak theoretical\
      \ bandwidth achievable on the specific accelerator."
    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
      interface by write and atomic operations per unit time. This is also presented
      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
      in Infinity Fabric before data was returned to the L2.
    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
      in Infinity Fabric before a completion acknowledgement was returned to the L2.
    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
      line the cache. Calculated as the ratio of the number of sL1D requests that
      hit over the number of all sL1D requests.
    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
      This is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
      is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
      Calculated as the ratio of the number of L1I requests that hit over the number
      of all L1I requests.
    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
      a CU.
  data source:
  - metric_table:
      id: 201
      title: System Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
        peak: Peak
        pop: Pct of Peak
      metric:
        VALU FLOPs:
          value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
            SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
            + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
            / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
        VALU IOPs:
          value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
            - Start_Timestamp)))
          unit: GIOP/s
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
            - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
        MFMA FLOPs (BF16):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
        MFMA FLOPs (F16):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
        MFMA FLOPs (F32):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
        MFMA FLOPs (F64):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
        MFMA IOPs (Int8):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GIOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
        Active CUs:
          value: $numActiveCUs
          unit: CUs
          peak: $cu_per_gpu
          pop: ((100 * $numActiveCUs) / $cu_per_gpu)
        SALU Utilization:
          value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
        VALU Utilization:
          value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
        MFMA Utilization:
          value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu) * 4)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu) * 4)))
        VMEM Utilization:
          value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
          unit: pct
          peak: 100
          pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
        Branch Utilization:
          value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
          peak: 100
          pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
        VALU Active Threads:
          value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          unit: Threads
          peak: 64
          pop: (AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None)) * 1.5625)
        IPC:
          value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          unit: Instr/cycle
          peak: 5
          pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
        Wavefront Occupancy:
          value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          peak: ($max_waves_per_cu * $cu_per_gpu)
          pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
            * $cu_per_gpu))))
          coll_level: SQ_LEVEL_WAVES
        Theoretical LDS Bandwidth:
          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: (($max_sclk * $cu_per_gpu) * 0.128)
          pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
        LDS Bank Conflicts/Access:
          value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Conflicts/access
          peak: 32
          pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
        vL1D Cache Hit Rate:
          value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: pct
          peak: 100
          pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
        vL1D Cache BW:
          value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $cu_per_gpu)
          pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
        L2 Cache Hit Rate:
          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          unit: pct
          peak: 100
          pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
        L2 Cache BW:
          value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan))
          pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
            / ((($max_sclk / 1000) * 64) * TO_INT($total_l2_chan)))
        L2-Fabric Read BW:
          value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: $hbmBandwidth
          pop: ((100 * AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
        L2-Fabric Write BW:
          value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: $hbmBandwidth
          pop: ((100 * AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
        L2-Fabric Read Latency:
          value: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: Cycles
          peak: None
          pop: None
        L2-Fabric Write Latency:
          value: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: Cycles
          peak: None
          pop: None
        sL1D Cache Hit Rate:
          value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
            if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
          unit: pct
          peak: 100
          pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
            if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
        sL1D Cache BW:
          value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
          pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) *
            64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
        L1I Hit Rate:
          value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
        L1I BW:
          value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
          pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) *
            64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
        L1I Fetch Latency:
          value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          unit: Cycles
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
@@ -1,315 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 300
  title: Memory Chart
  data source:
    - metric_table:
        id: 301
        title: Memory Chart
        header:
          metric: Metric
          #alias: #alias
          value: Value
          tips: Tips
        metric:
          # ----------------------------------------
          # Instr Buff Block
          #TODO: double check wave_occupancy
          Wavefront Occupancy:
            #alias: wave_occ_
            value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs), 0)
            coll_level: SQ_LEVEL_WAVES
            tips:
          Wave Life:
            #alias: wave_life_
            value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else 0)), 0)
            tips:
          # ----------------------------------------
          # Instr Dispatch Block
          SALU:
            #alias: salu_
            value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
            tips:
          SMEM:
            #alias: smem_
            value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
            tips:
          VALU:
            #alias: valu_
            value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
            tips:
          MFMA:
            #alias: mfma_
            value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
            tips:
          VMEM:
            #alias: vmem_
            value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
            tips:
          LDS:
            #alias: lds_
            value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
            tips:
          GWS:
            #alias: gws_
            value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
            tips:
          BR:
            #alias: br_
            value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
            tips:
          # ----------------------------------------
          # Exec Block
          Active CUs:
            #alias: active_cu_
            value: $numActiveCUs
            tips:
          Num CUs:
            #alias: num_cu_
            value: $cu_per_gpu
            tips:
          VGPR:
            #alias: vgpr_
            value: ROUND(AVG(Arch_VGPR), 0)
            tips:
          # Todo: add AGPRs
          SGPR:
            #alias: sgpr_
            value: ROUND(AVG(SGPR), 0)
            tips:
          LDS Allocation:
            #alias: lds_alloc_
            value: ROUND(AVG(LDS_Per_Workgroup), 0)
            tips:
          Scratch Allocation:
            #alias: scratch_alloc_
            value: ROUND(AVG(Scratch_Per_Workitem), 0)
            tips:
          Wavefronts:
            #alias: wavefronts_
            value: ROUND(AVG(SPI_CSN_WAVE), 0)
            tips:
          Workgroups:
            #alias: workgroups_
            value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
            tips:
          # ----------------------------------------
          # LDS Block
          LDS Req:
            #alias: lds_req_
            value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
            tips:
          LDS Util:
            #alias: lds_util_
            value:
              ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))),
              0)
            tips:
          LDS Latency:
            #alias: lds_lat
            value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)),0)
            coll_level: SQ_INST_LEVEL_LDS
            tips:
          # ----------------------------------------
          # Vector L1 Cache Block
          VL1 Rd:
            #alias: vl1_rd_
            value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
            tips:
          VL1 Wr:
            #alias: vl1_wr_
            value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
            tips:
          VL1 Atomic:
            #alias: vl1_atom_
            value:
              ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom)), 0)
            tips:
          VL1 Hit:
            #alias: vl1_hit_
            value:
              ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None )), 0)
            tips:
          VL1 Lat:
            #alias: vl1_lat_
            value:
              ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None)), 0)
            tips:
          VL1 Coalesce:
            #alias: vl1_coales_
            value:
              ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
              * 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
            tips:
          VL1 Stall:
            #alias: vl1_stall_
            value:
              ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)), 0)
            tips:
          VL1_L2 Rd:
            #alias: vl1_l2_rd_
            value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
            tips:
          VL1_L2 Wr:
            #alias: vl1_l2_wr_
            value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
            tips:
          VL1_L2 Atomic:
            #alias: vl1_l2_atom_
            value:
              ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom)), 0)
            tips:
          # ----------------------------------------
          # Scalar L1D Cache Block
          VL1D Rd:
            #alias: sl1_rd_
            value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
            tips:
          VL1D Hit:
            #alias: sl1_hit_
            value:
              ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
              0) else None)) * 100), 0)
            tips:
          VL1D Lat:
            #alias: sl1_lat_
            value:
              ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
              0) else None)) * 100), 0)
            coll_level: SQC_DCACHE_INFLIGHT_LEVEL
            tips:
          VL1D_L2 Rd:
            #alias: sl1_l2_rd_
            value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
            tips:
          VL1D_L2 Wr:
            #alias: sl1_l2_wr_
            value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
            tips:
          VL1D_L2 Atomic:
            #alias: sl1_l2_atom_
            value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
            tips:
          # ----------------------------------------
          # Instr L1  Cache Block
          IL1 Fetch:
            #alias: il1_fetch_
            value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
            tips:
          IL1 Hit:
            #alias: il1_hit_
            value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
            tips:
          IL1 Lat:
            #alias: il1_lat_
            value:
              ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ !=
              0) else None)) * 100), 0)
            tips: # ??? coll_level: SQ_IFETCH_LEVEL
          IL1_L2 Rd:
            #alias: il1_l2_req_
            value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
            tips:
          # ----------------------------------------
          # L2 Cache Block(inside)
          L2 Rd:
            #alias: l2_rd_
            value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
            tips:
          L2 Wr:
            #alias: l2_wr_
            value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
            tips:
          L2 Atomic:
            #alias: l2_atom_
            value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
            tips:
          L2 Hit:
            #alias: l2_hit_
            value:
              ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else 0)), 0)
            tips:
          L2 Rd Lat:
            #alias: l2_rd_lat_
            value:
              ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)),
              0)
            tips:
          L2 Wr Lat:
            #alias: l2_wr_lat_
            value:
              ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum +
              TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              != 0) else None)), 0)
            tips:
          # ----------------------------------------
          # Fabric Block
          Fabric_L2 Rd:
            #alias: l2_fabric_rd_
            value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
            tips:
          Fabric_L2 Wr:
            #alias: l2_fabric_wr_
            value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
            tips:
          Fabric_L2 Atomic:
            #alias: l2_fabric_atom_
            value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
            tips:
          Fabric Rd Lat:
            #alias: fabric_rd_lat_
            value:
              ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
              != 0) else  0)), 0)
            tips:
          Fabric Wr Lat:
            #alias: fabric_wr_lat_
            value:
              ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
              != 0) else  0)), 0)
            tips:
          Fabric Atomic Lat:
            #alias: fabric_atom_lat_
            value:
              ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else  0)), 0)
            tips:
          HBM Rd:
            #alias: hbm_rd_
            value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
            tips:
          HBM Wr:
            #alias: hbm_wr_
            value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
            tips:
        comparable: false # for now
        cli_style: mem_chart
@@ -0,0 +1,267 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 300
  title: Memory Chart
  metrics_description:
    Wavefront Occupancy: Wavefronts per active CU.
    Wave Life: Average number of cycles executing a wave.
    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
      unit.
    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
      unit.
    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
      normalization unit.
    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
      memory) per normalization unit.
    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
      and HIP's __shfl instructions) executed per normalization unit.
    GWS: Total number of GDS (global data sync) instructions issued per normalization
      unit.
    BR: Total number of BRANCH instructions issued per normalization unit.
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
    VGPR: 'The number of architected vector general-purpose registers allocated for
      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
      by the compiler due to allocation granularity.'
    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.'
    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
      for this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
    Scratch Allocation: The number of bytes of scratch memory requested per work-item
      for this kernel. Scratch memory is used for stack memory on the accelerator,
      as well as for register spills and restores.
    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
      this kernel launch.
    Workgroups: The total number of workgroups forming this kernel launch.
    LDS Req: The total number of LDS instructions (including, but not limited to,
      read/write/atomics and HIP's __shfl instructions) executed per normalization
      unit.
    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
      executing instructions (including, but not limited to, load, store, atomic and
      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
      LDS was active over the total CU cycles.
    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
      / acknowledgment) required for an LDS instruction to complete.
    VL1 Rd: The total number of incoming read requests from the address processing
      unit after coalescing per normalization unit
    VL1 Wr: The total number of incoming write requests from the address processing
      unit after coalescing per normalization unit
    VL1 Atomic: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit
    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
      as the average number of thread-requests generated per instruction divided by
      the ideal number of thread-requests per instruction.
    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
      to issue a request for data to the L2 cache divided by the number of cycles
      where the vL1D is active.
    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
      by the vL1D and must be retrieved from the to the L2 Cache per normalization
      unit.
    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
      the vL1D to the L2 cache, per normalization unit.
    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
      the L2 cache, per normalization unit. This includes requests for atomics with,
      and without return.
    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
      normalization unit.
    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
      unit.
    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
      cache. Calculated as the ratio of the number of L1I requests that hit over the
      number of all L1I requests.
    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
    L2 Rd: The total number of read requests to the L2 from all clients.
    L2 Wr: The total number of write requests to the L2 from all clients.
    L2 Atomic: The total number of atomic requests (with and without return) to the
      L2 from all clients.
    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
      over the total number of incoming cache line requests to the L2 cache.
    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
      to issue and receive read requests from the L2 Cache. This number also includes
      requests for atomics with return values.
    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
      to issue and receive acknowledgement of a write request to the L2 Cache. This
      number also includes requests for atomics without return values.
    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
      or 64-byte) summed over TCC instances per normalization unit.
    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
      or 64-byte) summed over TCC instances per normalization unit.
    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
      per normalization unit.
    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
      Fabric before data was returned to the L2.
    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
      Fabric before a completion acknowledgement was returned to the L2.
    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
      Infinity Fabric before a completion acknowledgement (atomic without return value)
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator''s local HBM, per normalization
      unit. '
  data source:
  - metric_table:
      id: 301
      title: Memory Chart
      header:
        metric: Metric
        value: Value
      metric:
        Wavefront Occupancy:
          value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs),
            0)
          coll_level: SQ_LEVEL_WAVES
        Wave Life:
          value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else
            0)), 0)
        SALU:
          value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
        SMEM:
          value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
        VALU:
          value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
        MFMA:
          value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
        VMEM:
          value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
        LDS:
          value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
        GWS:
          value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
        BR:
          value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
        Active CUs:
          value: $numActiveCUs
        Num CUs:
          value: $cu_per_gpu
        VGPR:
          value: ROUND(AVG(Arch_VGPR), 0)
        SGPR:
          value: ROUND(AVG(SGPR), 0)
        LDS Allocation:
          value: ROUND(AVG(LDS_Per_Workgroup), 0)
        Scratch Allocation:
          value: ROUND(AVG(Scratch_Per_Workitem), 0)
        Wavefronts:
          value: ROUND(AVG(SPI_CSN_WAVE), 0)
        Workgroups:
          value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
        LDS Req:
          value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
        LDS Util:
          value: ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu))), 0)
        LDS Latency:
          value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS
            != 0) else None)),0)
          coll_level: SQ_INST_LEVEL_LDS
        VL1 Rd:
          value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
        VL1 Wr:
          value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
        VL1 Atomic:
          value: ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom)), 0)
        VL1 Hit:
          value: ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None )), 0)
        VL1 Lat:
          value: ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None)), 0)
        VL1 Coalesce:
          value: ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
            * 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
        VL1 Stall:
          value: ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)), 0)
        VL1_L2 Rd:
          value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
        VL1_L2 Wr:
          value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
        VL1_L2 Atomic:
          value: ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom)), 0)
        sL1D Rd:
          value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
        sL1D Hit:
          value: ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
            != 0) else None)) * 100), 0)
        sL1D Lat:
          value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
            != 0) else None)) * 100), 0)
          coll_level: SQC_DCACHE_INFLIGHT_LEVEL
        sL1D_L2 Rd:
          value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
        sL1D_L2 Wr:
          value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
        sL1D_L2 Atomic:
          value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
        IL1 Fetch:
          value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
        IL1 Hit:
          value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
        IL1 Lat:
          value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ
            != 0) else None)) * 100), 0)
        IL1_L2 Rd:
          value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
        L2 Rd:
          value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
        L2 Wr:
          value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
        L2 Atomic:
          value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
        L2 Hit:
          value: ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if
            ((TCC_HIT_sum + TCC_MISS_sum) != 0) else 0)), 0)
        L2 Rd Lat:
          value: ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)) if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            != 0) else None)), 0)
        L2 Wr Lat:
          value: ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            != 0) else None)), 0)
        Fabric_L2 Rd:
          value: ROUND(AVG((TCC_EA_RDREQ_sum / $denom)), 0)
        Fabric_L2 Wr:
          value: ROUND(AVG((TCC_EA_WRREQ_sum / $denom)), 0)
        Fabric_L2 Atomic:
          value: ROUND(AVG((TCC_EA_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else  0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else  0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else  0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
          value: ROUND(AVG((TCC_EA_WRREQ_DRAM_sum / $denom)), 0)
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
@@ -0,0 +1,9 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 400
  title: Roofline
  metrics_description: {}
  data source:
  - None:
      id: 401
      title: Roofline
@@ -1,8 +0,0 @@
 ---
 Panel Config:
  id: 400
  title: Roofline
  data source:
    - None:
        id: 401
        title: Roofline
@@ -1,135 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
  data source:
    - metric_table:
        id: 501
        title: Command Processor Fetcher
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          CPF Utilization:
            avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            unit: pct
            tips:
          CPF Stall:
            avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPF-L2 Utilization:
            avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            unit: pct
            tips:
          CPF-L2 Stall:
            avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPF-UTCL1 Stall:
            avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            unit: pct
            tips:
    - metric_table:
        id: 502
        title: Packet Processor
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          CPC Utilization:
            avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            unit: pct
            tips:
          CPC Stall Rate:
            avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPC Packet Decoding Utilization:
            avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            unit: pct
            tips:
          CPC-Workgroup Manager Utilization:
            avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            unit: Pct
            tips:
          CPC-L2 Utilization:
            avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            unit: pct
            tips:
          CPC-UTCL1 Stall:
            avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            unit: pct
            tips:
          CPC-UTCL2 Utilization:
            avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            unit: pct
            tips:
@@ -0,0 +1,145 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
  metrics_description:
    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
      over total cycles counted by the CPF-L2.
    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
      stalled for any reason.
    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
      translation.
    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
      for processing.
    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
      workgroups to the workgroup manager.
    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
      title: Command processor fetcher (CPF)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        CPF Utilization:
          avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          unit: pct
        CPF Stall:
          avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          unit: pct
        CPF-L2 Utilization:
          avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          unit: pct
        CPF-L2 Stall:
          avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          unit: pct
        CPF-UTCL1 Stall:
          avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          unit: pct
  - metric_table:
      id: 502
      title: Command processor packet processor (CPC)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        CPC Utilization:
          avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          unit: pct
        CPC Stall Rate:
          avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          unit: pct
        CPC Packet Decoding Utilization:
          avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          unit: pct
        CPC-Workgroup Manager Utilization:
          avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          unit: Pct
        CPC-L2 Utilization:
          avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          unit: pct
        CPC-UTCL1 Stall:
          avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          unit: pct
        CPC-UTCL2 Utilization:
          avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
@@ -1,167 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
  data source:
    - metric_table:
        id: 601
        title: Workgroup Manager Utilizations
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Accelerator Utilization:
            avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            unit: Pct
            tips:
          Scheduler-Pipe Utilization:
            avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            unit: Pct
            tips:
          Workgroup Manager Utilization:
            avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            unit: Pct
            tips:
          Shader Engine Utilization:
            avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            unit: Pct
            tips:
          SIMD Utilization:
            avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Dispatched Workgroups:
            avg: AVG(SPI_CSN_NUM_THREADGROUPS)
            min: MIN(SPI_CSN_NUM_THREADGROUPS)
            max: MAX(SPI_CSN_NUM_THREADGROUPS)
            unit: Workgroups
            tips:
          Dispatched Wavefronts:
            avg: AVG(SPI_CSN_WAVE)
            min: MIN(SPI_CSN_WAVE)
            max: MAX(SPI_CSN_WAVE)
            unit: Wavefronts
            tips:
          VGPR Writes:
            avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            unit: Cycles/wave
            tips:
          SGPR Writes:
            avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            unit: Cycles/wave
            tips:
    - metric_table:
        id: 602
        title: Workgroup Manager - Resource Allocation
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Not-scheduled Rate (Workgroup Manager):
            avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            unit: Pct
            tips:
          Not-scheduled Rate (Scheduler-Pipe):
            avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            unit: Pct
            tips:
          Scheduler-Pipe Stall Rate:
            avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            unit: Pct
            tips:
          Scratch Stall Rate:
            avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            unit: Pct
            tips:
          Insufficient SIMD Waveslots:
            avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient SIMD VGPRs:
            avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient SIMD SGPRs:
            avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient CU LDS:
            avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient CU Barriers:
            avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Reached CU Workgroup Limit:
            avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Reached CU Wavefront Limit:
            avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
@@ -0,0 +1,201 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
  metrics_description:
    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
      was actively doing any work.
    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
      kernel where the scheduler-pipes were actively doing any work.
    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
      manager was actively doing any work.
    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
      where any CU in a shader-engine was actively doing any work, normalized over
      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
      was not fully saturated by the kernel, or a potential load-imbalance issue.
    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
      on a CU was actively doing any work, summed over all CUs. Low values (less than
      100%) indicate that the accelerator was not fully saturated by the kernel, or
      a potential load-imbalance issue.
    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
      forming this kernel launch.
    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
      resources. '
    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
      where a workgroup could not be scheduled to a CU due to occupancy limitations
      (like a lack of a CU or SIMD with sufficient resources).
    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
      memory slots. While this can reach up to 100%, note that the actual occupancy
      limitations on a kernel using private memory are typically quite small (for
      example, less than 1% of the total number of waves that can be scheduled to
      an accelerator).
    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
      could not be scheduled to a CU due to lack of available LDS.
    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
      workgroup could not be scheduled to a CU due to lack of available barriers.
    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
      a workgroup could not be scheduled to a CU due to limits within the workgroup
      manager. This is expected to be always be zero on CDNA2 or newer accelerators
      (and small for previous accelerators).
    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
      a wavefront could not be scheduled to a CU due to limits within the workgroup
      manager. This is expected to be always be zero on CDNA2 or newer accelerators
      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
      title: Workgroup manager utilizations
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Accelerator Utilization:
          avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          unit: Pct
        Scheduler-Pipe Utilization:
          avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          unit: Pct
        Workgroup Manager Utilization:
          avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          unit: Pct
        Shader Engine Utilization:
          avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          unit: Pct
        SIMD Utilization:
          avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Dispatched Workgroups:
          avg: AVG(SPI_CSN_NUM_THREADGROUPS)
          min: MIN(SPI_CSN_NUM_THREADGROUPS)
          max: MAX(SPI_CSN_NUM_THREADGROUPS)
          unit: Workgroups
        Dispatched Wavefronts:
          avg: AVG(SPI_CSN_WAVE)
          min: MIN(SPI_CSN_WAVE)
          max: MAX(SPI_CSN_WAVE)
          unit: Wavefronts
        VGPR Writes:
          avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          unit: Cycles/wave
        SGPR Writes:
          avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          unit: Cycles/wave
  - metric_table:
      id: 602
      title: Workgroup Manager - Resource Allocation
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Not-scheduled Rate (Workgroup Manager):
          avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Not-scheduled Rate (Scheduler-Pipe):
          avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Scheduler-Pipe Stall Rate:
          avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          unit: Pct
        Scratch Stall Rate:
          avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Insufficient SIMD Waveslots:
          avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient SIMD VGPRs:
          avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient SIMD SGPRs:
          avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient CU LDS:
          avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient CU Barriers:
          avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Reached CU Workgroup Limit:
          avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Reached CU Wavefront Limit:
          avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
@@ -1,142 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 700
  title: Wavefront
  data source:
    - metric_table:
        id: 701
        title: Wavefront Launch Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Grid Size:
            avg: AVG(Grid_Size)
            min: MIN(Grid_Size)
            max: MAX(Grid_Size)
            unit: Work Items
            tips:
          Workgroup Size:
            avg: AVG(Workgroup_Size)
            min: MIN(Workgroup_Size)
            max: MAX(Workgroup_Size)
            unit: Work Items
            tips:
          Total Wavefronts:
            avg: AVG(SPI_CSN_WAVE)
            min: MIN(SPI_CSN_WAVE)
            max: MAX(SPI_CSN_WAVE)
            unit: Wavefronts
            tips:
          Saved Wavefronts:
            avg: AVG(SQ_WAVES_SAVED)
            min: MIN(SQ_WAVES_SAVED)
            max: MAX(SQ_WAVES_SAVED)
            unit: Wavefronts
            tips:
          Restored Wavefronts:
            avg: AVG(SQ_WAVES_RESTORED)
            min: MIN(SQ_WAVES_RESTORED)
            max: MAX(SQ_WAVES_RESTORED)
            unit: Wavefronts
            tips:
          VGPRs:
            avg: AVG(Arch_VGPR)
            min: MIN(Arch_VGPR)
            max: MAX(Arch_VGPR)
            unit: Registers
            tips:
          AGPRs:
            avg: AVG(Accum_VGPR)
            min: MIN(Accum_VGPR)
            max: MAX(Accum_VGPR)
            unit: Registers
            tips:
          SGPRs:
            avg: AVG(SGPR)
            min: MIN(SGPR)
            max: MAX(SGPR)
            unit: Registers
            tips:
          LDS Allocation:
            avg: AVG(LDS_Per_Workgroup)
            min: MIN(LDS_Per_Workgroup)
            max: MAX(LDS_Per_Workgroup)
            unit: Bytes
            tips:
          Scratch Allocation:
            avg: AVG(Scratch_Per_Workitem)
            min: MIN(Scratch_Per_Workitem)
            max: MAX(Scratch_Per_Workitem)
            unit: Bytes/Workitem
            tips:
    - metric_table:
        id: 702
        title: Wavefront Runtime Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Kernel Time:
            avg: AVG((End_Timestamp - Start_Timestamp))
            min: MIN((End_Timestamp - Start_Timestamp))
            max: MAX((End_Timestamp - Start_Timestamp))
            unit: ns
            tips:
          Kernel Time (Cycles):
            avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
            min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
            max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
            unit: Cycle
            tips:
          Instructions per wavefront:
            avg: AVG((SQ_INSTS / SQ_WAVES))
            min: MIN((SQ_INSTS / SQ_WAVES))
            max: MAX((SQ_INSTS / SQ_WAVES))
            unit: Instr/wavefront
            tips:
          Wave Cycles:
            avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
            min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
            max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Dependency Wait Cycles:
            avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
            min: MIN(((4 * SQ_WAIT_ANY) / $denom))
            max: MAX(((4 * SQ_WAIT_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Issue Wait Cycles:
            avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
            min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
            max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Active Cycles:
            avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Wavefront Occupancy:
            avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            unit: Wavefronts
            coll_level: SQ_LEVEL_WAVES
            tips:
@@ -0,0 +1,173 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 700
  title: Wavefront
  metrics_description:
    Grid Size: The total number of work-items (or, threads) launched as a part of
      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
      by the total workgroup (or, block) size.
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
      \ should be equivalent to the ceiling of grid size divided by 64."
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
    VGPRs: 'The number of architected vector general-purpose registers allocated for
      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
      by the compiler due to allocation granularity.'
    AGPRs: 'The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
      requested by the compiler due to allocation granularity.'
    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.'
    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
      for this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
    Scratch Allocation: The number of bytes of scratch memory requested per work-item
      for this kernel. Scratch memory is used for stack memory on the accelerator,
      as well as for register spills and restores.
    Kernel Time: The total duration of the executed kernel.
    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
    Instructions per wavefront: The average number of instructions (of all types)
      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per normalization unit. This is averaged over all wavefronts
      in a kernel dispatch.
    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
      spent resident on a compute unit per normalization unit. This is averaged over
      all wavefronts in a kernel dispatch.
    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
      arbitration loss, etc.) per normalization unit. This counter is incremented
      at every cycle by all wavefronts on a CU unable to issue an instruction. As
      such, it is most useful to get a sense of how waves were spending their time,
      rather than identification of a precise limiter because another wave could be
      actively executing while a wave is issue stalled. The sum of this metric, Dependency
      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
      was actively executing instructions per normalization unit. This measurement
      is made on a per-wavefront basis, and may include cycles that another wavefront
      spent actively executing (on another execution unit, for example) or was stalled.
      As such, it is most useful to get a sense of how waves were spending their time,
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
      title: Wavefront Launch Stats
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Grid Size:
          avg: AVG(Grid_Size)
          min: MIN(Grid_Size)
          max: MAX(Grid_Size)
          unit: Work Items
        Workgroup Size:
          avg: AVG(Workgroup_Size)
          min: MIN(Workgroup_Size)
          max: MAX(Workgroup_Size)
          unit: Work Items
        Total Wavefronts:
          avg: AVG(SPI_CSN_WAVE)
          min: MIN(SPI_CSN_WAVE)
          max: MAX(SPI_CSN_WAVE)
          unit: Wavefronts
        Saved Wavefronts:
          avg: AVG(SQ_WAVES_SAVED)
          min: MIN(SQ_WAVES_SAVED)
          max: MAX(SQ_WAVES_SAVED)
          unit: Wavefronts
        Restored Wavefronts:
          avg: AVG(SQ_WAVES_RESTORED)
          min: MIN(SQ_WAVES_RESTORED)
          max: MAX(SQ_WAVES_RESTORED)
          unit: Wavefronts
        VGPRs:
          avg: AVG(Arch_VGPR)
          min: MIN(Arch_VGPR)
          max: MAX(Arch_VGPR)
          unit: Registers
        AGPRs:
          avg: AVG(Accum_VGPR)
          min: MIN(Accum_VGPR)
          max: MAX(Accum_VGPR)
          unit: Registers
        SGPRs:
          avg: AVG(SGPR)
          min: MIN(SGPR)
          max: MAX(SGPR)
          unit: Registers
        LDS Allocation:
          avg: AVG(LDS_Per_Workgroup)
          min: MIN(LDS_Per_Workgroup)
          max: MAX(LDS_Per_Workgroup)
          unit: Bytes
        Scratch Allocation:
          avg: AVG(Scratch_Per_Workitem)
          min: MIN(Scratch_Per_Workitem)
          max: MAX(Scratch_Per_Workitem)
          unit: Bytes/Workitem
  - metric_table:
      id: 702
      title: Wavefront Runtime Stats
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Kernel Time:
          avg: AVG((End_Timestamp - Start_Timestamp))
          min: MIN((End_Timestamp - Start_Timestamp))
          max: MAX((End_Timestamp - Start_Timestamp))
          unit: ns
        Kernel Time (Cycles):
          avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
          min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
          max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
          unit: Cycle
        Instructions per wavefront:
          avg: AVG((SQ_INSTS / SQ_WAVES))
          min: MIN((SQ_INSTS / SQ_WAVES))
          max: MAX((SQ_INSTS / SQ_WAVES))
          unit: Instr/wavefront
        Wave Cycles:
          avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
          min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
          max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
          unit: (Cycles + $normUnit)
        Dependency Wait Cycles:
          avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
          min: MIN(((4 * SQ_WAIT_ANY) / $denom))
          max: MAX(((4 * SQ_WAIT_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Issue Wait Cycles:
          avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
          min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
          max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Active Cycles:
          avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Wavefront Occupancy:
          avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
@@ -1,267 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
  data source:
    - metric_table:
        id: 1001
        title: Overall Instruction Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          VALU:
            avg: AVG(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
            min: MIN(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
            max: MAX(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
            unit: (instr + $normUnit)
            tips:
          VMEM:
            avg: AVG(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
            min: MIN(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
            max: MAX(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
            unit: (instr + $normUnit)
            tips:
          LDS:
            avg: AVG((SQ_INSTS_LDS / $denom))
            min: MIN((SQ_INSTS_LDS / $denom))
            max: MAX((SQ_INSTS_LDS / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA:
            avg: AVG((SQ_INSTS_MFMA / $denom))
            min: MIN((SQ_INSTS_MFMA / $denom))
            max: MAX((SQ_INSTS_MFMA / $denom))
            unit: (instr + $normUnit)
            tips:
          SALU:
            avg: AVG((SQ_INSTS_SALU / $denom))
            min: MIN((SQ_INSTS_SALU / $denom))
            max: MAX((SQ_INSTS_SALU / $denom))
            unit: (instr + $normUnit)
            tips:
          SMEM:
            avg: AVG((SQ_INSTS_SMEM / $denom))
            min: MIN((SQ_INSTS_SMEM / $denom))
            max: MAX((SQ_INSTS_SMEM / $denom))
            unit: (instr + $normUnit)
            tips:
          Branch:
            avg: AVG((SQ_INSTS_BRANCH / $denom))
            min: MIN((SQ_INSTS_BRANCH / $denom))
            max: MAX((SQ_INSTS_BRANCH / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1002
        title: VALU Arithmetic Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          INT32:
            avg: AVG((SQ_INSTS_VALU_INT32 / $denom))
            min: MIN((SQ_INSTS_VALU_INT32 / $denom))
            max: MAX((SQ_INSTS_VALU_INT32 / $denom))
            unit: (instr + $normUnit)
            tips:
          INT64:
            avg: AVG((SQ_INSTS_VALU_INT64 / $denom))
            min: MIN((SQ_INSTS_VALU_INT64 / $denom))
            max: MAX((SQ_INSTS_VALU_INT64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-ADD:
            avg: AVG((SQ_INSTS_VALU_ADD_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_ADD_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_ADD_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-MUL:
            avg: AVG((SQ_INSTS_VALU_MUL_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_MUL_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_MUL_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-FMA:
            avg: AVG((SQ_INSTS_VALU_FMA_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_FMA_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_FMA_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-Trans:
            avg: AVG((SQ_INSTS_VALU_TRANS_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_TRANS_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_TRANS_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-ADD:
            avg: AVG((SQ_INSTS_VALU_ADD_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_ADD_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_ADD_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-MUL:
            avg: AVG((SQ_INSTS_VALU_MUL_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_MUL_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_MUL_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-FMA:
            avg: AVG((SQ_INSTS_VALU_FMA_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_FMA_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_FMA_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-Trans:
            avg: AVG((SQ_INSTS_VALU_TRANS_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_TRANS_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_TRANS_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-ADD:
            avg: AVG((SQ_INSTS_VALU_ADD_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_ADD_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_ADD_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-MUL:
            avg: AVG((SQ_INSTS_VALU_MUL_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_MUL_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_MUL_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-FMA:
            avg: AVG((SQ_INSTS_VALU_FMA_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_FMA_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_FMA_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-Trans:
            avg: AVG((SQ_INSTS_VALU_TRANS_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_TRANS_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_TRANS_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          Conversion:
            avg: AVG((SQ_INSTS_VALU_CVT / $denom))
            min: MIN((SQ_INSTS_VALU_CVT / $denom))
            max: MAX((SQ_INSTS_VALU_CVT / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1003
        title: VMEM Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Global/Generic Instr:
            avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Read:
            avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Write:
            avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Atomic:
            avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Instr:
            avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Read:
            avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Write:
            avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Atomic:
            avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1004
        title: MFMA Arithmetic Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          MFMA-I8:
            avg: AVG((SQ_INSTS_VALU_MFMA_I8 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_I8 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_I8 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F16:
            avg: AVG((SQ_INSTS_VALU_MFMA_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-BF16:
            avg: AVG((SQ_INSTS_VALU_MFMA_BF16 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_BF16 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_BF16 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F32:
            avg: AVG((SQ_INSTS_VALU_MFMA_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F64:
            avg: AVG((SQ_INSTS_VALU_MFMA_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
@@ -0,0 +1,304 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
  metrics_description:
    VALU: The total number of vector arithmetic logic unit (VALU) operations issued.
      These are the workhorses of the compute unit, and are used to execute a wide
      range of instruction types including floating point operations, non-uniform
      address calculations, transcendental operations, integer operations, shifts,
      conditional evaluation, etc.
    VMEM: The total number of vector memory operations issued. These include most
      loads, stores and atomic operations and all accesses to generic, global, private
      and texture memory.
    LDS: The total number of LDS (also known as shared memory) operations issued.
      These include loads, stores, atomics, and HIP's __shfl operations.
    MFMA: The total number of matrix fused multiply-add instructions issued.
    SALU: The total number of scalar arithmetic logic unit (SALU) operations issued.
      Typically these are used for address calculations, literal constants, and other
      operations that are provably uniform across a wavefront. Although scalar memory
      (SMEM) operations are issued by the SALU, they are counted separately in this
      section.
    SMEM: The total number of scalar memory (SMEM) operations issued. These are typically
      used for loading kernel arguments, base-pointers and loads from HIP's __constant__
      memory.
    Branch: The total number of branch operations issued. These typically consist
      of jump or branch operations and are used to implement control flow.
    INT32: The total number of instructions operating on 32-bit integer operands issued
      to the VALU per normalization unit.
    INT64: The total number of instructions operating on 64-bit integer operands issued
      to the VALU per normalization unit.
    F16-ADD: The total number of addition instructions operating on 16-bit floating-point
      operands issued to the VALU per normalization unit.
    F16-MUL: The total number of multiplication instructions operating on 16-bit floating-point
      operands issued to the VALU per normalization unit.
    F16-FMA: The total number of fused multiply-add instructions operating on 16-bit
      floating-point operands issued to the VALU per normalization unit.
    F16-Trans: The total number of transcendental instructions (e.g., sqrt) operating
      on 16-bit floating-point operands issued to the VALU per normalization unit.
    F32-ADD: The total number of addition instructions operating on 32-bit floating-point
      operands issued to the VALU per normalization unit.
    F32-MUL: The total number of multiplication instructions operating on 32-bit floating-point
      operands issued to the VALU per normalization unit.
    F32-FMA: The total number of fused multiply-add instructions operating on 32-bit
      floating-point operands issued to the VALU per normalization unit.
    F32-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 32-bit floating-point operands issued to the VALU per normalization unit.
    F64-ADD: The total number of addition instructions operating on 64-bit floating-point
      operands issued to the VALU per normalization unit.
    F64-MUL: The total number of multiplication instructions operating on 64-bit floating-point
      operands issued to the VALU per normalization unit.
    F64-FMA: The total number of fused multiply-add instructions operating on 64-bit
      floating-point operands issued to the VALU per normalization unit.
    F64-Trans: The total number of transcendental instructions (such as sqrt) operating
      on 64-bit floating-point operands issued to the VALU per normalization unit.
    Conversion: "The total number of type conversion instructions (such as converting\
      \ data to or from F32\u2194F64) issued to the VALU per normalization unit."
    Global/Generic Instr: The total number of global & generic memory instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Read: The total number of global & generic memory read instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Write: The total number of global & generic memory write instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Atomic: The total number of global & generic memory atomic (with
      and without return) instructions executed on all compute units on the accelerator,
      per normalization unit.
    Spill/Stack Instr: The total number of spill/stack memory instructions executed
      on all compute units on the accelerator, per normalization unit.
    Spill/Stack Read: The total number of spill/stack memory read instructions executed
      on all compute units on the accelerator, per normalization unit.
    Spill/Stack Write: The total number of spill/stack memory write instructions executed
      on all compute units on the accelerator, per normalization unit.
    Spill/Stack Atomic: The total number of spill/stack memory atomic (with and without
      return) instructions executed on all compute units on the accelerator, per normalization
      unit. Typically unused as these memory operations are typically used to implement
      thread-local storage.
    MFMA-I8: The total number of 8-bit integer MFMA instructions issued per normalization
      unit.
    MFMA-F8: The total number of 8-bit floating point MFMA instructions issued per
      normalization unit. This is supported in AMD Instinct MI300 series and later
      only.
    MFMA-F16: The total number of 16-bit floating point MFMA instructions issued per
      normalization unit.
    MFMA-BF16: The total number of 16-bit brain floating point MFMA instructions issued
      per normalization unit.
    MFMA-F32: The total number of 32-bit floating-point MFMA instructions issued per
      normalization unit.
    MFMA-F64: The total number of 64-bit floating-point MFMA instructions issued per
      normalization unit.
  data source:
  - metric_table:
      id: 1001
      title: Overall Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        VALU:
          avg: AVG(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
          min: MIN(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
          max: MAX(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
          unit: (instr + $normUnit)
        VMEM:
          avg: AVG(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
          min: MIN(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
          max: MAX(((SQ_INSTS_VMEM - SQ_INSTS_FLAT_LDS_ONLY) / $denom))
          unit: (instr + $normUnit)
        LDS:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
          unit: (instr + $normUnit)
        MFMA:
          avg: AVG((SQ_INSTS_MFMA / $denom))
          min: MIN((SQ_INSTS_MFMA / $denom))
          max: MAX((SQ_INSTS_MFMA / $denom))
          unit: (instr + $normUnit)
        SALU:
          avg: AVG((SQ_INSTS_SALU / $denom))
          min: MIN((SQ_INSTS_SALU / $denom))
          max: MAX((SQ_INSTS_SALU / $denom))
          unit: (instr + $normUnit)
        SMEM:
          avg: AVG((SQ_INSTS_SMEM / $denom))
          min: MIN((SQ_INSTS_SMEM / $denom))
          max: MAX((SQ_INSTS_SMEM / $denom))
          unit: (instr + $normUnit)
        Branch:
          avg: AVG((SQ_INSTS_BRANCH / $denom))
          min: MIN((SQ_INSTS_BRANCH / $denom))
          max: MAX((SQ_INSTS_BRANCH / $denom))
          unit: (instr + $normUnit)
  - metric_table:
      id: 1002
      title: VALU Arithmetic Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        INT32:
          avg: AVG((SQ_INSTS_VALU_INT32 / $denom))
          min: MIN((SQ_INSTS_VALU_INT32 / $denom))
          max: MAX((SQ_INSTS_VALU_INT32 / $denom))
          unit: (instr + $normUnit)
        INT64:
          avg: AVG((SQ_INSTS_VALU_INT64 / $denom))
          min: MIN((SQ_INSTS_VALU_INT64 / $denom))
          max: MAX((SQ_INSTS_VALU_INT64 / $denom))
          unit: (instr + $normUnit)
        F16-ADD:
          avg: AVG((SQ_INSTS_VALU_ADD_F16 / $denom))
          min: MIN((SQ_INSTS_VALU_ADD_F16 / $denom))
          max: MAX((SQ_INSTS_VALU_ADD_F16 / $denom))
          unit: (instr + $normUnit)
        F16-MUL:
          avg: AVG((SQ_INSTS_VALU_MUL_F16 / $denom))
          min: MIN((SQ_INSTS_VALU_MUL_F16 / $denom))
          max: MAX((SQ_INSTS_VALU_MUL_F16 / $denom))
          unit: (instr + $normUnit)
        F16-FMA:
          avg: AVG((SQ_INSTS_VALU_FMA_F16 / $denom))
          min: MIN((SQ_INSTS_VALU_FMA_F16 / $denom))
          max: MAX((SQ_INSTS_VALU_FMA_F16 / $denom))
          unit: (instr + $normUnit)
        F16-Trans:
          avg: AVG((SQ_INSTS_VALU_TRANS_F16 / $denom))
          min: MIN((SQ_INSTS_VALU_TRANS_F16 / $denom))
          max: MAX((SQ_INSTS_VALU_TRANS_F16 / $denom))
          unit: (instr + $normUnit)
        F32-ADD:
          avg: AVG((SQ_INSTS_VALU_ADD_F32 / $denom))
          min: MIN((SQ_INSTS_VALU_ADD_F32 / $denom))
          max: MAX((SQ_INSTS_VALU_ADD_F32 / $denom))
          unit: (instr + $normUnit)
        F32-MUL:
          avg: AVG((SQ_INSTS_VALU_MUL_F32 / $denom))
          min: MIN((SQ_INSTS_VALU_MUL_F32 / $denom))
          max: MAX((SQ_INSTS_VALU_MUL_F32 / $denom))
          unit: (instr + $normUnit)
        F32-FMA:
          avg: AVG((SQ_INSTS_VALU_FMA_F32 / $denom))
          min: MIN((SQ_INSTS_VALU_FMA_F32 / $denom))
          max: MAX((SQ_INSTS_VALU_FMA_F32 / $denom))
          unit: (instr + $normUnit)
        F32-Trans:
          avg: AVG((SQ_INSTS_VALU_TRANS_F32 / $denom))
          min: MIN((SQ_INSTS_VALU_TRANS_F32 / $denom))
          max: MAX((SQ_INSTS_VALU_TRANS_F32 / $denom))
          unit: (instr + $normUnit)
        F64-ADD:
          avg: AVG((SQ_INSTS_VALU_ADD_F64 / $denom))
          min: MIN((SQ_INSTS_VALU_ADD_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_ADD_F64 / $denom))
          unit: (instr + $normUnit)
        F64-MUL:
          avg: AVG((SQ_INSTS_VALU_MUL_F64 / $denom))
          min: MIN((SQ_INSTS_VALU_MUL_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_MUL_F64 / $denom))
          unit: (instr + $normUnit)
        F64-FMA:
          avg: AVG((SQ_INSTS_VALU_FMA_F64 / $denom))
          min: MIN((SQ_INSTS_VALU_FMA_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_FMA_F64 / $denom))
          unit: (instr + $normUnit)
        F64-Trans:
          avg: AVG((SQ_INSTS_VALU_TRANS_F64 / $denom))
          min: MIN((SQ_INSTS_VALU_TRANS_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_TRANS_F64 / $denom))
          unit: (instr + $normUnit)
        Conversion:
          avg: AVG((SQ_INSTS_VALU_CVT / $denom))
          min: MIN((SQ_INSTS_VALU_CVT / $denom))
          max: MAX((SQ_INSTS_VALU_CVT / $denom))
          unit: (instr + $normUnit)
  - metric_table:
      id: 1003
      title: VMEM Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Global/Generic Instr:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Global/Generic Read:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Global/Generic Write:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Global/Generic Atomic:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Instr:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Read:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Write:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
        Spill/Stack Atomic:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (instr + $normUnit)
  - metric_table:
      id: 1004
      title: MFMA Arithmetic Instruction Mix
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        MFMA-I8:
          avg: AVG((SQ_INSTS_VALU_MFMA_I8 / $denom))
          min: MIN((SQ_INSTS_VALU_MFMA_I8 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_I8 / $denom))
          unit: (instr + $normUnit)
        MFMA-F16:
          avg: AVG((SQ_INSTS_VALU_MFMA_F16 / $denom))
          min: MIN((SQ_INSTS_VALU_MFMA_F16 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F16 / $denom))
          unit: (instr + $normUnit)
        MFMA-BF16:
          avg: AVG((SQ_INSTS_VALU_MFMA_BF16 / $denom))
          min: MIN((SQ_INSTS_VALU_MFMA_BF16 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_BF16 / $denom))
          unit: (instr + $normUnit)
        MFMA-F32:
          avg: AVG((SQ_INSTS_VALU_MFMA_F32 / $denom))
          min: MIN((SQ_INSTS_VALU_MFMA_F32 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F32 / $denom))
          unit: (instr + $normUnit)
        MFMA-F64:
          avg: AVG((SQ_INSTS_VALU_MFMA_F64 / $denom))
          min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
          max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
          unit: (instr + $normUnit)
@@ -1,260 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
  data source:
    - metric_table:
        id: 1101
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          peak: Peak
          pop: Pct of Peak
          tips: Tips
        metric:
          VALU FLOPs:
            value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
              + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
              + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32)))) + (64 * (((SQ_INSTS_VALU_ADD_F64
              + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (2 * SQ_INSTS_VALU_FMA_F64))))
              / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
              + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
              + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
              + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))) / (((($max_sclk
              * $cu_per_gpu) * 64) * 2) / 1000))
            tips:
          VALU IOPs:
            value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp - Start_Timestamp)))
            unit: GIOP
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
              - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
            tips:
          MFMA FLOPs (BF16):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
            tips:
          MFMA FLOPs (F16):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
            tips:
          MFMA FLOPs (F32):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
            tips:
          MFMA FLOPs (F64):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
            tips:
          MFMA IOPs (INT8):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GIOP
            peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
            tips:
    - metric_table:
        id: 1102
        title: Pipeline Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          IPC:
            avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            unit: Instr/cycle
            tips:
          IPC (Issued):
            avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
              + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
              / SQ_ACTIVE_INST_ANY))
            min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
              + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
              / SQ_ACTIVE_INST_ANY))
            max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
              + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
              / SQ_ACTIVE_INST_ANY))
            unit: Instr/cycle
            tips:
          SALU Utilization:
            avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            tips:
          VALU Utilization:
            avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            tips:
          VMEM Utilization:
            avg: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            min: MIN((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            max: MAX((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            tips:
          Branch Utilization:
            avg: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            min: MIN((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            max: MAX((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            tips:
          VALU Active Threads:
            avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            unit: Threads
            tips:
          MFMA Utilization:
            avg: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
            min: MIN(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
            max: MAX(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
            unit: pct
            tips:
          MFMA Instr Cycles:
            avg: AVG(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA != 0)
              else None))
            min: MIN(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA != 0)
              else None))
            max: MAX(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA != 0)
              else None))
            unit: cycles/instr
            tips:
          VMEM Latency:
            avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
              else None))
            min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
              else None))
            max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
              else None))
            unit: Cycles
            coll_level: SQ_INST_LEVEL_VMEM
            tips:
          SMEM Latency:
            avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
              else None))
            min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
              else None))
            max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
              else None))
            unit: Cycles
            coll_level: SQ_INST_LEVEL_SMEM
            tips:
    - metric_table:
        id: 1103
        title: Arithmetic Operations
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          FLOPs (Total):
            avg: AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
              + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512
              * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
              + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
              + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) /
              $denom))
            min: MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
              + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512
              * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
              + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
              + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) /
              $denom))
            max: MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
              + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16) + (512
              * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
              + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
              + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (SQ_INSTS_VALU_FMA_F64 * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) /
              $denom))
            unit: (OPs  + $normUnit)
            tips:
          IOPs (Total):
            avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / $denom)
            min: MIN(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / $denom)
            max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8 * 512)) / $denom)
            unit: (OPs  + $normUnit)
            tips:
          F16 OPs:
            avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16)) +
              (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512 *
              SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
            min: MIN(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16)) +
              (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512 *
              SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
            max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16)) +
              (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512 *
              SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
            unit: (OPs  + $normUnit)
            tips:
          BF16 OPs:
            avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
            min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
            max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
            unit: (OPs  + $normUnit)
            tips:
          F32 OPs:
            avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
              + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) / $denom))
            min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
              + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) / $denom))
            max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
              + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) / $denom))
            unit: (OPs  + $normUnit)
            tips:
          F64 OPs:
            avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
            min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
            max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
            unit: (OPs  + $normUnit)
            tips:
          INT8 OPs:
            avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
            min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
            max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
            unit: (OPs  + $normUnit)
            tips:
@@ -0,0 +1,316 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1100
  title: Compute Units - Compute Pipeline
  metrics_description:
    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.'
    VALU IOPs: 'The total integer operations executed per second on the VALU. This
      is also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.'
    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
      executed per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. This is also presented as a percent of the
      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
      per second. Note: this does not include any 16-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
      per second. Note: this does not include any 32-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
      per second. Note: this does not include any 64-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.'
    MFMA IOPs (INT8): 'The total number of 8-bit integer MFMA operations executed
      per second. Note: this does not include any 8-bit integer operations from VALU
      instructions. This is also presented as a percent of the peak theoretical INT8
      MFMA operations achievable on the specific accelerator.'
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles.
    IPC (Issued): The ratio of the total number of (non-internal) instructions issued
      over the number of cycles where the scheduler was actively working on issuing
      instructions.
    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
      busy executing instructions. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing SALU / SMEM instructions over the total CU cycles.
    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
      busy executing instructions. Does not include VMEM operations. Computed as the
      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
      over the total CU cycles.
    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
      was busy executing instructions, including both global/generic and spill/scratch
      operations (see the VMEM instruction count metrics for more detail). Does not
      include VALU operations. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing VMEM instructions over the total CU cycles.
    Branch Utilization: Indicates what percent of the kernel's duration the branch
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the scheduler issuing branch instructions over the total
      CU cycles.
    VALU Active Threads: Indicates the average level of divergence within a wavefront
      over the lifetime of the kernel. The number of work-items that were active in
      a wavefront during execution of each VALU instruction, time-averaged over all
      VALU instructions run on all wavefronts in the kernel
    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
      was busy executing instructions. Computed as the ratio of the total number of
      cycles spent by the MFMA was busy over the total CU cycles.
    MFMA Instruction Cycles: The average duration of MFMA instructions in this kernel
      in cycles. Computed as the ratio of the total number of cycles the MFMA unit
      was busy over the total number of MFMA instructions.
    VMEM Latency: The average number of round-trip cycles (that is, from issue to
      data return / acknowledgment) required for a VMEM instruction to complete.
    SMEM Latency: The average number of round-trip cycles (that is, from issue to
      data return / acknowledgment) required for a SMEM instruction to complete.
    FLOPs (Total): The total number of floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    IOPs (Total): The total number of integer operations executed on either the VALU
      or MFMA units, per normalization unit.
    F16 OPs: The total number of 16-bit floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    BF16 OPs: The total number of 16-bit brain floating-point operations executed
      on either the VALU or MFMA units, per normalization unit.
    F32 OPs: The total number of 32-bit floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    F64 OPs: The total number of 64-bit floating-point operations executed on either
      the VALU or MFMA units, per normalization unit.
    INT8 OPs: The total number of 8-bit integer operations executed on either the
      VALU or MFMA units, per normalization unit.
  data source:
  - metric_table:
      id: 1101
      title: Compute Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
        peak: Peak
        pop: Pct of Peak
      metric:
        VALU FLOPs:
          value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
            SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
            + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
            / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
        VALU IOPs:
          value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
            - Start_Timestamp)))
          unit: GIOP
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
            - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
        MFMA FLOPs (BF16):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
        MFMA FLOPs (F16):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
        MFMA FLOPs (F32):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
        MFMA FLOPs (F64):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
        MFMA IOPs (INT8):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GIOP
          peak: ((($max_sclk * $cu_per_gpu) * 1024) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 1024) / 1000))
  - metric_table:
      id: 1102
      title: Pipeline Statistics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        IPC:
          avg: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          min: MIN((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          max: MAX((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          unit: Instr/cycle
        IPC (Issued):
          avg: AVG(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          min: MIN(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          max: MAX(((((((((SQ_INSTS_VALU + SQ_INSTS_VMEM) + SQ_INSTS_SALU) + SQ_INSTS_SMEM))
            + SQ_INSTS_BRANCH) + SQ_INSTS_SENDMSG) + SQ_INSTS_VSKIPPED  + SQ_INSTS_LDS)
            / SQ_ACTIVE_INST_ANY))
          unit: Instr/cycle
        SALU Utilization:
          avg: AVG((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          min: MIN((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          max: MAX((((100 * SQ_ACTIVE_INST_SCA) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
        VALU Utilization:
          avg: AVG((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          min: MIN((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          max: MAX((((100 * SQ_ACTIVE_INST_VALU) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
        VMEM Utilization:
          avg: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
          min: MIN((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
          max: MAX((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
          unit: pct
        Branch Utilization:
          avg: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          min: MIN((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          max: MAX((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
        VALU Active Threads:
          avg: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          min: MIN(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          max: MAX(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          unit: Threads
        MFMA Utilization:
          avg: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
          min: MIN(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
          max: MAX(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / ((4 * $cu_per_gpu) * $GRBM_GUI_ACTIVE_PER_XCD)))
          unit: pct
        MFMA Instruction Cycles:
          avg: AVG(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA !=
            0) else None))
          min: MIN(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA !=
            0) else None))
          max: MAX(((SQ_VALU_MFMA_BUSY_CYCLES / SQ_INSTS_MFMA) if (SQ_INSTS_MFMA !=
            0) else None))
          unit: cycles/instr
        VMEM Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
            else None))
          min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
            else None))
          max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_VMEM) if (SQ_INSTS_VMEM != 0)
            else None))
          unit: Cycles
          coll_level: SQ_INST_LEVEL_VMEM
        SMEM Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
            else None))
          min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
            else None))
          max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_SMEM) if (SQ_INSTS_SMEM != 0)
            else None))
          unit: Cycles
          coll_level: SQ_INST_LEVEL_SMEM
  - metric_table:
      id: 1103
      title: Arithmetic Operations
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        FLOPs (Total):
          avg: AVG((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
            SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16)
            + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
            + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
          min: MIN((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
            SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16)
            + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
            + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
          max: MAX((((((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
            SQ_INSTS_VALU_TRANS_F16) + (SQ_INSTS_VALU_FMA_F16 * 2))) + ((512 * SQ_INSTS_VALU_MFMA_MOPS_F16)
            + (512 * SQ_INSTS_VALU_MFMA_MOPS_BF16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (SQ_INSTS_VALU_FMA_F32
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32)) + (64 * (((SQ_INSTS_VALU_ADD_F64
            + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (SQ_INSTS_VALU_FMA_F64
            * 2)))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64)) / $denom))
          unit: (OPs  + $normUnit)
        IOPs (Total):
          avg: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
          min: MIN(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
          max: MAX(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) + (SQ_INSTS_VALU_MFMA_MOPS_I8
            * 512)) / $denom)
          unit: (OPs  + $normUnit)
        F16 OPs:
          avg: AVG(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
          min: MIN(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
          max: MAX(((((((64 * SQ_INSTS_VALU_ADD_F16) + (64 * SQ_INSTS_VALU_MUL_F16))
            + (64 * SQ_INSTS_VALU_TRANS_F16)) + (128 * SQ_INSTS_VALU_FMA_F16)) + (512
            * SQ_INSTS_VALU_MFMA_MOPS_F16)) / $denom))
          unit: (OPs  + $normUnit)
        BF16 OPs:
          avg: AVG(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          min: MIN(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          max: MAX(((512 * SQ_INSTS_VALU_MFMA_MOPS_BF16) / $denom))
          unit: (OPs  + $normUnit)
        F32 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
          min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32)
            + (SQ_INSTS_VALU_FMA_F32 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F32))
            / $denom))
          unit: (OPs  + $normUnit)
        F64 OPs:
          avg: AVG((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
          min: MIN((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
          max: MAX((((64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (SQ_INSTS_VALU_FMA_F64 * 2))) + (512 * SQ_INSTS_VALU_MFMA_MOPS_F64))
            / $denom))
          unit: (OPs  + $normUnit)
        INT8 OPs:
          avg: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          min: MIN(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          max: MAX(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / $denom))
          unit: (OPs  + $normUnit)
@@ -1,118 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
  data source:
    - metric_table:
        id: 1201
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Utilization:
            value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: Pct of Peak
            tips:
          Access Rate:
            value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: Pct of Peak
            tips:
          Theoretical Bandwidth:
            value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
            unit: Pct of Peak
            tips:
          Bank Conflict Rate:
            value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1202
        title: LDS Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          LDS Instrs:
            avg: AVG((SQ_INSTS_LDS / $denom))
            min: MIN((SQ_INSTS_LDS / $denom))
            max: MAX((SQ_INSTS_LDS / $denom))
            unit: (Instr  + $normUnit)
            tips:
          Theoretical Bandwidth:
            avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / $denom))
            min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / $denom))
            max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / $denom))
            unit: (Bytes  + $normUnit)
            tips:
          LDS Latency:
            avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
            min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
            max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None))
            unit: Cycles
            coll_level: SQ_INST_LEVEL_LDS
            tips:
          Bank Conflicts/Access:
            avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Conflicts/Access
            tips:
          Index Accesses:
            avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
            min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
            max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Atomic Return Cycles:
            avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
            min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
            max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Bank Conflict:
            avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
            min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
            max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Addr Conflict:
            avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
            min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
            max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Unaligned Stall:
            avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
            min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
            max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Mem Violations:
            avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
            min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
            max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
            unit: (Accesses + $normUnit)
            tips:
@@ -0,0 +1,141 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1200
  title: Local Data Share (LDS)
  metrics_description:
    Utilization: Indicates what percent of the kernel's duration the LDS was actively
      executing instructions (including, but not limited to, load, store, atomic and
      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
      LDS was active over the total CU cycles.
    Access Rate: Indicates the percentage of SIMDs in the VALU actively issuing LDS
      instructions, averaged over the lifetime of the kernel. Calculated as the ratio
      of the total number of cycles spent by the scheduler issuing LDS instructions
      over the total CU cycles.
    Theoretical Bandwidth: Indicates the maximum amount of bytes that could have been
      loaded from, stored to, or atomically updated in the LDS per normalization unit.
      Does not take into account the execution mask of the wavefront when the instruction
      was executed.
    Bank Conflict Rate: Indicates the percentage of active LDS cycles that were spent
      servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing
      bank conflicts over the number of LDS cycles that would have been required to
      move the same amount of data in an uncontended access.
    LDS Instructions: The total number of LDS instructions (including, but not limited
      to, read/write/atomics and HIP's __shfl instructions) executed per normalization
      unit.
    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
      / acknowledgment) required for an LDS instruction to complete.
    Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS scheduler
      due to bank conflicts (as determined by the conflict resolution hardware) to
      the base number of cycles that would be spent in the LDS scheduler in a completely
      uncontended case. This is the unnormalized form of the Bank Conflict Rate.
    Index Accesses: The total number of cycles spent in the LDS scheduler over all
      operations per normalization unit.
    Atomic Return Cycles: The total number of cycles spent on LDS atomics with return
      per normalization unit.
    Bank Conflict: The total number of cycles spent in the LDS scheduler due to bank
      conflicts (as determined by the conflict resolution hardware) per normalization
      unit.
    Addr Conflict: The total number of cycles spent in the LDS scheduler due to address
      conflicts (as determined by the conflict resolution hardware) per normalization
      unit.
    Unaligned Stall: The total number of cycles spent in the LDS scheduler due to
      stalls from non-dword aligned addresses per normalization unit.
    Mem Violations: "The total number of out-of-bounds accesses made to the LDS, per\
      \ normalization unit. This is unused and expected to be zero in most configurations\
      \ for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1201
      title: LDS Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Utilization:
          value: AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
        Access Rate:
          value: AVG(((200 * SQ_ACTIVE_INST_LDS) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: Pct of Peak
        Theoretical Bandwidth:
          value: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
          unit: Pct of Peak
        Bank Conflict Rate:
          value: AVG((((SQ_LDS_BANK_CONFLICT * 3.125) / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1202
      title: LDS Statistics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        LDS Instructions:
          avg: AVG((SQ_INSTS_LDS / $denom))
          min: MIN((SQ_INSTS_LDS / $denom))
          max: MAX((SQ_INSTS_LDS / $denom))
          unit: (Instr  + $normUnit)
        Theoretical Bandwidth:
          avg: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / $denom))
          min: MIN(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / $denom))
          max: MAX(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / $denom))
          unit: (Bytes  + $normUnit)
        LDS Latency:
          avg: AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
          min: MIN(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
          max: MAX(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else
            None))
          unit: Cycles
          coll_level: SQ_INST_LEVEL_LDS
        Bank Conflicts/Access:
          avg: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          min: MIN(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          max: MAX(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Conflicts/Access
        Index Accesses:
          avg: AVG((SQ_LDS_IDX_ACTIVE / $denom))
          min: MIN((SQ_LDS_IDX_ACTIVE / $denom))
          max: MAX((SQ_LDS_IDX_ACTIVE / $denom))
          unit: (Cycles  + $normUnit)
        Atomic Return Cycles:
          avg: AVG((SQ_LDS_ATOMIC_RETURN / $denom))
          min: MIN((SQ_LDS_ATOMIC_RETURN / $denom))
          max: MAX((SQ_LDS_ATOMIC_RETURN / $denom))
          unit: (Cycles  + $normUnit)
        Bank Conflict:
          avg: AVG((SQ_LDS_BANK_CONFLICT / $denom))
          min: MIN((SQ_LDS_BANK_CONFLICT / $denom))
          max: MAX((SQ_LDS_BANK_CONFLICT / $denom))
          unit: (Cycles  + $normUnit)
        Addr Conflict:
          avg: AVG((SQ_LDS_ADDR_CONFLICT / $denom))
          min: MIN((SQ_LDS_ADDR_CONFLICT / $denom))
          max: MAX((SQ_LDS_ADDR_CONFLICT / $denom))
          unit: (Cycles  + $normUnit)
        Unaligned Stall:
          avg: AVG((SQ_LDS_UNALIGNED_STALL / $denom))
          min: MIN((SQ_LDS_UNALIGNED_STALL / $denom))
          max: MAX((SQ_LDS_UNALIGNED_STALL / $denom))
          unit: (Cycles  + $normUnit)
        Mem Violations:
          avg: AVG((SQ_LDS_MEM_VIOLATIONS / $denom))
          min: MIN((SQ_LDS_MEM_VIOLATIONS / $denom))
          max: MAX((SQ_LDS_MEM_VIOLATIONS / $denom))
          unit: (Accesses + $normUnit)
@@ -1,105 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1300
  title: Instruction Cache
  data source:
    - metric_table:
        id: 1301
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Bandwidth:
            value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
              * (End_Timestamp - Start_Timestamp))))
            unit: Pct of Peak
            tips:
          Cache Hit Rate:
            value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            unit: Pct of Peak
            tips:
          L1I-L2 Bandwidth:
            value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
              * (End_Timestamp - Start_Timestamp))))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1302
        title: Instruction Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Req:
            avg: AVG((SQC_ICACHE_REQ / $denom))
            min: MIN((SQC_ICACHE_REQ / $denom))
            max: MAX((SQC_ICACHE_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Hits:
            avg: AVG((SQC_ICACHE_HITS / $denom))
            min: MIN((SQC_ICACHE_HITS / $denom))
            max: MAX((SQC_ICACHE_HITS / $denom))
            unit: (Hits  + $normUnit)
            tips:
          Misses - Non Duplicated:
            avg: AVG((SQC_ICACHE_MISSES / $denom))
            min: MIN((SQC_ICACHE_MISSES / $denom))
            max: MAX((SQC_ICACHE_MISSES / $denom))
            unit: (Misses  + $normUnit)
            tips:
          Misses - Duplicated:
            avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
            min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
            max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
            unit: (Misses  + $normUnit)
            tips:
          Cache Hit Rate:
            avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
              + SQC_ICACHE_MISSES_DUPLICATE)))
            unit: pct
            tips:
          Instruction Fetch Latency:
            avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            unit: Cycles
            coll_level: SQ_IFETCH_LEVEL
            tips:
    - metric_table:
        id: 1303
        title: Instruction Cache - L2 Interface
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          L1I-L2 Bandwidth:
            avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
            min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
            max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
            unit: (Bytes + $normUnit)
            tips:
@@ -0,0 +1,106 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1300
  title: Instruction Cache
  metrics_description:
    Bandwidth: The number of bytes looked up in the L1I cache, as a percent of the
      peak theoretical bandwidth. Calculated as the ratio of L1I requests over the
      total L1I cycles.
    Cache Hit Rate: The percent of L1I requests that hit [#l1i-cache]_ on a previously
      loaded line the cache. Calculated as the ratio of the number of L1I requests
      that hit over the number of all L1I requests.
    L1I-L2 Bandwidth: "The percent of the peak theoretical L1I \u2192 L2 cache request\
      \ bandwidth achieved. Calculated as the ratio of the total number of requests\
      \ from the L1I to the L2 cache over the total L1I-L2 interface cycles."
    Req: The total number of requests made to the L1I per normalization-unit
    Hits: The total number of L1I requests that hit on a previously loaded cache line,
      per normalization-unit.
    Misses - Non Duplicated: The total number of L1I requests that missed on a cache
      line that were not already pending due to another request, per normalization-unit.
    Misses - Duplicated: The total number of L1I requests that missed on a cache line
      that were already pending due to another request, per normalization-unit.
    Instruction Fetch Latency: The average number of cycles spent to fetch instructions
      to a CU.
  data source:
  - metric_table:
      id: 1301
      title: L1I Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Bandwidth:
          value: AVG(((SQC_ICACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
        Cache Hit Rate:
          value: AVG(((SQC_ICACHE_HITS * 100) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: Pct of Peak
        L1I-L2 Bandwidth:
          value: AVG(((SQC_TC_INST_REQ * 100000) / (2 * ($max_sclk * $sqc_per_gpu)
            * (End_Timestamp - Start_Timestamp))))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1302
      title: L1I cache accesses
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Req:
          avg: AVG((SQC_ICACHE_REQ / $denom))
          min: MIN((SQC_ICACHE_REQ / $denom))
          max: MAX((SQC_ICACHE_REQ / $denom))
          unit: (Req  + $normUnit)
        Hits:
          avg: AVG((SQC_ICACHE_HITS / $denom))
          min: MIN((SQC_ICACHE_HITS / $denom))
          max: MAX((SQC_ICACHE_HITS / $denom))
          unit: (Hits  + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_ICACHE_MISSES / $denom))
          min: MIN((SQC_ICACHE_MISSES / $denom))
          max: MAX((SQC_ICACHE_MISSES / $denom))
          unit: (Misses  + $normUnit)
        Misses - Duplicated:
          avg: AVG((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_ICACHE_MISSES_DUPLICATE / $denom))
          unit: (Misses  + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          min: MIN(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          max: MAX(((100 * SQC_ICACHE_HITS) / ((SQC_ICACHE_HITS + SQC_ICACHE_MISSES)
            + SQC_ICACHE_MISSES_DUPLICATE)))
          unit: pct
        Instruction Fetch Latency:
          avg: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          min: MIN((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          max: MAX((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          unit: Cycles
          coll_level: SQ_IFETCH_LEVEL
  - metric_table:
      id: 1303
      title: L1I <-> L2 interface
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        L1I-L2 Bandwidth:
          avg: AVG(((SQC_TC_INST_REQ * 64) / $denom))
          min: MIN(((SQC_TC_INST_REQ * 64) / $denom))
          max: MAX(((SQC_TC_INST_REQ * 64) / $denom))
          unit: (Bytes + $normUnit)
@@ -1,171 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  data source:
    - metric_table:
        id: 1401
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Bandwidth:
            value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu)
              * (End_Timestamp - Start_Timestamp))))
            unit: Pct of Peak
            tips:
          Cache Hit Rate:
            value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            unit: Pct of Peak
            tips:
          sL1D-L2 BW:
            value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 100000)
                        / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1402
        title: Scalar L1D Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Req:
            avg: AVG((SQC_DCACHE_REQ / $denom))
            min: MIN((SQC_DCACHE_REQ / $denom))
            max: MAX((SQC_DCACHE_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Hits:
            avg: AVG((SQC_DCACHE_HITS / $denom))
            min: MIN((SQC_DCACHE_HITS / $denom))
            max: MAX((SQC_DCACHE_HITS / $denom))
            unit: (Req  + $normUnit)
            tips:
          Misses - Non Duplicated:
            avg: AVG((SQC_DCACHE_MISSES / $denom))
            min: MIN((SQC_DCACHE_MISSES / $denom))
            max: MAX((SQC_DCACHE_MISSES / $denom))
            unit: (Req  + $normUnit)
            tips:
          Misses- Duplicated:
            avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
            min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
            max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache Hit Rate:
            avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
              + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
            unit: pct
            tips:
          Read Req (Total):
            avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
              + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
            min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
              + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
            max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
              + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG((SQC_DCACHE_ATOMIC / $denom))
            min: MIN((SQC_DCACHE_ATOMIC / $denom))
            max: MAX((SQC_DCACHE_ATOMIC / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (1 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (2 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (4 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (8 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req (16 DWord):
            avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
            min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
            max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
            unit: (Req  + $normUnit)
            tips:
    - metric_table:
        id: 1403
        title: Scalar L1D Cache - L2 Interface
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          sL1D-L2 BW:
            avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
            min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
            max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ) * 64)) / $denom))
            unit: (Bytes + $normUnit)
            tips:
          Read Req:
            avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
            min: MIN((SQC_TC_DATA_READ_REQ / $denom))
            max: MAX((SQC_TC_DATA_READ_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write Req:
            avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
            min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
            max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
            min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
            max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
            unit: (Req  + $normUnit)
            tips:
          Stall Cycles:
            avg: AVG((SQC_TC_STALL / $denom))
            min: MIN((SQC_TC_STALL / $denom))
            max: MAX((SQC_TC_STALL / $denom))
            unit: (Cycles  + $normUnit)
            tips:
@@ -0,0 +1,186 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1400
  title: Scalar L1 Data Cache
  metrics_description:
    Bandwidth: The number of bytes looked up in the sL1D cache, as a percent of the
      peak theoretical bandwidth. Calculated as the ratio of sL1D requests over the
      total sL1D cycles.
    Cache Hit Rate: Indicates the percent of sL1D requests that hit on a previously
      loaded line the cache. The ratio of the number of sL1D requests that hit over
      the number of all sL1D requests.
    sL1D-L2 BW: "The total number of bytes read from, written to, or atomically updated\
      \ across the sL1D\u2194L2 interface, per normalization unit. Note that sL1D\
      \ writes and atomics are typically unused on current CDNA accelerators, so in\
      \ the majority of cases this can be interpreted as an sL1D\u2192L2 read bandwidth."
    Req: The total number of requests, of any size or type, made to the sL1D per normalization
      unit.
    Hits: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
    Misses - Non Duplicated: 'The total number of sL1D requests that missed on a cache
      line that was not already pending due to another request, per normalization
      unit. '
    Misses- Duplicated: The total number of sL1D requests that missed on a cache line
      that was already pending due to another request, per normalization unit.
    Read Req (Total): The total number of sL1D read requests of any size, per normalization
      unit.
    Atomic Req: The total number of atomic requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    Read Req (1 DWord): The total number of sL1D read requests made for a single dword
      of data (4B), per normalization unit.
    Read Req (2 DWord): The total number of sL1D read requests made for a two dwords
      of data (8B), per normalization unit.
    Read Req (4 DWord): The total number of sL1D read requests made for a four dwords
      of data (16B), per normalization unit.
    Read Req (8 DWord): The total number of sL1D read requests made for a eight dwords
      of data (32B), per normalization unit.
    Read Req (16 DWord): The total number of sL1D read requests made for a sixteen
      dwords of data (64B), per normalization unit.
    Read Req: The total number of read requests from sL1D to the L2 per normalization
      unit.
    Write Req: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    Stall Cycles: "The total number of cycles the sL1D\u2194L2 interface was stalled,\
      \ per normalization unit."
  data source:
  - metric_table:
      id: 1401
      title: Scalar L1D Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Bandwidth:
          value: AVG(((SQC_DCACHE_REQ * 100000) / (($max_sclk * $sqc_per_gpu) * (End_Timestamp
            - Start_Timestamp))))
          unit: Pct of Peak
        Cache Hit Rate:
          value: AVG((((SQC_DCACHE_HITS * 100) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE)) if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: Pct of Peak
        sL1D-L2 BW:
          value: AVG(((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 100000) / (2 * ($max_sclk * $sqc_per_gpu) * (End_Timestamp - Start_Timestamp)))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1402
      title: Scalar L1D cache accesses
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Req:
          avg: AVG((SQC_DCACHE_REQ / $denom))
          min: MIN((SQC_DCACHE_REQ / $denom))
          max: MAX((SQC_DCACHE_REQ / $denom))
          unit: (Req  + $normUnit)
        Hits:
          avg: AVG((SQC_DCACHE_HITS / $denom))
          min: MIN((SQC_DCACHE_HITS / $denom))
          max: MAX((SQC_DCACHE_HITS / $denom))
          unit: (Req  + $normUnit)
        Misses - Non Duplicated:
          avg: AVG((SQC_DCACHE_MISSES / $denom))
          min: MIN((SQC_DCACHE_MISSES / $denom))
          max: MAX((SQC_DCACHE_MISSES / $denom))
          unit: (Req  + $normUnit)
        Misses- Duplicated:
          avg: AVG((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          min: MIN((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          max: MAX((SQC_DCACHE_MISSES_DUPLICATE / $denom))
          unit: (Req  + $normUnit)
        Cache Hit Rate:
          avg: AVG((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          min: MIN((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          max: MAX((((100 * SQC_DCACHE_HITS) / ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE)) if (((SQC_DCACHE_HITS + SQC_DCACHE_MISSES)
            + SQC_DCACHE_MISSES_DUPLICATE) != 0) else None))
          unit: pct
        Read Req (Total):
          avg: AVG((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          min: MIN((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          max: MAX((((((SQC_DCACHE_REQ_READ_1 + SQC_DCACHE_REQ_READ_2) + SQC_DCACHE_REQ_READ_4)
            + SQC_DCACHE_REQ_READ_8) + SQC_DCACHE_REQ_READ_16) / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG((SQC_DCACHE_ATOMIC / $denom))
          min: MIN((SQC_DCACHE_ATOMIC / $denom))
          max: MAX((SQC_DCACHE_ATOMIC / $denom))
          unit: (Req  + $normUnit)
        Read Req (1 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_1 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_1 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_1 / $denom))
          unit: (Req  + $normUnit)
        Read Req (2 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_2 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_2 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_2 / $denom))
          unit: (Req  + $normUnit)
        Read Req (4 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_4 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_4 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_4 / $denom))
          unit: (Req  + $normUnit)
        Read Req (8 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_8 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_8 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_8 / $denom))
          unit: (Req  + $normUnit)
        Read Req (16 DWord):
          avg: AVG((SQC_DCACHE_REQ_READ_16 / $denom))
          min: MIN((SQC_DCACHE_REQ_READ_16 / $denom))
          max: MAX((SQC_DCACHE_REQ_READ_16 / $denom))
          unit: (Req  + $normUnit)
  - metric_table:
      id: 1403
      title: Scalar L1D Cache - L2 Interface
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        sL1D-L2 BW:
          avg: AVG(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 64)) / $denom))
          min: MIN(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 64)) / $denom))
          max: MAX(((((SQC_TC_DATA_READ_REQ + SQC_TC_DATA_WRITE_REQ + SQC_TC_DATA_ATOMIC_REQ)
            * 64)) / $denom))
          unit: (Bytes + $normUnit)
        Read Req:
          avg: AVG((SQC_TC_DATA_READ_REQ / $denom))
          min: MIN((SQC_TC_DATA_READ_REQ / $denom))
          max: MAX((SQC_TC_DATA_READ_REQ / $denom))
          unit: (Req  + $normUnit)
        Write Req:
          avg: AVG((SQC_TC_DATA_WRITE_REQ / $denom))
          min: MIN((SQC_TC_DATA_WRITE_REQ / $denom))
          max: MAX((SQC_TC_DATA_WRITE_REQ / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG((SQC_TC_DATA_ATOMIC_REQ / $denom))
          min: MIN((SQC_TC_DATA_ATOMIC_REQ / $denom))
          max: MAX((SQC_TC_DATA_ATOMIC_REQ / $denom))
          unit: (Req  + $normUnit)
        Stall Cycles:
          avg: AVG((SQC_TC_STALL / $denom))
          min: MIN((SQC_TC_STALL / $denom))
          max: MAX((SQC_TC_STALL / $denom))
          unit: (Cycles  + $normUnit)
@@ -1,174 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
  data source:
    - metric_table:
        id: 1501
        title: Address Processing Unit
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Address Processing Unit Busy:
            avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Address Stall:
            avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Data Stall:
            avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Data-Processor → Address Stall:
            avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Total Instructions:
            avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
            min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
            max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Instructions:
            avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Read Instructions:
            avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Write Instructions:
            avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Global/Generic Atomic Instructions:
            avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Instructions:
            avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Read Instructions:
            avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Write Instructions:
            avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Atomic Instructions:
            avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Spill/Stack Total Cycles:
            avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
            min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
            max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Spill/Stack Coalesced Read:
            avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
            min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
            max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
            unit: (Cycles  + $normUnit)
            tips:
          Spill/Stack Coalesced Write:
            avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
            min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
            max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
            unit: (Cycles  + $normUnit)
            tips:
    - metric_table:
        id: 1502
        title: Data-Return Path
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Data-Return Busy:
            avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Cache RAM → Data-Return Stall:
            avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Workgroup manager → Data-Return Stall:
            avg: AVG(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            min: MIN(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            max: MAX(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            tips:
          Coalescable Instructions:
            avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
            min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
            max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Read Instructions:
            avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
              / $denom))
            min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
              / $denom))
            max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
              / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Write Instructions:
            avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
            min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
            max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
          Atomic Instructions:
            avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
            min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
            max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
            unit: (Instructions  + $normUnit)
            tips:
@@ -0,0 +1,248 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1500
  title: Address Processing Unit and Data Return Path (TA/TD)
  metrics_description:
    Address Processing Unit Busy: Percent of the total CU cycles the address processor
      was busy
    Address Stall: Percent of the total CU cycles the address processor was stalled
      from sending address requests further into the vL1D pipeline.
    Data Stall: Percent of the total CU cycles the address processor was stalled from
      sending write/atomic data further into the vL1D pipeline.
    "Data-Processor \u2192 Address Stall": Percent of total CU cycles the address
      processor was stalled waiting to send command data to the data processor.
    Total Instructions: The total number of memory instructions executed by the address
      processer over all compute units on the accelerator, per normalization unit.
    Global/Generic Instructions: The total number of global & generic memory instructions
      executed on all compute units on the accelerator, per normalization unit.
    Global/Generic Read Instructions: The total number of global & generic memory
      read instructions executed on all compute units on the accelerator, per normalization
      unit.
    Global/Generic Write Instructions: The total number of global & generic memory
      write instructions executed on all compute units on the accelerator, per normalization
      unit.
    Global/Generic Atomic Instructions: The total number of global & generic memory
      atomic (with and without return) instructions executed on all compute units
      on the accelerator, per normalization unit.
    Spill/Stack Instructions: The total number of spill/stack memory instructions
      executed on all compute units on the accelerator, per normalization unit.
    Spill/Stack Read Instructions: The total number of spill/stack memory read instructions
      executed on all compute units on the accelerator, per normalization unit.
    Spill/Stack Write Instructions: The total number of spill/stack memory write instructions
      executed on all compute units on the accelerator, per normalization unit.
    Spill/Stack Atomic Instructions: The total number of spill/stack memory atomic
      (with and without return) instructions executed on all compute units on the
      accelerator, per normalization unit. Typically unused as these memory operations
      are typically used to implement thread-local storage.
    Spill/Stack Total Cycles: The number of cycles the address processing unit spent
      working on spill/stack instructions, per normalization unit.
    Spill/Stack Coalesced Read: The number of cycles the address processing unit spent
      working on coalesced spill/stack read instructions, per normalization unit.
    Spill/Stack Coalesced Write: The number of cycles the address processing unit
      spent working on coalesced spill/stack write instructions, per normalization
      unit.
    Data-Return Busy: Percent of the total CU cycles the data-return unit was busy
      processing or waiting on data to return to the CU.
    "Cache RAM \u2192 Data-Return Stall": Percent of the total CU cycles the data-return
      unit was stalled on data to be returned from the vL1D Cache RAM.
    "Workgroup manager \u2192 Data-Return Stall": Percent of the total CU cycles the
      data-return unit was stalled by the workgroup manager due to initialization
      of registers as a part of launching new workgroups.
    Coalescable Instructions: The number of instructions submitted to the data-return
      unit by the address processor that were found to be coalescable, per normalization
      unit.
    Read Instructions: The number of read instructions submitted to the data-return
      unit by the address processor summed over all compute units on the accelerator,
      per normalization unit. This is expected to be the sum of global/generic and
      spill/stack reads in the address processor.
    Write Instructions: The number of store instructions submitted to the data-return
      unit by the address processor summed over all compute units on the accelerator,
      per normalization unit. This is expected to be the sum of global/generic and
      spill/stack stores in the address processor.
    Atomic Instructions: The number of atomic instructions submitted to the data-return
      unit by the address processor summed over all compute units on the accelerator,
      per normalization unit. This is expected to be the sum of global/generic and
      spill/stack atomics in the address processor.
  data source:
  - metric_table:
      id: 1501
      title: Busy and stall metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Address Processing Unit Busy:
          avg: AVG(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TA_TA_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        Address Stall:
          avg: AVG(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          min: MIN(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          max: MAX(((100 * TA_ADDR_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          unit: pct
        Data Stall:
          avg: AVG(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          min: MIN(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          max: MAX(((100 * TA_DATA_STALLED_BY_TC_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          unit: pct
        "Data-Processor \u2192 Address Stall":
          avg: AVG(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          min: MIN(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          max: MAX(((100 * TA_ADDR_STALLED_BY_TD_CYCLES_sum) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu)))
          unit: pct
        "Sequencer \u2192 TA Address Stall":
          avg: AVG((SQ_VMEM_TA_ADDR_FIFO_FULL / $denom))
          min: MIN((SQ_VMEM_TA_ADDR_FIFO_FULL / $denom))
          max: MAX((SQ_VMEM_TA_ADDR_FIFO_FULL / $denom))
          unit: (Cycles + $normUnit)
        "Sequencer \u2192 TA Command Stall":
          avg: AVG((SQ_VMEM_TA_CMD_FIFO_FULL / $denom))
          min: MIN((SQ_VMEM_TA_CMD_FIFO_FULL / $denom))
          max: MAX((SQ_VMEM_TA_CMD_FIFO_FULL / $denom))
          unit: (Cycles + $normUnit)
        "Sequencer \u2192 TA Data Stall":
          avg: AVG((SQ_VMEM_WR_TA_DATA_FIFO_FULL / $denom))
          min: MIN((SQ_VMEM_WR_TA_DATA_FIFO_FULL / $denom))
          max: MAX((SQ_VMEM_WR_TA_DATA_FIFO_FULL / $denom))
          unit: (Cycles + $normUnit)
  - metric_table:
      id: 1502
      title: Instruction counts
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Total Instructions:
          avg: AVG((TA_TOTAL_WAVEFRONTS_sum / $denom))
          min: MIN((TA_TOTAL_WAVEFRONTS_sum / $denom))
          max: MAX((TA_TOTAL_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Instructions:
          avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Read Instructions:
          avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Write Instructions:
          avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Global/Generic Atomic Instructions:
          avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Instructions:
          avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Read Instructions:
          avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Write Instructions:
          avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
        Spill/Stack Atomic Instructions:
          avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          max: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
          unit: (Instructions  + $normUnit)
  - metric_table:
      id: 1503
      title: Spill and stack metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Spill/Stack Total Cycles:
          avg: AVG((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_TOTAL_CYCLES_sum / $denom))
          unit: (Cycles  + $normUnit)
        Spill/Stack Coalesced Read:
          avg: AVG((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_READ_CYCLES_sum / $denom))
          unit: (Cycles  + $normUnit)
        Spill/Stack Coalesced Write:
          avg: AVG((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          min: MIN((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          max: MAX((TA_BUFFER_COALESCED_WRITE_CYCLES_sum / $denom))
          unit: (Cycles  + $normUnit)
  - metric_table:
      id: 1504
      title: Vector L1 data-return path or Texture Data (TD)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Data-Return Busy:
          avg: AVG(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TD_TD_BUSY_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        "Cache RAM \u2192 Data-Return Stall":
          avg: AVG(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TD_TC_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        "Workgroup manager \u2192 Data-Return Stall":
          avg: AVG(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          min: MIN(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          max: MAX(((100 * TD_SPI_STALL_sum) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
        Coalescable Instructions:
          avg: AVG((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          min: MIN((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          max: MAX((TD_COALESCABLE_WAVEFRONT_sum / $denom))
          unit: (Instructions  + $normUnit)
        Read Instructions:
          avg: AVG((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
          min: MIN((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
          max: MAX((((TD_LOAD_WAVEFRONT_sum - TD_STORE_WAVEFRONT_sum) - TD_ATOMIC_WAVEFRONT_sum)
            / $denom))
          unit: (Instructions  + $normUnit)
        Write Instructions:
          avg: AVG((TD_STORE_WAVEFRONT_sum / $denom))
          min: MIN((TD_STORE_WAVEFRONT_sum / $denom))
          max: MAX((TD_STORE_WAVEFRONT_sum / $denom))
          unit: (Instructions  + $normUnit)
        Atomic Instructions:
          avg: AVG((TD_ATOMIC_WAVEFRONT_sum / $denom))
          min: MIN((TD_ATOMIC_WAVEFRONT_sum / $denom))
          max: MAX((TD_ATOMIC_WAVEFRONT_sum / $denom))
          unit: (Instructions  + $normUnit)
@@ -1,414 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
  data source:
    - metric_table:
        id: 1601
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Hit rate:
            value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            unit: Pct of Peak
            tips:
          Bandwidth:
            value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
            unit: Pct of Peak
            tips:
          Utilization:
            value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
              != 0) else None))
            unit: Pct of Peak
            tips:
          Coalescing:
            value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
              * 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
            unit: Pct of Peak
            tips:
        comparable: false # for now
        cli_style: simple_bar
    - metric_table:
        id: 1602
        title: L1D Cache Stalls (%)
        header:
          metric: Metric
          expr: Expression
          tips: Tips
        metric:
          Stalled on L2 Data:
            expr:
              (((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
              != 0) else None)
            tips:
          Stalled on L2 Req:
            expr:
              (((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
              != 0) else None)
            tips:
          Stalled on Address:
            expr:
              None
            tips:
          Stalled on Data:
            expr:
              None
            tips:
          Stalled on Latency FIFO:
            expr:
              None
            tips:
          Stalled on Request FIFO:
            expr:
              None
            tips:
          Stalled on Read Return:
            expr:
              None
            tips:
          Tag RAM Stall (Read):
            expr:
              (((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)
            tips:
          Tag RAM Stall (Write):
            expr:
              (((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)
            tips:
          Tag RAM Stall (Atomic):
            expr:
              (((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)
            tips:
        cli_style: simple_box
    - metric_table:
        id: 1603
        title: L1D Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Total Req:
            avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
            min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
            max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req:
            avg: AVG((TCP_TOTAL_READ_sum / $denom))
            min: MIN((TCP_TOTAL_READ_sum / $denom))
            max: MAX((TCP_TOTAL_READ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write Req:
            avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
            min: MIN((TCP_TOTAL_WRITE_sum / $denom))
            max: MAX((TCP_TOTAL_WRITE_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom))
            min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom))
            max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache BW:
            avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
            min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
            max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
            unit: (Bytes + $normUnit)
            tips:
          Cache Hit Rate:
            avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
               None))
            min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
               None))
            max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
               None))
            unit: pct
            tips:
          Cache Accesses:
            avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
            min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
            max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache Hits:
            avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / $denom))
            min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / $denom))
            max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / $denom))
            unit: (Req  + $normUnit)
            tips:
          Invalidations:
            avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
            min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
            max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
            unit: (Req + $normUnit)
            tips:
          L1-L2 BW:
            avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
              + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
            min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
              + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
            max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
              + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
            unit: (Bytes + $normUnit)
            tips:
          L1-L2 Read:
            avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          L1-L2 Write:
            avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          L1-L2 Atomic:
            avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom))
            min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom))
            max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom))
            unit: (Req  + $normUnit)
            tips:
          L1 Access Latency:
            avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None))
            min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None))
            max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None))
            unit: Cycles
            tips:
          L1-L2 Read Latency:
            avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
            min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
            max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None))
            unit: Cycles
            tips:
          L1-L2 Write Latency:
            avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
              None))
            min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
              None))
            max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0) else
              None))
            unit: Cycles
            tips:
    - metric_table:
        id: 1604
        title: L1D - L2 Transactions
        header:
          metric: Metric
          xfer: Xfer
          coherency: Coherency
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          NC - Read:
            xfer: Read
            coherency: NC
            avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC - Read:
            xfer: Read
            coherency: UC
            avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC - Read:
            xfer: Read
            coherency: CC
            avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW - Read:
            xfer: Read
            coherency: RW
            avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
            min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
            max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW - Write:
            xfer: Write
            coherency: RW
            avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          NC - Write:
            xfer: Write
            coherency: NC
            avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC - Write:
            xfer: Write
            coherency: UC
            avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC - Write:
            xfer: Write
            coherency: CC
            avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
            min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
            max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          NC - Atomic:
            xfer: Atomic
            coherency: NC
            avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC - Atomic:
            xfer: Atomic
            coherency: UC
            avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC - Atomic:
            xfer: Atomic
            coherency: CC
            avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW - Atomic:
            xfer: Atomic
            coherency: RW
            avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
            min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
            max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
    - metric_table:
        id: 1605
        title: L1D Addr Translation
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          units: Units
          tips: Tips
        metric:
          Req:
            avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
            min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
            max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
            units: (Req + $normUnit)
            tips:
          Inflight Req:
            avg:  None # Missing perfmon
            min:  None # Missing perfmon
            max:  None # Missing perfmon
            units: (Req + $normUnit)
            tips:
          Hit Ratio:
            avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
              (TCP_UTCL1_REQUEST_sum != 0) else None))
            min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
              (TCP_UTCL1_REQUEST_sum != 0) else None))
            max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum) if
              (TCP_UTCL1_REQUEST_sum != 0) else None))
            units: pct
            tips:
          Hits:
            avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
            min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
            max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
            units: (Req + $normUnit)
            tips:
          Translation Misses:
            avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
            min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
            max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
            units: (Req + $normUnit)
            tips:
          Permission Misses:
            avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
            min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
            max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
            units: (Req + $normUnit)
            tips:
    - metric_table:
        id: 1606
        title: L1D Addr Translation Stalls
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          units: Units
          tips: Tips
        metric:
@@ -0,0 +1,442 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1600
  title: Vector L1 Data Cache
  metrics_description:
    Hit rate: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
    Bandwidth: The number of bytes looked up in the vL1D cache as a result of VMEM
      instructions, as a percent of the peak theoretical bandwidth achievable on the
      specific accelerator. The number of bytes is calculated as the number of cache
      lines requested multiplied by the cache line size. This value does not consider
      partial requests, so for instance, if only a single value is requested in a
      cache line, the data movement will still be counted as a full cache line.
    Utilization: Indicates how busy the vL1D Cache RAM was during the kernel execution.
      The number of cycles where the vL1D Cache RAM is actively processing any request
      divided by the number of cycles where the vL1D is active.
    Coalescing: Indicates how well memory instructions were coalesced by the address
      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
      as the average number of thread-requests generated per instruction divided by
      the ideal number of thread-requests per instruction.
    Stalled on L2 Data: The ratio of the number of cycles where the vL1D is stalled
      waiting for requested data to return from the L2 cache divided by the number
      of cycles where the vL1D is active.
    Stalled on L2 Req: The ratio of the number of cycles where the vL1D is stalled
      waiting to issue a request for data to the L2 cache divided by the number of
      cycles where the vL1D is active.
    Tag RAM Stall (Read): The ratio of the number of cycles where the vL1D is stalled
      due to Read requests with conflicting tags being looked up concurrently, divided
      by the number of cycles where the vL1D is active.
    Tag RAM Stall (Write): The ratio of the number of cycles where the vL1D is stalled
      due to Write requests with conflicting tags being looked up concurrently, divided
      by the number of cycles where the vL1D is active.
    Tag RAM Stall (Atomic): The ratio of the number of cycles where the vL1D is stalled
      due to Atomic requests with conflicting tags being looked up concurrently, divided
      by the number of cycles where the vL1D is active.
    Total Req: The total number of incoming requests from the address processing unit
      after coalescing.
    Read Req: The total number of incoming read requests from the address processing
      unit after coalescing per normalization unit.
    Write Req: The total number of incoming write requests from the address processing
      unit after coalescing per normalization unit.
    Atomic Req: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit.
    Cache BW: The number of bytes looked up in the vL1D cache as a result of VMEM
      instructions per normalization unit. The number of bytes is calculated as the
      number of cache lines requested multiplied by the cache line size.  This value
      does not consider partial requests, so for instance, if only a single value
      is requested in a cache line, the data movement will still be counted as a full
      cache line.
    Cache Hit Rate: The ratio of the number of vL1D cache line requests that hit in
      vL1D cache over the total number of cache line requests to the vL1D Cache RAM.
    Cache Accesses: The total number of cache line lookups in the vL1D.
    Cache Hits: The number of cache accesses minus the number of outgoing requests
      to the L2 cache, that is, the number of cache line requests serviced by the
      vL1D Cache RAM per normalization unit.
    Invalidations: The number of times the vL1D was issued a write-back invalidate
      command during the kernel's execution per normalization unit. This may be triggered
      by, for instance, the buffer_wbinvl1 instruction.
    L1-L2 BW: The number of bytes transferred across the vL1D-L2 interface as a result
      of VMEM instructions, per normalization unit. The number of bytes is calculated
      as the number of cache lines requested multiplied by the cache line size. This
      value does not consider partial requests, so for instance, if only a single
      value is requested in a cache line, the data movement will still be counted
      as a full cache line.
    L1-L2 Read: The number of read requests for a vL1D cache line that were not satisfied
      by the vL1D and must be retrieved from the to the L2 Cache per normalization
      unit.
    L1-L2 Write: The number of write requests to a vL1D cache line that were sent
      through the vL1D to the L2 cache, per normalization unit.
    L1-L2 Atomic: The number of atomic requests that are sent through the vL1D to
      the L2 cache, per normalization unit. This includes requests for atomics with,
      and without return.
    L1 Access Latency: Calculated as the average number of cycles that a vL1D cache
      line request spent in the vL1D cache pipeline.
    L1-L2 Read Latency: Calculated as the average number of cycles that the vL1D cache
      took to issue and receive read requests from the L2 Cache. This number also
      includes requests for atomics with return values.
    L1-L2 Write Latency: Calculated as the average number of cycles that the vL1D
      cache took to issue and receive acknowledgement of a write request to the L2
      Cache. This number also includes requests for atomics without return values.
    NC - Read: Total read requests with NC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    UC - Read: Total read requests with UC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    CC - Read: Total read requests with CC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    RW - Read: Total read requests with RW mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    RW - Write: Total write requests with RW mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    NC - Write: Total write requests with NC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    UC - Write: Total write requests with UC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    CC - Write: Total write requests with CC mtype from this TCP to all TCCs Sum over
      TCP instances per normalization unit.
    NC - Atomic: Total atomic requests with NC mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    UC - Atomic: Total atomic requests with UC mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    CC - Atomic: Total atomic requests with CC mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    RW - Atomic: Total atomic requests with RW mtype from this TCP to all TCCs Sum
      over TCP instances per normalization unit.
    Req: The number of translation requests made to the UTCL1 per normalization unit.
    Hit Ratio: The ratio of the number of translation requests that hit in the UTCL1
      divided by the total number of translation requests made to the UTCL1.
    Hits: The number of translation requests that hit in the UTCL1, and could be reused,
      per normalization unit.
    Translation Misses: The total number of translation requests that missed in the
      UTCL1 due to  translation not being present in the cache, per normalization
      unit.
    Permission Misses: "The total number of translation requests that missed in the\
      \ UTCL1 due to a permission error, per normalization unit. This is unused and\
      \ expected to be zero in most configurations for modern CDNA\u2122 accelerators."
  data source:
  - metric_table:
      id: 1601
      title: vL1D Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Hit rate:
          value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: Pct of Peak
        Bandwidth:
          value: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 64) * $cu_per_gpu))
          unit: Pct of Peak
        Utilization:
          value: AVG((((TCP_GATE_EN2_sum * 100) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
            != 0) else None))
          unit: Pct of Peak
        Coalescing:
          value: AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
            * 4)) if (TCP_TOTAL_ACCESSES_sum != 0) else None))
          unit: Pct of Peak
      comparable: false
      cli_style: simple_bar
      tui_style: simple_bar
  - metric_table:
      id: 1602
      title: vL1D cache stall metrics
      header:
        metric: Metric
        expr: Expression
      metric:
        Stalled on L2 Data:
          expr: (((100 * TCP_PENDING_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
            != 0) else None)
        Stalled on L2 Req:
          expr: (((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum) if (TCP_GATE_EN1_sum
            != 0) else None)
        Tag RAM Stall (Read):
          expr: (((100 * TCP_READ_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)
        Tag RAM Stall (Write):
          expr: (((100 * TCP_WRITE_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)
        Tag RAM Stall (Atomic):
          expr: (((100 * TCP_ATOMIC_TAGCONFLICT_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1603
      title: vL1D cache access metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Total Req:
          avg: AVG((TCP_TOTAL_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_ACCESSES_sum / $denom))
          unit: (Req  + $normUnit)
        Read Req:
          avg: AVG((TCP_TOTAL_READ_sum / $denom))
          min: MIN((TCP_TOTAL_READ_sum / $denom))
          max: MAX((TCP_TOTAL_READ_sum / $denom))
          unit: (Req  + $normUnit)
        Write Req:
          avg: AVG((TCP_TOTAL_WRITE_sum / $denom))
          min: MIN((TCP_TOTAL_WRITE_sum / $denom))
          max: MAX((TCP_TOTAL_WRITE_sum / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
          min: MIN(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
          max: MAX(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom))
          unit: (Req  + $normUnit)
        Cache BW:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum * 64) / $denom))
          unit: (Bytes + $normUnit)
        Cache Hit Rate:
          avg: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          min: MIN(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          max: MAX(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: pct
        Cache Accesses:
          avg: AVG((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          min: MIN((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          max: MAX((TCP_TOTAL_CACHE_ACCESSES_sum / $denom))
          unit: (Req  + $normUnit)
        Cache Hits:
          avg: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
          min: MIN(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
          max: MAX(((TCP_TOTAL_CACHE_ACCESSES_sum - (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / $denom))
          unit: (Req  + $normUnit)
        Invalidations:
          avg: AVG((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          min: MIN((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          max: MAX((TCP_TOTAL_WRITEBACK_INVALIDATES_sum / $denom))
          unit: (Req + $normUnit)
        L1-L2 BW:
          avg: AVG(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
          min: MIN(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
          max: MAX(((64 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) + TCP_TCC_ATOMIC_WITH_RET_REQ_sum)
            + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) / $denom))
          unit: (Bytes + $normUnit)
        L1-L2 Read:
          avg: AVG((TCP_TCC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        L1-L2 Write:
          avg: AVG((TCP_TCC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        L1-L2 Atomic:
          avg: AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
          min: MIN(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
          max: MAX(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom))
          unit: (Req  + $normUnit)
        L1 Access Latency:
          avg: AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
          min: MIN(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
          max: MAX(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None))
          unit: Cycles
        L1-L2 Read Latency:
          avg: AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
            if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
            None))
          min: MIN(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
            if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
            None))
          max: MAX(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
            if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else
            None))
          unit: Cycles
        L1-L2 Write Latency:
          avg: AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
            else None))
          min: MIN(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
            else None))
          max: MAX(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum) != 0)
            else None))
          unit: Cycles
  - metric_table:
      id: 1604
      title: L1D - L2 Transactions
      header:
        metric: Metric
        xfer: Xfer
        coherency: Coherency
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        NC - Read:
          xfer: Read
          coherency: NC
          avg: AVG((TCP_TCC_NC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC - Read:
          xfer: Read
          coherency: UC
          avg: AVG((TCP_TCC_UC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC - Read:
          xfer: Read
          coherency: CC
          avg: AVG((TCP_TCC_CC_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW - Read:
          xfer: Read
          coherency: RW
          avg: AVG((TCP_TCC_RW_READ_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_READ_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_READ_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW - Write:
          xfer: Write
          coherency: RW
          avg: AVG((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        NC - Write:
          xfer: Write
          coherency: NC
          avg: AVG((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC - Write:
          xfer: Write
          coherency: UC
          avg: AVG((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC - Write:
          xfer: Write
          coherency: CC
          avg: AVG((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_WRITE_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        NC - Atomic:
          xfer: Atomic
          coherency: NC
          avg: AVG((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_NC_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC - Atomic:
          xfer: Atomic
          coherency: UC
          avg: AVG((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_UC_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC - Atomic:
          xfer: Atomic
          coherency: CC
          avg: AVG((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_CC_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW - Atomic:
          xfer: Atomic
          coherency: RW
          avg: AVG((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          min: MIN((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          max: MAX((TCP_TCC_RW_ATOMIC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
  - metric_table:
      id: 1605
      title: L1 Unified Translation Cache (UTCL1)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        units: Units
      metric:
        Req:
          avg: AVG((TCP_UTCL1_REQUEST_sum / $denom))
          min: MIN((TCP_UTCL1_REQUEST_sum / $denom))
          max: MAX((TCP_UTCL1_REQUEST_sum / $denom))
          units: (Req + $normUnit)
        Hit Ratio:
          avg: AVG((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
            if (TCP_UTCL1_REQUEST_sum != 0) else None))
          min: MIN((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
            if (TCP_UTCL1_REQUEST_sum != 0) else None))
          max: MAX((((100 * TCP_UTCL1_TRANSLATION_HIT_sum) / TCP_UTCL1_REQUEST_sum)
            if (TCP_UTCL1_REQUEST_sum != 0) else None))
          units: pct
        Hits:
          avg: AVG((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
          min: MIN((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
          max: MAX((TCP_UTCL1_TRANSLATION_HIT_sum / $denom))
          units: (Req + $normUnit)
        Translation Misses:
          avg: AVG((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
          min: MIN((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
          max: MAX((TCP_UTCL1_TRANSLATION_MISS_sum / $denom))
          units: (Req + $normUnit)
        Permission Misses:
          avg: AVG((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
          min: MIN((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
          max: MAX((TCP_UTCL1_PERMISSION_MISS_sum / $denom))
          units: (Req + $normUnit)
  - metric_table:
      id: 1606
      title: L1D Addr Translation Stalls
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        units: Units
      metric: {}
@@ -1,388 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1700
  title: L2 Cache
  data source:
    - metric_table:
        id: 1701
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          tips: Tips
        metric:
          Utilization:
            value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
            unit: pct
            tips:
          Bandwidth:
            value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
            unit: pct
            tips:
          Hit Rate:
            value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else 0))
            unit: pct
            tips:
          L2-Fabric Read BW:
            value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            tips:
          L2-Fabric Write and Atomic BW:
            value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            tips:
          HBM Bandwidth:
            value: $hbmBandwidth
            unit: GB/s
            tips:
    - metric_table:
        id: 1702
        title: L2 - Fabric Transactions
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Read BW:
            avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / $denom))
            min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / $denom))
            max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
              * 64)) / $denom))
            unit: (Bytes  + $normUnit)
            tips:
          HBM Read Traffic:
            avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            unit: pct
            tips:
          Remote Read Traffic:
            avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            unit: pct
            tips:
          Uncached Read Traffic:
            avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum != 0) else None))
            unit: pct
            tips:
          Write and Atomic BW:
            avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / $denom))
            min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / $denom))
            max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
              * 32)) / $denom))
            unit: (Bytes  + $normUnit)
            tips:
          HBM Write and Atomic Traffic:
            avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Remote Write and Atomic Traffic:
            avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Atomic Traffic:
            avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Uncached Write and Atomic Traffic:
            avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum != 0) else None))
            unit: pct
            tips:
          Read Latency:
            avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
              0) else None))
            min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
              0) else None))
            max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum !=
              0) else None))
            unit: Cycles
            tips:
          Write and Atomic Latency:
            avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
              0) else None))
            min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
              0) else None))
            max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum !=
              0) else None))
            unit: Cycles
            tips:
          Atomic Latency:
            avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else None))
            min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else None))
            max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
              != 0) else None))
            unit: Cycles
            tips:
    - metric_table:
        id: 1703
        title: L2 Cache Accesses
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Bandwidth:
            avg: AVG((TCC_REQ_sum * 128) / $denom)
            min: MIN((TCC_REQ_sum * 128) / $denom)
            max: MAX((TCC_REQ_sum * 128) / $denom)
            unit: (Bytes + $normUnit)
            tips:
          Req:
            avg: AVG((TCC_REQ_sum / $denom))
            min: MIN((TCC_REQ_sum / $denom))
            max: MAX((TCC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read Req:
            avg: AVG((TCC_READ_sum / $denom))
            min: MIN((TCC_READ_sum / $denom))
            max: MAX((TCC_READ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write Req:
            avg: AVG((TCC_WRITE_sum / $denom))
            min: MIN((TCC_WRITE_sum / $denom))
            max: MAX((TCC_WRITE_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic Req:
            avg: AVG((TCC_ATOMIC_sum / $denom))
            min: MIN((TCC_ATOMIC_sum / $denom))
            max: MAX((TCC_ATOMIC_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Streaming Req:
            avg: AVG((TCC_STREAMING_REQ_sum / $denom))
            min: MIN((TCC_STREAMING_REQ_sum / $denom))
            max: MAX((TCC_STREAMING_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Probe Req:
            avg: AVG((TCC_PROBE_sum / $denom))
            min: MIN((TCC_PROBE_sum / $denom))
            max: MAX((TCC_PROBE_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Cache Hit:
            avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            unit: pct
            tips:
          Hits:
            avg: AVG((TCC_HIT_sum / $denom))
            min: MIN((TCC_HIT_sum / $denom))
            max: MAX((TCC_HIT_sum / $denom))
            unit: (Hits  + $normUnit)
            tips:
          Misses:
            avg: AVG((TCC_MISS_sum / $denom))
            min: MIN((TCC_MISS_sum / $denom))
            max: MAX((TCC_MISS_sum / $denom))
            unit: (Misses  + $normUnit)
            tips:
          Writeback:
            avg: AVG((TCC_WRITEBACK_sum / $denom))
            min: MIN((TCC_WRITEBACK_sum / $denom))
            max: MAX((TCC_WRITEBACK_sum / $denom))
            unit: (Cachelines  + $normUnit)
            tips:
          Writeback (Internal):
            avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
            min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
            max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Writeback (vL1D Req):
            avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
            min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
            max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Evict (Internal):
            avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
            min: MIN((TCC_NORMAL_EVICT_sum / $denom))
            max: MAX((TCC_NORMAL_EVICT_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          Evict (vL1D Req):
            avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
            min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
            max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
            unit: (Cachelines + $normUnit)
            tips:
          NC Req:
            avg: AVG((TCC_NC_REQ_sum / $denom))
            min: MIN((TCC_NC_REQ_sum / $denom))
            max: MAX((TCC_NC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          UC Req:
            avg: AVG((TCC_UC_REQ_sum / $denom))
            min: MIN((TCC_UC_REQ_sum / $denom))
            max: MAX((TCC_UC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          CC Req:
            avg: AVG((TCC_CC_REQ_sum / $denom))
            min: MIN((TCC_CC_REQ_sum / $denom))
            max: MAX((TCC_CC_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          RW Req:
            avg: AVG((TCC_RW_REQ_sum / $denom))
            min: MIN((TCC_RW_REQ_sum / $denom))
            max: MAX((TCC_RW_REQ_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
    - metric_table:
        id: 1704
        title: L2 Cache Stalls
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
    - metric_table:
        id: 1705
        title: L2 - Fabric Interface Stalls
        header:
          metric: Metric
          type: Type
          transaction: Transaction
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        style:
          type: simple_multi_bar
        metric:
          Write - Credit Starvation:
            type: Credit Starvation
            transaction: Write
            avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
            min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
            max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum != 0) else None))
            unit: pct
            tips:
    - metric_table:
        id: 1706
        title: L2 - Fabric Detailed Transaction Breakdown
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Read (32B):
            avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
            min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
            max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read (64B):
            avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
            min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
            max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Read (Uncached):
            avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
            min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
            max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          HBM Read:
            avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
            min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
            max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Remote Read:
            avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
            min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
            max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write and Atomic (32B):
            avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
            min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
            max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write and Atomic (Uncached):
            avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
            min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
            max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Write and Atomic (64B):
            avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
            min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
            max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          HBM Write and Atomic:
            avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
            min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
            max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
          Remote Write and Atomic:
            avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
            min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
            max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
            unit: (Req  + $normUnit)
            tips:
          Atomic:
            avg: AVG((TCC_EA_ATOMIC_sum / $denom))
            min: MIN((TCC_EA_ATOMIC_sum / $denom))
            max: MAX((TCC_EA_ATOMIC_sum / $denom))
            unit: (Req  + $normUnit)
            tips:
@@ -0,0 +1,536 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1700
  title: L2 Cache
  metrics_description:
    Utilization: The ratio of the number of cycles an L2 channel was active, summed
      over all L2 channels on the accelerator over the total L2 cycles.
    Peak Bandwidth: The number of bytes looked up in the L2 cache, as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator. The number
      of bytes is calculated as the number of cache lines requested multiplied by
      the cache line size. This value does not consider partial requests, so e.g.,
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line.
    Hit Rate: The ratio of the number of L2 cache line requests that hit in the L2
      cache over the total number of incoming cache line requests to the L2 cache.
    L2-Fabric Read BW: The number of bytes read by the L2 over the Infinity Fabric
      interface per unit time.
    L2-Fabric Write and Atomic BW: The number of bytes sent by the L2 over the Infinity
      Fabric interface by write and atomic operations per unit time.
    HBM Bandwidth: Maximum theoretical bandwidth of the accelerator's local high-bandwidth
      memory (HBM) per unit time. This value is calculated as the number of HBM channels
      multiplied by the HBM channel width multiplied by the HBM clock frequency.
    Read BW: The total number of bytes read by the L2 cache from Infinity Fabric per
      normalization unit.
    HBM Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to the accelerator's local high-bandwidth memory (HBM). This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
      are both counted as a single request), so this metric only approximates the
      percent of the L2-Fabric Read bandwidth directed to the local HBM.
    Remote Read Traffic: The percent of read requests generated by the L2 cache that
      are routed to any memory location other than the accelerator's local high-bandwidth
      memory (HBM) - for example, the CPU's DRAM or a remote accelerator's HBM. This
      breakdown does not consider the size of the request (meaning that 32B and 64B
      requests are both counted as a single request), so this metric only approximates
      the percent of the L2-Fabric Read bandwidth directed to a remote location.
    Uncached Read Traffic: The percent of read requests generated by the L2 cache
      that are reading from an uncached memory allocation. Note, as described in the
      request flow section, a single 64B read request is typically counted as two
      uncached read requests. So, it is possible for the Uncached Read Traffic to
      reach up to 200% of the total number of read requests. This breakdown does not
      consider the size of the request (i.e., 32B and 64B requests are both counted
      as a single request), so this metric only approximates the percent of the L2-Fabric
      read bandwidth directed to an uncached memory location.
    Write and Atomic BW: The total number of bytes written by the L2 over Infinity
      Fabric by write and atomic operations per normalization unit. Note that on current
      CDNA accelerators, such as the MI2XX, requests are only considered atomic by
      Infinity Fabric if they are targeted at non-write-cacheable memory, for example,
      fine-grained memory allocations or uncached memory allocations on the MI2XX.
    HBM Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are routed to the accelerator's local high-bandwidth memory
      (HBM). This breakdown does not consider the size of the request (meaning that
      32B and 64B requests are both counted as a single request), so this metric only
      approximates the percent of the L2-Fabric Write and Atomic bandwidth directed
      to the local HBM. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at fine-grained memory allocations or uncached memory allocations.
    Remote Write and Atomic Traffic: The percent of read requests generated by the
      L2 cache that are routed to any memory location other than the accelerator's
      local high-bandwidth memory (HBM) - for example, the CPU's DRAM or a remote
      accelerator's HBM. This breakdown does not consider the size of the request
      (meaning that 32B and 64B requests are both counted as a single request), so
      this metric only approximates the percent of the L2-Fabric Read bandwidth directed
      to a remote location. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at fine-grained memory allocations or uncached memory allocations.
    Atomic Traffic: The percent of write requests generated by the L2 cache that are
      atomic requests to any memory location. This breakdown does not consider the
      size of the request (meaning that 32B and 64B requests are both counted as a
      single request), so this metric only approximates the percent of the L2-Fabric
      Read bandwidth directed to a remote location. Note that on current CDNA accelerators,
      such as the MI2XX, requests are only considered atomic by Infinity Fabric if
      they are targeted at fine-grained memory allocations or uncached memory allocations.
    Uncached Write and Atomic Traffic: The percent of write and atomic requests generated
      by the L2 cache that are targeting uncached memory allocations. This breakdown
      does not consider the size of the request (meaning that 32B and 64B requests
      are both counted as a single request), so this metric only approximates the
      percent of the L2-Fabric read bandwidth directed to uncached memory allocations.
    Read Latency: The time-averaged number of cycles read requests spent in Infinity
      Fabric before data was returned to the L2.
    Write and Atomic Latency: The time-averaged number of cycles write requests spent
      in Infinity Fabric before a completion acknowledgement was returned to the L2.
    Atomic Latency: The time-averaged number of cycles atomic requests spent in Infinity
      Fabric before a completion acknowledgement (atomic without return value) or
      data (atomic with return value) was returned to the L2.
    Bandwidth: The number of bytes looked up in the L2 cache, per normalization unit.
      The number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so for
      example, if only a single value is requested in a cache line, the data movement
      will still be counted as a full cache line.
    Req: The total number of incoming requests to the L2 from all clients for all
      request types, per normalization unit.
    Read Req: The total number of read requests to the L2 from all clients.
    Write Req: The total number of write requests to the L2 from all clients.
    Atomic Req: The total number of atomic requests (with and without return) to the
      L2 from all clients.
    Streaming Req: The total number of incoming requests to the L2 that are marked
      as streaming. The exact meaning of this may differ depending on the targeted
      accelerator, however on an MI2XX this corresponds to non-temporal load or stores.
      The L2 cache attempts to evict streaming requests before normal requests when
      the L2 is at capacity.
    Probe Req: The number of coherence probe requests made to the L2 cache from outside
      the accelerator. On an MI2XX, probe requests may be generated by, for example,
      writes to fine-grained device memory or by writes to coarse-grained device memory.
    Cache Hit: The ratio of the number of L2 cache line requests that hit in the L2
      cache over the total number of incoming cache line requests to the L2 cache.
    Hits: The total number of requests to the L2 from all clients that hit in the
      cache. As noted in the Speed-of-Light section, this includes hit-on-miss requests.
    Misses: The total number of requests to the L2 from all clients that miss in the
      cache. As noted in the Speed-of-Light section, these do not include hit-on-miss
      requests.
    Writeback: The total number of L2 cache lines written back to memory for any reason.
      Write-backs may occur due to user code (such as HIP kernel calls to _threadfence_system
      or atomic built-ins) by the command processor's memory acquire/release fences,
      or for other internal hardware reasons.
    Writeback (Internal): The total number of L2 cache lines written back to memory
      for internal hardware reasons, per normalization unit.
    Writeback (vL1D Req): The total number of L2 cache lines written back to memory
      due to requests initiated by the vL1D cache, per normalization unit.
    Evict (Internal): The total number of L2 cache lines evicted from the cache due
      to capacity limits, per normalization unit.
    Evict (vL1D Req): The total number of L2 cache lines evicted from the cache due
      to invalidation requests initiated by the vL1D cache, per normalization unit.
    NC Req: The total number of requests to the L2 to Not-hardware-Coherent (NC) memory
      allocations, per normalization unit.
    UC Req: The total number of requests to the L2 that go to Uncached (UC) memory
      allocations.
    CC Req: The total number of requests to the L2 that go to Coherently Cacheable
      (CC) memory allocations.
    RW Req: The total number of requests to the L2 that go to Read-Write coherent
      memory (RW) allocations.
    Write - Credit Starvation: The number of cycles the L2-Fabric interface was stalled
      on write or atomic requests to any memory location because too many write/atomic
      requests were currently in flight, as a percent of the total active L2 cycles.
    Read (32B): The total number of L2 requests to Infinity Fabric to read 32B of
      data from any memory location, per normalization unit.
    Read (64B): The total number of L2 requests to Infinity Fabric to read 64B of
      data from any memory location, per normalization unit.
    Read (Uncached): The total number of L2 requests to Infinity Fabric to read uncached
      data from any memory location, per normalization unit. 64B requests for uncached
      data are counted as two 32B uncached data requests.
    HBM Read: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
    Remote Read: The total number of L2 requests to Infinity Fabric to read 32B or
      64B of data from any source other than the accelerator's local HBM, per normalization
      unit.
    Write and Atomic (32B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B of data to any memory location, per normalization
      unit.
    Write and Atomic (Uncached): The total number of L2 requests to Infinity Fabric
      to write or atomically update 32B or 64B of uncached data, per normalization
      unit.
    Write and Atomic (64B): The total number of L2 requests to Infinity Fabric to
      write or atomically update 64B of data in any memory location, per normalization
      unit.
    HBM Write and Atomic: The total number of L2 requests to Infinity Fabric to write
      or atomically update 32B or 64B of data in the accelerator's local HBM, per
      normalization unit.
    Remote Write and Atomic: The total number of L2 requests to Infinity Fabric to
      write or atomically update 32B or 64B of data in any memory location other than
      the accelerator's local HBM, per normalization unit.
    Atomic: The total number of L2 requests to Infinity Fabric to atomically update
      32B or 64B of data in any memory location, per normalization unit. See Request
      flow for more detail. Note that on current CDNA accelerators, such as the MI2XX,
      requests are only considered atomic by Infinity Fabric if they are targeted
      at non-write-cacheable memory, such as fine-grained memory allocations or uncached
      memory allocations on the MI2XX.
    Read Stall: "The ratio of the total number of cycles the L2-Fabric interface was\
      \ stalled on a read request to any destination (local HBM, remote PCIe\xAE connected\
      \ accelerator or CPU, or remote Infinity Fabric connected accelerator or CPU)\
      \ over the total active L2 cycles."
    Write Stall: The ratio of the total number of cycles the L2-Fabric interface was
      stalled on a write or atomic request to any destination (local HBM, remote accelerator
      or CPU, PCIe connected accelerator or CPU, or remote Infinity Fabric connected
      accelerator or CPU) over the total active L2 cycles.
    Read - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
      read requests to remote PCIe connected accelerators or CPUs as a percent of
      the total active L2 cycles.
    Read - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
      stalled on read requests to remote Infinity Fabric connected accelerators or
      CPUs as a percent of the total active L2 cycles.
    Read - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
      read requests to the accelerator's local HBM as a percent of the total active
      L2 cycles.
    Write - PCIe Stall: The number of cycles the L2-Fabric interface was stalled on
      write or atomic requests to remote PCIe connected accelerators or CPUs as a
      percent of the total active L2 cycles.
    Write - Infinity Fabric Stall: The number of cycles the L2-Fabric interface was
      stalled on write or atomic requests to remote Infinity Fabric connected accelerators
      or CPUs as a percent of the total active L2 cycles.
    Write - HBM Stall: The number of cycles the L2-Fabric interface was stalled on
      write or atomic requests to accelerator's local HBM as a percent of the total
      active L2 cycles.
  data source:
  - metric_table:
      id: 1701
      title: L2 Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
      metric:
        Utilization:
          value: AVG(((TCC_BUSY_sum * 100) / (TO_INT($total_l2_chan) * $GRBM_GUI_ACTIVE_PER_XCD)))
          unit: pct
        Peak Bandwidth:
          value: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
            / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
          unit: pct
        Hit Rate:
          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else 0))
          unit: pct
        L2-Fabric Read BW:
          value: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
        L2-Fabric Write and Atomic BW:
          value: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
        HBM Bandwidth:
          value: $hbmBandwidth
          unit: GB/s
  - metric_table:
      id: 1702
      title: L2-Fabric interface metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Read BW:
          avg: AVG((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / $denom))
          min: MIN((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / $denom))
          max: MAX((((TCC_EA_RDREQ_32B_sum * 32) + ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum)
            * 64)) / $denom))
          unit: (Bytes  + $normUnit)
        HBM Read Traffic:
          avg: AVG((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_RDREQ_DRAM_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: pct
        Remote Read Traffic:
          avg: AVG((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
            if (TCC_EA_RDREQ_sum != 0) else None))
          min: MIN((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
            if (TCC_EA_RDREQ_sum != 0) else None))
          max: MAX((100 * ((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum) / TCC_EA_RDREQ_sum)
            if (TCC_EA_RDREQ_sum != 0) else None))
          unit: pct
        Uncached Read Traffic:
          avg: AVG((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_RD_UNCACHED_32B_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: pct
        Write and Atomic BW:
          avg: AVG((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / $denom))
          min: MIN((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / $denom))
          max: MAX((((TCC_EA_WRREQ_64B_sum * 64) + ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum)
            * 32)) / $denom))
          unit: (Bytes  + $normUnit)
        HBM Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_WRREQ_DRAM_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: pct
        Remote Write and Atomic Traffic:
          avg: AVG((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
            if (TCC_EA_WRREQ_sum != 0) else None))
          min: MIN((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
            if (TCC_EA_WRREQ_sum != 0) else None))
          max: MAX((100 * ((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum) / TCC_EA_WRREQ_sum)
            if (TCC_EA_WRREQ_sum != 0) else None))
          unit: pct
        Atomic Traffic:
          avg: AVG((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_ATOMIC_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: pct
        Uncached Write and Atomic Traffic:
          avg: AVG((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX((100 * (TCC_EA_WR_UNCACHED_32B_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: pct
        Read Latency:
          avg: AVG(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          min: MIN(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          max: MAX(((TCC_EA_RDREQ_LEVEL_sum / TCC_EA_RDREQ_sum) if (TCC_EA_RDREQ_sum
            != 0) else None))
          unit: Cycles
        Write and Atomic Latency:
          avg: AVG(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          min: MIN(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          max: MAX(((TCC_EA_WRREQ_LEVEL_sum / TCC_EA_WRREQ_sum) if (TCC_EA_WRREQ_sum
            != 0) else None))
          unit: Cycles
        Atomic Latency:
          avg: AVG(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else None))
          min: MIN(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else None))
          max: MAX(((TCC_EA_ATOMIC_LEVEL_sum / TCC_EA_ATOMIC_sum) if (TCC_EA_ATOMIC_sum
            != 0) else None))
          unit: Cycles
  - metric_table:
      id: 1703
      title: L2 Cache Accesses
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Bandwidth:
          avg: AVG((TCC_REQ_sum * 128) / $denom)
          min: MIN((TCC_REQ_sum * 128) / $denom)
          max: MAX((TCC_REQ_sum * 128) / $denom)
          unit: (Bytes + $normUnit)
        Req:
          avg: AVG((TCC_REQ_sum / $denom))
          min: MIN((TCC_REQ_sum / $denom))
          max: MAX((TCC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        Read Req:
          avg: AVG((TCC_READ_sum / $denom))
          min: MIN((TCC_READ_sum / $denom))
          max: MAX((TCC_READ_sum / $denom))
          unit: (Req  + $normUnit)
        Write Req:
          avg: AVG((TCC_WRITE_sum / $denom))
          min: MIN((TCC_WRITE_sum / $denom))
          max: MAX((TCC_WRITE_sum / $denom))
          unit: (Req  + $normUnit)
        Atomic Req:
          avg: AVG((TCC_ATOMIC_sum / $denom))
          min: MIN((TCC_ATOMIC_sum / $denom))
          max: MAX((TCC_ATOMIC_sum / $denom))
          unit: (Req  + $normUnit)
        Streaming Req:
          avg: AVG((TCC_STREAMING_REQ_sum / $denom))
          min: MIN((TCC_STREAMING_REQ_sum / $denom))
          max: MAX((TCC_STREAMING_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        Probe Req:
          avg: AVG((TCC_PROBE_sum / $denom))
          min: MIN((TCC_PROBE_sum / $denom))
          max: MAX((TCC_PROBE_sum / $denom))
          unit: (Req  + $normUnit)
        Cache Hit:
          avg: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          min: MIN((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          max: MAX((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          unit: pct
        Hits:
          avg: AVG((TCC_HIT_sum / $denom))
          min: MIN((TCC_HIT_sum / $denom))
          max: MAX((TCC_HIT_sum / $denom))
          unit: (Hits  + $normUnit)
        Misses:
          avg: AVG((TCC_MISS_sum / $denom))
          min: MIN((TCC_MISS_sum / $denom))
          max: MAX((TCC_MISS_sum / $denom))
          unit: (Misses  + $normUnit)
        Writeback:
          avg: AVG((TCC_WRITEBACK_sum / $denom))
          min: MIN((TCC_WRITEBACK_sum / $denom))
          max: MAX((TCC_WRITEBACK_sum / $denom))
          unit: (Cachelines  + $normUnit)
        Writeback (Internal):
          avg: AVG((TCC_NORMAL_WRITEBACK_sum / $denom))
          min: MIN((TCC_NORMAL_WRITEBACK_sum / $denom))
          max: MAX((TCC_NORMAL_WRITEBACK_sum / $denom))
          unit: (Cachelines + $normUnit)
        Writeback (vL1D Req):
          avg: AVG((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
          min: MIN((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
          max: MAX((TCC_ALL_TC_OP_WB_WRITEBACK_sum / $denom))
          unit: (Cachelines + $normUnit)
        Evict (Internal):
          avg: AVG((TCC_NORMAL_EVICT_sum / $denom))
          min: MIN((TCC_NORMAL_EVICT_sum / $denom))
          max: MAX((TCC_NORMAL_EVICT_sum / $denom))
          unit: (Cachelines + $normUnit)
        Evict (vL1D Req):
          avg: AVG((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
          min: MIN((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
          max: MAX((TCC_ALL_TC_OP_INV_EVICT_sum / $denom))
          unit: (Cachelines + $normUnit)
        NC Req:
          avg: AVG((TCC_NC_REQ_sum / $denom))
          min: MIN((TCC_NC_REQ_sum / $denom))
          max: MAX((TCC_NC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        UC Req:
          avg: AVG((TCC_UC_REQ_sum / $denom))
          min: MIN((TCC_UC_REQ_sum / $denom))
          max: MAX((TCC_UC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        CC Req:
          avg: AVG((TCC_CC_REQ_sum / $denom))
          min: MIN((TCC_CC_REQ_sum / $denom))
          max: MAX((TCC_CC_REQ_sum / $denom))
          unit: (Req  + $normUnit)
        RW Req:
          avg: AVG((TCC_RW_REQ_sum / $denom))
          min: MIN((TCC_RW_REQ_sum / $denom))
          max: MAX((TCC_RW_REQ_sum / $denom))
          unit: (Req  + $normUnit)
  - metric_table:
      id: 1704
      title: L2 Cache Stalls
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric: {}
  - metric_table:
      id: 1705
      title: L2 - Fabric Interface stalls
      header:
        metric: Metric
        type: Type
        transaction: Transaction
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      style:
        type: simple_multi_bar
      metric:
        Write - Credit Starvation:
          type: Credit Starvation
          transaction: Write
          avg: AVG(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
            != 0) else None))
          min: MIN(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
            != 0) else None))
          max: MAX(((100 * (TCC_TOO_MANY_EA_WRREQS_STALL_sum / TCC_BUSY_sum)) if (TCC_BUSY_sum
            != 0) else None))
          unit: pct
  - metric_table:
      id: 1706
      title: L2 - Fabric interface detailed metrics
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Read (32B):
          avg: AVG((TCC_EA_RDREQ_32B_sum / $denom))
          min: MIN((TCC_EA_RDREQ_32B_sum / $denom))
          max: MAX((TCC_EA_RDREQ_32B_sum / $denom))
          unit: (Req  + $normUnit)
        Read (64B):
          avg: AVG(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
          min: MIN(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
          max: MAX(((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_32B_sum) / $denom))
          unit: (Req  + $normUnit)
        Read (Uncached):
          avg: AVG((TCC_EA_RD_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA_RD_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA_RD_UNCACHED_32B_sum / $denom))
          unit: (Req  + $normUnit)
        HBM Read:
          avg: AVG((TCC_EA_RDREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA_RDREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA_RDREQ_DRAM_sum / $denom))
          unit: (Req  + $normUnit)
        Remote Read:
          avg: AVG((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA_RDREQ_sum - TCC_EA_RDREQ_DRAM_sum), 0) / $denom))
          unit: (Req  + $normUnit)
        Write and Atomic (32B):
          avg: AVG(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
          min: MIN(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
          max: MAX(((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_64B_sum) / $denom))
          unit: (Req  + $normUnit)
        Write and Atomic (Uncached):
          avg: AVG((TCC_EA_WR_UNCACHED_32B_sum / $denom))
          min: MIN((TCC_EA_WR_UNCACHED_32B_sum / $denom))
          max: MAX((TCC_EA_WR_UNCACHED_32B_sum / $denom))
          unit: (Req  + $normUnit)
        Write and Atomic (64B):
          avg: AVG((TCC_EA_WRREQ_64B_sum / $denom))
          min: MIN((TCC_EA_WRREQ_64B_sum / $denom))
          max: MAX((TCC_EA_WRREQ_64B_sum / $denom))
          unit: (Req  + $normUnit)
        HBM Write and Atomic:
          avg: AVG((TCC_EA_WRREQ_DRAM_sum / $denom))
          min: MIN((TCC_EA_WRREQ_DRAM_sum / $denom))
          max: MAX((TCC_EA_WRREQ_DRAM_sum / $denom))
          unit: (Req  + $normUnit)
        Remote Write and Atomic:
          avg: AVG((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
          min: MIN((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
          max: MAX((MAX((TCC_EA_WRREQ_sum - TCC_EA_WRREQ_DRAM_sum), 0) / $denom))
          unit: (Req  + $normUnit)
        Atomic:
          avg: AVG((TCC_EA_ATOMIC_sum / $denom))
          min: MIN((TCC_EA_ATOMIC_sum / $denom))
          max: MAX((TCC_EA_ATOMIC_sum / $denom))
          unit: (Req  + $normUnit)
@@ -1,350 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
  data source:
    - metric_table:
        id: 1801
        title: Aggregate Stats (All channels)
        header:
          metric: Metric
          avg: Avg
          std dev: Std Dev
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          L2 Cache Hit Rate:
            avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
              + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100 *
              TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
              + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) + (100
              * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100 * TCC_HIT[15]))
              + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 * TCC_HIT[18])) + (100
              * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21])) + (100 * TCC_HIT[22]))
              + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) + (100 * TCC_HIT[25])) + (100
              * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100 * TCC_HIT[28])) + (100 * TCC_HIT[29]))
              + (100 * TCC_HIT[30])) + (100 * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
              + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
              + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
              + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
              + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
              + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
              + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
              + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
              + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
              + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
              + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
              + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
              + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
              + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
            unit: pct
            tips:
        # FIXME: other arggr metrics!!
    - metric_table:
        id: 1802
        title: L2 Cache Hit Rate (pct)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr:
              (((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
              + TCC_MISS[::_1]) != 0) else None)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1803
        title: L2 Requests (per normUnit)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr: (TO_INT(TCC_REQ[::_1]) / $denom)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1804
        title: L2 Requests (per normUnit)
        header:
          metric: Channel
          read req: L2 Read
          write req: L2 Write
          atomic req: L2 Atomic
        metric:
          "::_1":
            read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
            write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
            atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    - metric_table:
        id: 1805
        title: L2-Fabric Requests (per normUnit)
        header:
          metric: Channel
          read req: L2-Fabric Read
          write req: L2-Fabric Write and Atomic
          atomic req: L2-Fabric Atomic
        metric:
          "::_1":
            read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
            write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
            atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    # - metric_table:
    #     id: 1806
    #     title: L2-EA Latency (Cycles)
    #     header:
    #       metric: Metric
    #       read lat: L2-EA Read
    #       write lat: L2-EA Write
    #       atomic lat: L2-EA Atomic
    #     metric:
    #       "::_1":
    #         read lat:
    #           AVG(((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
    #           != 0) else None))
    #         write lat:
    #           AVG(((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
    #           != 0) else None))
    #         atomic lat:
    #           AVG(((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
    #           (TCC_EA_ATOMIC[::_1] != 0) else 0))
    #       placeholder_range:
    #         "::_1": 32
    #     cli_style: simple_multiple_bar
    - metric_table:
        id: 1806
        title: L2-Fabric Read Latency (Cycles)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr:
              ((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
              != 0) else None)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1807
        title: L2-Fabric Write and Atomic Latency (Cycles)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr:
              ((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
              != 0) else None)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1808
        title: L2-Fabric Atomic Latency (Cycles)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if
              (TCC_EA_ATOMIC[::_1] != 0) else 0)
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_box
    - metric_table:
        id: 1809
        title: L2-Fabric Read Stall (Cycles per normUnit)
        header:
          metric: Channel
          ea read stall - pcie: L2-Fabric Read Stall (PCIe)
          ea read stall - if: L2-Fabric Read Stall (Infinity Fabric™)
          ea read stall - hbm: L2-Fabric Read Stall (HBM)
        metric:
          "::_1":
            ea read stall - pcie: None # Missing perfmon
            ea read stall - if: None # Missing perfmon
            ea read stall - hbm: None # Missing perfmon
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    - metric_table:
        id: 1810
        title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
        header:
          metric: Channel
          ea write stall - pcie: L2-Fabric Write Stall (PCIe)
          ea write stall - if: L2-Fabric Write Stall (Infinity Fabric™)
          ea write stall - hbm: L2-Fabric Write Stall (HBM)
          ea write stall - starve: L2-Fabric Write Starve
        metric:
          "::_1":
            ea write stall - pcie: None # Missing perfmon
            ea write stall - if: None # Missing perfmon
            ea write stall - hbm: None # Missing perfmon
            ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1]) / $denom))
          placeholder_range:
            "::_1": $total_l2_chan
        cli_style: simple_multiple_bar
    - metric_table:
        id: 1812
        title: L2-Fabric (128B read requests per normUnit)
        header:
          metric: Channel
          expr: Expression
        metric:
          "::_1":
            expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
          placeholder_range:
            "::_1": $total_l2_chan
          # tips: Number of 128-byte read requests sent to EA
        cli_style: simple_box
        tui_style: simple_box
@@ -0,0 +1,323 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 1800
  title: L2 Cache (per Channel)
  metrics_description:
    L2 Cache Hit Rate: The percent of total number of requests to the L2 from all
      clients that hit in the cache. As noted in the Speed-of-Light section, this
      includes hit-on-miss requests.
  data source:
  - metric_table:
      id: 1801
      title: Aggregate Stats (All channels)
      header:
        metric: Metric
        avg: Avg
        std dev: Std Dev
        min: Min
        max: Max
        unit: Unit
      metric:
        L2 Cache Hit Rate:
          avg: AVG(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
            + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
            * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
            + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
            (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
            * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
            TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
            + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
            (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
            * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
            TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[29] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          std dev: STD(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100
            * TCC_HIT[1])) + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4]))
            + (100 * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100
            * TCC_HIT[8])) + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11]))
            + (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) +
            (100 * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100
            * TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 *
            TCC_HIT[21])) + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24]))
            + (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) +
            (100 * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100
            * TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          min: MIN(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
            + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
            * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
            + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
            (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
            * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
            TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
            + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
            (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
            * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
            TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          max: MAX(((((((((((((((((((((((((((((((((((100 * TCC_HIT[0]) + (100 * TCC_HIT[1]))
            + (100 * TCC_HIT[2])) + (100 * TCC_HIT[3])) + (100 * TCC_HIT[4])) + (100
            * TCC_HIT[5])) + (100 * TCC_HIT[6])) + (100 * TCC_HIT[7])) + (100 * TCC_HIT[8]))
            + (100 * TCC_HIT[9])) + (100 * TCC_HIT[10])) + (100 * TCC_HIT[11])) +
            (100 * TCC_HIT[12])) + (100 * TCC_HIT[13])) + (100 * TCC_HIT[14])) + (100
            * TCC_HIT[15])) + (100 * TCC_HIT[16])) + (100 * TCC_HIT[17])) + (100 *
            TCC_HIT[18])) + (100 * TCC_HIT[19])) + (100 * TCC_HIT[20])) + (100 * TCC_HIT[21]))
            + (100 * TCC_HIT[22])) + (100 * TCC_HIT[23])) + (100 * TCC_HIT[24])) +
            (100 * TCC_HIT[25])) + (100 * TCC_HIT[26])) + (100 * TCC_HIT[27])) + (100
            * TCC_HIT[28])) + (100 * TCC_HIT[29])) + (100 * TCC_HIT[30])) + (100 *
            TCC_HIT[31])) / ((((((((((((((((((((((((((((((((TCC_MISS[0] + TCC_HIT[0])
            + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2])) + (TCC_MISS[3]
            + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5] + TCC_HIT[5]))
            + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7])) + (TCC_MISS[8]
            + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10] + TCC_HIT[10]))
            + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12])) + (TCC_MISS[13]
            + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15] + TCC_HIT[15]))
            + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17])) + (TCC_MISS[18]
            + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20] + TCC_HIT[20]))
            + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22])) + (TCC_MISS[23]
            + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25] + TCC_HIT[25]))
            + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27])) + (TCC_MISS[28]
            + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30] + TCC_HIT[30]))
            + (TCC_MISS[31] + TCC_HIT[31]))) if (((((((((((((((((((((((((((((((((TCC_MISS[0]
            + TCC_HIT[0]) + (TCC_MISS[1] + TCC_HIT[1])) + (TCC_MISS[2] + TCC_HIT[2]))
            + (TCC_MISS[3] + TCC_HIT[3])) + (TCC_MISS[4] + TCC_HIT[4])) + (TCC_MISS[5]
            + TCC_HIT[5])) + (TCC_MISS[6] + TCC_HIT[6])) + (TCC_MISS[7] + TCC_HIT[7]))
            + (TCC_MISS[8] + TCC_HIT[8])) + (TCC_MISS[9] + TCC_HIT[9])) + (TCC_MISS[10]
            + TCC_HIT[10])) + (TCC_MISS[11] + TCC_HIT[11])) + (TCC_MISS[12] + TCC_HIT[12]))
            + (TCC_MISS[13] + TCC_HIT[13])) + (TCC_MISS[14] + TCC_HIT[14])) + (TCC_MISS[15]
            + TCC_HIT[15])) + (TCC_MISS[16] + TCC_HIT[16])) + (TCC_MISS[17] + TCC_HIT[17]))
            + (TCC_MISS[18] + TCC_HIT[18])) + (TCC_MISS[19] + TCC_HIT[19])) + (TCC_MISS[20]
            + TCC_HIT[20])) + (TCC_MISS[21] + TCC_HIT[21])) + (TCC_MISS[22] + TCC_HIT[22]))
            + (TCC_MISS[23] + TCC_HIT[23])) + (TCC_MISS[24] + TCC_HIT[24])) + (TCC_MISS[25]
            + TCC_HIT[25])) + (TCC_MISS[26] + TCC_HIT[26])) + (TCC_MISS[27] + TCC_HIT[27]))
            + (TCC_MISS[28] + TCC_HIT[28])) + (TCC_MISS[28] + TCC_HIT[29])) + (TCC_MISS[30]
            + TCC_HIT[30])) + (TCC_MISS[31] + TCC_HIT[31])) != 0) else None))
          unit: pct
  - metric_table:
      id: 1802
      title: L2 Cache Hit Rate (pct)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: (((100 * TCC_HIT[::_1]) / (TCC_HIT[::_1] + TCC_MISS[::_1])) if ((TCC_HIT[::_1]
            + TCC_MISS[::_1]) != 0) else None)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1803
      title: L2 Requests (per normUnit)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: (TO_INT(TCC_REQ[::_1]) / $denom)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1804
      title: L2 Requests (per normUnit)
      header:
        metric: Channel
        read req: L2 Read
        write req: L2 Write
        atomic req: L2 Atomic
      metric:
        ::_1:
          read req: AVG((TO_INT(TCC_READ[::_1]) / $denom))
          write req: AVG((TO_INT(TCC_WRITE[::_1]) / $denom))
          atomic req: AVG((TO_INT(TCC_ATOMIC[::_1]) / $denom))
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1805
      title: L2-Fabric Requests (per normUnit)
      header:
        metric: Channel
        read req: L2-Fabric Read
        write req: L2-Fabric Write and Atomic
        atomic req: L2-Fabric Atomic
      metric:
        ::_1:
          read req: AVG((TO_INT(TCC_EA_RDREQ[::_1]) / $denom))
          write req: AVG((TO_INT(TCC_EA_WRREQ[::_1]) / $denom))
          atomic req: AVG((TO_INT(TCC_EA_ATOMIC[::_1]) / $denom))
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1806
      title: L2-Fabric Read Latency (Cycles)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: ((TCC_EA_RDREQ_LEVEL[::_1] / TCC_EA_RDREQ[::_1]) if (TCC_EA_RDREQ[::_1]
            != 0) else None)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1807
      title: L2-Fabric Write and Atomic Latency (Cycles)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: ((TCC_EA_WRREQ_LEVEL[::_1] / TCC_EA_WRREQ[::_1]) if (TCC_EA_WRREQ[::_1]
            != 0) else None)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1808
      title: L2-Fabric Atomic Latency (Cycles)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: ((TCC_EA_ATOMIC_LEVEL[::_1] / TCC_EA_ATOMIC[::_1]) if (TCC_EA_ATOMIC[::_1]
            != 0) else 0)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
  - metric_table:
      id: 1809
      title: L2-Fabric Read Stall (Cycles per normUnit)
      header:
        metric: Channel
        ea read stall - pcie: L2-Fabric Read Stall (PCIe)
        ea read stall - if: "L2-Fabric Read Stall (Infinity Fabric\u2122)"
        ea read stall - hbm: L2-Fabric Read Stall (HBM)
      metric:
        ::_1:
          ea read stall - pcie: None
          ea read stall - if: None
          ea read stall - hbm: None
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1810
      title: L2-Fabric Write and Atomic Stall (Cycles per normUnit)
      header:
        metric: Channel
        ea write stall - pcie: L2-Fabric Write Stall (PCIe)
        ea write stall - if: "L2-Fabric Write Stall (Infinity Fabric\u2122)"
        ea write stall - hbm: L2-Fabric Write Stall (HBM)
        ea write stall - starve: L2-Fabric Write Starve
      metric:
        ::_1:
          ea write stall - pcie: None
          ea write stall - if: None
          ea write stall - hbm: None
          ea write stall - starve: AVG((TO_INT(TCC_TOO_MANY_EA_WRREQS_STALL[::_1])
            / $denom))
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_multiple_bar
      tui_style: simple_multiple_bar
  - metric_table:
      id: 1812
      title: L2-Fabric (128B read requests per normUnit)
      header:
        metric: Channel
        expr: Expression
      metric:
        ::_1:
          expr: (TO_INT(TCC_BUBBLE[::_1]) / $denom)
        placeholder_range:
          ::_1: $total_l2_chan
      cli_style: simple_box
      tui_style: simple_box
@@ -1,10 +1,11 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 2100
  title: PC Sampling
  metrics_description: {}
  data source:
-    - pc_sampling_table:
+  - pc_sampling_table:
-        id: 2101
+      id: 2101
-        title: PC Sampling
+      title: PC Sampling
-        source: ps_file
+      source: ps_file
-        comparable: false # enable it later
+      comparable: false
@@ -1,14 +1,14 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
-  id: 000
+  id: 0
  title: Top Stats
  metrics_description: {}
  data source:
-    - raw_csv_table:
+  - raw_csv_table:
-        id: 001
+      id: 1
-        title: Top Kernels
+      title: Top Kernels
-        source: pmc_kernel_top.csv
+      source: pmc_kernel_top.csv
-
+  - raw_csv_table:
-    - raw_csv_table:
+      id: 2
-        id: 002
+      title: Dispatch List
-        title: Dispatch List
+      source: pmc_dispatch_info.csv
        source: pmc_dispatch_info.csv
@@ -1,9 +1,10 @@
---
+# AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 100
  title: System Info
  metrics_description: {}
  data source:
-    - raw_csv_table:
+  - raw_csv_table:
-        id: 101
+      id: 101
-        source: sysinfo.csv
+      source: sysinfo.csv
-        columnwise: True
+      columnwise: true
@@ -1,262 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
  SALU: &SALU_anchor Scalar Arithmetic Logic Unit
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 200
  title: System Speed-of-Light
  data source:
    - metric_table:
        id: 201
        title: Speed-of-Light
        header:
          metric: Metric
          value: Avg
          unit: Unit
          peak: Peak
          pop: Pct of Peak
          tips: Tips
        metric:
          VALU FLOPs:
            value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) + SQ_INSTS_VALU_TRANS_F16)
              + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32 + SQ_INSTS_VALU_MUL_F32)
              + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32)))) + (64 * (((SQ_INSTS_VALU_ADD_F64
              + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64) + (2 * SQ_INSTS_VALU_FMA_F64))))
              / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
              + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
              + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
              + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
              + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))) / (((($max_sclk
              * $cu_per_gpu) * 64) * 2) / 1000))
            tips:
          VALU IOPs:
            value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp - Start_Timestamp)))
            unit: GIOP/s
            peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
            pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
              - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
            tips:
          MFMA FLOPs (F8):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
            tips:
          MFMA FLOPs (BF16):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
            tips:
          MFMA FLOPs (F16):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
            tips:
          MFMA FLOPs (F32):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
            tips:
          MFMA FLOPs (F64):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GFLOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
            tips:
          MFMA IOPs (Int8):
            value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
            unit: GIOP/s
            peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
            pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
            tips:
          Active CUs:
            value: $numActiveCUs
            unit: CUs
            peak: $cu_per_gpu
            pop: ((100 * $numActiveCUs) / $cu_per_gpu)
            tips:
          SALU Utilization:
            value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            tips:
          VALU Utilization:
            value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
            tips:
          MFMA Utilization:
            value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
              * 4)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)
              * 4)))
            tips:
          VMEM Utilization:
            value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            peak: 100
            pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            tips:
          Branch Utilization:
            value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            unit: pct
            peak: 100
            pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
            tips:
          VALU Active Threads:
            value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
              != 0) else None))
            unit: Threads
            peak: $wave_size
            pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size) if (SQ_ACTIVE_INST_VALU != 0) else None))
            tips:
          IPC:
            value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
            unit: Instr/cycle
            peak: 5
            pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
            tips:
          Wavefront Occupancy:
            value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            unit: Wavefronts
            peak: ($max_waves_per_cu * $cu_per_gpu)
            pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
              * $cu_per_gpu))))
            coll_level: SQ_LEVEL_WAVES
            tips:
          Theoretical LDS Bandwidth:
            value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: (($max_sclk * $cu_per_gpu) * 0.128)
            pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
              / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
            tips:
          LDS Bank Conflicts/Access:
            value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
            unit: Conflicts/access
            peak: 32
            pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
              if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
            tips:
          vL1D Cache Hit Rate:
            value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            unit: pct
            peak: 100
            pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum) +
              TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) /
              TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None))
            tips:
          vL1D Cache BW:
            value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 128) * $cu_per_gpu)
            pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
            tips:
          L2 Cache Hit Rate:
            value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            unit: pct
            peak: 100
            pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else None))
            tips:
          L2 Cache BW:
            value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan))
            pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
              / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
            tips:
          L2-Fabric Read BW:
            value: AVG((128 * TCC_BUBBLE_sum +
                        64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) +
                        32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp))
            unit: GB/s
            peak: $hbmBandwidth
            pop: ((100 * (AVG((128 * TCC_BUBBLE_sum +
                        64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum - TCC_EA0_RDREQ_32B_sum) +
                        32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
            tips:
          L2-Fabric Write BW:
            value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))
            unit: GB/s
            peak: $hbmBandwidth
            pop: ((100 * AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
              * 32)) / (End_Timestamp - Start_Timestamp)))) / $hbmBandwidth)
            tips:
          L2-Fabric Read Latency:
            value: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
              != 0) else None))
            unit: Cycles
            peak: None
            pop: None
            tips:
          L2-Fabric Write Latency:
            value: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
              != 0) else None))
            unit: Cycles
            peak: None
            pop: None
            tips:
          sL1D Cache Hit Rate:
            value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
            unit: pct
            peak: 100
            pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
              if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
            tips:
          sL1D Cache BW:
            value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
            pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
              / 1000) * 64) * $sqc_per_gpu))
            tips:
          L1I Hit Rate:
            value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
            unit: pct
            peak: 100
            pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
            tips:
          L1I BW:
            value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
            unit: GB/s
            peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
            pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))) / ((($max_sclk
              / 1000) * 64) * $sqc_per_gpu))
            tips:
          L1I Fetch Latency:
            value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
            unit: Cycles
            peak: None
            pop: None
            coll_level: SQ_IFETCH_LEVEL
            tips:
@@ -0,0 +1,346 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 200
  title: System Speed-of-Light
  metrics_description:
    VALU FLOPs: 'The total floating-point operations executed per second on the VALU.
      This is also presented as a percent of the peak theoretical FLOPs achievable
      on the specific accelerator. Note: this does not include any floating-point
      operations from MFMA instructions.'
    VALU IOPs: 'The total integer operations executed per second on the VALU. This
      is also presented as a percent of the peak theoretical IOPs achievable on the
      specific accelerator. Note: this does not include any integer operations from
      MFMA instructions.'
    MFMA FLOPs (F8): The total number of 8-bit brain floating point MFMA operations
      executed per second. This does not include any 16-bit brain floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F8 MFMA operations achievable on the specific accelerator. It is supported on
      AMD Instinct MI300 series and later only.
    MFMA FLOPs (BF16): 'The total number of 16-bit brain floating point MFMA operations
      executed per second. Note: this does not include any 16-bit brain floating point
      operations from VALU instructions. This is also presented as a percent of the
      peak theoretical BF16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F16): 'The total number of 16-bit floating point MFMA operations executed
      per second. Note: this does not include any 16-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F16 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F32): 'The total number of 32-bit floating point MFMA operations executed
      per second. Note: this does not include any 32-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F32 MFMA operations achievable on the specific accelerator.'
    MFMA FLOPs (F64): 'The total number of 64-bit floating point MFMA operations executed
      per second. Note: this does not include any 64-bit floating point operations
      from VALU instructions. This is also presented as a percent of the peak theoretical
      F64 MFMA operations achievable on the specific accelerator.'
    MFMA IOPs (Int8): 'The total number of 8-bit integer MFMA operations executed
      per second. Note: this does not include any 8-bit integer operations from VALU
      instructions. This is also presented as a percent of the peak theoretical INT8
      MFMA operations achievable on the specific accelerator.'
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    SALU Utilization: Indicates what percent of the kernel's duration the SALU was
      busy executing instructions. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing SALU or SMEM instructions over the total CU cycles.
    VALU Utilization: Indicates what percent of the kernel's duration the VALU was
      busy executing instructions. Does not include VMEM operations. Computed as the
      ratio of the total number of cycles spent by the scheduler issuing VALU instructions
      over the total CU cycles.
    MFMA Utilization: Indicates what percent of the kernel's duration the MFMA unit
      was busy executing instructions. Computed as the ratio of the total number of
      cycles the MFMA was busy over the total CU cycles.
    VMEM Utilization: Indicates what percent of the kernel's duration the VMEM unit
      was busy executing instructions, including both global/generic and spill/scratch
      operations (see the VMEM instruction count metrics) for more detail). Does not
      include VALU operations. Computed as the ratio of the total number of cycles
      spent by the scheduler issuing VMEM instructions over the total CU cycles.
    Branch Utilization: Indicates what percent of the kernel's duration the branch
      unit was busy executing instructions. Computed as the ratio of the total number
      of cycles spent by the scheduler issuing branch instructions over the total
      CU cycles
    VALU Active Threads: Indicates the average level of divergence within a wavefront
      over the lifetime of the kernel. The number of work-items that were active in
      a wavefront during execution of each VALU instruction, time-averaged over all
      VALU instructions run on all wavefronts in the kernel.
    IPC: The ratio of the total number of instructions executed on the CU over the
      total active CU cycles. This is also presented as a percent of the peak theoretical
      bandwidth achievable on the specific accelerator.
    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms). This is also presented as a percent of the peak theoretical
      occupancy achievable on the specific accelerator.'
    Theoretical LDS Bandwidth: Indicates the maximum amount of bytes that could have
      been loaded from, stored to, or atomically updated in the LDS per unit time
      (see LDS Bandwidth example for more detail). This is also presented as a percent
      of the peak theoretical F64 MFMA operations achievable on the specific accelerator.
    LDS Bank Conflicts/Access: The ratio of the number of cycles spent in the LDS
      scheduler due to bank conflicts (as determined by the conflict resolution hardware)
      to the base number of cycles that would be spent in the LDS scheduler in a completely
      uncontended case. This is also presented in normalized form (i.e., the Bank
      Conflict Rate).
    vL1D Cache Hit Rate: The ratio of the number of vL1D cache line requests that
      hit in vL1D cache over the total number of cache line requests to the vL1D cache
      RAM.
    vL1D Cache BW: The number of bytes looked up in the vL1D cache as a result of
      VMEM instructions per unit time. The number of bytes is calculated as the number
      of cache lines requested multiplied by the cache line size. This value does
      not consider partial requests, so e.g., if only a single value is requested
      in a cache line, the data movement will still be counted as a full cache line.
      This is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L2 Cache Hit Rate: The ratio of the number of L2 cache line requests that hit
      in the L2 cache over the total number of incoming cache line requests to the
      L2 cache.
    L2 Cache BW: The number of bytes looked up in the L2 cache per unit time. The
      number of bytes is calculated as the number of cache lines requested multiplied
      by the cache line size. This value does not consider partial requests, so e.g.,
      if only a single value is requested in a cache line, the data movement will
      still be counted as a full cache line. This is also presented as a percent of
      the peak theoretical bandwidth achievable on the specific accelerator.
    L2-Fabric Read BW: "The number of bytes read by the L2 over the Infinity Fabric\u2122\
      \ interface per unit time. This is also presented as a percent of the peak theoretical\
      \ bandwidth achievable on the specific accelerator."
    L2-Fabric Write BW: The number of bytes sent by the L2 over the Infinity Fabric
      interface by write and atomic operations per unit time. This is also presented
      as a percent of the peak theoretical bandwidth achievable on the specific accelerator.
    L2-Fabric Read Latency: The time-averaged number of cycles read requests spent
      in Infinity Fabric before data was returned to the L2.
    L2-Fabric Write Latency: The time-averaged number of cycles write requests spent
      in Infinity Fabric before a completion acknowledgement was returned to the L2.
    sL1D Cache Hit Rate: The percent of sL1D requests that hit on a previously loaded
      line the cache. Calculated as the ratio of the number of sL1D requests that
      hit over the number of all sL1D requests.
    sL1D Cache BW: The number of bytes looked up in the sL1D cache per unit time.
      This is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L1I Hit Rate: The number of bytes looked up in the L1I cache per unit time. This
      is also presented as a percent of the peak theoretical bandwidth achievable
      on the specific accelerator.
    L1I BW: The percent of L1I requests that hit on a previously loaded line the cache.
      Calculated as the ratio of the number of L1I requests that hit over the number
      of all L1I requests.
    L1I Fetch Latency: The average number of cycles spent to fetch instructions to
      a CU.
  data source:
  - metric_table:
      id: 201
      title: System Speed-of-Light
      header:
        metric: Metric
        value: Avg
        unit: Unit
        peak: Peak
        pop: Pct of Peak
      metric:
        VALU FLOPs:
          value: AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16) +
            SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: ((100 * AVG(((((64 * (((SQ_INSTS_VALU_ADD_F16 + SQ_INSTS_VALU_MUL_F16)
            + SQ_INSTS_VALU_TRANS_F16) + (2 * SQ_INSTS_VALU_FMA_F16))) + (64 * (((SQ_INSTS_VALU_ADD_F32
            + SQ_INSTS_VALU_MUL_F32) + SQ_INSTS_VALU_TRANS_F32) + (2 * SQ_INSTS_VALU_FMA_F32))))
            + (64 * (((SQ_INSTS_VALU_ADD_F64 + SQ_INSTS_VALU_MUL_F64) + SQ_INSTS_VALU_TRANS_F64)
            + (2 * SQ_INSTS_VALU_FMA_F64)))) / (End_Timestamp - Start_Timestamp))))
            / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
        VALU IOPs:
          value: AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
            - Start_Timestamp)))
          unit: GIOP/s
          peak: (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000)
          pop: ((100 * AVG(((64 * (SQ_INSTS_VALU_INT32 + SQ_INSTS_VALU_INT64)) / (End_Timestamp
            - Start_Timestamp)))) / (((($max_sclk * $cu_per_gpu) * 64) * 2) / 1000))
        MFMA FLOPs (F8):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F8 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
        MFMA FLOPs (BF16):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_BF16 * 512) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
        MFMA FLOPs (F16):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 2048) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F16 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 2048) / 1000))
        MFMA FLOPs (F32):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F32 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
        MFMA FLOPs (F64):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GFLOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 256) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_F64 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 256) / 1000))
        MFMA IOPs (Int8):
          value: AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp - Start_Timestamp)))
          unit: GIOP/s
          peak: ((($max_sclk * $cu_per_gpu) * 4096) / 1000)
          pop: ((100 * AVG(((SQ_INSTS_VALU_MFMA_MOPS_I8 * 512) / (End_Timestamp -
            Start_Timestamp)))) / ((($max_sclk * $cu_per_gpu) * 4096) / 1000))
        Active CUs:
          value: $numActiveCUs
          unit: CUs
          peak: $cu_per_gpu
          pop: ((100 * $numActiveCUs) / $cu_per_gpu)
        SALU Utilization:
          value: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_ACTIVE_INST_SCA) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
        VALU Utilization:
          value: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_ACTIVE_INST_VALU) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu)))
        MFMA Utilization:
          value: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu) * 4)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQ_VALU_MFMA_BUSY_CYCLES) / (($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu) * 4)))
        VMEM Utilization:
          value: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
          unit: pct
          peak: 100
          pop: AVG((((100 * (SQ_ACTIVE_INST_FLAT+SQ_ACTIVE_INST_VMEM)) / $GRBM_GUI_ACTIVE_PER_XCD)
            / $cu_per_gpu))
        Branch Utilization:
          value: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
          unit: pct
          peak: 100
          pop: AVG((((100 * SQ_ACTIVE_INST_MISC) / $GRBM_GUI_ACTIVE_PER_XCD) / $cu_per_gpu))
        VALU Active Threads:
          value: AVG(((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU) if (SQ_ACTIVE_INST_VALU
            != 0) else None))
          unit: Threads
          peak: $wave_size
          pop: (100 * AVG((SQ_THREAD_CYCLES_VALU / SQ_ACTIVE_INST_VALU / $wave_size)
            if (SQ_ACTIVE_INST_VALU != 0) else None))
        IPC:
          value: AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))
          unit: Instr/cycle
          peak: 5
          pop: ((100 * AVG((SQ_INSTS / SQ_BUSY_CU_CYCLES))) / 5)
        Wavefront Occupancy:
          value: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          peak: ($max_waves_per_cu * $cu_per_gpu)
          pop: (100 * AVG(((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / ($max_waves_per_cu
            * $cu_per_gpu))))
          coll_level: SQ_LEVEL_WAVES
        Theoretical LDS Bandwidth:
          value: AVG(((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: (($max_sclk * $cu_per_gpu) * 0.128)
          pop: AVG((((((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) * 4) * TO_INT($lds_banks_per_cu))
            / (End_Timestamp - Start_Timestamp)) / (($max_sclk * $cu_per_gpu) * 0.00128)))
        LDS Bank Conflicts/Access:
          value: AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))
          unit: Conflicts/access
          peak: 32
          pop: ((100 * AVG(((SQ_LDS_BANK_CONFLICT / (SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT))
            if ((SQ_LDS_IDX_ACTIVE - SQ_LDS_BANK_CONFLICT) != 0) else None))) / 32)
        vL1D Cache Hit Rate:
          value: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
          unit: pct
          peak: 100
          pop: AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None))
        vL1D Cache BW:
          value: AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 128) * $cu_per_gpu)
          pop: ((100 * AVG(((TCP_TOTAL_CACHE_ACCESSES_sum * 128) / (End_Timestamp
            - Start_Timestamp)))) / ((($max_sclk / 1000) * 128) * $cu_per_gpu))
        L2 Cache Hit Rate:
          value: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
          unit: pct
          peak: 100
          pop: AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
            + TCC_MISS_sum) != 0) else None))
        L2 Cache BW:
          value: AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan))
          pop: ((100 * AVG(((TCC_REQ_sum * 128) / (End_Timestamp - Start_Timestamp))))
            / ((($max_sclk / 1000) * 128) * TO_INT($total_l2_chan)))
        L2-Fabric Read BW:
          value: AVG((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
            - Start_Timestamp))
          unit: GB/s
          peak: $hbmBandwidth
          pop: ((100 * (AVG((128 * TCC_BUBBLE_sum + 64 * (TCC_EA0_RDREQ_sum - TCC_BUBBLE_sum
            - TCC_EA0_RDREQ_32B_sum) + 32 * TCC_EA0_RDREQ_32B_sum) / (End_Timestamp
            - Start_Timestamp)))) / $hbmBandwidth)
        L2-Fabric Write BW:
          value: AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum - TCC_EA0_WRREQ_64B_sum)
            * 32)) / (End_Timestamp - Start_Timestamp)))
          unit: GB/s
          peak: $hbmBandwidth
          pop: ((100 * AVG((((TCC_EA0_WRREQ_64B_sum * 64) + ((TCC_EA0_WRREQ_sum -
            TCC_EA0_WRREQ_64B_sum) * 32)) / (End_Timestamp - Start_Timestamp)))) /
            $hbmBandwidth)
        L2-Fabric Read Latency:
          value: AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
            != 0) else None))
          unit: Cycles
          peak: None
          pop: None
        L2-Fabric Write Latency:
          value: AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
            != 0) else None))
          unit: Cycles
          peak: None
          pop: None
        sL1D Cache Hit Rate:
          value: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
            if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
          unit: pct
          peak: 100
          pop: AVG((((100 * SQC_DCACHE_HITS) / (SQC_DCACHE_HITS + SQC_DCACHE_MISSES))
            if ((SQC_DCACHE_HITS + SQC_DCACHE_MISSES) != 0) else None))
        sL1D Cache BW:
          value: AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
          pop: ((100 * AVG(((SQC_DCACHE_REQ / (End_Timestamp - Start_Timestamp)) *
            64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
        L1I Hit Rate:
          value: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
          unit: pct
          peak: 100
          pop: AVG(((100 * SQC_ICACHE_HITS) / (SQC_ICACHE_HITS + SQC_ICACHE_MISSES)))
        L1I BW:
          value: AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) * 64))
          unit: GB/s
          peak: ((($max_sclk / 1000) * 64) * $sqc_per_gpu)
          pop: ((100 * AVG(((SQC_ICACHE_REQ / (End_Timestamp - Start_Timestamp)) *
            64))) / ((($max_sclk / 1000) * 64) * $sqc_per_gpu))
        L1I Fetch Latency:
          value: AVG((SQ_ACCUM_PREV_HIRES / SQ_IFETCH))
          unit: Cycles
          peak: None
          pop: None
          coll_level: SQ_IFETCH_LEVEL
@@ -1,315 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 300
  title: Memory Chart
  data source:
    - metric_table:
        id: 301
        title: Memory Chart
        header:
          metric: Metric
          #alias: #alias
          value: Value
          tips: Tips
        metric:
          # ----------------------------------------
          # Instr Buff Block
          #TODO: double check wave_occupancy
          Wavefront Occupancy:
            #alias: wave_occ_
            value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs), 0)
            coll_level: SQ_LEVEL_WAVES
            tips:
          Wave Life:
            #alias: wave_life_
            value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else 0)), 0)
            tips:
          # ----------------------------------------
          # Instr Dispatch Block
          SALU:
            #alias: salu_
            value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
            tips:
          SMEM:
            #alias: smem_
            value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
            tips:
          VALU:
            #alias: valu_
            value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
            tips:
          MFMA:
            #alias: mfma_
            value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
            tips:
          VMEM:
            #alias: vmem_
            value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
            tips:
          LDS:
            #alias: lds_
            value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
            tips:
          GWS:
            #alias: gws_
            value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
            tips:
          BR:
            #alias: br_
            value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
            tips:
          # ----------------------------------------
          # Exec Block
          Active CUs:
            #alias: active_cu_
            value: $numActiveCUs
            tips:
          Num CUs:
            #alias: num_cu_
            value: $cu_per_gpu
            tips:
          VGPR:
            #alias: vgpr_
            value: ROUND(AVG(Arch_VGPR), 0)
            tips:
          # Todo: add AGPRs
          SGPR:
            #alias: sgpr_
            value: ROUND(AVG(SGPR), 0)
            tips:
          LDS Allocation:
            #alias: lds_alloc_
            value: ROUND(AVG(LDS_Per_Workgroup), 0)
            tips:
          Scratch Allocation:
            #alias: scratch_alloc_
            value: ROUND(AVG(Scratch_Per_Workitem), 0)
            tips:
          Wavefronts:
            #alias: wavefronts_
            value: ROUND(AVG(SPI_CSN_WAVE), 0)
            tips:
          Workgroups:
            #alias: workgroups_
            value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
            tips:
          # ----------------------------------------
          # LDS Block
          LDS Req:
            #alias: lds_req_
            value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
            tips:
          LDS Util:
            #alias: lds_util_
            value:
              ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))),
              0)
            tips:
          LDS Latency:
            #alias: lds_lat
            value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS != 0) else None)),0)
            coll_level: SQ_INST_LEVEL_LDS
            tips:
          # ----------------------------------------
          # Vector L1 Cache Block
          VL1 Rd:
            #alias: vl1_rd_
            value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
            tips:
          VL1 Wr:
            #alias: vl1_wr_
            value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
            tips:
          VL1 Atomic:
            #alias: vl1_atom_
            value:
              ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
              / $denom)), 0)
            tips:
          VL1 Hit:
            #alias: vl1_hit_
            value:
              ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
              + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
              / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0) else
              None )), 0)
            tips:
          VL1 Lat:
            #alias: vl1_lat_
            value:
              ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
              != 0) else None)), 0)
            tips:
          VL1 Coalesce:
            #alias: vl1_coales_
            value:
              ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
              * 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
            tips:
          VL1 Stall:
            #alias: vl1_stall_
            value:
              ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
              if (TCP_GATE_EN1_sum != 0) else None)), 0)
            tips:
          VL1_L2 Rd:
            #alias: vl1_l2_rd_
            value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
            tips:
          VL1_L2 Wr:
            #alias: vl1_l2_wr_
            value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
            tips:
          VL1_L2 Atomic:
            #alias: vl1_l2_atom_
            value:
              ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              / $denom)), 0)
            tips:
          # ----------------------------------------
          # Scalar L1D Cache Block
          VL1D Rd:
            #alias: sl1_rd_
            value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
            tips:
          VL1D Hit:
            #alias: sl1_hit_
            value:
              ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
              0) else None)) * 100), 0)
            tips:
          VL1D Lat:
            #alias: sl1_lat_
            value:
              ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ !=
              0) else None)) * 100), 0)
            coll_level: SQC_DCACHE_INFLIGHT_LEVEL
            tips:
          VL1D_L2 Rd:
            #alias: sl1_l2_rd_
            value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
            tips:
          VL1D_L2 Wr:
            #alias: sl1_l2_wr_
            value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
            tips:
          VL1D_L2 Atomic:
            #alias: sl1_l2_atom_
            value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
            tips:
          # ----------------------------------------
          # Instr L1  Cache Block
          IL1 Fetch:
            #alias: il1_fetch_
            value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
            tips:
          IL1 Hit:
            #alias: il1_hit_
            value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
            tips:
          IL1 Lat:
            #alias: il1_lat_
            value:
              ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ !=
              0) else None)) * 100), 0)
            tips: # ??? coll_level: SQ_IFETCH_LEVEL
          IL1_L2 Rd:
            #alias: il1_l2_req_
            value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
            tips:
          # ----------------------------------------
          # L2 Cache Block(inside)
          L2 Rd:
            #alias: l2_rd_
            value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
            tips:
          L2 Wr:
            #alias: l2_wr_
            value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
            tips:
          L2 Atomic:
            #alias: l2_atom_
            value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
            tips:
          L2 Hit:
            #alias: l2_hit_
            value:
              ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if ((TCC_HIT_sum
              + TCC_MISS_sum) != 0) else 0)), 0)
            tips:
          L2 Rd Lat:
            #alias: l2_rd_lat_
            value:
              # ROUND(AVG(((TCP_TCC_READ_REQ_LATENCY_sum / (TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum))
              # if ((TCP_TCC_READ_REQ_sum + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) != 0) else None)),
              # 0)
            tips:
          L2 Wr Lat:
            #alias: l2_wr_lat_
            value:
              # ROUND(AVG(((TCP_TCC_WRITE_REQ_LATENCY_sum / (TCP_TCC_WRITE_REQ_sum +
              # TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)) if ((TCP_TCC_WRITE_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
              # != 0) else None)), 0)
            tips:
          # ----------------------------------------
          # Fabric Block
          Fabric_L2 Rd:
            #alias: l2_fabric_rd_
            value: ROUND(AVG((TCC_EA0_RDREQ_sum / $denom)), 0)
            tips:
          Fabric_L2 Wr:
            #alias: l2_fabric_wr_
            value: ROUND(AVG((TCC_EA0_WRREQ_sum / $denom)), 0)
            tips:
          Fabric_L2 Atomic:
            #alias: l2_fabric_atom_
            value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
            tips:
          Fabric Rd Lat:
            #alias: fabric_rd_lat_
            value:
              ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
              != 0) else  0)), 0)
            tips:
          Fabric Wr Lat:
            #alias: fabric_wr_lat_
            value:
              ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
              != 0) else  0)), 0)
            tips:
          Fabric Atomic Lat:
            #alias: fabric_atom_lat_
            value:
              ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
              != 0) else  0)), 0)
            tips:
          HBM Rd:
            #alias: hbm_rd_
            value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
            tips:
          HBM Wr:
            #alias: hbm_wr_
            value: ROUND(AVG((TCC_EA0_WRREQ_DRAM_sum / $denom)), 0)
            tips:
        comparable: false # for now
        cli_style: mem_chart
@@ -0,0 +1,263 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 300
  title: Memory Chart
  metrics_description:
    Wavefront Occupancy: Wavefronts per active CU.
    Wave Life: Average number of cycles executing a wave.
    SALU: Total Number of SALU (Scalar ALU) instructions issued per normalization
      unit.
    SMEM: Total number of SMEM (Scalar Memory Read) instructions issued normalization
      unit.
    VALU: The number of VALU (Vector ALU) instructions issued per normalization unit.
    MFMA: Total number of MFMA (Matrix-Fused-Multiply-Add) instructions issued per
      normalization unit.
    VMEM: The number of VMEM (GPU Memory) read instructions issued (including FLAT/scratch
      memory) per normalization unit.
    LDS: The total number of LDS instructions (including, but not limited to, read/write/atomics
      and HIP's __shfl instructions) executed per normalization unit.
    GWS: Total number of GDS (global data sync) instructions issued per normalization
      unit.
    BR: Total number of BRANCH instructions issued per normalization unit.
    Active CUs: Total number of active compute units (CUs) on the accelerator during
      the kernel execution.
    Num CUs: Total number of compute units (CUs) on the accelerator.
    VGPR: 'The number of architected vector general-purpose registers allocated for
      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
      by the compiler due to allocation granularity.'
    SGPR: 'The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.'
    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
      for this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
    Scratch Allocation: The number of bytes of scratch memory requested per work-item
      for this kernel. Scratch memory is used for stack memory on the accelerator,
      as well as for register spills and restores.
    Wavefronts: The total number of wavefronts, summed over all workgroups, forming
      this kernel launch.
    Workgroups: The total number of workgroups forming this kernel launch.
    LDS Req: The total number of LDS instructions (including, but not limited to,
      read/write/atomics and HIP's __shfl instructions) executed per normalization
      unit.
    LDS Util: Indicates what percent of the kernel's duration the LDS was actively
      executing instructions (including, but not limited to, load, store, atomic and
      HIP's __shfl operations). Calculated as the ratio of the total number of cycles
      LDS was active over the total CU cycles.
    LDS Latency: The average number of round-trip cycles (i.e., from issue to data-return
      / acknowledgment) required for an LDS instruction to complete.
    VL1 Rd: The total number of incoming read requests from the address processing
      unit after coalescing per normalization unit
    VL1 Wr: The total number of incoming write requests from the address processing
      unit after coalescing per normalization unit
    VL1 Atomic: The total number of incoming atomic requests from the address processing
      unit after coalescing per normalization unit
    VL1 Hit: The ratio of the number of vL1D cache line requests that hit in vL1D
      cache over the total number of cache line requests to the vL1D Cache RAM.
    VL1 Lat: Calculated as the average number of cycles that a vL1D cache line request
      spent in the vL1D cache pipeline.
    VL1 Coalesce: Indicates how well memory instructions were coalesced by the address
      processing unit, ranging from uncoalesced (25%) to fully coalesced (100%). Calculated
      as the average number of thread-requests generated per instruction divided by
      the ideal number of thread-requests per instruction.
    VL1 Stall: The ratio of the number of cycles where the vL1D is stalled waiting
      to issue a request for data to the L2 cache divided by the number of cycles
      where the vL1D is active.
    VL1_L2 Rd: The number of read requests for a vL1D cache line that were not satisfied
      by the vL1D and must be retrieved from the to the L2 Cache per normalization
      unit.
    VL1_L2 Wr: The number of write requests to a vL1D cache line that were sent through
      the vL1D to the L2 cache, per normalization unit.
    VL1_L2 Atomic: The number of atomic requests that are sent through the vL1D to
      the L2 cache, per normalization unit. This includes requests for atomics with,
      and without return.
    sL1D Rd: The total number of requests, of any size or type, made to the sL1D per
      normalization unit.
    sL1D Hit: The total number of sL1D requests that hit on a previously loaded cache
      line, per normalization unit.
    sL1D_L2 Rd: The total number of read requests from sL1D to the L2, per normalization
      unit.
    sL1D_L2 Wr: The total number of write requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    sL1D_L2 Atomic: The total number of atomic requests from sL1D to the L2, per normalization
      unit. Typically unused on current CDNA accelerators.
    IL1 Fetch: The total number of requests made to the L1I per normalization-unit.
    IL1 Hit: The percent of L1I requests that hit on a previously loaded line the
      cache. Calculated as the ratio of the number of L1I requests that hit over the
      number of all L1I requests.
    IL1 Lat: The average number of cycles spent to fetch instructions to a CU.
    IL1_L2 Rd: The total number of requests across the L1I - L2 interface per normalization-unit.
    L2 Rd: The total number of read requests to the L2 from all clients.
    L2 Wr: The total number of write requests to the L2 from all clients.
    L2 Atomic: The total number of atomic requests (with and without return) to the
      L2 from all clients.
    L2 Hit: The ratio of the number of L2 cache line requests that hit in the L2 cache
      over the total number of incoming cache line requests to the L2 cache.
    L2 Rd Lat: Calculated as the average number of cycles that the vL1D cache took
      to issue and receive read requests from the L2 Cache. This number also includes
      requests for atomics with return values.
    L2 Wr Lat: Calculated as the average number of cycles that the vL1D cache took
      to issue and receive acknowledgement of a write request to the L2 Cache. This
      number also includes requests for atomics without return values.
    Fabric_L2 Rd: Number of L2 cache - Infinity Fabric read requests (either 32-byte
      or 64-byte) summed over TCC instances per normalization unit.
    Fabric_L2 Wr: Number of L2 cache - Infinity Fabric write requests (either 32-byte
      or 64-byte) summed over TCC instances per normalization unit.
    Fabric_L2 Atomic: Number of L2 cache - Infinity Fabric write requests (either
      32-byte or 64-byte) that are actually atomic requests summed over TCC instances
      per normalization unit.
    Fabric Rd Lat: The time-averaged number of cycles read requests spent in Infinity
      Fabric before data was returned to the L2.
    Fabric Wr Lat: The time-averaged number of cycles write requests spent in Infinity
      Fabric before a completion acknowledgement was returned to the L2.
    Fabric Atomic Lat: The time-averaged number of cycles atomic requests spent in
      Infinity Fabric before a completion acknowledgement (atomic without return value)
      or data (atomic with return value) was returned to the L2.
    HBM Rd: The total number of L2 requests to Infinity Fabric to read 32B or 64B
      of data from the accelerator's local HBM, per normalization unit.
    HBM Wr: 'The total number of L2 requests to Infinity Fabric to write or atomically
      update 32B or 64B of data in the accelerator''s local HBM, per normalization
      unit. '
  data source:
  - metric_table:
      id: 301
      title: Memory Chart
      header:
        metric: Metric
        value: Value
      metric:
        Wavefront Occupancy:
          value: ROUND(AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD) / $numActiveCUs),
            0)
          coll_level: SQ_LEVEL_WAVES
        Wave Life:
          value: ROUND(AVG(((4 * (SQ_WAVE_CYCLES / SQ_WAVES)) if (SQ_WAVES != 0) else
            0)), 0)
        SALU:
          value: ROUND(AVG((SQ_INSTS_SALU / $denom)), 0)
        SMEM:
          value: ROUND(AVG((SQ_INSTS_SMEM / $denom)), 0)
        VALU:
          value: ROUND(AVG((SQ_INSTS_VALU / $denom)), 0)
        MFMA:
          value: ROUND(AVG((SQ_INSTS_MFMA / $denom)), 0)
        VMEM:
          value: ROUND(AVG((SQ_INSTS_VMEM / $denom)), 0)
        LDS:
          value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
        GWS:
          value: ROUND(AVG((SQ_INSTS_GDS / $denom)), 0)
        BR:
          value: ROUND(AVG((SQ_INSTS_BRANCH / $denom)), 0)
        Active CUs:
          value: $numActiveCUs
        Num CUs:
          value: $cu_per_gpu
        VGPR:
          value: ROUND(AVG(Arch_VGPR), 0)
        SGPR:
          value: ROUND(AVG(SGPR), 0)
        LDS Allocation:
          value: ROUND(AVG(LDS_Per_Workgroup), 0)
        Scratch Allocation:
          value: ROUND(AVG(Scratch_Per_Workitem), 0)
        Wavefronts:
          value: ROUND(AVG(SPI_CSN_WAVE), 0)
        Workgroups:
          value: ROUND(AVG(SPI_CSN_NUM_THREADGROUPS), 0)
        LDS Req:
          value: ROUND(AVG((SQ_INSTS_LDS / $denom)), 0)
        LDS Util:
          value: ROUND(AVG(((100 * SQ_LDS_IDX_ACTIVE) / ($GRBM_GUI_ACTIVE_PER_XCD
            * $cu_per_gpu))), 0)
        LDS Latency:
          value: ROUND(AVG(((SQ_ACCUM_PREV_HIRES / SQ_INSTS_LDS) if (SQ_INSTS_LDS
            != 0) else None)),0)
          coll_level: SQ_INST_LEVEL_LDS
        VL1 Rd:
          value: ROUND(AVG((TCP_TOTAL_READ_sum / $denom)), 0)
        VL1 Wr:
          value: ROUND(AVG((TCP_TOTAL_WRITE_sum / $denom)), 0)
        VL1 Atomic:
          value: ROUND(AVG(((TCP_TOTAL_ATOMIC_WITH_RET_sum + TCP_TOTAL_ATOMIC_WITHOUT_RET_sum)
            / $denom)), 0)
        VL1 Hit:
          value: ROUND(AVG(((100 - ((100 * (((TCP_TCC_READ_REQ_sum + TCP_TCC_WRITE_REQ_sum)
            + TCP_TCC_ATOMIC_WITH_RET_REQ_sum) + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum))
            / TCP_TOTAL_CACHE_ACCESSES_sum)) if (TCP_TOTAL_CACHE_ACCESSES_sum != 0)
            else None )), 0)
        VL1 Lat:
          value: ROUND(AVG(((TCP_TCP_LATENCY_sum / TCP_TA_TCP_STATE_READ_sum) if (TCP_TA_TCP_STATE_READ_sum
            != 0) else None)), 0)
        VL1 Coalesce:
          value: ROUND(AVG(((((TA_TOTAL_WAVEFRONTS_sum * 64) * 100) / (TCP_TOTAL_ACCESSES_sum
            * 4)) if (TCP_TOTAL_ACCESSES_sum != None) else 0)), 0)
        VL1 Stall:
          value: ROUND(AVG((((100 * TCP_TCR_TCP_STALL_CYCLES_sum) / TCP_GATE_EN1_sum)
            if (TCP_GATE_EN1_sum != 0) else None)), 0)
        VL1_L2 Rd:
          value: ROUND(AVG((TCP_TCC_READ_REQ_sum / $denom)), 0)
        VL1_L2 Wr:
          value: ROUND(AVG((TCP_TCC_WRITE_REQ_sum / $denom)), 0)
        VL1_L2 Atomic:
          value: ROUND(AVG(((TCP_TCC_ATOMIC_WITH_RET_REQ_sum + TCP_TCC_ATOMIC_WITHOUT_RET_REQ_sum)
            / $denom)), 0)
        sL1D Rd:
          value: ROUND(AVG((SQC_DCACHE_REQ / $denom)), 0)
        sL1D Hit:
          value: ROUND((AVG(((SQC_DCACHE_HITS / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
            != 0) else None)) * 100), 0)
        sL1D Lat:
          value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_DCACHE_REQ) if (SQC_DCACHE_REQ
            != 0) else None)) * 100), 0)
          coll_level: SQC_DCACHE_INFLIGHT_LEVEL
        sL1D_L2 Rd:
          value: ROUND(AVG((SQC_TC_DATA_READ_REQ / $denom)), 0)
        sL1D_L2 Wr:
          value: ROUND(AVG((SQC_TC_DATA_WRITE_REQ / $denom)), 0)
        sL1D_L2 Atomic:
          value: ROUND(AVG((SQC_TC_DATA_ATOMIC_REQ / $denom)), 0)
        IL1 Fetch:
          value: ROUND(AVG((SQC_ICACHE_REQ / $denom)), 0)
        IL1 Hit:
          value: ROUND((AVG((SQC_ICACHE_HITS / SQC_ICACHE_REQ)) * 100), 0)
        IL1 Lat:
          value: ROUND((AVG(((SQ_ACCUM_PREV_HIRES / SQC_ICACHE_REQ) if (SQC_ICACHE_REQ
            != 0) else None)) * 100), 0)
        IL1_L2 Rd:
          value: ROUND(AVG((SQC_TC_INST_REQ / $denom)), 0)
        L2 Rd:
          value: ROUND(AVG((TCC_READ_sum / $denom)), 0)
        L2 Wr:
          value: ROUND(AVG((TCC_WRITE_sum / $denom)), 0)
        L2 Atomic:
          value: ROUND(AVG((TCC_ATOMIC_sum / $denom)), 0)
        L2 Hit:
          value: ROUND(AVG((((100 * TCC_HIT_sum) / (TCC_HIT_sum + TCC_MISS_sum)) if
            ((TCC_HIT_sum + TCC_MISS_sum) != 0) else 0)), 0)
        L2 Rd Lat:
          value: null
        L2 Wr Lat:
          value: null
        Fabric_L2 Rd:
          value: ROUND(AVG((TCC_EA0_RDREQ_sum / $denom)), 0)
        Fabric_L2 Wr:
          value: ROUND(AVG((TCC_EA0_WRREQ_sum / $denom)), 0)
        Fabric_L2 Atomic:
          value: ROUND(AVG((TCC_EA0_ATOMIC_sum / $denom)), 0)
        Fabric Rd Lat:
          value: ROUND(AVG(((TCC_EA0_RDREQ_LEVEL_sum / TCC_EA0_RDREQ_sum) if (TCC_EA0_RDREQ_sum
            != 0) else  0)), 0)
        Fabric Wr Lat:
          value: ROUND(AVG(((TCC_EA0_WRREQ_LEVEL_sum / TCC_EA0_WRREQ_sum) if (TCC_EA0_WRREQ_sum
            != 0) else  0)), 0)
        Fabric Atomic Lat:
          value: ROUND(AVG(((TCC_EA0_ATOMIC_LEVEL_sum / TCC_EA0_ATOMIC_sum) if (TCC_EA0_ATOMIC_sum
            != 0) else  0)), 0)
        HBM Rd:
          value: ROUND(AVG((TCC_EA0_RDREQ_DRAM_sum / $denom)), 0)
        HBM Wr:
          value: ROUND(AVG((TCC_EA0_WRREQ_DRAM_sum / $denom)), 0)
      comparable: false
      cli_style: mem_chart
      tui_style: mem_chart
@@ -0,0 +1,9 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 400
  title: Roofline
  metrics_description: {}
  data source:
  - None:
      id: 401
      title: Roofline
@@ -1,8 +0,0 @@
 ---
 Panel Config:
  id: 400
  title: Roofline
  data source:
    - None:
        id: 401
        title: Roofline
@@ -1,135 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
  data source:
    - metric_table:
        id: 501
        title: Command Processor Fetcher
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          CPF Utilization:
            avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
              if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
            unit: pct
            tips:
          CPF Stall:
            avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPF-L2 Utilization:
            avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
              if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
            unit: pct
            tips:
          CPF-L2 Stall:
            avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPF-UTCL1 Stall:
            avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
              != 0) else None)
            unit: pct
            tips:
    - metric_table:
        id: 502
        title: Packet Processor
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          CPC Utilization:
            avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
              if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
            unit: pct
            tips:
          CPC Stall Rate:
            avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None))
            unit: pct
            tips:
          CPC Packet Decoding Utilization:
            avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            unit: pct
            tips:
          CPC-Workgroup Manager Utilization:
            avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY != 0) else None)
            unit: Pct
            tips:
          CPC-L2 Utilization:
            avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
              if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
            unit: pct
            tips:
          CPC-UTCL1 Stall:
            avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
              != 0) else None)
            unit: pct
            tips:
          CPC-UTCL2 Utilization:
            avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
              if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
            unit: pct
            tips:
@@ -0,0 +1,145 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 500
  title: Command Processor (CPC/CPF)
  metrics_description:
    CPF Utilization: Percent of total cycles where the CPF was busy actively doing
      any work. The ratio of CPF busy cycles over total cycles counted by the CPF.
    CPF Stall: Percent of CPF busy cycles where the CPF was stalled for any reason.
    CPF-L2 Utilization: Percent of total cycles counted by the CPF-L2 interface where
      the CPF-L2 interface was active doing any work. The ratio of CPF-L2 busy cycles
      over total cycles counted by the CPF-L2.
    CPF-L2 Stall: Percent of CPF-L2 L2 busy cycles where the CPF-L2 interface was
      stalled for any reason.
    CPF-UTCL1 Stall: Percent of CPF busy cycles where the CPF was stalled by address
      translation.
    CPC Utilization: Percent of total cycles where the CPC was busy actively doing
      any work. The ratio of CPC busy cycles over total cycles counted by the CPC.
    CPC Stall Rate: Percent of CPC busy cycles where the CPC was stalled for any reason.
    CPC Packet Decoding Utilization: Percent of CPC busy cycles spent decoding commands
      for processing.
    CPC-Workgroup Manager Utilization: Percent of CPC busy cycles spent dispatching
      workgroups to the workgroup manager.
    CPC-L2 Utilization: Percent of total cycles counted by the CPC-L2 interface where
      the CPC-L2 interface was active doing any work.
    CPC-UTCL1 Stall: Percent of CPC busy cycles where the CPC was stalled by address
      translation
    CPC-UTCL2 Utilization: 'Percent of total cycles counted by the CPC''s L2 address
      translation interface where the CPC was busy doing address translation work.  '
  data source:
  - metric_table:
      id: 501
      title: Command processor fetcher (CPF)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        CPF Utilization:
          avg: AVG((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          min: MIN((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          max: MAX((((100 * CPF_CPF_STAT_BUSY) / (CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE))
            if ((CPF_CPF_STAT_BUSY + CPF_CPF_STAT_IDLE) != 0) else None))
          unit: pct
        CPF Stall:
          avg: AVG((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          min: MIN((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          max: MAX((((100 * CPF_CPF_STAT_STALL) / CPF_CPF_STAT_BUSY) if (CPF_CPF_STAT_BUSY
            != 0) else None))
          unit: pct
        CPF-L2 Utilization:
          avg: AVG((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          min: MIN((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          max: MAX((((100 * CPF_CPF_TCIU_BUSY) / (CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE))
            if ((CPF_CPF_TCIU_BUSY + CPF_CPF_TCIU_IDLE) != 0) else None))
          unit: pct
        CPF-L2 Stall:
          avg: AVG((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          min: MIN((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          max: MAX((((100 * CPF_CPF_TCIU_STALL) / CPF_CPF_TCIU_BUSY) if (CPF_CPF_TCIU_BUSY
            != 0) else None))
          unit: pct
        CPF-UTCL1 Stall:
          avg: AVG(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          min: MIN(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          max: MAX(((100 * CPF_CMP_UTCL1_STALL_ON_TRANSLATION) / CPF_CPF_STAT_BUSY)
            if (CPF_CPF_STAT_BUSY != 0) else None)
          unit: pct
  - metric_table:
      id: 502
      title: Command processor packet processor (CPC)
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        CPC Utilization:
          avg: AVG((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_STAT_BUSY) / (CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE))
            if ((CPC_CPC_STAT_BUSY + CPC_CPC_STAT_IDLE) != 0) else None))
          unit: pct
        CPC Stall Rate:
          avg: AVG((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          min: MIN((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          max: MAX((((100 * CPC_CPC_STAT_STALL) / CPC_CPC_STAT_BUSY) if (CPC_CPC_STAT_BUSY
            != 0) else None))
          unit: pct
        CPC Packet Decoding Utilization:
          avg: AVG((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          min: MIN((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          max: MAX((100 * CPC_ME1_BUSY_FOR_PACKET_DECODE) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          unit: pct
        CPC-Workgroup Manager Utilization:
          avg: AVG((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          min: MIN((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          max: MAX((100 * CPC_ME1_DC0_SPI_BUSY) / CPC_CPC_STAT_BUSY if (CPC_CPC_STAT_BUSY
            != 0) else None)
          unit: Pct
        CPC-L2 Utilization:
          avg: AVG((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_TCIU_BUSY) / (CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE))
            if ((CPC_CPC_TCIU_BUSY + CPC_CPC_TCIU_IDLE) != 0) else None))
          unit: pct
        CPC-UTCL1 Stall:
          avg: AVG(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          min: MIN(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          max: MAX(((100 * CPC_UTCL1_STALL_ON_TRANSLATION) / CPC_CPC_STAT_BUSY) if
            (CPC_CPC_STAT_BUSY != 0) else None)
          unit: pct
        CPC-UTCL2 Utilization:
          avg: AVG((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          min: MIN((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          max: MAX((((100 * CPC_CPC_UTCL2IU_BUSY) / (CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE))
            if ((CPC_CPC_UTCL2IU_BUSY + CPC_CPC_UTCL2IU_IDLE) != 0) else None))
          unit: pct
@@ -1,167 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
  data source:
    - metric_table:
        id: 601
        title: Workgroup Manager Utilizations
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Accelerator Utilization:
            avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
            unit: Pct
            tips:
          Scheduler-Pipe Utilization:
            avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu * $se_per_gpu))
            unit: Pct
            tips:
          Workgroup Manager Utilization:
            avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
            unit: Pct
            tips:
          Shader Engine Utilization:
            avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
            unit: Pct
            tips:
          SIMD Utilization:
            avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Dispatched Workgroups:
            avg: AVG(SPI_CSN_NUM_THREADGROUPS)
            min: MIN(SPI_CSN_NUM_THREADGROUPS)
            max: MAX(SPI_CSN_NUM_THREADGROUPS)
            unit: Workgroups
            tips:
          Dispatched Wavefronts:
            avg: AVG(SPI_CSN_WAVE)
            min: MIN(SPI_CSN_WAVE)
            max: MAX(SPI_CSN_WAVE)
            unit: Wavefronts
            tips:
          VGPR Writes:
            avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            unit: Cycles/wave
            tips:
          SGPR Writes:
            avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
              None))
            unit: Cycles/wave
            tips:
    - metric_table:
        id: 602
        title: Workgroup Manager - Resource Allocation
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Not-scheduled Rate (Workgroup Manager):
            avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            unit: Pct
            tips:
          Not-scheduled Rate (Scheduler-Pipe):
            avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None)
            unit: Pct
            tips:
          Scheduler-Pipe Stall Rate:
            avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD !=
              0) else None))
            unit: Pct
            tips:
          Scratch Stall Rate:
            avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu)) if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
            unit: Pct
            tips:
          Insufficient SIMD Waveslots:
            avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient SIMD VGPRs:
            avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient SIMD SGPRs:
            avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient CU LDS:
            avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Insufficient CU Barriers:
            avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Reached CU Workgroup Limit:
            avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
          Reached CU Wavefront Limit:
            avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
            unit: Pct
            tips:
@@ -0,0 +1,201 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 600
  title: Workgroup Manager (SPI)
  metrics_description:
    Accelerator Utilization: The percent of cycles in the kernel where the accelerator
      was actively doing any work.
    Scheduler-Pipe Utilization: The percent of total scheduler-pipe cycles in the
      kernel where the scheduler-pipes were actively doing any work.
    Workgroup Manager Utilization: The percent of cycles in the kernel where the workgroup
      manager was actively doing any work.
    Shader Engine Utilization: The percent of total shader engine cycles in the kernel
      where any CU in a shader-engine was actively doing any work, normalized over
      all shader-engines. Low values (e.g., << 100%) indicate that the accelerator
      was not fully saturated by the kernel, or a potential load-imbalance issue.
    SIMD Utilization: The percent of total SIMD cycles in the kernel where any SIMD
      on a CU was actively doing any work, summed over all CUs. Low values (less than
      100%) indicate that the accelerator was not fully saturated by the kernel, or
      a potential load-imbalance issue.
    Dispatched Workgroups: The total number of workgroups forming this kernel launch.
    Dispatched Wavefronts: The total number of wavefronts, summed over all workgroups,
      forming this kernel launch.
    VGPR Writes: The average number of cycles spent initializing VGPRs at wave creation.
    SGPR Writes: The average number of cycles spent initializing SGPRs at wave creation.
    Not-scheduled Rate (Workgroup Manager): The percent of total scheduler-pipe cycles
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the workgroup manager rather than a lack of a CU or SIMD with sufficient
      resources.
    Not-scheduled Rate (Scheduler-Pipe): 'The percent of total scheduler-pipe cycles
      in the kernel where a workgroup could not be scheduled to a CU due to a bottleneck
      within the scheduler-pipes rather than a lack of a CU or SIMD with sufficient
      resources. '
    Scheduler-Pipe Stall Rate: The percent of total scheduler-pipe cycles in the kernel
      where a workgroup could not be scheduled to a CU due to occupancy limitations
      (like a lack of a CU or SIMD with sufficient resources).
    Scratch Stall Rate: The percent of total shader-engine cycles in the kernel where
      a workgroup could not be scheduled to a CU due to lack of private (a.k.a., scratch)
      memory slots. While this can reach up to 100%, note that the actual occupancy
      limitations on a kernel using private memory are typically quite small (for
      example, less than 1% of the total number of waves that can be scheduled to
      an accelerator).
    Insufficient SIMD Waveslots: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available waveslots.
    Insufficient SIMD VGPRs: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available VGPRs.
    Insufficient SIMD SGPRs: The percent of total SIMD cycles in the kernel where
      a workgroup could not be scheduled to a SIMD due to lack of available SGPRs.
    Insufficient CU LDS: The percent of total CU cycles in the kernel where a workgroup
      could not be scheduled to a CU due to lack of available LDS.
    Insufficient CU Barriers: The percent of total CU cycles in the kernel where a
      workgroup could not be scheduled to a CU due to lack of available barriers.
    Reached CU Workgroup Limit: The percent of total CU cycles in the kernel where
      a workgroup could not be scheduled to a CU due to limits within the workgroup
      manager. This is expected to be always be zero on CDNA2 or newer accelerators
      (and small for previous accelerators).
    Reached CU Wavefront Limit: The percent of total CU cycles in the kernel where
      a wavefront could not be scheduled to a CU due to limits within the workgroup
      manager. This is expected to be always be zero on CDNA2 or newer accelerators
      (and small for previous accelerators).
  data source:
  - metric_table:
      id: 601
      title: Workgroup manager utilizations
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Accelerator Utilization:
          avg: AVG(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          min: MIN(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          max: MAX(100 * $GRBM_GUI_ACTIVE_PER_XCD / $GRBM_COUNT_PER_XCD)
          unit: Pct
        Scheduler-Pipe Utilization:
          avg: AVG(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          min: MIN(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          max: MAX(100 * SPI_CSN_BUSY / ($GRBM_GUI_ACTIVE_PER_XCD * $pipes_per_gpu
            * $se_per_gpu))
          unit: Pct
        Workgroup Manager Utilization:
          avg: AVG(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          min: MIN(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          max: MAX(100 * $GRBM_SPI_BUSY_PER_XCD / $GRBM_GUI_ACTIVE_PER_XCD)
          unit: Pct
        Shader Engine Utilization:
          avg: AVG(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          min: MIN(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          max: MAX(100 * SQ_BUSY_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $se_per_gpu))
          unit: Pct
        SIMD Utilization:
          avg: AVG(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SQ_BUSY_CU_CYCLES / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Dispatched Workgroups:
          avg: AVG(SPI_CSN_NUM_THREADGROUPS)
          min: MIN(SPI_CSN_NUM_THREADGROUPS)
          max: MAX(SPI_CSN_NUM_THREADGROUPS)
          unit: Workgroups
        Dispatched Wavefronts:
          avg: AVG(SPI_CSN_WAVE)
          min: MIN(SPI_CSN_WAVE)
          max: MAX(SPI_CSN_WAVE)
          unit: Wavefronts
        VGPR Writes:
          avg: AVG((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          min: MIN((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          max: MAX((((4 * SPI_VWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          unit: Cycles/wave
        SGPR Writes:
          avg: AVG((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          min: MIN((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          max: MAX((((1 * SPI_SWC_CSC_WR) / SPI_CSN_WAVE) if (SPI_CSN_WAVE != 0) else
            None))
          unit: Cycles/wave
  - metric_table:
      id: 602
      title: Workgroup Manager - Resource Allocation
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Not-scheduled Rate (Workgroup Manager):
          avg: AVG((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_REQ_NO_ALLOC_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Not-scheduled Rate (Scheduler-Pipe):
          avg: AVG((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_REQ_NO_ALLOC / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Scheduler-Pipe Stall Rate:
          avg: AVG((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          min: MIN((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          max: MAX((((100 * SPI_RA_RES_STALL_CSN) / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None))
          unit: Pct
        Scratch Stall Rate:
          avg: AVG((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          min: MIN((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          max: MAX((100 * SPI_RA_TMP_STALL_CSN / ($GRBM_SPI_BUSY_PER_XCD * $se_per_gpu))
            if ($GRBM_SPI_BUSY_PER_XCD != 0) else None)
          unit: Pct
        Insufficient SIMD Waveslots:
          avg: AVG(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_WAVE_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient SIMD VGPRs:
          avg: AVG(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_VGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient SIMD SGPRs:
          avg: AVG(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(100 * SPI_RA_SGPR_SIMD_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient CU LDS:
          avg: AVG(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_LDS_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Insufficient CU Barriers:
          avg: AVG(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_BAR_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Reached CU Workgroup Limit:
          avg: AVG(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_TGLIM_CU_FULL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
        Reached CU Wavefront Limit:
          avg: AVG(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          min: MIN(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          max: MAX(400 * SPI_RA_WVLIM_STALL_CSN / ($GRBM_GUI_ACTIVE_PER_XCD * $cu_per_gpu))
          unit: Pct
@@ -1,142 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 700
  title: Wavefront
  data source:
    - metric_table:
        id: 701
        title: Wavefront Launch Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Grid Size:
            avg: AVG(Grid_Size)
            min: MIN(Grid_Size)
            max: MAX(Grid_Size)
            unit: Work Items
            tips:
          Workgroup Size:
            avg: AVG(Workgroup_Size)
            min: MIN(Workgroup_Size)
            max: MAX(Workgroup_Size)
            unit: Work Items
            tips:
          Total Wavefronts:
            avg: AVG(SPI_CSN_WAVE)
            min: MIN(SPI_CSN_WAVE)
            max: MAX(SPI_CSN_WAVE)
            unit: Wavefronts
            tips:
          Saved Wavefronts:
            avg: AVG(SQ_WAVES_SAVED)
            min: MIN(SQ_WAVES_SAVED)
            max: MAX(SQ_WAVES_SAVED)
            unit: Wavefronts
            tips:
          Restored Wavefronts:
            avg: AVG(SQ_WAVES_RESTORED)
            min: MIN(SQ_WAVES_RESTORED)
            max: MAX(SQ_WAVES_RESTORED)
            unit: Wavefronts
            tips:
          VGPRs:
            avg: AVG(Arch_VGPR)
            min: MIN(Arch_VGPR)
            max: MAX(Arch_VGPR)
            unit: Registers
            tips:
          AGPRs:
            avg: AVG(Accum_VGPR)
            min: MIN(Accum_VGPR)
            max: MAX(Accum_VGPR)
            unit: Registers
            tips:
          SGPRs:
            avg: AVG(SGPR)
            min: MIN(SGPR)
            max: MAX(SGPR)
            unit: Registers
            tips:
          LDS Allocation:
            avg: AVG(LDS_Per_Workgroup)
            min: MIN(LDS_Per_Workgroup)
            max: MAX(LDS_Per_Workgroup)
            unit: Bytes
            tips:
          Scratch Allocation:
            avg: AVG(Scratch_Per_Workitem)
            min: MIN(Scratch_Per_Workitem)
            max: MAX(Scratch_Per_Workitem)
            unit: Bytes/Workitem
            tips:
    - metric_table:
        id: 702
        title: Wavefront Runtime Stats
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Kernel Time:
            avg: AVG((End_Timestamp - Start_Timestamp))
            min: MIN((End_Timestamp - Start_Timestamp))
            max: MAX((End_Timestamp - Start_Timestamp))
            unit: ns
            tips:
          Kernel Time (Cycles):
            avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
            min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
            max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
            unit: Cycle
            tips:
          Instructions per wavefront:
            avg: AVG((SQ_INSTS / SQ_WAVES))
            min: MIN((SQ_INSTS / SQ_WAVES))
            max: MAX((SQ_INSTS / SQ_WAVES))
            unit: Instr/wavefront
            tips:
          Wave Cycles:
            avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
            min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
            max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Dependency Wait Cycles:
            avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
            min: MIN(((4 * SQ_WAIT_ANY) / $denom))
            max: MAX(((4 * SQ_WAIT_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Issue Wait Cycles:
            avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
            min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
            max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Active Cycles:
            avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
            unit: (Cycles + $normUnit)
            tips:
          Wavefront Occupancy:
            avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
            unit: Wavefronts
            coll_level: SQ_LEVEL_WAVES
            tips:
@@ -0,0 +1,173 @@
 # AUTOGENERATED FILE. Only edit for testing purposes, not for development. Generated from utils/unified_config.yaml. Generated by utils/split_config.py
 Panel Config:
  id: 700
  title: Wavefront
  metrics_description:
    Grid Size: The total number of work-items (or, threads) launched as a part of
      the kernel dispatch. In HIP, this is equivalent to the total grid size multiplied
      by the total workgroup (or, block) size.
    Workgroup Size: The total number of work-items (or, threads) in each workgroup
      (or, block) launched as part of the kernel dispatch. In HIP, this is equivalent
      to the total block size.
    Total Wavefronts: "The total number of wavefronts launched as part of the kernel\
      \ dispatch. On AMD Instinct\u2122 CDNA\u2122 accelerators and GCN\u2122 GPUs,\
      \ the wavefront size is always 64 work-items. Thus, the total number of wavefronts\
      \ should be equivalent to the ceiling of grid size divided by 64."
    Saved Wavefronts: The total number of wavefronts saved at a context-save.
    Restored Wavefronts: The total number of wavefronts restored from a context-save.
    VGPRs: 'The number of architected vector general-purpose registers allocated for
      the kernel, see VALU. Note: this may not exactly match the number of VGPRs requested
      by the compiler due to allocation granularity.'
    AGPRs: 'The number of accumulation vector general-purpose registers allocated
      for the kernel, see AGPRs. Note: this may not exactly match the number of AGPRs
      requested by the compiler due to allocation granularity.'
    SGPRs: 'The number of scalar general-purpose registers allocated for the kernel,
      see SALU. Note: this may not exactly match the number of SGPRs requested by
      the compiler due to allocation granularity.'
    LDS Allocation: 'The number of bytes of LDS memory (or, shared memory) allocated
      for this kernel. Note: This may also be larger than what was requested at compile
      time due to both allocation granularity and dynamic per-dispatch LDS allocations.'
    Scratch Allocation: The number of bytes of scratch memory requested per work-item
      for this kernel. Scratch memory is used for stack memory on the accelerator,
      as well as for register spills and restores.
    Kernel Time: The total duration of the executed kernel.
    Kernel Time (Cycles): The total duration of the executed kernel in cycles.
    Instructions per wavefront: The average number of instructions (of all types)
      executed per wavefront. This is averaged over all wavefronts in a kernel dispatch.
    Wave Cycles: The number of cycles a wavefront in the kernel dispatch spent resident
      on a compute unit per normalization unit. This is averaged over all wavefronts
      in a kernel dispatch.
    Dependency Wait Cycles: The number of cycles a wavefront in the kernel dispatch
      spent resident on a compute unit per normalization unit. This is averaged over
      all wavefronts in a kernel dispatch.
    Issue Wait Cycles: The number of cycles a wavefront in the kernel dispatch was
      unable to issue an instruction for any reason (e.g., execution pipe back-pressure,
      arbitration loss, etc.) per normalization unit. This counter is incremented
      at every cycle by all wavefronts on a CU unable to issue an instruction. As
      such, it is most useful to get a sense of how waves were spending their time,
      rather than identification of a precise limiter because another wave could be
      actively executing while a wave is issue stalled. The sum of this metric, Dependency
      Wait Cycles and Active Cycles should be equal to the total Wave Cycles metric.
    Active Cycles: The average number of cycles a wavefront in the kernel dispatch
      was actively executing instructions per normalization unit. This measurement
      is made on a per-wavefront basis, and may include cycles that another wavefront
      spent actively executing (on another execution unit, for example) or was stalled.
      As such, it is most useful to get a sense of how waves were spending their time,
      rather than identification of a precise limiter. The sum of this metric, Issue
      Wait Cycles and Active Wait Cycles should be equal to the total Wave Cycles
      metric.
    Wavefront Occupancy: 'The time-averaged number of wavefronts resident on the accelerator
      over the lifetime of the kernel. Note: this metric may be inaccurate for short-running
      kernels (less than 1ms).'
  data source:
  - metric_table:
      id: 701
      title: Wavefront Launch Stats
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Grid Size:
          avg: AVG(Grid_Size)
          min: MIN(Grid_Size)
          max: MAX(Grid_Size)
          unit: Work Items
        Workgroup Size:
          avg: AVG(Workgroup_Size)
          min: MIN(Workgroup_Size)
          max: MAX(Workgroup_Size)
          unit: Work Items
        Total Wavefronts:
          avg: AVG(SPI_CSN_WAVE)
          min: MIN(SPI_CSN_WAVE)
          max: MAX(SPI_CSN_WAVE)
          unit: Wavefronts
        Saved Wavefronts:
          avg: AVG(SQ_WAVES_SAVED)
          min: MIN(SQ_WAVES_SAVED)
          max: MAX(SQ_WAVES_SAVED)
          unit: Wavefronts
        Restored Wavefronts:
          avg: AVG(SQ_WAVES_RESTORED)
          min: MIN(SQ_WAVES_RESTORED)
          max: MAX(SQ_WAVES_RESTORED)
          unit: Wavefronts
        VGPRs:
          avg: AVG(Arch_VGPR)
          min: MIN(Arch_VGPR)
          max: MAX(Arch_VGPR)
          unit: Registers
        AGPRs:
          avg: AVG(Accum_VGPR)
          min: MIN(Accum_VGPR)
          max: MAX(Accum_VGPR)
          unit: Registers
        SGPRs:
          avg: AVG(SGPR)
          min: MIN(SGPR)
          max: MAX(SGPR)
          unit: Registers
        LDS Allocation:
          avg: AVG(LDS_Per_Workgroup)
          min: MIN(LDS_Per_Workgroup)
          max: MAX(LDS_Per_Workgroup)
          unit: Bytes
        Scratch Allocation:
          avg: AVG(Scratch_Per_Workitem)
          min: MIN(Scratch_Per_Workitem)
          max: MAX(Scratch_Per_Workitem)
          unit: Bytes/Workitem
  - metric_table:
      id: 702
      title: Wavefront Runtime Stats
      header:
        metric: Metric
        avg: Avg
        min: Min
        max: Max
        unit: Unit
      metric:
        Kernel Time:
          avg: AVG((End_Timestamp - Start_Timestamp))
          min: MIN((End_Timestamp - Start_Timestamp))
          max: MAX((End_Timestamp - Start_Timestamp))
          unit: ns
        Kernel Time (Cycles):
          avg: AVG($GRBM_GUI_ACTIVE_PER_XCD)
          min: MIN($GRBM_GUI_ACTIVE_PER_XCD)
          max: MAX($GRBM_GUI_ACTIVE_PER_XCD)
          unit: Cycle
        Instructions per wavefront:
          avg: AVG((SQ_INSTS / SQ_WAVES))
          min: MIN((SQ_INSTS / SQ_WAVES))
          max: MAX((SQ_INSTS / SQ_WAVES))
          unit: Instr/wavefront
        Wave Cycles:
          avg: AVG(((4 * SQ_WAVE_CYCLES) / $denom))
          min: MIN(((4 * SQ_WAVE_CYCLES) / $denom))
          max: MAX(((4 * SQ_WAVE_CYCLES) / $denom))
          unit: (Cycles + $normUnit)
        Dependency Wait Cycles:
          avg: AVG(((4 * SQ_WAIT_ANY) / $denom))
          min: MIN(((4 * SQ_WAIT_ANY) / $denom))
          max: MAX(((4 * SQ_WAIT_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Issue Wait Cycles:
          avg: AVG(((4 * SQ_WAIT_INST_ANY) / $denom))
          min: MIN(((4 * SQ_WAIT_INST_ANY) / $denom))
          max: MAX(((4 * SQ_WAIT_INST_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Active Cycles:
          avg: AVG(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          min: MIN(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          max: MAX(((4 * SQ_ACTIVE_INST_ANY) / $denom))
          unit: (Cycles + $normUnit)
        Wavefront Occupancy:
          avg: AVG((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          min: MIN((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          max: MAX((SQ_ACCUM_PREV_HIRES / $GRBM_GUI_ACTIVE_PER_XCD))
          unit: Wavefronts
          coll_level: SQ_LEVEL_WAVES
@@ -1,277 +0,0 @@
 ---
 # Add description/tips for each metric in this section.
 # So it could be shown in hover.
 Metric Description:
 # Define the panel properties and properties of each metric in the panel.
 Panel Config:
  id: 1000
  title: Compute Units - Instruction Mix
  data source:
    - metric_table:
        id: 1001
        title: Overall Instruction Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          VALU:
            avg: AVG(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
            min: MIN(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
            max: MAX(((SQ_INSTS_VALU - SQ_INSTS_MFMA) / $denom))
            unit: (instr + $normUnit)
            tips:
          VMEM:
            # TODO: need to fix this when the new FLAT/LDS counts
            # are present in ROCm
            avg: AVG(((SQ_INSTS_VMEM) / $denom))
            min: MIN(((SQ_INSTS_VMEM) / $denom))
            max: MAX(((SQ_INSTS_VMEM) / $denom))
            unit: (instr + $normUnit)
            tips:
          LDS:
            # TODO: need to fix this when the new FLAT/LDS counts
            # are present in ROCm
            avg: AVG((SQ_INSTS_LDS / $denom))
            min: MIN((SQ_INSTS_LDS / $denom))
            max: MAX((SQ_INSTS_LDS / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA:
            avg: AVG((SQ_INSTS_MFMA / $denom))
            min: MIN((SQ_INSTS_MFMA / $denom))
            max: MAX((SQ_INSTS_MFMA / $denom))
            unit: (instr + $normUnit)
            tips:
          SALU:
            avg: AVG((SQ_INSTS_SALU / $denom))
            min: MIN((SQ_INSTS_SALU / $denom))
            max: MAX((SQ_INSTS_SALU / $denom))
            unit: (instr + $normUnit)
            tips:
          SMEM:
            avg: AVG((SQ_INSTS_SMEM / $denom))
            min: MIN((SQ_INSTS_SMEM / $denom))
            max: MAX((SQ_INSTS_SMEM / $denom))
            unit: (instr + $normUnit)
            tips:
          Branch:
            avg: AVG((SQ_INSTS_BRANCH / $denom))
            min: MIN((SQ_INSTS_BRANCH / $denom))
            max: MAX((SQ_INSTS_BRANCH / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1002
        title: VALU Arithmetic Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          INT32:
            avg: AVG((SQ_INSTS_VALU_INT32 / $denom))
            min: MIN((SQ_INSTS_VALU_INT32 / $denom))
            max: MAX((SQ_INSTS_VALU_INT32 / $denom))
            unit: (instr + $normUnit)
            tips:
          INT64:
            avg: AVG((SQ_INSTS_VALU_INT64 / $denom))
            min: MIN((SQ_INSTS_VALU_INT64 / $denom))
            max: MAX((SQ_INSTS_VALU_INT64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-ADD:
            avg: AVG((SQ_INSTS_VALU_ADD_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_ADD_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_ADD_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-MUL:
            avg: AVG((SQ_INSTS_VALU_MUL_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_MUL_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_MUL_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-FMA:
            avg: AVG((SQ_INSTS_VALU_FMA_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_FMA_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_FMA_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F16-Trans:
            avg: AVG((SQ_INSTS_VALU_TRANS_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_TRANS_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_TRANS_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-ADD:
            avg: AVG((SQ_INSTS_VALU_ADD_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_ADD_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_ADD_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-MUL:
            avg: AVG((SQ_INSTS_VALU_MUL_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_MUL_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_MUL_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-FMA:
            avg: AVG((SQ_INSTS_VALU_FMA_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_FMA_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_FMA_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F32-Trans:
            avg: AVG((SQ_INSTS_VALU_TRANS_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_TRANS_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_TRANS_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-ADD:
            avg: AVG((SQ_INSTS_VALU_ADD_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_ADD_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_ADD_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-MUL:
            avg: AVG((SQ_INSTS_VALU_MUL_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_MUL_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_MUL_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-FMA:
            avg: AVG((SQ_INSTS_VALU_FMA_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_FMA_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_FMA_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          F64-Trans:
            avg: AVG((SQ_INSTS_VALU_TRANS_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_TRANS_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_TRANS_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
          Conversion:
            avg: AVG((SQ_INSTS_VALU_CVT / $denom))
            min: MIN((SQ_INSTS_VALU_CVT / $denom))
            max: MAX((SQ_INSTS_VALU_CVT / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1003
        title: VMEM Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          Global/Generic Instr:
            avg: AVG((TA_FLAT_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Read:
            avg: AVG((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_READ_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Write:
            avg: AVG((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_WRITE_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Global/Generic Atomic:
            avg: AVG((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_FLAT_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Instr:
            avg: AVG((TA_BUFFER_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Read:
            avg: AVG((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_READ_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Write:
            avg: AVG((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_WRITE_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
          Spill/Stack Atomic:
            avg: AVG((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            min: MIN((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            max: MAX((TA_BUFFER_ATOMIC_WAVEFRONTS_sum / $denom))
            unit: (instr + $normUnit)
            tips:
    - metric_table:
        id: 1004
        title: MFMA Arithmetic Instr Mix
        header:
          metric: Metric
          avg: Avg
          min: Min
          max: Max
          unit: Unit
          tips: Tips
        metric:
          MFMA-I8:
            avg: AVG((SQ_INSTS_VALU_MFMA_I8 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_I8 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_I8 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F8:
            avg: AVG((SQ_INSTS_VALU_MFMA_F8 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F8 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F8 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F16:
            avg: AVG((SQ_INSTS_VALU_MFMA_F16 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F16 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F16 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-BF16:
            avg: AVG((SQ_INSTS_VALU_MFMA_BF16 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_BF16 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_BF16 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F32:
            avg: AVG((SQ_INSTS_VALU_MFMA_F32 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F32 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F32 / $denom))
            unit: (instr + $normUnit)
            tips:
          MFMA-F64:
            avg: AVG((SQ_INSTS_VALU_MFMA_F64 / $denom))
            min: MIN((SQ_INSTS_VALU_MFMA_F64 / $denom))
            max: MAX((SQ_INSTS_VALU_MFMA_F64 / $denom))
            unit: (instr + $normUnit)
            tips:
--- a/Mehr anzeigen
+++ b/Mehr anzeigen